Structure and Dynamics of Discussions on the Social Media Platform Reddit
by Riccardo Carlucci
Date of Examination:2024-04-29
Date of issue:2024-05-28
Advisor:Prof. Dr. Stephan Herminghaus
Referee:Prof. Dr. Stephan Herminghaus
Referee:Prof. Dr. Stefan Klumpp
Files in this item
Name:dissertation.pdf
Size:34.8Mb
Format:PDF
Abstract
English
The last fifteen years have witnessed a tremendous rise in the impact of online social media on our lives. Platforms such as Facebook, Twitter, YouTube, and Reddit have radically changed the way we communicate and consume information. Understanding the dynamics of these platforms is therefore a significant research challenge, particularly in relation to the study of how (mis-)information spreads across online spaces. In this work, we focus specifically on Reddit, one of the most popular and influential discussion platforms on the Internet. From a structural perspective, Reddit is organised into thematic communities where users can start new discussions, or participate in existing ones. First, we study the statistical properties of discussions in the most active communities. We find that highly popular discussions can be orders of magnitude larger than less popular ones, although the details of the ‘popularity distribution’ depend on the particular community. In addition, communities are characterised by fast dynamics, whereby old discussions are continuously replaced by newer ones. As a consequence, we find that most of the activity in a discussion is concentrated in the first four hours, after which it becomes progressively more difficult for late-coming users to make significant contributions. In order to understand these observations, we study a rich-get-richer model where discussions in a community compete for the attention of users. The attractiveness of a discussion depends on its current popularity, age, and an intrinsic ‘fitness’ factor determined essentially by its title. The rich-get-richer mechanism amplifies intrinsic fitness differences exponentially, although ageing and competition act as mitigating factors. In addition, we find that the behaviour of the model depends on the interplay between three characteristic time scales: the two ‘creation time scales’ determining the frequency of new comments and new discussions, and the ‘ageing time scale’ determining how quickly discussions lose their attractiveness. Finally, we compare the model with some of the most active Reddit communities, finding good agreement between the two. These studies are made possible by the extensive Reddit Pushshift Dataset, which provides an almost complete archive of the platform until March 2023. However, Reddit restricted access to its data in April 2023 so that academic researchers may only have partial access to the platform in future. In the last part of our work, we study how limited samples affect global inferences on the behaviour of Reddit communities. Specifically, we analyse two different subsampling schemes to address the problems of estimating the total size of a community, and the distribution of the number of comments per discussion. In both cases, we find that a 5% random sample is often sufficient to obtain a reasonable approximation. These results may provide some hope in the face of the recent trends seeing a progressively hostile stance of social media platforms towards academic research.
Keywords: Social Media; Reddit; Pólya Urn Models; Subsampling