Using Topical Interests And Social Interactions To


Download Using Topical Interests And Social Interactions To


Preview text

USING TOPICAL INTERESTS AND SOCIAL INTERACTIONS TO IDENTIFY SIMILAR TWITTER USERS

, A Thesis submitted to the faculty of

San Francisco State University

I <5

In partial fulfillment of

C M f'C K

the requirements for

t &

the Degree

Master of Science In
Computer Science

by Swati Pradeep Patel San Francisco, California
May 2018

Copyright by Swati Pradeep Patel
2018

CERTIFICATION OF APPROVAL
I certify that I have read “Using topical interests and social interactions to identify similar Twitter users. ” by Swati Pradeep Patel, and that in my opinion this work meets the criteria for approving a thesis submitted in partial fulfillment of the requirement for the degree Master of Science in Computer Science at San Francisco State University.

I

C-
Hui Yang, Ph.D.

c Associate Professor of Computer Science

(py\ ly OiAyj

'far.

)

Hui-ming Deanna Wang, Ph.D. Professor of Marketing

Kaiunori $kada, Ph.D. Associate Professor of Computer Science

USING TOPICAL INTERESTS AND SOCIAL INTERACTIONS TO IDENTIFY SIMILAR TWITTER USERS
Swati Pradeep Patel San Francisco, California
2018
In the present world, social networks such as Twitter have become an important medium for the diffusion of information. Quantifying the connection between the users will help us understand the propagation of the knowledge. With this motivation, we focus on designing and implementing a host of reliable measurements to measure the similarity between two users on the Twitter microblogging platform. These measurements take into account both a user’s topical interests and her social connections. To address the question at hand, we first focused on computing the topical interest-based similarity using user’s historical tweets. We studied various topic modeling approaches and evaluated these approaches in terms of their ability to distinguish the social relationship between Twitter users. We also designed an evaluation approach based on the concept of Homophily in Twitter. To leverage another aspect of content-based similarity, we used hashtags in the tweets to measure users’ interests. Next, we used interactions between the users including retweeting, quoting, replying and mentioning to create a similarity measure to represent the structural aspect. Our experimental results demonstrate that the proposed similarity measurements could significantly distinguish pairs that are socially connected from those which are not connected. As a part of ongoing research, we are studying the role of these similarity measures in modeling the information diffusion process of new TV shows on the Twitter platform.
I certify that the Abstract is a correct representation of the content of this thesis.

Chair, Thesis Committee

Date

ACKNOWLEDGEMENTS
Firstly, I would like to thank my advisor Dr. Hui Yang for giving me an opportunity to work on this interdisciplinary and challenging project which has numerous practical applications. Further, I want to thank her not only for her guidance and support throughout the project but also for helping me through the important Data Mining concepts, skills, knowledge and best practices. Secondly, I would like to extend my appreciation for Dr. Deanna Wang for giving me an opportunity to be a part of the project and guiding the project conceptually. I would like to thank her specially for the enlightening discussions on the marketing aspect of the project. Further, I would like to thank Dr. Kazunori Okada for being part of my thesis committee, providing his time and insight and more importantly for introducing me to the concepts of Machine Learning which have been a definite help in the project. Furthermore, I would like to thank my colleague Yeqing Yan for the code reviews and the brainstorming discussions.
Last but not the least, I would like to thank my parents, family and friends who supported me throughout my graduate program. Special thanks to my husband Mr. Pradeep Patel, whose motivation and support is paramount in helping me make it thus far.

TABLE OF CONTENTS

LIST OF TABLES................................................................................................................. ix

LIST OF FIGURES............................................................................................................... xi

1. Introduction...................................

1

2. Background and Related W ork................................................................................. 10

2.1. Twitter data..........................................................

10

2.2. Related work on calculating user similarity..........................................................11

2.3. Topic Modeling for Twitter data...........................................................................14

2.3.1. Introduction to Latent Dirichlet Allocation...........................

15

2.3.2. jLDADMM Java Package: LDA implementation.......................................... 20

2.3.3. Gensi m package for LDA................................................................................. 20

2.3.4. jLDADMM Dirchlet mixture Model (DMM) implementation.....................21

2.3.5. Word vectors with Topic Modeling.................................................................21

2.4. Multifaceted similarity measure for Twitter users.............................................. 23

3. Methods...................................................................................................................... 26 3.1. Data Collection and Storage................................................................................. 26 3.1.1. Streaming tweets collection.............................................................................. 28 3.1.2. Collection of historical tweets.......................................................................... 28

3.1.3. Collection of Social network data................................................................... 29 3.1.4. Monitoring the Data Collection....................................................................... 31 3.2. Content-based similarity........................................................................................32 3.2.1. Preprocessing the Twitter data for topic modeling........................................ 32 3.2.2. Topic modeling................................................................................................. 36 3.2.3. Calculating topical similarity........................................................................... 36 3.3. Hashtag-based similarity........................................................................................ 40 3.4. Social interactions-basedsimilarity.........................................................................42 3.4.1. Retweet Network Creation............................................................................... 42 3.4.2. Interactions-based similarity............................................................................ 44
Evaluation and Results.............................................................................................47 4.1. Dataset Description.................................................................................................47 4.2. Evaluation Strategies...............................................................................................50
4.2.1. Qualitative evaluation of topic modeling results............................................50 4.2.2. Evaluation datasets for similarity measure..................................................... 54 4.2.3. Results from Topic based similarity................................................................ 57 4.2.4. Runtime of topic modeling approaches..........................................................107 4.2.5. Entertainment based topical similarity (EBTS)............................................ 108 4.3. Results from hashtag-basedsimilarity.................................................................. 111

4.4. Results from interactions-based similarity........................................................... 114 4.5. Co-relation between the similarity scores............................................................117

5. Conclusions and Future Directions........................................................................120

6 . References................................................................................................................ 125

7. Appendix A: Twitter term s.....................................................................................132

8. Appendix B: Installation and Documentation....................................................... 134

8.1. Data Collection...................................................................................................... 134

8.1.1. Tweet Streams.................................................................................................134

8.1.2. Historic Tweets................................................................................................ 135

8.1.3. Social Structure D ata:..................................................................................... 137

8.2. Topical Similarity Calculation..............................................................................138

8.2.1. Preparati on of the input documents data..............................................

138

8.2.2. Topic Modeling Software...............................................................................139

8.2.3. Creating topic vectors..................................................................................... 143

8.2.4. Calculating topical Similarity......................................................................... 144

8.3. Hashtag based similarity and Interactions Based Similarity...............................144

LIST OF TABLES

Table

Page

Table 1: Variables for LDA M odel........................................................................................19

Table 2: Topic modeling approaches..................................................................................... 22

Table 3: Details of the TV Show included in the study........................................................48

Table 4: Total count, mean, median and range for the historical tweets per user..............49

Table 5: Characteristics of the Retweet Network.................................................................49

Table 6: Users excluded from the study by preprocessing.................................................. 49

Table 7: Mean Similarity Scores in evaluation groups for all topicmodelingapproaches

for 24-Legacy Dataset.......................................................................................................60

Table 8: Analysis of Variance (ANOVA) and Post hoc Independent T-test for similarity

scores -24-Legacy Dataset................................................................................................. 74

Table 9:Mean Similarity Scores in evaluation groups for all topic modeling approaches

for The-Good-Place Dataset..............................................................................................75

Table 10: Analysis of Variance (ANOVA) and Posthoc Independent T-testfor similarity

scores -The-Good-Place Dataset....................................................................................... 90

Table 11: Mean Similarity Scores in evaluation groups for all topic modeling approaches

for This-Is-Us Dataset........................................................................................................91

Table 12: Analysis of Variance (ANOVA) and Post hoc Independent T-test for similarity

scores -This-Is-Us Dataset............................................................................................... 100

Table 13: Run time for topic modeling algorithms used....................................................108

Table 15: Descriptive statistics Hashtags Count.................................................................112

Table 16: Mean Hashtag Based Similarity in SocialPair groups.......................................113

Table 17: Analysis of Variance (ANOVA) and Post hoc Independent T-test for Hashtag

Based Similarity................................................................................................................114

Table 18: Mean Interaction Jaccard Measurein Social PairGroups.................................. 116

Table 19: Analysis of Variance (ANOVA) and Post hoc Independent T-test for

Interaction Jaccard M easure.................

116

Table 20: Pearson’s Co-relation between the varioussimilaritymeasures.......................119

Table 21: Document structure of historic tweet collection................................................ 136

Table 22: Document structure of social network collection.............................................. 137

Table 23: output files created by data preprocessing scripts............................................. 139

Table 24: Output files created by jLDA Java program

........................................... 140

x

Preparing to load PDF file. please wait...

0 of 0
100%
Using Topical Interests And Social Interactions To