Social Networks analysis to characterize HIV at-risk populations - Progress and Status.

Detecting HIV at-risk MSM in San Diego through
Social Networks
Digital Epidemiology

Current Relevance of HIV
● 33.4 million cases.
● Second growth phase of
HIV already been
reported in some of the
countries.
● Need to intensify HIV
prevention efforts - this
is difficult.

How can technology help?
● Philosophical question : Can social networks help in identifying users with
high risk of HIV infection?
● Goal of project: Characterize HIV vulnerable populations by extracting
user sentiments from social networks like Twitter.

History & Related work
● Epidemiology - Hippocrates, 400 B.C. -> Digital Epidemiology - Marcel
Salathe et. al. 2012.
● Unraveling Abstinence and Relapse: Smoking Cessation Reflected in
Social Media - Dr. Elizabeth Murnane, CHI 2014.
● Methods of using real-time social media technologies for detection and
remote monitoring of HIV outcomes - Sean D. Young et. al., Elsevier
Preventive Medicine, 2014.

Data source
● 210 notable social networks - 43 things to Zooppa.
● Twitter was chosen because of the results published in earlier studies.
● Programmatic access to tweets using Streaming API.
○ Sample Hose (~4200 tweets/min)
○ Filter Hose (~40 tweets/min)
○ Fire Hose (~420000 tweets/min)

Data collection
● Streaming API
● MongoDB
○ Tweets
○ HIV Corpus
○ HIV Corpus cleaned
○ Related tweets/users
● Neo4j

Data classification & cleaning
● Classification
○ Filter tweets based on a pre-defined set of HIV risk words.
○ Five Risk Buckets : Drug, SexVenues, STI, Sex, Homosexual.
● Cleaning
○ Keep or discard tweets based on co-occurring words.
○ Manually scavenged through classified tweets to create lists with Dr.
Nella Green’s help.
○ Exception and Inclusion lists for every HIV risk word.

Why Graph DB?
● Twitter’s deeply associative data can be easily modeled.
● Most use cases correspond to analyzing sub-structures and
connectedness. Queries on a graph are much faster than join bombs in
relational data models.
● We use Neo4j - mature and scalable native graph store with good
support.

Property graph Data Model
Nodes
1. USER
2. TWEET
3. HASHTAG
4. URL
5. FOLLOWER_USER
6. ONTOLOGY_BUCKET
7. ONTOLOGY_INSTANCE
Edges
1. FOLLOWS
2. TWEETED
3. MENTIONED_IN
4. IS_REPLY_FOR
5. RETWEET_FOR
6. HAS_HASHTAG
7. HAS_URL
8. HAS_RISK_WORD
9. INSTANCE_OF

Property graph Data Model
Nodes
1. USER
2. TWEET
3. HASHTAG
4. URL
5. FOLLOWER_USER
6. ONTOLOGY_BUCKET
7. ONTOLOGY_INSTANCE
Edges
1. FOLLOWS
2. TWEETED
3. MENTIONED_IN
4. IS_REPLY_FOR
5. RETWEET_FOR
6. HAS_HASTAG
7. HAS_URL
8. HAS_RISK_WORD
9. INSTANCE_OF

Migration from mongoDB to Neo4j
● Using python - Py2neo library.
● Modular scripts

ONTOLOGY_
BUCKET {id:
“DrugBucket”}
ONTOLOGY
_INSTANCE
{id:“Meth”}
ONTOLOGY
_INSTANCE
{id:“Coke”}
USER {name:”
Bob”}
USER {name:”
Alice”}
TWEET {text:”
Hello World! I
like meth!
#drugs http://t.
co/ran1”}
FOLLOWER_
USER {name:”
Eve”}
FOLLOWER_
USER {name:”
Fred”}
TWEET {text:”
@Alice:Want
some coke? I
am at the loft”}
ONTOLOGY
_BUCKET
{id:
“SexVenues”
}
ONTOLOGY
_INSTANCE
{id:“The Loft”}
HASHTAG
{name:”drugs”}
URL {name:”
http://t.co/ran1”}
INSTANCE_OF
HAS_RISK_WORD
TWEETED
HAS_URL
HAS_HASHTAG
FOLLOWS
RETWEET_FOR
IS_REPLY_FOR
MENTIONED_IN
Data Model..

ONTOLOGY_
BUCKET {id:
“DrugBucket
”}
ONTOLOGY
_INSTANCE
{id:“Meth”}
ONTOLOGY
_INSTANCE
{id:“Coke”}
USER
{name:”
Bob”}
USER
{name:”
Alice”}
TWEET {text:”
Hello World! I like
meth! #drugs http:
//t.co/ran1”}
FOLLOWER_
USER
{name:”
Eve”}
FOLLOWER_
USER
{name:”
Fred”}
TWEET {text:”
@Alice:Want
some coke? I
am at the loft”}
ONTOLOGY_
BUCKET {id:
“SexVenues”
}
ONTOLOGY
_INSTANCE
{id:“The
Loft”}
HASHTAG
{name:”drugs”}
URL {name:”
http://t.
co/ran1”}
INSTANCE_OF
HAS_RISK_WORD
TWEETED
HAS_URL
HAS_HASHTAG
FOLLOWS
TWEET
TWEET
RETWEET_FOR
IS_REPLY_FOR
MENTIONED_IN

Conversations among users..
“How many conversations are happening among the drug bucket users alone ,
sex bucket users alone and across drug bucket users and sex bucket users?”
MATCH p=(
(n:ONTOLOGY_BUCKET{id: 'DrugBucket'})-[r]-(m:ONTOLOGY_INSTANCE)
-[r1]-
(t:TWEET)<-[r2:IS_REPLY_FOR*2..]-(t1:TWEET))
where not
(t)-[:`IS_REPLY_FOR`]->(:`TWEET`)
RETURN count(DISTINCT t)
Queries
Output:
8 (1692 ms)

MATCH p=((n:ONTOLOGY_BUCKET)-[r]-(m:ONTOLOGY_INSTANCE)
-[r1]-
(t:TWEET)<-[r2:IS_REPLY_FOR*2..]-(t1:TWEET))
where n.id in ["HomosexualTermsBucket","STIBucket","SexBucket","
SexVenues"]
and not (t)-[:`IS_REPLY_FOR`]->(:`TWEET`)
RETURN count(DISTINCT t);
Queries
Output:
20 (2350 ms)

MATCH p1=((n:ONTOLOGY_BUCKET)-[r]-(m:ONTOLOGY_INSTANCE)
-[r1]-
(t:TWEET)<-[r2:IS_REPLY_FOR*2..]-(t1:TWEET)
-[r3]-
(o:ONTOLOGY_INSTANCE)-[r4]-(p:ONTOLOGY_BUCKET {id: 'DrugBucket'}))
where n.id in ["HomosexualTermsBucket","STIBucket","SexBucket","
SexVenues"]
Queries
Output:
2 (207952 ms)

MATCH p1=((n:ONTOLOGY_BUCKET {id: 'DrugBucket'})-[r]-(m:ONTOLOGY_INSTANCE)
-[r1]-
(t:TWEET)<-[r2:IS_REPLY_FOR*2..]-(t1:TWEET)
-[r3]-
(o:ONTOLOGY_INSTANCE)-[r4]-(p:ONTOLOGY_BUCKET))
where p.id in ["HomosexualTermsBucket","STIBucket","SexBucket","
SexVenues"]
Queries
Output:
1 (234202 ms)

Finding most referred users..
“List users in the descending order of referral counts”
MATCH p=((u:USER)-[r:MENTIONED_IN]->() )
RETURN u.name,count(p) AS num_mentions
ORDER BY num_mentions DESC limit 5;
Queries
Output:
+--------------------------------------+
| u.name | num_mentions |
+--------------------------------------+
| "cc7764343d" | 261 |
| "972b1707f7" | 256 |
| "9be7e77265" | 235 |
| "8dc5aaf21a" | 232 |
| "e1095646aa" | 220 |
+--------------------------------------+
(172 ms)

Finding most referred users..
“List users in the descending order of referral counts”
MATCH p=((u:USER)-[r:MENTIONED_IN]->(t) )
where not (t)-[:`IS_REPLY_FOR`]->(:`TWEET`)
RETURN u.name,count(p) AS num_mentions
ORDER BY num_mentions DESC limit 5;
Queries
Output:
+----------------------------------+
| u.name | num_mentions |
+----------------------------------+
| "00f4edeac2" | 28 |
| "8987f033aa" | 16 |
| "e6e67c5cef" | 10 |
| "fdf2ce82fd" | 6 |
| "86609dbd6e" | 5 |
+----------------------------------+
(198 ms)
Forbidden substructure

Topics of interest around a hub..
“What are the main topics in the discussions among people who are at a one-
hop following distance from their sub-graph’s hubs.”
MATCH (n:USER)<-[r:FOLLOWS*1..]-(m)
OPTIONAL MATCH (m)-[r1:TWEETED]->( t:TWEET)-[o]->(p:ONTOLOGY_INSTANCE)-[q]-
>(s:ONTOLOGY_BUCKET {id:” DrugBucket”})
WITH COUNT(t) as count, n as hub
WHERE count >= 2
MATCH (o:ONTOLOGY_BUCKET)<-[r2*2..2]-(t1:TWEET)
<-[TWEETED]-(neighbour:USER)-[r3:FOLLOWS]-hub
return o.id, hub.name, count(t1)
ORDER BY count(t1) DESC limit 5
Queries
Output:
+-------------------------------------+
| o.id | hub.name | count(t1) |
+-------------------------------------+
| "SexBucket" | "b4f30295f9" | 1 |
| "DrugBucket" | "b4f30295f9" | 1 |
+-------------------------------------+
(589 ms)

Two most consulted drug users..
“The real world data tells us that lots of homosexual (MSM) people consume
drugs or psycho-stimulants. Identify two drug bucket users who are most
consulted by homosexual people on Twitter”
MATCH (o:ONTOLOGY_BUCKET {id:"DrugBucket"})
<-[ri1:INSTANCE_OF]-(oi1:ONTOLOGY_INSTANCE)
<-[rhr1:HAS_RISK_WORD]-(t1:TWEET)
<-[rt1:TWEETED]-(drug:USER)-[MENTIONED_IN]->
(t:TWEET)<-[rt2:TWEETED]-(homosex:USER)
-[rt3:TWEETED]->(t2:TWEET)-[rhr2:HAS_RISK_WORD]
->(oi2:ONTOLOGY_INSTANCE)-[ri2:INSTANCE_OF]
->(o1:ONTOLOGY_BUCKET {id:"HomosexualTermsBucket"})
RETURN drug.name, count(DISTINCT t)
ORDER BY count(DISTINCT t) DESC
LIMIT 2
Queries
Output:
+------------------------------------------+
| drug.name | count(DISTINCT t) |
+------------------------------------------+
| "748d9dc913" | 26 |
| "5a74f759b8" | 13 |
+------------------------------------------+
(13825 ms)

Proximity of drug bucket users..
“How close are drug bucket users to other homosexual bucket users in terms
of proximity in the social graph?”
MATCH p =
(o1:ONTOLOGY_BUCKET {id:” HomosexualTermsBucket ”})<-[ri1:INSTANCE_OF]-(oi1:
ONTOLOGY_INSTANCE)<-[rrw1:HAS_RISK_WORD]-(t1:TWEET)
<-[rt1:TWEETED]- (u1:USER)-[r:FOLLOWS*1..3]->(u2:USER) -[rt2:TWEETED]->(t2:
TWEET)-[rrw2:HAS_RISK_WORD]->(oi2:ONTOLOGY_INSTANCE)-[ri2:INSTANCE_OF]->
(o2:ONTOLOGY_BUCKET {id:” DrugBucket”})
return u1.name,length(p), count(u2)
ORDER BY length(p)
Queries
Output:
+-------------------------------------+
| u1.name | length(p) | count(u2) |
+-------------------------------------+
| "1b0056b07a"| 7 | 4 |
| "0c384be19a"| 7 | 2 |
+-------------------------------------+
(260 ms)

Graph substructures
Get me all the social subgraphs
which have central nodes

Shortest paths vs. diameter between users
● Finding user-connected components
○ Perform BFS traversal and add a property ‘subgraph’ for each node
○ Forbidden substructure - Users can be connected via ontology
buckets or ontology instances
● Neo4j Java Traversal Framework API Code Snippet
Traverser traverser = db.traversalDescription()
.breadthFirst()
.relationships(RelTypes.TWEETED)
.relationships(RelTypes.FOLLOWS)
.relationships(RelTypes.IS_REPLY_FOR)
.relationships(RelTypes.MENTIONED_IN)
.evaluator(Evaluators.excludeStartPosition())
.uniqueness(Uniqueness.NODE_GLOBAL).traverse(n);
Eliminate forbidden
substructure
Queries

Find average shortest path between any 2 users in a connected component
and compare it to diameter of the connected component
match (n:USER) WITH n.subgraph as subGraphNum, count(n) as c
WHERE c >= 7
WITH collect(subGraphNum) as collectionSG
MATCH p=shortestPath((s:USER)-[:FOLLOWS|MENTIONED_IN|TWEETED|IS_REPLY_FOR*..]-(d:USER))
WHERE s.subgraph=d.subgraph and s.subgraph in collectionSG and length(p)>1
RETURN s.subgraph, sum(length(p))/count(p), max(length(p)), ((sum(length(p))/count(p))*1.0)
/max(length(p))
ORDER BY ((sum(length(p))/count(p))*1.0)/max(length(p)) DESC
Queries

Output:
+--------------------------------------------------------------------------------------------------------
+
| s.subgraph | sum(length(p))/count(p) | max(length(p)) | ((sum(length(p))/count(p))*1.0)/max(length(p))
|
+--------------------------------------------------------------------------------------------------------
+
| 2431 | 2 | 2 | 1.0 |
| 23 | 3 | 4 | 0.75 |
| 6024 | 3 | 4 | 0.75 |
| 671 | 3 | 4 | 0.75 |
| 1737 | 3 | 4 | 0.75 |
| 1264 | 3 | 4 | 0.75 |
| 2136 | 3 | 4 | 0.75 |
| 1742 | 3 | 4 | 0.75 |
| 7152 | 3 | 4 | 0.75 |
| 5650 | 3 | 4 | 0.75 |
| 1 | 9 | 15 | 0.6 |
| 4195 | 3 | 5 | 0.6 |
| 8038 | 2 | 6 | 0.3333333333333333 |
+--------------------------------------------------------------------------------------------------------
+

Some interesting substructures!

Even more interesting substructures!

Getting geo-local sentiments
What are the people hanging around Sex Venues
talking about?

Commonly discussed topics around Sex
Venues
● Some tweets are geotagged
○ Neo4j Spatial plugin to create spatial index on tweets
● Find tweets tweeted near a specific Sex Venue
○ Perform a withinDistance query for the coordinates of the sex venue
● Are these tweets talking about specific topics?
○ Topic Modeling -LDA (Gensim) on tweets
Queries

Commonly discussed topics around Sex
Venues
Find what topic HIV risk users are talking about the most, around a particular
Sex Venue.
REST API Code Snippet
headers = {'content-type': 'application/json'}
url = "http://localhost:7474/db/data/ext/SpatialPlugin/graphdb/findGeometriesWithinDistance"
payload = {
"layer" : "geom",
"pointX" : -117.161324,
"pointY" : 32.710671,
"distanceInKm" : 2
}
r = requests.post(url, data=json.dumps(payload), headers=headers)

● Cleaning tweets - remove mentions, urls
● Stop word list - Stanford NLTK library
● Gensim - Corpora & lda libraries
● Free parameters
○ Number of topics - 2,3,4
○ Distance Radius for ‘withinDistance’ query - 2,5,10 kms
LDA on Tweets found around Sex Venues
Queries

LDA on Colocated Tweets - Results
Topic #1 gay, san, diego, queen, flicks, glass, amp, dont, coke, get
Topic #2 gay, san, diego, ca, glass, amp, cheers, flicks, bourbon, happy
Drug & Homosexual
bucket
coke
glass
dope
gay
queen
…
...
Sex Venues Bucket
groovy laid back amp nasty
cheers
flicks
bourbon street
club san diego
pecs
…
...

Interesting patterns..
“What is the longest conversation thread among any set of users?”
MATCH p = (n:TWEET)<-[r:IS_REPLY_FOR*]-(m:TWEET) RETURN p order by length
(p) desc limit 1
205 nodes
(157179 ms)
Queries

Challenges
● Data collection
○ Sampled data (1%)
○ Twitter APIs call rate limit per user - 15 calls/15 mins.
○ Collecting users who have favorited a tweet.
○ Extracting conversations/retweet chains associated with a tweet.
● Data classification and cleaning
○ Working with microblogs.
○ Iterative process.
● Restricted visualization for Neo4j
○ Hard to decipher patterns in graph.

Future
● More representative dataset - Firehose API
● Innovative Data Visualizations to visualize evolving graphs
● Machine Learning for better HIV risk tweets classification.
○ Mechanical Turk for labeling
○ Logistic Regression for classification
● SD Primary Infection Cohort - Overlaying real-world HIV infection graph
on top of an enriched social network

Conclusion
● Structured approach to model social networks and derive insights from
networks like Twitter. Best practices in collecting and managing Twitter
data for social networks analysis.
● Current results - Graph queries to derive intuitions on factors that
influence HIV risk behaviour.
● Vision for the future.

Social Networks analysis to characterize HIV at-risk populations - Progress and Status.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Social Networks analysis to characterize HIV at-risk populations - Progress and Status.

Similar to Social Networks analysis to characterize HIV at-risk populations - Progress and Status. (20)

More from UC San Diego

More from UC San Diego (20)

Recently uploaded

Recently uploaded (20)

Social Networks analysis to characterize HIV at-risk populations - Progress and Status.