Mining social data

Mining Social Data
FOSDEM 2013

Credits
Speaker Romeu "@malk_zameth" MOURA
Company @linagora
License CC-BY-SA 3.0
SlideShar j.mp/XXgBAn
e ● Mining Graph Data
Sources ● Mining the Social Web
● Social Network Analysis for
startups
● Social Media Mining and Social
Network Analysis
● Graph Mining

I work at Linagora, a french FLOSS
co.

EloData &
OpenGraphMiner
Linagora's foray into ESN, DataStorage, Graphs &
Mining.

Why mine social data at
all?
Without being a creepy stalker

To see what humans
can't.
Influence, centers of interest.

To remeber what humans
can't.
What worked in the past? Objectively how did I behave
until now?

To discover what humans
won't.

Serendipity
Find what tou were not looking for

Real life social data
What is so specific about it?

Dense substructures
Every Vertex is an unique entity (someone).
Several dense subgraphs: Relations of poaches of
people

Usually it has no good cuts
Even the best partition algorithms cannot find
partitions that are just not there

There will be
errors & unknowns
Exact matching is not an option

Plenty of vanity metrics
pollution.
Sometimes very surprising ones.

Number of followers is a
vanity metric
@GuyKawasaki (~1.5M followers) is much more
retweeted than the user with most followers
(@justinbieber, ~34M)

Why use graphs?
What is the itch with Inductive Logic that Inductive
Graphs scratch?

'Classic' Data Mining
Pros and cons

pro: Solid known
techniques
of good performance

con: Complex structures
are translated
Into Bayesian Networks or Multi-Relational tables:
Incurring either data loss or combinatory explosion.

pro: Expressiveness
and simplicity
The input and output are graphs, no conversions, graph
algorithms all around.

con: The unit of operation
is comparing
isomorphisms
NP-Complete

Is the easy part
A commodity really.

Social networks provide
API
Facebook Graph api, Twitter REST api, yammer api
etc.

Worst case:
Crawl the website
Crawling The Web For Fun And Profit:
http://youtu.be/eQtxbaw__W8

import sys
import json
import twitter
import networkx as nx
from recipe__get_rt_origins import get_rt_origins

def create_rt_graph(tweets):
g = nx.DiGraph()
for tweet in tweets:
rt_origins = get_rt_origins(tweet)
if not rt_origins:
continue
for rt_origin in rt_origins:
g.add_edge(rt_origin.encode('ascii', 'ignore'),
tweet['from_user'].encode('ascii', 'ignore'),
{'tweet_id': tweet['id']}
)
return g

if __name__ == '__main__':
Q = ' '.join(sys.argv[1])
MAX_PAGES = 15
RESULTS_PER_PAGE = 100
twitter_search = twitter.Twitter(domain='search.twitter.com')
search_results = []
for page in range(1,MAX_PAGES+1):
search_results.append(
twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page)
)
all_tweets = [tweet for page in search_results for tweet in page['results']]
g = create_rt_graph(all_tweets)

print >> sys.stderr, "Number nodes:", g.number_of_nodes()
print >> sys.stderr, "Num edges:", g.number_of_edges()
print >> sys.stderr, "Num connected components:",
len(nx.connected_components(g.to_undirected()))
print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))

Finding patterns
substructures that repeat

Older options
Apriori-based, Pattern growth

Stepwise pair expansion
Separate the graph by pairs, count frequencies, keep
most frequent, augment them by one repeat.

"Chunk": Separate the graph by pairs

"ChunkingLess"
Graph Based Induction
CL-CBI [Cook et. al.]

Inputs needed
1. Minimal frequency where we consider a
conformation to be a pattern : threshold
2. Number of most frequent pattern we will
retain : beam size
3. Arbitrary number of times we will iterate:
levels

1. "Chunk": Separate the graph by
pairs

2. Select beam-size most frequent
ones

3. Turn selected pairs into pseudo-
nodes

Keep going back to step 2
Until you have done it levels times.

A Tree of patterns
Finding a pattern on a branch yields a decision

DT-CLGBI(graph: D)
begin
create_node DT in D
if thresold-attained
return DT
else
P <- select_most_discriminative(CL-CBI(D))
(Dy, Dn) <- branch_DT_on_predicate(p)
for Di <- Dy
DT.branch_yes.add-child(DT-CLGBI(Di))
for Di <- Dn
DT.branch_no.add-child(DT-CLGBI(Di))

Mining social data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mining social data

Similar to Mining social data (20)

Recently uploaded

Recently uploaded (20)

Mining social data