Mining Social Data
     FOSDEM 2013
Credits
 Speaker    Romeu "@malk_zameth" MOURA
Company     @linagora
  License   CC-BY-SA 3.0
SlideShar   j.mp/XXgBAn
        e   ● Mining Graph Data
  Sources   ● Mining the Social Web
            ● Social Network Analysis for
               startups
            ● Social Media Mining and Social
               Network Analysis
            ● Graph Mining
I work at Linagora, a french FLOSS
co.
EloData &
      OpenGraphMiner
Linagora's foray into ESN, DataStorage, Graphs &
                      Mining.
Why mine social data at
         all?
    Without being a creepy stalker
To see what humans
       can't.
  Influence, centers of interest.
To remeber what humans
        can't.
What worked in the past? Objectively how did I behave
                     until now?
To discover what humans
         won't.
Serendipity
Find what tou were not looking for
Real life social data
   What is so specific about it?
Always graphs
Dense substructures
  Every Vertex is an unique entity (someone).
Several dense subgraphs: Relations of poaches of
                     people
Usually it has no good cuts
  Even the best partition algorithms cannot find
        partitions that are just not there
There will be
errors & unknowns
 Exact matching is not an option
Plenty of vanity metrics
       pollution.
    Sometimes very surprising ones.
Number of followers is a
    vanity metric
@GuyKawasaki (~1.5M followers) is much more
 retweeted than the user with most followers
           (@justinbieber, ~34M)
Why use graphs?
What is the itch with Inductive Logic that Inductive
                  Graphs scratch?
'Classic' Data Mining
       Pros and cons
pro: Solid known
   techniques
   of good performance
con: Complex structures
     are translated
 Into Bayesian Networks or Multi-Relational tables:
Incurring either data loss or combinatory explosion.
Graph Mining
  'The new deal'
pro: Expressiveness
         and simplicity
The input and output are graphs, no conversions, graph
                algorithms all around.
con: The unit of operation
       is comparing
      isomorphisms
         NP-Complete
Extraction
 Getting the data
Is the easy part
  A commodity really.
Social networks provide
          API
Facebook Graph api, Twitter REST api, yammer api
                      etc.
Worst case:
Crawl the website
Crawling The Web For Fun And Profit:
   http://youtu.be/eQtxbaw__W8
import sys
import json
import twitter
import networkx as nx
from recipe__get_rt_origins import get_rt_origins

def create_rt_graph(tweets):
  g = nx.DiGraph()
  for tweet in tweets:
     rt_origins = get_rt_origins(tweet)
     if not rt_origins:
        continue
     for rt_origin in rt_origins:
        g.add_edge(rt_origin.encode('ascii', 'ignore'),
                tweet['from_user'].encode('ascii', 'ignore'),
                {'tweet_id': tweet['id']}
        )
  return g

if __name__ == '__main__':
   Q = ' '.join(sys.argv[1])
   MAX_PAGES = 15
   RESULTS_PER_PAGE = 100
   twitter_search = twitter.Twitter(domain='search.twitter.com')
   search_results = []
   for page in range(1,MAX_PAGES+1):
      search_results.append(
         twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page)
      )
   all_tweets = [tweet for page in search_results for tweet in page['results']]
   g = create_rt_graph(all_tweets)

  print >> sys.stderr, "Number nodes:", g.number_of_nodes()
   print >> sys.stderr, "Num edges:", g.number_of_edges()
   print >> sys.stderr, "Num connected components:",
                len(nx.connected_components(g.to_undirected()))
   print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))
Finding patterns
  substructures that repeat
Older options
Apriori-based, Pattern growth
Stepwise pair expansion
Separate the graph by pairs, count frequencies, keep
   most frequent, augment them by one repeat.
"Chunk": Separate the graph by pairs
Keep only the frequent ones
Expand them
Find your frequent pattern
con: Chunkiness
"ChunkingLess"
Graph Based Induction
     CL-CBI [Cook et. al.]
Inputs needed
1. Minimal frequency where we consider a
   conformation to be a pattern : threshold
2. Number of most frequent pattern we will
   retain : beam size
3. Arbitrary number of times we will iterate:
   levels
1. "Chunk": Separate the graph by
pairs
2. Select beam-size most frequent
ones
3. Turn selected pairs into pseudo-
nodes
4. Expand & Rechunk
Keep going back to step 2
    Until you have done it levels times.
Decision Trees
A Tree of patterns
Finding a pattern on a branch yields a decision
DT-CLGBI
DT-CLGBI(graph: D)
begin
 create_node DT in D
 if thresold-attained
    return DT
 else
   P <- select_most_discriminative(CL-CBI(D))
    (Dy, Dn) <- branch_DT_on_predicate(p)
   for Di <- Dy
     DT.branch_yes.add-child(DT-CLGBI(Di))
   for Di <- Dn
     DT.branch_no.add-child(DT-CLGBI(Di))

Mining social data

  • 1.
  • 2.
    Credits Speaker Romeu "@malk_zameth" MOURA Company @linagora License CC-BY-SA 3.0 SlideShar j.mp/XXgBAn e ● Mining Graph Data Sources ● Mining the Social Web ● Social Network Analysis for startups ● Social Media Mining and Social Network Analysis ● Graph Mining
  • 3.
    I work atLinagora, a french FLOSS co.
  • 4.
    EloData & OpenGraphMiner Linagora's foray into ESN, DataStorage, Graphs & Mining.
  • 5.
    Why mine socialdata at all? Without being a creepy stalker
  • 6.
    To see whathumans can't. Influence, centers of interest.
  • 7.
    To remeber whathumans can't. What worked in the past? Objectively how did I behave until now?
  • 8.
    To discover whathumans won't.
  • 9.
    Serendipity Find what touwere not looking for
  • 10.
    Real life socialdata What is so specific about it?
  • 11.
  • 12.
    Dense substructures Every Vertex is an unique entity (someone). Several dense subgraphs: Relations of poaches of people
  • 13.
    Usually it hasno good cuts Even the best partition algorithms cannot find partitions that are just not there
  • 14.
    There will be errors& unknowns Exact matching is not an option
  • 15.
    Plenty of vanitymetrics pollution. Sometimes very surprising ones.
  • 16.
    Number of followersis a vanity metric @GuyKawasaki (~1.5M followers) is much more retweeted than the user with most followers (@justinbieber, ~34M)
  • 17.
    Why use graphs? Whatis the itch with Inductive Logic that Inductive Graphs scratch?
  • 18.
  • 19.
    pro: Solid known techniques of good performance
  • 20.
    con: Complex structures are translated Into Bayesian Networks or Multi-Relational tables: Incurring either data loss or combinatory explosion.
  • 21.
    Graph Mining 'The new deal'
  • 22.
    pro: Expressiveness and simplicity The input and output are graphs, no conversions, graph algorithms all around.
  • 23.
    con: The unitof operation is comparing isomorphisms NP-Complete
  • 24.
  • 25.
    Is the easypart A commodity really.
  • 26.
    Social networks provide API Facebook Graph api, Twitter REST api, yammer api etc.
  • 27.
    Worst case: Crawl thewebsite Crawling The Web For Fun And Profit: http://youtu.be/eQtxbaw__W8
  • 28.
    import sys import json importtwitter import networkx as nx from recipe__get_rt_origins import get_rt_origins def create_rt_graph(tweets): g = nx.DiGraph() for tweet in tweets: rt_origins = get_rt_origins(tweet) if not rt_origins: continue for rt_origin in rt_origins: g.add_edge(rt_origin.encode('ascii', 'ignore'), tweet['from_user'].encode('ascii', 'ignore'), {'tweet_id': tweet['id']} ) return g if __name__ == '__main__': Q = ' '.join(sys.argv[1]) MAX_PAGES = 15 RESULTS_PER_PAGE = 100 twitter_search = twitter.Twitter(domain='search.twitter.com') search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) all_tweets = [tweet for page in search_results for tweet in page['results']] g = create_rt_graph(all_tweets) print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))
  • 29.
    Finding patterns substructures that repeat
  • 30.
  • 31.
    Stepwise pair expansion Separatethe graph by pairs, count frequencies, keep most frequent, augment them by one repeat.
  • 32.
    "Chunk": Separate thegraph by pairs
  • 33.
    Keep only thefrequent ones
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    Inputs needed 1. Minimalfrequency where we consider a conformation to be a pattern : threshold 2. Number of most frequent pattern we will retain : beam size 3. Arbitrary number of times we will iterate: levels
  • 39.
    1. "Chunk": Separatethe graph by pairs
  • 40.
    2. Select beam-sizemost frequent ones
  • 41.
    3. Turn selectedpairs into pseudo- nodes
  • 42.
    4. Expand &Rechunk
  • 43.
    Keep going backto step 2 Until you have done it levels times.
  • 44.
  • 45.
    A Tree ofpatterns Finding a pattern on a branch yields a decision
  • 46.
  • 47.
    DT-CLGBI(graph: D) begin create_nodeDT in D if thresold-attained return DT else P <- select_most_discriminative(CL-CBI(D)) (Dy, Dn) <- branch_DT_on_predicate(p) for Di <- Dy DT.branch_yes.add-child(DT-CLGBI(Di)) for Di <- Dn DT.branch_no.add-child(DT-CLGBI(Di))