Mining social data


Published on

FOSDEM 2013 presentation on Techniques used for mining the social web on graphs.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Mining social data

  1. 1. Mining Social Data FOSDEM 2013
  2. 2. Credits Speaker Romeu "@malk_zameth" MOURACompany @linagora License CC-BY-SA 3.0SlideShar e ● Mining Graph Data Sources ● Mining the Social Web ● Social Network Analysis for startups ● Social Media Mining and Social Network Analysis ● Graph Mining
  3. 3. I work at Linagora, a french FLOSSco.
  4. 4. EloData & OpenGraphMinerLinagoras foray into ESN, DataStorage, Graphs & Mining.
  5. 5. Why mine social data at all? Without being a creepy stalker
  6. 6. To see what humans cant. Influence, centers of interest.
  7. 7. To remeber what humans cant.What worked in the past? Objectively how did I behave until now?
  8. 8. To discover what humans wont.
  9. 9. SerendipityFind what tou were not looking for
  10. 10. Real life social data What is so specific about it?
  11. 11. Always graphs
  12. 12. Dense substructures Every Vertex is an unique entity (someone).Several dense subgraphs: Relations of poaches of people
  13. 13. Usually it has no good cuts Even the best partition algorithms cannot find partitions that are just not there
  14. 14. There will beerrors & unknowns Exact matching is not an option
  15. 15. Plenty of vanity metrics pollution. Sometimes very surprising ones.
  16. 16. Number of followers is a vanity metric@GuyKawasaki (~1.5M followers) is much more retweeted than the user with most followers (@justinbieber, ~34M)
  17. 17. Why use graphs?What is the itch with Inductive Logic that Inductive Graphs scratch?
  18. 18. Classic Data Mining Pros and cons
  19. 19. pro: Solid known techniques of good performance
  20. 20. con: Complex structures are translated Into Bayesian Networks or Multi-Relational tables:Incurring either data loss or combinatory explosion.
  21. 21. Graph Mining The new deal
  22. 22. pro: Expressiveness and simplicityThe input and output are graphs, no conversions, graph algorithms all around.
  23. 23. con: The unit of operation is comparing isomorphisms NP-Complete
  24. 24. Extraction Getting the data
  25. 25. Is the easy part A commodity really.
  26. 26. Social networks provide APIFacebook Graph api, Twitter REST api, yammer api etc.
  27. 27. Worst case:Crawl the websiteCrawling The Web For Fun And Profit:
  28. 28. import sysimport jsonimport twitterimport networkx as nxfrom recipe__get_rt_origins import get_rt_originsdef create_rt_graph(tweets): g = nx.DiGraph() for tweet in tweets: rt_origins = get_rt_origins(tweet) if not rt_origins: continue for rt_origin in rt_origins: g.add_edge(rt_origin.encode(ascii, ignore), tweet[from_user].encode(ascii, ignore), {tweet_id: tweet[id]} ) return gif __name__ == __main__: Q = .join(sys.argv[1]) MAX_PAGES = 15 RESULTS_PER_PAGE = 100 twitter_search = twitter.Twitter( search_results = [] for page in range(1,MAX_PAGES+1): search_results.append(, rpp=RESULTS_PER_PAGE, page=page) ) all_tweets = [tweet for page in search_results for tweet in page[results]] g = create_rt_graph(all_tweets) print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(
  29. 29. Finding patterns substructures that repeat
  30. 30. Older optionsApriori-based, Pattern growth
  31. 31. Stepwise pair expansionSeparate the graph by pairs, count frequencies, keep most frequent, augment them by one repeat.
  32. 32. "Chunk": Separate the graph by pairs
  33. 33. Keep only the frequent ones
  34. 34. Expand them
  35. 35. Find your frequent pattern
  36. 36. con: Chunkiness
  37. 37. "ChunkingLess"Graph Based Induction CL-CBI [Cook et. al.]
  38. 38. Inputs needed1. Minimal frequency where we consider a conformation to be a pattern : threshold2. Number of most frequent pattern we will retain : beam size3. Arbitrary number of times we will iterate: levels
  39. 39. 1. "Chunk": Separate the graph bypairs
  40. 40. 2. Select beam-size most frequentones
  41. 41. 3. Turn selected pairs into pseudo-nodes
  42. 42. 4. Expand & Rechunk
  43. 43. Keep going back to step 2 Until you have done it levels times.
  44. 44. Decision Trees
  45. 45. A Tree of patternsFinding a pattern on a branch yields a decision
  46. 46. DT-CLGBI
  47. 47. DT-CLGBI(graph: D)begin create_node DT in D if thresold-attained return DT else P <- select_most_discriminative(CL-CBI(D)) (Dy, Dn) <- branch_DT_on_predicate(p) for Di <- Dy DT.branch_yes.add-child(DT-CLGBI(Di)) for Di <- Dn DT.branch_no.add-child(DT-CLGBI(Di))
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.