Your SlideShare is downloading. ×
Mining social data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mining social data

830

Published on

FOSDEM 2013 presentation on Techniques used for mining the social web on graphs.

FOSDEM 2013 presentation on Techniques used for mining the social web on graphs.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
830
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Mining Social Data FOSDEM 2013
  • 2. Credits Speaker Romeu "@malk_zameth" MOURACompany @linagora License CC-BY-SA 3.0SlideShar j.mp/XXgBAn e ● Mining Graph Data Sources ● Mining the Social Web ● Social Network Analysis for startups ● Social Media Mining and Social Network Analysis ● Graph Mining
  • 3. I work at Linagora, a french FLOSSco.
  • 4. EloData & OpenGraphMinerLinagoras foray into ESN, DataStorage, Graphs & Mining.
  • 5. Why mine social data at all? Without being a creepy stalker
  • 6. To see what humans cant. Influence, centers of interest.
  • 7. To remeber what humans cant.What worked in the past? Objectively how did I behave until now?
  • 8. To discover what humans wont.
  • 9. SerendipityFind what tou were not looking for
  • 10. Real life social data What is so specific about it?
  • 11. Always graphs
  • 12. Dense substructures Every Vertex is an unique entity (someone).Several dense subgraphs: Relations of poaches of people
  • 13. Usually it has no good cuts Even the best partition algorithms cannot find partitions that are just not there
  • 14. There will beerrors & unknowns Exact matching is not an option
  • 15. Plenty of vanity metrics pollution. Sometimes very surprising ones.
  • 16. Number of followers is a vanity metric@GuyKawasaki (~1.5M followers) is much more retweeted than the user with most followers (@justinbieber, ~34M)
  • 17. Why use graphs?What is the itch with Inductive Logic that Inductive Graphs scratch?
  • 18. Classic Data Mining Pros and cons
  • 19. pro: Solid known techniques of good performance
  • 20. con: Complex structures are translated Into Bayesian Networks or Multi-Relational tables:Incurring either data loss or combinatory explosion.
  • 21. Graph Mining The new deal
  • 22. pro: Expressiveness and simplicityThe input and output are graphs, no conversions, graph algorithms all around.
  • 23. con: The unit of operation is comparing isomorphisms NP-Complete
  • 24. Extraction Getting the data
  • 25. Is the easy part A commodity really.
  • 26. Social networks provide APIFacebook Graph api, Twitter REST api, yammer api etc.
  • 27. Worst case:Crawl the websiteCrawling The Web For Fun And Profit: http://youtu.be/eQtxbaw__W8
  • 28. import sysimport jsonimport twitterimport networkx as nxfrom recipe__get_rt_origins import get_rt_originsdef create_rt_graph(tweets): g = nx.DiGraph() for tweet in tweets: rt_origins = get_rt_origins(tweet) if not rt_origins: continue for rt_origin in rt_origins: g.add_edge(rt_origin.encode(ascii, ignore), tweet[from_user].encode(ascii, ignore), {tweet_id: tweet[id]} ) return gif __name__ == __main__: Q = .join(sys.argv[1]) MAX_PAGES = 15 RESULTS_PER_PAGE = 100 twitter_search = twitter.Twitter(domain=search.twitter.com) search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) all_tweets = [tweet for page in search_results for tweet in page[results]] g = create_rt_graph(all_tweets) print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))
  • 29. Finding patterns substructures that repeat
  • 30. Older optionsApriori-based, Pattern growth
  • 31. Stepwise pair expansionSeparate the graph by pairs, count frequencies, keep most frequent, augment them by one repeat.
  • 32. "Chunk": Separate the graph by pairs
  • 33. Keep only the frequent ones
  • 34. Expand them
  • 35. Find your frequent pattern
  • 36. con: Chunkiness
  • 37. "ChunkingLess"Graph Based Induction CL-CBI [Cook et. al.]
  • 38. Inputs needed1. Minimal frequency where we consider a conformation to be a pattern : threshold2. Number of most frequent pattern we will retain : beam size3. Arbitrary number of times we will iterate: levels
  • 39. 1. "Chunk": Separate the graph bypairs
  • 40. 2. Select beam-size most frequentones
  • 41. 3. Turn selected pairs into pseudo-nodes
  • 42. 4. Expand & Rechunk
  • 43. Keep going back to step 2 Until you have done it levels times.
  • 44. Decision Trees
  • 45. A Tree of patternsFinding a pattern on a branch yields a decision
  • 46. DT-CLGBI
  • 47. DT-CLGBI(graph: D)begin create_node DT in D if thresold-attained return DT else P <- select_most_discriminative(CL-CBI(D)) (Dy, Dn) <- branch_DT_on_predicate(p) for Di <- Dy DT.branch_yes.add-child(DT-CLGBI(Di)) for Di <- Dn DT.branch_no.add-child(DT-CLGBI(Di))

×