• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mining social data
 

Mining social data

on

  • 1,086 views

FOSDEM 2013 presentation on Techniques used for mining the social web on graphs.

FOSDEM 2013 presentation on Techniques used for mining the social web on graphs.

Statistics

Views

Total Views
1,086
Views on SlideShare
1,085
Embed Views
1

Actions

Likes
1
Downloads
15
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mining social data Mining social data Presentation Transcript

    • Mining Social Data FOSDEM 2013
    • Credits Speaker Romeu "@malk_zameth" MOURACompany @linagora License CC-BY-SA 3.0SlideShar j.mp/XXgBAn e ● Mining Graph Data Sources ● Mining the Social Web ● Social Network Analysis for startups ● Social Media Mining and Social Network Analysis ● Graph Mining
    • I work at Linagora, a french FLOSSco.
    • EloData & OpenGraphMinerLinagoras foray into ESN, DataStorage, Graphs & Mining.
    • Why mine social data at all? Without being a creepy stalker
    • To see what humans cant. Influence, centers of interest.
    • To remeber what humans cant.What worked in the past? Objectively how did I behave until now?
    • To discover what humans wont.
    • SerendipityFind what tou were not looking for
    • Real life social data What is so specific about it?
    • Always graphs
    • Dense substructures Every Vertex is an unique entity (someone).Several dense subgraphs: Relations of poaches of people
    • Usually it has no good cuts Even the best partition algorithms cannot find partitions that are just not there
    • There will beerrors & unknowns Exact matching is not an option
    • Plenty of vanity metrics pollution. Sometimes very surprising ones.
    • Number of followers is a vanity metric@GuyKawasaki (~1.5M followers) is much more retweeted than the user with most followers (@justinbieber, ~34M)
    • Why use graphs?What is the itch with Inductive Logic that Inductive Graphs scratch?
    • Classic Data Mining Pros and cons
    • pro: Solid known techniques of good performance
    • con: Complex structures are translated Into Bayesian Networks or Multi-Relational tables:Incurring either data loss or combinatory explosion.
    • Graph Mining The new deal
    • pro: Expressiveness and simplicityThe input and output are graphs, no conversions, graph algorithms all around.
    • con: The unit of operation is comparing isomorphisms NP-Complete
    • Extraction Getting the data
    • Is the easy part A commodity really.
    • Social networks provide APIFacebook Graph api, Twitter REST api, yammer api etc.
    • Worst case:Crawl the websiteCrawling The Web For Fun And Profit: http://youtu.be/eQtxbaw__W8
    • import sysimport jsonimport twitterimport networkx as nxfrom recipe__get_rt_origins import get_rt_originsdef create_rt_graph(tweets): g = nx.DiGraph() for tweet in tweets: rt_origins = get_rt_origins(tweet) if not rt_origins: continue for rt_origin in rt_origins: g.add_edge(rt_origin.encode(ascii, ignore), tweet[from_user].encode(ascii, ignore), {tweet_id: tweet[id]} ) return gif __name__ == __main__: Q = .join(sys.argv[1]) MAX_PAGES = 15 RESULTS_PER_PAGE = 100 twitter_search = twitter.Twitter(domain=search.twitter.com) search_results = [] for page in range(1,MAX_PAGES+1): search_results.append( twitter_search.search(q=Q, rpp=RESULTS_PER_PAGE, page=page) ) all_tweets = [tweet for page in search_results for tweet in page[results]] g = create_rt_graph(all_tweets) print >> sys.stderr, "Number nodes:", g.number_of_nodes() print >> sys.stderr, "Num edges:", g.number_of_edges() print >> sys.stderr, "Num connected components:", len(nx.connected_components(g.to_undirected())) print >> sys.stderr, "Node degrees:", sorted(nx.degree(g))
    • Finding patterns substructures that repeat
    • Older optionsApriori-based, Pattern growth
    • Stepwise pair expansionSeparate the graph by pairs, count frequencies, keep most frequent, augment them by one repeat.
    • "Chunk": Separate the graph by pairs
    • Keep only the frequent ones
    • Expand them
    • Find your frequent pattern
    • con: Chunkiness
    • "ChunkingLess"Graph Based Induction CL-CBI [Cook et. al.]
    • Inputs needed1. Minimal frequency where we consider a conformation to be a pattern : threshold2. Number of most frequent pattern we will retain : beam size3. Arbitrary number of times we will iterate: levels
    • 1. "Chunk": Separate the graph bypairs
    • 2. Select beam-size most frequentones
    • 3. Turn selected pairs into pseudo-nodes
    • 4. Expand & Rechunk
    • Keep going back to step 2 Until you have done it levels times.
    • Decision Trees
    • A Tree of patternsFinding a pattern on a branch yields a decision
    • DT-CLGBI
    • DT-CLGBI(graph: D)begin create_node DT in D if thresold-attained return DT else P <- select_most_discriminative(CL-CBI(D)) (Dy, Dn) <- branch_DT_on_predicate(p) for Di <- Dy DT.branch_yes.add-child(DT-CLGBI(Di)) for Di <- Dn DT.branch_no.add-child(DT-CLGBI(Di))