Monkigras 2012: Networks Of Data
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Monkigras 2012: Networks Of Data

  • 1,602 views
Uploaded on

How to think of your data as a graph, and apply social network analysis to understand it.

How to think of your data as a graph, and apply social network analysis to understand it.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,602
On Slideshare
1,519
From Embeds
83
Number of Embeds
3

Actions

Shares
Downloads
27
Comments
0
Likes
3

Embeds 83

http://lanyrd.com 81
http://127.0.0.1 1
https://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. networks of data Matt Biddulph @mattb | matt@hackdiary.comEvery data scientist has their own favourite way of representing their data. For some peopleit’s Excel, and they think in rows and columns. For others it’s matrices, and they use linearalgreba to interrogate their data. For me, it’s graphs.
  • 2. We’re all pretty used to the idea that you can model human relationships in a social graph.
  • 3. “Social network analysis views social relationships in terms of network theory consisting of nodes and ties. Nodes are the individual actors within the networks, and ties are the relationships between the actors.”There’s a pretty deep area of mathematical study called Social Network Analysis that goesback at least 20 years. It tries to create insight by analysing the structure of social networks,and usually doesn’t incorporate any elements of culture or sociology in doing so.
  • 4. Centrality measuresIt led to the creation of techniques like centrality measures, that try to find the nodes that aremost central to the network. These might be the kind of people on Twitter who have thehighest chance of being retweeted.
  • 5. Community detectionThere are also community detection algorithms that try to find the most tightly-knitsubgraphs and cluster those nodes together. If you ran this over the network of people Ifollow on Twitter, it might be able to pick out my work colleagues or the people I socialisewith face-to-face.
  • 6. People you may knowSites like LinkedIn build almost-telepathic “people you may know” features by walking aroundthe graph starting at your node and looking for people that show up a lot in yourneighbourhood that you haven’t connected with yet.
  • 7. To demonstrate what these techniques can do, I downloaded some data from Github’s API. Iwanted to identify and map London’s most-connected developers.
  • 8. acastro si mikewest lawrencec guioconnor spjwebster muffinresearch osde8info tyru dannyamey IanPouncey dennyhalim ejeliot kulor dorward cyrildoussin cheeaun marcusramberg andyhd isofarro aphillipo pierslowe acme jason23z kraih nefarioustim carlo sh1mmer cdent melo minty dann BenJam SteveMarshall yncyrydybyl gfx FND fhelmberger rjray barbie sartak rozza thrudigital NeilCrosbyginader nothingmuch tcaine perigrin bricas arcanez petemounce bingos gugod themattharris tomyan philhawksworth davorg rafl bobtfish bradleywright richardc richardhodgson norm phae salfield greut simonmaddox rjw1 stig ashb psd deanwilson tmtmtmtm drewm gillesruppert miyagawa BenWard cbetta tommorris natbat garethr jjl dwhittle dhilton mojodna thesmith sammyt evilstreak pjbarry voodoochild AndrewDisley willi iamdanw matth c9s andybeeching alfredwesterveld georgebrock simonw riklomas samsoir threebytesfull mikesten richardkeen jtweed Rodreegez dsingleton skarab molily danieljohnmorris dstrelau mattb ask webiest atl abecciu lingrch rondevera philnash bruntonspall sriprasanna Jonty Allinthedata fidothe whomwah superfeedr dvydra tonytw1 jensy cc bbcpete gklopper monkchips straup rux russss kenlim tackley steppenwells memespring vancaem bob-p kurtjx jaygooby metade james filipeamoreira chrismear hungryblank the-experimenters jwheare hubgit jystewart jonocole camelpunch evangineer fredrikmollerstrand craigw baseonmars harry-m pkqk jberkel dougma eartle thommay otfrom tonyg stever mokele Roelven danski kanzure braindeaf thmghtd andrew charlenopires julians blaine e1i45 muesli tims tobypadilla edouard rmetzler holizz joshbuddy nogeek cwninja rarepleasures hdurer matagus bileckme aubergene mxcl esneko tim ntoll mcroydon liquid tomtaylor haifeng snowblink georgepalmer eightbitraptor threedaymonk micrypt deepak brett pusewicz zachinglis digdog zaczheng crowbot thechrisoshow twoism-dev monadic jcoglan lrug professionalnerd colin danwrong techbelly ja maccman rlivsey floehopper nevali melito elliottcable lifo chris-d-adams libin flunder andrewmcdonough natematias svetlyak40wt Floppy dwo smtlaissezfaire tonylpurzelrakete ejdraper bumi lazyatom danlucraft jasoncale kalv stonegao nikolay matthewford robmckinnon reddavis bru chrisroos topfunky tomafro grillpanda newbamboo jibes21 stinie timcowlishaw baob ebrett matclayton benpickles felixcohen tomdyson timd alexstubbs cv wakatara gerhard Marak geoffgarside jaikoo BenHall olly jaigouk pableteThis diagram, created in 2009, has several dimensions. Each node is a London developer witha github account. Lines show follower relationships. Nodes are sized according to number offollowers, and coloured according to network centrality (red for most-central). The layoutshows community structure - for example the top-left cluster is mostly Perl developers.
  • 9. carlo rozza SteveMarshall FND NeilCrosbyginader themattharris tomyan philhawksradleywright richardhodgson norm phae greut simonmaddox psd drewm gillesruppert BenWard cbetta tommorris natbat garethr dwhittle dhilton mojodna themyt evilstreak pjbarry voodoochild AndrewDisley willi iamdanwandybeechingsterveld georgebrock simonw samsoir rik mikesten richardkee dsingletonskarab molily danieljohn mattb webiest atlsanna fidothe Jonty Allinthedata russss jensy superfeed memespring rux straup jaygooby monkchips vancae jonocole jwheare james filipeamore chrismear hubgit jystewart
  • 10. Let’s go beyond purely social data. James Governor suggested I explore the connectionbetween music taste and choice of programming language. I wrote a script to correlatelast.fm usernames with github usernames and created a graph structure linking the musicgenre taste of each developer to the languages their github projects are implemented in.
  • 11. This diagram is just a small sample amongst the people I follow on Github and last.fm - notenough to provide a statistically-significant judgement.
  • 12. in this small sample we can see that my Ruby-coding friends tend towards sing-songwriteracoustic folk, and the Javascript coders are all about rock and indie.
  • 13. This is a great book that goes into these techniques in depth. However it’s useful for anynetworked data, not just social networks. And it’s useful to anyone, not just startups.
  • 14. This is a great book that goes into these techniques in depth. However it’s useful for anynetworked data, not just social networks. And it’s useful to anyone, not just startups.
  • 15. This is a great book that goes into these techniques in depth. However it’s useful for anynetworked data, not just social networks. And it’s useful to anyone, not just startups.
  • 16. So let’s take a step back and think about what other kinds of graph we could form, from whatkinds of data.
  • 17. I used to work in location apps at Nokia, and so I naturally think of places. Wouldn’t it beinteresting to study the connections between cities instead of people? For example, peopleprobably fly more often between NYC and LA than they do between NYC and New Jersey. Wecould re-draw the map based on closeness in the travel network.
  • 18. In 2011 I turned to the Hadoop cluster at Nokia and took a sample of several weeks of logsfrom our routing servers. These are used every time someone uses our maps application torequest a driving route from one place to another. Every time someone drove from A to B, Imade an edge in a “place graph” from A to B.
  • 19. I ran the data through Gephi and asked it to cluster it based on the strength of connectionsbetween towns. The result is a not-quite-geographic new map of the world, where two citiesare close to each other if people often drive between them.
  • 20. UK China Korea, Japan, etc Spain Most of Europe India Pakistan Finland RussiaAs you’d expect, the UK is an island and so people don’t drive in and out of it very often.Spain and Portugal are not islands, but they appear separate because they’re attached to therest of Europe by a very narrow neck of land. So people are much more likely to fly than driveout of Spain.
  • 21. Times Square = Piccadilly Circus New York LondonWhat kind of questions can this data answer? Say I’m coming to London for the first time andI’m familiar with New York. I could ask a friend what the equivalent of Times Square is inLondon. If they know both towns, they’d probably tell me that Times Square is the PiccadillyCircus of New York.
  • 22. What is the Holborn of Amsterdam? ... the De Pijp of New York? ... the Williamsburg of London?But if we delve into the place graph, we could answer much more interesting questions, andcreate a “neighbourhood isomorphism” from city to city. People who like the Mission in SFand Shoreditch in London could find out that Williamsberg is probably the best place forthem to stay in New York.
  • 23. the Place Graph is just like the Social GraphThis is just one example of viewing data as a graph and then using Social Graph analytics onit. There are many more possible - the link structure of Wikipedia, the co-occurrence oftopics in a newspaper, the implicit social network of @replies on Twitter, etc.
  • 24. Thanks!Matt Biddulph@mattb | matt@hackdiary.com