Where 2012 prototyping workshop
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Where 2012 prototyping workshop

on

  • 8,012 views

 

Statistics

Views

Total Views
8,012
Views on SlideShare
7,721
Embed Views
291

Actions

Likes
9
Downloads
37
Comments
0

24 Embeds 291

http://servletsuite.blogspot.com 122
http://tweetedtimes.com 39
http://servletsuite.blogspot.in 24
http://fernand0.tumblr.com 16
http://bitly.com 14
http://servletsuite.blogspot.ca 9
https://twimg0-a.akamaihd.net 9
http://servletsuite.blogspot.de 7
https://si0.twimg.com 7
http://us-w1.rockmelt.com 6
http://servletsuite.blogspot.fr 6
http://servletsuite.blogspot.com.au 5
http://servletsuite.blogspot.com.es 5
http://servletsuite.blogspot.ru 3
http://servletsuite.blogspot.kr 3
http://servletsuite.blogspot.com.br 3
http://servletsuite.blogspot.co.uk 3
http://servletsuite.blogspot.it 3
http://bottlenose.com 2
http://a0.twimg.com 1
http://www.twylah.com 1
http://servletsuite.blogspot.se 1
http://www.tumblr.com 1
https://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Where 2012 prototyping workshop Presentation Transcript

  • 1. Prototyping location appswith real dataMatt Biddulph@mattb | matt@hackdiary.com
  • 2. Whether youre a new startup looking for investment, or a team at a large company who wants the green light for a new product,nothing convinces like real running code. But how do you solve the chicken-and-egg problem of filling your early prototype withreal data?Traffic Photo by TheTruthAbout - http://flic.kr/p/59kPoKMoney Photo by borman818 - http://flic.kr/p/61LYTT
  • 3. As experts in processing large datasets and interpreting charts and graphs, we may think of our data in the same way that aBloomberg terminal presents financial information. But information visualisation alone does not make a product.http://www.flickr.com/photos/financemuseum/2200062668/
  • 4. We need to communicate our understanding of the data to the rest of our product team. We need to be their eyes and ears in thedata - translating human questions into code, and query results into human answers.
  • 5. prototypes are boundary objectsInstead of communicating across disciplines using language from our own specialisms, we show what we mean in real runningcode and designs. We prototype as early as possible, so that we can talk in the language of the product.http://en.wikipedia.org/wiki/Boundary_object - “allow coordination without consensus as they can allow an actors localunderstanding to be reframed in the context of a some wider collective activity”http://www.flickr.com/photos/orinrobertjohn/159744546/
  • 6. Prototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear whatinsights we are looking for in a particular project.
  • 7. NoveltyPrototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear whatinsights we are looking for in a particular project.
  • 8. Novelty lity id e FPrototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear whatinsights we are looking for in a particular project.
  • 9. Novelty ty De eli si rab Fid ilit yPrototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear whatinsights we are looking for in a particular project.
  • 10. Novelty ty De eli si rab Fid ilit yPrototyping has many potential benefits. We use this triangle to think about how to structure our work and make it clear whatinsights we are looking for in a particular project.
  • 11. no more lorem ipsumBy incorporating analysis and data-science into product design during the prototyping phase, we avoid “lorem ipsum”, the faketext and made-up data that is often used as a placeholder in design sketches. This helps us understand real-world product useand find problems earlier.Photo by R.B. - http://flic.kr/p/8APoN4
  • 12. helping designers explore dataData can be complex. One of the first things we do when working with a new dataset is create internal toys - “dataexplorers” - to help us understand it.
  • 13. Philip Kromer, InfochimpsFlip Kromer of Infochimps describes this process as “hitting the data with the Insight Stick.”As data scientists, one of our common tasks is to take data from almost any source and apply standard structural techniques toit without worrying too much about the domain of the data.
  • 14. Philip Kromer, InfochimpsFlip Kromer of Infochimps describes this process as “hitting the data with the Insight Stick.”As data scientists, one of our common tasks is to take data from almost any source and apply standard structural techniques toit without worrying too much about the domain of the data.
  • 15. ou can discov er patterns “With e nough data y t you cant s using simple counting tha and fact ophisticated discover in sma ll data using s ical and ML a pproaches.” ig on Quora statist –Dmitriy Ryaboy par aphrasing Peter Norv http://b.qr.ae/ijdb2G Philip Kromer, InfochimpsFlip Kromer of Infochimps describes this process as “hitting the data with the Insight Stick.”As data scientists, one of our common tasks is to take data from almost any source and apply standard structural techniques toit without worrying too much about the domain of the data.
  • 16. Here’s a small example of exploring a dataset that I did while working in Nokia’s Location & Commerce division.
  • 17. Searches are goal-driven user behaviour - someone typed something into a search box on a phone. But we can even learn fromactivity that isn’t so explicit.When someone views a Nokia Ovi map on the web or phone, the visuals for the map are served up in square “tiles” from ourservers. We can analyse the number of requests made for each tile and take it as a measure of interest or attention in that part ofthe world.
  • 18. Searches are goal-driven user behaviour - someone typed something into a search box on a phone. But we can even learn fromactivity that isn’t so explicit.When someone views a Nokia Ovi map on the web or phone, the visuals for the map are served up in square “tiles” from ourservers. We can analyse the number of requests made for each tile and take it as a measure of interest or attention in that part ofthe world.
  • 19. Searches are goal-driven user behaviour - someone typed something into a search box on a phone. But we can even learn fromactivity that isn’t so explicit.When someone views a Nokia Ovi map on the web or phone, the visuals for the map are served up in square “tiles” from ourservers. We can analyse the number of requests made for each tile and take it as a measure of interest or attention in that part ofthe world.
  • 20. LA attention heatmapWe built a tool that could calculate metrics for every grid-square of the map of the world, and present heatmaps ofthat data on a city level. This view shows which map-tiles are viewed most often in LA using Ovi Maps. It’s calculatedfrom the server logs of our map-tile servers. You could think of it as a map of the attention our users give to eachtile of LA.
  • 21. LA driving heatmapThis is the same area of California, but instead of map-tile attention it shows the relative number of cars on the road that areusing our navigation features. This gives a whole different view on the city. We can see that it highlights major roads, and it’smuch harder to see where the US coastline occurs. By comparing these two heatmaps we start to understand the meaning andthe potential of these two datasets.
  • 22. But of course a heatmap alone isn’t a product. This is one of the visualisation sketches produced by designer TomCoates after investigating the data using the heatmap explorer. It’s much closer to something that could go into areal product.
  • 23. ToolsThese are the tools I’ll be using to demo some of my working processes.
  • 24. Apache Pig makes Hadoop much easier to use by creating map-reduce plans from SQL-like scripts.
  • 25. Elastic MapReduce and S3
  • 26. With ruby scripts acting as glue for the inevitable hacking, massaging and munging of the data.
  • 27. Question: who’s alreadyworking with these tools?
  • 28. All code for the workshop:https://github.com/mattb/where2012-workshop
  • 29. Demo:Starting up an Elastic Mapreduce cluster
  • 30. Realistic cities generating a dataset of people moving around townThe first dataset we’ll generate is one you could use to test any system or app involving people moving around theworld - whether it’s an ad-targeting system or a social network.
  • 31. You probably know about Stamen’s beautiful work creating new renderings of OpenStreetMap, including this Tonerstyle.
  • 32. When they were getting ready to launch their newest tiles called Watercolor, they created this rendering of the accesslogs from their Toner tileservers. It shows which parts of the map are most viewed by users of Toner-based apps.
  • 33. Working with data and inspiration from Eric Fischer, Nathaniel Kelso of Stamen generated this map to decide howdeep to pre-render each area of the world to get the maximum hit-rate on their servers. Rendering the full map tothe deepest zoom would have taken years on their servers. The data used as a proxy for the attention of users is amassive capture of geocoded tweets. The more tweets per square mile, the deeper the zoom will be rendered in thatarea.
  • 34. We can go further than geocoded tweets and get a realistic set of POIs that people go to, with timestamps. If yousearch for 4sq on the Twitter streaming API you get about 25,000 tweets per hour announcing users’ Foursquarecheckins.
  • 35. There’s a lot of metadata available.
  • 36. If you follow the URL you get even more data.
  • 37. And if you view source, the data’s all there in JSON format.
  • 38. Demo: Gathering Foursquare tweetsSo I set up a script to skim the tweets, perform the HTTP requests on 4sq.com and capture the tweet+checkin data aslines of JSON in files in S3.
  • 39. For this demo I wanted to show just people in San Francisco so I looked up a bounding-box for San Francisco.
  • 40. DEFINE json2tsv `json2tsv.rb` SHIP(/home/hadoop/pig/ json2tsv.rb,/home/hadoop/pig/json.tar); A = LOAD s3://mattb-4sq; B = STREAM A THROUGH json2tsv AS (lat:float, lng:float, venue, nick, created_at, tweet); SF = FILTER B BY lat > 37.604031 AND lat < 37.832371 AND lng > -123.013657 AND lng < -122.355301; PEOPLE = GROUP SF BY nick; PEOPLE_COUNTED = FOREACH PEOPLE GENERATE COUNT(SF) AS c, group, SF; ACTIVE = FILTER PEOPLE_COUNTED BY c >= 5; RESULT = FOREACH ACTIVE GENERATEThis pig script loads up the JSON and streams it through a ruby script to turn JSON into Tab-Separated data (becauseit’s easier to deal with in pig than JSON). group,FLATTEN(SF); STORE RESULT INTO s3://mattb-4sq/active-sf;
  • 41. venue, nick, created_at, tweet); SF = FILTER B BY lat > 37.604031 AND lat < 37.832371 AND lng > -123.013657 AND lng < -122.355301; PEOPLE = GROUP SF BY nick; PEOPLE_COUNTED = FOREACH PEOPLE GENERATE COUNT(SF) AS c, group, SF; ACTIVE = FILTER PEOPLE_COUNTED BY c >= 5; RESULT = FOREACH ACTIVE GENERATE group,FLATTEN(SF); STORE RESULT INTO s3://mattb-4sq/active-sf;We filter the data to San Francisco lat-longs, group the data by username and count it. Then we keep only “active”users - people with more than 5 checkins.
  • 42. Demo: Visualising checkins with GeoJSON and KMLYou can view the path of one individual user as they arrive at SFO and get their rental car at http://maps.google.com/maps?q=http:%2F%2Fwww.hackdiary.com%2Fmisc%2Fsampledata-broton.kml&hl=en&ll=37.625585,-122.398124&spn=0.018015,0.040169&sll=37.0625,-95.677068&sspn=36.863178,82.265625&t=m&z=15&iwloc=lyrftr:kml:cFxADtCtq9UxFii5poF9Dk7kA_B4QPBI,g475427abe3071143,,
  • 43. Realistic social networks generating a dataset of social connections between peopleWhat about the connections between people? What data could we use as a proxy for a large social graph?
  • 44. Wikipedia is full of data about people and the connections between them.
  • 45. The DBpedia project extracts just the metadata from Wikipedia - the types, the links, the geo-coordinates etc.
  • 46. The DBpedia project extracts just the metadata from Wikipedia - the types, the links, the geo-coordinates etc.
  • 47. It’s available as a public dataset that you can attach to an Amazon EC2 instance and look through.
  • 48. There are many kinds of data in separate files (you can also choose your language).
  • 49. We’re going to start with this one. It tells us what “types” each entity is on Wikipedia, parsed out from their theInfoboxes on their pages.
  • 50. <Autism> <type> <dbpedia.org/ontology/Disease> <Autism> <type> <www.w3.org/2002/07/owl#Thing> <Aristotle> <type> <dbpedia.org/ontology/Philosopher> <Aristotle> <type> <dbpedia.org/ontology/Person> <Aristotle> <type> <www.w3.org/2002/07/owl#Thing> <Aristotle> <type> <xmlns.com/foaf/0.1/Person> <Aristotle> <type> <schema.org/Person> <Bill_Clinton> <type> <dbpedia.org/ontology/OfficeHolder> <Bill_Clinton> <type> <dbpedia.org/ontology/Person> <Bill_Clinton> <type> <www.w3.org/2002/07/owl#Thing> <Bill_Clinton> <type> <xmlns.com/foaf/0.1/Person> <Bill_Clinton> <type> <schema.org/Person>Here are some examples.
  • 51. <Autism> <type> <dbpedia.org/ontology/Disease> <Autism> <type> <www.w3.org/2002/07/owl#Thing> <Aristotle> <type> <dbpedia.org/ontology/Philosopher> <Aristotle> <type> <dbpedia.org/ontology/Person> <Aristotle> <type> <www.w3.org/2002/07/owl#Thing> <Aristotle> <type> <xmlns.com/foaf/0.1/Person> <Aristotle> <type> <schema.org/Person> <Bill_Clinton> <type> <dbpedia.org/ontology/OfficeHolder> <Bill_Clinton> <type> <dbpedia.org/ontology/Person> <Bill_Clinton> <type> <www.w3.org/2002/07/owl#Thing> <Bill_Clinton> <type> <xmlns.com/foaf/0.1/Person> <Bill_Clinton> <type> <schema.org/Person>And these are the ones we’re going to need; just the people.
  • 52. Then we’ll take the file that shows which pages link to which other Wikipedia pages.
  • 53. <http://dbpedia.org/resource/Bill_Clinton> -> Woody_Freeman <http://dbpedia.org/resource/Bill_Clinton> -> Yasser_Arafat <http://dbpedia.org/resource/Bill_Dodd> -> Bill_Clinton <http://dbpedia.org/resource/Bill_Frist> -> Bill_Clinton <http://dbpedia.org/resource/Bob_Dylan> -> Bill_Clinton <http://dbpedia.org/resource/Bob_Graham> -> Bill_Clinton <http://dbpedia.org/resource/Bob_Hope> -> Bill_ClintonAnd we’ll try to filter it down to just the human relationships.
  • 54. TYPES = LOAD s3://mattb/instance_types_en.nt.bz2 USING PigStorage( ) AS (subj, pred, obj, dot); PEOPLE_TYPES = FILTER TYPES BY obj == <http://xmlns.com/ foaf/0.1/Person>; PEOPLE = FOREACH PEOPLE_TYPES GENERATE subj; LINKS = LOAD s3://mattb/page_links_en.nt.bz2 USING PigStorage( ) AS (subj, pred, obj, dot); SUBJ_LINKS_CO = COGROUP PEOPLE BY subj, LINKS BY subj; SUBJ_LINKS_FILTERED = FILTER SUBJ_LINKS_CO BY NOT IsEmpty(PEOPLE) AND NOT IsEmpty(LINKS); SUBJ_LINKS = FOREACH SUBJ_LINKS_FILTERED GENERATE FLATTEN(LINKS); OBJ_LINKS_CO = COGROUP PEOPLE BY subj, SUBJ_LINKS BY obj;Using pig we load up the types file and filter it to just the people (the entities of type Person from the FOAFontology). OBJ_LINKS_FILTERED = FILTER OBJ_LINKS_CO BY NOT IsEmpty(PEOPLE) AND NOT IsEmpty(SUBJ_LINKS); OBJ_LINKS = FOREACH OBJ_LINKS_FILTERED GENERATE
  • 55. TYPES = LOAD s3://mattb/instance_types_en.nt.bz2 USING PigStorage( ) AS (subj, pred, obj, dot); PEOPLE_TYPES = FILTER TYPES BY obj == <http://xmlns.com/ foaf/0.1/Person>; PEOPLE = FOREACH PEOPLE_TYPES GENERATE subj; LINKS = LOAD s3://mattb/page_links_en.nt.bz2 USING PigStorage( ) AS (subj, pred, obj, dot); SUBJ_LINKS_CO = COGROUP PEOPLE BY subj, LINKS BY subj; SUBJ_LINKS_FILTERED = FILTER SUBJ_LINKS_CO BY NOT IsEmpty(PEOPLE) AND NOT IsEmpty(LINKS); SUBJ_LINKS = FOREACH SUBJ_LINKS_FILTERED GENERATE FLATTEN(LINKS); OBJ_LINKS_CO = COGROUP PEOPLE BY subj, SUBJ_LINKS BY obj;We filter the links to only those whose subject (originating page) is a person. OBJ_LINKS_FILTERED = FILTER OBJ_LINKS_CO BY NOT IsEmpty(PEOPLE) AND NOT IsEmpty(SUBJ_LINKS); OBJ_LINKS = FOREACH OBJ_LINKS_FILTERED GENERATE
  • 56. OBJ_LINKS_CO = COGROUP PEOPLE BY subj, SUBJ_LINKS BY obj; OBJ_LINKS_FILTERED = FILTER OBJ_LINKS_CO BY NOT IsEmpty(PEOPLE) AND NOT IsEmpty(SUBJ_LINKS); OBJ_LINKS = FOREACH OBJ_LINKS_FILTERED GENERATE FLATTEN(SUBJ_LINKS); D_LINKS = DISTINCT OBJ_LINKS; STORE D_LINKS INTO s3://mattb/people-graph USING PigStorage( );And then filter again to only those links that link to a person.
  • 57. OBJ_LINKS_CO = COGROUP PEOPLE BY subj, SUBJ_LINKS BY obj; OBJ_LINKS_FILTERED = FILTER OBJ_LINKS_CO BY NOT IsEmpty(PEOPLE) AND NOT IsEmpty(SUBJ_LINKS); OBJ_LINKS = FOREACH OBJ_LINKS_FILTERED GENERATE FLATTEN(SUBJ_LINKS); D_LINKS = DISTINCT OBJ_LINKS; STORE D_LINKS INTO s3://mattb/people-graph USING PigStorage( );... and store it.
  • 58. <http://dbpedia.org/resource/Bill_Clinton> -> Woody_Freeman <http://dbpedia.org/resource/Bill_Clinton> -> Yasser_Arafat <http://dbpedia.org/resource/Bill_Dodd> -> Bill_Clinton <http://dbpedia.org/resource/Bill_Frist> -> Bill_Clinton <http://dbpedia.org/resource/Bob_Dylan> -> Bill_Clinton <http://dbpedia.org/resource/Bob_Graham> -> Bill_Clinton <http://dbpedia.org/resource/Bob_Hope> -> Bill_ClintonThis is the result in text.
  • 59. And this is the 10,000 feet view.
  • 60. Colours show the results of a “Modularity” analysis that finds the clusters of communities within the graph. Forexample, the large cyan group containing Barack Obama is all government and royalty.
  • 61. Explore it yourself:http://biddul.ph/wikipedia-graph
  • 62. http://gephi.orgThanks to Gephi for a great graph-visualisation tool.
  • 63. This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not justsocial networks. And it’s useful to anyone, not just startups.
  • 64. This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not justsocial networks. And it’s useful to anyone, not just startups.
  • 65. This is a great book that goes into these techniques in depth. However it’s useful for any networked data, not justsocial networks. And it’s useful to anyone, not just startups.
  • 66. Realistic ranking generating a dataset of places ordered by importanceWhat if we have all this data about people, places or things but we don’t know whether one thing is more importantthan another? We can use public data to rank, compare and score.
  • 67. Wikipedia makes hourly summaries of their web traffic available. Each line of each file shows the language and nameof a page on Wikipedia and how many times it was accessed that hour. We can use that attention as a proxy for theimportance of concepts.
  • 68. Back to DBpedia for some more data.
  • 69. This time we’re going to extract and rank things that have geotags on their page.
  • 70. <Alabama> <type> <www.opengis.net/gml/_Feature>The geographic coordinates file lists each entity on Wikipedia that is known to have lat-long coordinates.
  • 71. $ bzcat geo_coordinates_en.nt.bz2 | grep gml/_Feature | cut -d> -f 1 | cut -b30-I pull out just the names of the pages...
  • 72. Van_Ness_Avenue_%28San_Francisco%29 Recreation_Park_%28San_Francisco%29 Broadway_Tunnel_%28San_Francisco%29 Broadway_Street_%28San_Francisco%29 Carville,_San_Francisco Union_League_Golf_and_Country_Club_of_San_Francisco Ambassador_Hotel_%28San_Francisco%29 Columbus_Avenue_%28San_Francisco%29 Grand_Hyatt_San_Francisco Marina_District,_San_Francisco Pier_70,_San_Francisco Victoria_Theatre,_San_Francisco San_Francisco_Glacier San_Francisco_de_Ravacayco_District San_Francisco_church Lafayette_Park,_San_Francisco,_California Antioch_University_%28San_Francisco%29 San_Francisco_de_Chiu_Chiu... which looks like this. There are over 400,000 of them.
  • 73. DATA = LOAD s3://wikipedia-stats/*.gz USING PigStorage( ) AS (lang, name, count:int, other); ENDATA = FILTER DATA BY lang==en; FEATURES = LOAD s3://wikipedia-stats/features.txt USING PigStorage( ) AS (feature); FEATURE_CO = COGROUP ENDATA BY name, FEATURES BY feature; FEATURE_FILTERED = FILTER FEATURE_CO BY NOT IsEmpty(FEATURES) AND NOT IsEmpty(ENDATA);Using pig we filter the page traffic stats to just the English hits. FEATURE_DATA = FOREACH FEATURE_FILTERED GENERATE FLATTEN(ENDATA);
  • 74. FEATURES = LOAD s3://wikipedia-stats/features.txt USING PigStorage( ) AS (feature); FEATURE_CO = COGROUP ENDATA BY name, FEATURES BY feature; FEATURE_FILTERED = FILTER FEATURE_CO BY NOT IsEmpty(FEATURES) AND NOT IsEmpty(ENDATA); FEATURE_DATA = FOREACH FEATURE_FILTERED GENERATE FLATTEN(ENDATA); NAMES = GROUP FEATURE_DATA BY name;We filter the entities down to just those that are geo-features. COUNTS = FOREACH NAMES GENERATE group,
  • 75. GENERATE FLATTEN(ENDATA); NAMES = GROUP FEATURE_DATA BY name; COUNTS = FOREACH NAMES GENERATE group, SUM(FEATURE_DATA.count) as c; FCOUNT = FILTER COUNTS BY c > 500; SORTED = ORDER FCOUNT BY c DESC; STORE SORTED INTO s3://wikipedia-stats/ features_out.gz USING PigStorage(t);We group and sum the statistics by page-name.
  • 76. Successfully read 442775 records from: "s3://wikipedia-stats/features.txt" Successfully read 975017055 records from: "s3://wikipedia-stats/pagecounts-2012012*.gz" in 4 hours, 19 minutes and 32 seconds using 4 m1.small instances.Using a 4-machine Elastic Mapreduce cluster I can process 50Gb of data containing nearly a billion rows in aboutfour hours.
  • 77. The Castro 2479 Chinatown 2457 Tenderloin 2276 Mission District 1336 Union Square 1283 Nob Hill 952 Bayview-Hunters Point 916 Alamo Square 768 Russian Hill 721 Ocean Beach 661 San Francisco Pacific Heights 592 Sunset District 573 neighborhoods 0 750 1500 2250Here are some results. As you’d expect, the neighbourhoods that rank the highest are the most famous ones. Localresidential neighbourhoods come lower down the scale.
  • 78. Hackney 3428 Camden 2498 Tower Hamlets 2378 Newham 1850 Enfield 1830 Croydon 1796 Islington 1624 Southwark 1603 Lambeth 1354 Greenwich 1316 Hammersmith and Fulham 1268 Haringey 1263 London Harrow 1183 neighbourhoods Brent 1140 0 1000 2000 3000Here it is again for London.
  • 79. To demo this ranking in a data toy that anyone can play with, I built an auto-completer using Elasticsearch. Itransformed the pig output into JSON and made an index.
  • 80. Demo: A weighted autocompleter with ElasticsearchI exposed this index through a small Ruby webapp written in Sinatra.
  • 81. So we can easily answer questions like “which of the world’s many Chinatown districts are the best-known?”
  • 82. All code for the workshop:https://github.com/mattb/where2012-workshop
  • 83. Thanks!Matt Biddulph@mattb | matt@hackdiary.com