Successfully reported this slideshow.
Your SlideShare is downloading. ×

Geo Analytics Tutorial - Where 2.0 2011

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Geo data analytics
Geo data analytics
Loading in …3
×

Check these out next

1 of 89 Ad

Geo Analytics Tutorial - Where 2.0 2011

Download to read offline

O'Reilly Where 2.0 2011

As a result of cheap storage and computing power, society is measuring and storing increasing amounts of information.
It is now possible to efficiently crunch Petabytes of data with tools like Hadoop.

In this O'Reilly Where 2.0 tutorial, Pete Skomoroch, Sr. Data Scientist at LinkedIn, gives an overview of spatial analytics and how you can use tools like Hadoop, Python, and Mechanical Turk to process location data and derive insights about cities and people.

Topics:

* Data Science & Geo Analytics
* Useful Geo tools and Datasets
* Hadoop, Pig, and Big Data
* Cleaning Location Data with Mechanical Turk
* Spatial Tweet Analytics with Hadoop & Python
* Using Social Data to Understand Cities

O'Reilly Where 2.0 2011

As a result of cheap storage and computing power, society is measuring and storing increasing amounts of information.
It is now possible to efficiently crunch Petabytes of data with tools like Hadoop.

In this O'Reilly Where 2.0 tutorial, Pete Skomoroch, Sr. Data Scientist at LinkedIn, gives an overview of spatial analytics and how you can use tools like Hadoop, Python, and Mechanical Turk to process location data and derive insights about cities and people.

Topics:

* Data Science & Geo Analytics
* Useful Geo tools and Datasets
* Hadoop, Pig, and Big Data
* Cleaning Location Data with Mechanical Turk
* Spatial Tweet Analytics with Hadoop & Python
* Using Social Data to Understand Cities

Advertisement
Advertisement

More Related Content

Viewers also liked (12)

Similar to Geo Analytics Tutorial - Where 2.0 2011 (20)

Advertisement

More from Peter Skomoroch (16)

Recently uploaded (20)

Advertisement

Geo Analytics Tutorial - Where 2.0 2011

  1. Geo Analytics Tutorial Pete Skomoroch Sr. Data Scientist - LinkedIn (@peteskomoroch) #geoanalytics ** Hadoop Intro slides from Kevin Weil, Twitter
  2. Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  3. Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  4. Analytics & Data are Hot Topics
  5. Analytics & Data are Hot Topics
  6. Analytics & Data are Hot Topics
  7. Analytics & Data are Hot Topics
  8. Analytics & Data are Hot Topics
  9. Analytics & Data are Hot Topics
  10. Data Exhaust My Delicious Tags
  11. Data Science * http://www.drewconway.com/zia/?p=2378
  12. Data Visualization ‣ http://www.dataspora.com/blog/
  13. Spatial Analysis Map by Dr. John Snow of London, showing clusters of cholera cases in the 1854 Broad Street cholera outbreak. This was one of the first uses of map-based spatial analysis.
  14. Spatial Analysis • Spatial regression - estimate dependencies between variables • Gravity models - estimate the flow of people, material, or information between locations • Spatial interpolation - estimate variables at unobserved locations based on other measured values • Simulation - use models and data to predict spatial phenomena
  15. Life Span & Food by Zip Code * http://zev.lacounty.gov/news/health/death-by-zip-code * http://www.verysmallarray.com/?p=975
  16. Where Americans Are Moving (IRS Data) ‣ (Jon Bruner) http://jebruner.com/2010/06/the-migration-map/
  17. Facebook Connectivity (Pete Warden) * http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html
  18. Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  19. Useful Geo Tools •R, Matlab, SciPy, Commercial Geo Software •R Spatial Pkgs http://cran.r-project.org/web/views/Spatial.html •Hadoop, Amazon EC2, Mechanical Turk •Data Science Toolkit: http://www.datasciencetoolkit.org/ •80% of effort is often in cleaning and processing data
  20. DataScienceToolkit.org •Runs on VM or Amazon EC2 •Street Address to Coordinates •Coordinates to Political Areas •Geodict (text extraction) •IP Address to Coordinates •New UK release on Github
  21. Resources for location data • SimpleGeo • Factual • Geonames • Infochimps • Data.gov • DataWrangling.com
  22. Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  23. Hadoop: Motivation •We want to crunch 1TB of Twitter stream data and understand spatial patterns in Tweets •Data collected from the Twitter “Garden Hose” API last Spring
  24. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year!) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability
  25. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability ‣ Open source, top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds
  26. MapReduce? cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  27. MapReduce? cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  28. MapReduce? cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  29. MapReduce? cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  30. MapReduce? cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  31. MapReduce? cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  32. MapReduce? cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  33. But... ‣ Analysis typically done in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins: lengthy, error-prone ‣ n-stage jobs: Hard to manage ‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...
  34. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  35. Why Pig? ‣ Because I bet you can read the following script.
  36. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  37. No, seriously.
  38. Pig Simplifies Analysis ‣ The Pig version is: ‣ 5% of the code, 5% of the time ‣ Within 50% of the execution time. ‣ Pig Geo: ‣ Programmable: fuzzy matching, custom filtering ‣ Easily link multiple datasets, regardless of size/structure ‣ Iterative, quick
  39. A Real Example ‣ Fire up your Elastic MapReduce Cluster. ‣ ... or follow along at http://bit.ly/whereanalytics ‣ I used Twitter’s streaming API to store some tweets ‣ Simplest thing: group by location and count with Pig ‣ http://bit.ly/where20pig ‣ Here comes some code!
  40. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  41. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  42. tweets_with_location = FILTER tweets BY user_location != 'NULL';
  43. normalized_locations = FOREACH tweets_with_location GENERATE LOWER(user_location) as user_location;
  44. grouped_tweets = GROUP normalized_locations BY user_location PARALLEL 10;
  45. location_counts = FOREACH grouped_tweets GENERATE $0 as location, SIZE($1) as user_count;
  46. sorted_counts = ORDER location_counts BY user_count DESC;
  47. STORE sorted_counts INTO 'global_location_tweets';
  48. hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30 brasil 37985 indonesia 33777 brazil 22432 london 17294 usa 14564 são paulo 14238 new york 13420 tokyo 10967 singapore 10225 rio de janeiro 10135 los angeles 9934 california 9386 chicago 9155 uk 9095 jakarta 9086 germany 8741 canada 8201 7696 7121 jakarta, indonesia 6480 nyc 6456 new york, ny 6331
  49. Neat, but... ‣ Wow, that data is messy! ‣ brasil, brazil at #1 and #3 ‣ new york, nyc, and new york ny all in the top 30 ‣ Mechanical Turk to the rescue...
  50. Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  51. Code examples we’ll cover are on Github
  52. You can run them on Elastic MapReduce
  53. Cleaning Twitter Profile Location Names Filter Exact Matches Extract Top Tweet Locations Clean with MTurk Aggregate Context with Hadoop
  54. We will map locations to GeoNames IDs
  55. Start with Location Exact Matches
  56. Use Mechanical Turk to improve results
  57. Workers do simple tasks for a few cents
  58. We constructed the following task
  59. Workers used a Geonames search tool
  60. Location search tool code is on Github
  61. Preparing Data to send to MTurk
  62. We use consensus answers from workers
  63. Processing MTurk Output
  64. Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  65. Tokenizing and Cleaning Tweet Text ‣ Extract Tweet topics with Hadoop + Python + NLTK + Wikipedia
  66. Build Phrase Dictionary with Wikipedia
  67. Streaming Tweet Parser (Python + NLTK)
  68. Parse Tweets and Join to Wikipedia (Pig)
  69. Aggregate by US County for Analysis
  70. Clean Data => Thematic US County Map
  71. Twitter users by county in our sample
  72. “Lady Gaga” Tweets
  73. “Tea Party” Tweets
  74. “Dallas” Tweets
  75. “Stephen Colbert” Tweets
  76. Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  77. LinkedIn Skills
  78. Skills in the Design Industry
  79. Exploring the Spatial Distribution of Skills
  80. People with “Ship Building” Skills
  81. What is the Skill profile of a given city?
  82. Expertise correlated with Santa Clara, CA
  83. Expertise correlated with Los Angeles
  84. Expertise correlated with Washington, DC
  85. Yuba City, CA has 21.3% Unemployment 21.3
  86. Ames, Iowa has 4.7% Unemployment 21.3
  87. Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  88. Questions? Follow me at twitter.com/peteskomoroch datawrangling.com

Editor's Notes

  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

×