Geo Analytics Tutorial - Where 2.0 2011
Upcoming SlideShare
Loading in...5
×
 

Geo Analytics Tutorial - Where 2.0 2011

on

  • 11,872 views

O'Reilly Where 2.0 2011 ...

O'Reilly Where 2.0 2011

As a result of cheap storage and computing power, society is measuring and storing increasing amounts of information.
It is now possible to efficiently crunch Petabytes of data with tools like Hadoop.

In this O'Reilly Where 2.0 tutorial, Pete Skomoroch, Sr. Data Scientist at LinkedIn, gives an overview of spatial analytics and how you can use tools like Hadoop, Python, and Mechanical Turk to process location data and derive insights about cities and people.

Topics:

* Data Science & Geo Analytics
* Useful Geo tools and Datasets
* Hadoop, Pig, and Big Data
* Cleaning Location Data with Mechanical Turk
* Spatial Tweet Analytics with Hadoop & Python
* Using Social Data to Understand Cities

Statistics

Views

Total Views
11,872
Views on SlideShare
10,396
Embed Views
1,476

Actions

Likes
27
Downloads
448
Comments
0

47 Embeds 1,476

http://a-ahad.blogspot.com 469
http://where2conf.com 306
http://blog.bookcold.me 273
http://whereconf.com 90
http://www.linkedin.com 33
http://bookcold.me 31
http://a-ahad.blogspot.com.es 27
http://a-ahad.blogspot.ca 26
http://bookcold.com 24
http://www.techgig.com 23
http://paper.li 22
http://a-ahad.blogspot.co.uk 22
http://a-ahad.blogspot.com.au 20
https://www.linkedin.com 14
http://a-ahad.blogspot.in 12
url_unknown 8
http://a-ahad.blogspot.jp 8
http://a-ahad.blogspot.co.il 6
http://s240757377.onlinehome.fr 5
http://a-ahad.blogspot.nl 5
http://www.datawrangling.com 4
http://a-ahad.blogspot.tw 4
http://a-ahad.blogspot.de 4
http://a-ahad.blogspot.com.br 4
http://a-ahad.blogspot.pt 3
http://twitter.com 3
http://webcache.googleusercontent.com 3
http://a-ahad.blogspot.cz 3
http://a-ahad.blogspot.kr 3
http://a-ahad.blogspot.fr 2
http://173.230.153.79 2
http://a-ahad.blogspot.ru 2
http://a-ahad.blogspot.hk 1
http://a-ahad.blogspot.sg 1
http://a-ahad.blogspot.fi 1
http://a-ahad.blogspot.gr 1
http://www.slashdocs.com 1
http://a-ahad.blogspot.be 1
http://a-ahad.blogspot.hu 1
http://tweetedtimes.com 1
http://a-ahad.blogspot.ae 1
http://3scrowd.com 1
http://a-ahad.blogspot.co.nz 1
http://a-ahad.blogspot.ch 1
http://a-ahad.blogspot.sk 1
http://a-ahad.blogspot.com.ar 1
http://a-ahad.blogspot.com.tr 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Geo Analytics Tutorial - Where 2.0 2011 Geo Analytics Tutorial - Where 2.0 2011 Presentation Transcript

  • Geo Analytics TutorialPete SkomorochSr. Data Scientist - LinkedIn (@peteskomoroch)#geoanalytics** Hadoop Intro slides from Kevin Weil, Twitter
  • Topics‣ Data Science & Geo Analytics‣ Useful Geo tools and Datasets‣ Hadoop, Pig, and Big Data‣ Cleaning Location Data with Mechanical Turk‣ Spatial Tweet Analytics with Hadoop & Python‣ Using Social Data to Understand Cities‣ Q&A
  • Topics‣ Data Science & Geo Analytics‣ Useful Geo tools and Datasets‣ Hadoop, Pig, and Big Data‣ Cleaning Location Data with Mechanical Turk‣ Spatial Tweet Analytics with Hadoop & Python‣ Using Social Data to Understand Cities‣ Q&A
  • Analytics & Data are Hot Topics
  • Analytics & Data are Hot Topics
  • Analytics & Data are Hot Topics
  • Analytics & Data are Hot Topics
  • Analytics & Data are Hot Topics
  • Analytics & Data are Hot Topics
  • Data Exhaust My Delicious Tags
  • Data Science * http://www.drewconway.com/zia/?p=2378
  • Data Visualization ‣ http://www.dataspora.com/blog/
  • Spatial Analysis Map by Dr. John Snow of London, showing clusters of cholera cases in the 1854 Broad Street cholera outbreak. This was one of the first uses of map-based spatial analysis.
  • Spatial Analysis• Spatial regression - estimate dependencies between variables• Gravity models - estimate the flow of people, material, or information between locations• Spatial interpolation - estimate variables at unobserved locations based on other measured values• Simulation - use models and data to predict spatial phenomena
  • Life Span & Food by Zip Code* http://zev.lacounty.gov/news/health/death-by-zip-code* http://www.verysmallarray.com/?p=975
  • Where Americans Are Moving (IRS Data) ‣ (Jon Bruner) http://jebruner.com/2010/06/the-migration-map/
  • Facebook Connectivity (Pete Warden)* http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html
  • Topics‣ Data Science & Geo Analytics‣ Useful Geo tools and Datasets‣ Hadoop, Pig, and Big Data‣ Cleaning Location Data with Mechanical Turk‣ Spatial Tweet Analytics with Hadoop & Python‣ Using Social Data to Understand Cities‣ Q&A
  • Useful Geo Tools•R, Matlab, SciPy, Commercial Geo Software•R Spatial Pkgs http://cran.r-project.org/web/views/Spatial.html•Hadoop, Amazon EC2, Mechanical Turk•Data Science Toolkit: http://www.datasciencetoolkit.org/•80% of effort is often in cleaning and processing data
  • DataScienceToolkit.org•Runs on VM or Amazon EC2•Street Address to Coordinates•Coordinates to Political Areas•Geodict (text extraction)•IP Address to Coordinates•New UK release on Github
  • Resources for location data• SimpleGeo• Factual• Geonames• Infochimps• Data.gov• DataWrangling.com
  • Topics‣ Data Science & Geo Analytics‣ Useful Geo tools and Datasets‣ Hadoop, Pig, and Big Data‣ Cleaning Location Data with Mechanical Turk‣ Spatial Tweet Analytics with Hadoop & Python‣ Using Social Data to Understand Cities‣ Q&A
  • Hadoop: Motivation •We want to crunch 1TB of Twitter stream data and understand spatial patterns in Tweets •Data collected from the Twitter “Garden Hose” API last Spring
  • Data is Getting Big‣ NYSE: 1 TB/day‣ Facebook: 20+ TB compressed/day‣ CERN/LHC: 40 TB/day (15 PB/year!)‣ And growth is accelerating‣ Need multiple machines, horizontal scalability
  • Hadoop‣ Distributed file system (hard to store a PB)‣ Fault-tolerant, handles replication, node failure, etc‣ MapReduce-based parallel computation (even harder to process a PB)‣ Generic key-value based computation interface allows for wide applicability‣ Open source, top-level Apache project‣ Scalable: Y! has a 4000-node cluster‣ Powerful: sorted a TB of random integers in 62 seconds
  • MapReduce?cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets peroutput county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • MapReduce?cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets peroutput county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • MapReduce?cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets peroutput county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • MapReduce?cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets peroutput county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • MapReduce?cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets peroutput county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • MapReduce?cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets peroutput county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • MapReduce?cat file | grep geo | sort | uniq -c > ‣ Challenge: how many tweets peroutput county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • But...‣ Analysis typically done in Java‣ Single-input, two-stage data flow is rigid‣ Projections, filters: custom code‣ Joins: lengthy, error-prone‣ n-stage jobs: Hard to manage‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...
  • Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • Why Pig?‣ Because I bet you can read the following script.
  • A Real Pig Script‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • No, seriously.
  • Pig Simplifies Analysis‣ The Pig version is:‣ 5% of the code, 5% of the time‣ Within 50% of the execution time.‣ Pig Geo: ‣ Programmable: fuzzy matching, custom filtering ‣ Easily link multiple datasets, regardless of size/structure ‣ Iterative, quick
  • A Real Example‣ Fire up your Elastic MapReduce Cluster. ‣ ... or follow along at http://bit.ly/whereanalytics‣ I used Twitter’s streaming API to store some tweets‣ Simplest thing: group by location and count with Pig ‣ http://bit.ly/where20pig‣ Here comes some code!
  • tweets = LOAD s3://where20demo/sample-tweets as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • tweets = LOAD s3://where20demo/sample-tweets as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • tweets_with_location = FILTER tweets BY user_location !=NULL;
  • normalized_locations = FOREACH tweets_with_locationGENERATE LOWER(user_location) as user_location;
  • grouped_tweets = GROUP normalized_locations BYuser_location PARALLEL 10;
  • location_counts = FOREACH grouped_tweets GENERATE $0 aslocation, SIZE($1) as user_count;
  • sorted_counts = ORDER location_counts BY user_count DESC;
  • STORE sorted_counts INTO global_location_tweets;
  • hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30brasil 37985indonesia 33777brazil 22432london 17294usa 14564são paulo 14238new york 13420tokyo 10967singapore 10225rio de janeiro 10135los angeles 9934california 9386chicago 9155uk 9095jakarta 9086germany 8741canada 8201 7696 7121jakarta, indonesia 6480nyc 6456new york, ny 6331
  • Neat, but... ‣ Wow, that data is messy! ‣ brasil, brazil at #1 and #3 ‣ new york, nyc, and new york ny all in the top 30 ‣ Mechanical Turk to the rescue...
  • Topics‣ Data Science & Geo Analytics‣ Useful Geo tools and Datasets‣ Hadoop, Pig, and Big Data‣ Cleaning Location Data with Mechanical Turk‣ Spatial Tweet Analytics with Hadoop & Python‣ Using Social Data to Understand Cities‣ Q&A
  • Code examples we’ll cover are on Github
  • You can run them on Elastic MapReduce
  • Cleaning Twitter Profile Location Names Filter Exact MatchesExtract Top Tweet Locations Clean with MTurk Aggregate Context with Hadoop
  • We will map locations to GeoNames IDs
  • Start with Location Exact Matches
  • Use Mechanical Turk to improve results
  • Workers do simple tasks for a few cents
  • We constructed the following task
  • Workers used a Geonames search tool
  • Location search tool code is on Github
  • Preparing Data to send to MTurk
  • We use consensus answers from workers
  • Processing MTurk Output
  • Topics‣ Data Science & Geo Analytics‣ Useful Geo tools and Datasets‣ Hadoop, Pig, and Big Data‣ Cleaning Location Data with Mechanical Turk‣ Spatial Tweet Analytics with Hadoop & Python‣ Using Social Data to Understand Cities‣ Q&A
  • Tokenizing and Cleaning Tweet Text‣ Extract Tweet topics with Hadoop + Python + NLTK + Wikipedia
  • Build Phrase Dictionary with Wikipedia
  • Streaming Tweet Parser (Python + NLTK)
  • Parse Tweets and Join to Wikipedia (Pig)
  • Aggregate by US County for Analysis
  • Clean Data => Thematic US County Map
  • Twitter users by county in our sample
  • “Lady Gaga” Tweets
  • “Tea Party” Tweets
  • “Dallas” Tweets
  • “Stephen Colbert” Tweets
  • Topics‣ Data Science & Geo Analytics‣ Useful Geo tools and Datasets‣ Hadoop, Pig, and Big Data‣ Cleaning Location Data with Mechanical Turk‣ Spatial Tweet Analytics with Hadoop & Python‣ Using Social Data to Understand Cities‣ Q&A
  • LinkedIn Skills
  • Skills in the Design Industry
  • Exploring the Spatial Distribution of Skills
  • People with “Ship Building” Skills
  • What is the Skill profile of a given city?
  • Expertise correlated with Santa Clara, CA
  • Expertise correlated with Los Angeles
  • Expertise correlated with Washington, DC
  • Yuba City, CA has 21.3% Unemployment 21.3
  • Ames, Iowa has 4.7% Unemployment 21.3
  • Topics‣ Data Science & Geo Analytics‣ Useful Geo tools and Datasets‣ Hadoop, Pig, and Big Data‣ Cleaning Location Data with Mechanical Turk‣ Spatial Tweet Analytics with Hadoop & Python‣ Using Social Data to Understand Cities‣ Q&A
  • Questions? Follow me at twitter.com/peteskomoroch datawrangling.com