Geo Analytics Tutorial
Pete Skomoroch
Sr. Data Scientist - LinkedIn (@peteskomoroch)

#geoanalytics
** Hadoop Intro slides from Kevin Weil, Twitter
Topics
‣   Data Science & Geo Analytics
‣   Useful Geo tools and Datasets
‣   Hadoop, Pig, and Big Data
‣   Cleaning Location Data with Mechanical Turk
‣   Spatial Tweet Analytics with Hadoop & Python
‣   Using Social Data to Understand Cities
‣   Q&A
Topics
‣   Data Science & Geo Analytics
‣   Useful Geo tools and Datasets
‣   Hadoop, Pig, and Big Data
‣   Cleaning Location Data with Mechanical Turk
‣   Spatial Tweet Analytics with Hadoop & Python
‣   Using Social Data to Understand Cities
‣   Q&A
Analytics & Data are Hot Topics
Analytics & Data are Hot Topics
Analytics & Data are Hot Topics
Analytics & Data are Hot Topics
Analytics & Data are Hot Topics
Analytics & Data are Hot Topics
Data Exhaust
               My Delicious Tags
Data Science




       * http://www.drewconway.com/zia/?p=2378
Data Visualization




          ‣   http://www.dataspora.com/blog/
Spatial Analysis

                   Map by Dr. John Snow of London,
                   showing clusters of cholera cases in
                   the 1854 Broad Street cholera
                   outbreak. This was one of the first
                   uses of map-based spatial analysis.
Spatial Analysis

• Spatial regression - estimate dependencies between variables
• Gravity models - estimate the flow of people, material, or
 information between locations
• Spatial interpolation - estimate variables at unobserved locations
 based on other measured values
• Simulation - use models and data to predict spatial phenomena
Life Span & Food by Zip Code




* http://zev.lacounty.gov/news/health/death-by-zip-code
* http://www.verysmallarray.com/?p=975
Where Americans Are Moving (IRS Data)




 ‣   (Jon Bruner) http://jebruner.com/2010/06/the-migration-map/
Facebook Connectivity (Pete Warden)




* http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html
Topics
‣   Data Science & Geo Analytics
‣   Useful Geo tools and Datasets
‣   Hadoop, Pig, and Big Data
‣   Cleaning Location Data with Mechanical Turk
‣   Spatial Tweet Analytics with Hadoop & Python
‣   Using Social Data to Understand Cities
‣   Q&A
Useful Geo Tools

•R, Matlab, SciPy, Commercial Geo Software
•R Spatial Pkgs http://cran.r-project.org/web/views/Spatial.html
•Hadoop, Amazon EC2, Mechanical Turk
•Data Science Toolkit: http://www.datasciencetoolkit.org/
•80% of effort is often in cleaning and processing data
DataScienceToolkit.org

•Runs on VM or Amazon EC2
•Street Address to Coordinates
•Coordinates to Political Areas
•Geodict (text extraction)
•IP Address to Coordinates
•New UK release on Github
Resources for location data

• SimpleGeo
• Factual
• Geonames
• Infochimps
• Data.gov
• DataWrangling.com
Topics
‣   Data Science & Geo Analytics
‣   Useful Geo tools and Datasets
‣   Hadoop, Pig, and Big Data
‣   Cleaning Location Data with Mechanical Turk
‣   Spatial Tweet Analytics with Hadoop & Python
‣   Using Social Data to Understand Cities
‣   Q&A
Hadoop: Motivation

  •We want to crunch 1TB of Twitter stream data and understand
   spatial patterns in Tweets
  •Data collected from the Twitter “Garden Hose” API last Spring
Data is Getting Big
‣   NYSE: 1 TB/day
‣   Facebook: 20+ TB
    compressed/day
‣   CERN/LHC: 40 TB/day (15
    PB/year!)
‣   And growth is accelerating
‣   Need multiple machines,
    horizontal scalability
Hadoop
‣   Distributed file system (hard to store a PB)
‣   Fault-tolerant, handles replication, node failure, etc
‣   MapReduce-based parallel computation
    (even harder to process a PB)
‣   Generic key-value based computation interface
    allows for wide applicability
‣   Open source, top-level Apache project
‣   Scalable: Y! has a 4000-node cluster
‣   Powerful: sorted a TB of random integers in 62 seconds
MapReduce?
cat file | grep geo | sort | uniq -c >   ‣   Challenge: how many tweets per
output                                       county, given tweets table?
                                         ‣   Input: key=row, value=tweet info
                                         ‣   Map: output key=county, value=1
                                         ‣   Shuffle: sort by county
                                         ‣   Reduce: for each county, sum
                                         ‣   Output: county, tweet count
                                         ‣   With 2x machines, runs close to
                                             2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c >   ‣   Challenge: how many tweets per
output                                       county, given tweets table?
                                         ‣   Input: key=row, value=tweet info
                                         ‣   Map: output key=county, value=1
                                         ‣   Shuffle: sort by county
                                         ‣   Reduce: for each county, sum
                                         ‣   Output: county, tweet count
                                         ‣   With 2x machines, runs close to
                                             2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c >   ‣   Challenge: how many tweets per
output                                       county, given tweets table?
                                         ‣   Input: key=row, value=tweet info
                                         ‣   Map: output key=county, value=1
                                         ‣   Shuffle: sort by county
                                         ‣   Reduce: for each county, sum
                                         ‣   Output: county, tweet count
                                         ‣   With 2x machines, runs close to
                                             2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c >   ‣   Challenge: how many tweets per
output                                       county, given tweets table?
                                         ‣   Input: key=row, value=tweet info
                                         ‣   Map: output key=county, value=1
                                         ‣   Shuffle: sort by county
                                         ‣   Reduce: for each county, sum
                                         ‣   Output: county, tweet count
                                         ‣   With 2x machines, runs close to
                                             2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c >   ‣   Challenge: how many tweets per
output                                       county, given tweets table?
                                         ‣   Input: key=row, value=tweet info
                                         ‣   Map: output key=county, value=1
                                         ‣   Shuffle: sort by county
                                         ‣   Reduce: for each county, sum
                                         ‣   Output: county, tweet count
                                         ‣   With 2x machines, runs close to
                                             2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c >   ‣   Challenge: how many tweets per
output                                       county, given tweets table?
                                         ‣   Input: key=row, value=tweet info
                                         ‣   Map: output key=county, value=1
                                         ‣   Shuffle: sort by county
                                         ‣   Reduce: for each county, sum
                                         ‣   Output: county, tweet count
                                         ‣   With 2x machines, runs close to
                                             2x faster.
MapReduce?
cat file | grep geo | sort | uniq -c >   ‣   Challenge: how many tweets per
output                                       county, given tweets table?
                                         ‣   Input: key=row, value=tweet info
                                         ‣   Map: output key=county, value=1
                                         ‣   Shuffle: sort by county
                                         ‣   Reduce: for each county, sum
                                         ‣   Output: county, tweet count
                                         ‣   With 2x machines, runs close
                                             to 2x faster.
But...
‣   Analysis typically done in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins: lengthy, error-prone
‣   n-stage jobs: Hard to manage
‣   Prototyping/exploration requires             ‣   analytics in Eclipse?
    compilation                                      ur doin it wrong...
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process data one step at a time
            ‣   Easier than SQL?
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Simplifies Analysis

‣   The Pig version is:
‣        5% of the code, 5% of the time
‣        Within 50% of the execution time.
‣   Pig      Geo:
    ‣   Programmable: fuzzy matching, custom filtering
    ‣   Easily link multiple datasets, regardless of size/structure
    ‣   Iterative, quick
A Real Example

‣   Fire up your Elastic MapReduce Cluster.
    ‣   ... or follow along at http://bit.ly/whereanalytics
‣   I used Twitter’s streaming API to store some tweets
‣   Simplest thing: group by location and count with Pig
    ‣   http://bit.ly/where20pig


‣   Here comes some code!
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_friends_count:int,
  user_statuses_count:int,
  user_location:chararray,
  user_lang:chararray,
  user_time_zone:chararray,
  place_id:chararray,
  ...);
tweets_with_location = FILTER tweets BY user_location !=
'NULL';
normalized_locations = FOREACH tweets_with_location
GENERATE LOWER(user_location) as user_location;
grouped_tweets = GROUP normalized_locations BY
user_location PARALLEL 10;
location_counts = FOREACH grouped_tweets GENERATE $0 as
location, SIZE($1) as user_count;
sorted_counts = ORDER location_counts BY user_count DESC;
STORE sorted_counts INTO 'global_location_tweets';
hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30

brasil           37985
indonesia        33777
brazil           22432
london           17294
usa              14564
são paulo        14238
new york         13420
tokyo            10967
singapore        10225
rio de janeiro   10135
los angeles      9934
california       9386
chicago          9155
uk               9095
jakarta          9086
germany          8741
canada           8201
                 7696
                 7121
jakarta, indonesia  6480
nyc              6456
new york, ny     6331
Neat, but...

 ‣   Wow, that data is messy!
     ‣   brasil, brazil at #1 and #3
     ‣   new york, nyc, and new york ny all in the top 30
 ‣   Mechanical Turk to the rescue...
Topics
‣   Data Science & Geo Analytics
‣   Useful Geo tools and Datasets
‣   Hadoop, Pig, and Big Data
‣   Cleaning Location Data with Mechanical Turk
‣   Spatial Tweet Analytics with Hadoop & Python
‣   Using Social Data to Understand Cities
‣   Q&A
Code examples we’ll cover are on Github
You can run them on Elastic MapReduce
Cleaning Twitter Profile Location Names


                     Filter Exact
                       Matches
Extract Top Tweet
    Locations                           Clean with
                                          MTurk
                    Aggregate Context
                      with Hadoop
We will map locations to GeoNames IDs
Start with Location Exact Matches
Use Mechanical Turk to improve results
Workers do simple tasks for a few cents
We constructed the following task
Workers used a Geonames search tool
Location search tool code is on Github
Preparing Data to send to MTurk
We use consensus answers from workers
Processing MTurk Output
Topics
‣   Data Science & Geo Analytics
‣   Useful Geo tools and Datasets
‣   Hadoop, Pig, and Big Data
‣   Cleaning Location Data with Mechanical Turk
‣   Spatial Tweet Analytics with Hadoop & Python
‣   Using Social Data to Understand Cities
‣   Q&A
Tokenizing and Cleaning Tweet Text
‣   Extract Tweet topics with Hadoop + Python + NLTK + Wikipedia
Build Phrase Dictionary with Wikipedia
Streaming Tweet Parser (Python + NLTK)
Parse Tweets and Join to Wikipedia (Pig)
Aggregate by US County for Analysis
Clean Data => Thematic US County Map
Twitter users by county in our sample
“Lady Gaga” Tweets
“Tea Party” Tweets
“Dallas” Tweets
“Stephen Colbert” Tweets
Topics
‣   Data Science & Geo Analytics
‣   Useful Geo tools and Datasets
‣   Hadoop, Pig, and Big Data
‣   Cleaning Location Data with Mechanical Turk
‣   Spatial Tweet Analytics with Hadoop & Python
‣   Using Social Data to Understand Cities
‣   Q&A
LinkedIn Skills
Skills in the Design Industry
Exploring the Spatial Distribution of Skills
People with “Ship Building” Skills
What is the Skill profile of a given city?
Expertise correlated with Santa Clara, CA
Expertise correlated with Los Angeles
Expertise correlated with Washington, DC
Yuba City, CA has 21.3% Unemployment




                     21.3
Ames, Iowa has 4.7% Unemployment




                    21.3
Topics
‣   Data Science & Geo Analytics
‣   Useful Geo tools and Datasets
‣   Hadoop, Pig, and Big Data
‣   Cleaning Location Data with Mechanical Turk
‣   Spatial Tweet Analytics with Hadoop & Python
‣   Using Social Data to Understand Cities
‣   Q&A
Questions?   Follow me at
             twitter.com/peteskomoroch
             datawrangling.com

Geo Analytics Tutorial - Where 2.0 2011

  • 1.
    Geo Analytics Tutorial PeteSkomoroch Sr. Data Scientist - LinkedIn (@peteskomoroch) #geoanalytics ** Hadoop Intro slides from Kevin Weil, Twitter
  • 2.
    Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  • 3.
    Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  • 4.
    Analytics & Dataare Hot Topics
  • 5.
    Analytics & Dataare Hot Topics
  • 6.
    Analytics & Dataare Hot Topics
  • 7.
    Analytics & Dataare Hot Topics
  • 8.
    Analytics & Dataare Hot Topics
  • 9.
    Analytics & Dataare Hot Topics
  • 10.
    Data Exhaust My Delicious Tags
  • 11.
    Data Science * http://www.drewconway.com/zia/?p=2378
  • 12.
    Data Visualization ‣ http://www.dataspora.com/blog/
  • 13.
    Spatial Analysis Map by Dr. John Snow of London, showing clusters of cholera cases in the 1854 Broad Street cholera outbreak. This was one of the first uses of map-based spatial analysis.
  • 14.
    Spatial Analysis • Spatialregression - estimate dependencies between variables • Gravity models - estimate the flow of people, material, or information between locations • Spatial interpolation - estimate variables at unobserved locations based on other measured values • Simulation - use models and data to predict spatial phenomena
  • 15.
    Life Span &Food by Zip Code * http://zev.lacounty.gov/news/health/death-by-zip-code * http://www.verysmallarray.com/?p=975
  • 16.
    Where Americans AreMoving (IRS Data) ‣ (Jon Bruner) http://jebruner.com/2010/06/the-migration-map/
  • 17.
    Facebook Connectivity (PeteWarden) * http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html
  • 18.
    Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  • 19.
    Useful Geo Tools •R,Matlab, SciPy, Commercial Geo Software •R Spatial Pkgs http://cran.r-project.org/web/views/Spatial.html •Hadoop, Amazon EC2, Mechanical Turk •Data Science Toolkit: http://www.datasciencetoolkit.org/ •80% of effort is often in cleaning and processing data
  • 20.
    DataScienceToolkit.org •Runs on VMor Amazon EC2 •Street Address to Coordinates •Coordinates to Political Areas •Geodict (text extraction) •IP Address to Coordinates •New UK release on Github
  • 21.
    Resources for locationdata • SimpleGeo • Factual • Geonames • Infochimps • Data.gov • DataWrangling.com
  • 22.
    Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  • 23.
    Hadoop: Motivation •We want to crunch 1TB of Twitter stream data and understand spatial patterns in Tweets •Data collected from the Twitter “Garden Hose” API last Spring
  • 24.
    Data is GettingBig ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year!) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability
  • 25.
    Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability ‣ Open source, top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds
  • 26.
    MapReduce? cat file |grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 27.
    MapReduce? cat file |grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 28.
    MapReduce? cat file |grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 29.
    MapReduce? cat file |grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 30.
    MapReduce? cat file |grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 31.
    MapReduce? cat file |grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 32.
    MapReduce? cat file |grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  • 33.
    But... ‣ Analysis typically done in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins: lengthy, error-prone ‣ n-stage jobs: Hard to manage ‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...
  • 34.
    Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 35.
    Why Pig? ‣ Because I bet you can read the following script.
  • 36.
    A Real PigScript ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  • 37.
  • 38.
    Pig Simplifies Analysis ‣ The Pig version is: ‣ 5% of the code, 5% of the time ‣ Within 50% of the execution time. ‣ Pig Geo: ‣ Programmable: fuzzy matching, custom filtering ‣ Easily link multiple datasets, regardless of size/structure ‣ Iterative, quick
  • 39.
    A Real Example ‣ Fire up your Elastic MapReduce Cluster. ‣ ... or follow along at http://bit.ly/whereanalytics ‣ I used Twitter’s streaming API to store some tweets ‣ Simplest thing: group by location and count with Pig ‣ http://bit.ly/where20pig ‣ Here comes some code!
  • 41.
    tweets = LOAD's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 42.
    tweets = LOAD's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  • 43.
    tweets_with_location = FILTERtweets BY user_location != 'NULL';
  • 44.
    normalized_locations = FOREACHtweets_with_location GENERATE LOWER(user_location) as user_location;
  • 45.
    grouped_tweets = GROUPnormalized_locations BY user_location PARALLEL 10;
  • 46.
    location_counts = FOREACHgrouped_tweets GENERATE $0 as location, SIZE($1) as user_count;
  • 47.
    sorted_counts = ORDERlocation_counts BY user_count DESC;
  • 48.
    STORE sorted_counts INTO'global_location_tweets';
  • 49.
    hadoop@ip-10-160-113-142:~$ hadoop dfs-cat /global_location_counts/part* | head -30 brasil 37985 indonesia 33777 brazil 22432 london 17294 usa 14564 são paulo 14238 new york 13420 tokyo 10967 singapore 10225 rio de janeiro 10135 los angeles 9934 california 9386 chicago 9155 uk 9095 jakarta 9086 germany 8741 canada 8201 7696 7121 jakarta, indonesia 6480 nyc 6456 new york, ny 6331
  • 50.
    Neat, but... ‣ Wow, that data is messy! ‣ brasil, brazil at #1 and #3 ‣ new york, nyc, and new york ny all in the top 30 ‣ Mechanical Turk to the rescue...
  • 51.
    Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  • 52.
    Code examples we’llcover are on Github
  • 53.
    You can runthem on Elastic MapReduce
  • 54.
    Cleaning Twitter ProfileLocation Names Filter Exact Matches Extract Top Tweet Locations Clean with MTurk Aggregate Context with Hadoop
  • 55.
    We will maplocations to GeoNames IDs
  • 56.
    Start with LocationExact Matches
  • 57.
    Use Mechanical Turkto improve results
  • 58.
    Workers do simpletasks for a few cents
  • 59.
    We constructed thefollowing task
  • 60.
    Workers used aGeonames search tool
  • 61.
    Location search toolcode is on Github
  • 62.
    Preparing Data tosend to MTurk
  • 63.
    We use consensusanswers from workers
  • 64.
  • 65.
    Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  • 66.
    Tokenizing and CleaningTweet Text ‣ Extract Tweet topics with Hadoop + Python + NLTK + Wikipedia
  • 67.
  • 68.
    Streaming Tweet Parser(Python + NLTK)
  • 69.
    Parse Tweets andJoin to Wikipedia (Pig)
  • 70.
    Aggregate by USCounty for Analysis
  • 71.
    Clean Data =>Thematic US County Map
  • 72.
    Twitter users bycounty in our sample
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
    Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  • 78.
  • 79.
    Skills in theDesign Industry
  • 80.
    Exploring the SpatialDistribution of Skills
  • 81.
    People with “ShipBuilding” Skills
  • 82.
    What is theSkill profile of a given city?
  • 83.
  • 84.
  • 85.
  • 86.
    Yuba City, CAhas 21.3% Unemployment 21.3
  • 87.
    Ames, Iowa has4.7% Unemployment 21.3
  • 88.
    Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A
  • 89.
    Questions? Follow me at twitter.com/peteskomoroch datawrangling.com