Geo Analytics Tutorial - Where 2.0 2011

1.
Geo Analytics Tutorial PeteSkomoroch Sr. Data Scientist - LinkedIn (@peteskomoroch) #geoanalytics ** Hadoop Intro slides from Kevin Weil, Twitter

2.
Topics ‣ Data Science & Geo Analytics ‣ Useful Geo tools and Datasets ‣ Hadoop, Pig, and Big Data ‣ Cleaning Location Data with Mechanical Turk ‣ Spatial Tweet Analytics with Hadoop & Python ‣ Using Social Data to Understand Cities ‣ Q&A

3.

4.
Analytics & Dataare Hot Topics

5.

6.

7.

8.

9.

10.
Data Exhaust My Delicious Tags

11.
Data Science * http://www.drewconway.com/zia/?p=2378

12.
Data Visualization ‣ http://www.dataspora.com/blog/

13.
Spatial Analysis Map by Dr. John Snow of London, showing clusters of cholera cases in the 1854 Broad Street cholera outbreak. This was one of the first uses of map-based spatial analysis.

14.
Spatial Analysis • Spatialregression - estimate dependencies between variables • Gravity models - estimate the flow of people, material, or information between locations • Spatial interpolation - estimate variables at unobserved locations based on other measured values • Simulation - use models and data to predict spatial phenomena

15.
Life Span &Food by Zip Code * http://zev.lacounty.gov/news/health/death-by-zip-code * http://www.verysmallarray.com/?p=975

16.
Where Americans AreMoving (IRS Data) ‣ (Jon Bruner) http://jebruner.com/2010/06/the-migration-map/

17.
Facebook Connectivity (PeteWarden) * http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html

18.

19.
Useful Geo Tools •R,Matlab, SciPy, Commercial Geo Software •R Spatial Pkgs http://cran.r-project.org/web/views/Spatial.html •Hadoop, Amazon EC2, Mechanical Turk •Data Science Toolkit: http://www.datasciencetoolkit.org/ •80% of effort is often in cleaning and processing data

20.
DataScienceToolkit.org •Runs on VMor Amazon EC2 •Street Address to Coordinates •Coordinates to Political Areas •Geodict (text extraction) •IP Address to Coordinates •New UK release on Github

21.
Resources for locationdata • SimpleGeo • Factual • Geonames • Infochimps • Data.gov • DataWrangling.com

22.

23.
Hadoop: Motivation •We want to crunch 1TB of Twitter stream data and understand spatial patterns in Tweets •Data collected from the Twitter “Garden Hose” API last Spring

24.
Data is GettingBig ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year!) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability

25.
Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability ‣ Open source, top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds

26.
MapReduce? cat file |grep geo | sort | uniq -c > ‣ Challenge: how many tweets per output county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.

27.

28.

29.

30.

31.

32.

33.
But... ‣ Analysis typically done in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins: lengthy, error-prone ‣ n-stage jobs: Hard to manage ‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...

34.
Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?

35.
Why Pig? ‣ Because I bet you can read the following script.

36.
A Real PigScript ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.

37.
No, seriously.

38.
Pig Simplifies Analysis ‣ The Pig version is: ‣ 5% of the code, 5% of the time ‣ Within 50% of the execution time. ‣ Pig Geo: ‣ Programmable: fuzzy matching, custom filtering ‣ Easily link multiple datasets, regardless of size/structure ‣ Iterative, quick

39.
A Real Example ‣ Fire up your Elastic MapReduce Cluster. ‣ ... or follow along at http://bit.ly/whereanalytics ‣ I used Twitter’s streaming API to store some tweets ‣ Simplest thing: group by location and count with Pig ‣ http://bit.ly/where20pig ‣ Here comes some code!

41.
tweets = LOAD's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);

42.
tweets = LOAD's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);

43.
tweets_with_location = FILTERtweets BY user_location != 'NULL';

44.
normalized_locations = FOREACHtweets_with_location GENERATE LOWER(user_location) as user_location;

45.
grouped_tweets = GROUPnormalized_locations BY user_location PARALLEL 10;

46.
location_counts = FOREACHgrouped_tweets GENERATE $0 as location, SIZE($1) as user_count;

47.
sorted_counts = ORDERlocation_counts BY user_count DESC;

48.
STORE sorted_counts INTO'global_location_tweets';

49.
hadoop@ip-10-160-113-142:~$ hadoop dfs-cat /global_location_counts/part* | head -30 brasil 37985 indonesia 33777 brazil 22432 london 17294 usa 14564 são paulo 14238 new york 13420 tokyo 10967 singapore 10225 rio de janeiro 10135 los angeles 9934 california 9386 chicago 9155 uk 9095 jakarta 9086 germany 8741 canada 8201 7696 7121 jakarta, indonesia 6480 nyc 6456 new york, ny 6331

50.
Neat, but... ‣ Wow, that data is messy! ‣ brasil, brazil at #1 and #3 ‣ new york, nyc, and new york ny all in the top 30 ‣ Mechanical Turk to the rescue...

51.

52.
Code examples we’llcover are on Github

53.
You can runthem on Elastic MapReduce

54.
Cleaning Twitter ProfileLocation Names Filter Exact Matches Extract Top Tweet Locations Clean with MTurk Aggregate Context with Hadoop

55.
We will maplocations to GeoNames IDs

56.
Start with LocationExact Matches

57.
Use Mechanical Turkto improve results

58.
Workers do simpletasks for a few cents

59.
We constructed thefollowing task

60.
Workers used aGeonames search tool

61.
Location search toolcode is on Github

62.
Preparing Data tosend to MTurk

63.
We use consensusanswers from workers

64.
Processing MTurk Output

65.

66.
Tokenizing and CleaningTweet Text ‣ Extract Tweet topics with Hadoop + Python + NLTK + Wikipedia

67.
Build Phrase Dictionarywith Wikipedia

68.
Streaming Tweet Parser(Python + NLTK)

69.
Parse Tweets andJoin to Wikipedia (Pig)

70.
Aggregate by USCounty for Analysis

71.
Clean Data =>Thematic US County Map

72.
Twitter users bycounty in our sample

73.
“Lady Gaga” Tweets

74.
“Tea Party” Tweets

75.
“Dallas” Tweets

76.
“Stephen Colbert” Tweets

77.

78.
LinkedIn Skills

79.
Skills in theDesign Industry

80.
Exploring the SpatialDistribution of Skills

81.
People with “ShipBuilding” Skills

82.
What is theSkill profile of a given city?

83.
Expertise correlated withSanta Clara, CA

84.
Expertise correlated withLos Angeles

85.
Expertise correlated withWashington, DC

86.
Yuba City, CAhas 21.3% Unemployment 21.3

87.
Ames, Iowa has4.7% Unemployment 21.3

88.

89.
Questions? Follow me at twitter.com/peteskomoroch datawrangling.com

Geo Analytics Tutorial - Where 2.0 2011

More Related Content

Viewers also liked

Similar to Geo Analytics Tutorial - Where 2.0 2011

More from Peter Skomoroch

Recently uploaded

Geo Analytics Tutorial - Where 2.0 2011

Editor's Notes