An overview of traditional spatial analysis tools, an intro to hadoop and other tools for analyzing terabytes or more of data, and then a primer with examples on combining the two with data pulled from the Twitter streaming API. Given at the O'Reilly Where 2.0 conference in March 2010.
1. Spatial Analytics Workshop
Pete Skomoroch, LinkedIn (@peteskomoroch)
Kevin Weil, Twitter (@kevinweil)
Sean Gorman, FortiusOne (@seangorman)
#spatialanalytics
2. Introduction
‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
3. Introduction
‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
4. Introduction
‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
5. Spatial Analysis
Analytical techniques to determine the spatial
distribution of a variable, the relationship between
the spatial distribution of variables, and the
association of the variables in an area.
7. Spatial Analysis Types
1. Spatial autocorrelation
2. Spatial interpolation
3. Spatial interaction
4. Simulation and modeling
5. Density mapping
8. Spatial Autocorrelation
Spatial autocorrelation statistics measure and analyze
the degree of dependency among observations in a
geographic space.
First law of geography: “everything is related to everything
else, but near things are more related than distant things.”
-- Waldo Tobler
9. Moran’s I - Per Capita
Moran’s I - Random Variable Income in Monroe County
Moran’s I = .012 Moran’s I = .66
10. Spatial Interpolation
Spatial interpolation methods estimate the variables
at unobserved locations in geographic space based
on the values at observed locations.
11. $14.00
Chicago
$14.00
NYC
$7.55
Henry
Natural Gas Demand in Response to
February 21, 2003 Alberta Clipper cold
front
12. $18.50
Chicago
$30.00
NYC
$16.00
Henry
Natural Gas Demand in Response to
February 24, 2003 Alberta Clipper cold
front
13. $20.00
Chicago
$37.00
NYC
$22.00
Henry
Natural Gas Demand in Response to
February 25, 2003 Alberta Clipper cold
front
14. Spatial Interaction
Spatial interaction or “gravity models” estimate
the flow of people, material, or information
between locations in geographic space.
15. Introduction
‣ Motiviation
‣ Execution
‣ Prototype
‣ Service
‣ API
‣ Operations
‣ UX
Global Oil Supply and Demand Gravity
Model
16. Simulation and Modeling
Simple interactions among proximal entities can
lead to intricate, persistent, and functional spatial
entities at aggregate levels (complex adaptive
systems).
17. Spatial Interdependency Analysis of
the San Francisco Failure Simulation
Total Number of No. Links % Links %Volume
Infrastructure Links Congested Congested Delay
Refined Products
(National)
3,197 1 0.03% 0.05%
Refined Products
(MSA) 12.50%
8 1 93%
Power Grid (Regional) 1,942 4 0% N/A
Power Grid (MSA) 16 2 13% N/A
18. Density Mapping
Calculating the proximity and frequency of a
spatial phenomenon by creating a probabilistic
surface.
21. Distributed Analytics
Queueing analysis tasks from disparate data sources
for agents to run across distributed servers to collate
back to the user as answers.
22. Disparate Data
Distributed Servers
Agents
User
Request Queue
Analysis
28. Introduction
‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
29. Data is Getting Big
‣ NYSE: 1 TB/day
‣ Facebook: 20+ TB
compressed/day
‣ CERN/LHC: 40 TB/day (15
PB/year!)
‣ And growth is accelerating
‣ Need multiple machines,
horizontal scalability
30. Hadoop
‣ Distributed file system (hard to store a PB)
‣ Fault-tolerant, handles replication, node failure, etc
‣ MapReduce-based parallel computation
(even harder to process a PB)
‣ Generic key-value based computation interface
allows for wide applicability
‣ Open source, top-level Apache project
‣ Scalable: Y! has a 4000-node cluster
‣ Powerful: sorted a TB of random integers in 62 seconds
31. MapReduce?
cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
32. MapReduce?
cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
33. MapReduce?
cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
34. MapReduce?
cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
35. MapReduce?
cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
36. MapReduce?
cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close to
2x faster.
37. MapReduce?
cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per
county, given tweets table?
‣ Input: key=row, value=tweet info
‣ Map: output key=county, value=1
‣ Shuffle: sort by county
‣ Reduce: for each county, sum
‣ Output: county, tweet count
‣ With 2x machines, runs close
to 2x faster.
38. But...
‣ Analysis typically done in Java
‣ Single-input, two-stage data flow is rigid
‣ Projections, filters: custom code
‣ Joins: lengthy, error-prone
‣ n-stage jobs: Hard to manage
‣ Prototyping/exploration requires ‣ analytics in Eclipse?
compilation ur doin it wrong...
39. Enter Pig
‣ High level language
‣ Transformations on sets of records
‣ Process data one step at a time
‣ Easier than SQL?
40. Why Pig?
‣ Because I bet you can read the following script.
41. A Real Pig Script
‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
43. Pig Simplifies Analysis
‣ The Pig version is:
‣ 5% of the code, 5% of the time
‣ Within 50% of the execution time.
‣ Pig Geo:
‣ Programmable: fuzzy matching, custom filtering
‣ Easily link multiple datasets, regardless of size/structure
‣ Iterative, quick
44. A Real Example
‣ Fire up your EMR.
‣ ... or follow along at http://bit.ly/whereanalytics
‣ Pete used Twitter’s streaming API to store some tweets
‣ Simplest thing: group by location and count with Pig
‣ http://bit.ly/where20pig
‣ Here comes some code!
54. hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30
brasil 37985
indonesia 33777
brazil 22432
london 17294
usa 14564
são paulo 14238
new york 13420
tokyo 10967
singapore 10225
rio de janeiro 10135
los angeles 9934
california 9386
chicago 9155
uk 9095
jakarta 9086
germany 8741
canada 8201
7696
7121
jakarta, indonesia 6480
nyc 6456
new york, ny 6331
55. Neat, but...
‣ Wow, that data is messy!
‣ brasil, brazil at #1 and #3
‣ new york, nyc, and new york ny all in the top 30
‣ Pete to the rescue.
56. Introduction
‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
72. Introduction
‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
73. Introduction
‣ The Rise of Spatial Analytics
‣ Spatial Analysis Techniques
‣ Hadoop, Pig, and Big Data
‣ Bringing the Two Together
‣ Conclusion
‣ Q&A
74. Questions? Follow us at
twitter.com/peteskomoroch
twitter.com/kevinweil
twitter.com/seangorman