Spatial Analytics Workshop
Pete Skomoroch, LinkedIn (@peteskomoroch)
Kevin Weil, Twitter (@kevinweil)
Sean Gorman, Fortius...
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing ...
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing ...
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing ...
Spatial Analysis

      Analytical techniques to determine the spatial
      distribution of a variable, the relationship ...
Pattern Analysis
Spatial Analysis Types

     1. Spatial autocorrelation
     2. Spatial interpolation
     3. Spatial interaction
     4. ...
Spatial Autocorrelation

      Spatial autocorrelation statistics measure and analyze
      the degree of dependency among...
Moran’s I - Per Capita
Moran’s I - Random Variable   Income in Monroe County




       Moran’s I = .012              Mora...
Spatial Interpolation

      Spatial interpolation methods estimate the variables
      at unobserved locations in geograp...
$14.00
                                                   Chicago




                                                    ...
$18.50
                                                   Chicago




                                                    ...
$20.00
                                                   Chicago




                                                    ...
Spatial Interaction

      Spatial interaction or “gravity models” estimate
      the flow of people, material, or informa...
Introduction
‣   Motiviation
‣   Execution
‣   Prototype
‣   Service
‣   API
‣   Operations
‣   UX

                  Glob...
Simulation and Modeling

      Simple interactions among proximal entities can
      lead to intricate, persistent, and fu...
Spatial Interdependency Analysis of
                                                                            the San Fr...
Density Mapping

     Calculating the proximity and frequency of a
     spatial phenomenon by creating a probabilistic
   ...
New York City Fiber Density Map
Standard GIS Architectures
Distributed Analytics

      Queueing analysis tasks from disparate data sources
      for agents to run across distribute...
Disparate Data




                                               Distributed Servers
                                    ...
(http://finder.geocommons.com/overlays/20148)




       1. Rasterize
       2. Kernel
          density calc
       3. Col...
Vector Density Mapping Demo
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing ...
Data is Getting Big
‣   NYSE: 1 TB/day
‣   Facebook: 20+ TB
    compressed/day
‣   CERN/LHC: 40 TB/day (15
    PB/year!)
‣...
Hadoop
‣   Distributed file system (hard to store a PB)
‣   Fault-tolerant, handles replication, node failure, etc
‣   Map...
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                            ...
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                            ...
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                            ...
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                            ...
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                            ...
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                            ...
MapReduce?
cat file | grep geo | sort | uniq -c > output   ‣   Challenge: how many tweets per
                            ...
But...
‣   Analysis typically done in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom...
Enter Pig

            ‣   High level language
            ‣   Transformations on sets of records
            ‣   Process ...
Why Pig?
‣   Because I bet you can read the following script.
A Real Pig Script




‣   Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
No, seriously.
Pig Simplifies Analysis

‣   The Pig version is:
‣        5% of the code, 5% of the time
‣        Within 50% of the execut...
A Real Example

‣   Fire up your EMR.
    ‣   ... or follow along at http://bit.ly/whereanalytics
‣   Pete used Twitter’s ...
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_frien...
tweets = LOAD 's3://where20demo/sample-tweets' as (
  user_screen_name:chararray,
  tweet_id:chararray,
  ...
  user_frien...
tweets_with_location = FILTER tweets BY user_location !=
'NULL';
normalized_locations = FOREACH tweets_with_location
GENERATE LOWER(user_location) as user_location;
grouped_tweets = GROUP normalized_locations BY
user_location PARALLEL 10;
location_counts = FOREACH grouped_tweets GENERATE $0 as
location, SIZE($1) as user_count;
sorted_counts = ORDER location_counts BY user_count DESC;
STORE sorted_counts INTO 'global_location_tweets';
hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30

brasil           37985
indonesia    ...
Neat, but...

 ‣   Wow, that data is messy!
     ‣   brasil, brazil at #1 and #3
     ‣   new york, nyc, and new york ny a...
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing ...
Users by County
Lady Gaga
Tea Party
Dallas
Colbert
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing ...
Introduction
‣   The Rise of Spatial Analytics
‣   Spatial Analysis Techniques
‣   Hadoop, Pig, and Big Data
‣   Bringing ...
Questions?   Follow us at
             twitter.com/peteskomoroch
             twitter.com/kevinweil
             twitter.c...
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
O'Reilly Where 2.0 Spatial Analytics Workshop
Upcoming SlideShare
Loading in...5
×

O'Reilly Where 2.0 Spatial Analytics Workshop

2,303

Published on

This workshop focused on quickly uncovering patterns and generating actionable insights from large datasets using spatial analytics. The three part workshop consisted of: 1) an overview of traditional spatial analysis tools 2) an intro to Hadoop and other tools for analyzing terabytes or more of data 3) Live code examples on combining the two to uncover patterns in location data pulled from the Twitter streaming API using Mechanical Turk, Amazon Elastic MapReduce, and Pig.

Given at the O’Reilly Where 2.0 conference in March 2010.

http://en.oreilly.com/where2010/public/schedule/detail/12400

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,303
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
77
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

O'Reilly Where 2.0 Spatial Analytics Workshop

  1. 1. Spatial Analytics Workshop Pete Skomoroch, LinkedIn (@peteskomoroch) Kevin Weil, Twitter (@kevinweil) Sean Gorman, FortiusOne (@seangorman) #spatialanalytics
  2. 2. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  3. 3. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  4. 4. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  5. 5. Spatial Analysis Analytical techniques to determine the spatial distribution of a variable, the relationship between the spatial distribution of variables, and the association of the variables in an area.
  6. 6. Pattern Analysis
  7. 7. Spatial Analysis Types 1. Spatial autocorrelation 2. Spatial interpolation 3. Spatial interaction 4. Simulation and modeling 5. Density mapping
  8. 8. Spatial Autocorrelation Spatial autocorrelation statistics measure and analyze the degree of dependency among observations in a geographic space. First law of geography: “everything is related to everything else, but near things are more related than distant things.” -- Waldo Tobler
  9. 9. Moran’s I - Per Capita Moran’s I - Random Variable Income in Monroe County Moran’s I = .012 Moran’s I = .66
  10. 10. Spatial Interpolation Spatial interpolation methods estimate the variables at unobserved locations in geographic space based on the values at observed locations.
  11. 11. $14.00 Chicago $14.00 NYC $7.55 Henry Natural Gas Demand in Response to February 21, 2003 Alberta Clipper cold front
  12. 12. $18.50 Chicago $30.00 NYC $16.00 Henry Natural Gas Demand in Response to February 24, 2003 Alberta Clipper cold front
  13. 13. $20.00 Chicago $37.00 NYC $22.00 Henry Natural Gas Demand in Response to February 25, 2003 Alberta Clipper cold front
  14. 14. Spatial Interaction Spatial interaction or “gravity models” estimate the flow of people, material, or information between locations in geographic space.
  15. 15. Introduction ‣ Motiviation ‣ Execution ‣ Prototype ‣ Service ‣ API ‣ Operations ‣ UX Global Oil Supply and Demand Gravity Model
  16. 16. Simulation and Modeling Simple interactions among proximal entities can lead to intricate, persistent, and functional spatial entities at aggregate levels (complex adaptive systems).
  17. 17. Spatial Interdependency Analysis of the San Francisco Failure Simulation Total Number of No. Links % Links %Volume Infrastructure Links Congested Congested Delay Refined Products (National) 3,197 1 0.03% 0.05% Refined Products (MSA) 12.50% 8 1 93% Power Grid (Regional) 1,942 4 0% N/A Power Grid (MSA) 16 2 13% N/A
  18. 18. Density Mapping Calculating the proximity and frequency of a spatial phenomenon by creating a probabilistic surface.
  19. 19. New York City Fiber Density Map
  20. 20. Standard GIS Architectures
  21. 21. Distributed Analytics Queueing analysis tasks from disparate data sources for agents to run across distributed servers to collate back to the user as answers.
  22. 22. Disparate Data Distributed Servers Agents User Request Queue Analysis
  23. 23. (http://finder.geocommons.com/overlays/20148) 1. Rasterize 2. Kernel density calc 3. Color map Agent Amazon EC2 User Request Queue Amazon S3
  24. 24. Vector Density Mapping Demo
  25. 25. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  26. 26. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year!) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability
  27. 27. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability ‣ Open source, top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds
  28. 28. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  29. 29. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  30. 30. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  31. 31. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  32. 32. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  33. 33. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  34. 34. MapReduce? cat file | grep geo | sort | uniq -c > output ‣ Challenge: how many tweets per county, given tweets table? ‣ Input: key=row, value=tweet info ‣ Map: output key=county, value=1 ‣ Shuffle: sort by county ‣ Reduce: for each county, sum ‣ Output: county, tweet count ‣ With 2x machines, runs close to 2x faster.
  35. 35. But... ‣ Analysis typically done in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins: lengthy, error-prone ‣ n-stage jobs: Hard to manage ‣ Prototyping/exploration requires ‣ analytics in Eclipse? compilation ur doin it wrong...
  36. 36. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  37. 37. Why Pig? ‣ Because I bet you can read the following script.
  38. 38. A Real Pig Script ‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
  39. 39. No, seriously.
  40. 40. Pig Simplifies Analysis ‣ The Pig version is: ‣ 5% of the code, 5% of the time ‣ Within 50% of the execution time. ‣ Pig Geo: ‣ Programmable: fuzzy matching, custom filtering ‣ Easily link multiple datasets, regardless of size/structure ‣ Iterative, quick
  41. 41. A Real Example ‣ Fire up your EMR. ‣ ... or follow along at http://bit.ly/whereanalytics ‣ Pete used Twitter’s streaming API to store some tweets ‣ Simplest thing: group by location and count with Pig ‣ http://bit.ly/where20pig ‣ Here comes some code!
  42. 42. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  43. 43. tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);
  44. 44. tweets_with_location = FILTER tweets BY user_location != 'NULL';
  45. 45. normalized_locations = FOREACH tweets_with_location GENERATE LOWER(user_location) as user_location;
  46. 46. grouped_tweets = GROUP normalized_locations BY user_location PARALLEL 10;
  47. 47. location_counts = FOREACH grouped_tweets GENERATE $0 as location, SIZE($1) as user_count;
  48. 48. sorted_counts = ORDER location_counts BY user_count DESC;
  49. 49. STORE sorted_counts INTO 'global_location_tweets';
  50. 50. hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30 brasil 37985 indonesia 33777 brazil 22432 london 17294 usa 14564 são paulo 14238 new york 13420 tokyo 10967 singapore 10225 rio de janeiro 10135 los angeles 9934 california 9386 chicago 9155 uk 9095 jakarta 9086 germany 8741 canada 8201 7696 7121 jakarta, indonesia 6480 nyc 6456 new york, ny 6331
  51. 51. Neat, but... ‣ Wow, that data is messy! ‣ brasil, brazil at #1 and #3 ‣ new york, nyc, and new york ny all in the top 30 ‣ Pete to the rescue.
  52. 52. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  53. 53. Users by County
  54. 54. Lady Gaga
  55. 55. Tea Party
  56. 56. Dallas
  57. 57. Colbert
  58. 58. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  59. 59. Introduction ‣ The Rise of Spatial Analytics ‣ Spatial Analysis Techniques ‣ Hadoop, Pig, and Big Data ‣ Bringing the Two Together ‣ Conclusion ‣ Q&A
  60. 60. Questions? Follow us at twitter.com/peteskomoroch twitter.com/kevinweil twitter.com/seangorman
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×