• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
How To Analyze Geolocation Data with Hive and Hadoop
 

How To Analyze Geolocation Data with Hive and Hadoop

on

  • 9,428 views

This demo walks through a Geolocation dataset from Uber and looks at how to explore the dataset to assess new product viability using Hive and Hadoop

This demo walks through a Geolocation dataset from Uber and looks at how to explore the dataset to assess new product viability using Hive and Hadoop

Statistics

Views

Total Views
9,428
Views on SlideShare
5,852
Embed Views
3,576

Actions

Likes
15
Downloads
0
Comments
0

8 Embeds 3,576

http://www.scoop.it 3306
http://www.bigdatanosql.com 220
https://twitter.com 21
http://hortonworks.com 19
http://translate.googleusercontent.com 4
http://webcache.googleusercontent.com 3
http://dschool.co 2
https://www.google.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • If you spotted the error in this slide… we’re hiring.
  • If you spotted the error in this slide… we’re hiring.
  • If you spotted the error in this slide… we’re hiring.
  • If you spotted the error in this slide… we’re hiring.

How To Analyze Geolocation Data with Hive and Hadoop How To Analyze Geolocation Data with Hive and Hadoop Presentation Transcript

  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Spatial Analytics with Hive Hive Meetup – July 24, 2013 @cshanklin Page 1
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Why Spatial Analytics? • Amount of spatial data has exploded due to mobile device ubiquity and more reliance on sensors. • Proliferation of consumer-oriented mapping products brings spatial analytics to the mainstream. Page 2
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. An Interesting Dataset • GPS data collected from Uber trips. • Anonymized, maintains days/times but not dates. • Obtained from InfoChimps Page 3
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Data Sample Page 4 ID Date Time Latitude Longitude 1 1/7/07 10:54:50 37.782551 -122.445368 1 1/7/07 10:54:54 37.782745 -122.444586 1 1/7/07 10:54:58 37.782842 -122.443688 1 1/7/07 10:55:02 37.782919 -122.442815 1 1/7/07 10:55:06 37.782992 -122.442112 1 1/7/07 10:55:10 37.7831 -122.441461 1 1/7/07 10:55:14 37.783206 -122.440829 1 1/7/07 10:55:18 37.783273 -122.440324 Overall 1.1M distinct readings 25,000 distinct trips.
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Meanwhile, At Uber Headquarters… Page 5
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Questions Uber Might Ask: • What do trips tend to look like? • How can we reduce wait time and make more trips? • Are there new products we should introduce? Page 6
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Answering The Questions • Why Use SQL? –Well understood by analysts. –Huge ecosystem, access Hive from any of 20+ BI tools. • Why Hive? –Supports advanced SQL analytics like windowing functions. –Java based, makes it easy for 3rd parties to add extensions. • Last Reason –This is the Hive meetup. Were you expecting ABAP? Page 7
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Getting a feel for the trips. Page 8
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Duration • To get the duration all we need to do is: –Subtract the last timestamp from the first timestamp. –Do it per trip ID (1-25000). • OK, how do we do it with SQL? Page 9
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Getting First Or Last Values In A Partition Page 10 -- Get the last observation from each trip ID. -- Standard approach on any SQL system that supports windowing. SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY uber.id ORDER BY uber.dt DESC) as rn FROM uber ) sub1 WHERE rn = 1;
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. And Hive Supports Windowing Now (0.11+) Page 11 Name Purpose CUME_DIST Number of rows with values lower than (or greater than if ORDER BY DESC) the current row. DENSE_RANK The dense rank of the row within the partition. If any rows “tie” or have the same value, they receive the same rank. DENSE_RANK does not have gaps in the ranks, in contrast to RANK. FIRST_VALUE The value in the first row within the partition. LAST_VALUE Surprisingly, not the opposite of FIRST_VALUE (if you want that just change your sort order.) LAST_VALUE is tricky, look it up. LAG Value from a prior row in the partition. LEAD Value from a subsequent row in the partition. NTILE Divides rows in a partition into N many groups. ROW_NUMBER The row number of the row within the partition. RANK The rank of the row within the partition. This differs from ROW_NUMBER in that ties receive the same value.
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Compute Trip Durations Page 12 -- Subtract the first timestamp from the last timestamp. -- Use FIRST_VALUE and ROW_NUMBER to help compare first and last timestamps. SELECT id, (unix_timestamp(dt) - unix_timestamp(fv)) as trip_duration FROM ( SELECT id, dt, fv FROM ( SELECT id, dt, FIRST_VALUE(dt) OVER (PARTITION BY id ORDER BY dt) as fv, ROW_NUMBER() OVER (PARTITION BY id ORDER BY dt DESC) as lastrk FROM uber ) sub1 WHERE lastrk = 1 ) sub2;
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Trip Duration SQL Output Page 13 id trip_duration 1 128 2 148 3 150 4 336 5 400 6 168 7 142 8 558 9 312 10 208 ... (25,000 total trips)
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Duration Was Easy, What About Distance? • All we have is GPS readings. • If we draw a line from GPS readings, it estimates trip distance. • GPS readings are 4s apart, estimates should be close. Page 14 Actual Route GPS Signal Estimated Route
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Enter GIS Tools for Hadoop Page 15 esri.github.io/gis-tools-for-hadoop Works with Hive and Map-Reduce Syntax similar to other spatial systems like PostGIS Open Source
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Spatial Framework for Hadoop Functions Page 16 Name Purpose ST_LineString Create a line from coordinates supplied in a string. ST_Polygon Create a polygon. ST_SetSRID Set Spatial Reference ID. SRID 4326 corresponds to WGS84. ST_GeodesicLengthWGS84 Compute length of a line in meters assuming points use the World Geodetic System 1984. GPS uses the WGS84 coordinate system. ST_Length Compute Cartesian length. ST_Contains Determine if one spatial object contains another spatial object. ST_Intersects Determine if two spatial objects intersect. ST_AsText Return a text representation of a spatial object, suitable for storing in a Hive string column. Objects can also be saved in binary columns with no conversion. 82 total spatial functions provided by Spatial Framework for Hadoop.
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ST_LineString: Make a line. • 2 Constructors –ST_LineString(1, 1, 2, 2, 3, 3); – Simple constructor. –ST_LineString('linestring(1 1, 2 2, 3 3)'); – WKT or Well-Known-Text constructor. • Neither approach very convenient for this dataset. • Since SF4H is open-source I added a new constructor: –ST_LineString([Array of ST_Point Objects]); Page 17
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. collect_array: Custom UDAF turns columns to arrays Page 18 ID Date Time Latitude Longitude 1 1/7/07 10:54:50 37.782551 -122.445368 1 1/7/07 10:54:54 37.782745 -122.444586 1 1/7/07 10:54:58 37.782842 -122.443688 1 1/7/07 10:55:02 37.782919 -122.442815 1 1/7/07 10:55:06 37.782992 -122.442112 1 1/7/07 10:55:10 37.7831 -122.441461 1 1/7/07 10:55:14 37.783206 -122.440829 1 1/7/07 10:55:18 37.783273 -122.440324 > SELECT id, collect_array(latitude) FROM table GROUP BY id; (1, [ 37.782551, 37.782745, 37.782842, 37.782919, 37.782992 ... ]) ...
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Computing Trip Lengths Now Trivial Page 19 -- Compute the trip lengths. -- Our coordinates conform to WGS84, use that to compute distances. -- ST_SetSRID(_, 4326) marks the object as conforming to WGS84. -- Group by trip ID. SELECT id, ST_GeodesicLengthWGS84( ST_SetSRID( ST_LineString(collect_array(point)), 4326)) as length FROM ( SELECT id, ST_Point(longitude, latitude) as point FROM uber ) sub GROUP BY id; Generate an ST_Point for each row Group the points, turn them into arrays and make a line out of it.
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Demo Computing Trip Distances in Hortonworks Sandbox Page 20
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Visualizing Trip Times and Durations Page 21
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Time For a New Product? • How Likely is Demand for an SFO Rideshare? • How many trips even go to SFO? Page 22
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ST_Intersects • Determines if two shapes intersect. Page 23 Yes Not So Much
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. What Trips Go To SFO? • Approach: –Draw a polygon around SFO drop-off area. –Using the ST_LineStrings, see which trips intersect with this polygon. Page 24
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. SFO Drop-Off Area • Inserted into table locations (name string, location string) for easy joining against other shapes. • Data estimated using Google Maps. Page 25 Name Location SFO ST_Polygon( 37.616543, -122.392291, 37.613297, -122.392119, 37.616458, -122.389115, 37.613552, -122.389051)
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Computing the Intersection Page 26 SELECT count(id) FROM ( SELECT id, ST_LineString(collect_array(point)) as trip FROM ( SELECT id, ST_Point(longitude, latitude) AS point FROM uber ) points GROUP BY id ) trips JOIN ( SELECT ST_Polygon(definition) as sfo_coordinates FROM locations WHERE locations.name = "SFO" ) sfosub WHERE ST_Intersects(sfosub.sfo_coordinates, trips.trip);
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Demo Counting Number of Trips to SFO in Sandbox Page 27
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Counting It Up • 80 / 25000 Uber trips went to SFO (0.32%) • SFO Rideshare Product, maybe not a great idea. Page 28
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Conclusion • Spatial Framework for Hadoop makes geo analytics simple with Hadoop and Hive. • Hive 11 makes it simple to slice and dice datasets with powerful analytics like windowing. • Open source, extend and change to fit your needs. Page 29
  • Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Try It For Yourself • Spatial Framework for Hadoop –esri.github.io/gis-tools-for-hadoop • UDFs, extra data and Hive queries –github.com/cartershanklin/hive-spatial-uber – (For the collect_array UDAF, queries and extra data) –github.com/cartershanklin/spatial-framework-for-hadoop – (For the extra ST_LineString constructor) • Main Dataset –infochimps.com/datasets/uber-anonymized-gps-logs • Hortonworks Sandbox –The easiest way to learn Hadoop. –hortonworks.com/sandbox Page 30