New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
How To Analyze Geolocation Data with Hive and Hadoop
1. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Spatial Analytics with Hive
Hive Meetup – July 24, 2013
@cshanklin
Page 1
2. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Why Spatial Analytics?
• Amount of spatial data has exploded due to mobile device
ubiquity and more reliance on sensors.
• Proliferation of consumer-oriented mapping products brings
spatial analytics to the mainstream.
Page 2
3. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
An Interesting Dataset
• GPS data collected from Uber trips.
• Anonymized, maintains days/times but not dates.
• Obtained from InfoChimps
Page 3
4. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Data Sample
Page 4
ID Date Time Latitude Longitude
1 1/7/07 10:54:50 37.782551 -122.445368
1 1/7/07 10:54:54 37.782745 -122.444586
1 1/7/07 10:54:58 37.782842 -122.443688
1 1/7/07 10:55:02 37.782919 -122.442815
1 1/7/07 10:55:06 37.782992 -122.442112
1 1/7/07 10:55:10 37.7831 -122.441461
1 1/7/07 10:55:14 37.783206 -122.440829
1 1/7/07 10:55:18 37.783273 -122.440324
Overall
1.1M distinct readings
25,000 distinct trips.
5. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Meanwhile, At Uber Headquarters…
Page 5
6. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Questions Uber Might Ask:
• What do trips tend to look like?
• How can we reduce wait time and make more trips?
• Are there new products we should introduce?
Page 6
7. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Answering The Questions
• Why Use SQL?
–Well understood by analysts.
–Huge ecosystem, access Hive from any of 20+ BI tools.
• Why Hive?
–Supports advanced SQL analytics like windowing functions.
–Java based, makes it easy for 3rd parties to add extensions.
• Last Reason
–This is the Hive meetup. Were you expecting ABAP?
Page 7
8. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Getting a feel for the trips.
Page 8
9. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Duration
• To get the duration all we need to do is:
–Subtract the last timestamp from the first timestamp.
–Do it per trip ID (1-25000).
• OK, how do we do it with SQL?
Page 9
10. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Getting First Or Last Values In A Partition
Page 10
-- Get the last observation from each trip ID.
-- Standard approach on any SQL system that supports windowing.
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY uber.id ORDER BY uber.dt DESC) as rn
FROM
uber
) sub1
WHERE
rn = 1;
11. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
And Hive Supports Windowing Now (0.11+)
Page 11
Name Purpose
CUME_DIST
Number of rows with values lower than (or greater than if ORDER
BY DESC) the current row.
DENSE_RANK
The dense rank of the row within the partition. If any rows “tie” or
have the same value, they receive the same rank. DENSE_RANK
does not have gaps in the ranks, in contrast to RANK.
FIRST_VALUE The value in the first row within the partition.
LAST_VALUE
Surprisingly, not the opposite of FIRST_VALUE (if you want that
just change your sort order.) LAST_VALUE is tricky, look it up.
LAG Value from a prior row in the partition.
LEAD Value from a subsequent row in the partition.
NTILE Divides rows in a partition into N many groups.
ROW_NUMBER The row number of the row within the partition.
RANK
The rank of the row within the partition. This differs from
ROW_NUMBER in that ties receive the same value.
12. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Compute Trip Durations
Page 12
-- Subtract the first timestamp from the last timestamp.
-- Use FIRST_VALUE and ROW_NUMBER to help compare first and last timestamps.
SELECT
id,
(unix_timestamp(dt) - unix_timestamp(fv)) as trip_duration
FROM (
SELECT
id, dt, fv
FROM (
SELECT
id, dt,
FIRST_VALUE(dt) OVER (PARTITION BY id ORDER BY dt) as fv,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY dt DESC) as lastrk
FROM
uber
) sub1
WHERE
lastrk = 1
) sub2;
13. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Trip Duration SQL Output
Page 13
id trip_duration
1 128
2 148
3 150
4 336
5 400
6 168
7 142
8 558
9 312
10 208
...
(25,000 total trips)
14. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Duration Was Easy, What About Distance?
• All we have is GPS readings.
• If we draw a line from GPS readings, it estimates trip distance.
• GPS readings are 4s apart, estimates should be close.
Page 14
Actual Route
GPS Signal
Estimated Route
15. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Enter GIS Tools for Hadoop
Page 15
esri.github.io/gis-tools-for-hadoop
Works with Hive and Map-Reduce
Syntax similar to other spatial systems like PostGIS
Open Source
16. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Spatial Framework for Hadoop Functions
Page 16
Name Purpose
ST_LineString Create a line from coordinates supplied in a string.
ST_Polygon Create a polygon.
ST_SetSRID Set Spatial Reference ID. SRID 4326 corresponds to WGS84.
ST_GeodesicLengthWGS84
Compute length of a line in meters assuming points use the
World Geodetic System 1984. GPS uses the WGS84
coordinate system.
ST_Length Compute Cartesian length.
ST_Contains
Determine if one spatial object contains another spatial
object.
ST_Intersects Determine if two spatial objects intersect.
ST_AsText
Return a text representation of a spatial object, suitable for
storing in a Hive string column. Objects can also be saved in
binary columns with no conversion.
82 total spatial functions provided by Spatial Framework for
Hadoop.
17. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ST_LineString: Make a line.
• 2 Constructors
–ST_LineString(1, 1, 2, 2, 3, 3);
– Simple constructor.
–ST_LineString('linestring(1 1, 2 2, 3 3)');
– WKT or Well-Known-Text constructor.
• Neither approach very convenient for this dataset.
• Since SF4H is open-source I added a new constructor:
–ST_LineString([Array of ST_Point Objects]);
Page 17
18. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
collect_array: Custom UDAF turns columns to
arrays
Page 18
ID Date Time Latitude Longitude
1 1/7/07 10:54:50 37.782551 -122.445368
1 1/7/07 10:54:54 37.782745 -122.444586
1 1/7/07 10:54:58 37.782842 -122.443688
1 1/7/07 10:55:02 37.782919 -122.442815
1 1/7/07 10:55:06 37.782992 -122.442112
1 1/7/07 10:55:10 37.7831 -122.441461
1 1/7/07 10:55:14 37.783206 -122.440829
1 1/7/07 10:55:18 37.783273 -122.440324
> SELECT id, collect_array(latitude) FROM table GROUP BY id;
(1, [ 37.782551, 37.782745, 37.782842, 37.782919, 37.782992 ... ])
...
19. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Computing Trip Lengths Now Trivial
Page 19
-- Compute the trip lengths.
-- Our coordinates conform to WGS84, use that to compute distances.
-- ST_SetSRID(_, 4326) marks the object as conforming to WGS84.
-- Group by trip ID.
SELECT
id,
ST_GeodesicLengthWGS84(
ST_SetSRID(
ST_LineString(collect_array(point)), 4326)) as length
FROM (
SELECT
id,
ST_Point(longitude, latitude) as point
FROM
uber
) sub
GROUP BY
id;
Generate an ST_Point for each row
Group the points, turn them into arrays
and make a line out of it.
20. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Demo
Computing Trip Distances in Hortonworks Sandbox
Page 20
21. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Visualizing Trip Times and Durations
Page 21
22. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Time For a New Product?
• How Likely is Demand for an SFO Rideshare?
• How many trips even go to SFO?
Page 22
23. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ST_Intersects
• Determines if two shapes intersect.
Page 23
Yes Not So Much
24. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
What Trips Go To SFO?
• Approach:
–Draw a polygon around SFO drop-off area.
–Using the ST_LineStrings, see which trips intersect with this polygon.
Page 24
25. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
SFO Drop-Off Area
• Inserted into table locations (name string, location string) for
easy joining against other shapes.
• Data estimated using Google Maps.
Page 25
Name Location
SFO
ST_Polygon(
37.616543, -122.392291,
37.613297, -122.392119,
37.616458, -122.389115,
37.613552, -122.389051)
26. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Computing the Intersection
Page 26
SELECT
count(id)
FROM (
SELECT
id,
ST_LineString(collect_array(point)) as trip
FROM (
SELECT
id,
ST_Point(longitude, latitude) AS point
FROM
uber
) points
GROUP BY
id
) trips JOIN (
SELECT ST_Polygon(definition) as sfo_coordinates
FROM locations
WHERE locations.name = "SFO"
) sfosub
WHERE
ST_Intersects(sfosub.sfo_coordinates, trips.trip);
27. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Demo
Counting Number of Trips to SFO in Sandbox
Page 27
28. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Counting It Up
• 80 / 25000 Uber trips went to SFO (0.32%)
• SFO Rideshare Product, maybe not a great idea.
Page 28
29. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Conclusion
• Spatial Framework for Hadoop makes geo analytics simple
with Hadoop and Hive.
• Hive 11 makes it simple to slice and dice datasets with
powerful analytics like windowing.
• Open source, extend and change to fit your needs.
Page 29
30. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Try It For Yourself
• Spatial Framework for Hadoop
–esri.github.io/gis-tools-for-hadoop
• UDFs, extra data and Hive queries
–github.com/cartershanklin/hive-spatial-uber
– (For the collect_array UDAF, queries and extra data)
–github.com/cartershanklin/spatial-framework-for-hadoop
– (For the extra ST_LineString constructor)
• Main Dataset
–infochimps.com/datasets/uber-anonymized-gps-logs
• Hortonworks Sandbox
–The easiest way to learn Hadoop.
–hortonworks.com/sandbox
Page 30
Editor's Notes
If you spotted the error in this slide… we’re hiring.
If you spotted the error in this slide… we’re hiring.
If you spotted the error in this slide… we’re hiring.
If you spotted the error in this slide… we’re hiring.