SlideShare a Scribd company logo
1 of 30
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Spatial Analytics with Hive
Hive Meetup – July 24, 2013
@cshanklin
Page 1
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Why Spatial Analytics?
• Amount of spatial data has exploded due to mobile device
ubiquity and more reliance on sensors.
• Proliferation of consumer-oriented mapping products brings
spatial analytics to the mainstream.
Page 2
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
An Interesting Dataset
• GPS data collected from Uber trips.
• Anonymized, maintains days/times but not dates.
• Obtained from InfoChimps
Page 3
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Data Sample
Page 4
ID Date Time Latitude Longitude
1 1/7/07 10:54:50 37.782551 -122.445368
1 1/7/07 10:54:54 37.782745 -122.444586
1 1/7/07 10:54:58 37.782842 -122.443688
1 1/7/07 10:55:02 37.782919 -122.442815
1 1/7/07 10:55:06 37.782992 -122.442112
1 1/7/07 10:55:10 37.7831 -122.441461
1 1/7/07 10:55:14 37.783206 -122.440829
1 1/7/07 10:55:18 37.783273 -122.440324
Overall
1.1M distinct readings
25,000 distinct trips.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Meanwhile, At Uber Headquarters…
Page 5
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Questions Uber Might Ask:
• What do trips tend to look like?
• How can we reduce wait time and make more trips?
• Are there new products we should introduce?
Page 6
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Answering The Questions
• Why Use SQL?
–Well understood by analysts.
–Huge ecosystem, access Hive from any of 20+ BI tools.
• Why Hive?
–Supports advanced SQL analytics like windowing functions.
–Java based, makes it easy for 3rd parties to add extensions.
• Last Reason
–This is the Hive meetup. Were you expecting ABAP?
Page 7
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Getting a feel for the trips.
Page 8
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Duration
• To get the duration all we need to do is:
–Subtract the last timestamp from the first timestamp.
–Do it per trip ID (1-25000).
• OK, how do we do it with SQL?
Page 9
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Getting First Or Last Values In A Partition
Page 10
-- Get the last observation from each trip ID.
-- Standard approach on any SQL system that supports windowing.
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY uber.id ORDER BY uber.dt DESC) as rn
FROM
uber
) sub1
WHERE
rn = 1;
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
And Hive Supports Windowing Now (0.11+)
Page 11
Name Purpose
CUME_DIST
Number of rows with values lower than (or greater than if ORDER
BY DESC) the current row.
DENSE_RANK
The dense rank of the row within the partition. If any rows “tie” or
have the same value, they receive the same rank. DENSE_RANK
does not have gaps in the ranks, in contrast to RANK.
FIRST_VALUE The value in the first row within the partition.
LAST_VALUE
Surprisingly, not the opposite of FIRST_VALUE (if you want that
just change your sort order.) LAST_VALUE is tricky, look it up.
LAG Value from a prior row in the partition.
LEAD Value from a subsequent row in the partition.
NTILE Divides rows in a partition into N many groups.
ROW_NUMBER The row number of the row within the partition.
RANK
The rank of the row within the partition. This differs from
ROW_NUMBER in that ties receive the same value.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Compute Trip Durations
Page 12
-- Subtract the first timestamp from the last timestamp.
-- Use FIRST_VALUE and ROW_NUMBER to help compare first and last timestamps.
SELECT
id,
(unix_timestamp(dt) - unix_timestamp(fv)) as trip_duration
FROM (
SELECT
id, dt, fv
FROM (
SELECT
id, dt,
FIRST_VALUE(dt) OVER (PARTITION BY id ORDER BY dt) as fv,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY dt DESC) as lastrk
FROM
uber
) sub1
WHERE
lastrk = 1
) sub2;
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Trip Duration SQL Output
Page 13
id trip_duration
1 128
2 148
3 150
4 336
5 400
6 168
7 142
8 558
9 312
10 208
...
(25,000 total trips)
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Duration Was Easy, What About Distance?
• All we have is GPS readings.
• If we draw a line from GPS readings, it estimates trip distance.
• GPS readings are 4s apart, estimates should be close.
Page 14
Actual Route
GPS Signal
Estimated Route
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Enter GIS Tools for Hadoop
Page 15
esri.github.io/gis-tools-for-hadoop
Works with Hive and Map-Reduce
Syntax similar to other spatial systems like PostGIS
Open Source
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Spatial Framework for Hadoop Functions
Page 16
Name Purpose
ST_LineString Create a line from coordinates supplied in a string.
ST_Polygon Create a polygon.
ST_SetSRID Set Spatial Reference ID. SRID 4326 corresponds to WGS84.
ST_GeodesicLengthWGS84
Compute length of a line in meters assuming points use the
World Geodetic System 1984. GPS uses the WGS84
coordinate system.
ST_Length Compute Cartesian length.
ST_Contains
Determine if one spatial object contains another spatial
object.
ST_Intersects Determine if two spatial objects intersect.
ST_AsText
Return a text representation of a spatial object, suitable for
storing in a Hive string column. Objects can also be saved in
binary columns with no conversion.
82 total spatial functions provided by Spatial Framework for
Hadoop.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ST_LineString: Make a line.
• 2 Constructors
–ST_LineString(1, 1, 2, 2, 3, 3);
– Simple constructor.
–ST_LineString('linestring(1 1, 2 2, 3 3)');
– WKT or Well-Known-Text constructor.
• Neither approach very convenient for this dataset.
• Since SF4H is open-source I added a new constructor:
–ST_LineString([Array of ST_Point Objects]);
Page 17
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
collect_array: Custom UDAF turns columns to
arrays
Page 18
ID Date Time Latitude Longitude
1 1/7/07 10:54:50 37.782551 -122.445368
1 1/7/07 10:54:54 37.782745 -122.444586
1 1/7/07 10:54:58 37.782842 -122.443688
1 1/7/07 10:55:02 37.782919 -122.442815
1 1/7/07 10:55:06 37.782992 -122.442112
1 1/7/07 10:55:10 37.7831 -122.441461
1 1/7/07 10:55:14 37.783206 -122.440829
1 1/7/07 10:55:18 37.783273 -122.440324
> SELECT id, collect_array(latitude) FROM table GROUP BY id;
(1, [ 37.782551, 37.782745, 37.782842, 37.782919, 37.782992 ... ])
...
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Computing Trip Lengths Now Trivial
Page 19
-- Compute the trip lengths.
-- Our coordinates conform to WGS84, use that to compute distances.
-- ST_SetSRID(_, 4326) marks the object as conforming to WGS84.
-- Group by trip ID.
SELECT
id,
ST_GeodesicLengthWGS84(
ST_SetSRID(
ST_LineString(collect_array(point)), 4326)) as length
FROM (
SELECT
id,
ST_Point(longitude, latitude) as point
FROM
uber
) sub
GROUP BY
id;
Generate an ST_Point for each row
Group the points, turn them into arrays
and make a line out of it.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Demo
Computing Trip Distances in Hortonworks Sandbox
Page 20
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Visualizing Trip Times and Durations
Page 21
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Time For a New Product?
• How Likely is Demand for an SFO Rideshare?
• How many trips even go to SFO?
Page 22
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ST_Intersects
• Determines if two shapes intersect.
Page 23
Yes Not So Much
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
What Trips Go To SFO?
• Approach:
–Draw a polygon around SFO drop-off area.
–Using the ST_LineStrings, see which trips intersect with this polygon.
Page 24
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
SFO Drop-Off Area
• Inserted into table locations (name string, location string) for
easy joining against other shapes.
• Data estimated using Google Maps.
Page 25
Name Location
SFO
ST_Polygon(
37.616543, -122.392291,
37.613297, -122.392119,
37.616458, -122.389115,
37.613552, -122.389051)
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Computing the Intersection
Page 26
SELECT
count(id)
FROM (
SELECT
id,
ST_LineString(collect_array(point)) as trip
FROM (
SELECT
id,
ST_Point(longitude, latitude) AS point
FROM
uber
) points
GROUP BY
id
) trips JOIN (
SELECT ST_Polygon(definition) as sfo_coordinates
FROM locations
WHERE locations.name = "SFO"
) sfosub
WHERE
ST_Intersects(sfosub.sfo_coordinates, trips.trip);
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Demo
Counting Number of Trips to SFO in Sandbox
Page 27
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Counting It Up
• 80 / 25000 Uber trips went to SFO (0.32%)
• SFO Rideshare Product, maybe not a great idea.
Page 28
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Conclusion
• Spatial Framework for Hadoop makes geo analytics simple
with Hadoop and Hive.
• Hive 11 makes it simple to slice and dice datasets with
powerful analytics like windowing.
• Open source, extend and change to fit your needs.
Page 29
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Try It For Yourself
• Spatial Framework for Hadoop
–esri.github.io/gis-tools-for-hadoop
• UDFs, extra data and Hive queries
–github.com/cartershanklin/hive-spatial-uber
– (For the collect_array UDAF, queries and extra data)
–github.com/cartershanklin/spatial-framework-for-hadoop
– (For the extra ST_LineString constructor)
• Main Dataset
–infochimps.com/datasets/uber-anonymized-gps-logs
• Hortonworks Sandbox
–The easiest way to learn Hadoop.
–hortonworks.com/sandbox
Page 30

More Related Content

What's hot

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 
nftables - the evolution of Linux Firewall
nftables - the evolution of Linux Firewallnftables - the evolution of Linux Firewall
nftables - the evolution of Linux FirewallMarian Marinov
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about cephEmma Haruka Iwao
 

What's hot (20)

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
nftables - the evolution of Linux Firewall
nftables - the evolution of Linux Firewallnftables - the evolution of Linux Firewall
nftables - the evolution of Linux Firewall
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Scaling search with SolrCloud
Scaling search with SolrCloudScaling search with SolrCloud
Scaling search with SolrCloud
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Script mediator
Script mediatorScript mediator
Script mediator
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 

Similar to How To Analyze Geolocation Data with Hive and Hadoop

JSR-82 Bluetooth tutorial
JSR-82 Bluetooth tutorialJSR-82 Bluetooth tutorial
JSR-82 Bluetooth tutorialSoham Sengupta
 
Vortex Tutorial -- Part I
Vortex Tutorial -- Part IVortex Tutorial -- Part I
Vortex Tutorial -- Part IAngelo Corsaro
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
 
OpenTelemetry Introduction
OpenTelemetry Introduction OpenTelemetry Introduction
OpenTelemetry Introduction DimitrisFinas1
 
IRJET - Identification and Classification of IoT Devices in Various Appli...
IRJET -  	  Identification and Classification of IoT Devices in Various Appli...IRJET -  	  Identification and Classification of IoT Devices in Various Appli...
IRJET - Identification and Classification of IoT Devices in Various Appli...IRJET Journal
 
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...Ted Chien
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Roberto Franchini
 
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...sinaexe
 
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...IRJET Journal
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentationlilyco
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 
Exploring Openstack Swift(Object Storage) and Swiftstack
Exploring Openstack Swift(Object Storage) and Swiftstack Exploring Openstack Swift(Object Storage) and Swiftstack
Exploring Openstack Swift(Object Storage) and Swiftstack Ramit Surana
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 

Similar to How To Analyze Geolocation Data with Hive and Hadoop (20)

JSR-82 Bluetooth tutorial
JSR-82 Bluetooth tutorialJSR-82 Bluetooth tutorial
JSR-82 Bluetooth tutorial
 
Abstract
AbstractAbstract
Abstract
 
PrismTech Vortex Tutorial Part 1
PrismTech Vortex Tutorial Part 1PrismTech Vortex Tutorial Part 1
PrismTech Vortex Tutorial Part 1
 
Vortex Tutorial -- Part I
Vortex Tutorial -- Part IVortex Tutorial -- Part I
Vortex Tutorial -- Part I
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
 
Search@airbnb
Search@airbnbSearch@airbnb
Search@airbnb
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
OpenTelemetry Introduction
OpenTelemetry Introduction OpenTelemetry Introduction
OpenTelemetry Introduction
 
IRJET - Identification and Classification of IoT Devices in Various Appli...
IRJET -  	  Identification and Classification of IoT Devices in Various Appli...IRJET -  	  Identification and Classification of IoT Devices in Various Appli...
IRJET - Identification and Classification of IoT Devices in Various Appli...
 
seminar report
seminar reportseminar report
seminar report
 
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
viWave Study Group - Introduction to Google Android Development - Chapter 23 ...
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?
 
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
An energy efficient geographic routing protocol design in vehicular ad-hoc ne...
 
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
IRJET- Autonomous Underwater Vehicle: Electronics and Software Implementation...
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
Nandita resume
Nandita resumeNandita resume
Nandita resume
 
Exploring Openstack Swift(Object Storage) and Swiftstack
Exploring Openstack Swift(Object Storage) and Swiftstack Exploring Openstack Swift(Object Storage) and Swiftstack
Exploring Openstack Swift(Object Storage) and Swiftstack
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

How To Analyze Geolocation Data with Hive and Hadoop

  • 1. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Spatial Analytics with Hive Hive Meetup – July 24, 2013 @cshanklin Page 1
  • 2. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Why Spatial Analytics? • Amount of spatial data has exploded due to mobile device ubiquity and more reliance on sensors. • Proliferation of consumer-oriented mapping products brings spatial analytics to the mainstream. Page 2
  • 3. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. An Interesting Dataset • GPS data collected from Uber trips. • Anonymized, maintains days/times but not dates. • Obtained from InfoChimps Page 3
  • 4. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Data Sample Page 4 ID Date Time Latitude Longitude 1 1/7/07 10:54:50 37.782551 -122.445368 1 1/7/07 10:54:54 37.782745 -122.444586 1 1/7/07 10:54:58 37.782842 -122.443688 1 1/7/07 10:55:02 37.782919 -122.442815 1 1/7/07 10:55:06 37.782992 -122.442112 1 1/7/07 10:55:10 37.7831 -122.441461 1 1/7/07 10:55:14 37.783206 -122.440829 1 1/7/07 10:55:18 37.783273 -122.440324 Overall 1.1M distinct readings 25,000 distinct trips.
  • 5. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Meanwhile, At Uber Headquarters… Page 5
  • 6. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Questions Uber Might Ask: • What do trips tend to look like? • How can we reduce wait time and make more trips? • Are there new products we should introduce? Page 6
  • 7. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Answering The Questions • Why Use SQL? –Well understood by analysts. –Huge ecosystem, access Hive from any of 20+ BI tools. • Why Hive? –Supports advanced SQL analytics like windowing functions. –Java based, makes it easy for 3rd parties to add extensions. • Last Reason –This is the Hive meetup. Were you expecting ABAP? Page 7
  • 8. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Getting a feel for the trips. Page 8
  • 9. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Duration • To get the duration all we need to do is: –Subtract the last timestamp from the first timestamp. –Do it per trip ID (1-25000). • OK, how do we do it with SQL? Page 9
  • 10. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Getting First Or Last Values In A Partition Page 10 -- Get the last observation from each trip ID. -- Standard approach on any SQL system that supports windowing. SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY uber.id ORDER BY uber.dt DESC) as rn FROM uber ) sub1 WHERE rn = 1;
  • 11. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. And Hive Supports Windowing Now (0.11+) Page 11 Name Purpose CUME_DIST Number of rows with values lower than (or greater than if ORDER BY DESC) the current row. DENSE_RANK The dense rank of the row within the partition. If any rows “tie” or have the same value, they receive the same rank. DENSE_RANK does not have gaps in the ranks, in contrast to RANK. FIRST_VALUE The value in the first row within the partition. LAST_VALUE Surprisingly, not the opposite of FIRST_VALUE (if you want that just change your sort order.) LAST_VALUE is tricky, look it up. LAG Value from a prior row in the partition. LEAD Value from a subsequent row in the partition. NTILE Divides rows in a partition into N many groups. ROW_NUMBER The row number of the row within the partition. RANK The rank of the row within the partition. This differs from ROW_NUMBER in that ties receive the same value.
  • 12. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Compute Trip Durations Page 12 -- Subtract the first timestamp from the last timestamp. -- Use FIRST_VALUE and ROW_NUMBER to help compare first and last timestamps. SELECT id, (unix_timestamp(dt) - unix_timestamp(fv)) as trip_duration FROM ( SELECT id, dt, fv FROM ( SELECT id, dt, FIRST_VALUE(dt) OVER (PARTITION BY id ORDER BY dt) as fv, ROW_NUMBER() OVER (PARTITION BY id ORDER BY dt DESC) as lastrk FROM uber ) sub1 WHERE lastrk = 1 ) sub2;
  • 13. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Trip Duration SQL Output Page 13 id trip_duration 1 128 2 148 3 150 4 336 5 400 6 168 7 142 8 558 9 312 10 208 ... (25,000 total trips)
  • 14. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Duration Was Easy, What About Distance? • All we have is GPS readings. • If we draw a line from GPS readings, it estimates trip distance. • GPS readings are 4s apart, estimates should be close. Page 14 Actual Route GPS Signal Estimated Route
  • 15. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Enter GIS Tools for Hadoop Page 15 esri.github.io/gis-tools-for-hadoop Works with Hive and Map-Reduce Syntax similar to other spatial systems like PostGIS Open Source
  • 16. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Spatial Framework for Hadoop Functions Page 16 Name Purpose ST_LineString Create a line from coordinates supplied in a string. ST_Polygon Create a polygon. ST_SetSRID Set Spatial Reference ID. SRID 4326 corresponds to WGS84. ST_GeodesicLengthWGS84 Compute length of a line in meters assuming points use the World Geodetic System 1984. GPS uses the WGS84 coordinate system. ST_Length Compute Cartesian length. ST_Contains Determine if one spatial object contains another spatial object. ST_Intersects Determine if two spatial objects intersect. ST_AsText Return a text representation of a spatial object, suitable for storing in a Hive string column. Objects can also be saved in binary columns with no conversion. 82 total spatial functions provided by Spatial Framework for Hadoop.
  • 17. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ST_LineString: Make a line. • 2 Constructors –ST_LineString(1, 1, 2, 2, 3, 3); – Simple constructor. –ST_LineString('linestring(1 1, 2 2, 3 3)'); – WKT or Well-Known-Text constructor. • Neither approach very convenient for this dataset. • Since SF4H is open-source I added a new constructor: –ST_LineString([Array of ST_Point Objects]); Page 17
  • 18. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. collect_array: Custom UDAF turns columns to arrays Page 18 ID Date Time Latitude Longitude 1 1/7/07 10:54:50 37.782551 -122.445368 1 1/7/07 10:54:54 37.782745 -122.444586 1 1/7/07 10:54:58 37.782842 -122.443688 1 1/7/07 10:55:02 37.782919 -122.442815 1 1/7/07 10:55:06 37.782992 -122.442112 1 1/7/07 10:55:10 37.7831 -122.441461 1 1/7/07 10:55:14 37.783206 -122.440829 1 1/7/07 10:55:18 37.783273 -122.440324 > SELECT id, collect_array(latitude) FROM table GROUP BY id; (1, [ 37.782551, 37.782745, 37.782842, 37.782919, 37.782992 ... ]) ...
  • 19. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Computing Trip Lengths Now Trivial Page 19 -- Compute the trip lengths. -- Our coordinates conform to WGS84, use that to compute distances. -- ST_SetSRID(_, 4326) marks the object as conforming to WGS84. -- Group by trip ID. SELECT id, ST_GeodesicLengthWGS84( ST_SetSRID( ST_LineString(collect_array(point)), 4326)) as length FROM ( SELECT id, ST_Point(longitude, latitude) as point FROM uber ) sub GROUP BY id; Generate an ST_Point for each row Group the points, turn them into arrays and make a line out of it.
  • 20. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Demo Computing Trip Distances in Hortonworks Sandbox Page 20
  • 21. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Visualizing Trip Times and Durations Page 21
  • 22. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Time For a New Product? • How Likely is Demand for an SFO Rideshare? • How many trips even go to SFO? Page 22
  • 23. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ST_Intersects • Determines if two shapes intersect. Page 23 Yes Not So Much
  • 24. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. What Trips Go To SFO? • Approach: –Draw a polygon around SFO drop-off area. –Using the ST_LineStrings, see which trips intersect with this polygon. Page 24
  • 25. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. SFO Drop-Off Area • Inserted into table locations (name string, location string) for easy joining against other shapes. • Data estimated using Google Maps. Page 25 Name Location SFO ST_Polygon( 37.616543, -122.392291, 37.613297, -122.392119, 37.616458, -122.389115, 37.613552, -122.389051)
  • 26. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Computing the Intersection Page 26 SELECT count(id) FROM ( SELECT id, ST_LineString(collect_array(point)) as trip FROM ( SELECT id, ST_Point(longitude, latitude) AS point FROM uber ) points GROUP BY id ) trips JOIN ( SELECT ST_Polygon(definition) as sfo_coordinates FROM locations WHERE locations.name = "SFO" ) sfosub WHERE ST_Intersects(sfosub.sfo_coordinates, trips.trip);
  • 27. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Demo Counting Number of Trips to SFO in Sandbox Page 27
  • 28. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Counting It Up • 80 / 25000 Uber trips went to SFO (0.32%) • SFO Rideshare Product, maybe not a great idea. Page 28
  • 29. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Conclusion • Spatial Framework for Hadoop makes geo analytics simple with Hadoop and Hive. • Hive 11 makes it simple to slice and dice datasets with powerful analytics like windowing. • Open source, extend and change to fit your needs. Page 29
  • 30. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Try It For Yourself • Spatial Framework for Hadoop –esri.github.io/gis-tools-for-hadoop • UDFs, extra data and Hive queries –github.com/cartershanklin/hive-spatial-uber – (For the collect_array UDAF, queries and extra data) –github.com/cartershanklin/spatial-framework-for-hadoop – (For the extra ST_LineString constructor) • Main Dataset –infochimps.com/datasets/uber-anonymized-gps-logs • Hortonworks Sandbox –The easiest way to learn Hadoop. –hortonworks.com/sandbox Page 30

Editor's Notes

  1. If you spotted the error in this slide… we’re hiring.
  2. If you spotted the error in this slide… we’re hiring.
  3. If you spotted the error in this slide… we’re hiring.
  4. If you spotted the error in this slide… we’re hiring.