SlideShare a Scribd company logo
1 of 11
Download to read offline
MapMyCab
Preetika Kulshrestha!
Insight Data Engineering, Feb 2015
Motivation
• Tool for Data Scientists and Cab dispatchers to analyze (by
time of day or day of week):!
• cab occupancy!
• miles travelled!
• pickups and drop-offs!
• An app for city dwellers to view real-time cab status for
unoccupied cabs in a given area
Demo
Pipeline
Cab
Data
Message
Broker
Real-Time
Streaming
HDFS
HBase UI
MrJob
11 million rows
Data Aggregation
CabID Lat Long Occ Timestamp
Aggregate Metrics (per cab)
MrJob
year month day hour avocc pickup drop off
• Drop off event: Occupancy change from 1 to 0!
• Pickup event: Occupancy change from 0 to 1
Computing Trip Durations and Shift Times
• Used Windowing
function in Hive to
calculate idle times!
• Maximum idle time
in a day points to a
potential shift!
• 1 million trips
idle/shift time!
(hours)
tripId hour idle (s) idle (h)
Occupancy Profile
occ(%)
0
0.175
0.35
0.525
0.7
hour
0 1 2 3 4 5 6 7 8 9 1011131214151617181920212223
potential !
shift time!
Tables
• Hourly data organized by Day of Week!
• Aggregate metrics stored in the same table for fast retrieval
y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals
Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 ..
2008_01_Mon
pickups, dropoffs,
avg_occ, avg_dist
.. .. .. .. .. ..
sum(pickups),
sum(drop offs),
avg(occ), avg(dist)
Hourly Aggregates by Day of Week
• HBase row level atomicity can be leveraged for
transactional operations!
• Keyed producer in Kafka assures in-order delivery
of messages (by key)!
• Simple operations for tool integration, followed by
incremental complexity streamlines the
development process
Takeaways
About Me
• Previous Life - Senior Energy Analyst
(EnerNOC Inc.).
• M.S. Electrical Engineering - North Carolina
State University (focus on robotics, control
systems and smart grid).
• https://github.com/PreetikaKuls
• preetika.kulshrestha@gmail.com
Batch Views
Batch Views

More Related Content

What's hot

What's hot (16)

Ibm infosphere mgarren
Ibm infosphere mgarrenIbm infosphere mgarren
Ibm infosphere mgarren
 
Mapbox
MapboxMapbox
Mapbox
 
Making beautiful maps with Mapbox Studio by Charley Glynn
Making beautiful maps with Mapbox Studio by Charley GlynnMaking beautiful maps with Mapbox Studio by Charley Glynn
Making beautiful maps with Mapbox Studio by Charley Glynn
 
Trb 2017 annual_conference_visualization_lightning_talk_rst
Trb 2017 annual_conference_visualization_lightning_talk_rstTrb 2017 annual_conference_visualization_lightning_talk_rst
Trb 2017 annual_conference_visualization_lightning_talk_rst
 
5200 Analysis-Airbnb data
5200 Analysis-Airbnb data5200 Analysis-Airbnb data
5200 Analysis-Airbnb data
 
Relay Local State Management: Replacing Redux
Relay Local State Management: Replacing ReduxRelay Local State Management: Replacing Redux
Relay Local State Management: Replacing Redux
 
Geo-Processing in the Clouds
Geo-Processing in the CloudsGeo-Processing in the Clouds
Geo-Processing in the Clouds
 
Graph Computing with Apache TinkerPop
Graph Computing with Apache TinkerPopGraph Computing with Apache TinkerPop
Graph Computing with Apache TinkerPop
 
JanusGraph: Looking Backward, Reaching Forward
JanusGraph: Looking Backward, Reaching ForwardJanusGraph: Looking Backward, Reaching Forward
JanusGraph: Looking Backward, Reaching Forward
 
Join semantics in kafka streams
Join semantics in kafka streamsJoin semantics in kafka streams
Join semantics in kafka streams
 
Community-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraphCommunity-Driven Graphs with JanusGraph
Community-Driven Graphs with JanusGraph
 
ElasticSearch入門
ElasticSearch入門ElasticSearch入門
ElasticSearch入門
 
AWS and Terraform for Disaster Recovery
AWS and Terraform for Disaster RecoveryAWS and Terraform for Disaster Recovery
AWS and Terraform for Disaster Recovery
 
Scalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized ServicesScalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized Services
 
AWS for mega(geo)data
AWS for mega(geo)dataAWS for mega(geo)data
AWS for mega(geo)data
 
Cloud in your Cloud
Cloud in your CloudCloud in your Cloud
Cloud in your Cloud
 

Similar to MapMyCab Presentation

Disruptive open transport data
Disruptive open transport dataDisruptive open transport data
Disruptive open transport data
Jonathan Raper
 
Urban Data Challenge - Christopher A. Pangilinan
Urban Data Challenge - Christopher A. PangilinanUrban Data Challenge - Christopher A. Pangilinan
Urban Data Challenge - Christopher A. Pangilinan
swissnex San Francisco
 
Od ifriday openraildata
Od ifriday openraildataOd ifriday openraildata
Od ifriday openraildata
Jonathan Raper
 

Similar to MapMyCab Presentation (20)

Demo week three_thurs
Demo week three_thursDemo week three_thurs
Demo week three_thurs
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at Uber
 
Koober Preduction IO Presentation
Koober Preduction IO PresentationKoober Preduction IO Presentation
Koober Preduction IO Presentation
 
Koober Machine Learning
Koober Machine LearningKoober Machine Learning
Koober Machine Learning
 
Analyzing NYC Transit Data
Analyzing NYC Transit DataAnalyzing NYC Transit Data
Analyzing NYC Transit Data
 
Disruptive open transport data
Disruptive open transport dataDisruptive open transport data
Disruptive open transport data
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017
 
The Impact of Always-on Connectivity for Geospatial Applications and Analysis
The Impact of Always-on Connectivity for Geospatial Applications and AnalysisThe Impact of Always-on Connectivity for Geospatial Applications and Analysis
The Impact of Always-on Connectivity for Geospatial Applications and Analysis
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
 
capital bikeshare
capital bikesharecapital bikeshare
capital bikeshare
 
Azure Maps Mobility Services Workshop
Azure Maps Mobility Services WorkshopAzure Maps Mobility Services Workshop
Azure Maps Mobility Services Workshop
 
Baseride Technologies - solutions for smart transportation & logistics
Baseride Technologies - solutions for smart transportation & logisticsBaseride Technologies - solutions for smart transportation & logistics
Baseride Technologies - solutions for smart transportation & logistics
 
2016 gisco track: coupling gis with online time reporting to monitor and repo...
2016 gisco track: coupling gis with online time reporting to monitor and repo...2016 gisco track: coupling gis with online time reporting to monitor and repo...
2016 gisco track: coupling gis with online time reporting to monitor and repo...
 
Urban Data Challenge - Christopher A. Pangilinan
Urban Data Challenge - Christopher A. PangilinanUrban Data Challenge - Christopher A. Pangilinan
Urban Data Challenge - Christopher A. Pangilinan
 
Urban Planning: Mapping & Alternative Scenarios (TLP)
Urban Planning: Mapping & Alternative Scenarios (TLP)Urban Planning: Mapping & Alternative Scenarios (TLP)
Urban Planning: Mapping & Alternative Scenarios (TLP)
 
Truck planning: how to certify the right route
Truck planning: how to certify the right routeTruck planning: how to certify the right route
Truck planning: how to certify the right route
 
AWS Architecture Case Study: Real-Time Bidding
AWS Architecture Case Study: Real-Time BiddingAWS Architecture Case Study: Real-Time Bidding
AWS Architecture Case Study: Real-Time Bidding
 
Od ifriday openraildata
Od ifriday openraildataOd ifriday openraildata
Od ifriday openraildata
 
QCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uberQCon SF-2015 Stream Processing in uber
QCon SF-2015 Stream Processing in uber
 

MapMyCab Presentation

  • 2. Motivation • Tool for Data Scientists and Cab dispatchers to analyze (by time of day or day of week):! • cab occupancy! • miles travelled! • pickups and drop-offs! • An app for city dwellers to view real-time cab status for unoccupied cabs in a given area
  • 5. Data Aggregation CabID Lat Long Occ Timestamp Aggregate Metrics (per cab) MrJob year month day hour avocc pickup drop off • Drop off event: Occupancy change from 1 to 0! • Pickup event: Occupancy change from 0 to 1
  • 6. Computing Trip Durations and Shift Times • Used Windowing function in Hive to calculate idle times! • Maximum idle time in a day points to a potential shift! • 1 million trips idle/shift time! (hours) tripId hour idle (s) idle (h) Occupancy Profile occ(%) 0 0.175 0.35 0.525 0.7 hour 0 1 2 3 4 5 6 7 8 9 1011131214151617181920212223 potential ! shift time!
  • 7. Tables • Hourly data organized by Day of Week! • Aggregate metrics stored in the same table for fast retrieval y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 .. 2008_01_Mon pickups, dropoffs, avg_occ, avg_dist .. .. .. .. .. .. sum(pickups), sum(drop offs), avg(occ), avg(dist) Hourly Aggregates by Day of Week
  • 8. • HBase row level atomicity can be leveraged for transactional operations! • Keyed producer in Kafka assures in-order delivery of messages (by key)! • Simple operations for tool integration, followed by incremental complexity streamlines the development process Takeaways
  • 9. About Me • Previous Life - Senior Energy Analyst (EnerNOC Inc.). • M.S. Electrical Engineering - North Carolina State University (focus on robotics, control systems and smart grid). • https://github.com/PreetikaKuls • preetika.kulshrestha@gmail.com