Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MapMyCab
Preetika Kulshrestha!
Insight Data Engineering, Feb 2015
Motivation
• Tool for Data Scientists and Cab dispatchers to analyze (by
time of day or day of week):!
• cab occupancy!
• ...
Demo
Pipeline
Cab
Data
Message
Broker
Real-Time
Streaming
HDFS
HBase UI
MrJob
11 million rows
Data Aggregation
CabID Lat Long Occ Timestamp
Aggregate Metrics (per cab)
MrJob
year month day hour avocc pickup drop off
...
Computing Trip Durations and Shift Times
• Used Windowing
function in Hive to
calculate idle times!
• Maximum idle time
in...
Tables
• Hourly data organized by Day of Week!
• Aggregate metrics stored in the same table for fast retrieval
y_m_dow c:0...
• HBase row level atomicity can be leveraged for
transactional operations!
• Keyed producer in Kafka assures in-order deli...
About Me
• Previous Life - Senior Energy Analyst
(EnerNOC Inc.).
• M.S. Electrical Engineering - North Carolina
State Univ...
Batch Views
Batch Views
Upcoming SlideShare
Loading in …5
×

MapMyCab Presentation

748 views

Published on

  • Be the first to comment

  • Be the first to like this

MapMyCab Presentation

  1. 1. MapMyCab Preetika Kulshrestha! Insight Data Engineering, Feb 2015
  2. 2. Motivation • Tool for Data Scientists and Cab dispatchers to analyze (by time of day or day of week):! • cab occupancy! • miles travelled! • pickups and drop-offs! • An app for city dwellers to view real-time cab status for unoccupied cabs in a given area
  3. 3. Demo
  4. 4. Pipeline Cab Data Message Broker Real-Time Streaming HDFS HBase UI MrJob 11 million rows
  5. 5. Data Aggregation CabID Lat Long Occ Timestamp Aggregate Metrics (per cab) MrJob year month day hour avocc pickup drop off • Drop off event: Occupancy change from 1 to 0! • Pickup event: Occupancy change from 0 to 1
  6. 6. Computing Trip Durations and Shift Times • Used Windowing function in Hive to calculate idle times! • Maximum idle time in a day points to a potential shift! • 1 million trips idle/shift time! (hours) tripId hour idle (s) idle (h) Occupancy Profile occ(%) 0 0.175 0.35 0.525 0.7 hour 0 1 2 3 4 5 6 7 8 9 1011131214151617181920212223 potential ! shift time!
  7. 7. Tables • Hourly data organized by Day of Week! • Aggregate metrics stored in the same table for fast retrieval y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 .. 2008_01_Mon pickups, dropoffs, avg_occ, avg_dist .. .. .. .. .. .. sum(pickups), sum(drop offs), avg(occ), avg(dist) Hourly Aggregates by Day of Week
  8. 8. • HBase row level atomicity can be leveraged for transactional operations! • Keyed producer in Kafka assures in-order delivery of messages (by key)! • Simple operations for tool integration, followed by incremental complexity streamlines the development process Takeaways
  9. 9. About Me • Previous Life - Senior Energy Analyst (EnerNOC Inc.). • M.S. Electrical Engineering - North Carolina State University (focus on robotics, control systems and smart grid). • https://github.com/PreetikaKuls • preetika.kulshrestha@gmail.com
  10. 10. Batch Views
  11. 11. Batch Views

×