2. Motivation
• Tool for Data Scientists and Cab dispatchers to analyze (by
time of day or day of week):!
• cab occupancy!
• miles travelled!
• pickups and drop-offs!
• An app for city dwellers to view real-time cab status for
unoccupied cabs in a given area
5. Data Aggregation
CabID Lat Long Occ Timestamp
Aggregate Metrics (per cab)
MrJob
year month day hour avocc pickup drop off
• Drop off event: Occupancy change from 1 to 0!
• Pickup event: Occupancy change from 0 to 1
6. Computing Trip Durations and Shift Times
• Used Windowing
function in Hive to
calculate idle times!
• Maximum idle time
in a day points to a
potential shift!
• 1 million trips
idle/shift time!
(hours)
tripId hour idle (s) idle (h)
Occupancy Profile
occ(%)
0
0.175
0.35
0.525
0.7
hour
0 1 2 3 4 5 6 7 8 9 1011131214151617181920212223
potential !
shift time!
7. Tables
• Hourly data organized by Day of Week!
• Aggregate metrics stored in the same table for fast retrieval
y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals
Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 ..
2008_01_Mon
pickups, dropoffs,
avg_occ, avg_dist
.. .. .. .. .. ..
sum(pickups),
sum(drop offs),
avg(occ), avg(dist)
Hourly Aggregates by Day of Week
8. • HBase row level atomicity can be leveraged for
transactional operations!
• Keyed producer in Kafka assures in-order delivery
of messages (by key)!
• Simple operations for tool integration, followed by
incremental complexity streamlines the
development process
Takeaways
9. About Me
• Previous Life - Senior Energy Analyst
(EnerNOC Inc.).
• M.S. Electrical Engineering - North Carolina
State University (focus on robotics, control
systems and smart grid).
• https://github.com/PreetikaKuls
• preetika.kulshrestha@gmail.com