Insight Data Engineering, Feb 2015
• Tool for Data Scientists and Cab dispatchers to analyze (by
time of day or day of week):!
• cab occupancy!
• miles travelled!
• pickups and drop-offs!
• An app for city dwellers to view real-time cab status for
unoccupied cabs in a given area
11 million rows
CabID Lat Long Occ Timestamp
Aggregate Metrics (per cab)
year month day hour avocc pickup drop off
• Drop off event: Occupancy change from 1 to 0!
• Pickup event: Occupancy change from 0 to 1
Computing Trip Durations and Shift Times
• Used Windowing
function in Hive to
calculate idle times!
• Maximum idle time
in a day points to a
• 1 million trips
tripId hour idle (s) idle (h)
0 1 2 3 4 5 6 7 8 9 1011131214151617181920212223
• Hourly data organized by Day of Week!
• Aggregate metrics stored in the same table for fast retrieval
y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals
Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 ..
.. .. .. .. .. ..
Hourly Aggregates by Day of Week
• HBase row level atomicity can be leveraged for
• Keyed producer in Kafka assures in-order delivery
of messages (by key)!
• Simple operations for tool integration, followed by
incremental complexity streamlines the
• Previous Life - Senior Energy Analyst
• M.S. Electrical Engineering - North Carolina
State University (focus on robotics, control
systems and smart grid).