MapMyCab
Preetika Kulshrestha
Motivation
• Tool for Data Scientists and Cab dispatchers to analyze
(by time of day or day of week):
• cab occupancy
• miles travelled
• pickups and drop-offs
• An app for city dwellers to view real-time cab status for
unoccupied cabs in a given area
Demo
Pipeline
Script Message
Broker
Real-Time
Streaming
HDFS
HBase UI
MrJob
Data Flow
CabID Lat Long Occ Timestamp
yyyy_m_day AvgOcc Pickups Drops Miles
MrJob
Tables
• Hourly data organized by Day of Week
• Aggregate metrics stored in the same table for fast retrieval
y_m_dow c:0 c:1 c:2 c:3 c:4 … c23 c:Totals
Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 ..
2008_01_Mon
pickups, dropoffs,
avg_occ, avg_dist
.. .. .. .. .. ..
sum(pickups),
sum(dropoffs), avg(occ),
avg(dist)
2008_01_Tue
<pickups, dropoffs,
avg_occ, avg_dist>
.. .. .. .. .. ..
<sum(pickups),
sum(dropoffs), avg(occ),
avg(dist)>
2008_01_Wed
<pickups, dropoffs,
avg_occ, avg_dist>
.. .. .. .. .. ..
<sum(pickups),
sum(dropoffs), avg(occ),
avg(dist)>
Hourly Aggregates by Day of Week
API and Lessons Learned
• Need to safeguard against corrupt data
• Workflow is very important when connecting
different tools
About Me
• Previous Life - Senior Energy
Analyst (EnerNOC Inc.).
• M.S. Electrical Engineering -
North Carolina State University
(focus on robotics, control
systems and smart grid).
• https://github.com/PreetikaKuls
• preetika.kulshrestha@gmail.com
Pipeline
Script Message
Broker
Real-Time
Streaming
HDFS
HBase UI
MrJob
Python Script
uid, lat, long,
timestamp, occ
y_m_dow_h, pickups,
drops, dist, occ
y_m_dow, hour(pickups,
drops, dist, occ)
Hive
Data
Item SF Cabs
Description
GPS coordinates of approx. 500 SF cabs collected
over 30 days
Format
[latitude (float), longitude (float), occupancy
(boolean), time (timestamp)]
Size ~ 500 MB
Throughput
50-100 messages/sec (500 cabs, 5-10 min
granularity)
Master Data Set
Time
CabID!
Lat | Long | Occupancy
CabID!
Lat | Long | Occupancy
-—>
CabID
Timestamp!
Lat | Long | Occupancy
Timestamp!
Lat | Long | Occupancy
-—>
Retrieve all data for a given time frame
where latitude and longitude fall with in a specific range
Analyze data based on timestamp
Batch Processing Result
Features and Example
Queries
Features!
• A system that uses crowdsourcing to automatically generate
parking spot information for streets
• Parking information overlaid on Google Maps
Queries!
• Does West Middlefield Road allow for street parking?
• Can I park on this street for more than 2 hours?
• Which nearby streets might have better parking availability?

MapMyCab

  • 1.
  • 2.
    Motivation • Tool forData Scientists and Cab dispatchers to analyze (by time of day or day of week): • cab occupancy • miles travelled • pickups and drop-offs • An app for city dwellers to view real-time cab status for unoccupied cabs in a given area
  • 3.
  • 4.
  • 5.
    Data Flow CabID LatLong Occ Timestamp yyyy_m_day AvgOcc Pickups Drops Miles MrJob
  • 6.
    Tables • Hourly dataorganized by Day of Week • Aggregate metrics stored in the same table for fast retrieval y_m_dow c:0 c:1 c:2 c:3 c:4 … c23 c:Totals Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 .. 2008_01_Mon pickups, dropoffs, avg_occ, avg_dist .. .. .. .. .. .. sum(pickups), sum(dropoffs), avg(occ), avg(dist) 2008_01_Tue <pickups, dropoffs, avg_occ, avg_dist> .. .. .. .. .. .. <sum(pickups), sum(dropoffs), avg(occ), avg(dist)> 2008_01_Wed <pickups, dropoffs, avg_occ, avg_dist> .. .. .. .. .. .. <sum(pickups), sum(dropoffs), avg(occ), avg(dist)> Hourly Aggregates by Day of Week
  • 7.
    API and LessonsLearned • Need to safeguard against corrupt data • Workflow is very important when connecting different tools
  • 8.
    About Me • PreviousLife - Senior Energy Analyst (EnerNOC Inc.). • M.S. Electrical Engineering - North Carolina State University (focus on robotics, control systems and smart grid). • https://github.com/PreetikaKuls • preetika.kulshrestha@gmail.com
  • 9.
    Pipeline Script Message Broker Real-Time Streaming HDFS HBase UI MrJob PythonScript uid, lat, long, timestamp, occ y_m_dow_h, pickups, drops, dist, occ y_m_dow, hour(pickups, drops, dist, occ) Hive
  • 10.
    Data Item SF Cabs Description GPScoordinates of approx. 500 SF cabs collected over 30 days Format [latitude (float), longitude (float), occupancy (boolean), time (timestamp)] Size ~ 500 MB Throughput 50-100 messages/sec (500 cabs, 5-10 min granularity)
  • 11.
    Master Data Set Time CabID! Lat| Long | Occupancy CabID! Lat | Long | Occupancy -—> CabID Timestamp! Lat | Long | Occupancy Timestamp! Lat | Long | Occupancy -—> Retrieve all data for a given time frame where latitude and longitude fall with in a specific range Analyze data based on timestamp
  • 12.
  • 13.
    Features and Example Queries Features! •A system that uses crowdsourcing to automatically generate parking spot information for streets • Parking information overlaid on Google Maps Queries! • Does West Middlefield Road allow for street parking? • Can I park on this street for more than 2 hours? • Which nearby streets might have better parking availability?