© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
Railroad Modeling at HadoOp
Scale
Hado...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
2
Why is a data science &
engineering ...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
3
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
4
• Commuter rail between San Francisc...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
HOW DO
WE KNOW
IF THE
TRAIN IS
LATE?
•...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
SVDS Approach
6
 Take advantage of th...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
7
Stovepipe:
One-to-one
relationship
f...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
8
Source
Signals
Audio
Image
Text
API
...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
9
• Microphone connected to Raspberry ...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
10
• wget pulls images from camera’s b...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
11
• Capturing all the tweets with key...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
12
• Real-time departure times availab...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
13
Combining
the Signals
Audio
Signal
...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
14
Twitter
Agent
Analytics
Dev
MapRedu...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
15
Batch:
• Apply FFT to audio data to...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
16
Real-Time
• ORB algorithm (openCV) ...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
17
Batch:
• Update baseline tweet
freq...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
18
Baseline
Calculation Baseline
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
19
Future Work • Detect direction of t...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
COMING SOON:
CALTRAIN RIDER APP
• Find...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
questions
21
Yes, We’re Hiring
www.svd...
© 2014 Silicon Valley Data Science LLC
All Rights Reserved.
svds.com @SVDataScience
THANK YOU
John @BigDataAnalysis
Tatsia...
Upcoming SlideShare
Loading in …5
×

Railroad Modeling at Hadoop Scale

683 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
683
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • When train is detected an the information is sent to Hbase and to the Event detector

    The camera has a network connection, so we can drop images via wget to the local server.
    Label wget
  • Add API setup
  • Railroad Modeling at Hadoop Scale

    1. 1. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience Railroad Modeling at HadoOp Scale Hadoop Summit 3 June 2014, San Jose, CA John Akred (@BigDataAnalysis), Tatsiana Maskalevich (@notrockstar) www.svds.com @SVDataScience
    2. 2. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 2 Why is a data science & engineering consulting company building its own Caltrain app?
    3. 3. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 3
    4. 4. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 4 • Commuter rail between San Francisco and San Mateo and Santa Clara counties ~30 stations • 118 passenger cars • 60% >=30 years old • 2014 weekday ridership is 52,019 people • On-time performance is about 92% • No reliable real-time status information • API outage between April 5th and June 2nd
    5. 5. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience HOW DO WE KNOW IF THE TRAIN IS LATE? • Direct observation – We can hear the train horn – We can see the train when it goes by • Purpose-built systems: – We can use Caltrain API’s (when working) • Other signals – We can check Twitter for delay info or rider comments 5
    6. 6. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience SVDS Approach 6  Take advantage of the available signals  Use historical data to make direct and latent observations more useful  Provide a service that gives users valuable planning and riding features  Don’t let the perfect be the enemy of the good
    7. 7. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 7 Stovepipe: One-to-one relationship from data source to product Hard Failure: If the data source is broken, so is the app. Multi-sourced: Redundancy of overlapping data sources makes your products more resilient Graceful Degradation: If a data source breaks, there is a backup and your app continues to function Production data services abstract the probabilistic integration of overlapping data sources. We call this model a Data Mesh: DATA RESILIENCY Products Data Sources Broken Data Sources Data Services
    8. 8. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 8 Source Signals Audio Image Text API Variety Volume Velocity
    9. 9. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 9 • Microphone connected to Raspberry Pi mic->preamp->analog-to-digital converter->usb • PyAudio running on Raspberry Pi serializes audio as an array of 2-byte integers. • Sound data + metadata -> Flume on AWS via flumelogger • We use FFT + Decision Trees to detect and classify the trains into express and local based on the whistle sound. Audio Capture and Ingest Raspberry Pi Raw Audio Agent Raw Audio Agent
    10. 10. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 10 • wget pulls images from camera’s built- in server 2-3 times a second, and saves them on a local server/NAS • Flume pushes the image data to our EC2 servers • openCV (python) is used to detect trains in images Image Capture and Ingest Raw Image Agent Raw Image Agent Local Server
    11. 11. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 11 • Capturing all the tweets with keyword ‘Caltrain’ via Twitter API • Flume agent sends tweets to Apache Storm topology for processing • Tweets are parsed and written to HDFS and HBase • Event Detection is based on the baseline number of tweets per hour and keywords Text Capture and Ingest: Twitter Raw Image Agent Twitter APIs
    12. 12. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 12 • Real-time departure times available via 511.org developer API’s • Python script collects data once a minute from 511.org APIs and stores it in HDFS as sequence files using WebHDFS API’s. • Python script collects data from the Caltrain site that includes run # • Didn’t function from April 5th until June 2nd 2014 Caltrain API Data Capturing scraper.py 511.Org APIs Caltrain Webpag e data_collec tor_api.py
    13. 13. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 13 Combining the Signals Audio Signal Detection Image Recogni- tion Text Analysis STATE of complex system
    14. 14. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 14 Twitter Agent Analytics Dev MapReduce Event StorageSound Agent Image Agent Twitter Spout Sound Spout Image Spout Tweet Parser Tweets Counter HDFS Writer Event Detector Alerts Twitter API HBase Writer Microphone on Raspberry Pi Web Camera External Data Sources Data Platform Sounds Classifier Train Detector Transmit to APP Caltrain Agent Caltrain Spout Caltrain API Schedule Integrator
    15. 15. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 15 Batch: • Apply FFT to audio data to identify train based on train whistle’s fundamental frequencies. • Decision tree trained to classify trains into local or express based on minimum and maximum fundamental frequencies (Doppler effect) Data Science: Audio Real-Time: • Execute local / express classifier • Send data to the Event Detector for APP alerts • Store results in HBase • Apply FFT to audio signal • Extract min and max fundamental frequencies Frequency,Hz Histogram of Whistle Frequencies Over a Period of Time FrequencyCounts
    16. 16. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 16 Real-Time • ORB algorithm (openCV) is used to detect the train in image • Sends results to the Event Detector to identify train and compare to schedule • Event Detector updates APP with the train’s status, alerts if late Data Science: Image Number of Key-PointsThat AreThe Same In Two ConsecutivesImages Time (Sec) NumberofMatchingPoints
    17. 17. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 17 Batch: • Update baseline tweet frequencies for each hour as additional historical data collected • Store model parameters in HBase Data Science: Text Real-Time: • Count tweets as they stream through topology • Alert based on frequency deviations from the baseline
    18. 18. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 18 Baseline Calculation Baseline
    19. 19. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience 19 Future Work • Detect direction of train in image processing • Use natural language processing on twitter data for event detector. • Continue evaluation of analytical frameworks for model computation • Add observation posts • Release Caltrain Rider Application
    20. 20. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience COMING SOON: CALTRAIN RIDER APP • Find out what train to catch using our ‘Ride Now’ view • Select a train, see when that train should be reaching each stop in a trip detail view. • For more info: www.svds.com/trains 20
    21. 21. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience questions 21 Yes, We’re Hiring www.svds.com/join-us
    22. 22. © 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience THANK YOU John @BigDataAnalysis Tatsiana @notrockstar 22

    ×