END-TO-END
MACHINE LEARNING PIPELINE
Single machineML hero Small data
Single machineSmall data
Single machineSmall data
ML hero
ML hero
More Data + Bigger Models + More Computation
= Better Results in Machine Learning
https://blog.openai.com/ai-and-compute
Single machineBig data
Single machineBig data
ML hero
ML hero
Cluster
Big data
Big data
ML hero
ML hero
Single machine Data center
1 user Many users
Megabyte of data Petabyte of data
Local filesystem Distributed filesystem
Exclusive use Resource sharing, scheduling,
queueing, resource isolation
Scale up Scale out
pip install ... Automating deployment
- Operations, monitoring, ...
Development cycle for autonomous vehicles
1 Collect
sensors data
3 Autonomous
Driving
2 Model
Engineering
Data Logger Control Unit
Big Data Trained Model
Data Center
Sensors Udacity Lincoln MKZ
Camera 3x Blackfly GigE Camera, 20 Hz
Lidar Velodyne HDL-32E, 9.5 Hz
IMU Xsens, 400 Hz
GPS 2x fixed, 1 Hz
CAN bus, 1,1 kHz
Robot Operating System
Data 3 GB per minute
https://github.com/udacity/self-driving-car
Lidar Velodyne HDL-32E
ROS bag data structure
https://github.com/valtech/ros_hadoop
Robot Operating System
+ Popular open source robotics
framework
+ Reliable distributed architecture
+ Wide use in the robotics
research community
+ Huge selection of “off-the-shelf”
software packages for
hardware/algorithms/etc.
+ Used by Bosch, BMW, KUKA, Google, Siemens, etc.
https://roscon.ros.org/2015/presentations/ROSCon-Automated-Driving.pdf
17
1 Collect
sensors data
3 Autonomous
Driving
2 Model
Engineering
Data Logger Control Unit
Big Data Trained Model
Data Center
Development cycle for autonomous vehicles
18
Carla
Ingest data
Data
Preprocessing
Feature
Engineering
Model
Training
Simulation
Reports
Results
Model
Deployment
Training
data
Model
Validation
Train Test Loop
Test
data
Model Feedback Loop
Train and evaluate machine learning models at scale
Single machine Data center
How to run more experiments faster and in parallel?
How to share and reproduce research?
How to go from research to real products?
Distributed Machine Learning
Data Size
Model Size
Model parallelism
Single machine
Data center
Data
parallelism
training very large models exploring several model
architectures, hyper-
parameter optimization,
training several
independent models
speeds up the training
Compute Workload for Training and Evaluation
I/O workload
Compute
workload
Single machine
Data center
I/O Workload for Simulation and Testing
I/O workload
Compute
workload
Single machine
Data center
Flux – Open Machine Learning Stack
Training &
Test data
Compute + Network + Storage Deploy model
ML Development & Catalog & REST API
ML-Heros
Feature
Engineering
Training
Evaluation
Re-Simulation
Testing
CaffeOnSpark
Sample Model Prediction Batch Regression Cluster
Dataset Correlation Centroid Anomaly Test Scores
 Mainly open source
 No vendor lock in
 Scale-out architecture
 Multi user support
 Resource management
 Job scheduling
 Speed-up training
 Speed-up simulation
https://github.com/flux-project/flux
Feature Engineering
+ Hadoop InputFormat and
Record Reader for Rosbag
+ Process Rosbag with Spark,
Yarn, MapReduce, Hadoop
Streaming API, …
+ Spark RDD are cached and
optimized for analysis
Ros
bag
Processing
Engine
Computer
Network
Storage
Advanced
Analytics
RDD
Record
Reader
RDD
DataFrame, DataSet
SQL, Spark APIs
NumPy
Ros
Msg
Hadoop InputFormat for ROS bags
https://github.com/valtech/ros_hadoop
Training & Evaluation
+ Tensorflow ROSRecordDataset
+ Protocol Buffers to serialize
records
+ Save time because data
conversion not needed
+ Save storage because data
duplication not needed
Training
Engine
Machine
Learning
Ros
bag
Computer
Network
Storage
ROS
Dataset
Ros
msg
Re-Simulation & Testing
+ Use Spark for preprocessing,
transformation, cleansing,
aggregation, time window
selection before publish to ROS
topics
+ Use Re-Simulation framework
of choice to subscribe to the
ROS topics
Engine
Re-Simulation
with framework
of choice
Computer
Network
Storage
Ros
bag
Ros
topic
core
subscribe
publish
Flux
Open Machine Learning Stack
Apache License 2.0
http://flux-project.org

END-TO-END MACHINE LEARNING STACK

  • 1.
  • 2.
  • 3.
    Single machineSmall data SinglemachineSmall data ML hero ML hero
  • 4.
    More Data +Bigger Models + More Computation = Better Results in Machine Learning
  • 6.
  • 7.
    Single machineBig data SinglemachineBig data ML hero ML hero
  • 8.
  • 9.
    Single machine Datacenter 1 user Many users Megabyte of data Petabyte of data Local filesystem Distributed filesystem Exclusive use Resource sharing, scheduling, queueing, resource isolation Scale up Scale out pip install ... Automating deployment - Operations, monitoring, ...
  • 12.
    Development cycle forautonomous vehicles 1 Collect sensors data 3 Autonomous Driving 2 Model Engineering Data Logger Control Unit Big Data Trained Model Data Center
  • 13.
    Sensors Udacity LincolnMKZ Camera 3x Blackfly GigE Camera, 20 Hz Lidar Velodyne HDL-32E, 9.5 Hz IMU Xsens, 400 Hz GPS 2x fixed, 1 Hz CAN bus, 1,1 kHz Robot Operating System Data 3 GB per minute https://github.com/udacity/self-driving-car
  • 14.
  • 15.
    ROS bag datastructure https://github.com/valtech/ros_hadoop
  • 16.
    Robot Operating System +Popular open source robotics framework + Reliable distributed architecture + Wide use in the robotics research community + Huge selection of “off-the-shelf” software packages for hardware/algorithms/etc. + Used by Bosch, BMW, KUKA, Google, Siemens, etc. https://roscon.ros.org/2015/presentations/ROSCon-Automated-Driving.pdf
  • 17.
    17 1 Collect sensors data 3Autonomous Driving 2 Model Engineering Data Logger Control Unit Big Data Trained Model Data Center Development cycle for autonomous vehicles
  • 18.
  • 19.
  • 21.
    Train and evaluatemachine learning models at scale Single machine Data center How to run more experiments faster and in parallel? How to share and reproduce research? How to go from research to real products?
  • 22.
    Distributed Machine Learning DataSize Model Size Model parallelism Single machine Data center Data parallelism training very large models exploring several model architectures, hyper- parameter optimization, training several independent models speeds up the training
  • 23.
    Compute Workload forTraining and Evaluation I/O workload Compute workload Single machine Data center
  • 24.
    I/O Workload forSimulation and Testing I/O workload Compute workload Single machine Data center
  • 25.
    Flux – OpenMachine Learning Stack Training & Test data Compute + Network + Storage Deploy model ML Development & Catalog & REST API ML-Heros Feature Engineering Training Evaluation Re-Simulation Testing CaffeOnSpark Sample Model Prediction Batch Regression Cluster Dataset Correlation Centroid Anomaly Test Scores  Mainly open source  No vendor lock in  Scale-out architecture  Multi user support  Resource management  Job scheduling  Speed-up training  Speed-up simulation https://github.com/flux-project/flux
  • 26.
    Feature Engineering + HadoopInputFormat and Record Reader for Rosbag + Process Rosbag with Spark, Yarn, MapReduce, Hadoop Streaming API, … + Spark RDD are cached and optimized for analysis Ros bag Processing Engine Computer Network Storage Advanced Analytics RDD Record Reader RDD DataFrame, DataSet SQL, Spark APIs NumPy Ros Msg
  • 27.
    Hadoop InputFormat forROS bags https://github.com/valtech/ros_hadoop
  • 28.
    Training & Evaluation +Tensorflow ROSRecordDataset + Protocol Buffers to serialize records + Save time because data conversion not needed + Save storage because data duplication not needed Training Engine Machine Learning Ros bag Computer Network Storage ROS Dataset Ros msg
  • 29.
    Re-Simulation & Testing +Use Spark for preprocessing, transformation, cleansing, aggregation, time window selection before publish to ROS topics + Use Re-Simulation framework of choice to subscribe to the ROS topics Engine Re-Simulation with framework of choice Computer Network Storage Ros bag Ros topic core subscribe publish
  • 30.
    Flux Open Machine LearningStack Apache License 2.0 http://flux-project.org