Presentation by Tim Park at ScaleConf 2015 around how to ingest and process data in the Internet of Things and turn it into context and insights. Additionally, a discussion of how we deploy the system using Docker, CoreOS, and Deis.
12. Avalanche!
122,000 cars in their fleet.
35 interesting data points a second during operation
45 minutes a day per car across their fleet.
94,500 data points per car per day.
11.58 billion data points per day (133,984 per second).
4.2 trillion data points per year.
Make sure the telemetry you collect has business impact.
Even then, there is still a ton of data coming your way.
25. Logical Deployment View
Frontdoor Service (API Surface Consolidation)
Ingestion Service Device Registry Service Consumption Service
REST Auth QueryProvisioning
Devices Applications
WebsocketsRESTMQTT MQTT
Node.js service boundaries
26. Deployment Sizing
We send up roughly 2 messages a second during operation of the vehicle.
37% peak concurrent usage of car fleet is typical.
Design goal of monitoring car fleet of 30,000 vehicles.
Deployment designed to be capable of 60% usage of fleet.
Need to be able to support 60k messages/sec load from car fleet.
Each ingestion instance can handle roughly 1000 messages/second
60 ingestion instances
20 frontdoor instances
10 consumption instances
5 device registry instances
27.
28. App A
bins / lib
Guest
OS
Hypervisor
Host OS
Server Hardware
App A’
bins / lib
Guest
OS
App B
bins / lib
Guest
OS
App A
bins / lib
Host OS
Server Hardware
App A’ App B
bins / lib Docker
VMs vs. Containers
VM
Container
29.
30. CoreOS
Stripped down Linux distro
Optimized to run containers
Autoupdating
Systemd / Fleetctl
etcd
“Warehouse scale computing”
Thinking in cores and containers instead of VMs
31.
32.
33.
34. Deis Deployment Internals
I C
F I
CoreOS
I I
I C
CoreOS
I F
I I
CoreOS
I I
R I
CoreOS
I I
F I
CoreOS
I I
R I
CoreOS
R I
F I
CoreOS
I I
C I
CoreOS
F I
I I
CoreOS
R I
I C
CoreOS
Deis Request Router
35. Summary
Context is the next user interface
Collecting context is very high scale by its nature
Need to use new approaches to store this data.
Need to use new approaches to process this data.
Need to use new approaches to learn from this data.
Today I’m going to talk about a transition we are making in the industry
For decades, this has been our conception of a computer
A screen
an input device
Computers have morphed products that we knew and loved like the phone…
… into computers
… with a screen
… and a touchscreen
There are a lot of directions the industry is headed to attempt to take this touch screen based computing approach deeper
Watches for example
This is the Microsoft Band
But I believe the next big thing is computing coming to the ordinary things in our life and making them better.
The car is one of these ordinary things.
While it has a screen prominently displayed in each car
In the car, the Screen is really an auxiliary user interface.
Primary user interface: steering wheel, the gas pedal, and the brakes
Inflection away from screens and to a more contextual user interface
Working with a number of car manufacturers at Microsoft on scenarios like predictive maintenance, route prediction, and risk scoring.
Going to talk about the infrastructure we are building to collect and process this context and be able to do something like route prediction.
The screen is really a luxury user interface in the car. The primary user interface is the steering wheel, the gas pedal, and the brakes.
The inflection we are seeing is that in this model applications do not collect input via a touchscreen but instead collect and process context about what is happening in the real world, like our use of the steering wheel and brakes as the inputs to the applications that produce insights and value to drivers.
We’ve been working with a number of car manufacturers at Microsoft to use this interface to connect cars such that they can provide services to their drivers like drive session scoring, predictive maintenance, and fleet management.
In this talk I’m going to walk through what the architecture looks like for doing X. While I am going to talk about this in terms of the Internet of Things this pattern is broadly applicable to scenarios in which you need to ingest and process large scale telemetry data. I’ll also use Microsoft Azure infrastructure to discuss this, but again, there are similar infrastructure services at other cloud providers.
So let’s get started with looking at what a car really is, which is a rolling distributed computer.
It has dozens of computer systems onboard that communicate with each other over an internal bus called the CAN bus.
CAN bus has been in every car since 1987
Wanted to use same diagnostic tool in car repair shops
Same port they “scan” when there is a check engine light
The great news is that car manufacturers include a port to this CAN bus in nearly every car since 1987 with a standard hardware interface.
This was driven by the desire to be able to use the same diagnostic tool by car repair shops.
This is the same port that they use the diagnostic tool on when the check engine light turns on in your car.
This is a picture of an OBD-II adapter that we custom build using to collect this CAN bus bus data and relay it to the cloud over a mobile network.
We relay this telemetry using messaging.
You can think about this like Twitter for Devices.
Each message has a type, timestamp, and a body of data.
I’m only showing location data here but there is a wide range of engine data, car occupancy, and other data that we can relay as well.
Car relays over 350 types of data 60 times a second
Need to think through the data you collect
A company like Avis has over 122,000 cars in their fleet
In fact, the bus in the car relays over 350 types of data to us over 60 times a second.
However, one of the things you need to balance when designing any telemetry system with a large fleet is the usefulness of any piece of telemetry data for the business outcomes you are hoping to achieve.
If we think about a rental fleet that a company like Avis that has over 122,500 cars…
This is the high level architecture we are using to collect and process this data from clients.
Have a set of protocol adapters to land telemetry from clients
The incoming requests are auth / authz with the help of Device Registry
The data is then landed in a set of storage systems
And then fed into data pipeline where we transform it and learn from it.
All this telemetry hits an architecture that looks like the following
Depending on the client, we have a set of endpoints that we call Protocol Adapters that land the telemetry from the clients.
We authenticate and authorize the client using the connection using a separate system that we call the device registry that provisions and maintains identity for all of these connecting clients.
If the client is authorized, we then land this telemetry in a set of backend storage systems that I’ll discuss more later. The architecture we’ve built is flexible enough to enable us to plug in a set of these storage providers.
One of the scenarios we are working on is route prediction
Given the start of a route -> where does it end
Lots of value in being able to help the user with traffic or predict where a car will end up
Here is one visualization of that
One of the scenarios we are working on is being able to predict the endpoint of a driving session based on its starting point and the first few turns of the driving session.
This has a lot of value in terms of business scenarios like hiring cars by the hour.
This is a visualization of that data. The dots are the centroids of all of the driving sessions. When you click on them, you can see each individual driving session.
All of the telemetry that we ingested in the previous slide eventually lands in Azure’s Table Storage.
Table Storage in Azure is lightly structured store that essentially has a single index consisting of a partition key.
It is dirt cheap and highly scalable.
Sometimes you can use the raw telemetry directly in applications but often, and in our case, we have two separate data pipelines that operate on this raw telemetry.
The first is HDInsight, which is our hosted distribution of Hadoop.
We largely use this to clean up, summarize, or build derivative data using the bulk raw telemetry that we collect with the platform which we push back into Table Storage.
Hadoop operates on a batch model which is efficient from an operational perspective but also has latency that is too high for some applications.
Because of this, we are also using Apache Storm to process some of the data in real time.
The second half of our data pipeline is classifying and predicting outcomes based on this data.
We are using Azure ML, which is our machine learning service in Azure that enables you to train models that can predict or classify data.
We have a processor for drive data that derives certain risk factors from the raw telemetry data and feeds that into Azure ML for scoring.
That classification is then pushed back around to Table Storage for storage again.
This classification data and the summarized driving session data is then exposed through an API to the browser based applications they’ve built to visualize all this data like we saw a couple of slides ago.
I thought I would talk briefly about the streaming data pipeline in more detail since the streaming model embodied by Apache Storm is not as well known as the batch style embodied by Hadoop.
In a streaming architecture, we are processing events in real time and typically emitting more events.
In the connected car case here, we are taking a raw location stream and producing both a cleansed location stream.
Location data is inevitably pretty dirty and you will almost always have to cleanse it to remove outliers that are almost certainly GPS error like the one highlighted in red above.
Apache Storm is a very powerful system for doing realtime processing on data streams.
It came out of processing very high scale streams of tweets and click data at Twitter.
Apache storm works off the concept of Spouts and Bolts.
A Spout is a source of data
And a bolt is something that processes that data.
In the case we described on the previous slide, we have two spouts: one that emits the location stream from the car and one that emits car data related to seat occupancy, which many car models can report.
We also have two bolts.
The first does filtering on the data stream to remove outliers and clean it up.
We have set up our Storm topology such that this stream is both emitted and also connected to a second bolt that does session identification.
This bolt uses seat occupancy as a second source of data along with the location stream to better split driving into sessions, which is emitted.
The final part of our data pipeline is the machine learning that we are doing on the data that we are collecting.
I’m going to talk through how we are building driving risk profiles for users as part of this talk.
We’re using supervised learning for our algorithm, which means that we have a set of labeled training data for driving that we have classified as good or bad.
We split this data into training data and test data, train the model using an algorithm, and then score it based on the test data.
We used Azure’s machine learning to do that.
This is the Azure ML studio that enables you to connect to your training data, split it into training and validation data sets, and specify the model your are going to use.
We’ve trained a model that can fairly accurately classify segments of a drive according to their risk.
Car companies are interested in integrating this into cars to both coach users to better driving and as a fleet management function for professional drivers.
So that is the architecture we are using to ingest and process this data.
I thought I would next talk about how we are deploying the services themselves.
We have broken the service down into three smaller services: Ingestion, Registry, and Consumption.
These operate as their own independent node.js services but are aggregated up into one API surface by a 4th service we call the Frontdoor.
This frontdoor also common services across the backing services like checking incoming authentication of requests after the initial authentication against the Device Registry and providing an endpoint map to clients.
Talk through sizing
So overall, we end up with an infrastructure of roughly 100 servers
we could have taken the traditional route of using Chef or Puppet to configure these servers.
But instead we’ve made the decision to use Docker extensively for deploying our API services.
For those of you that haven’t run across Docker before, it essentially enables you to capture the essential aspects of application and run it in a container.
Containers differ from virtual machines in that they only capture the application
They do not bake in the OS but instead this OS is shared across the containers.
This makes these containers millisecond fast to start up and efficient but this sharing of the Host OS reduces the isolation and security of each of these containers.
Part of the genius of Docker is the declarative nature of specifying what is part of the container.
This is a Dockerfile that we use for a NGINX container that we use.
Starting a single or even a small set of Docker containers is easy but it quickly becomes a operational nightmare when you scale up to larger systems.
CoreOS provides a Linux distro that is optimized for running containers
It is a stripped down linux distro with no package manager. The package manager is Docker which comes preinstalled.
It also autoupdates it self. CoreOS is always up to date using the same technology Chromium uses to keep up to date.
CoreOS also builds on systemd, the init system being used across Linux, to build a fleet level control system called fleetctl.
It also provides a distributed configuration system called etcd that enables you to centralize configuration information.
And so CoreOS provides the backbone for running large scale deployments
But you still face the complexity of building your own containers for deployments
And routing incoming requests to the right containers running your application
This distributed config system acts as the backbone for deployments based on CoreOS because you can share config amongst the containers running on the set of hosts and use it as a way to communicate state between the containers.
This provides some of the infrastructure that you need to do larger scale Docker based deployments.
But you still face complexities around building the necessary containers for your applications and routing requests into the right containers for a given request externally.
So we are using another open source project called Deis as well
Deis builds upon Docker and CoreOS to provide a more friendly workflow for developers.
Deis provides a ‘git push’ style workflow for your application + configuration management
You ‘git push’ your application to Deis and Deis uses what’s called a Buildpack to sense what sort of application you are pushing.
Deis then uses this Buildpack to build a Docker container with your application, publishes into a private Docker Registry, and then places that container on a number of hosts in your CoreOS cluster.
The configuration for each of these applications is applied to the environment within the container and the container is started.
The Deis Router then handles routing incoming requests for a particular application to the containers in the cluster that are executing it.
This diagram shows how a physical deployment using Deis works in more detail.
Deis uses the qualified name of each of those services to route requests internally to the right application.
In our connected car API scenario, we have four different backing services behind our API.
All requests are fielded by the Deis request router where, ssl, if you’ve configured it is terminated.
The request router knows how to connect requests to containers that are running them on the CoreOS hosts in the cluster.
These containers are spread across the hosts by Deis automatically to provide load balancing and fault tolerance.
Although the Deis project is still young, we have been super happy with the workflow that it provides and how easy it makes to deploy and scale our API services.
We are moving to a future where many of the user interfaces for the things in our lives are provided via the context they throw off.
Collecting all of this context is very high scale by its very nature.
We need to use new approaches, like utilizing your cloud provider’s high scale storage, to store this data effectively and efficiently.
We need to use new approaches, like Apache Storm and Hadoop, to process this data.
And we need to use new approaches, like Azure Machine Learning or your cloud provider’s equivalent, to learn and classify this data.
Finally, Microsoft has open sourced the common pieces of the infrastructure that I described today as a framework called Nitrogen.
If you tackle a project in the Internet of Things, we’d love to have you use and contribute to the project.
You can find out more about it at nitrogen.io