Good morning, I’m here to talk about Data at Scale and the problems associated with receiving, processing and presenting data at scale in a connected world; in particular I’m going to use a case study of Electric Vehicle telematics from my own experiences of an extremely challenging, data intensive project. I’m going to be talking about the problem of data at scale both in terms of server resources and in terms of application design – because you need to be able to push data into your solution quickly, but you also need to process and export it quickly too.
I’m Michael Peacock, the web systems developer on the telemetry team for Smith Electric Vehicles. We are a core team of three; myself, the only web developer on the project. A systems administrator who looks after our server infrastructure – and its much of his work that I’ll be taking credit for today! And a project manager. A very small team for a large amount of data – I’ll tell you exactly how much soon.
Smith are the worlds largest manufacturer of all electric, commercial vehicles. Founded over 90 years ago to build electric delivery vehicles – both battery based and cable based. In 2009 the company opened its doors in the US, and at the start of last year the US operation bought out the European company which brings us to where we are today.
When most people think of electric vehicles they tend to think of either hybrid vehicles or the likes of the Nissan Leaf or the Chevvy Volt. When it comes to commercial electric vehicles, they think of the electric buggies in airports or, for any British members of the audience today, Milk floats. However, we develop a different type of commercial vehicle:
Large, fully electric, commercial delivery vehicles. Ranging from flatbed vehicles with military applications...
Through to home delivery, depo delivery, utilities and school buses.
These are 16 and a half thousand to 26 thousand pound delivery vehicles, capable of supporting upto 16 thousand pound payload, with a top speed of 80km/h.
As I’m sure you can appreciate, Electric Vehicles are a relatively new, and continually evolving technology. As a result the technologies are constantly being evaluated and improved.
We use EV data to look at the performance analysis and metrics of the vehicle, to see how far, how fast and how efficiently the vehicles travelled; to prove the technology through research; ensuring that driver training to help drivers move from diesel vehicles to electric vehicles has been successful; to diagnose issues, help with service intervals and warranty issues, and of course help to continually improve the vehicles. The only way to prove the technology is to look at a large sample of data. The only way to look at specific routes, vehicles, performance, driver and service issues is to ensure the data is available on every vehicle.This means capturing alot of data on the vehicles.Not only capturing the data – but we must effectively store the data, process the data, display the data and export it.
We need to display real time information quickly to our users, so they can look and see in real time how a vehicle is functioning, where it is, where it is going and if it has raised any fault codes.
Our users need to be able to look at detailed vehicle performance data over time, to see how much use a customer is getting from a vehicle or a fleet of vehicles, how well the vehicles are being driven, and pull out various other performance and driver style metrics.
We currently have around 500 vehicles in service with telemetry installed and active. Each telemetry enabled vehicle collects between two and two and a half thousand data points on a per second basis while in drive mode, or on a per minute basis in charge mode. Telemetry is now a standard function of our vehicles, so going forward every new Smith vehicle will have it – which means we have a lot more data to process. As a result, our MySQL solution currently processes 1.5 billion inserts on a daily basis, with a constant minimum of 4000 inserts per second.
As with many vehicles, we make use of a Controller Area Network to allow vehicle components to communicate with one another. The purpose behind a CANBus is to allow various nodes to communicate with one another without the need for a central host. Each component packages up a message which describes various aspects of its operation and sends these down the bus for the rest of the vehicle to pick up on. Because there isn’t a central host keeping an eye on things, the components don’t know who wants to know what when, so they constantly broadcast their information as messages onto the bus, hundreds of times per second.
Obviously in a vehicle application, broadcasting and acting on data hundreds of times per second is essential. A driver wouldn’t want a delay when they apply the brake. From a monitoring and telematics perspective, this is somewhat excessive. As such, we only sample the CANBus once per second. Not all of the buses on the vehicle contain information which is relevant to performance, component analysis and diagnostics, and so they can be discounted.
Mention about module level data on battery pods
The problem that I am talking to you about today, arises from the fact that we are becoming a more connected world. More and more devices are capturing data in larger and larger quantities, in greater frequencies. As passenger electric vehicles become more popular, a larger and larger charging infrastructure develops with billing and usage data being collected; utilities companies collect data to monitor and evaluate their infrastructure, deal with issues, monitor electricity generation and route water supplies from resivours based on demand. Some homes now have smart meters which report billing information directly to the supplier, or which monitor and track energy consumption so that home owners can save energy by turning appliances off. The connected world already gives us huge data problems. We are sure to get lots more over the years.
When the project was first conceived, it had a single aim: to capture a small set of vehicle performance metrics for a small number of vehicles. Subsequently, the initial system design simply had the vehicles connecting directly to a single server for the information to be processed. This however causes two problems for us.
The first of these issues was our availability. If our systems were down, vehicles couldn’t connect to us and deliver their data, and the data would be lost.
The other problem is the capacity of our servers. With a large amount of data coming in, and a large number of collection devices giving us this data, we could find our selves vulnerable to a Distributed Denial of Service attack that we ourselves authorised. This would lead to us being unable to process some or all data, some data being lost, and potentially, downtime.As more and more vehicles are used more and more regularly our servers will run the risk of catching fire!
One option when faced with problems like this, is of course standard cloud based infrastructure. With the likes of Amazons EC2, more machines could be powered on when demand was high, and different availability zones can help in the event of machine downtime or network problems.
With other cloud based services, we were able to put cloud based services between our data collection devices and our enterprise infrastructure and internal systems. Allowing us to use the Cloud to Connect us.
This cloud based middleware for us, was a Message Queue
By using a dedicated message queuing infrastructure: Our application can cope with downtime issues; when our service is down messages are queued in the message queue, until we are back online Just because our application is online, doesn’t mean we can process incoming data; the queue acts as an elastic between our computing power and our data streams Since we can cope with downtime, and capacity problems, we can perform maintenance on our system as and when we need to
While cloud based infrastructure and services often deliver massive reliability boosts, they can and sometimes do, fail.
The solution there, was to ensure the remote collection devices themselves have a small buffer within them. Be careful, you don’t want to send all of the buffered data back in one go once you can connect to the service again!
If you remember back to one of the earlier slides, I talked about how the vehicle components constantly tell us what they are up to, and that we sample that on a per second basis? Do we always care? With data which is reported at a low resolution, which doesn’t change as frequently as the sampling occurs – do we really need it? Clearly the answer is no. If a battery is 100% full, and remains that way for 5 minutes during drive, we don’t need to know that for every second of those 5 minutes the battery has 100% state of charge. We can simply log when it went down to 99%, and we can assume and extrapolate that if we are looking up data within those five minutes – the previous known value applies.Of course, this does bring problems of its own into the mix. What if the vehicle was turned off, and its shut-down sequence interupted. We don’t know that its drive status is now off, nor do we know that its battery is now doing nothing. Should we assume it continues to draw charge? No. We need to make some assumptions. In effect, we put our own sampling in place, where we sample the data from the vehicle on a per minute basis, unless the data changes. If it changes, we sample, if it doesn’t change we don’t sample until at least 60 second has passed.
For example, the energy used relies on the current and the voltage values of the battery. The distance travelled and speeds (for top and average speeds) rely on the motor speed, the gearbox and other vehicle metrics.
For any non-essential server tasks, we try to outsource these to the cloud. Services such as postmarkapp allow you to outsource your email delivering – so you don’t need to worry that your servers are under pressure dealing with critical data, AND you need to send hundreds of reports based off the contents of the data. Get someone else to do the work.We make use of a vast number of cloud based services to do our work for us, including:
With SQL based database systems, each data type available for a field uses a set amount of storage space. A good example is integers, MySQL offers a range of different integer fields, each type is able to store a different range of values, the greater the range, the more storage space the field needs to use – regardless of if the value of the field is part of that range, as opposed to the range of the next field type down. If you know the data in a particular field is always within a specific range – use the data type with the smallest size which supports the range you require. When you need to store data at scale, an over eager datatype can cost you dearly.Similarly, make sure the data type is optimised for the work you are doing on it. When it comes to Ints, floats, doubles and decimals some are more suited to others for arithmetic work because of the part of the CPU they use.
Our vehicle live screen lists the current status of 27 different pieces of information on a given vehicle. Due to the way the data is stored together and managed, to pull out this information would require a seperate query per data point, or a query with lots of subqueries. The page is refreshed every thirty seconds.
We run a large number of daily exports and reports, where we look at the data held in a number of shards, and pull that out. If we postpone this processing for maintenance work, or to deal with a built up message queue, we find we have a large queue of data to be exported. The trick, is to export the data differently.Normally, we would pull data for a single vehicle for a single day, do some processing build a report, then move onto the next. When catching up on a number of days worth of reports, it is faster to do each of those days for a single vehicle, then move onto the next, because the data from the indexes is often still held in memory.
Data at Scale - Michael Peacock, Cloud Connect 2012
Data at ScaleData problems and solutions with the connected world
Michael PeacockWeb Systems DeveloperTelemetry TeamSmith Electric VehiclesLead DeveloperOccasional conference speakerTechnical Author
• Worlds largest manufacturer of all electric commercial vehicles• Founded in 1920• US facility opened 2009• US buyout in 2011
Electric Vehicles• 16,500 – 26,000 lbs gross vehicle weight• Commercial Electric Delivery Trucks• 7,121 – 16,663 lbs payload• 50 – 240km• Top Speed 80km/h
Electric Vehicles• New, continually evolving, technology• Viability evidence required• Government research
EV Data• Performance analysis and metrics• Proving the technology: Government research• Evaluating driver training conversions• Diagnostics, Service and Warranty Issues• Continuous Improvement
Current Status• ~500 telemetry enabled vehicles• Telemetry is now fitted as standard in our vehicles• Our MySQL solution processes: – 1.5 billion inserts per day – Constant minimum of 4000 inserts per second
CANBus and Telemetry• Sample the buses: once per second• Only sample buses with useful performance and diagnostic information on them
Vehicle Data• Drive train information: – Motor speed – Pedal positions – Temperatures – Fault Codes• Battery information: – Current, Voltage & Power – Capacity – Temperatures
Connected World: The Problem• Connected infrastructure – EV Charging stations – Utilities• Home based telemetry – Smart Meters – Smart Homes
Our problem• Hundreds of connected devices, each with numerous sensors giving us 2,500 pieces of data per second per vehicle• Broadcast time we can’t plan for• Vehicles rolling off the production line• New requirements for more data
Option: Cloud Infrastructure• Cloud based infrastructure gives: – More capacity – More failover – Higher availability
Cloud Infrastructure: Problem• Huge volumes of data inserts into a MySQL solution: sub-optimal on virtualised environments• Existing enterprise hardware investment• Security and legal issues for us storing the data off-site
Cloud based infrastructure• Use a Message Queue to ensure data is only processed when you have the resources to process it
SAN• Backbone to most cloud-based systems• Powers our MySQL solution• Supports: – Huge volumes of data – Lots of processing – Fast connection to your servers – Backups and snapshots
SAN Tips• When dealing with data on a huge scale every aspect of your application and infrastructure needs to be optimised, this includes your SAN – something which is commonly overlooked.• http://www.samlambert.com/2011/07/how-to-push-your-san-with- open-iscsi_13.html
Speed: Stream Batch• Streams of continuously flowing data can be difficult to process• Turn the stream into small, quick batches• MySQL: LOAD DATA INFILE
Shard 1: Hardware• As the amount of data increased, we hit a huge performance problem. This was solved by sharding at a hardware level.• Each data collection device was given its own database, which could be on any number of separate machines, with a single database acting as a registry
Rationalisation & Extrapolation• Remember the CANBus – Always telling us information, which we sample every second? – Do we always need that?• Extrapolate and assume
Getting information from data• Vehicle performance information involves: – Looking at 20 – 30 data points for each second of a vehicles operation in a day – Analysing the data – Performing calculations, which vary depending on certain data points• Getting this data was slow – How far did Customer A’s fleet travel last week?
Regular processing• Instead of processing data on demand, process it regularly• Nightly scheduled task to evaluate performance information
Regular Processing: ProblemsYou need to pull the data out faster and faster than before!
Shard 2: Tables• All our data has a timestamp associated with it• Looking up data for a particular day was slow. Very slow.• We sharded the data again, this time with a table per week within a vehicles specific database
Sharding: Fallbacks and logic• What about data before you implemented sharding?• Which table do I need to look at?
Aggregation• With data segregated on a per vehicle and per week basis, lookups were much faster• Performance calculations could be scheduled nightly, with a single record recorded for each vehicle for each day in a central database• Allows for easy aggregation: – How far did my fleet travel last week? – How much energy did they use last month?
Backups and Archives• SAN backups and snapshots• With date based sharding: – Dump a table – Copy it elsewhere – Drop it / Flush it (if archiving)
Outsource to the cloud• Why waste resources doing things that cloud based services do better (where legal, security and privacy reasons allow?)• Maps• Email delivery• Even phone integration
Data Type Optimization• When prototyping a system and designing a database schema, its easy to be sloppy with your data types, and fields• DONT BE• Use as little storage space as you can – Ensure the data type uses as little as you can – Use only the fields you need
Sharding: An excuse• Sharding was a large project for us, and involved extensive re-architecting of the system.• We had to make changes to every query we have in our code• Gave us an excuse to: – Optimise the queries – Optimise the indexes
Query Optimization• Run every query through EXPLAIN EXTENDED• Check it hits the indexes• Remove functions like CURDATE from queries, to ensure query cache is hit
Index Optimization• Keep it small• From our legacy days of one database on one server, we had a column that told us which vehicle the data related to – This was still there...as part of an index...despite the fact the application hadn’t required it for months
Live data• Original database design dictated: • Each type of data point required a separate query, sub-query or join to obtain• Collection device and processing service dictated: • GPS Co-ordinates can be up to 6 separate data points, including: Longitude; Latitude; Altitude; Speed; Number of Satellites used to get location; Direction
Dashboards: Caching• Don’t query if you don’t have to• Cache what you can; access direct• With message queuing its possible to route messages to two or more places: one to be processed and another to display the latest information directly
Exporting data: Group• Where possible group exports and reports together by the same shard/table/index
Code considerations• Race conditions• Number of concurrent requests – group them
Application Quality• When dealing with lots of data, quickly, you need to ensure: – You process it correctly – You can act fast if there is a bug – You can act fast when refactoring
Deployment• When dealing with a stream of data, rolling out new code can mean pausing the processing work that is done• Put deployment measures in place to make a deployment switch over instantaneous
Technical Tips• Measure your applications performance, data throughput and so on – A data at scale problem itself• Use as much RAM on your servers as is safe to do so – We give 80% per DB server to MySQL of 100 – 140GB
What do we have now?• Now we have a fast, stable reliable system• Pulling in millions of messages from a queue per day• Decoding those messages into 1.5 billion data points per day• Inserting 1.5 billion data points into MySQL per day• Performance data generated, and grant authority reports exported daily• More sleep on a night than we used to