In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages.
Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.
Ensuring Technical Readiness For Copilot in Microsoft 365
British Gas Connected Homes: Data Engineering
1. Data Engineering
At British Gas Connected Homes
1Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
2. 2Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
British Gas / Connected Homes
• British Gas is a 200 year old company
• Connected Homes is BG’s IoT “startup”
• Leader in the UK’s connected home market
3. Data Sources
• Gas and electricity meter readings
• Thermostat temperature data
• Connected boiler data
• Real time energy consumption data
• Introducing motion sensors, window
and door sensors, etc.
3Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
4. Meter Data
• Millions of gas and electricity
customers
• ~600k smart meters
• Readings every 30 minutes from
smart meters
4Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
5. Machine Learning
applied to Meter Data
• Energy disaggregation
• Similar homes comparison
• Smart meters used in indirect
algorithms for non-smart customers
5Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
6. Connected Thermostats
• > 200k Connected Thermostats
• Temperature data time series
6Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
7. Connected Boilers
• Proactive maintenance
• Failure detection
7Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
8. In Home Displays
in a mobile App
• Data every 10 seconds
• Still needs an access device
connected to the router
• Allows real time mobile alerts
8Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
10. Our Engineering process
• Two points of friction at the
intersection between teams
• Sharing datasets is problematic
• Real infrastructure too different from
real environments
• New technologies too hard to
deploy
• Time to production > 6 months
10Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
11. Solution #1:
Data Ops
• Data oriented DevOps instead of
service oriented DevOps:
• Stateful instead of stateless
• Jobs instead of config
• Resource management instead of
resource partitioning
11Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
12. Solution #1: Data Ops
• Ansible and Docker:
1. Smooth transition from development testing to production
2. blue / green deployments
3. swarm / mesos + docker = better use of infrastructure
• Time to production down to < 2 months :-|
12Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
13. Future Solution #2: Data Science Environment
• Ideally Data Science models should be plug and play
• Python and R dataframes in Spark are promising but data scientists don’t
feel the need of Spark
• Data scientists prefer to work with relational DBs
• We need to find a way to make production datasets available to them
13Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
14. Future Solution #2:
Data Science Environment
• Possible solutions we are investigating
are:
• Automated exports into a data
science relational DB
• Spark SQL server
• Automatically generated environment
images
• Objective is to reduce implementation
time for new features to < 1 month
14Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
15. Use Case
High Consumption Alerts
• The red dot on top is what we want
to detect
• The green bottom dots are the
baseline plus the fridge
15Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
16. High Consumption Alerts
Data Ingest
• Very high volume of messages (every
10 seconds)
• Kafka partitions help us cope with
volume
• (experimental) we’re trying Samza for
quick sliding-window type
transformations
• Often we miss reads, the Samza job
also does basic interpolation
16Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
17. High Consumption Alerts
Spark Streaming with Cassandra
• Real time data comes from Kafka
• Cassandra stores historical usage
information
• A Spark Streaming job combines
both and applies a machine
learning algorithm to generate high
usage alerts
17Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
18. High Consumption Alerts
Overall Architecture
• Getting the partitions right is very
important for scalability
• Spark-Cassandra connector keeps
C* partitions
• It’s important to match Kafka
partitioning to CassandraRDD
partitioning
18Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
19. High Consumption alerts | Main Spark loop
19Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
20. Data Partitioning
• Data systems like Cassandra or
Kafka scale by partitioning data
• Given enough partitions, any
technology can work
• We need a simple hashing algorithm
that works the same in many
languages and across technologies
20Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
21. Cassandra data modelling with buckets
• Using a hashing function that is uniform and deterministic we can cope
time series data of any amount of customers
• One of our preferred strategies is to use buckets
21Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
22. h(k) = ⌊m * frac(kA)⌋
• Multiplicative hashing is our preferred simple partitioning algorithm
• m= Number of partitions
• A≈(√5−1)/2 = 0.6180339887... (Golden Ratio)
• Online example: jsfiddle.net/joscas/yfp72fq5
22Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
23. Summary
• Increase in productivity with portable environments (Ansible, Docker,
Mesos)
• Getting partitions straight is essential
• Using a simple common hashing algorithm across technologies and
languages is very helpful
23Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
24. Summary
• Streaming technologies are rapidly evolving
• Spark streaming is complex but with many advantages (Spark’s excellent
integration with Cassandra, Spark’s ML libraries, etc.)
• Kafka ticks a lot of boxes for large scale distributed real time data
systems
24Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA