British Gas Connected Homes: Data Engineering

Data Engineering
At British Gas Connected Homes
1Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

British Gas / Connected Homes
• British Gas is a 200 year old company
• Connected Homes is BG’s IoT “startup”
• Leader in the UK’s connected home market

Data Sources
• Gas and electricity meter readings
• Thermostat temperature data
• Connected boiler data
• Real time energy consumption data
• Introducing motion sensors, window
and door sensors, etc.

Meter Data
• Millions of gas and electricity
customers
• ~600k smart meters
• Readings every 30 minutes from
smart meters

Machine Learning
applied to Meter Data
• Energy disaggregation
• Similar homes comparison
• Smart meters used in indirect
algorithms for non-smart customers

Connected Thermostats
• > 200k Connected Thermostats
• Temperature data time series

Connected Boilers
• Proactive maintenance
• Failure detection

In Home Displays
in a mobile App
• Data every 10 seconds
• Still needs an access device
connected to the router
• Allows real time mobile alerts

Technologies we use
Technologies we are trying

Our Engineering process
• Two points of friction at the
intersection between teams
• Sharing datasets is problematic
• Real infrastructure too different from
real environments
• New technologies too hard to
deploy
• Time to production > 6 months

Solution #1:
Data Ops
• Data oriented DevOps instead of
service oriented DevOps:
• Stateful instead of stateless
• Jobs instead of conﬁg
• Resource management instead of
resource partitioning

Solution #1: Data Ops
• Ansible and Docker:
1. Smooth transition from development testing to production
2. blue / green deployments
3. swarm / mesos + docker = better use of infrastructure
• Time to production down to < 2 months :-|

Future Solution #2: Data Science Environment
• Ideally Data Science models should be plug and play
• Python and R dataframes in Spark are promising but data scientists don’t
feel the need of Spark
• Data scientists prefer to work with relational DBs
• We need to ﬁnd a way to make production datasets available to them

Future Solution #2:
Data Science Environment
• Possible solutions we are investigating
are:
• Automated exports into a data
science relational DB
• Spark SQL server
• Automatically generated environment
images
• Objective is to reduce implementation
time for new features to < 1 month

Use Case
High Consumption Alerts
• The red dot on top is what we want
to detect
• The green bottom dots are the
baseline plus the fridge

Data Ingest
• Very high volume of messages (every
10 seconds)
• Kafka partitions help us cope with
volume
• (experimental) we’re trying Samza for
quick sliding-window type
transformations
• Often we miss reads, the Samza job
also does basic interpolation

Spark Streaming with Cassandra
• Real time data comes from Kafka
• Cassandra stores historical usage
information
• A Spark Streaming job combines
both and applies a machine
learning algorithm to generate high
usage alerts

Overall Architecture
• Getting the partitions right is very
important for scalability
• Spark-Cassandra connector keeps
C* partitions
• It’s important to match Kafka
partitioning to CassandraRDD
partitioning

High Consumption alerts | Main Spark loop

Data Partitioning
• Data systems like Cassandra or
Kafka scale by partitioning data
• Given enough partitions, any
technology can work
• We need a simple hashing algorithm
that works the same in many
languages and across technologies

Cassandra data modelling with buckets
• Using a hashing function that is uniform and deterministic we can cope
time series data of any amount of customers
• One of our preferred strategies is to use buckets

h(k) = ⌊m * frac(kA)⌋
• Multiplicative hashing is our preferred simple partitioning algorithm
• m= Number of partitions
• A≈(√5−1)/2 = 0.6180339887... (Golden Ratio)
• Online example: jsﬁddle.net/joscas/yfp72fq5

Summary
• Increase in productivity with portable environments (Ansible, Docker,
Mesos)
• Getting partitions straight is essential
• Using a simple common hashing algorithm across technologies and
languages is very helpful

Summary
• Streaming technologies are rapidly evolving
• Spark streaming is complex but with many advantages (Spark’s excellent
integration with Cassandra, Spark’s ML libraries, etc.)
• Kafka ticks a lot of boxes for large scale distributed real time data
systems

Thank you
josep.casals@bgch.co.uk
@jcasals

British Gas Connected Homes: Data Engineering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to British Gas Connected Homes: Data Engineering

Similar to British Gas Connected Homes: Data Engineering (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

British Gas Connected Homes: Data Engineering