What is HiveHome doing ?
Provides a range of different sensors that all work together
to build a smart and connected home.
How is Big Data generated at Connected Home
…more devices to be released
How is it accessible ?
Avro messages through
Kafka Contracted topics
Some numbers ?
4+ Billion messages from input topics
to the Data Platform (increasing by 1000s every day)
Is it useful ?
Many of CH & BG services are based solely
on Big Data projects.
* design a micro services architecture
that won’t wake you up at 03:00 for a simple restart
* not duplicate stuff (code or configs)
significant % of our time we are plumbers ..let’s make our lives easy
* be resilient to failures.
Especially when dealing with stageful applications
* communicate/collaborate with data scientists
mathematicians != engineers
Processing 50K msgs/s from IoT Devices you learn to:
Try to :
* Decouple applications
* Stick to single responsibility principle
* Make apps portable
* Make apps immutable
* Make testing portable and easy
Docker & Kubernetes
Police EL with Schema Registry
Microservices in real time pipelines
Average the internal temperatures per
house per 30 minutes and persist to ES
* We only support/monitor one app !
* All in one place and you don’t have to
remember git repos etc..
* Job has 2 responsibilities
* Hard to test
* If we want to persist to Cassandra we
need to reprocess the messages
* We cannot reuse the app
* 1 responsibility per app
* Easy to replace the load job to ES with
a Cassandra job
* Easy to replay data
* We CAN generalise/reuse the L stage
* We need to support/monitor 2 apps :(
Microsevices based approach
We went through of how our infrastructure looks like.
Let’s see what we deploy in that infrastructure
We used to write a lot of Spark apps for E & L operations
> internalTemperature to ElasticSearch
> internalTemperature Cassandra
> motionDetected to ElasticSearch
> deviceSignal to Cassandra
But we replaced our spark jobs because…
We ended up with:
* Duplicated code all over the place for
* Too many github repos. Hard to keep them
in your head
* Too much time to provision a small cluster
to test the app
* Many resources £££ were wasted
because of the master/driver dependencies
The goal was to define the E & L stages *once* as
a generic re-usable component that handles:
Serialization / de-serialization
Partitioning / Scalability
Fault tolerance / fail-over
Schema Registry integration.
Kafka Connect to the rescue
* Suitable for EL operations (no T here)
* No driver/master/worker notations
* No dependency on zookeeper
* Uses the well tested kafka consumer/producers
* Configurable by a rest API
But by default you need to write some code for every application
for the specific domain transformations.
KCQL is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors
* INSERT INTO transactionIndex SELECT * FROM transcationTopic
* INSERT INTO motionIndex SELECT motionDatetime As motionAt FROM motionTopic
KCQL (Kafka Connect Query Language)
Available operations: rename,ignore fields, reparation messages any many more
Monitor your KC apps…
* JMX Metrics and logs from the APP (Jmx metrics provide detailed granularity of the state of the KC app)
* Kafka Connect UI (logs and configs for each KC app available with 1 click - https://github.com/landoop)
E & L stages are now solid, well defined
with minimal duplication and highly reusable.
T needs some polishing. Time to re-think of
our T stage.
What was the problem …
Spark is great but not always the best option:
-> has the notation of micro batches
-> handling state is not optimal
-> you need shared storage to store checkpoints and state
-> you need a cluster with master, driver & workers
From Spark to Kafka Streams …
Kafka Streams is great because:
-> is cluster and framework free
-> uses kafka to store the state
-> exposes the state via an API
-> has no notation of micro batches
-> No need for zookeeper
So we re-wrote one of our heavy CPU jobs in Kafka Streams
-> Again: No need to worry about where to store checkpoints. Everything is stored in kafka.
-> No need for a cluster. Just execute `java -jar app.jar`
-> Less scripting !
-> We needed to do funny stuff to make it work with scala :(
And now we have:
-> 50% less resources were used in some cases. Better CPU/Memory utilisation across instances.
-> Easier auto scaling. Just start more instances of your app and kafka streams will scale automatically.
-> Happier devops because they worry about the infrastructure and not the frameworks on top of that.
And since the state is exposed
through an API we now know
what happens internally inside the app at any given time
Until now we described the engineering part of the Data
Let’s see who uses the data from our platform.
Data Science @ HiveHome
Some of the projects:
-> Energy Breakdown
Distribute the energy usage into categories (lighting, cooking etc) just by knowing the total hourly
consumed energy (patent pending)
-> Heating Failure Alert
Try to identify if a boiler is not working properly, knowing only the internal temperature of a house
-> as much data as possible
-> as soon as possible
-> as accessible as possible
Data Science @ Connected Home
what do scientists need ?
Data Science @ Connected Home
how to work with data scientists
* Be proactive. Have the data ready in advance.
* Keep the data in an flexible datastore. I.e. Elastic Search and not Cassandra.
* Side by side development during each iteration of a model. (Scientists do not unit test!)
* Jupyter/Zeppelin notebooks. Easily run and scale a model across your clusters.
So what we actually learned
(except from all the cool stuff we can add to our CVs)
* Decouple everything.
* When you start copying code and configs -> tools down and re-think of your applications setup.
* Try new technologies. The initial learning curve will compensate you later.
* Work tightly with data scientists so they develop similar mindset to an engineer.