SlideShare a Scribd company logo
STREAMING 4 BILLIONS MESSAGES
LESSONS LEARNED
Angelos Petheriotis (@apetheriotis)
HiveHome
What is HiveHome doing ?
Provides a range of different sensors that all work together
to build a smart and connected home.
How is Big Data generated at Connected Home
…more devices to be released
How is it accessible ?
Avro messages through
Kafka Contracted topics
Some numbers ?
4+ Billion messages from input topics
to the Data Platform (increasing by 1000s every day)
Is it useful ?
Many of CH & BG services are based solely
on Big Data projects.
* design a micro services architecture

that won’t wake you up at 03:00 for a simple restart
* not duplicate stuff (code or configs) 

significant % of our time we are plumbers ..let’s make our lives easy
* be resilient to failures. 

Especially when dealing with stageful applications
* communicate/collaborate with data scientists 

mathematicians != engineers
Processing 50K msgs/s from IoT Devices you learn to:
Try to :
* Decouple applications
* Stick to single responsibility principle
* Make apps portable
* Make apps immutable
* Make testing portable and easy
Docker & Kubernetes
GoCD (CI/CD)
Decouple ETL
Police EL with Schema Registry
Microservices in real time pipelines
Average the internal temperatures per
house per 30 minutes and persist to ES
Pros
* We only support/monitor one app !
* All in one place and you don’t have to
remember git repos etc..


Cons
* Job has 2 responsibilities
* Hard to test
* If we want to persist to Cassandra we
need to reprocess the messages
* We cannot reuse the app
Monolithic approach
T+ L
Pros
* 1 responsibility per app
* Easy to replace the load job to ES with
a Cassandra job
* Easy to replay data
* We CAN generalise/reuse the L stage
Cons
* We need to support/monitor 2 apps :(
Microsevices based approach
T E
E C*
ES
We went through of how our infrastructure looks like.
Let’s see what we deploy in that infrastructure
We used to write a lot of Spark apps for E & L operations
> internalTemperature to ElasticSearch
> internalTemperature Cassandra
> motionDetected to ElasticSearch
> deviceSignal to Cassandra
> …
But we replaced our spark jobs because…
We ended up with:
* Duplicated code all over the place for
simple tasks
* Too many github repos. Hard to keep them
in your head
* Too much time to provision a small cluster
to test the app
* Many resources £££ were wasted
because of the master/driver dependencies
of spark
The goal was to define the E & L stages *once* as
a generic re-usable component that handles:
Offset Management
Serialization / de-serialization
Partitioning / Scalability
Fault tolerance / fail-over
Schema Registry integration.
Kafka Connect to the rescue
Kafka Connect
* Suitable for EL operations (no T here)
* No driver/master/worker notations
* No dependency on zookeeper
* Uses the well tested kafka consumer/producers
* Configurable by a rest API
But by default you need to write some code for every application
for the specific domain transformations.
KCQL is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors
Examples:
* INSERT INTO transactionIndex SELECT * FROM transcationTopic
* INSERT INTO motionIndex SELECT motionDatetime As motionAt FROM motionTopic
KCQL (Kafka Connect Query Language)
Available operations: rename,ignore fields, reparation messages any many more


(https://github.com/datamountaineer/kafka-connect-query-language)
How it looks like today …
Monitor your KC apps…
* JMX Metrics and logs from the APP (Jmx metrics provide detailed granularity of the state of the KC app)
* Kafka Connect UI (logs and configs for each KC app available with 1 click - https://github.com/landoop)
E & L stages are now solid, well defined
with minimal duplication and highly reusable.
T needs some polishing. Time to re-think of

our T stage.
What was the problem …
Spark is great but not always the best option:
-> has the notation of micro batches
-> handling state is not optimal
-> you need shared storage to store checkpoints and state
-> you need a cluster with master, driver & workers
SPARK :(
From Spark to Kafka Streams …
Kafka Streams is great because:
-> is cluster and framework free
-> uses kafka to store the state
-> exposes the state via an API
-> has no notation of micro batches
-> KTables
-> No need for zookeeper
So we re-wrote one of our heavy CPU jobs in Kafka Streams
Results:
-> Again: No need to worry about where to store checkpoints. Everything is stored in kafka.
-> No need for a cluster. Just execute `java -jar app.jar`
-> Less scripting !
-> We needed to do funny stuff to make it work with scala :(
And now we have:
-> 50% less resources were used in some cases. Better CPU/Memory utilisation across instances.
-> Easier auto scaling. Just start more instances of your app and kafka streams will scale automatically.
-> Happier devops because they worry about the infrastructure and not the frameworks on top of that.
And since the state is exposed 

through an API we now know
what happens internally inside the app at any given time
Until now we described the engineering part of the Data
Platform team.
Let’s see who uses the data from our platform.
Data Science @ HiveHome
Some of the projects:

-> Energy Breakdown
Distribute the energy usage into categories (lighting, cooking etc) just by knowing the total hourly
consumed energy (patent pending)
-> Heating Failure Alert
Try to identify if a boiler is not working properly, knowing only the internal temperature of a house
-> as much data as possible
-> as soon as possible
-> as accessible as possible
Data Science @ Connected Home
what do scientists need ?
Data Science @ Connected Home
how to work with data scientists
* Be proactive. Have the data ready in advance.
* Keep the data in an flexible datastore. I.e. Elastic Search and not Cassandra.
* Side by side development during each iteration of a model. (Scientists do not unit test!)
* Jupyter/Zeppelin notebooks. Easily run and scale a model across your clusters.
So what we actually learned
(except from all the cool stuff we can add to our CVs)
* Decouple everything.
* When you start copying code and configs -> tools down and re-think of your applications setup.
* Try new technologies. The initial learning curve will compensate you later.
* Work tightly with data scientists so they develop similar mindset to an engineer.
Streaming 4 billion Messages per day. Lessons Learned.

More Related Content

What's hot

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
Dan Sullivan, Ph.D.
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
Yu Ishikawa
 
Migrating a multi tenant app to Azure (war biopic)
Migrating a multi tenant app to Azure (war biopic)Migrating a multi tenant app to Azure (war biopic)
Migrating a multi tenant app to Azure (war biopic)
★ Akshay Surve
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Chris Schalk
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
 
Introducing the Hub for Data Orchestration
Introducing the Hub for Data OrchestrationIntroducing the Hub for Data Orchestration
Introducing the Hub for Data Orchestration
Alluxio, Inc.
 
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
Gary Stafford
 
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest ProblemsUnified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Databricks
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Yahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile PlatformYahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile Platform
DataWorks Summit/Hadoop Summit
 
Stargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data APIStargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data API
Data Con LA
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
DoiT International
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
tdc-globalcode
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
Databricks
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
Databricks
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
Databricks
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Spark Summit
 
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration StoryDeep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Alluxio, Inc.
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 

What's hot (20)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 
Migrating a multi tenant app to Azure (war biopic)
Migrating a multi tenant app to Azure (war biopic)Migrating a multi tenant app to Azure (war biopic)
Migrating a multi tenant app to Azure (war biopic)
 
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQueryIntro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Introducing the Hub for Data Orchestration
Introducing the Hub for Data OrchestrationIntroducing the Hub for Data Orchestration
Introducing the Hub for Data Orchestration
 
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
 
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest ProblemsUnified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Yahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile PlatformYahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile Platform
 
Stargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data APIStargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data API
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration StoryDeep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration Story
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 

Similar to Streaming 4 billion Messages per day. Lessons Learned.

Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
Nicola Ferraro
 
[Draft] Fast Prototyping with DPDK and eBPF in Containernet
[Draft] Fast Prototyping with DPDK and eBPF in Containernet[Draft] Fast Prototyping with DPDK and eBPF in Containernet
[Draft] Fast Prototyping with DPDK and eBPF in Containernet
Andrew Wang
 
Deep Dive into Futures and the Parallel Programming Library
Deep Dive into Futures and the Parallel Programming LibraryDeep Dive into Futures and the Parallel Programming Library
Deep Dive into Futures and the Parallel Programming Library
Jim McKeeth
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
Keiichiro Ono
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
ScyllaDB
 
Kubernetes Java Operator
Kubernetes Java OperatorKubernetes Java Operator
Kubernetes Java Operator
Anthony Dahanne
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...
Wojciech Barczyński
 
Predicting Space Weather with Docker
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with Docker
Docker, Inc.
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Landon Robinson
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
NETWAYS
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
Spring, Functions, Serverless and You
Spring, Functions, Serverless and YouSpring, Functions, Serverless and You
Spring, Functions, Serverless and You
VMware Tanzu
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
Lightbend
 
All the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and KubernetesAll the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
DevOps.com
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
Why do we even have Kubernetes?
Why do we even have Kubernetes?Why do we even have Kubernetes?
Why do we even have Kubernetes?
Sean Walberg
 

Similar to Streaming 4 billion Messages per day. Lessons Learned. (20)

Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
[Draft] Fast Prototyping with DPDK and eBPF in Containernet
[Draft] Fast Prototyping with DPDK and eBPF in Containernet[Draft] Fast Prototyping with DPDK and eBPF in Containernet
[Draft] Fast Prototyping with DPDK and eBPF in Containernet
 
Deep Dive into Futures and the Parallel Programming Library
Deep Dive into Futures and the Parallel Programming LibraryDeep Dive into Futures and the Parallel Programming Library
Deep Dive into Futures and the Parallel Programming Library
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
 
Kubernetes Java Operator
Kubernetes Java OperatorKubernetes Java Operator
Kubernetes Java Operator
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...
Effective Kubernetes - Is Kubernetes the new Linux? Is the new Application Se...
 
Predicting Space Weather with Docker
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with Docker
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Spring, Functions, Serverless and You
Spring, Functions, Serverless and YouSpring, Functions, Serverless and You
Spring, Functions, Serverless and You
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
 
All the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and KubernetesAll the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Why do we even have Kubernetes?
Why do we even have Kubernetes?Why do we even have Kubernetes?
Why do we even have Kubernetes?
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Streaming 4 billion Messages per day. Lessons Learned.

  • 1. STREAMING 4 BILLIONS MESSAGES LESSONS LEARNED Angelos Petheriotis (@apetheriotis) HiveHome
  • 2. What is HiveHome doing ? Provides a range of different sensors that all work together to build a smart and connected home.
  • 3. How is Big Data generated at Connected Home …more devices to be released How is it accessible ? Avro messages through Kafka Contracted topics Some numbers ? 4+ Billion messages from input topics to the Data Platform (increasing by 1000s every day) Is it useful ? Many of CH & BG services are based solely on Big Data projects.
  • 4. * design a micro services architecture
 that won’t wake you up at 03:00 for a simple restart * not duplicate stuff (code or configs) 
 significant % of our time we are plumbers ..let’s make our lives easy * be resilient to failures. 
 Especially when dealing with stageful applications * communicate/collaborate with data scientists 
 mathematicians != engineers Processing 50K msgs/s from IoT Devices you learn to:
  • 5. Try to : * Decouple applications * Stick to single responsibility principle * Make apps portable * Make apps immutable * Make testing portable and easy Docker & Kubernetes GoCD (CI/CD) Decouple ETL Police EL with Schema Registry Microservices in real time pipelines
  • 6. Average the internal temperatures per house per 30 minutes and persist to ES Pros * We only support/monitor one app ! * All in one place and you don’t have to remember git repos etc.. 
 Cons * Job has 2 responsibilities * Hard to test * If we want to persist to Cassandra we need to reprocess the messages * We cannot reuse the app Monolithic approach T+ L
  • 7. Pros * 1 responsibility per app * Easy to replace the load job to ES with a Cassandra job * Easy to replay data * We CAN generalise/reuse the L stage Cons * We need to support/monitor 2 apps :( Microsevices based approach T E E C* ES
  • 8.
  • 9. We went through of how our infrastructure looks like. Let’s see what we deploy in that infrastructure
  • 10. We used to write a lot of Spark apps for E & L operations > internalTemperature to ElasticSearch > internalTemperature Cassandra > motionDetected to ElasticSearch > deviceSignal to Cassandra > …
  • 11. But we replaced our spark jobs because… We ended up with: * Duplicated code all over the place for simple tasks * Too many github repos. Hard to keep them in your head * Too much time to provision a small cluster to test the app * Many resources £££ were wasted because of the master/driver dependencies of spark
  • 12. The goal was to define the E & L stages *once* as a generic re-usable component that handles: Offset Management Serialization / de-serialization Partitioning / Scalability Fault tolerance / fail-over Schema Registry integration.
  • 13. Kafka Connect to the rescue Kafka Connect * Suitable for EL operations (no T here) * No driver/master/worker notations * No dependency on zookeeper * Uses the well tested kafka consumer/producers * Configurable by a rest API
  • 14. But by default you need to write some code for every application for the specific domain transformations.
  • 15. KCQL is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors Examples: * INSERT INTO transactionIndex SELECT * FROM transcationTopic * INSERT INTO motionIndex SELECT motionDatetime As motionAt FROM motionTopic KCQL (Kafka Connect Query Language) Available operations: rename,ignore fields, reparation messages any many more 
 (https://github.com/datamountaineer/kafka-connect-query-language)
  • 16. How it looks like today …
  • 17. Monitor your KC apps… * JMX Metrics and logs from the APP (Jmx metrics provide detailed granularity of the state of the KC app) * Kafka Connect UI (logs and configs for each KC app available with 1 click - https://github.com/landoop)
  • 18. E & L stages are now solid, well defined with minimal duplication and highly reusable. T needs some polishing. Time to re-think of
 our T stage.
  • 19. What was the problem … Spark is great but not always the best option: -> has the notation of micro batches -> handling state is not optimal -> you need shared storage to store checkpoints and state -> you need a cluster with master, driver & workers SPARK :(
  • 20. From Spark to Kafka Streams … Kafka Streams is great because: -> is cluster and framework free -> uses kafka to store the state -> exposes the state via an API -> has no notation of micro batches -> KTables -> No need for zookeeper
  • 21. So we re-wrote one of our heavy CPU jobs in Kafka Streams Results: -> Again: No need to worry about where to store checkpoints. Everything is stored in kafka. -> No need for a cluster. Just execute `java -jar app.jar` -> Less scripting ! -> We needed to do funny stuff to make it work with scala :(
  • 22. And now we have: -> 50% less resources were used in some cases. Better CPU/Memory utilisation across instances. -> Easier auto scaling. Just start more instances of your app and kafka streams will scale automatically. -> Happier devops because they worry about the infrastructure and not the frameworks on top of that. And since the state is exposed 
 through an API we now know what happens internally inside the app at any given time
  • 23. Until now we described the engineering part of the Data Platform team. Let’s see who uses the data from our platform.
  • 24. Data Science @ HiveHome Some of the projects:
 -> Energy Breakdown Distribute the energy usage into categories (lighting, cooking etc) just by knowing the total hourly consumed energy (patent pending) -> Heating Failure Alert Try to identify if a boiler is not working properly, knowing only the internal temperature of a house
  • 25. -> as much data as possible -> as soon as possible -> as accessible as possible Data Science @ Connected Home what do scientists need ?
  • 26. Data Science @ Connected Home how to work with data scientists * Be proactive. Have the data ready in advance. * Keep the data in an flexible datastore. I.e. Elastic Search and not Cassandra. * Side by side development during each iteration of a model. (Scientists do not unit test!) * Jupyter/Zeppelin notebooks. Easily run and scale a model across your clusters.
  • 27. So what we actually learned (except from all the cool stuff we can add to our CVs) * Decouple everything. * When you start copying code and configs -> tools down and re-think of your applications setup. * Try new technologies. The initial learning curve will compensate you later. * Work tightly with data scientists so they develop similar mindset to an engineer.