Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming 4 billion Messages per day. Lessons Learned.

892 views

Published on

Lessons learned from designing a pipeline that handles Billion of messages per day

Published in: Technology
  • Be the first to comment

Streaming 4 billion Messages per day. Lessons Learned.

  1. 1. STREAMING 4 BILLIONS MESSAGES LESSONS LEARNED Angelos Petheriotis (@apetheriotis) HiveHome
  2. 2. What is HiveHome doing ? Provides a range of different sensors that all work together to build a smart and connected home.
  3. 3. How is Big Data generated at Connected Home …more devices to be released How is it accessible ? Avro messages through Kafka Contracted topics Some numbers ? 4+ Billion messages from input topics to the Data Platform (increasing by 1000s every day) Is it useful ? Many of CH & BG services are based solely on Big Data projects.
  4. 4. * design a micro services architecture
 that won’t wake you up at 03:00 for a simple restart * not duplicate stuff (code or configs) 
 significant % of our time we are plumbers ..let’s make our lives easy * be resilient to failures. 
 Especially when dealing with stageful applications * communicate/collaborate with data scientists 
 mathematicians != engineers Processing 50K msgs/s from IoT Devices you learn to:
  5. 5. Try to : * Decouple applications * Stick to single responsibility principle * Make apps portable * Make apps immutable * Make testing portable and easy Docker & Kubernetes GoCD (CI/CD) Decouple ETL Police EL with Schema Registry Microservices in real time pipelines
  6. 6. Average the internal temperatures per house per 30 minutes and persist to ES Pros * We only support/monitor one app ! * All in one place and you don’t have to remember git repos etc.. 
 Cons * Job has 2 responsibilities * Hard to test * If we want to persist to Cassandra we need to reprocess the messages * We cannot reuse the app Monolithic approach T+ L
  7. 7. Pros * 1 responsibility per app * Easy to replace the load job to ES with a Cassandra job * Easy to replay data * We CAN generalise/reuse the L stage Cons * We need to support/monitor 2 apps :( Microsevices based approach T E E C* ES
  8. 8. We went through of how our infrastructure looks like. Let’s see what we deploy in that infrastructure
  9. 9. We used to write a lot of Spark apps for E & L operations > internalTemperature to ElasticSearch > internalTemperature Cassandra > motionDetected to ElasticSearch > deviceSignal to Cassandra > …
  10. 10. But we replaced our spark jobs because… We ended up with: * Duplicated code all over the place for simple tasks * Too many github repos. Hard to keep them in your head * Too much time to provision a small cluster to test the app * Many resources £££ were wasted because of the master/driver dependencies of spark
  11. 11. The goal was to define the E & L stages *once* as a generic re-usable component that handles: Offset Management Serialization / de-serialization Partitioning / Scalability Fault tolerance / fail-over Schema Registry integration.
  12. 12. Kafka Connect to the rescue Kafka Connect * Suitable for EL operations (no T here) * No driver/master/worker notations * No dependency on zookeeper * Uses the well tested kafka consumer/producers * Configurable by a rest API
  13. 13. But by default you need to write some code for every application for the specific domain transformations.
  14. 14. KCQL is a SQL like syntax allowing streamlined configuration of Kafka Sink Connectors Examples: * INSERT INTO transactionIndex SELECT * FROM transcationTopic * INSERT INTO motionIndex SELECT motionDatetime As motionAt FROM motionTopic KCQL (Kafka Connect Query Language) Available operations: rename,ignore fields, reparation messages any many more 
 (https://github.com/datamountaineer/kafka-connect-query-language)
  15. 15. How it looks like today …
  16. 16. Monitor your KC apps… * JMX Metrics and logs from the APP (Jmx metrics provide detailed granularity of the state of the KC app) * Kafka Connect UI (logs and configs for each KC app available with 1 click - https://github.com/landoop)
  17. 17. E & L stages are now solid, well defined with minimal duplication and highly reusable. T needs some polishing. Time to re-think of
 our T stage.
  18. 18. What was the problem … Spark is great but not always the best option: -> has the notation of micro batches -> handling state is not optimal -> you need shared storage to store checkpoints and state -> you need a cluster with master, driver & workers SPARK :(
  19. 19. From Spark to Kafka Streams … Kafka Streams is great because: -> is cluster and framework free -> uses kafka to store the state -> exposes the state via an API -> has no notation of micro batches -> KTables -> No need for zookeeper
  20. 20. So we re-wrote one of our heavy CPU jobs in Kafka Streams Results: -> Again: No need to worry about where to store checkpoints. Everything is stored in kafka. -> No need for a cluster. Just execute `java -jar app.jar` -> Less scripting ! -> We needed to do funny stuff to make it work with scala :(
  21. 21. And now we have: -> 50% less resources were used in some cases. Better CPU/Memory utilisation across instances. -> Easier auto scaling. Just start more instances of your app and kafka streams will scale automatically. -> Happier devops because they worry about the infrastructure and not the frameworks on top of that. And since the state is exposed 
 through an API we now know what happens internally inside the app at any given time
  22. 22. Until now we described the engineering part of the Data Platform team. Let’s see who uses the data from our platform.
  23. 23. Data Science @ HiveHome Some of the projects:
 -> Energy Breakdown Distribute the energy usage into categories (lighting, cooking etc) just by knowing the total hourly consumed energy (patent pending) -> Heating Failure Alert Try to identify if a boiler is not working properly, knowing only the internal temperature of a house
  24. 24. -> as much data as possible -> as soon as possible -> as accessible as possible Data Science @ Connected Home what do scientists need ?
  25. 25. Data Science @ Connected Home how to work with data scientists * Be proactive. Have the data ready in advance. * Keep the data in an flexible datastore. I.e. Elastic Search and not Cassandra. * Side by side development during each iteration of a model. (Scientists do not unit test!) * Jupyter/Zeppelin notebooks. Easily run and scale a model across your clusters.
  26. 26. So what we actually learned (except from all the cool stuff we can add to our CVs) * Decouple everything. * When you start copying code and configs -> tools down and re-think of your applications setup. * Try new technologies. The initial learning curve will compensate you later. * Work tightly with data scientists so they develop similar mindset to an engineer.

×