Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU talk by Bas Geerdink

715 views

Published on

Lambda Architecture with Spark in the IoT

Published in: Data & Analytics
  • Be the first to comment

Spark Summit EU talk by Bas Geerdink

  1. 1. Lambda Architecture in the IoT Fast Data Analytics with Spark Streaming and MLlib Bas Geerdink ING
  2. 2. Bas Geerdink • Chapter Lead in Analytics area at ING • Master degree in Artificial Intelligence and Informatics • Spark Certified Developer • @bgeerdink • https://www.linkedin.com/in/geerdink Who am I?
  3. 3. The Internet of …
  4. 4. • Data – Streaming data from more sources • Use cases – Combining data streams • Technology – Fast processing and scalability • Challenges – Security: encrypt the sensors/network/server What’s new in the IoT?
  5. 5. What do we want in the IoT? • FAST data: process stream of events (sensory data) • BIG data: process files/tables in batches (static data) • ONE analytics engine • ONE querying/visualization tool • Scalable environment • Reactive software
  6. 6. Lambda Architecture Source: Nathan Marz (2013)
  7. 7. Gamma/Kappa/Omega/… Architecture • Lambda Criticism: – Too complex – Two code bases – Two data stores • Alternatives: – Treat a stream as a mini-batch – Treat a batch as a stream of records – Combine batch and stream within one system
  8. 8. Big Data Reference Architecture Source: Geerdink (2013)
  9. 9. Use case: Smart Parking Recommend the best car park when driving to Amsterdam  Show the car park with the highest score, determined by • Batch (every x minutes), data from car parks • Stream (every x seconds), GPS data of cars
  10. 10. Stream Process • Get car events (GPS data) • Filter events (business rules) • Store events • For each car park in the neighborhood: – Get feature set (location, capacity, usage, …) – Combine event with previous events (running sum) – Update feature set – Predict score (machine learning) and update database
  11. 11. Batch Process • Get car park data • Clean up (remove old car data) • For each car park in the data set: – Get recent set of car data (running sum) – Update feature set – Predict score (machine learning) and update database
  12. 12. Lambda Architecture GPS updates Message Broker Analytics Engine File System Capacity updates Data Science Environment Machine learning models Business Rules Environment Scores Batch Stream 1. Get features 2. Apply rules 3. Predict score Business rules Features Selection API Front-end Data Scientist User
  13. 13. Lambda Architecture GPS updates Message Broker Analytics Engine File System Capacity updates Data Science Environment Machine learning models Business Rules Environment Scores Batch Stream 1. Get features 2. Apply rules 3. Predict score Business rules Features Selection API Front-end Data Scientist User
  14. 14. Lambda Architecture GPS updates Message Broker Analytics Engine File System Capacity updates Data Science Environment Machine learning models Business Rules Environment Scores Batch Stream 1. Get features 2. Apply rules 3. Predict score Business rules Features Selection API Front-end Data Scientist User
  15. 15. Event Processing with Kafka
  16. 16. Lambda Architecture GPS updates Message Broker Analytics Engine File System Capacity updates Data Science Environment Machine learning models Business Rules Environment Scores Batch Stream 1. Get features 2. Apply rules 3. Predict score Business rules Features Selection API Front-end Data Scientist User
  17. 17. Stream: Data Preparation
  18. 18. Lambda Architecture GPS updates Message Broker Analytics Engine File System Capacity updates Data Science Environment Machine learning models Business Rules Environment Scores Batch Stream 1. Get features 2. Apply rules 3. Predict score Business rules Features Selection API Front-end Data Scientist User
  19. 19. Stream: Create Recommendation
  20. 20. Lambda Architecture GPS updates Message Broker Analytics Engine File System Capacity updates Data Science Environment Machine learning models Business Rules Environment Scores Batch Stream 1. Get features 2. Apply rules 3. Predict score Business rules Features Selection API Front-end Data Scientist User
  21. 21. Batch Processing
  22. 22. Lambda Architecture GPS updates Message Broker Analytics Engine File System Capacity updates Data Science Environment Machine learning models Business Rules Environment Scores Batch Stream 1. Get features 2. Apply rules 3. Predict score Business rules Features Selection API Front-end Data Scientist User
  23. 23. Data Science with Zeppelin
  24. 24. Summary • The IoT requires new architecture principles: in-memory, distributed, reactive and scalable are the new normal • Lambda/Kappa/Omega reference architectures are designed for combining batch and streaming data flows • Open source tooling such as Kafka, Cassandra, and Spark adhere to these principles and design patterns
  25. 25. THANK YOU! ING is hiring  #SparkSummit

×