Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Webinar - Big Data: Let's SMACK - Jorg Schad

852 views

Published on

For many use cases such as fraud detection or reacting on sensor data the response times of traditional batch processing are simply to slow. In order to be able to react to such events close to real-time, we need to go beyond the classical batch processing and utilize stream processing systems such as Apache Spark Streaming, Apache Flink, or Apache Storm. But these systems are not sufficient by itself. One common example for such fast data pipelines is the SMACK stack using Apache Spark, Mesos, Kafka, Akka, Cassandra, Kafka.

Published in: Technology
  • Be the first to comment

Webinar - Big Data: Let's SMACK - Jorg Schad

  1. 1. Big Data: Let's SMACK @joerg_schad
  2. 2. You can come and see me at Codemotion Milan 2017! Talk: NO ONE PUTS Java IN THE CONTAINER November 10th, 15:10-15.50 http://milan2017.codemotionworld.com/
  3. 3. © 2017 Mesosphere, Inc. All Rights Reserved. 3 Jörg Schad Software Engineer @Mesosphere @joerg_schad @joerg.mesosphere
  4. 4. © 2017 Mesosphere, Inc. All Rights Reserved. 4 MapReduce is crunching Data Ancient Times...
  5. 5. © 2016 Mesosphere, Inc. All Rights Reserved. 5 But then business demanded FAST DATA We need to turn faster! Today...
  6. 6. Batch Event ProcessingMicro-Batch Days Hours Minutes Seconds Microseconds Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations (Fast) Data Processing
  7. 7. Fast Data Pipeline EVENTS Ubiquitous data streams from connected devices INGEST STOREANALYZE ACT Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications
  8. 8. The SMACK Stack EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications Apache Mesos Sensors Devices Clients
  9. 9. © 2017 Mesosphere, Inc. All Rights Reserved. 9 Datacenter
  10. 10. NAIVE APPROACH Typical Datacenter
 siloed, over-provisioned servers,
 low utilization Industry Average
 12-15% utilization mySQL microservice Cassandra Spark/Hadoop Kafka
  11. 11. © 2017 Mesosphere, Inc. All Rights Reserved. 11
  12. 12. MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS Typical Datacenter
 siloed, over-provisioned servers,
 low utilization Mesos/ DC/OS
 automated schedulers, workload multiplexing onto the same machines mySQL microservice Cassandra Spark/Hadoop Kafka
  13. 13. Apache Mesos • A top-level Apache project • A cluster resource negotiator • Scalable to 10,000s of nodes • Fault-tolerant, battle-tested • An SDK for distributed apps • Native Docker support
  14. 14. MESOS: FUNDAMENTAL ARCHITECTURE Mesos Master Mesos Master Mesos Master Mesos AgentMesos Agent Service Cassandra Executor Cassandra Task Cassandra Scheduler Container Scheduler Spark Scheduler Spark Executor Spark
 Task Mesos AgentMesos Agent Service Docker Executor Docker
 Task Spark Executor Spark
 Task Two-level Scheduling 1. Agents advertise resources to Master 2. Master offers resources to Framework 3. Framework rejects / uses resources 4. Agent reports task status to Master
  15. 15. © 2017 Mesosphere, Inc. All Rights Reserved. 15
  16. 16. Datacenter Operating System (DC/OS) Distributed Systems Kernel (Mesos) DC/OS ENABLES MODERN DISTRIBUTED APPS Big Data + Analytics EnginesMicroservices (in containers) Streaming Batch Machine Learning Analytics Functions & Logic Search Time Series SQL / NoSQL Databases Modern App Components Any Infrastructure (Physical, Virtual, Cloud)
  17. 17. The SMACK Stack EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications Apache Mesos Sensors Devices Clients
  18. 18. The SMACK Stack EVENTS Ubiquitous data streams from connected devices INGEST STOREANALYZE ACT Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications
  19. 19. The SMACK Stack EVENTS Ubiquitous data streams from connected devices INGEST STOREANALYZE ACT Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications
  20. 20. DATA PROCESSING AT HYPERSCALE EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  21. 21. MESSAGE QUEUES Apache Kafka ØMQ, RabbitMQ, Disque (Redis-based), etc. fluentd, Logstash, Flume Akka streams cloud-only: AWS SQS Google Cloud Pub/Sub
  22. 22. APACHE KAFKA High-throughput, distributed, persistent publish-subscribe messaging system Originates from LinkedIn Typically used as buffer/de-coupling layer in online stream processing
  23. 23. fluentd
  24. 24. © 2016 Mesosphere, Inc. All Rights Reserved. 24 ● Scalability ! Message Type ! Log vs … ! Delivery Guarantees/Message durability ! Routing Capabilities ! Failover ! Community ! Mesos Support ;-) HOW TO CHOOSE?
  25. 25. DELIVERY GUARANTEES At most once—Messages may be lost but are never redelivered. At least once—Messages are never lost but may be redelivered. Exactly once—this is what people actually want, each message is delivered once and only once. Murphy’s Law of Distributed Systems: 
 Anything that can go wrong, will go wrong … partially!
  26. 26. Routing Simple Pipes Routing
  27. 27. DATA PROCESSING AT HYPERSCALE EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  28. 28. STREAM PROCESSING • Apache Storm • Apache Spark • Apache Samza • Apache Flink • Apache Apex • Concord • cloud-only: AWS Kinesis,
 Google Cloud Dataflow
  29. 29. © 2016 Mesosphere, Inc. All Rights Reserved. 29 APACHE SPARK
  30. 30. APACHE SPARK (STREAMING) Typical Use: distributed, large-scale data processing; micro-batching 
 Why Spark Streaming? • Micro-batching creates very low latency, which can be faster • Well defined role means it fits in well with other pieces of the pipeline
  31. 31. © 2016 Mesosphere, Inc. All Rights Reserved. 31 ! Execution Model ! Native Streaming vs Microbatch ! Fault Tolerance Granularity ! Per record, per batch ! Delivery Guarantees ! API ! SQL ! Spark ! Performance…. ! Realtime ≠ Realtime ! Community ! Mesos Support ;-) HOW TO CHOOSE?
  32. 32. EXECUTION MODEL Micro-Batching Native Streaming
  33. 33. FAULT TOLERANCE Checkpoint per “Batch”Ack-Per-Record Checkpoint per Batch
  34. 34. DELIVERY GUARANTEES “Exactly once”At least Once
  35. 35. DATA PROCESSING AT HYPERSCALE EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  36. 36. Datastores
  37. 37. Data Model Key-Value GraphRelational Document ! Schema ! SQL ! Foreign Keys/Joins ! OLTP/ OLAP ! Simple ! Scalable ! Cache FilesTime-Series ! Complex relations ! Social Graph ! Recommen dation ! Fraud detections ! Schema- Less ! Semi- structured queries ! Product catalogue ! Session data
  38. 38. Demo Time Generator Display 1. Financial data created by generator 2. Written to Kafka topics 3. Kafka Topics consumed by Flink 4. Flink pipeline operates on Kafka data 5. Results written back into Kafka stream (another topic) 6. Results displayed
  39. 39. © 2017 Mesosphere, Inc. All Rights Reserved. 39 Keep it running!
  40. 40. SERVICE OPERATIONS ● Configuration Updates (ex: Scaling, re-configuration) ● Binary Upgrades ● Cluster Maintenance (ex: Backup, Restore, Restart) ● Monitor progress of operations ● Debug any runtime blockages
  41. 41. CODEMOTION MILAN November 10/11th,2017 http://milan2017.codemotionworld.com/ See you at

×