Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

872 views

Published on

There are an ever increasing number of use cases, like online fraud detection, for which the response times of traditional batch processing are too slow. In order to be able to react to such events in close to real-time, you need to go beyond classical batch processing and utilize stream processing systems such as Apache Spark Streaming, Apache Flink, or Apache Storm. These systems, however, are not sufficient on their own. For an efficient and fault-tolerant setup, you also need a message queue and storage system. One common example for setting up a fast data pipeline is the SMACK stack. SMACK stands for Spark (Streaming) – the stream processing system Mesos – the cluster orchestrator Akka – the system for providing custom actors for reacting upon the analyses Cassandra – the storage system Kafka – the message queue Setting up this kind of pipeline in a scalable, efficient and fault-tolerant manner is not trivial. First, this workshop will discuss the different components in the SMACK stack. Then, participants will get hands-on experience in setting up and maintaining data pipelines.

Published in: Data & Analytics

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

  1. 1. Jörg Schad, Mesosphere SMACK STACK AND BEYOND BUILDING FAST DATA PIPELINES @dcos @joerg_schad
  2. 2. © 2017 Mesosphere, Inc. All Rights Reserved. 2 Jörg Schad Software Engineer @Mesosphere @joerg_schad @joerg.mesosphere
  3. 3. © 2017 Mesosphere, Inc. All Rights Reserved. 3 MapReduce is crunching Data Ancient Times...
  4. 4. © 2016 Mesosphere, Inc. All Rights Reserved. 4 But then business demanded FAST DATA We need to turn faster! Today...
  5. 5. Batch Event ProcessingMicro-Batch Days Hours Minutes Seconds Microseconds Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations Data Processing 5
  6. 6. 6 Fast Data Pipelines Data Ingestion Request/Response Devices Client Sensors Message Queue/Bus Microservices Distributed Storage Analytics (Streaming)
  7. 7. 7 Fast Data Pipelines Data Ingestion Request/Response Devices Client Sensors Message Queue/Bus Microservices Distributed Storage Analytics (Streaming)
  8. 8. 8 EVENTS Ubiquitous data streams from connected devices INGEST STOREANALYZE ACT Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications Fast Data Pipelines
  9. 9. SMACK Stack 9 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications Apache Mesos Sensors Devices Clients
  10. 10. SMACK Stack 10 EVENTS Ubiquitous data streams from connected devices INGEST STOREANALYZE ACT Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications
  11. 11. Message Queues 11 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  12. 12. MESSAGE QUEUES Apache Kafka ØMQ, RabbitMQ, Disque (Redis-based), etc. fluentd, Logstash, Flume Akka streams cloud-only: AWS SQS Google Cloud Pub/Sub 12
  13. 13. APACHE KAFKA High-throughput, distributed, persistent publish-subscribe messaging system Originates from LinkedIn Typically used as buffer/de-coupling layer in online stream processing 13
  14. 14. fluentd 14
  15. 15. © 2017 Mesosphere, Inc. All Rights Reserved. 15 ● Scalability ! Message Type ! Log vs … ! Delivery Guarantees/Message durability ! Routing Capabilities ! Failover ! Community ! Mesos Support ;-) HOW TO CHOOSE?
  16. 16. DELIVERY GUARANTEES At most once—Messages may be lost but are never redelivered. At least once—Messages are never lost but may be redelivered. Exactly once—this is what people actually want, each message is delivered once and only once. 16 Murphy’s Law of Distributed Systems: 
 Anything that can go wrong, will go wrong … partially!
  17. 17. Routing 17 Simple Pipes Routing
  18. 18. Stream Processing 18 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  19. 19. STREAM PROCESSING • Apache Storm • Apache Spark • Apache Samza • Apache Flink • Apache Apex • Concord • cloud-only: AWS Kinesis,
 Google Cloud Dataflow 19
  20. 20. © 2016 Mesosphere, Inc. All Rights Reserved. 20 APACHE SPARK
  21. 21. APACHE SPARK (STREAMING 2.0) Typical Use: distributed, large-scale data processing; micro-batching 
 Why Spark Streaming? • Micro-batching creates very low latency, which can be faster • Well defined role means it fits in well with other pieces of the pipeline 21
  22. 22. © 2016 Mesosphere, Inc. All Rights Reserved. 22 ! Execution Model ! Native Streaming vs Microbatch ! Fault Tolerance Granularity ! Per record, per batch ! Delivery Guarantees ! API ! SQL ! Spark ! Performance…. ! Realtime ≠ Realtime ! Community ! Mesos Support ;-) HOW TO CHOOSE?
  23. 23. EXECUTION MODEL Micro-Batching 23 Native Streaming
  24. 24. FAULT TOLERANCE Checkpoint per “Batch” 24 Ack-Per-Record Checkpoint per Batch
  25. 25. DELIVERY GUARANTEES “Exactly once” 25 At least Once
  26. 26. Storage 26 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  27. 27. Datastores 27
  28. 28. Data Model 28 GraphRelational Document ! Schema ! SQL ! Foreign Keys/Joins ! OLTP/ OLAP ! Simple ! Scalable ! Cache FilesTime-Series ! Complex relations ! Social Graph ! Recommen dation ! Fraud detections ! Schema- Less ! Semi- structured queries ! Product catalogue ! Session data Key-Value
  29. 29. © 2017 Mesosphere, Inc. All Rights Reserved. 29 Datacenter
  30. 30. NAIVE APPROACH 30 Typical Datacenter
 siloed, over-provisioned servers,
 low utilization Industry Average
 12-15% utilization mySQL microservice Cassandra Spark/Hadoop Kafka
  31. 31. © 2017 Mesosphere, Inc. All Rights Reserved. 31
  32. 32. MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS 32 Typical Datacenter
 siloed, over-provisioned servers,
 low utilization Mesos/ DC/OS
 automated schedulers, workload multiplexing onto the same machines mySQL microservice Cassandra Spark/Hadoop Kafka
  33. 33. Apache Mesos • A top-level Apache project • A cluster resource negotiator • Scalable to 10,000s of nodes • Fault-tolerant, battle-tested • An SDK for distributed apps • Native Docker support 33
  34. 34. MESOS: FUNDAMENTAL ARCHITECTURE 34 Mesos Master Mesos Master Mesos Master Mesos AgentMesos Agent Service Cassandra Executor Cassandra Task Cassandra Scheduler Container Scheduler Spark Scheduler Spark Executor Spark
 Task Mesos AgentMesos Agent Service Docker Executor Docker
 Task Spark Executor Spark
 Task Two-level Scheduling 1. Agents advertise resources to Master 2. Master offers resources to Framework 3. Framework rejects / uses resources 4. Agent reports task status to Master
  35. 35. Challenges 35 • Mesos is just the kernel • Need for OS: • Scheduler • Monitoring • Security • CLI • Package Repository • …
  36. 36. © 2017 Mesosphere, Inc. All Rights Reserved. 36 Operating Systems
  37. 37. © 2017 Mesosphere, Inc. All Rights Reserved. 37
  38. 38. DC/OS 38 ! Datacenter-wide services to power your apps ! Turnkey installation and lifecycle management DC/OS Universe DC/OS Any Infrastructure ! Container operations & big data operations ! Security, fault tolerance & high availability ! Open Source (ASL2.0) ! Based on Apache Mesos ! Production proven at scale ! Requires only a modern linux distro 
 (windows coming soon) ! Hybrid Datacenter Datacenter Operating System (DC/OS) Distributed Systems Kernel (Mesos) Big Data + Analytics EnginesMicroservices ( containers) Streaming Batch Machine Learning Analytics Search Time Series SQL / NoSQL Databases Modern App Components Any Infrastructure (Physical, Virtual, Cloud)
  39. 39. Developing Distributed Services 39 • Failures (Task, Node, Network,…) • Zero Downtime Upgrades • Persistence • Multiple Frameworks • Service Discovery • Metrics • ….
  40. 40. © 2017 Mesosphere, Inc. All Rights Reserved. 40 Operating Distributed Services
  41. 41. Distributed Services: Challenges 41 ● As simple as Docker Compose ● Don’t need to write any Java code ● Don’t need to be an app expert ● Need to be an app expert ● Need to write a little Java code ● Don’t want to understand DC/OS ● Can’t use the default scheduler ● Need to write a lot of Java code ● Willing to understand DC/OS Custom Jobs & Strategies Build Your Own Scheduler Defaults
  42. 42. Demo Time 42 Generator Display 1. Financial data created by generator 2. Written to Kafka topics 3. Kafka Topics consumed by Spark or Flink 4. Results written back into Kafka stream (another topic) 7. Results displayed
  43. 43. © 2017 Mesosphere, Inc. All Rights Reserved. 43 Keep it running!
  44. 44. SERVICE OPERATIONS 44 ● Configuration Updates (ex: Scaling, re-configuration) ● Binary Upgrades ● Cluster Maintenance (ex: Backup, Restore, Restart) ● Monitor progress of operations ● Debug any runtime blockages
  45. 45. Questions? @dcos @joerg_schad dcos.io Demo

×