Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
@wip
Building Reactive Fast Data & the Data Lake
with Akka, Kafka, Spark
AJUG, Jan 17, 2017
Todd Fritz
Cox Automotive, Inc.
This presentation is a draft of what will be presented, next month, at DevNexus.
“Sneak peek”
Reactions and questions may ...
3
tafritz@outlook.com
Todd.Fritz@coxautoinc.com
www.linkedin.com/in/tfritz
http://www.slideshare.net/ToddFritz
https://githu...
License: CC BY-SA 3.0 5
• Senior Solutions Architect @ Cox Automotive, Inc.
• Strategic Data Services
• The opinions conta...
DevNexus 2015
http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015
Great Wide Open - Atlanta (April 3, ...
• Forward
• The Briefest History
• Background: Reactive Systems, Patterns, Implementations
• The Enterprise
• Fast Data
• ...
Forward
“Our greatest glory is not in never failing,
but in rising every time we fall.”
-Confucius
8
• Why is Reactive Important?
• Reactive Systems and Programming != Reactive Management
• Reactive underpins every use case...
The Briefest History
History is the sum total of things that could have been avoided.
- Konrad Adenauer
10
• “Reactive” is not new
• Underlying principles go back almost 50 years, to the days of punch cards
• Erlang
• Built to sc...
Reactive Systems, Patterns, Implementations
“People who think they know everything
are a great annoyance to those of us wh...
• The Reactive Manifesto
• http://www.reactivemanifesto.org
• Many organizations independently building software to new, a...
“…“Reactive” is a set of design principles for creating cohesive systems. It’s a way of
thinking about systems architectur...
1. Reactive Systems (architecture, design)
2. Reactive Programming (declarative event-based)
3. Functional Reactive Progra...
• Reactive Programming != Functional Reactive Programming
• Subset of Asynch Programming
• New information drives logic fl...
• Reactive Programming is related to Dataflow Programming
• Both emphasize flow of data vs. flow of control
• Examples
• F...
• Notable benefits
• Increased (efficient) utilization of compute resources (incl. multi-core)
• Increased performance via...
• Reactive Programming - event-driven
• Computation via dataflow chains
• Events are not directed; “addressable event sour...
• Common pattern is to us Messaging as means to communicate Events across
network/components
• Events within Messages
• Ex...
Reactive Systems: Characteristics*
21
Responsive
• Low latency
• Consistent &
predictable
• Foundation of Usability
• Esse...
Reactive Systems: Patterns
22
Architecture Pattern:
Single Component
• Component does one thing, fully and well
• Single R...
• Still need to use the brain and learn
• Not going to be able to build reactive systems with just paper
certifications
• ...
The Enterprise
“Even if you are on the right track,
you’ll get run over if you just sit there.”
- Will Rogers
24
25
• We all work for 1..n companies that have a size, complexity and age
• Young companies (e.g. start-ups)
• Less legacy ove...
• Generalizing large companies
• Probably where the terms “technical debt” and “culture debt” came from
• Certainly has le...
• An Enterprise has a variety of software applications
• Customer facing, revenue generating
• “Supporting” applications f...
• (Not a complete list)
• Products (and supporting software) operates in Real Time
• Opens door to do things to increase c...
• Friends don’t let their friend-analysts do map-reduce
• Analysts are not highly technical (some do code in R)
• Larger c...
31Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Fast Data
“If you are in a spaceship that is traveling at the speed of light,
and you turn on the headlights,
does anythin...
• Big Data  Fast Data
• Constant stream of inbound data, at incredible rates
• The old way is to store it, then analyze
•...
Time is Gold
34* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)
• Fast Data is the obvious future
• Many of us have been using these techs for years
• The easy part
• Build new systems u...
• Enough background, time to talk tech
• For this discussion, the reference platform is Lightbend’s Fast Data
platform
• (...
37
Source: Lightbend, Inc.
• JMS -> AMQP –> Kafka
• Streaming platform
• Process messages when provided
• Fault-tolerant storage
• Pub/Sub capability...
Kafka Performance
39Source: “Introduction to Kafka”, Ducas Francis
30k/s 1.8M/min 108M/hr 2.7B/day
• Producer API
• To publish a record to Kafka
• Consumer API
• To subscribe to a topic(s)
• Consumer groups
• Handle recor...
• > Messaging systems (RabbitMQ, AMQ, etc)
• Just don’t scale well and become complex
• Need to use other abstractions for...
• “…a fast and general engine for large-scale data processing.”
• Hadoop / YARN
• Map-Reduce – mapper, reducer, disk IO, q...
• Combine SQL, streaming, complex analytics
• SQL
• Dataframes
• MLlib
• GraphX
• Spark Streaming
Spark
43Source: spark.ap...
• Run it
• Standalone
• Hadoop
• Mesos
• Cloud
• Access data
• HDFS
• Cassandra
• Hbase
• S3
• Hive
• Tachyon, and more
• ...
• Slow due to replication, serialization, filesystem IO
• Inefficient use cases:
• Iterative algorithms (ML, Graphs, Netwo...
Spark: Hadoop & Spark (in Hadoop)
46Source: “Spark Overview”, Lisa Hua
• Spark has additional features, such as interop wi...
Spark: A Clustered Application
47Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Spark: Execution Terminology
48Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Job – a set of tasks t...
Spark: SQL
49Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Spark SQL is a module for structured dat...
Spark: Dataframe, Datasets, RDDs
50Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Dataframe is a dis...
• Scalable, high-throughput, fault tolerant processing of real time streams and use
cases.
• Microbatched. Think of in ter...
• Can merge inbound data with historical data
• Write code in Scala, Python, Java, etc
• Lower level access via DStream, o...
Spark Streaming - comparison with other techs
53Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
Building a Spark Application
54Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
• Scala or Java, compile...
Akka, who uses it?
55Source: “Akka Actor Introduction”, Gene
Akka
56Source: “Introducing Akka”, Jonas Bonér
• Vision
• Simpler: Concurrency, Scalability, Fault-Tolerance
• With a sing...
Akka: Perfect for the Cloud
57Source: “Introducing Akka”, Jonas Bonér
• Elastic and dynamic
• Fault tolerant & self healin...
Akka 101
58Source: “Introducing Akka”, Jonas Bonér
• Akka’s unit of code is called an Actor
• The vehicle to create concur...
Akka: How to use Actors?
59Source: “Introducing Akka”, Jonas Bonér
• Alternative to:
• A thread
• An object instance or co...
Akka: What is it?
60Source: http://bit.ly/hewitt-on-actors
• Carl Hewitt’s definition
• Fundamental unit of computation th...
Akka: Core Actor Operations
61Source: “Introducing Akka”, Jonas Bonér
0. Define
1. Create
2. Send
3. Become
4. Supervise
Akka: Define Operation
62Source: “Introducing Akka”, Jonas Bonér
0. Define
• Define the message (class) the actor should r...
Akka: Create Operation
63Source: “Introducing Akka”, Jonas Bonér
1. Create
• Yes, creates new actor. From ActorSystem, the...
Akka: Send Operation
64Source: “Introducing Akka”, Jonas Bonér
2. Send
• Sends a message to an Actor
• Asynch and on-block...
Akka: Performance
65Source: “Introducing Akka”, Jonas Bonér
+50 million messages per second !!!
Akka: Remote Deployment
66Source: “Introducing Akka”, Jonas Bonér
Akka: Become Operation
67Source: “Introducing Akka”, Jonas Bonér
3. Become
• Dynamically redefine Actor’s behavior
• React...
Akka: Become Operation – Why?
68Source: “Introducing Akka”, Jonas Bonér
Why do this?
• A busy actor can become an Actor Po...
Akka: Failure Management
69Source: “Introducing Akka”, Jonas Bonér
In Java/C/C+ etc.
• Single thread of control
• If that ...
Akka: Supervise Operation
70Source: “Introducing Akka”, Jonas Bonér
4. Supervise
• Manage another Actor’s failures (or the...
Akka Streaming
71Source: Akka Documentation
• Akka Streams API is completely decoupled from Reactive Streams interfaces
• ...
Akka Streaming
72Source: Akka Documentation
• Immutable building blocks / blueprints enabled for libraries, include
• Sour...
Akka HTTP
73Source: Akka Documentation
• Built atop Akka Streams
• Can expose an incoming connection in form of a Source i...
• Akka Streams Integration
• https://github.com/akka/alpakka
• http://developer.lightbend.com/docs/alpakka/current/
• Adds...
The Data Lake for Analytics, App Dev
“Lake Wobegon, the little town that time forgot
and the decades cannot improve.”
- Ga...
The Data Lake
76
• Note: This section will be expanded for DevNexus.
The Data Lake
77
• The Data Lake is where a copy of much of the data from source sytems “ends
up”, via Fast Data, etc.
• E...
The Dark Side of the Data Lake
78Source: “Gartner Says Beware of the Data Lake Fallacy”
• Many companies tend to vacuum da...
More to Come…
79Source: “Gartner Says Beware of the Data Lake Fallacy”
• At DevNexus 2017
• Thurs, Feb 23 @ 2:30pm
• http:...
Presentation Improvements for DevNexus
“Build something 100 people love,
not something 1 million people kind of like.”
- B...
• More diagrams, fewer words
• Refine, refine, refine
• Mix in coding examples
• Improve contrast between older architectu...
Questions?
“I'm sorry, if you were right, I'd agree with you.”
- Robin Williams
82
• The Reactive Manifesto (http://www.reactivemanifesto.org/)
• Chaos Monkey? Use Linear Fault Driven Testing instead.
• ht...
Ideas
84
Upcoming SlideShare
Loading in …5
×

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

2,105 views

Published on

In this session, we will discuss:
* reactive architecture tenets
* distributed “fast data” streams
* application and analytics focused Data Lake

Enterprise level concerns and the importance of holistic governance, operational management, and a Metadata Lake will be conceptually investigated. The next level of detail will be to explore what a prospective architecture looks like at scale with Terabytes of ingestion per day, how scale puts pressure on an architecture, and how to be successful without losing data in a mission critical system via resilient, self-healing, scalable technologies. DevOps and application architecture concerns will be first-class themes throughout.

Reactive principles and technology will be the second act of this talk. Kafka. Akka. Spark. Various streaming technologies (Kafka Streams, Akka Streams, Spark Streaming) will be reviewed to identify what they are best suited for. The fast data pipeline discussion will center around Kafka, Akka, and Apache Flink (Lightbend Fast Data platform). We’ll also walk through an exciting addition to the Akka family, Alpakka, which is a Camel equivalent for Enterprise Integration Patterns.

The final act will be to dive into the Data Lake, from both an analytics and application development perspective. Technologies used to explain concepts will include Amazon and Hadoop. A Data Lake may service multiple analytics consumers with various “views” (and access levels) of data. It may also be a participant of various applications, perhaps by acting as a centralized source for reference data or common middleware (in turn feeding the analytics aspect). The concept of the Metadata Lake to apply structure, meaning and purpose will be an over-arching success factor for a Data Lake. The difference between the Data Lake and Metadata Lake is conceptually similar to a Halocline… Various technologies (Iglu/Snowplow and more) will be discussed from a feature standpoint to flesh out the technology capabilities needed for Data Lake governance.

Published in: Software
  • Be the first to comment

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

  1. 1. @wip Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark AJUG, Jan 17, 2017 Todd Fritz Cox Automotive, Inc.
  2. 2. This presentation is a draft of what will be presented, next month, at DevNexus. “Sneak peek” Reactions and questions may influence evolution of content Disclaimer 2
  3. 3. 3
  4. 4. tafritz@outlook.com Todd.Fritz@coxautoinc.com www.linkedin.com/in/tfritz http://www.slideshare.net/ToddFritz https://github.com/todd-fritz 4
  5. 5. License: CC BY-SA 3.0 5 • Senior Solutions Architect @ Cox Automotive, Inc. • Strategic Data Services • The opinions contained herein may not represent my employer, but I believe they should. • Background is building platforms, middleware, MoM, EIP, EDA, etc • DevOps mentality • Exposed to many environments, technologies, people • Life-long learner and always curious • Novice bass player • Scuba diver About Me
  6. 6. DevNexus 2015 http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015 Great Wide Open - Atlanta (April 3, 2014) http://www.slideshare.net/ToddFritz/2014-04-03legacytocloud AJUG (April 15, 2014) http://www.slideshare.net/ToddFritz/2014-april-15-atlanta-java-users-group Video - https://vimeo.com/94556976 Previous Presentations 6
  7. 7. • Forward • The Briefest History • Background: Reactive Systems, Patterns, Implementations • The Enterprise • Fast Data • The Data Lake for Analytics, App Dev • Presentation Improvements Planned for DevNexus • Questions • Resources Agenda 7
  8. 8. Forward “Our greatest glory is not in never failing, but in rising every time we fall.” -Confucius 8
  9. 9. • Why is Reactive Important? • Reactive Systems and Programming != Reactive Management • Reactive underpins every use case, every business capability, every product feature • Tendency for companies to survey market and select products to match perceived business need • Importance of vision, governance, tenancy, entitlements, security • Both Process and Technology • Use Case: Build a system that can scale thousands to millions of users, handling millions to billions of messages • Real time data, for application components, middleware, data processing, analytics • A journey of innovation and successive refinement Onward 9
  10. 10. The Briefest History History is the sum total of things that could have been avoided. - Konrad Adenauer 10
  11. 11. • “Reactive” is not new • Underlying principles go back almost 50 years, to the days of punch cards • Erlang • Built to scale, handle extremely high volume • Extensive use in Telco for decades; many billions of messages • Actors • Bedrock is messaging • We’ve been using this technique for decades, via many technologies • Improvements over time around component isolation, decoupling to benefit scalability and concurrency The Briefest History 11
  12. 12. Reactive Systems, Patterns, Implementations “People who think they know everything are a great annoyance to those of us who do.” - Isaac Asimov 12
  13. 13. • The Reactive Manifesto • http://www.reactivemanifesto.org • Many organizations independently building software to new, and similar, patterns • Increasing pressures to simplify, scale, innovate and improve customer experience • Increasing proliferation and interoperability of system environments and connected devices • More data with contextual use cases • Yesterday’s architectures just don’t cut it • Need more flexible, resilient, robust systems • Solution through evolution • Spawn of Actors (Akka, Erlang) • Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns Reactive Systems 13
  14. 14. “…“Reactive” is a set of design principles for creating cohesive systems. It’s a way of thinking about systems architecture and design in a distributed environment where implementation techniques, tooling, and design patterns are components of a larger whole.”* “A Reactive System is based on an architectural style that allows … multiple individual services to coalesce as a single unit and react to its surroundings while remaining aware of each … scale up/down, load balance and even ...” * (proactive steps) Components may qualify as reactive, but when combined, does not guarantee a Reactive System What is Reactive? A Set of Design Principles 14* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  15. 15. 1. Reactive Systems (architecture, design) 2. Reactive Programming (declarative event-based) 3. Functional Reactive Programming (FRP) NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava, Bacon.js, etc). Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07- 22-essence-of-frp.html Reactive Begets* 15* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  16. 16. • Reactive Programming != Functional Reactive Programming • Subset of Asynch Programming • New information drives logic flow vs. control flow driven by thread-of-execution • Avoids resource contention (Amdahl’s Law) that impedes scalability. • Decompose into multiple steps that are asynch and nonblocking • Combine into a composed workflow • Reactive Systems very rarely block • Reactive API libraries are either declarative (functional composition, combinators) or callback-based (attached to events, executed during dataflow chain) with stream-based operators (windowing, triggers, etc) • Reactive programming is event-driven • Reactive systems are message driven • Wait? What? More about this distinction in a few slides. Reactive Programming* 16* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  17. 17. • Reactive Programming is related to Dataflow Programming • Both emphasize flow of data vs. flow of control • Examples • Futures / Promises • (Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured pipelines connecting sources and destinations. • Dataflow Variables – single assignment variables (AKA a cell in Excel) whereby a value change can trigger dependent functions to produce new values (state) • Technologies that do this, include • Akka Streams • RxJava • Vert.X • Reactive Streams Specification • The standard for interoperability amongst Reactive Programming libraries on the JVM • “…an initiative to provide a standard for asynchronous stream processing with non- blocking back pressure.” About Dataflow Programming* 17* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  18. 18. • Notable benefits • Increased (efficient) utilization of compute resources (incl. multi-core) • Increased performance via serialization reduction • Amdahl’s Law • Neil Günter’s Universal Scalability Law. To quantify the effects of contention and coordination in concurrent, distributed systems. This explains how the cost of coherency in a system can lead to negative results, as new resources are added to the system. • Productivity. Reactive libraries handle complexities such as dealing with asynch, nonblocking compute, IO, coordination between components. • Great for creating components that are composed to workflows, that are back- pressured, scalable, high-performance Why Reactive Programming* 18* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  19. 19. • Reactive Programming - event-driven • Computation via dataflow chains • Events are not directed; “addressable event sources” • Events are mere facts that can be observed • Emitted by changes in state (state machine) • Listeners attach to even sources, which in turn, react to them • Emitted locally • Reactive Systems - message-driven • Basis of communication across network/components; prefer asynch • Sender/Receiver are decoupled • Focus on resilience and elasticity via communication/communication inherent to distributed systems • Long-lived, addressable components • Waits for messages to be sent, then reacts to them • Messages are ”directed” • A message has a clear destination; “addressable recipient” Event-Driven vs. Message-Driven* 19* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  20. 20. • Common pattern is to us Messaging as means to communicate Events across network/components • Events within Messages • Examples • AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub) • Pros • Abstraction and simplicity • Cons • Lose some control • Messaging forces developers to deal with complex realities of distributed programming • Failure detection • Message delivery contracts (dupes, retry, ordering) • Consistency guarantees • Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC, etc) Event-Driven & Message-Driven* 20* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  21. 21. Reactive Systems: Characteristics* 21 Responsive • Low latency • Consistent & predictable • Foundation of Usability • Essential for Utility • Expose problems quickly • Happier Customers Resilient • Responsive through failure • Resilient (H/A) through replication • Failure isolated to Component (bulkheads) • Delegate recovery to external Component (supervisor hierarchies) • Client does not handle failures Elastic • Responsive as workload varies • Devoid of bottlenecks or hot spots • Perf metrics drive predictive / reactive autonomic scaling • Favor commodity infrastructure Message Driven • Asynchronous • Loose Coupling • Isolation • Location Transparency • Non Blocking • Recipients can passivate • Message Passing • Flow Control • Exception Management • Elasticity • Back Pressure * Source: The Reactive Manifesto
  22. 22. Reactive Systems: Patterns 22 Architecture Pattern: Single Component • Component does one thing, fully and well • Single Responsibility Principle • Max cohension, min coupling Architecture Pattern: Let-it-Crash • Prefer a full component restart to complex internal error handling • Design for failure • Leads to more reliable components • Avoids hard to find/fix errors • Failure is unavoidable Implementation Patterns Circuit Breaker • Protect services by breaking connections during failures • From EE • Protects clients from timeouts • Allows time for service to recover Source: https://www.lightbend.com/blog/architect-reactive-design-patterns Implementation Patterns Saga • Divide long-lived, distributed transactions into quick local ones with compensating actions for recovery. • Compensating txns run during saga rollback • Concurrent sagas can see intermediate state • Sags need to be persistent to recover from hardware failures. Save points.
  23. 23. • Still need to use the brain and learn • Not going to be able to build reactive systems with just paper certifications • No substitute for experience! • The “new” is based on evolution (old techniques and patterns) • Paradigms and patterns not isolated to components or technology Sound Complicated? 23Source: The Reactive Manifesto
  24. 24. The Enterprise “Even if you are on the right track, you’ll get run over if you just sit there.” - Will Rogers 24
  25. 25. 25
  26. 26. • We all work for 1..n companies that have a size, complexity and age • Young companies (e.g. start-ups) • Less legacy overhead • More able to adopt newer technology; attempting to innovate, disrupt, find a niche • Less able to leverage expensive, enterprise-class solutions; value through IP or a unique product • More likely to build cloud native (various reasons) • Staff typically has larger sphere of influence • Sometimes filled with Unicorns • Mid-size companies • May have legacy overhead • Complexity if growth through acquisition vs. growth through product evolution • Strategic use of of Enterprise products • May be cloud native, or hybrid • Increased division of labor, defined roles, paying customers to keep happy The Enterprise 26
  27. 27. • Generalizing large companies • Probably where the terms “technical debt” and “culture debt” came from • Certainly has legacy overhead, very complex environments • Fear of change, less room to fail – backed by valid business reasons • Likely has complexity due to both acquisition and product evolution • Purchase a company, adding a ”different” tech stack • Money to fully absorb? Risk? Or purchase to evolve existing lines of business? • Use of of Enterprise products, likely prefers “supported” flavors of open source • A blend of on-premises, hybrid, cloud • More politics and cats to herd • Slower to see value from new technology • Matrixed division of labor, defined roles, paying customers to keep happy • Transformative change is more difficult; planning, budgeting, process, management structure The Enterprise 27
  28. 28. • An Enterprise has a variety of software applications • Customer facing, revenue generating • “Supporting” applications for operation, development, maintenance activities • Datamarts used for analytics • Data typically moved from system of record, into one or more centralized data “hubs” • OLAP / Data Warehouses • Data Lake, common use of Hadoop • Older application architectures focused on the application, not on enterprise interoperability • Drop in ETL, ESB, SOA, iPaaS to do so ($$$) • Perhaps refactor service layers into more modern middleware • Enterprises becoming ”real time” • The old, batch-oriented, application silos are too complex and just can’t do this well Breaking Down the Enterprise 28
  29. 29. • (Not a complete list) • Products (and supporting software) operates in Real Time • Opens door to do things to increase customer satisfaction/retention • Save money, become more agile (speed to market) – productivity enhancer • Less reliant on batch or pull (streaming batch to interop legacy) • Data flows to system that need it, where it is acted on in real time • More powerful data processing • Easier to manage; reduced TCO • Decoupled, enables greater interoperability, CI/CD (Infrastructure as a Service) • Leverage cloud solutions, auto scaling, resiliency, high availability • Analysts get to use latest tools, oftentimes in the cloud • Enables autonomic automation How the Enterprise Benefits from Reactive 29
  30. 30. • Friends don’t let their friend-analysts do map-reduce • Analysts are not highly technical (some do code in R) • Larger companies may have communities of SAS users • They just want to connect a friendly BI tool (e.g. Tableau) to run their SQL against data. • Can’t expect people to switch careers or skill sets to accommodate new technology. And Again, those Analysts 30
  31. 31. 31Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  32. 32. Fast Data “If you are in a spaceship that is traveling at the speed of light, and you turn on the headlights, does anything happen?” - Steven Wright 32
  33. 33. • Big Data  Fast Data • Constant stream of inbound data, at incredible rates • The old way is to store it, then analyze • Say hello to Map Reduce! • Hadoop did reinvent how to process petabytes of data on commodity hardware • Why not analyze and act on data as it is received? • Using (proven) technology built to scale to massive volume? • Akka, Kafka, Spark, Flink • Use case reminder: millions of concurrent actors handling billions of messages, in real-time • Fast Data means acting on data in real-time, and sending it to destinations • Data Lake • Other systems Fast Data 33* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  34. 34. Time is Gold 34* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)
  35. 35. • Fast Data is the obvious future • Many of us have been using these techs for years • The easy part • Build new systems using Reactive patterns, architectures and technology • You start ups have it easy… • The ability of a company to adopt disruptive architectures and patterns is inversely related to its size • In a real world where SDLC costs money • How to reconcile with legacy architectures and implementations? • How to interoperate with disparate systems with varying capabilities? • Does it make sense to refactor “what works”? • Are Frankenstein / Stove-piped solutions the norm, or (partially) a matter of perspective, when viewed in the rear-facing mirror of innovation? • What is a cost efficient approach to adopt and adapt to Reactive? Reality Check 35* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  36. 36. • Enough background, time to talk tech • For this discussion, the reference platform is Lightbend’s Fast Data platform • (A few deviations mentioned) • Core techs will be • Kafka - open source or Confluent • Spark - open source or Databricks (a nice managed service in AWS) • Akka (and Akka-HTTP) (open source or Lightbend stack) • Alpakka Now, the Fun Stuff 36* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  37. 37. 37 Source: Lightbend, Inc.
  38. 38. • JMS -> AMQP –> Kafka • Streaming platform • Process messages when provided • Fault-tolerant storage • Pub/Sub capability • Common use cases • Data pipelines (messaging) between a source and destination • Real-time action / transformation • Metrics, weblogs, stream processing, event sourcing • A commit log • Spread across a cluster • Record streams stored in topics • Each record has a key, value, timestamp • Each topic has offsets and a retention policy Kafka 101 38
  39. 39. Kafka Performance 39Source: “Introduction to Kafka”, Ducas Francis 30k/s 1.8M/min 108M/hr 2.7B/day
  40. 40. • Producer API • To publish a record to Kafka • Consumer API • To subscribe to a topic(s) • Consumer groups • Handle records pushed to topics • Streams API • Stream processing • Consume from Stream An, do processing, publish to Stream Bn • E.g. aggregation, joining • Connector API • To build custom Producers/Consumers • Purpose built integration components Kafka APIs 40
  41. 41. • > Messaging systems (RabbitMQ, AMQ, etc) • Just don’t scale well and become complex • Need to use other abstractions for batching • Lacks replay ability (reset offset, etc) • > Log forwarders, e.g. Scribe or Flume • Push architecture • High performance • Scales well • Sensitive to business logic in endpoints, because needs to push data fast • Assumes data is pushed to large sink (e.g. Hadoop) • Oh, and then queries later. So much for real-time. • Supports Polyglot • Python, Go, .NET, Node.js, C/C++, etc • Robust ecosystem • https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem Why Use Kafka? 41
  42. 42. • “…a fast and general engine for large-scale data processing.” • Hadoop / YARN • Map-Reduce – mapper, reducer, disk IO, queue, fetch resource • Great for parallel file processing of large files • Synchronization barrier during persistence • Spark • In-memory data processing • Interactive/iterative data query • Better supports more complex, interactive (real-time) apps • 100x faster than Hadoop MR (in memory), 10x faster on disk • Microbatching Spark: What is it? 42Source: spark.apache.org
  43. 43. • Combine SQL, streaming, complex analytics • SQL • Dataframes • MLlib • GraphX • Spark Streaming Spark 43Source: spark.apache.org
  44. 44. • Run it • Standalone • Hadoop • Mesos • Cloud • Access data • HDFS • Cassandra • Hbase • S3 • Hive • Tachyon, and more • Write code • Scala • Java • Python, Clojure, R • Interactive query shell (notebooks) Spark Execution Modes 44Source: spark.apache.org
  45. 45. • Slow due to replication, serialization, filesystem IO • Inefficient use cases: • Iterative algorithms (ML, Graphs, Network analysis) • Interactive / Ad-hoc data mining (R, Excel, Searching, analyst queries) Spark: Hadoop MR 45Source: “Spark Overview”, Lisa Hua
  46. 46. Spark: Hadoop & Spark (in Hadoop) 46Source: “Spark Overview”, Lisa Hua • Spark has additional features, such as interop with S3 storage
  47. 47. Spark: A Clustered Application 47Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  48. 48. Spark: Execution Terminology 48Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Job – a set of tasks to be executed as a result of an action • Stage – a set of tasks in a job that can be run in parallel • Task – a individual unit of work sent to a single executor
  49. 49. Spark: SQL 49Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Spark SQL is a module for structured data querying • Supports basic SQL and HiveQL • Can act as distributed query engine via JDBC/ODBC, or CLI
  50. 50. Spark: Dataframe, Datasets, RDDs 50Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Dataframe is a distributed assembly of data into named columns • Analogous to a relational table, or data frame in R/Python (with richer optimizations) • Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s execution engine • Build datasets from JVM objects and then manipulate with functional transformations
  51. 51. • Scalable, high-throughput, fault tolerant processing of real time streams and use cases. • Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs. handling a single message/event. Spark Streaming 51Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  52. 52. • Can merge inbound data with historical data • Write code in Scala, Python, Java, etc • Lower level access via DStream, obtained via StreamingContext. • Create RDD’s from the DStream • Two primary metrics to monitor and tune: • Processing time (per batch) • Scheduling delay (processed upon arrival?) • Use Kyro for serialization Spark Streaming 52Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  53. 53. Spark Streaming - comparison with other techs 53Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  54. 54. Building a Spark Application 54Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Scala or Java, compiled to JAR (in turn uploaded to worker nodes) • Running a spark app is as easy as pie (submit options can expand the experience): $ spark-submit myAwesomePythonScript.py theFileURL $ spark-submit –class SkyNetInScala skynet2017.jar theFileURL • Spark uses log4j (beware) • YARN can aggregate worker logs
  55. 55. Akka, who uses it? 55Source: “Akka Actor Introduction”, Gene
  56. 56. Akka 56Source: “Introducing Akka”, Jonas Bonér • Vision • Simpler: Concurrency, Scalability, Fault-Tolerance • With a single unified • Programming Model • Managed Runtime • Open Source Distribution • Manage System Overload (backpressure) • Scale up & Scale out • Program to a higher level • No more shared state, state visibility, threads, locks, concurrent collections, thread notifications • Low level concurrency built into the plumbing; it becomes simple workflow just deal with messages • Increases CPU utilization, lowers latency, high throughput, scalable! • Superior, Proven model to detect and recover from errors
  57. 57. Akka: Perfect for the Cloud 57Source: “Introducing Akka”, Jonas Bonér • Elastic and dynamic • Fault tolerant & self healing (autonomic) • Adaptive load-balancing, cluster rebalancing & Actor migration • Build loosely coupled systems that can dynamically adapt at runtime
  58. 58. Akka 101 58Source: “Introducing Akka”, Jonas Bonér • Akka’s unit of code is called an Actor • The vehicle to create concurrent, scalable, fault-tolerant apps atop the fabric • Encapsulates code like servlets or session beans; policy decisions separated from biz logic • Actors have been around since 1973, and if you’ve ever used a phone, that software helped make it work. 9 nines of uptime! • Think of an actor as a VM in the cloud (it isn’t, but) • Encapsulated, decoupled • Managing own memory and behavior • Communicates asynchronously with non-blocking messages • Elastic – grow/shrink on demand • Hot deploy, change runtime behavior
  59. 59. Akka: How to use Actors? 59Source: “Introducing Akka”, Jonas Bonér • Alternative to: • A thread • An object instance or component • Callback or listener • Singleton or service • Router, load-balancer or pool • Session bean or MDB • Out of process service • A Finite State Machine (FSM)
  60. 60. Akka: What is it? 60Source: http://bit.ly/hewitt-on-actors • Carl Hewitt’s definition • Fundamental unit of computation that embodies: • Processing • Storage • Communication • 3 Axioms – When an actor receives a message it can: • Create new Actors • Send messages to Actors it knows • Designate how it should handle the next message it receives
  61. 61. Akka: Core Actor Operations 61Source: “Introducing Akka”, Jonas Bonér 0. Define 1. Create 2. Send 3. Become 4. Supervise
  62. 62. Akka: Define Operation 62Source: “Introducing Akka”, Jonas Bonér 0. Define • Define the message (class) the actor should respond to, and Actor class
  63. 63. Akka: Create Operation 63Source: “Introducing Akka”, Jonas Bonér 1. Create • Yes, creates new actor. From ActorSystem, then ActorRef. • Lightweight, 2.6M per Gb RAM • Strong encapsulation of: state/behavior (indistinguishable), message queue
  64. 64. Akka: Send Operation 64Source: “Introducing Akka”, Jonas Bonér 2. Send • Sends a message to an Actor • Asynch and on-blocking (fire & forget) • Everything is Reactive • Actor is passivated until receiving a message, which triggers it to awaken • Messages are energy • Everything is asynch and lockless
  65. 65. Akka: Performance 65Source: “Introducing Akka”, Jonas Bonér +50 million messages per second !!!
  66. 66. Akka: Remote Deployment 66Source: “Introducing Akka”, Jonas Bonér
  67. 67. Akka: Become Operation 67Source: “Introducing Akka”, Jonas Bonér 3. Become • Dynamically redefine Actor’s behavior • Reactively triggered by receipt of a message • Will not react differently to messages it receives • Behaviors are stacked – can by pushed and popped… (Think in terms of the object changed it’s type – interface, protocol, implementation)
  68. 68. Akka: Become Operation – Why? 68Source: “Introducing Akka”, Jonas Bonér Why do this? • A busy actor can become an Actor Pool or Router! • Implement FSM (Finite State Machine) • Implement graceful degradation • Spawn empty workers that can ”Become” whatever the Master desires • Very useful. Limited only by your imagination.
  69. 69. Akka: Failure Management 69Source: “Introducing Akka”, Jonas Bonér In Java/C/C+ etc. • Single thread of control • If that thread blows up you are screwed • Only option to do explicit error handling within your single thread • Errors isolated within thread; other threads have no clue • Results in tons of defensive code, scattered throughout the codebase and entangled in your business logic
  70. 70. Akka: Supervise Operation 70Source: “Introducing Akka”, Jonas Bonér 4. Supervise • Manage another Actor’s failures (or the person sitting next to you) • Let actors monitor (supervise) each other for failure, then respond • Notification sent to Actor’s supervisor if failure occurs. • Neat separation of processing from error handling
  71. 71. Akka Streaming 71Source: Akka Documentation • Akka Streams API is completely decoupled from Reactive Streams interfaces • Implementation details to pass stream data between processing stages • Akka Streams is interoperable with any conformant Reactive Streams i • Principles • All features explicit in the API • Compositionality; combined pieces retain the function of each part • Model of domain of distributed bounded stream processing • Reactive Streams -> JDK9 Flow APIs
  72. 72. Akka Streaming 72Source: Akka Documentation • Immutable building blocks / blueprints enabled for libraries, include • Source - something with exactly one output stream • Sink - something with exactly one input stream • Flow - something with exactly one input and one output stream • BidiFlow - something with exactly two input streams and two output streams that conceptually behave like two Flows of opposite direction • Graph - a packaged stream processing topology that exposes a certain set of input and output ports, characterized by an object of type Shape. • Built in backpressure capability • No stage can push downstream unless it received a pull beforehand • Difference between error and failure • Error is accessible within the stream as a data element (signaled via onNext) • Failure means the stream itself has collapsed (signaled via onError). • Want failure to propagate faster than data (essential, to deal with backpressure) • Data elements emitted before a failure can still be lost of the onError overtakes them • Recovery element acts as bulkhead to confine a stream collapse to a given region of stream topology, to isolate outside from impact of collapsed region (e.g. buffered elements)
  73. 73. Akka HTTP 73Source: Akka Documentation • Built atop Akka Streams • Can expose an incoming connection in form of a Source instance • To start listening on network with Akka HTTP, create a Route and bind it to a port (similar syntax to Spray). • Backpressure on source? Akka HTTP stops consuming data from network; eventually leads to 0 TCP window – applying backpressure to sending party itself (e.g. a sensor) • Rules • Libraries shall provide their users with reusable pieces, i.e. expose factories that return graphs, allowing full compositionality • Avoid destruction of compositionality. • Express functionality of a library such that materialization can be done by user outside of library’s control. • Libraries may optionally and additionally provide facilities that consume and materialize graphs • Allows a library to provide convenience “sugar” for use cases
  74. 74. • Akka Streams Integration • https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/ • Adds interesting capabilities to Akka Streams • Modern alternative to Apache Camel (EIP implementation) • Camel en ze Akka • Community driven, focused on connectors to external libraries, integration patterns and conversions. • ”A call to arms” • https://github.com/akka/alpakka/releases/tag/v0.1 Alpakka 101 74
  75. 75. The Data Lake for Analytics, App Dev “Lake Wobegon, the little town that time forgot and the decades cannot improve.” - Garrison Keillor 75
  76. 76. The Data Lake 76 • Note: This section will be expanded for DevNexus.
  77. 77. The Data Lake 77 • The Data Lake is where a copy of much of the data from source sytems “ends up”, via Fast Data, etc. • Easily accessible, massive repository of data built on commodity hardware (or Cloud). • Data is not stored in a way that is optimized for data analysis (S3) • Data Lake retains all attributes • Beware the Data Lake fallacy: http://www.gartner.com/newsroom/id/2809117 • Let’s combine all this data to drive increased information sharing, usage, while reducing cost through consolidation / tech simplication. • Does it really work? • Has the ideal of Enterprise-wide data management been realized? • Deriving value from data still in hands of business end user (enter: Fast Data Platforms)
  78. 78. The Dark Side of the Data Lake 78Source: “Gartner Says Beware of the Data Lake Fallacy” • Many companies tend to vacuum data into a Hadoop for later use • Many companies use overlapping tools within the same ecosystem, that do not interoperate • Data lakes ignore how/why data is used, governed, defined and secured. • Does this sound like a good solution? • Data Lake solves old problem of siloing data. Great, so now it’s all in comingled. • Federated query? AWS Athena? Why move data if not necessary? • Inability to quantitatively measure data quality • Accepts any data without governance or oversight • Accepts any data without metadata (description) • Inability to share lineage of findings by other analysis to share found value • Security, access control (and tracking of both) • Data Ownership, Entitlements? • Tenancy? • Regard for regulatory controls, compliance issues? • What to do?
  79. 79. More to Come… 79Source: “Gartner Says Beware of the Data Lake Fallacy” • At DevNexus 2017 • Thurs, Feb 23 @ 2:30pm • http://devnexus.com/s/devnexus2017/presentations/17212
  80. 80. Presentation Improvements for DevNexus “Build something 100 people love, not something 1 million people kind of like.” - Brian Chesky 80
  81. 81. • More diagrams, fewer words • Refine, refine, refine • Mix in coding examples • Improve contrast between older architectures and reactive, for the Enterprise • Content that contrasts different streaming options (Akka, Spark, Kafka) • Add specific performance details • Incorporate additional, interesting content (incl. Data Lake related) Planned Improvements 81
  82. 82. Questions? “I'm sorry, if you were right, I'd agree with you.” - Robin Williams 82
  83. 83. • The Reactive Manifesto (http://www.reactivemanifesto.org/) • Chaos Monkey? Use Linear Fault Driven Testing instead. • https://www.lightbend.com/blog/architect-reactive-design-patterns • http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html • https://www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka • https://kafka.apache.org • http://www.slideshare.net/ducasf/introduction-to-kafka • http://www.slideshare.net/SparkSummit/grace-huang-49762421 • http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms • https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/ • https://github.com/akka/alpakka/releases/tag/v0.1 • http://www.slideshare.net/LisaHua/spark-overview-37479609 • http://spark.apache.org/ • https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/ • http://www.slideshare.net/gene7299/akka-actor-presentation • http://www.slideshare.net/jboner/introducing-akka • http://bit.ly/hewitt-on-actors • http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html • https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/ Resources 83
  84. 84. Ideas 84

×