Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Advertisement

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

  1. @wip Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark AJUG, Jan 17, 2017 Todd Fritz Cox Automotive, Inc.
  2. This presentation is a draft of what will be presented, next month, at DevNexus. “Sneak peek” Reactions and questions may influence evolution of content Disclaimer 2
  3. 3
  4. tafritz@outlook.com Todd.Fritz@coxautoinc.com www.linkedin.com/in/tfritz http://www.slideshare.net/ToddFritz https://github.com/todd-fritz 4
  5. License: CC BY-SA 3.0 5 • Senior Solutions Architect @ Cox Automotive, Inc. • Strategic Data Services • The opinions contained herein may not represent my employer, but I believe they should. • Background is building platforms, middleware, MoM, EIP, EDA, etc • DevOps mentality • Exposed to many environments, technologies, people • Life-long learner and always curious • Novice bass player • Scuba diver About Me
  6. DevNexus 2015 http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015 Great Wide Open - Atlanta (April 3, 2014) http://www.slideshare.net/ToddFritz/2014-04-03legacytocloud AJUG (April 15, 2014) http://www.slideshare.net/ToddFritz/2014-april-15-atlanta-java-users-group Video - https://vimeo.com/94556976 Previous Presentations 6
  7. • Forward • The Briefest History • Background: Reactive Systems, Patterns, Implementations • The Enterprise • Fast Data • The Data Lake for Analytics, App Dev • Presentation Improvements Planned for DevNexus • Questions • Resources Agenda 7
  8. Forward “Our greatest glory is not in never failing, but in rising every time we fall.” -Confucius 8
  9. • Why is Reactive Important? • Reactive Systems and Programming != Reactive Management • Reactive underpins every use case, every business capability, every product feature • Tendency for companies to survey market and select products to match perceived business need • Importance of vision, governance, tenancy, entitlements, security • Both Process and Technology • Use Case: Build a system that can scale thousands to millions of users, handling millions to billions of messages • Real time data, for application components, middleware, data processing, analytics • A journey of innovation and successive refinement Onward 9
  10. The Briefest History History is the sum total of things that could have been avoided. - Konrad Adenauer 10
  11. • “Reactive” is not new • Underlying principles go back almost 50 years, to the days of punch cards • Erlang • Built to scale, handle extremely high volume • Extensive use in Telco for decades; many billions of messages • Actors • Bedrock is messaging • We’ve been using this technique for decades, via many technologies • Improvements over time around component isolation, decoupling to benefit scalability and concurrency The Briefest History 11
  12. Reactive Systems, Patterns, Implementations “People who think they know everything are a great annoyance to those of us who do.” - Isaac Asimov 12
  13. • The Reactive Manifesto • http://www.reactivemanifesto.org • Many organizations independently building software to new, and similar, patterns • Increasing pressures to simplify, scale, innovate and improve customer experience • Increasing proliferation and interoperability of system environments and connected devices • More data with contextual use cases • Yesterday’s architectures just don’t cut it • Need more flexible, resilient, robust systems • Solution through evolution • Spawn of Actors (Akka, Erlang) • Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns Reactive Systems 13
  14. “…“Reactive” is a set of design principles for creating cohesive systems. It’s a way of thinking about systems architecture and design in a distributed environment where implementation techniques, tooling, and design patterns are components of a larger whole.”* “A Reactive System is based on an architectural style that allows … multiple individual services to coalesce as a single unit and react to its surroundings while remaining aware of each … scale up/down, load balance and even ...” * (proactive steps) Components may qualify as reactive, but when combined, does not guarantee a Reactive System What is Reactive? A Set of Design Principles 14* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  15. 1. Reactive Systems (architecture, design) 2. Reactive Programming (declarative event-based) 3. Functional Reactive Programming (FRP) NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava, Bacon.js, etc). Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07- 22-essence-of-frp.html Reactive Begets* 15* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  16. • Reactive Programming != Functional Reactive Programming • Subset of Asynch Programming • New information drives logic flow vs. control flow driven by thread-of-execution • Avoids resource contention (Amdahl’s Law) that impedes scalability. • Decompose into multiple steps that are asynch and nonblocking • Combine into a composed workflow • Reactive Systems very rarely block • Reactive API libraries are either declarative (functional composition, combinators) or callback-based (attached to events, executed during dataflow chain) with stream-based operators (windowing, triggers, etc) • Reactive programming is event-driven • Reactive systems are message driven • Wait? What? More about this distinction in a few slides. Reactive Programming* 16* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  17. • Reactive Programming is related to Dataflow Programming • Both emphasize flow of data vs. flow of control • Examples • Futures / Promises • (Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured pipelines connecting sources and destinations. • Dataflow Variables – single assignment variables (AKA a cell in Excel) whereby a value change can trigger dependent functions to produce new values (state) • Technologies that do this, include • Akka Streams • RxJava • Vert.X • Reactive Streams Specification • The standard for interoperability amongst Reactive Programming libraries on the JVM • “…an initiative to provide a standard for asynchronous stream processing with non- blocking back pressure.” About Dataflow Programming* 17* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  18. • Notable benefits • Increased (efficient) utilization of compute resources (incl. multi-core) • Increased performance via serialization reduction • Amdahl’s Law • Neil Günter’s Universal Scalability Law. To quantify the effects of contention and coordination in concurrent, distributed systems. This explains how the cost of coherency in a system can lead to negative results, as new resources are added to the system. • Productivity. Reactive libraries handle complexities such as dealing with asynch, nonblocking compute, IO, coordination between components. • Great for creating components that are composed to workflows, that are back- pressured, scalable, high-performance Why Reactive Programming* 18* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  19. • Reactive Programming - event-driven • Computation via dataflow chains • Events are not directed; “addressable event sources” • Events are mere facts that can be observed • Emitted by changes in state (state machine) • Listeners attach to even sources, which in turn, react to them • Emitted locally • Reactive Systems - message-driven • Basis of communication across network/components; prefer asynch • Sender/Receiver are decoupled • Focus on resilience and elasticity via communication/communication inherent to distributed systems • Long-lived, addressable components • Waits for messages to be sent, then reacts to them • Messages are ”directed” • A message has a clear destination; “addressable recipient” Event-Driven vs. Message-Driven* 19* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  20. • Common pattern is to us Messaging as means to communicate Events across network/components • Events within Messages • Examples • AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub) • Pros • Abstraction and simplicity • Cons • Lose some control • Messaging forces developers to deal with complex realities of distributed programming • Failure detection • Message delivery contracts (dupes, retry, ordering) • Consistency guarantees • Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC, etc) Event-Driven & Message-Driven* 20* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  21. Reactive Systems: Characteristics* 21 Responsive • Low latency • Consistent & predictable • Foundation of Usability • Essential for Utility • Expose problems quickly • Happier Customers Resilient • Responsive through failure • Resilient (H/A) through replication • Failure isolated to Component (bulkheads) • Delegate recovery to external Component (supervisor hierarchies) • Client does not handle failures Elastic • Responsive as workload varies • Devoid of bottlenecks or hot spots • Perf metrics drive predictive / reactive autonomic scaling • Favor commodity infrastructure Message Driven • Asynchronous • Loose Coupling • Isolation • Location Transparency • Non Blocking • Recipients can passivate • Message Passing • Flow Control • Exception Management • Elasticity • Back Pressure * Source: The Reactive Manifesto
  22. Reactive Systems: Patterns 22 Architecture Pattern: Single Component • Component does one thing, fully and well • Single Responsibility Principle • Max cohension, min coupling Architecture Pattern: Let-it-Crash • Prefer a full component restart to complex internal error handling • Design for failure • Leads to more reliable components • Avoids hard to find/fix errors • Failure is unavoidable Implementation Patterns Circuit Breaker • Protect services by breaking connections during failures • From EE • Protects clients from timeouts • Allows time for service to recover Source: https://www.lightbend.com/blog/architect-reactive-design-patterns Implementation Patterns Saga • Divide long-lived, distributed transactions into quick local ones with compensating actions for recovery. • Compensating txns run during saga rollback • Concurrent sagas can see intermediate state • Sags need to be persistent to recover from hardware failures. Save points.
  23. • Still need to use the brain and learn • Not going to be able to build reactive systems with just paper certifications • No substitute for experience! • The “new” is based on evolution (old techniques and patterns) • Paradigms and patterns not isolated to components or technology Sound Complicated? 23Source: The Reactive Manifesto
  24. The Enterprise “Even if you are on the right track, you’ll get run over if you just sit there.” - Will Rogers 24
  25. 25
  26. • We all work for 1..n companies that have a size, complexity and age • Young companies (e.g. start-ups) • Less legacy overhead • More able to adopt newer technology; attempting to innovate, disrupt, find a niche • Less able to leverage expensive, enterprise-class solutions; value through IP or a unique product • More likely to build cloud native (various reasons) • Staff typically has larger sphere of influence • Sometimes filled with Unicorns • Mid-size companies • May have legacy overhead • Complexity if growth through acquisition vs. growth through product evolution • Strategic use of of Enterprise products • May be cloud native, or hybrid • Increased division of labor, defined roles, paying customers to keep happy The Enterprise 26
  27. • Generalizing large companies • Probably where the terms “technical debt” and “culture debt” came from • Certainly has legacy overhead, very complex environments • Fear of change, less room to fail – backed by valid business reasons • Likely has complexity due to both acquisition and product evolution • Purchase a company, adding a ”different” tech stack • Money to fully absorb? Risk? Or purchase to evolve existing lines of business? • Use of of Enterprise products, likely prefers “supported” flavors of open source • A blend of on-premises, hybrid, cloud • More politics and cats to herd • Slower to see value from new technology • Matrixed division of labor, defined roles, paying customers to keep happy • Transformative change is more difficult; planning, budgeting, process, management structure The Enterprise 27
  28. • An Enterprise has a variety of software applications • Customer facing, revenue generating • “Supporting” applications for operation, development, maintenance activities • Datamarts used for analytics • Data typically moved from system of record, into one or more centralized data “hubs” • OLAP / Data Warehouses • Data Lake, common use of Hadoop • Older application architectures focused on the application, not on enterprise interoperability • Drop in ETL, ESB, SOA, iPaaS to do so ($$$) • Perhaps refactor service layers into more modern middleware • Enterprises becoming ”real time” • The old, batch-oriented, application silos are too complex and just can’t do this well Breaking Down the Enterprise 28
  29. • (Not a complete list) • Products (and supporting software) operates in Real Time • Opens door to do things to increase customer satisfaction/retention • Save money, become more agile (speed to market) – productivity enhancer • Less reliant on batch or pull (streaming batch to interop legacy) • Data flows to system that need it, where it is acted on in real time • More powerful data processing • Easier to manage; reduced TCO • Decoupled, enables greater interoperability, CI/CD (Infrastructure as a Service) • Leverage cloud solutions, auto scaling, resiliency, high availability • Analysts get to use latest tools, oftentimes in the cloud • Enables autonomic automation How the Enterprise Benefits from Reactive 29
  30. • Friends don’t let their friend-analysts do map-reduce • Analysts are not highly technical (some do code in R) • Larger companies may have communities of SAS users • They just want to connect a friendly BI tool (e.g. Tableau) to run their SQL against data. • Can’t expect people to switch careers or skill sets to accommodate new technology. And Again, those Analysts 30
  31. 31Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  32. Fast Data “If you are in a spaceship that is traveling at the speed of light, and you turn on the headlights, does anything happen?” - Steven Wright 32
  33. • Big Data  Fast Data • Constant stream of inbound data, at incredible rates • The old way is to store it, then analyze • Say hello to Map Reduce! • Hadoop did reinvent how to process petabytes of data on commodity hardware • Why not analyze and act on data as it is received? • Using (proven) technology built to scale to massive volume? • Akka, Kafka, Spark, Flink • Use case reminder: millions of concurrent actors handling billions of messages, in real-time • Fast Data means acting on data in real-time, and sending it to destinations • Data Lake • Other systems Fast Data 33* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  34. Time is Gold 34* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)
  35. • Fast Data is the obvious future • Many of us have been using these techs for years • The easy part • Build new systems using Reactive patterns, architectures and technology • You start ups have it easy… • The ability of a company to adopt disruptive architectures and patterns is inversely related to its size • In a real world where SDLC costs money • How to reconcile with legacy architectures and implementations? • How to interoperate with disparate systems with varying capabilities? • Does it make sense to refactor “what works”? • Are Frankenstein / Stove-piped solutions the norm, or (partially) a matter of perspective, when viewed in the rear-facing mirror of innovation? • What is a cost efficient approach to adopt and adapt to Reactive? Reality Check 35* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  36. • Enough background, time to talk tech • For this discussion, the reference platform is Lightbend’s Fast Data platform • (A few deviations mentioned) • Core techs will be • Kafka - open source or Confluent • Spark - open source or Databricks (a nice managed service in AWS) • Akka (and Akka-HTTP) (open source or Lightbend stack) • Alpakka Now, the Fun Stuff 36* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang
  37. 37 Source: Lightbend, Inc.
  38. • JMS -> AMQP –> Kafka • Streaming platform • Process messages when provided • Fault-tolerant storage • Pub/Sub capability • Common use cases • Data pipelines (messaging) between a source and destination • Real-time action / transformation • Metrics, weblogs, stream processing, event sourcing • A commit log • Spread across a cluster • Record streams stored in topics • Each record has a key, value, timestamp • Each topic has offsets and a retention policy Kafka 101 38
  39. Kafka Performance 39Source: “Introduction to Kafka”, Ducas Francis 30k/s 1.8M/min 108M/hr 2.7B/day
  40. • Producer API • To publish a record to Kafka • Consumer API • To subscribe to a topic(s) • Consumer groups • Handle records pushed to topics • Streams API • Stream processing • Consume from Stream An, do processing, publish to Stream Bn • E.g. aggregation, joining • Connector API • To build custom Producers/Consumers • Purpose built integration components Kafka APIs 40
  41. • > Messaging systems (RabbitMQ, AMQ, etc) • Just don’t scale well and become complex • Need to use other abstractions for batching • Lacks replay ability (reset offset, etc) • > Log forwarders, e.g. Scribe or Flume • Push architecture • High performance • Scales well • Sensitive to business logic in endpoints, because needs to push data fast • Assumes data is pushed to large sink (e.g. Hadoop) • Oh, and then queries later. So much for real-time. • Supports Polyglot • Python, Go, .NET, Node.js, C/C++, etc • Robust ecosystem • https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem Why Use Kafka? 41
  42. • “…a fast and general engine for large-scale data processing.” • Hadoop / YARN • Map-Reduce – mapper, reducer, disk IO, queue, fetch resource • Great for parallel file processing of large files • Synchronization barrier during persistence • Spark • In-memory data processing • Interactive/iterative data query • Better supports more complex, interactive (real-time) apps • 100x faster than Hadoop MR (in memory), 10x faster on disk • Microbatching Spark: What is it? 42Source: spark.apache.org
  43. • Combine SQL, streaming, complex analytics • SQL • Dataframes • MLlib • GraphX • Spark Streaming Spark 43Source: spark.apache.org
  44. • Run it • Standalone • Hadoop • Mesos • Cloud • Access data • HDFS • Cassandra • Hbase • S3 • Hive • Tachyon, and more • Write code • Scala • Java • Python, Clojure, R • Interactive query shell (notebooks) Spark Execution Modes 44Source: spark.apache.org
  45. • Slow due to replication, serialization, filesystem IO • Inefficient use cases: • Iterative algorithms (ML, Graphs, Network analysis) • Interactive / Ad-hoc data mining (R, Excel, Searching, analyst queries) Spark: Hadoop MR 45Source: “Spark Overview”, Lisa Hua
  46. Spark: Hadoop & Spark (in Hadoop) 46Source: “Spark Overview”, Lisa Hua • Spark has additional features, such as interop with S3 storage
  47. Spark: A Clustered Application 47Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  48. Spark: Execution Terminology 48Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Job – a set of tasks to be executed as a result of an action • Stage – a set of tasks in a job that can be run in parallel • Task – a individual unit of work sent to a single executor
  49. Spark: SQL 49Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Spark SQL is a module for structured data querying • Supports basic SQL and HiveQL • Can act as distributed query engine via JDBC/ODBC, or CLI
  50. Spark: Dataframe, Datasets, RDDs 50Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Dataframe is a distributed assembly of data into named columns • Analogous to a relational table, or data frame in R/Python (with richer optimizations) • Dataset was added in 1.6 to provide benefit of RDDs and Spark SQL’s execution engine • Build datasets from JVM objects and then manipulate with functional transformations
  51. • Scalable, high-throughput, fault tolerant processing of real time streams and use cases. • Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs. handling a single message/event. Spark Streaming 51Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  52. • Can merge inbound data with historical data • Write code in Scala, Python, Java, etc • Lower level access via DStream, obtained via StreamingContext. • Create RDD’s from the DStream • Two primary metrics to monitor and tune: • Processing time (per batch) • Scheduling delay (processed upon arrival?) • Use Kyro for serialization Spark Streaming 52Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  53. Spark Streaming - comparison with other techs 53Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam
  54. Building a Spark Application 54Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam • Scala or Java, compiled to JAR (in turn uploaded to worker nodes) • Running a spark app is as easy as pie (submit options can expand the experience): $ spark-submit myAwesomePythonScript.py theFileURL $ spark-submit –class SkyNetInScala skynet2017.jar theFileURL • Spark uses log4j (beware) • YARN can aggregate worker logs
  55. Akka, who uses it? 55Source: “Akka Actor Introduction”, Gene
  56. Akka 56Source: “Introducing Akka”, Jonas Bonér • Vision • Simpler: Concurrency, Scalability, Fault-Tolerance • With a single unified • Programming Model • Managed Runtime • Open Source Distribution • Manage System Overload (backpressure) • Scale up & Scale out • Program to a higher level • No more shared state, state visibility, threads, locks, concurrent collections, thread notifications • Low level concurrency built into the plumbing; it becomes simple workflow just deal with messages • Increases CPU utilization, lowers latency, high throughput, scalable! • Superior, Proven model to detect and recover from errors
  57. Akka: Perfect for the Cloud 57Source: “Introducing Akka”, Jonas Bonér • Elastic and dynamic • Fault tolerant & self healing (autonomic) • Adaptive load-balancing, cluster rebalancing & Actor migration • Build loosely coupled systems that can dynamically adapt at runtime
  58. Akka 101 58Source: “Introducing Akka”, Jonas Bonér • Akka’s unit of code is called an Actor • The vehicle to create concurrent, scalable, fault-tolerant apps atop the fabric • Encapsulates code like servlets or session beans; policy decisions separated from biz logic • Actors have been around since 1973, and if you’ve ever used a phone, that software helped make it work. 9 nines of uptime! • Think of an actor as a VM in the cloud (it isn’t, but) • Encapsulated, decoupled • Managing own memory and behavior • Communicates asynchronously with non-blocking messages • Elastic – grow/shrink on demand • Hot deploy, change runtime behavior
  59. Akka: How to use Actors? 59Source: “Introducing Akka”, Jonas Bonér • Alternative to: • A thread • An object instance or component • Callback or listener • Singleton or service • Router, load-balancer or pool • Session bean or MDB • Out of process service • A Finite State Machine (FSM)
  60. Akka: What is it? 60Source: http://bit.ly/hewitt-on-actors • Carl Hewitt’s definition • Fundamental unit of computation that embodies: • Processing • Storage • Communication • 3 Axioms – When an actor receives a message it can: • Create new Actors • Send messages to Actors it knows • Designate how it should handle the next message it receives
  61. Akka: Core Actor Operations 61Source: “Introducing Akka”, Jonas Bonér 0. Define 1. Create 2. Send 3. Become 4. Supervise
  62. Akka: Define Operation 62Source: “Introducing Akka”, Jonas Bonér 0. Define • Define the message (class) the actor should respond to, and Actor class
  63. Akka: Create Operation 63Source: “Introducing Akka”, Jonas Bonér 1. Create • Yes, creates new actor. From ActorSystem, then ActorRef. • Lightweight, 2.6M per Gb RAM • Strong encapsulation of: state/behavior (indistinguishable), message queue
  64. Akka: Send Operation 64Source: “Introducing Akka”, Jonas Bonér 2. Send • Sends a message to an Actor • Asynch and on-blocking (fire & forget) • Everything is Reactive • Actor is passivated until receiving a message, which triggers it to awaken • Messages are energy • Everything is asynch and lockless
  65. Akka: Performance 65Source: “Introducing Akka”, Jonas Bonér +50 million messages per second !!!
  66. Akka: Remote Deployment 66Source: “Introducing Akka”, Jonas Bonér
  67. Akka: Become Operation 67Source: “Introducing Akka”, Jonas Bonér 3. Become • Dynamically redefine Actor’s behavior • Reactively triggered by receipt of a message • Will not react differently to messages it receives • Behaviors are stacked – can by pushed and popped… (Think in terms of the object changed it’s type – interface, protocol, implementation)
  68. Akka: Become Operation – Why? 68Source: “Introducing Akka”, Jonas Bonér Why do this? • A busy actor can become an Actor Pool or Router! • Implement FSM (Finite State Machine) • Implement graceful degradation • Spawn empty workers that can ”Become” whatever the Master desires • Very useful. Limited only by your imagination.
  69. Akka: Failure Management 69Source: “Introducing Akka”, Jonas Bonér In Java/C/C+ etc. • Single thread of control • If that thread blows up you are screwed • Only option to do explicit error handling within your single thread • Errors isolated within thread; other threads have no clue • Results in tons of defensive code, scattered throughout the codebase and entangled in your business logic
  70. Akka: Supervise Operation 70Source: “Introducing Akka”, Jonas Bonér 4. Supervise • Manage another Actor’s failures (or the person sitting next to you) • Let actors monitor (supervise) each other for failure, then respond • Notification sent to Actor’s supervisor if failure occurs. • Neat separation of processing from error handling
  71. Akka Streaming 71Source: Akka Documentation • Akka Streams API is completely decoupled from Reactive Streams interfaces • Implementation details to pass stream data between processing stages • Akka Streams is interoperable with any conformant Reactive Streams i • Principles • All features explicit in the API • Compositionality; combined pieces retain the function of each part • Model of domain of distributed bounded stream processing • Reactive Streams -> JDK9 Flow APIs
  72. Akka Streaming 72Source: Akka Documentation • Immutable building blocks / blueprints enabled for libraries, include • Source - something with exactly one output stream • Sink - something with exactly one input stream • Flow - something with exactly one input and one output stream • BidiFlow - something with exactly two input streams and two output streams that conceptually behave like two Flows of opposite direction • Graph - a packaged stream processing topology that exposes a certain set of input and output ports, characterized by an object of type Shape. • Built in backpressure capability • No stage can push downstream unless it received a pull beforehand • Difference between error and failure • Error is accessible within the stream as a data element (signaled via onNext) • Failure means the stream itself has collapsed (signaled via onError). • Want failure to propagate faster than data (essential, to deal with backpressure) • Data elements emitted before a failure can still be lost of the onError overtakes them • Recovery element acts as bulkhead to confine a stream collapse to a given region of stream topology, to isolate outside from impact of collapsed region (e.g. buffered elements)
  73. Akka HTTP 73Source: Akka Documentation • Built atop Akka Streams • Can expose an incoming connection in form of a Source instance • To start listening on network with Akka HTTP, create a Route and bind it to a port (similar syntax to Spray). • Backpressure on source? Akka HTTP stops consuming data from network; eventually leads to 0 TCP window – applying backpressure to sending party itself (e.g. a sensor) • Rules • Libraries shall provide their users with reusable pieces, i.e. expose factories that return graphs, allowing full compositionality • Avoid destruction of compositionality. • Express functionality of a library such that materialization can be done by user outside of library’s control. • Libraries may optionally and additionally provide facilities that consume and materialize graphs • Allows a library to provide convenience “sugar” for use cases
  74. • Akka Streams Integration • https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/ • Adds interesting capabilities to Akka Streams • Modern alternative to Apache Camel (EIP implementation) • Camel en ze Akka • Community driven, focused on connectors to external libraries, integration patterns and conversions. • ”A call to arms” • https://github.com/akka/alpakka/releases/tag/v0.1 Alpakka 101 74
  75. The Data Lake for Analytics, App Dev “Lake Wobegon, the little town that time forgot and the decades cannot improve.” - Garrison Keillor 75
  76. The Data Lake 76 • Note: This section will be expanded for DevNexus.
  77. The Data Lake 77 • The Data Lake is where a copy of much of the data from source sytems “ends up”, via Fast Data, etc. • Easily accessible, massive repository of data built on commodity hardware (or Cloud). • Data is not stored in a way that is optimized for data analysis (S3) • Data Lake retains all attributes • Beware the Data Lake fallacy: http://www.gartner.com/newsroom/id/2809117 • Let’s combine all this data to drive increased information sharing, usage, while reducing cost through consolidation / tech simplication. • Does it really work? • Has the ideal of Enterprise-wide data management been realized? • Deriving value from data still in hands of business end user (enter: Fast Data Platforms)
  78. The Dark Side of the Data Lake 78Source: “Gartner Says Beware of the Data Lake Fallacy” • Many companies tend to vacuum data into a Hadoop for later use • Many companies use overlapping tools within the same ecosystem, that do not interoperate • Data lakes ignore how/why data is used, governed, defined and secured. • Does this sound like a good solution? • Data Lake solves old problem of siloing data. Great, so now it’s all in comingled. • Federated query? AWS Athena? Why move data if not necessary? • Inability to quantitatively measure data quality • Accepts any data without governance or oversight • Accepts any data without metadata (description) • Inability to share lineage of findings by other analysis to share found value • Security, access control (and tracking of both) • Data Ownership, Entitlements? • Tenancy? • Regard for regulatory controls, compliance issues? • What to do?
  79. More to Come… 79Source: “Gartner Says Beware of the Data Lake Fallacy” • At DevNexus 2017 • Thurs, Feb 23 @ 2:30pm • http://devnexus.com/s/devnexus2017/presentations/17212
  80. Presentation Improvements for DevNexus “Build something 100 people love, not something 1 million people kind of like.” - Brian Chesky 80
  81. • More diagrams, fewer words • Refine, refine, refine • Mix in coding examples • Improve contrast between older architectures and reactive, for the Enterprise • Content that contrasts different streaming options (Akka, Spark, Kafka) • Add specific performance details • Incorporate additional, interesting content (incl. Data Lake related) Planned Improvements 81
  82. Questions? “I'm sorry, if you were right, I'd agree with you.” - Robin Williams 82
  83. • The Reactive Manifesto (http://www.reactivemanifesto.org/) • Chaos Monkey? Use Linear Fault Driven Testing instead. • https://www.lightbend.com/blog/architect-reactive-design-patterns • http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html • https://www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka • https://kafka.apache.org • http://www.slideshare.net/ducasf/introduction-to-kafka • http://www.slideshare.net/SparkSummit/grace-huang-49762421 • http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms • https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/ • https://github.com/akka/alpakka/releases/tag/v0.1 • http://www.slideshare.net/LisaHua/spark-overview-37479609 • http://spark.apache.org/ • https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/ • http://www.slideshare.net/gene7299/akka-actor-presentation • http://www.slideshare.net/jboner/introducing-akka • http://bit.ly/hewitt-on-actors • http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html • https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/ Resources 83
  84. Ideas 84

Editor's Notes

  1. Mi
  2. Mi
  3. Mi
  4. Mi
  5. Mi
  6. Mi
  7. Mi
  8. Mi
  9. Mi
  10. Mi
Advertisement