Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building real time data-driven products

This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.

Presented at 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent,

  • Be the first to comment

Building real time data-driven products

  1. 1. Building Real-time Data-driven Products Øyvind Løkling & Lars Albertsson Version 1.3 - 2016.10.12
  2. 2. Øyvind Løkling Staff Software Engineer Schibsted Product & Technology Oslo - Stockholm - London - Barcelona - Krakow
  3. 3. Lars Albertsson Independent Consultant
  4. 4. .. and more
  5. 5. Presentation goals ● Spark your interest in building data driven products. ● Give an overview of components and how these relate. ● Suggest technologies and approaches that can be used in practice. ● Event Collection ● The Unified Log ● Stream Processing ● Serving Results ● Schema 5
  6. 6. Data driven products • Services and applications primarily driven by capturing and making sense of data • Health trackers • Recommendations • Analytics CC BY ©
  7. 7. Data driven products • Hosted services need to • Handle large volumes of data • Cleaning and structuring data • Serve individual users fast CC BY ©
  8. 8. Big Data, Fast Data, Smart Data • Accelerating data volumes and speeds • Internet of Things • AB Testing and Experiments • Personalised products CC BY ©
  9. 9. Big Data, Fast Data, Smart Data A need to make sense of data and act on fast • Faster development cycle • Data driven organisations • Data driven products CC BY ©
  10. 10. Time scales - what parts become obsolete? 10 Credits: Confluent Credits: Netflix
  11. 11. The Log • Other common names • Commit log • Journal
  12. 12. Jay Kreps - I <3 Logs The state machine replication principle: If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state.
  13. 13. The Unified Log Simple idea; All of the organizations data, available in one unified log that is; • Unified: the one source of truth • Append only: Data items are immutable • Ordered: addressable by offset unique per partition • Fast and scalable: Able to handle 1000´s msg/sec • Distributed, robust and fault tolerant.
  14. 14. Kafka • “Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.” But… :-)
  15. 15. Kafka • “Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.” But… :-)
  16. 16. © 2016 Apache Software Foundation
  17. 17. © 2016 Apache Software Foundation
  18. 18. A streaming data product’s anatomy 18 Pub / sub Unified log Ingress Stream processing Egress DB Service TopicJob Pipeline Service Export Visualisation DB
  19. 19. Architectural patterns  The Unified Log and Lambda, Kappa architectures
  20. 20. Lambda Architecture Example: Recommendation Engine @ Schibsted
  21. 21. Kappa Architecture A software architecture pattern… where the canonical data store is an append-only immutable log. From the log, data is streamed through a computational system and fed into auxiliary stores for serving. A Lambda architecture system with batch processing removed.
  22. 22. Kappa architecture
  23. 23. Lambda vs Kappa ● Lambda ○ Leverages existing batch infrastructure ○ Cognitive overhead maintaining two approaches in parallel ● Kappa ○ Is real-time processing is inherently approximate, less powerful, and more lossy than batch processing. True? ○ Simpler model
  24. 24. Cold Storage • Source of truth for replay in case of failure • Available for ad-hoc batch querying (Spark) • Wishlist; Fast writes, Reliable, Cheap • Cloud storage - s3 (with Glacier) • Traditional - HDFS, SAN • Consider what read performance do you need for a) error recovery b) bootstrapping new deployments
  25. 25. Capturing the data
  26. 26. Event collection 33 Cluster storage HDFS (NFS, S3, Google CS, C*) Service Reliable, simple, write available Kafka Event Bus with history (Secor, Camus) Immediate handoff to append-only replicated log. Once in the log, events eventually arrive in storage. Unified log Immutable events, append-only, source of truth
  27. 27. Event collection - guarantees 34 Unified log Service (unimportant) Events are safe from here Replicated Topics Non-critical data: Asynchronous fire-and-forget handoff Critical data: Synchronous, replicated, with acknowledgement Service (important)
  28. 28. Event collection • Source Types • Firehose api´s • Mobile apps and websites • Internet of things / embedded sensors • Event sourcing from existing systems
  29. 29. Event collection • Considerations • Can you control the data flow, ask sender to wait? • Can clients be expected to have their logic updated? • Can you afford to loose some data, make tradeoffs and still solve your task?
  30. 30. Stream Processing
  31. 31. Pipeline graph • Parallelised jobs read and write to Kafka topics • Jobs are completely decoupled • Downstream jobs do not impact upstream • Usually an acyclic graph CC BY © Illustration from Apache Samza Doc - Concepts
  32. 32. Stream processing components ● Building blocks ○ Aggregate ■ Calculate time windows ■ Aggregate state (database/in memory) ○ Filter ■ Slim down stream ■ Privacy, Security concerns ○ Join ■ Enrich by joining with datasets (geoip) ○ Transform ■ Bring data into same “shape”, schema
  33. 33. Stream Processing Platforms • Spark Streaming • Ideal if you are already using Spark, same model • Bridges gap between data science / data engineers, batch and stream • Kafka Stream • Library - New, positions itself as a lightweight alternative • Tightly coupled to on Kafka • Others ○ Storm, Flink, Samza, Google Dataflow, AWS Lambda
  34. 34. Schemas
  35. 35. Schemas • You always have a schema • Schema on write • Requires upfront schema design before data can be received • Synchronised deployment of whole pipeline • Schema on read • Allows data to be captured -as is-. • Suitable for “Data lake” • Often requires cleaning and transform bring datasets into consistency downstream.
  36. 36. Schema on read or write? 43 DB DB DB Service Service Export Business intelligenceChange agility important here Production stability important here
  37. 37. Schemas • Options In streaming applications • Schema bundled with every record • Schema registry + id in record • Schema formats • JSON Schema • AVRO
  38. 38. Evolving Schemas • Declare schema version even if no guarantee. Captures intention of source. • Be prepared for bad and non-validating data. • Decide on strategy for bringing schema versions in alignment. • Maintain upgrade path through transforms. • What are the needs of the consumer. • Data exploration vs Stable services.
  39. 39. Results
  40. 40. Serving Results ● As streams ○ Internal Consumer ○ External Consumer bridges ■ ex. REST post to external ingest endpoint ● As Views ○ Serving index, NoSQL ○ SQL / cubes for BI
  41. 41. Reactive Streams • [...] an initiative to provide a standard for asynchronous stream processing with non-blocking back pressure. [...] aimed at runtime environments (JVM and JavaScript) as well as network protocols. • The scope [...] is to find a minimal set of interfaces, methods and protocols that will describe the necessary operations and entities to achieve this goal. • “Glue” between libraries. Reactive Kafka -> Akka Stream -> RxJava
  42. 42.
  43. 43. Thank you!
  44. 44. Schemas • You always have a schema • Even if you are “Schemaless” • Build tooling and workflows for handling schema changes