Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward Berlin 2017: Stephan Ewen, Flavio Junqueira - Connecting Apache Flink and the Pravega Stream Store

639 views

Published on

Pravega is a storage substrate that we have designed and implemented from ground up to accommodate the requirements of recent data analytics applications. The architecture of Pravega is such that applications use it to store stream data permanently while being able to process this data with low latency. Storing stream data permanently is highly desirable because it enables applications to process such data in both near real-time or months later in the same way and through the same interface. Writes to Pravega can be transactional so that the application can make data visible for reading atomically. This feature is important for guaranteeing exactly-once semantics when writing data into Pravega. As the volume of writes on a stream can grow and shrink over time, Pravega enables the capacity of a stream to adapt by allowing streams to scale automatically. The auto-scaling feature enables the number of segments of a stream to increase and decrease according to load, where a segment is the unit of parallelism of a Pravega stream. Pravega by itself does not provide any capability to process data, however, and it requires an engine with advanced data analytics features such as Apache Flink to process its streams. To connect Pravega and Flink, we have been working with the community to implement a source and a sink. The source uses Pravega checkpoints to save the state of incoming streams and guarantees exactly-once semantics, while not requiring a sticky assignment of source workers to segments. Pravega checkpoints integrate and align well with Flink checkpoints, making this a unique approach compared to existing sources. As scaling is a key feature of Pravega, we also envision scaling signals out of Pravega to the source workers indicating that a job needs to scale up or down, which is a feature currently under development and [...]

Published in: Data & Analytics
  • Be the first to comment

Flink Forward Berlin 2017: Stephan Ewen, Flavio Junqueira - Connecting Apache Flink and the Pravega Stream Store

  1. 1. Pravega: Rethinking storage for streams Stephan Ewen, Data Artisans Flavio Junqueira, Pravega Flink Forward – Berlin 2017
  2. 2. Outline • Intro to Pravega • Flink + Pravega
  3. 3. Pravega http://pravega.io
  4. 4. Streams
  5. 5. • Process bits • Store bits • Transmit bits Flavio Junqueira, Lives in Barcelona 01000110 01101100 01100001 01110110 01101001 01101111 01001010 01110101 01101110 01110001 01110101 01100101 01101001 01110010 01100001 00101100 01000010 01100001 01110010 01100011 01100101 01101100 01101111 01101110 01100001
  6. 6. Flavio Junqueira, is in Barcelona Sep. 10 - noon 01000110 01101100 01100001 01110110 01101001 01101111 01001010 01110101 01101110 01110001 01110101 01100101 01101001 01110010 01100001 00101100 01000010 01100001 01110010 01100011 01100101 01101100 01101111 01101110 01100001 01000110 01101100 01100001 01110110 01101001 01101111 01001010 01110101 01101110 01110001 01110101 01100101 01101001 01110010 01100001 00101100 01000010 01100001 01110010 01100011 01100101 01101100 01101111 01101110 01100001 01000110 01101100 01100001 01110110 01101001 01101111 01001010 01110101 01101110 01110001 01110101 01100101 01101001 01110010 01100001 00101100 01000010 01100001 01110010 01100011 01100101 01101100 01101111 01101110 01100001 Flavio Junqueira, is in Berlin Sep. 11 – noon Flavio Junqueira, is back in Barcelona Sep. 15 - noon • Order matters • Correlation between events • Causality maybe? Time
  7. 7. CPU usage: 10% Temperature: 20C • Number of devices potentially much larger • Volume per device potentially much higher CPU usage: 50% Temperature: 20C CPU usage: 60% Temperature: 20C 01000110 01101100 01100001 01110110 01101001 01101111 01001010 01110101 01101110 01110001 01110101 01100101 01101001 01110010 01100001 00101100 01000010 01100001 01110010 01100011 01100101 01101100 01101111 01101110 01100001 01000110 01101100 01100001 01110110 01101001 01101111 01001010 01110101 01101110 01110001 01110101 01100101 01101001 01110010 01100001 00101100 01000010 01100001 01110010 01100011 01100101 01101100 01101111 01101110 01100001 01000110 01101100 01100001 01110110 01101001 01101111 01001010 01110101 01101110 01110001 01110101 01100101 01101001 01110010 01100001 00101100 01000010 01100001 01110010 01100011 01100101 01101100 01101111 01101110 01100001 Time
  8. 8. Processing data streams
  9. 9. ….. 01110110 01100001 01101100 01000110 … 01001010 01101111 01101001 01110110 Stream processing pipeline ….. 01110110 01100001 01101100 01000110 … 01001010 01101111 01101001 01110110 ….. 01110110 01100001 01101100 01000110 … 01001010 01101111 01101001 01110110 Processing streams
  10. 10. A typical architecture ….. 01110110 01100001 01101100 01000110 … 01001010 01101111 01101001 01110110 Source Messaging substrate • Ingests and buffers data • Decouples source from the engine processing the data ….. 01110110 01100001 01101100 … 01001010 01101111 01101001 Stream data processor Messaging substrate One or more stages Stream data processor
  11. 11. A typical architecture ….. 01110110 01100001 01101100 01000110 … 01001010 01101111 01101001 01110110 Source Messaging substrate ….. 01110110 01100001 01101100 … 01001010 01101111 01101001 Messaging substrate Limitations: • Data stored temporarily • Not able to store an unbounded amount of stream data Multiple stages Stream data processor Stream data processor • Ingests and buffers data • Decouples source from the engine processing the data
  12. 12. A typical architecture ….. 01110110 01100001 01101100 01000110 … 01001010 01101111 01101001 01110110 Source Messaging substrate ….. 01110110 01100001 01101100 Messaging substrate Multiple stages Limitations: • Data stored temporarily • Not able to store an unbounded amount of stream data • Separate bulk store for historical reads Some bulk data store Stream data processor Stream data processor • Ingests and buffers data • Decouples source from the engine processing the data
  13. 13. Target of Pravega is a stream store able to: • Store stream data permanently • Preserve order • Accommodate unbounded streams
  14. 14. Streams in Pravega
  15. 15. Pravega and Streams ….. 01110110 01100001 01101100 ….. 01001010 01101111 01101001 Pravega 01000110 01110110 Append Read 01000110 01110110
  16. 16. What about parallelism?
  17. 17. Pravega and Streams ….. 01110110 01100001 01101100 ….. 01101001 01110110 01001010 Pravega 01000110 01110110 Append Read 01000110 01110110 01101111 01101001 01101001 01101111
  18. 18. Segments in Pravega ….. 01110110 01100001 01101100 ….. 01101001 01110110 01001010 Pravega 01000110 01110110 Segments Append Read 01000110 01110110 01101111 01101001 01101001 01101111 Segments • Segments are sequences of bytes
  19. 19. Segments in Pravega ….. 01110110 01100001 01101100 ….. 01101001 01110110 01001010 Pravega 01000110 01110110 Segments Append Read 01000110 01110110 01101111 01101001 01101001 01101111 Segments • Segments are sequences of bytes • Use routing keys to determine segment
  20. 20. Segments can be sealed
  21. 21. Segments in Pravega ….. 01110110 01100001 01101100 ….. 01101001 01110110 01001010 Pravega 01000110 01110110 Segments Append Read 01000110 01110110 01101111 01101001 01101001 01101111 Segments Once sealed, a segment can’t be appended to any longer.
  22. 22. Segments in Pravega Pravega 01000110 Segments Segments 01101111 01000110 01000110 01000110 01101111 01101111 01101111 01101111 01000110 01000110 Pravega is primarily: • A segment store • Segments sealed or open
  23. 23. How is sealing segments useful?
  24. 24. Segments in Pravega Pravega 01000110 Segments Segments 01101111 01000110 01000110 01000110 01101111 01101111 01101111 01101111 01000110 01000110 0110111101101111 01000110 01101111 Stream Compose to form a stream • Each segment can live in a different server • Not limited to the capacity of a single server
  25. 25. Segments in Pravega Pravega 01000110 Segments Segments 01101111 01000110 01000110 01000110 01101111 01101111 01101111 01101111 01000110 01000110 01101111 01000110 01101111 Stream Compose to form a stream 01101111
  26. 26. Some useful ways to compose segments
  27. 27. 01000110 Scaling a stream ….. 01110110 01100001 01101100 01000110 Stream has one segment 1 ….. 01110110 01100001 01101100 • Seal current segment • Create new ones 2 01000110 01000110 • Say input load has increased • Need more parallelism
  28. 28. Routing key space 0.0 1.0 Time Split Split Merge 0.5 0.75 Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 t0 t1 t2
  29. 29. Routing key space 0.0 1.0 Time Split Split Merge 0.5 0.75 Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 t0 t1 t2 Key ranges are not statically assigned to segments
  30. 30. Source: Virtual cluster - Nautilus Platform
  31. 31. Source: Virtual cluster - Nautilus PlatformScale up Scale down
  32. 32. Transactions 01100001 01000110 Stream has two segments 1 Begin txn 2 Txn segments Write to txn 3 01100001 Upon commit 5 Seal txn segment 01000110 6 01100001 Merge txn segment into stream segment 01110110 01110110 01000110 Txn segments01110110 01000110 01100000 Write to txn 4 01100001 Txn segments01110110 01000110 01100000 01110110 01000110 01100000 s1 s2 s1 s2 s1 s2 s1 s2 s1 s2 01100001 Upon commit s10111011001100000 s2
  33. 33. Transactions 01100001 01000110 Stream has two segments 1 Begin txn 2 Txn segments Write to txn 3 01100001 Upon abort 5 Eliminate segments 01110110 01110110 01000110 Txn segments01110110 01000110 01100000 Write to txn 4 01100001 Txn segments01110110 01000110 01100000 01110110 01000110 s1 s2 s1 s2 s1 s2 s1 s2 s1 s2 01100001 01100000
  34. 34. Wait, how are segments manipulated?
  35. 35. Controller • Control plane • A few of the controller tasks • Stream lifecycle • Create • Delete • Scale • Txn management • Create • Commit/Abort Segments 01101111 01000110 01000110 01000110 01000110 01101111 01101111 01101111 Segment store – Data plane Streams Scaling Txns Controller – Control plane Pravega
  36. 36. API – Writer and Reader
  37. 37. Events • Internally • Pravega is all about bytes • Current API focused on events • Some encapsulation of application bytes • Serializer interface public interface Serializer<T> { /** * Serializes the given event. * * @param value The event to be serialized. * @return The serialized form of the event. * NOTE: buffers returned should not exceed {@link #MAX_EVENT_SIZE}. */ ByteBuffer serialize(T value); /** * Deserializes the given ByteBuffer into an event. * * @param serializedValue A event that has been previously serialized. * @return The event object. */ T deserialize(ByteBuffer serializedValue); }
  38. 38. EventStreamWriter API String scope = “myScope”; WriterConfig config = new WriterConfig(); String streamName = “myStream”; ClientFactory factory = ClientFactory.withScope(scope, new URI(“//demo.pravega.io:3333”)); EventStreamWriter<String> writer = factory.createEventWriter(streamName, serializer, config); while(!worldEnd) { /* E.g., getNewRecord() reads the next line in a file */ String record = getNewRecord(); String key = extractKey(record); String event = extractEvent(record); writer.writeEvent(key, event); }
  39. 39. Using the EventStreamReader API String scope = “myScope”; String myReaderId = “myId”; String myReadGroup = “myGroup”; ReaderConfig config = new ReaderConfig(props); String streamName = “myStream”; ClientFactory factory = ClientFactory.withScope(scope, new URI(“//127.0.0.1:3333”); EventStreamReader<String> reader = factory.createEventReader(myReaderId, myReadGroup, serializer, config); while(!worldEnd) { EventRead<String> event = reader.readNextEvent(long timeout); // Application consumes event and it is supposed to persist the Position // object to guarantee exactly once. consumeEvent(event.getEvent(), event.getPosition()); } reader.close();
  40. 40. Transactions Txn txn = writer.beginTxn(); txn.writeEvent(getKey(“Pravega”), “Pravega”); txn.writeEvent(getKey(“is”), “is”); txn.writeEvent(getKey(“invading”), “invading”); txn.commit();
  41. 41. Pravega semantics
  42. 42. Segment store Segment store The write path Event Stream Writer Controller 1 2 Apache BookKeeper Long-term storage 3 4 • Synchronous write • Temporarily stored • Truncated once flushed to next storage tier • Optimized for low-latency writes • Asynchronous write • Permanently stored • Options: HDFS, NFS, Extended S3 • High read/write throughput Append bytes Locate segment Segment store
  43. 43. Guarantees on the write path • Order • Writer appends in application order • Duplicates • Writer IDs • Maps to last appended data on the segment store • Writer does not persist ID to tolerate a crash • Txns • Atomicity at the stream level • If anything goes wrong with the writes, either abort or let it time out
  44. 44. Avoiding duplicates – Reconnect Segment store Writer ID = 101 10: 1 Ack Segment store Writer ID = 102 10: 2 Segment store Writer ID = 103 10: 2 Setup 10 10: 2 Segment store Writer ID = 104 10: 2 Ack
  45. 45. Avoiding duplicates – Transactional writes Writer ID = 101 Ack Controller beginTxn Segment store Writer ID = 102 10: 1 Ack Segment store Writer ID = 103 10: 1 4 Controller Eventually times out and aborts txn
  46. 46. Avoiding duplicates – Transactional writes Writer ID = 265 Ack Controller beginTxn Segment store Writer ID = 266 26: 1 Ack
  47. 47. Segment store Segment store The read path Event Stream Reader Controller 1 2 Apache BookKeeper Long-term storage 3 • Used for recovery alone • Not used to serve reads • Bytes read from memory • If not present, pull data from Tier 2 Read bytes Locate segment 4 Bytes read Segment store
  48. 48. Reader groups • Group of event readers • Read events from a set of streams • Load distributed across readers of the group • Segments • A given reader reads from a set of segments • Coordination of segment assignment done via a state synchronizer • State synchronizer • General facility for synchronizing state across processes • Uses a revisioned Pravega segment • Advanced topic: not explained further in this talk
  49. 49. Reader groups + Scaling Pravega Segment 2 Segment 1 Reader Reader 1 Pravega Segment 2 Segment 1 Reader Reader 2 Segment 3 Segment 4 Scale up!
  50. 50. Reader groups + Scaling Pravega Segment 2 Segment 1 Reader Reader 3 Segment 3 Segment 4 • Hit end of segment • Get successors from controller • Update reader group state Pravega Reader Reader 4 Segment 4 Segment 2 Segment 3 Pravega Reader {3} Reader {2, 4} 5 Segment 4 Segment 2 Segment 3
  51. 51. Checkpoint object • Maps segments to corresponding offsets • Opaque to the application
  52. 52. Getting a checkpoint Pravega Segment 2 Segment 1 Reader Reader 1 2 Initiate checkpoint C1 Pravega Segment 2 Segment 1 Reader Reader 3 C C Pravega Segment 2 Segment 1 Reader Reader Pravega Segment 2 Segment 1 Reader Reader 4 C1: <checkpoint object> Segment 1: 1 Segment 2: 1
  53. 53. Resetting to a checkpoint 1 Reset to checkpoint C1 Pravega Segment 1 Reader Reader Segment 2 2 Reinitialize readers Pravega Segment 1 Reader Reader Segment 2 3 Pravega Segment 1 Reader Reader Segment 2 Assignment can be different from last time
  54. 54. Wrap up
  55. 55. Take-away messages • Pravega is all about • Unbounded stream data • Permanently stored • Elasticity for streams • Scaling producers and consumers independently • Under active development • Looking at first use cases
  56. 56. Ongoing work • Performance tuning • Scaling support • Event-time support • Geo-distribution • Security • … and much more http://pravega.io E-mail: fpj@apache.org Twitter: @fpjunqueira Join the community!
  57. 57. 57 Pravega
  58. 58. Streaming Storage and Compute 58 Pravega (Streaming Storage / PubSub / Messaging) Applications Sensor APIs Data / Event Producers Consuming Applications Streaming Applications
  59. 59. Pravega Pravega Streaming Pipelines 59 Applications Sensor APIs Pravega Pravega Pravega
  60. 60. 60 Pravega Flink reading from Pravega
  61. 61. Reading via ReaderGroup 61 seg A seg B seg C seg D seg E seg F seg G Task Manager Task Manager Task Manager Task Manager Pravega Controller & Reader Group • Readers do not choose their own segments • ReaderGroup automatically assigns and re-balances segments • Leaving the ReaderGroup in charge is key to automatic scaling Reader Reader
  62. 62. Reading via ReaderGroup 62 • At points in time, some segments may be not assigned to any reader • Example: New segments, re-balancing segments, … seg A seg B seg C seg D seg F seg G Task Manager Task Manager Task Manager Task Manager Pravega Reader Reader seg E being re-assigned Controller & Reader Group
  63. 63. Flink + Pravega Checkpoints 63 Task Manager Task Manager Task Manager Task Manager Job Manager Pravega Checkpoint- Coordinator seg A seg B seg C seg D seg E seg F Checkpoint store (HDFS, S3, NFS, …) seg G being assigned read positions
  64. 64. Flink + Pravega Checkpoints 64 Task Manager Task Manager Task Manager Task Manager Job Manager Pravega Checkpoint- Coordinator seg A seg B seg C seg D seg E seg F Checkpoint store (HDFS, S3, NFS, …) seg G being assigned (1) Trigger Checkpoint in Pravega read positions checkpoint positions (2a) Segment Checkpoint Data (2b) Inject Barrier (4) Store Check- point Metadata (3) Operator Snapshots
  65. 65. 65 Flink writing to Pravega Pravega
  66. 66. The FlinkPravegaWriter • Regular Flink SinkFunction • No partitioner, but a "routing key" • Remember: No partitions in Pravega • Just dynamically created segments • Same key always goes to the same segment • Order of elements guaranteed per key! 66 Flink Application Pravega Nodes seg 2 seg 1 seg 3 seg 4
  67. 67. Exactly-once via Transactions • Similar to a distributed 2-phase commit • Coordinated by asynchronous checkpoints, no voting delays • Basic algorithm: • Between checkpoints: Produce into transaction • On operator snapshot: Flush local transaction (vote-to-commit) • On checkpoint complete: Commit transactions • On recovery: check and commit any pending transactions 67
  68. 68. Exactly-once via Transactions 68 chk-1 chk-2 TXN-1 ✔chk-1 ✔chk-2 TXN-2 ✘ TXN-3 Pravega Stream ✔ global ✔ global
  69. 69. Transaction fails after local snapshot 69 chk-1 chk-2 TXN-1 ✔chk-1 TXN-2 ✘ TXN-3 Pravega Stream ✔ global
  70. 70. Transaction fails before commit… 70 chk-1 chk-2 TXN-1 ✔chk-1 TXN-2 ✘ TXN-3 Pravega Stream ✔ global ✔ global
  71. 71. … commit on recovery 71 chk-2 TXN-2 TXN-3 Pravega Stream ✔ global recover TXN handle chk-3
  72. 72. 72 Looking ahead…
  73. 73. Looking ahead… 73 Automatic Scaling Flink follows Pravega's scaling (at least the first stage) High Availability through Pravega Use synchronizers instead of ZooKeeper (leader election, distributed atomic counters, …)
  74. 74. Questions? http://pravega.io http://github.com/pravega/pravega http://github.com/pravega/flink-connectors E-mail: sewen@apache.org, fpj@apache.org Twitter: @StephanEwen, @fpjunqueira

×