Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Similar to Introducing BinarySortedMultiMap - A new Flink state primitive to boost your application performance(20)

Advertisement
Advertisement

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your application performance

  1. Introducing BinarySortedState A new Flink state primitive to boost your application performance Nico Kruber – Software Engineer —— David Anderson – Community Engineering – Flink Forward 22
  2. About me Open source ● Apache Flink contributor/committer since 2016 ○ Focus on network stack, usability, and performance Career ● PhD in Computer Science at HU Berlin / Zuse Institute Berlin ● Software engineer -> Solutions architect -> Head of Solutions Architecture @ DataArtisans/Ververica (acquired by Alibaba) ● Engineering @ Immerok About Immerok ● Building a fully managed Apache Flink cloud service for powering real-time systems at any scale ○ immerok.com 2
  3. Agenda ● Motivation ● BinarySortedState ● Results & Future Work
  4. Motivation
  5. Use Case: Stream Sort 20 15 11 30 35 27 21 40 41 35 1 2 5 7 6 9 10 11 13 14
  6. Use Case: Stream Sort - Code void processElement(Long event, /*...*/) { TimerService timerSvc = ctx.timerService(); long ts = ctx.timestamp(); if (!isLate(ts, timerSvc)) { List<Long> listAtTs = events.get(ts); if (listAtTs == null) { listAtTs = new ArrayList<>(); } listAtTs.add(event); events.put(ts, listAtTs); timerSvc.registerEventTimeTimer(ts); } } MapState<Long, List<Long>> events; MapStateDescriptor<Long, List<Long>> desc = new MapStateDescriptor<>( "Events", Types.LONG, Types.LIST(Types.LONG)); events = getRuntimeContext().getMapState(desc); void onTimer(long ts, /*...*/) { events.get(ts).forEach(out::collect); events.remove(ts); }
  7. List<Long> listAtTs = events.get(ts); if (listAtTs == null) { listAtTs = new ArrayList<>(); } listAtTs.add(event); events.put(ts, listAtTs); Use Case: Stream Sort - What’s Happening Underneath State (RocksDB) De-/Serializer full list as byte[] search memtable + sst files for one entry lookup deserialized (Java) list
  8. List<Long> listAtTs = events.get(ts); if (listAtTs == null) { listAtTs = new ArrayList<>(); } listAtTs.add(event); events.put(ts, listAtTs); Use Case: Stream Sort - What’s Happening Underneath State (RocksDB) De-/Serializer full Java list serialized list as byte[] add new entry to memtable (leave old one for compaction)
  9. Use Case: Stream Sort - Alternative Solutions ● Using MapState<Long, Event> instead of MapState<Long, List<Event>>? ○ Cannot handle multiple events per timestamp ● Using Window API? ○ Efficient event storage per timestamp ○ No really well-matching window types: sliding, tumbling, and session windows ● Using HashMapStateBackend? ○ No de-/serialization overhead ○ state limited by available memory ○ no incremental checkpoints ● Using ListState<Event> and filtering in onTimer()? ○ Reduced overhead in processElement() vs. more to do in onTimer()
  10. {ts: 5, code: GBP, rate: 1.20} {ts: 10, code: USD, rate: 1.00} {ts: 19, code: USD, rate: 1.02} rates {ts: 15, code: USD, amount: 1.00} {ts: 10, code: GBP, amount: 2.00} {ts: 25, code: USD, amount: 3.00} transactions Use Case: Event-Time Stream Join {ts1: 15, ts2: 10, amount: 1.00} {ts1: 10, ts2: 5, amount: 2.40} {ts1: 25, ts2: 19, amount: 3.06} SELECT t.ts AS ts1, r.ts AS ts2, t.amount * r.rate AS amount FROM transactions AS t LEFT JOIN rates FOR SYSTEM_TIME AS OF t.ts AS r ON t.code = r.code;
  11. void onTimer(long ts, /*...*/) { TreeSet<Entry<Long, Double>> rates = ratesInRange(ts, rate.entries()); Double myRate = getLast(rates); transactions.get(ts).forEach( tx -> out.collect( new Joined(myRate, tx))); deleteAllButLast(rates); transactions.remove(ts); } Use Case: Event-Time Stream Join - Code void processElement1(Transaction value,/*...*/) { TimerService timerSvc = ctx.timerService(); long ts = ctx.timestamp(); if (!isLate(ts, timerSvc)) { addTransaction(ts, value); timerSvc.registerEventTimeTimer(ts); } } // similar for processElement2() MapState<Long, List<Transaction>> transactions; MapState<Long, Double> rate;
  12. void onTimer(long ts, /*...*/) { TreeSet<Entry<Long, Double>> rates = ratesInRange(ts, rate.entries()); Double myRate = getLast(rates); transactions.get(ts).forEach( tx -> out.collect( new Joined(myRate, tx))); deleteAllButLast(rates); transactions.remove(ts); } With RocksDB: ● always fetching all rates’ (key+value) bytes ⚠ ● (also need to fit into memory ⚠) ● deserialize all keys keys during iteration ⚠ ○ not deserializing values (at least) ✓ Use Case: Event-Time Stream Join - Code void processElement1(Transaction value,/*...*/) { TimerService timerSvc = ctx.timerService(); long ts = ctx.timestamp(); if (!isLate(ts, timerSvc)) { addTransaction(ts, value); timerSvc.registerEventTimeTimer(ts); } } // similar for processElement2() Similar to stream sort, with RocksDB: ● always fetch/write full lists ⚠ ● always de-/serialize full list ⚠ ● additional stress on RocksDB compaction ⚠ MapState<Long, List<Transaction>> transactions; MapState<Long, Double> rate;
  13. BinarySortedState
  14. BinarySortedState - History “Temporal state” Hackathon project (David + Nico + Seth) ● Main primitives: getAt(), getAtOrBefore(), getAtOrAfter(), add(), addAll(), update() 2020 Nov 2021 April 2022 started as a Hackathon project (David + Nico) on custom windowing with process functions Created FLIP-220 and discussed on dev@flink.apache.org ● Extended scope further to allow arbitrary user keys (not just timestamps) ○ Identified further use cases in SQL operators, e.g. min/max with retractions ● Clarified serializer requirements ● Extend proposed API to offer range read and clear operations ● …
  15. ● A new keyed-state primitive, built on top of ListState ● Efficiently add to list of values for a user-provided key ● Efficiently iterate user-keys in a well-defined sort order, with native state-backend support, especially RocksDB ● Efficient operations for time-based functions (windowing, sorting, event-time joins, custom, ...) ● Operations on subset of the state, based on user-key ranges ● Portable between state backends (RocksDB, HashMap) BinarySortedState - Goals
  16. BinarySortedState - API (subject to change!) ● Point-operations: ○ valuesAt(key) ○ add(key, value) ○ put(key, values) ○ clearEntryAt(key) ● Lookups: ○ firstEntry(), firstEntry(fromKey) ○ lastEntry(), lastEntry(UK toKey) ● Cleanup: ○ clear() ● Range-operations: ○ readRange(fromKey, toKey, inclusiveToKey) ○ readRangeUntil(toKey, inclusiveToKey) ○ readRangeFrom(fromKey) ● Range-deletes: ○ clearEntryAt(key) ○ clearRange( fromKey, toKey, inclusiveToKey) ○ clearRangeUntil(toKey, inclusiveToKey) ○ clearRangeFrom(fromKey) BinarySortedState<UK, UV>
  17. ● RocksDB is a key-value store writing into MemTables → flushing into SST files ● SST files are sorted by the key in lexicographical binary order (byte-wise unsigned comparison) BinarySortedState - How does it work with RocksDB?! ● RocksDB offers Prefix Seek and SeekForPrev ● RocksDBMapState.RocksDBMapIterator provides efficient iteration via: ○ Fetching up to 128 RocksDB entries at once ○ RocksDBMapEntry with lazy key/value deserialization
  18. BinarySortedState - LexicographicalTypeSerializer ● “Just” need to provide serializers that are compatible with RocksDB’s sort order ● Based on lexicographical binary order as defined by byte-wise unsigned comparison ● Compatible serializers extend LexicographicalTypeSerializer public abstract class LexicographicalTypeSerializer<T> extends TypeSerializer<T> { public Optional<Comparator<T>> findComparator() { return Optional.empty(); } }
  19. Stream Sort w/out BinarySortedState (1) private BinarySortedState<Long, Long> events; BinarySortedStateDescriptor<Long, Long> desc = new BinarySortedStateDescriptor<>( "Events", LexicographicLongSerializer.INSTANCE, Types.LONG); events = getRuntimeContext() .getBinarySortedState(desc); private MapState<Long, List<Long>> events; MapStateDescriptor<Long, List<Long>> desc = new MapStateDescriptor<>( "Events", Types.LONG, Types.LIST(Types.LONG)); events = getRuntimeContext() .getMapState(desc); public void onTimer(long ts, /*...*/) { events.valuesAt(ts).forEach(out::collect); events.clearEntryAt(ts); } public void onTimer(long ts, /*...*/) { events.get(ts).forEach(out::collect); events.remove(ts); }
  20. Stream Sort w/out BinarySortedState (2) public void processElement(/*...*/) { // ... events.add(ts, event); // ... } public void processElement(/*...*/) { // ... List<Long> listAtTs = events.get(ts); if (listAtTs == null) { listAtTs = new ArrayList<>(); } listAtTs.add(event); events.put(ts, listAtTs); // ... }
  21. events.add(ts, event); Stream Sort with BinarySortedState - What’s Happening Underneath State (RocksDB) De-/Serializer add new merge op to memtable Serialized entry as byte[] new list entry
  22. Stream Sort with BinarySortedState - What’s Happening Underneath State (RocksDB) De-/Serializer full list as byte[] Lookup deserialized (Java) list search memtable + sst files for all entries events.valuesAt(ts).forEach(out::collect);
  23. Mark k/v as deleted (removal during compaction) Stream Sort with BinarySortedState - What’s Happening Underneath State (RocksDB) De-/Serializer events.clearEntryAt(ts); Delete
  24. Event-Time Stream Join w/out BinarySortedState (1) private BinarySortedState<Long, Transaction> transactions; private BinarySortedState<Long, Double> rate; private MapState<Long, List<Transaction>> transactions; private MapState<Long, Double> rate; public void processElement1(/*...*/) { // ... if (!isLate(ts, timerSvc)) { // append to BinarySortedState: transactions.add(ts, value); timerSvc.registerEventTimeTimer(ts); } } // similar for processElement2() public void processElement1(/*...*/) { // ... if (!isLate(ts, timerSvc)) { // replace list in MapState: addTransaction(ts, value); timerSvc.registerEventTimeTimer(ts); } } // similar for processElement2()
  25. Event-Time Stream Join w/out BinarySortedState (2) public void onTimer(long ts, /*...*/) { Entry<Long, Iterable<Double>> myRate = rate.lastEntry(ts); Double rateVal = Optional .ofNullable(myRate) .map(e -> e.getValue().iterator().next()) .orElse(null); transactions.valuesAt(ts).forEach( tx -> out.collect( new Joined(rateVal, tx))); if (myRate != null) { rate.clearRangeUntil(myRate.getKey(), false); } transactions.clearEntryAt(ts); } public void onTimer(long ts, /*...*/) { TreeSet<Entry<Long, Double>> rates = ratesInRange(ts, rate.entries()); Double myRate = Optional .ofNullable(rates.pollLast()) .map(Entry::getValue) .orElse(null); transactions.get(ts).forEach( tx -> out.collect( new Joined(myRate, tx))); rates.forEach(this::removeRate); transactions.remove(ts); }
  26. Stream Sort with BinarySortedState - Optimized ● Idea: ○ Increase efficiency by processing all events between watermarks ● Challenge: ○ Registering a timer for the next watermark will fire too often ➔ Solution: ○ Register timer for the first unprocessed event ○ When the timer fires: ■ Process all events until the current watermark (not the timer timestamp!) ● events.readRangeUntil(currentWatermark, true)
  27. Event-Time Stream Join with BinarySortedState - Optimized ● Idea: ○ Increase efficiency by processing all events between watermarks ● Challenge: ○ Registering a timer for the next watermark will fire too often ➔ Solution: ○ Same as Stream Sort: Timer for first unprocessed event, processing until watermark, but: ○ When the timer fires: ■ Iterate both, transactions and rate (in the appropriate time range) in event-time order
  28. Results & Future Work
  29. +50% Stream Sort
  30. +130% Optimized Stream Sort
  31. +60% Stream Join
  32. ● (Custom) stream sorting ● Time-based (custom) joins ● Code with range-reads or bulk-deletes ● Custom window implementations ● Min/Max with retractions ● … ● Basically everything maintaining a MapState<?, List<?>> or requiring range operations BinarySortedState - Who will benefit?
  33. What’s left to do? ● Iron out last bits and pieces + tests ○ Start voting thread for FLIP-220 on dev@flink.apache.org ○ Create a PR and get it merged ● Expected to land in Flink 1.17 (as experimental feature) ● Port Table/SQL/DataStream operators to improve efficiency: ○ TemporalRowTimeJoinOperator (PoC already done for validating the API ✓) ○ RowTimeSortOperator ○ IntervalJoinOperator ○ CepOperator ○ … ● Provide more LexicographicalTypeSerializers
  34. Get ready to ROK!!! Nico Kruber linkedin.com/in/nico-kruber nico@immerok.com
Advertisement