1
Hadoop Made Fast
Why Virtual Reality
Needed Stream Processing
to Survive
Greg Fodor, Co-founder, AltspaceVR
Gehrig Kunz, Technical Product Marketing, Confluent
2Confidential
Streaming in Action Series
You are here!
August 16th
Pandora Plays Nicely
Everywhere with Real-Time
Data Pipelines
Watch on Confluent.io
3
A look at today
A Streaming Platform is Hadoop Made Fast
● Hadoop was a good idea, it has its flaws
● How a streaming platform can look like Hadoop
● Companies are using a streaming platform
Stream Processing with Kafka for Virtual Reality
● An example of Kafka with VR
● Challenges VR has that require stream processing
● Examples where it helps
● Why stream processing with Kafka makes sense
4
Interest in Hadoop
5
Good idea, Hadoop is
● Get all the datas
● Perform analysis, explore data
● Perfect for understanding your business
6
But today is different
Star Wars is good, again.
And the apps we build require
constant data.
7
Bringing it to today
Get all the datas
Process data as it arrives
Power your business
git commit -m “Today you want to”
With Hadoop you wanted to
Get all the datas
Explore historical data
Understanding your business
8
What this looks like in practice
9
What this looks like in practice
Ingest a stream
of data.
Process and act on it as it arrives.
Power your business.
1
2
3
10
Kafka’s Streams API
● Kafka’s Streams API: A lightweight library for
performing stream processing
• Aggregations, Sessions, Windowing, Joins,
et al
● Build apps, not clusters
Client
Server
Runs outside
Kafka brokers!
11
Build scalable, fault-tolerant apps
Client
Server
12
Build today’s apps quicker
13
Kafka, stream processing for developers
Deploy apps – not clusters – that are:
● Real-time
● Elastic
● Fault-tolerant
● Teams can be more efficient
● Provide a better, new experience to users
14
Kafka, stream processing for developers
Deploy apps – not clusters – that are:
● Real-time
● Elastic
● Fault-tolerant
● Teams can be more efficient
● Provide a better, new experience to users
Virtual reality, anyone?
Psst, Greg.
15
The best shared VR platform
https://altvr.com/kafka
16
Use cases
https://altvr.com/kafka
17
VR Mirroring + Capture
https://altvr.com/kafka
18
“Real” Reggie
“VIP” Room
https://altvr.com/kafka
19
“Real” Reggie
“VIP” Room
“Mirrored” Reggies
Room 1 Room 2 Room 3 Room 4
https://altvr.com/kafka
20
Use cases for capture/replay
21
22https://altvr.com/kafka
23
24
25
Kafka’s Streams API
26
Kafka’s Streams API
Stream processing: it’s not just for analytics!
27
Kafka’s Streams API
• Independent capacity
• Arbitrary transformations
• Flexible and simple ops
28
Kafka’s Streams API
• Build cohesive, re-usable topologies
• Design for extensibility
• Apply patterns + avoid pitfalls
29
Job #1: Game Streams
30
Game Streams
Create a logical stream across Photon servers
• Real-time netdata transformation
• Routing between Photon servers
• Stateful, due to Photon protocol
31
“Mirror User A to room R2”
32
6 months later: “Capture User A”
33
Job #2: Playbacks
34
Playbacks
Replays captured data
• Load capture data (Kafka/S3)
• Timed emission
• Checkpointing, looping, filtering
35
“Playback capture to room R2”
36
“Mirror User A to room R2”
37
Kafka’s Streams API
• Build cohesive, re-usable topologies
• Design for extendibility
• Apply patterns + avoid pitfalls
GameStreams job allows:
• User capture/mirroring
• Interactable object capture/mirroring
• VoIP, avatar transforms, VR emojis payloads
• Entire room capture/mirroring
38
Kafka’s Streams API
• Build cohesive, re-usable topologies
• Design for extendibility
• Apply patterns + avoid pitfalls
GameStreams job allows:
• Design names, record types generically
• Build in mechanisms for parameterization + control
• Use avro and schema registry
• Job code is not throwaway! Build accordingly
39
Patterns + Pitfalls
40
Patterns + Pitfalls
41
Config KTables
• Drive job behavior via OLTP state
• In our case, users interact with Rails API to control mirroring + captures
42
KIP-99 Global Tables
https://cwiki.apache.org/confluence/display/KAFKA/KIP-
99%3A+Add+Global+Tables+to+Kafka+Streams
43
Prefer declarative OLTP table state
Database tables state should describe “how the world should be” not “steps to perform”
Job’s duty is to make the world look like the one desired
“A stream should exist from playback A to room B” not
“Right now, create a stream from playback A to room B”
Straightforward to test + verify: does desired world match up with reality?
Easier to reason about in failure cases
44
Keep consistent topic naming
Kafka Stream jobs involve a lot of source + intermediate topics
We prefer:
[<data source>|<job application id>]-<avro record type>[_<specifier>]-<partition key>
Ex:
oltp_db-user-user_id
job_playbacks-photon_instantiations-game_stream_id
45
RocksDB range scans
Did you know that RocksDB stores keys lexicographically sorted?
Kafka Streams exposes range() queries on persistent state stores!
46
Example: Scheduled tasks
Keys in “tasks” topic are a composite key of <timestamp, id>
Allows range queries for upcoming tasks (local to partition, obviously)
47
Dark staging jobs
Eventually you will need to deploy a staging version of a job into prod for integration testing
while known-good version is serving users.
Ensure you bake in the necessary degree of freedom! (Duplicate topics, application ids, etc.)
48
Patterns + Pitfalls
49
KTable rematerialization
Cold nodes read *entire* KTable transaction log for each KTable on startup. (Of course!)
Not something you’re likely to experience except during a failure.
You could be in for a surprise!
Easy to force a rematerialization to test: stop job, remove state dir from job work directory,
restart.
(But you should probably check your xlog topic sizes first)
In our case, AWS EBS I/O throttling caused us to be unable to bring a fresh node up!
Ensure topic xlog doesn’t grow unbounded:
- Ensure you delete dead keys explicitly and have proper compaction policies set on xlog topics
- Or, use set up topic rentention policies if data can be purged after time duration
50
Reset switches + flushing
Sometimes KTables topics or entries need to be forcibly rematerialized/flushed/read from
beginning.
For example: KTable topic exists before first job run. Or, something broke.
Handy to build in mechanisms to:
- Reset consumer offsets to zero
- For OLTP/Connect-backed KTable data, force a no-op update to database record(s) to flush
- In Rails, ActiveRecord#flush
May be less necessary in newer versions of Kafka Streams (ex due to KAFKA-4114 + bugfixes)
Handy topic consumer group offset resetter routine, pass in job Properties:
https://gist.github.com/gfodor/a4f5e4721e959766e75e4c901bf42890
51
Streaming for VR
Kafka Streams has been amazing for us.
Shown so far, we have jobs for:
• VR Mirror/Capture/Playback
• Presence
• Scheduled tasks
We are also using it for:
• Real time game telemetry ET
• VR Capture archival to S3
• Real-time push messaging
52
From batch to real-time
● Provides similar concepts to Hadoop
● Streaming platform is right for today’s applications
○ Distributed storage, Stream processing, Publish/Subscribe model
53
A streaming platform can be ‘Hadoop Made Fast’
● Use Kafka as a ‘source of truth’
● Process data as it arrives
● Power real-time experiences (like VR)
54Confidential
Streaming in Action Series
You are here
August 16th
Pandora Plays Nicely
Everywhere with Real-Time
Data Pipelines
Watch on Confluent.io
55Confidential
Download Confluent Open Source
Join the Confluent Slack community
Check out Kafka Summit!
August 28th in San Francisco
Thanks!

Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive

  • 1.
    1 Hadoop Made Fast WhyVirtual Reality Needed Stream Processing to Survive Greg Fodor, Co-founder, AltspaceVR Gehrig Kunz, Technical Product Marketing, Confluent
  • 2.
    2Confidential Streaming in ActionSeries You are here! August 16th Pandora Plays Nicely Everywhere with Real-Time Data Pipelines Watch on Confluent.io
  • 3.
    3 A look attoday A Streaming Platform is Hadoop Made Fast ● Hadoop was a good idea, it has its flaws ● How a streaming platform can look like Hadoop ● Companies are using a streaming platform Stream Processing with Kafka for Virtual Reality ● An example of Kafka with VR ● Challenges VR has that require stream processing ● Examples where it helps ● Why stream processing with Kafka makes sense
  • 4.
  • 5.
    5 Good idea, Hadoopis ● Get all the datas ● Perform analysis, explore data ● Perfect for understanding your business
  • 6.
    6 But today isdifferent Star Wars is good, again. And the apps we build require constant data.
  • 7.
    7 Bringing it totoday Get all the datas Process data as it arrives Power your business git commit -m “Today you want to” With Hadoop you wanted to Get all the datas Explore historical data Understanding your business
  • 8.
    8 What this lookslike in practice
  • 9.
    9 What this lookslike in practice Ingest a stream of data. Process and act on it as it arrives. Power your business. 1 2 3
  • 10.
    10 Kafka’s Streams API ●Kafka’s Streams API: A lightweight library for performing stream processing • Aggregations, Sessions, Windowing, Joins, et al ● Build apps, not clusters Client Server Runs outside Kafka brokers!
  • 11.
  • 12.
  • 13.
    13 Kafka, stream processingfor developers Deploy apps – not clusters – that are: ● Real-time ● Elastic ● Fault-tolerant ● Teams can be more efficient ● Provide a better, new experience to users
  • 14.
    14 Kafka, stream processingfor developers Deploy apps – not clusters – that are: ● Real-time ● Elastic ● Fault-tolerant ● Teams can be more efficient ● Provide a better, new experience to users Virtual reality, anyone? Psst, Greg.
  • 15.
    15 The best sharedVR platform https://altvr.com/kafka
  • 16.
  • 17.
    17 VR Mirroring +Capture https://altvr.com/kafka
  • 18.
  • 19.
    19 “Real” Reggie “VIP” Room “Mirrored”Reggies Room 1 Room 2 Room 3 Room 4 https://altvr.com/kafka
  • 20.
    20 Use cases forcapture/replay
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    26 Kafka’s Streams API Streamprocessing: it’s not just for analytics!
  • 27.
    27 Kafka’s Streams API •Independent capacity • Arbitrary transformations • Flexible and simple ops
  • 28.
    28 Kafka’s Streams API •Build cohesive, re-usable topologies • Design for extensibility • Apply patterns + avoid pitfalls
  • 29.
  • 30.
    30 Game Streams Create alogical stream across Photon servers • Real-time netdata transformation • Routing between Photon servers • Stateful, due to Photon protocol
  • 31.
    31 “Mirror User Ato room R2”
  • 32.
    32 6 months later:“Capture User A”
  • 33.
  • 34.
    34 Playbacks Replays captured data •Load capture data (Kafka/S3) • Timed emission • Checkpointing, looping, filtering
  • 35.
  • 36.
    36 “Mirror User Ato room R2”
  • 37.
    37 Kafka’s Streams API •Build cohesive, re-usable topologies • Design for extendibility • Apply patterns + avoid pitfalls GameStreams job allows: • User capture/mirroring • Interactable object capture/mirroring • VoIP, avatar transforms, VR emojis payloads • Entire room capture/mirroring
  • 38.
    38 Kafka’s Streams API •Build cohesive, re-usable topologies • Design for extendibility • Apply patterns + avoid pitfalls GameStreams job allows: • Design names, record types generically • Build in mechanisms for parameterization + control • Use avro and schema registry • Job code is not throwaway! Build accordingly
  • 39.
  • 40.
  • 41.
    41 Config KTables • Drivejob behavior via OLTP state • In our case, users interact with Rails API to control mirroring + captures
  • 42.
  • 43.
    43 Prefer declarative OLTPtable state Database tables state should describe “how the world should be” not “steps to perform” Job’s duty is to make the world look like the one desired “A stream should exist from playback A to room B” not “Right now, create a stream from playback A to room B” Straightforward to test + verify: does desired world match up with reality? Easier to reason about in failure cases
  • 44.
    44 Keep consistent topicnaming Kafka Stream jobs involve a lot of source + intermediate topics We prefer: [<data source>|<job application id>]-<avro record type>[_<specifier>]-<partition key> Ex: oltp_db-user-user_id job_playbacks-photon_instantiations-game_stream_id
  • 45.
    45 RocksDB range scans Didyou know that RocksDB stores keys lexicographically sorted? Kafka Streams exposes range() queries on persistent state stores!
  • 46.
    46 Example: Scheduled tasks Keysin “tasks” topic are a composite key of <timestamp, id> Allows range queries for upcoming tasks (local to partition, obviously)
  • 47.
    47 Dark staging jobs Eventuallyyou will need to deploy a staging version of a job into prod for integration testing while known-good version is serving users. Ensure you bake in the necessary degree of freedom! (Duplicate topics, application ids, etc.)
  • 48.
  • 49.
    49 KTable rematerialization Cold nodesread *entire* KTable transaction log for each KTable on startup. (Of course!) Not something you’re likely to experience except during a failure. You could be in for a surprise! Easy to force a rematerialization to test: stop job, remove state dir from job work directory, restart. (But you should probably check your xlog topic sizes first) In our case, AWS EBS I/O throttling caused us to be unable to bring a fresh node up! Ensure topic xlog doesn’t grow unbounded: - Ensure you delete dead keys explicitly and have proper compaction policies set on xlog topics - Or, use set up topic rentention policies if data can be purged after time duration
  • 50.
    50 Reset switches +flushing Sometimes KTables topics or entries need to be forcibly rematerialized/flushed/read from beginning. For example: KTable topic exists before first job run. Or, something broke. Handy to build in mechanisms to: - Reset consumer offsets to zero - For OLTP/Connect-backed KTable data, force a no-op update to database record(s) to flush - In Rails, ActiveRecord#flush May be less necessary in newer versions of Kafka Streams (ex due to KAFKA-4114 + bugfixes) Handy topic consumer group offset resetter routine, pass in job Properties: https://gist.github.com/gfodor/a4f5e4721e959766e75e4c901bf42890
  • 51.
    51 Streaming for VR KafkaStreams has been amazing for us. Shown so far, we have jobs for: • VR Mirror/Capture/Playback • Presence • Scheduled tasks We are also using it for: • Real time game telemetry ET • VR Capture archival to S3 • Real-time push messaging
  • 52.
    52 From batch toreal-time ● Provides similar concepts to Hadoop ● Streaming platform is right for today’s applications ○ Distributed storage, Stream processing, Publish/Subscribe model
  • 53.
    53 A streaming platformcan be ‘Hadoop Made Fast’ ● Use Kafka as a ‘source of truth’ ● Process data as it arrives ● Power real-time experiences (like VR)
  • 54.
    54Confidential Streaming in ActionSeries You are here August 16th Pandora Plays Nicely Everywhere with Real-Time Data Pipelines Watch on Confluent.io
  • 55.
    55Confidential Download Confluent OpenSource Join the Confluent Slack community Check out Kafka Summit! August 28th in San Francisco Thanks!