SlideShare a Scribd company logo
1 of 38
Download to read offline
Lessons: Porting a Streaming Pipeline from
Scala to Rust
2023 Scale by the Bay
Evan Chan
Principal Engineer - Conviva
http://velvia.github.io/presentations/2023-conviva-scala-to-rust
1 / 38
Conviva
2 / 38
Massive Real-time Streaming Analytics
5 trillion events processed per day
800-2000GB/hour (not peak!!)
Started with custom Java code
went through Spark Streaming and Flink iterations
Most backend data components in production are written in Scala
Today: 420 pods running custom Akka Streams processors
3 / 38
Data World is Going Native and Rust
Going native: Python, end of Moore's Law, cloud compute
Safe, fast, and high-level abstractions
Functional data patterns - map, fold, pattern matching, etc.
Static dispatch and no allocations by default
PyO3 - Rust is the best way to write native Python extensions
JVM Rust projects
Spark, Hive DataFusion, Ballista, Amadeus
Flink Arroyo, RisingWave, Materialize
Kafka/KSQL Fluvio
ElasticSearch / Lucene Toshi, MeiliDB
Cassandra, HBase Skytable, Sled, Sanakirja...
Neo4J TerminusDB, IndraDB
4 / 38
About our Architecture
graph LR; SAE(Streaming
Data
Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE SAE -->
DB[(Metrics
Database)] DB --> Dashboards
5 / 38
What We Are Porting to Rust
graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px
SAE(Streaming
Data
Pipeline) Sensors:::highlighted --> Gateways:::highlighted Gateways --> Kafka
Kafka --> SAE:::highlighted SAE --> DB[(Metrics
Database)] DB --> Dashboards
graph LR; Notes1(Sensors: consolidate
fragmented code base) Notes2(Gateway:
Improve on JVM and Go) Notes3(Pipeline:
Improve efficiency
New operator architecture) Notes1 ~~~ Notes2 Notes2 ~~~ Notes3
6 / 38
Our Journey to Rust
gantt title From Hackathon to Multiple Teams dateFormat YYYY-MM
axisFormat %y-%b section Data Pipeline Hackathon :Small Kafka ingestion
project, 2022-11, 30d Scala prototype :2023-02, 6w Initial Rust Port : small
team, 2023-04, 45d Bring on more people :2023-07, 8w 20-25 people 4 teams
:2023-11, 1w section Gateway Go port :2023-07, 6w Rust port :2023-09, 4w
“I like that if it compiles, I know it will work, so it gives confidence.”
7 / 38
Promising Rust Hackathon
graph LR; Kafka --> RustDeser(Rust Deserializer) RustDeser --> RA(Rust Actors -
Lightweight Processing)
Measurement Improvement over Scala/Akka
Throughput (CPU) 2.6x more
Memory used 12x less
Mostly I/O-bound lightweight deserialization and processing workload
Found out Actix does not work well with Tokio
8 / 38
Performance Results - Gateway
9 / 38
Key Lessons or Questions
What matters for a Rust port?
The 4 P's ?
People How do we bring developers onboard?
Performance How do I get performance? Data structures? Static dispatch?
Patterns What coding patterns port well from Scala? Async?
Project How do I build? Tooling, IDEs?
10 / 38
People
How do we bring developers onboard?
11 / 38
A Phased Rust Bringup
We ported our main data pipeline in two phases:
Phase Team Rust Expertise Work
First 3-5, very senior
1-2 with significant
Rust
Port core project
components
Second
10-15, mixed,
distributed
Most with zero
Rust
Smaller, broken down
tasks
Have organized list of learning resources
2-3 weeks to learn Rust and come up to speed
12 / 38
Difficulties:
Lifetimes
Compiler errors
Porting previous patterns
Ownership and async
etc.
How we helped:
Good docs
Start with tests
ChatGPT!
Rust Book
Office hours
Lots of detailed reviews
Split project into async and
sync cores
Overcoming Challenges
13 / 38
Performance
Data structures, static dispatch, etc.
"I enjoy the fact that the default route is performant. It makes you write
performant code, and if you go out the way, it becomes explicit (e.g., with dyn,
Boxed, or clone etc). "
14 / 38
Porting from Scala: Huge Performance Win
graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px
SAE(Streaming
Data
Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE:::highlighted
SAE --> DB[(Metrics
Database)] DB --> Dashboards
CPU-bound, programmable, heavy data processing
Neither Rust nor Scala is productionized nor optimized
Same architecture and same input/outputs
Scala version was not designed for speed, lots of objects
Rust: we chose static dispatch and minimizing allocations
Type of comparison Improvement over Scala
Throughput, end to end 22x
Throughput, single-threaded microbenchmark >= 40x
15 / 38
Building a Flexible Data Pipeline
graph LR; RawEvents(Raw Events) RawEvents -->| List of numbers | Extract1
RawEvents --> Extract2 Extract1 --> DoSomeMath Extract2 -->
TransformSomeFields DoSomeMath --> Filter1 TransformSomeFields -->
Filter1 Filter1 --> MoreProcessing
An interpreter passes time-ordered data between flexible DAG of operators.
Span1
Start time: 1000
End time: 1100
Events: ["start", "click"]
Span2
Start time: 1100
End time: 1300
Events: ["ad_load"]
16 / 38
Scala: Object Graph on Heap
graph TB; classDef default font-
size:24px
ArraySpan["`Array[Span]`"]
TL(Timeline - Seq) --> ArraySpan
ArraySpan --> Span1["`Span(start,
end, Payload)`"] ArraySpan -->
Span2["`Span(start, end,
Payload)`"] Span1 -->
EventsAtSpanEnd("`Events(Seq[A])`")
EventsAtSpanEnd -->
ArrayEvent["`Array[A]`"]
Rust: mostly stack based / 0 alloc:
flowchart TB; subgraph Timeline
subgraph OutputSpans subgraph
Span1 subgraph Events EvA ~~~
EvB end TimeInterval ~~~ Events
end subgraph Span2 Time2 ~~~
Events2 end Span1 ~~~ Span2 end
DataType ~~~ OutputSpans end
Data Structures: Scala vs Rust
17 / 38
Rust: Using Enums and Avoiding Boxing
pub enum Timeline {
EventNumber(OutputSpans<EventsAtEnd<f64>>),
EventBoolean(OutputSpans<EventsAtEnd<bool>>),
EventString(OutputSpans<EventsAtEnd<DataString>>),
}
type OutputSpans<V> = SmallVec<[Spans<V>; 2]>;
pub struct Span<SV: SpanValue> {
pub time: TimeInterval,
pub value: SV,
}
pub struct EventsAtEnd<V>(SmallVec<[V; 1]>);
In the above, the Timeline enum can fit entirely in the stack and avoid all
boxing and allocations, if:
The number of spans is very small, below limit set in code
The number of events in each span is very small (1 in this case, which is
the common case)
The base type is a primitive, or a string which is below a certain length 18 / 38
Avoiding Allocations using SmallVec and
SmallString
SmallVec is something like this:
pub enum SmallVec<T, const N: usize> {
Stack([T; N]),
Heap(Vec<T>),
}
The enum can hold up to N items inline in an array with no allocations, but
switches to the Heap variant if the number of items exceeds N.
There are various crates for small strings and other data structures.
19 / 38
Static vs Dynamic Dispatch
Often one will need to work with many different structs that implement a Trait
-- for us, different operator implementations supporting different types. Static
dispatch and inlined code is much faster.
1. Monomorphisation using generics
fn execute_op<O: Operator>(op: O) -> Result<...>
Compiler creates a new instance of execute_op for every different O
Only works when you know in advance what Operator to pass in
2. Use Enums and enum_dispatch
fn execute_op(op: OperatorEnum) -> Result<...>
3. Dynamic dispatch
fn execute_op(op: Box<dyn Operator>) -> Result<...>
fn execute_op(op: &dyn Operator) -> Result<...> (avoids allocation)
4. Function wrapping
Embedding functions in a generic struct
20 / 38
enum_dispatch
Suppose you have
trait KnobControl {
fn set_position(&mut self, value: f64);
fn get_value(&self) -> f64;
}
struct LinearKnob {
position: f64,
}
struct LogarithmicKnob {
position: f64,
}
impl KnobControl for LinearKnob...
enum_dispatch lets you do this:
#[enum_dispatch]
trait KnobControl {
//...
} 21 / 38
Function wrapping
Static function wrapping - no generics
pub struct OperatorWrapper {
name: String,
func: fn(input: &Data) -> Data,
}
Need a generic - but accepts closures
pub struct OperatorWrapper<F>
where F: Fn(input: &Data) -> Data {
name: String,
func: F,
}
22 / 38
Patterns
Async, Type Classes, etc.
23 / 38
Rust Async: Different Paradigms
"Async: It is well designed... Yes, it is still pretty complicated piece of code, but
the logic or the framework is easier to grasp compared to other languages."
Having to use Arc: Data Structures are not Thread-safe by default!
Scala Rust
Futures futures, async functions
?? async-await
Actors(Akka) Actix, Bastion, etc.
Async streams Tokio streams
Reactive (Akka streams, Monix, ZIO) reactive_rs, rxRust, etc.
24 / 38
Replacing Akka: Actors in Rust
Actix threading model doesn't mix well with Tokio
We moved to tiny-tokio-actor, then wrote our own
pub struct AnomalyActor {}
#[async_trait]
impl ChannelActor<Anomaly, AnomalyActorError> for AnomalyActor {
async fn handle(
&mut self,
msg: Anomaly,
ctx: &mut ActorContext<Anomaly>,
) -> Result<(), Report<AnomalyActorError>> {
use Anomaly::*;
match msg {
QuantityOverflowAnomaly {
ctx: _, ts: _, qual: _,
qty: _, cnt: _, data: _,
} => {}
PoisonPill => {
ctx.stop();
}
}
Ok(())
}
25 / 38
Other Patterns to Learn
Old Pattern New Pattern
No inheritance
Use composition!
- Compose data structures
- Compose small Traits
No exceptions Use Result and ?
Data structures are not
Thread safe
Learn to use Arc etc.
Returning Iterators
Don't return things that borrow other things.
This makes life difficult.
26 / 38
Type Classes
In Rust, type classes (Traits) are smaller and more compositional.
pub trait Inhale {
fn sniff(&self);
}
You can implement new Traits for existing types, and have different impl's for
different types.
impl Inhale for String {
fn sniff(&self) {
println!("I sniffed {}", self);
}
}
// Only implemented for specific N subtypes of MyStruct
impl<N: Numeric> Inhale for MyStruct<N> {
fn sniff(&self) {
....
}
}
27 / 38
Project
Build, IDE, Tooling
28 / 38
"Cargo is the best build tool ever"
Almost no dependency conflicts due to multiple dep versioning
Configuration by convention - common directory/file layouts for example
Really simple .toml - no need for XML, functional Scala, etc.
Rarely need code to build anything, even for large projects
[package]
name = "telemetry-subscribers"
version = "0.3.0"
license = "Apache-2.0"
description = "Library for common telemetry and observability functionality"
[dependencies]
console-subscriber = { version = "0.1.6", optional = true }
crossterm = "0.25.0"
once_cell = "1.13.0"
opentelemetry = { version = "0.18.0", features = ["rt-tokio"], optional = true }
29 / 38
IDEs, CI, and Tooling
IDEs/Editors
VSCode, RustRover (IntelliJ),
vim/emacs/etc with Rust Analyzer
Code Coverage VSCode inline, grcov/lcov, Tarpaulin (Linux only)
Slow build times Caching: cargo-chef, rust-cache
Slow test times cargo-nextest
Property Testing proptest
Benchmarking Criterion
https://blog.logrocket.com/optimizing-ci-cd-pipelines-rust-projects/
VSCode's "LiveShare" feature for distributed pair programming is TOP NOTCH.
30 / 38
Rust Resources and Projects
https://github.com/velvia/links/blob/main/rust.md - this is my list of Rust
projects and learning resources
https://github.com/rust-unofficial/awesome-rust
https://www.arewelearningyet.com - ML focused
31 / 38
What do we miss from Scala?
More mature libraries - in some cases: HDFS, etc.
Good streaming libraries - like Monix, Akka Streams etc.
I guess all of Akka
"Less misleading compiler messages"
Rust error messages read better from the CLI, IMO (not an IDE)
32 / 38
Takeaways
It's a long journey but Rust is worth it.
Structuring a project for successful onramp is really important
Think about data structure design early on
Allow plenty of time to ramp up on Rust patterns, tools
We are hiring across multiple roles/levels!
33 / 38
https://velvia.github.io/about
https://github.com/velvia
@evanfchan
IG: @platypus.arts
Thank You Very Much!
34 / 38
Extra slides
35 / 38
Data World is Going Native (from JVM)
The rise of Python and Data Science
Led to AnyScale, Dask, and many other Python-oriented data
frameworks
Rise of newer, developer-friendly native languages (Go, Swift, Rust, etc.)
Migration from Hadoop/HDFS to more cloud-based data architectures
Apache Arrow and other data interchange formats
Hardware architecture trends - end of Moore's Law, rise of GPUs etc
36 / 38
Why We Went with our Own Actors
1. Initial Hackathon prototype used Actix
Actix has its own event-loop / threading model, using Arbiters
Difficult to co-exist with Tokio and configure both
2. Moved to tiny-tokio-actor
Really thin layer on top of Tokio
25% improvement over rdkafka + Tokio + Actix
3. Ultimately wrote our own, 100-line mini Actor framework
tiny-tokio-actor required messages to be Clone so we could not, for
example, send OneShot channels for other actors to reply
Wanted ActorRef<MessageType> instead of ActorRef<ActorType,
MessageType>
supports tell() and ask() semantics
37 / 38
Scala: Object Graphs and Any
class Timeline extends BufferedIterator[Span[Payload]]
final case class Span[+A](start: Timestamp, end: Timestamp, payload: A) {
def mapPayload[B](f: A => B): Span[B] = copy(payload = f(payload))
}
type Event[+A] = Span[EventsAtSpanEnd[A]]
@newtype final case class EventsAtSpanEnd[+A](events: Iterable[A])
BufferedIterator must be on the heap
Each Span Payload is also boxed and on the heap, even for numbers
To be dynamically interpretable, we need BufferedIterator[Span[Any]]
in many places :(
Yes, specialization is possible, at the cost of complexity
38 / 38

More Related Content

What's hot

Apache Pulsar First Overview
Apache PulsarFirst OverviewApache PulsarFirst Overview
Apache Pulsar First OverviewRicardo Paiva
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningKai Wähner
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registryconfluent
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDatabricks
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyRikin Tanna
 
K8s in 3h - Kubernetes Fundamentals Training
K8s in 3h - Kubernetes Fundamentals TrainingK8s in 3h - Kubernetes Fundamentals Training
K8s in 3h - Kubernetes Fundamentals TrainingPiotr Perzyna
 
Reactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootReactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootVMware Tanzu
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformconfluent
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScyllaDB
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 

What's hot (20)

Apache Pulsar First Overview
Apache PulsarFirst OverviewApache PulsarFirst Overview
Apache Pulsar First Overview
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/Livy
 
K8s in 3h - Kubernetes Fundamentals Training
K8s in 3h - Kubernetes Fundamentals TrainingK8s in 3h - Kubernetes Fundamentals Training
K8s in 3h - Kubernetes Fundamentals Training
 
Reactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootReactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring Boot
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Scylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with RaftScylla Summit 2022: Making Schema Changes Safe with Raft
Scylla Summit 2022: Making Schema Changes Safe with Raft
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 

Similar to Porting a Streaming Pipeline from Scala to Rust

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Actor model in .NET - Akka.NET
Actor model in .NET - Akka.NETActor model in .NET - Akka.NET
Actor model in .NET - Akka.NETKonrad Dusza
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the wayOleg Podsechin
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayPhil Estes
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinaloscon2007
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinaloscon2007
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streamingphanleson
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterAnne Nicolas
 
Introduction to Real Time Java
Introduction to Real Time JavaIntroduction to Real Time Java
Introduction to Real Time JavaDeniz Oguz
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 

Similar to Porting a Streaming Pipeline from Scala to Rust (20)

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Actor model in .NET - Akka.NET
Actor model in .NET - Akka.NETActor model in .NET - Akka.NET
Actor model in .NET - Akka.NET
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Server side JavaScript: going all the way
Server side JavaScript: going all the wayServer side JavaScript: going all the way
Server side JavaScript: going all the way
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Postgres clusters
Postgres clustersPostgres clusters
Postgres clusters
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 
Learning spark ch10 - Spark Streaming
Learning spark ch10 - Spark StreamingLearning spark ch10 - Spark Streaming
Learning spark ch10 - Spark Streaming
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverterKernel Recipes 2014 - NDIV: a low overhead network traffic diverter
Kernel Recipes 2014 - NDIV: a low overhead network traffic diverter
 
Introduction to Real Time Java
Introduction to Real Time JavaIntroduction to Real Time Java
Introduction to Real Time Java
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 

More from Evan Chan

Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesEvan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureEvan Chan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 

More from Evan Chan (16)

Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

Recently uploaded

%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 

Recently uploaded (20)

%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

Porting a Streaming Pipeline from Scala to Rust

  • 1. Lessons: Porting a Streaming Pipeline from Scala to Rust 2023 Scale by the Bay Evan Chan Principal Engineer - Conviva http://velvia.github.io/presentations/2023-conviva-scala-to-rust 1 / 38
  • 3. Massive Real-time Streaming Analytics 5 trillion events processed per day 800-2000GB/hour (not peak!!) Started with custom Java code went through Spark Streaming and Flink iterations Most backend data components in production are written in Scala Today: 420 pods running custom Akka Streams processors 3 / 38
  • 4. Data World is Going Native and Rust Going native: Python, end of Moore's Law, cloud compute Safe, fast, and high-level abstractions Functional data patterns - map, fold, pattern matching, etc. Static dispatch and no allocations by default PyO3 - Rust is the best way to write native Python extensions JVM Rust projects Spark, Hive DataFusion, Ballista, Amadeus Flink Arroyo, RisingWave, Materialize Kafka/KSQL Fluvio ElasticSearch / Lucene Toshi, MeiliDB Cassandra, HBase Skytable, Sled, Sanakirja... Neo4J TerminusDB, IndraDB 4 / 38
  • 5. About our Architecture graph LR; SAE(Streaming Data Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE SAE --> DB[(Metrics Database)] DB --> Dashboards 5 / 38
  • 6. What We Are Porting to Rust graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px SAE(Streaming Data Pipeline) Sensors:::highlighted --> Gateways:::highlighted Gateways --> Kafka Kafka --> SAE:::highlighted SAE --> DB[(Metrics Database)] DB --> Dashboards graph LR; Notes1(Sensors: consolidate fragmented code base) Notes2(Gateway: Improve on JVM and Go) Notes3(Pipeline: Improve efficiency New operator architecture) Notes1 ~~~ Notes2 Notes2 ~~~ Notes3 6 / 38
  • 7. Our Journey to Rust gantt title From Hackathon to Multiple Teams dateFormat YYYY-MM axisFormat %y-%b section Data Pipeline Hackathon :Small Kafka ingestion project, 2022-11, 30d Scala prototype :2023-02, 6w Initial Rust Port : small team, 2023-04, 45d Bring on more people :2023-07, 8w 20-25 people 4 teams :2023-11, 1w section Gateway Go port :2023-07, 6w Rust port :2023-09, 4w “I like that if it compiles, I know it will work, so it gives confidence.” 7 / 38
  • 8. Promising Rust Hackathon graph LR; Kafka --> RustDeser(Rust Deserializer) RustDeser --> RA(Rust Actors - Lightweight Processing) Measurement Improvement over Scala/Akka Throughput (CPU) 2.6x more Memory used 12x less Mostly I/O-bound lightweight deserialization and processing workload Found out Actix does not work well with Tokio 8 / 38
  • 9. Performance Results - Gateway 9 / 38
  • 10. Key Lessons or Questions What matters for a Rust port? The 4 P's ? People How do we bring developers onboard? Performance How do I get performance? Data structures? Static dispatch? Patterns What coding patterns port well from Scala? Async? Project How do I build? Tooling, IDEs? 10 / 38
  • 11. People How do we bring developers onboard? 11 / 38
  • 12. A Phased Rust Bringup We ported our main data pipeline in two phases: Phase Team Rust Expertise Work First 3-5, very senior 1-2 with significant Rust Port core project components Second 10-15, mixed, distributed Most with zero Rust Smaller, broken down tasks Have organized list of learning resources 2-3 weeks to learn Rust and come up to speed 12 / 38
  • 13. Difficulties: Lifetimes Compiler errors Porting previous patterns Ownership and async etc. How we helped: Good docs Start with tests ChatGPT! Rust Book Office hours Lots of detailed reviews Split project into async and sync cores Overcoming Challenges 13 / 38
  • 14. Performance Data structures, static dispatch, etc. "I enjoy the fact that the default route is performant. It makes you write performant code, and if you go out the way, it becomes explicit (e.g., with dyn, Boxed, or clone etc). " 14 / 38
  • 15. Porting from Scala: Huge Performance Win graph LR; classDef highlighted fill:#99f,stroke:#333,stroke-width:4px SAE(Streaming Data Pipeline) Sensors --> Gateways Gateways --> Kafka Kafka --> SAE:::highlighted SAE --> DB[(Metrics Database)] DB --> Dashboards CPU-bound, programmable, heavy data processing Neither Rust nor Scala is productionized nor optimized Same architecture and same input/outputs Scala version was not designed for speed, lots of objects Rust: we chose static dispatch and minimizing allocations Type of comparison Improvement over Scala Throughput, end to end 22x Throughput, single-threaded microbenchmark >= 40x 15 / 38
  • 16. Building a Flexible Data Pipeline graph LR; RawEvents(Raw Events) RawEvents -->| List of numbers | Extract1 RawEvents --> Extract2 Extract1 --> DoSomeMath Extract2 --> TransformSomeFields DoSomeMath --> Filter1 TransformSomeFields --> Filter1 Filter1 --> MoreProcessing An interpreter passes time-ordered data between flexible DAG of operators. Span1 Start time: 1000 End time: 1100 Events: ["start", "click"] Span2 Start time: 1100 End time: 1300 Events: ["ad_load"] 16 / 38
  • 17. Scala: Object Graph on Heap graph TB; classDef default font- size:24px ArraySpan["`Array[Span]`"] TL(Timeline - Seq) --> ArraySpan ArraySpan --> Span1["`Span(start, end, Payload)`"] ArraySpan --> Span2["`Span(start, end, Payload)`"] Span1 --> EventsAtSpanEnd("`Events(Seq[A])`") EventsAtSpanEnd --> ArrayEvent["`Array[A]`"] Rust: mostly stack based / 0 alloc: flowchart TB; subgraph Timeline subgraph OutputSpans subgraph Span1 subgraph Events EvA ~~~ EvB end TimeInterval ~~~ Events end subgraph Span2 Time2 ~~~ Events2 end Span1 ~~~ Span2 end DataType ~~~ OutputSpans end Data Structures: Scala vs Rust 17 / 38
  • 18. Rust: Using Enums and Avoiding Boxing pub enum Timeline { EventNumber(OutputSpans<EventsAtEnd<f64>>), EventBoolean(OutputSpans<EventsAtEnd<bool>>), EventString(OutputSpans<EventsAtEnd<DataString>>), } type OutputSpans<V> = SmallVec<[Spans<V>; 2]>; pub struct Span<SV: SpanValue> { pub time: TimeInterval, pub value: SV, } pub struct EventsAtEnd<V>(SmallVec<[V; 1]>); In the above, the Timeline enum can fit entirely in the stack and avoid all boxing and allocations, if: The number of spans is very small, below limit set in code The number of events in each span is very small (1 in this case, which is the common case) The base type is a primitive, or a string which is below a certain length 18 / 38
  • 19. Avoiding Allocations using SmallVec and SmallString SmallVec is something like this: pub enum SmallVec<T, const N: usize> { Stack([T; N]), Heap(Vec<T>), } The enum can hold up to N items inline in an array with no allocations, but switches to the Heap variant if the number of items exceeds N. There are various crates for small strings and other data structures. 19 / 38
  • 20. Static vs Dynamic Dispatch Often one will need to work with many different structs that implement a Trait -- for us, different operator implementations supporting different types. Static dispatch and inlined code is much faster. 1. Monomorphisation using generics fn execute_op<O: Operator>(op: O) -> Result<...> Compiler creates a new instance of execute_op for every different O Only works when you know in advance what Operator to pass in 2. Use Enums and enum_dispatch fn execute_op(op: OperatorEnum) -> Result<...> 3. Dynamic dispatch fn execute_op(op: Box<dyn Operator>) -> Result<...> fn execute_op(op: &dyn Operator) -> Result<...> (avoids allocation) 4. Function wrapping Embedding functions in a generic struct 20 / 38
  • 21. enum_dispatch Suppose you have trait KnobControl { fn set_position(&mut self, value: f64); fn get_value(&self) -> f64; } struct LinearKnob { position: f64, } struct LogarithmicKnob { position: f64, } impl KnobControl for LinearKnob... enum_dispatch lets you do this: #[enum_dispatch] trait KnobControl { //... } 21 / 38
  • 22. Function wrapping Static function wrapping - no generics pub struct OperatorWrapper { name: String, func: fn(input: &Data) -> Data, } Need a generic - but accepts closures pub struct OperatorWrapper<F> where F: Fn(input: &Data) -> Data { name: String, func: F, } 22 / 38
  • 24. Rust Async: Different Paradigms "Async: It is well designed... Yes, it is still pretty complicated piece of code, but the logic or the framework is easier to grasp compared to other languages." Having to use Arc: Data Structures are not Thread-safe by default! Scala Rust Futures futures, async functions ?? async-await Actors(Akka) Actix, Bastion, etc. Async streams Tokio streams Reactive (Akka streams, Monix, ZIO) reactive_rs, rxRust, etc. 24 / 38
  • 25. Replacing Akka: Actors in Rust Actix threading model doesn't mix well with Tokio We moved to tiny-tokio-actor, then wrote our own pub struct AnomalyActor {} #[async_trait] impl ChannelActor<Anomaly, AnomalyActorError> for AnomalyActor { async fn handle( &mut self, msg: Anomaly, ctx: &mut ActorContext<Anomaly>, ) -> Result<(), Report<AnomalyActorError>> { use Anomaly::*; match msg { QuantityOverflowAnomaly { ctx: _, ts: _, qual: _, qty: _, cnt: _, data: _, } => {} PoisonPill => { ctx.stop(); } } Ok(()) } 25 / 38
  • 26. Other Patterns to Learn Old Pattern New Pattern No inheritance Use composition! - Compose data structures - Compose small Traits No exceptions Use Result and ? Data structures are not Thread safe Learn to use Arc etc. Returning Iterators Don't return things that borrow other things. This makes life difficult. 26 / 38
  • 27. Type Classes In Rust, type classes (Traits) are smaller and more compositional. pub trait Inhale { fn sniff(&self); } You can implement new Traits for existing types, and have different impl's for different types. impl Inhale for String { fn sniff(&self) { println!("I sniffed {}", self); } } // Only implemented for specific N subtypes of MyStruct impl<N: Numeric> Inhale for MyStruct<N> { fn sniff(&self) { .... } } 27 / 38
  • 29. "Cargo is the best build tool ever" Almost no dependency conflicts due to multiple dep versioning Configuration by convention - common directory/file layouts for example Really simple .toml - no need for XML, functional Scala, etc. Rarely need code to build anything, even for large projects [package] name = "telemetry-subscribers" version = "0.3.0" license = "Apache-2.0" description = "Library for common telemetry and observability functionality" [dependencies] console-subscriber = { version = "0.1.6", optional = true } crossterm = "0.25.0" once_cell = "1.13.0" opentelemetry = { version = "0.18.0", features = ["rt-tokio"], optional = true } 29 / 38
  • 30. IDEs, CI, and Tooling IDEs/Editors VSCode, RustRover (IntelliJ), vim/emacs/etc with Rust Analyzer Code Coverage VSCode inline, grcov/lcov, Tarpaulin (Linux only) Slow build times Caching: cargo-chef, rust-cache Slow test times cargo-nextest Property Testing proptest Benchmarking Criterion https://blog.logrocket.com/optimizing-ci-cd-pipelines-rust-projects/ VSCode's "LiveShare" feature for distributed pair programming is TOP NOTCH. 30 / 38
  • 31. Rust Resources and Projects https://github.com/velvia/links/blob/main/rust.md - this is my list of Rust projects and learning resources https://github.com/rust-unofficial/awesome-rust https://www.arewelearningyet.com - ML focused 31 / 38
  • 32. What do we miss from Scala? More mature libraries - in some cases: HDFS, etc. Good streaming libraries - like Monix, Akka Streams etc. I guess all of Akka "Less misleading compiler messages" Rust error messages read better from the CLI, IMO (not an IDE) 32 / 38
  • 33. Takeaways It's a long journey but Rust is worth it. Structuring a project for successful onramp is really important Think about data structure design early on Allow plenty of time to ramp up on Rust patterns, tools We are hiring across multiple roles/levels! 33 / 38
  • 36. Data World is Going Native (from JVM) The rise of Python and Data Science Led to AnyScale, Dask, and many other Python-oriented data frameworks Rise of newer, developer-friendly native languages (Go, Swift, Rust, etc.) Migration from Hadoop/HDFS to more cloud-based data architectures Apache Arrow and other data interchange formats Hardware architecture trends - end of Moore's Law, rise of GPUs etc 36 / 38
  • 37. Why We Went with our Own Actors 1. Initial Hackathon prototype used Actix Actix has its own event-loop / threading model, using Arbiters Difficult to co-exist with Tokio and configure both 2. Moved to tiny-tokio-actor Really thin layer on top of Tokio 25% improvement over rdkafka + Tokio + Actix 3. Ultimately wrote our own, 100-line mini Actor framework tiny-tokio-actor required messages to be Clone so we could not, for example, send OneShot channels for other actors to reply Wanted ActorRef<MessageType> instead of ActorRef<ActorType, MessageType> supports tell() and ask() semantics 37 / 38
  • 38. Scala: Object Graphs and Any class Timeline extends BufferedIterator[Span[Payload]] final case class Span[+A](start: Timestamp, end: Timestamp, payload: A) { def mapPayload[B](f: A => B): Span[B] = copy(payload = f(payload)) } type Event[+A] = Span[EventsAtSpanEnd[A]] @newtype final case class EventsAtSpanEnd[+A](events: Iterable[A]) BufferedIterator must be on the heap Each Span Payload is also boxed and on the heap, even for numbers To be dynamically interpretable, we need BufferedIterator[Span[Any]] in many places :( Yes, specialization is possible, at the cost of complexity 38 / 38