SlideShare a Scribd company logo
MANCHESTER LONDON NEW YORK
Petr Zapletal @petr_zapletal
@spark_summit
@cakesolutions
Distributed Real-Time Stream Processing:
Why and How
Agenda
● Motivation
● Stream Processing
● Available Frameworks
● Systems Comparison
● Recommendations
The Data Deluge
● New Sources and New Use Cases
● 8 Zettabytes (1 ZB = 1 trillion GB) created in 2015
● Stream Processing to the Rescue
● Continuous processing, aggregation and analysis of unbounded data
Distributed Stream Processing
Points of Interest
➜ Runtime and Programming Model
Points of Interest
➜ Runtime and Programming Model
➜ Primitives
Points of Interest
➜ Runtime and Programming Model
➜ Primitives
➜ State Management
Points of Interest
➜ Runtime and Programming Model
➜ Primitives
➜ State Management
➜ Message Delivery Guarantees
Points of Interest
➜ Runtime and Programming Model
➜ Primitives
➜ State Management
➜ Message Delivery Guarantees
➜ Fault Tolerance & Low Overhead Recovery
Points of Interest
➜ Runtime and Programming Model
➜ Primitives
➜ State Management
➜ Message Delivery Guarantees
➜ Fault Tolerance & Low Overhead Recovery
➜ Latency, Throughput & Scalability
Points of Interest
➜ Runtime and Programming Model
➜ Primitives
➜ State Management
➜ Message Delivery Guarantees
➜ Fault Tolerance & Low Overhead Recovery
➜ Latency, Throughput & Scalability
➜ Maturity and Adoption Level
Points of Interest
➜ Runtime and Programming Model
➜ Primitives
➜ State Management
➜ Message Delivery Guarantees
➜ Fault Tolerance & Low Overhead Recovery
➜ Latency, Throughput & Scalability
➜ Maturity and Adoption Level
➜ Ease of Development and Operability
● Most important trait of stream processing system
● Defines expressiveness, possible operations and its limitations
● Therefore defines systems capabilities and its use cases
Runtime and Programming Model
Native Streaming
records
Sink
Operator
Source
Operator
Processing
Operator
records processed one at a time
Processing
Operator
records processed in short batches
Processing
Operator
Receiver
records
Processing
Operator
Micro-batches
Sink
Operator
Micro-batching
Native Streaming
● Records are processed as they arrive
Pros
⟹ Expressiveness
⟹ Low-latency
⟹ Stateful operations
Cons
⟹ Fault-tolerance is expensive
⟹ Load-balancing
Micro-batching
Cons
⟹ Lower latency, depends on
batch interval
⟹ Limited expressivity
⟹ Harder stateful operations
● Splits incoming stream into small batches
Pros
⟹ High-throughput
⟹ Easier fault tolerance
⟹ Simpler load-balancing
Programming Model
Compositional
⟹ Provides basic building blocks as
operators or sources
⟹ Custom component definition
⟹ Manual Topology definition &
optimization
⟹ Advanced functionality often
missing
Declarative
⟹ High-level API
⟹ Operators as higher order functions
⟹ Abstract data types
⟹ Advance operations like state
management or windowing supported
out of the box
⟹ Advanced optimizers
Apache Streaming Landscape
TRIDENT
Storm
Stor
● Pioneer in large scale stream processing
● Higher level micro-batching system build atop Storm
Trident
● Unified batch and stream processing over a batch runtime
Spark Streaming
input data
stream Spark
Streaming
Spark
Engine
batches of
input data
batches of
processed data
Samza
● Builds heavily on Kafka’s log based philosophy
Task 1
Task 2
Task 3
Kafka Streams
● Simple streaming library on top of Apache Kafka
Apex
● Processes massive amounts of real-time events natively in Hadoop
Operator
Operator
OperatorOperator Operator Operator
Output
Stream
Stream
Stream
Stream
Stream
Tuple
Flink
Stream Data
Batch Data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
● Native streaming & High level API
Counting Words
Spark Summit 2017
Apache Apache Spark
Storm Apache Trident
Flink Streaming Samza
Scala 2017 Streaming
(Apache, 3)
(Streaming, 2)
(2017, 2)
(Spark, 2)
(Storm, 1)
(Trident, 1)
(Flink, 1)
(Samza, 1)
(Scala, 1)
(Summit, 1)
Storm
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
...
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
Storm
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
...
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
Storm
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
...
Map<String, Integer> counts = new HashMap<String, Integer>();
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;
counts.put(word, count);
collector.emit(new Values(word, count));
}
Trident
public static StormTopology buildTopology(LocalDRPC drpc) {
FixedBatchSpout spout = ...
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count"));
...
}
Trident
public static StormTopology buildTopology(LocalDRPC drpc) {
FixedBatchSpout spout = ...
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count"));
...
}
Trident
public static StormTopology buildTopology(LocalDRPC drpc) {
FixedBatchSpout spout = ...
TridentTopology topology = new TridentTopology();
TridentState wordCounts = topology.newStream("spout1", spout)
.each(new Fields("sentence"),new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(),
new Count(), new Fields("count"));
...
}
Spark Streaming
val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
Spark Streaming
val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
Spark Streaming
val conf = new SparkConf().setAppName("wordcount")
val ssc = new StreamingContext(conf, Seconds(1))
val text = ...
val counts = text.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.print()
ssc.start()
ssc.awaitTermination()
Spark Streaming
Samza
class WordCountTask extends StreamTask {
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val text = envelope.getMessage.asInstanceOf[String]
val counts = text.split(" ")
.foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"),
counts))
}
Samza
class WordCountTask extends StreamTask {
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val text = envelope.getMessage.asInstanceOf[String]
val counts = text.split(" ")
.foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"),
counts))
}
class WordCountTask extends StreamTask {
override def process(envelope: IncomingMessageEnvelope,
collector: MessageCollector, coordinator: TaskCoordinator) {
val text = envelope.getMessage.asInstanceOf[String]
val counts = text.split(" ")
.foldLeft(Map.empty[String, Int]) {
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
}
collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"),
counts))
}
Samza
Kafka Streams
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
KStream<String, String> textLines = builder.stream(stringSerde, stringSerde,
"streams-file-input");
KStream<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+")))
.groupBy((key, value) -> value)
.count("Counts")
.toStream();
wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");
Kafka Streams
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
KStream<String, String> textLines = builder.stream(stringSerde, stringSerde,
"streams-file-input");
KStream<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+")))
.groupBy((key, value) -> value)
.count("Counts")
.toStream();
wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");
Apex
val input = dag.addOperator("input", new LineReader)
val parser = dag.addOperator("parser", new Parser)
val out = dag.addOperator("console", new ConsoleOutputOperator)
dag.addStream[String]("lines", input.out, parser.in)
dag.addStream[String]("words", parser.out, counter.data)
class Parser extends BaseOperator {
@transient
val out = new DefaultOutputPort[String]()
@transient
val in = new DefaultInputPort[String]()
override def process(t: String): Unit = {
for(w <- t.split(" ")) out.emit(w)
}
}
Apex
val input = dag.addOperator("input", new LineReader)
val parser = dag.addOperator("parser", new Parser)
val out = dag.addOperator("console", new ConsoleOutputOperator)
dag.addStream[String]("lines", input.out, parser.in)
dag.addStream[String]("words", parser.out, counter.data)
class Parser extends BaseOperator {
@transient
val out = new DefaultOutputPort[String]()
@transient
val in = new DefaultInputPort[String]()
override def process(t: String): Unit = {
for(w <- t.split(" ")) out.emit(w)
}
}
val input = dag.addOperator("input", new LineReader)
val parser = dag.addOperator("parser", new Parser)
val out = dag.addOperator("console", new ConsoleOutputOperator)
dag.addStream[String]("lines", input.out, parser.in)
dag.addStream[String]("words", parser.out, counter.data)
class Parser extends BaseOperator {
@transient
val out = new DefaultOutputPort[String]()
@transient
val in = new DefaultInputPort[String]()
override def process(t: String): Unit = {
for(w <- t.split(" ")) out.emit(w)
}
}
Apex
Flink
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(...)
val counts = text.flatMap ( _.split(" ") )
.map ( (_, 1) )
.groupBy(0)
.sum(1)
counts.print()
env.execute("wordcount")
Flink
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(...)
val counts = text.flatMap ( _.split(" ") )
.map ( (_, 1) )
.groupBy(0)
.sum(1)
counts.print()
env.execute("wordcount")
Summary
Native Micro-batching Native Native
Compositional Declarative Compositional Declarative
At-least-once Exactly-once At-least-once Exactly-once
Record ACKs
RDD based
Checkpointing
Log-based Checkpointing
Not build-in
Dedicated
Operators
Stateful
Operators
Stateful
Operators
Very Low Medium Low Low
Low Medium High High
High High Medium Medium
Micro-batching
Exactly-once*
Dedicated
DStream
Medium
High
Streaming
Model
API
Guarantees
Fault
Tolerance
State
Management
Latency
Throughput
Maturity
TRIDENT
Hybrid
Compositional*
Exactly-once
Checkpointing
Stateful
Operators
Very Low
High
Medium
Native
Declarative*
At-least-once
Low
Log-based
Stateful
Operators
Low
High
General Guidelines
As always, It depends
General Guidelines
➜ Evaluate particular application needs
General Guidelines
➜ Evaluate particular application needs
➜ Programming model
General Guidelines
➜ Evaluate particular application needs
➜ Programming model
➜ Available delivery guarantees
General Guidelines
➜ Evaluate particular application needs
➜ Programming model
➜ Available delivery guarantees
➜ Almost all non-trivial jobs have state
General Guidelines
➜ Evaluate particular application needs
➜ Programming model
➜ Available delivery guarantees
➜ Almost all non-trivial jobs have state
➜ Fast recovery is critical
Recommendations [Storm & Trident]
● Fits for small and fast tasks
● Very low (tens of milliseconds) latency
● State & Fault tolerance degrades performance significantly
● Potential update to Heron
○ Keeps the API, according to Twitter better in every single way
○ Open-sourced recently
Recommendations [Spark Streaming]
● Spark Ecosystem
● Data Exploration
● Latency is not critical
● Micro-batching limitations
Recommendations [Samza]
● Kafka is a cornerstone of your architecture
● Application requires large states
● Don’t need exactly once
Recommendations [Kafka Streams]
● Similar functionality like Samza with nicer APIs
● Operated by Kafka itself
● Great learning curve
● At-least once delivery
● May not support more advanced functionality
Recommendations [Apex]
● Prefer compositional approach
● Hadoop
● Great performance
● Dynamic DAG changes
Recommendations [Flink]
● Conceptually great, fits very most use cases
● Take advantage of batch processing capabilities
● Need a functionality which is hard to implement in micro-batch
● Enough courage to use emerging project
Dataflow and Apache Beam
Dataflow
Model & SDKs
Apache Flink
Apache Spark
Direct Pipeline
Google Cloud
Dataflow
Stream Processing
Batch Processing
Multiple Modes One Pipeline Many Runtimes
Local
or
cloud
Local
Cloud
Questions
MANCHESTER LONDON NEW YORK
MANCHESTER LONDON NEW YORK
@petr_zapletal @cakesolutions
347 708 1518
enquiries@cakesolutions.net
We are hiring
http://www.cakesolutions.net/careers
References
● http://storm.apache.org/
● http://spark.apache.org/streaming/
● http://samza.apache.org/
● https://apex.apache.org/
● https://flink.apache.org/
● http://beam.incubator.apache.org/
● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1
● http://data-artisans.com/blog/
● http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
● https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
● http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem
● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● http://stackoverflow.com/questions/29111549/where-do-apache-samza-and-apache-storm-differ-in-their-use-cases
● https://www.dropbox.com/s/1s8pnjwgkkvik4v/GearPump%20Edge%20Topologies%20and%20Deployment.pptx?dl=0
● ...
List of Figures
● https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
● http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014
● https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png
● http://data-artisans.com/wp-content/uploads/2015/08/microbatching.png
● http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job
● http://data-artisans.com/wp-content/uploads/2015/08/streambarrier.png
● https://cwiki.apache.org/confluence/display/FLINK/Stateful+Stream+Processing
● https://raw.githubusercontent.com/tweise/apex-samples/kafka-count-jdbc/exactly-once/docs/images/image00.png
● https://4.bp.blogspot.com/-RlLeDymI_mU/Vp-1cb3AxNI/AAAAAAAACSQ/5TphliHJA4w/s1600/dataflow%2BASF.pn
g
Backup Slides
MANCHESTER LONDON NEW YORK
Processing Architecture Evolution
Batch Pipeline
Serving DBHDFS
Query
Processing Architecture Evolution
Lambda Architecture
Batch Layer Serving
Layer
Stream layer
Query
Allyourdata
Oozie
Query
Processing Architecture Evolution
Standalone Stream Processing
Stream
Processing
Processing Architecture Evolution
Kappa Architecture
Query
Stream
Processing
ETL Operations
● Transformations, joining or filtering of incoming
data
Streaming Applications
Windowing
● Trends in bounded interval, like tweets or sales
Streaming Applications
Streaming Applications
Machine Learning
● Clustering, Trend fitting,
Classification
Streaming Applications
Pattern Recognition
● Fraud detection, Signal
triggering, Anomaly detection
Fault tolerance in streaming systems is
inherently harder that in batch
Fault Tolerance
Managing State
f: (input, state) => (output, state’)
Performance
Hard to design not biased test, lots
of variables
Performance
➜ Latency vs. Throughput
Performance
➜ Latency vs. Throughput
➜ Costs of Delivery Guarantees, Fault-tolerance & State Management
Performance
➜ Latency vs. Throughput
➜ Costs of Delivery Guarantees, Fault-tolerance & State Management
➜ Tuning
Performance
➜ Latency vs. Throughput
➜ Costs of Delivery Guarantees, Fault-tolerance & State Management
➜ Tuning
➜ Network operations, Data locality & Serialization
Project Maturity
When picking up the framework, you
should always consider its maturity
For a long time de-facto industrial
standard
Project Maturity [Storm & Trident]
Project Maturity [Spark Streaming]
The most trending Scala repository
these days and one of the engines
behind Scala’s popularity
Project Maturity [Samza]
Used by LinkedIn and also by tens of
other companies
Project Maturity [Kafka Streams]
???
Project Maturity [Apex]
Graduated very recently, adopted by
a couple of corporate clients already
Project Maturity [Flink]
Still an emerging project, but we can
see its first production deployments

More Related Content

What's hot

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
Spark etl
Spark etlSpark etl
Spark etl
Imran Rashid
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
Matthias Niehoff
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
Provectus
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Realtime Reporting using Spark Streaming
Realtime Reporting using Spark StreamingRealtime Reporting using Spark Streaming
Realtime Reporting using Spark Streaming
Santosh Sahoo
 

What's hot (20)

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Spark etl
Spark etlSpark etl
Spark etl
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Realtime Reporting using Spark Streaming
Realtime Reporting using Spark StreamingRealtime Reporting using Spark Streaming
Realtime Reporting using Spark Streaming
 

Viewers also liked

IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
Spark Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Spark Summit
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
Spark Summit
 
Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...
Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...
Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...
Spark Summit
 
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Spark Summit
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...
 FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by... FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...
FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...
Spark Summit
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Spark Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Spark Summit
 

Viewers also liked (20)

IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...
Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...
Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...
 
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
 
FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...
 FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by... FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...
FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
 

Similar to Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk by Petr Zapletal

Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
Petr Zapletal
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
Petr Zapletal
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
Prakash Chockalingam
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
Suneel Marthi
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
Ortus Solutions, Corp
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
Anyscale
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
Leonardo Gamas
 

Similar to Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk by Petr Zapletal (20)

Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 

Recently uploaded (20)

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 

Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk by Petr Zapletal

  • 3. Agenda ● Motivation ● Stream Processing ● Available Frameworks ● Systems Comparison ● Recommendations
  • 4. The Data Deluge ● New Sources and New Use Cases ● 8 Zettabytes (1 ZB = 1 trillion GB) created in 2015 ● Stream Processing to the Rescue
  • 5. ● Continuous processing, aggregation and analysis of unbounded data Distributed Stream Processing
  • 6. Points of Interest ➜ Runtime and Programming Model
  • 7. Points of Interest ➜ Runtime and Programming Model ➜ Primitives
  • 8. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management
  • 9. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees
  • 10. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees ➜ Fault Tolerance & Low Overhead Recovery
  • 11. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees ➜ Fault Tolerance & Low Overhead Recovery ➜ Latency, Throughput & Scalability
  • 12. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees ➜ Fault Tolerance & Low Overhead Recovery ➜ Latency, Throughput & Scalability ➜ Maturity and Adoption Level
  • 13. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees ➜ Fault Tolerance & Low Overhead Recovery ➜ Latency, Throughput & Scalability ➜ Maturity and Adoption Level ➜ Ease of Development and Operability
  • 14. ● Most important trait of stream processing system ● Defines expressiveness, possible operations and its limitations ● Therefore defines systems capabilities and its use cases Runtime and Programming Model
  • 16. records processed in short batches Processing Operator Receiver records Processing Operator Micro-batches Sink Operator Micro-batching
  • 17. Native Streaming ● Records are processed as they arrive Pros ⟹ Expressiveness ⟹ Low-latency ⟹ Stateful operations Cons ⟹ Fault-tolerance is expensive ⟹ Load-balancing
  • 18. Micro-batching Cons ⟹ Lower latency, depends on batch interval ⟹ Limited expressivity ⟹ Harder stateful operations ● Splits incoming stream into small batches Pros ⟹ High-throughput ⟹ Easier fault tolerance ⟹ Simpler load-balancing
  • 19. Programming Model Compositional ⟹ Provides basic building blocks as operators or sources ⟹ Custom component definition ⟹ Manual Topology definition & optimization ⟹ Advanced functionality often missing Declarative ⟹ High-level API ⟹ Operators as higher order functions ⟹ Abstract data types ⟹ Advance operations like state management or windowing supported out of the box ⟹ Advanced optimizers
  • 21. Storm Stor ● Pioneer in large scale stream processing
  • 22. ● Higher level micro-batching system build atop Storm Trident
  • 23. ● Unified batch and stream processing over a batch runtime Spark Streaming input data stream Spark Streaming Spark Engine batches of input data batches of processed data
  • 24. Samza ● Builds heavily on Kafka’s log based philosophy Task 1 Task 2 Task 3
  • 25. Kafka Streams ● Simple streaming library on top of Apache Kafka
  • 26. Apex ● Processes massive amounts of real-time events natively in Hadoop Operator Operator OperatorOperator Operator Operator Output Stream Stream Stream Stream Stream Tuple
  • 27. Flink Stream Data Batch Data Kafka, RabbitMQ, ... HDFS, JDBC, ... ● Native streaming & High level API
  • 28. Counting Words Spark Summit 2017 Apache Apache Spark Storm Apache Trident Flink Streaming Samza Scala 2017 Streaming (Apache, 3) (Streaming, 2) (2017, 2) (Spark, 2) (Storm, 1) (Trident, 1) (Flink, 1) (Samza, 1) (Scala, 1) (Summit, 1)
  • 29. Storm TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); }
  • 30. Storm TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); }
  • 31. Storm TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); }
  • 32. Trident public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... }
  • 33. Trident public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... }
  • 34. Trident public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... }
  • 35. Spark Streaming val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination()
  • 36. Spark Streaming val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination()
  • 37. val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination() Spark Streaming
  • 38. val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination() Spark Streaming
  • 39. Samza class WordCountTask extends StreamTask { override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val text = envelope.getMessage.asInstanceOf[String] val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) } collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts)) }
  • 40. Samza class WordCountTask extends StreamTask { override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val text = envelope.getMessage.asInstanceOf[String] val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) } collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts)) }
  • 41. class WordCountTask extends StreamTask { override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val text = envelope.getMessage.asInstanceOf[String] val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) } collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts)) } Samza
  • 42. Kafka Streams final Serde<String> stringSerde = Serdes.String(); final Serde<Long> longSerde = Serdes.Long(); KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "streams-file-input"); KStream<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+"))) .groupBy((key, value) -> value) .count("Counts") .toStream(); wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");
  • 43. Kafka Streams final Serde<String> stringSerde = Serdes.String(); final Serde<Long> longSerde = Serdes.Long(); KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "streams-file-input"); KStream<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+"))) .groupBy((key, value) -> value) .count("Counts") .toStream(); wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");
  • 44. Apex val input = dag.addOperator("input", new LineReader) val parser = dag.addOperator("parser", new Parser) val out = dag.addOperator("console", new ConsoleOutputOperator) dag.addStream[String]("lines", input.out, parser.in) dag.addStream[String]("words", parser.out, counter.data) class Parser extends BaseOperator { @transient val out = new DefaultOutputPort[String]() @transient val in = new DefaultInputPort[String]() override def process(t: String): Unit = { for(w <- t.split(" ")) out.emit(w) } }
  • 45. Apex val input = dag.addOperator("input", new LineReader) val parser = dag.addOperator("parser", new Parser) val out = dag.addOperator("console", new ConsoleOutputOperator) dag.addStream[String]("lines", input.out, parser.in) dag.addStream[String]("words", parser.out, counter.data) class Parser extends BaseOperator { @transient val out = new DefaultOutputPort[String]() @transient val in = new DefaultInputPort[String]() override def process(t: String): Unit = { for(w <- t.split(" ")) out.emit(w) } }
  • 46. val input = dag.addOperator("input", new LineReader) val parser = dag.addOperator("parser", new Parser) val out = dag.addOperator("console", new ConsoleOutputOperator) dag.addStream[String]("lines", input.out, parser.in) dag.addStream[String]("words", parser.out, counter.data) class Parser extends BaseOperator { @transient val out = new DefaultOutputPort[String]() @transient val in = new DefaultInputPort[String]() override def process(t: String): Unit = { for(w <- t.split(" ")) out.emit(w) } } Apex
  • 47. Flink val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements(...) val counts = text.flatMap ( _.split(" ") ) .map ( (_, 1) ) .groupBy(0) .sum(1) counts.print() env.execute("wordcount")
  • 48. Flink val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements(...) val counts = text.flatMap ( _.split(" ") ) .map ( (_, 1) ) .groupBy(0) .sum(1) counts.print() env.execute("wordcount")
  • 49. Summary Native Micro-batching Native Native Compositional Declarative Compositional Declarative At-least-once Exactly-once At-least-once Exactly-once Record ACKs RDD based Checkpointing Log-based Checkpointing Not build-in Dedicated Operators Stateful Operators Stateful Operators Very Low Medium Low Low Low Medium High High High High Medium Medium Micro-batching Exactly-once* Dedicated DStream Medium High Streaming Model API Guarantees Fault Tolerance State Management Latency Throughput Maturity TRIDENT Hybrid Compositional* Exactly-once Checkpointing Stateful Operators Very Low High Medium Native Declarative* At-least-once Low Log-based Stateful Operators Low High
  • 51. General Guidelines ➜ Evaluate particular application needs
  • 52. General Guidelines ➜ Evaluate particular application needs ➜ Programming model
  • 53. General Guidelines ➜ Evaluate particular application needs ➜ Programming model ➜ Available delivery guarantees
  • 54. General Guidelines ➜ Evaluate particular application needs ➜ Programming model ➜ Available delivery guarantees ➜ Almost all non-trivial jobs have state
  • 55. General Guidelines ➜ Evaluate particular application needs ➜ Programming model ➜ Available delivery guarantees ➜ Almost all non-trivial jobs have state ➜ Fast recovery is critical
  • 56. Recommendations [Storm & Trident] ● Fits for small and fast tasks ● Very low (tens of milliseconds) latency ● State & Fault tolerance degrades performance significantly ● Potential update to Heron ○ Keeps the API, according to Twitter better in every single way ○ Open-sourced recently
  • 57. Recommendations [Spark Streaming] ● Spark Ecosystem ● Data Exploration ● Latency is not critical ● Micro-batching limitations
  • 58. Recommendations [Samza] ● Kafka is a cornerstone of your architecture ● Application requires large states ● Don’t need exactly once
  • 59. Recommendations [Kafka Streams] ● Similar functionality like Samza with nicer APIs ● Operated by Kafka itself ● Great learning curve ● At-least once delivery ● May not support more advanced functionality
  • 60. Recommendations [Apex] ● Prefer compositional approach ● Hadoop ● Great performance ● Dynamic DAG changes
  • 61. Recommendations [Flink] ● Conceptually great, fits very most use cases ● Take advantage of batch processing capabilities ● Need a functionality which is hard to implement in micro-batch ● Enough courage to use emerging project
  • 62. Dataflow and Apache Beam Dataflow Model & SDKs Apache Flink Apache Spark Direct Pipeline Google Cloud Dataflow Stream Processing Batch Processing Multiple Modes One Pipeline Many Runtimes Local or cloud Local Cloud
  • 64. MANCHESTER LONDON NEW YORK @petr_zapletal @cakesolutions 347 708 1518 enquiries@cakesolutions.net We are hiring http://www.cakesolutions.net/careers
  • 65. References ● http://storm.apache.org/ ● http://spark.apache.org/streaming/ ● http://samza.apache.org/ ● https://apex.apache.org/ ● https://flink.apache.org/ ● http://beam.incubator.apache.org/ ● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1 ● http://data-artisans.com/blog/ ● http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple ● https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at ● http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● http://stackoverflow.com/questions/29111549/where-do-apache-samza-and-apache-storm-differ-in-their-use-cases ● https://www.dropbox.com/s/1s8pnjwgkkvik4v/GearPump%20Edge%20Topologies%20and%20Deployment.pptx?dl=0 ● ...
  • 66. List of Figures ● https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html ● http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014 ● https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png ● http://data-artisans.com/wp-content/uploads/2015/08/microbatching.png ● http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job ● http://data-artisans.com/wp-content/uploads/2015/08/streambarrier.png ● https://cwiki.apache.org/confluence/display/FLINK/Stateful+Stream+Processing ● https://raw.githubusercontent.com/tweise/apex-samples/kafka-count-jdbc/exactly-once/docs/images/image00.png ● https://4.bp.blogspot.com/-RlLeDymI_mU/Vp-1cb3AxNI/AAAAAAAACSQ/5TphliHJA4w/s1600/dataflow%2BASF.pn g
  • 68. Processing Architecture Evolution Batch Pipeline Serving DBHDFS Query
  • 69. Processing Architecture Evolution Lambda Architecture Batch Layer Serving Layer Stream layer Query Allyourdata Oozie Query
  • 70. Processing Architecture Evolution Standalone Stream Processing Stream Processing
  • 71. Processing Architecture Evolution Kappa Architecture Query Stream Processing
  • 72. ETL Operations ● Transformations, joining or filtering of incoming data Streaming Applications
  • 73. Windowing ● Trends in bounded interval, like tweets or sales Streaming Applications
  • 74. Streaming Applications Machine Learning ● Clustering, Trend fitting, Classification
  • 75. Streaming Applications Pattern Recognition ● Fraud detection, Signal triggering, Anomaly detection
  • 76. Fault tolerance in streaming systems is inherently harder that in batch Fault Tolerance
  • 77. Managing State f: (input, state) => (output, state’)
  • 78. Performance Hard to design not biased test, lots of variables
  • 80. Performance ➜ Latency vs. Throughput ➜ Costs of Delivery Guarantees, Fault-tolerance & State Management
  • 81. Performance ➜ Latency vs. Throughput ➜ Costs of Delivery Guarantees, Fault-tolerance & State Management ➜ Tuning
  • 82. Performance ➜ Latency vs. Throughput ➜ Costs of Delivery Guarantees, Fault-tolerance & State Management ➜ Tuning ➜ Network operations, Data locality & Serialization
  • 83. Project Maturity When picking up the framework, you should always consider its maturity
  • 84. For a long time de-facto industrial standard Project Maturity [Storm & Trident]
  • 85. Project Maturity [Spark Streaming] The most trending Scala repository these days and one of the engines behind Scala’s popularity
  • 86. Project Maturity [Samza] Used by LinkedIn and also by tens of other companies
  • 87. Project Maturity [Kafka Streams] ???
  • 88. Project Maturity [Apex] Graduated very recently, adopted by a couple of corporate clients already
  • 89. Project Maturity [Flink] Still an emerging project, but we can see its first production deployments