SPARK
Alexey Diomin, diominay@gmail.com
Intro
Basic
 RDD
 DAG
RDD
 Resilient Distributed Dataset
RDD
 Resilient Distributed Dataset
 SchemaRDD
DAG
DAG
DAG
Mythology
 Spark is not MapReduce
Mythology
 Spark is not MapReduce
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
Mythology
 Spark is not MapReduce
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
 InMemory processing
Mythology
 Spark is not MapReduce
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
 InMemory processing
 Spark Streaming is real-time streaming
Mythology
 Spark is not MapReduce
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
 InMemory processing
 Spark Streaming is real-time streaming
 Lightning-fast cluster computing
MapReduce
MapReduce
MapReduce
Not MapReduce
Spark
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk
Spark
 Run programs up to 100x faster than Hadoop
MapReduce* in memory, or 10x faster on disk
*Hadoop without Tez
http://spark.apache.org/
InMemory
InMemory
 The MapReduce and Spark shuffles use a “pull”
model. Every map task writes out data to local
disk, and then the reduce tasks make remote
requests to fetch that data
 http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
Spark Streaming
 RDD
 DAG
Spark Streaming
Spark Streaming
Receiver.store(...)
Spark Streaming
Google Cloud Dataflow
 One of the most compelling aspects of Cloud
Dataflow is its approach to one of the most
difficult problems facing data engineers: how to
develop pipeline logic that can execute in both
batch and streaming contexts.
 http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-
dataflow-on-apache-spark/
Lightning-fast cluster computing
Lightning-fast cluster computing
Lightning-fast cluster computing
Lightning-fast cluster computing
Lightning-fast cluster computing
Spark
 Logging
 Pipeline
 Indexes
 Job progress
 Effective Memory
 Network
Example
Staged (batch) execution
Pipelined execution
Indexes
 Netflix
 https://github.com/amplab/spark-indexedrdd
Job Progress
 Accumulators
 Broadcast
Memory
 val value = task.run(taskId, attemptNumber)
Memory
 val value = task.run(taskId, attemptNumber)
 val valueBytes = resultSer.serialize(value)
Memory
 val value = task.run(taskId, attemptNumber)
 val valueBytes = resultSer.serialize(value)
 val directResult = new DirectTaskResult(valueBytes,
accumUpdates, task.metrics.orNull)
 val serializedDirectResult = ser.serialize(directResult)
Memory
 val value = task.run(taskId, attemptNumber)
 val valueBytes = resultSer.serialize(value)
 val directResult = new DirectTaskResult(valueBytes,
accumUpdates, task.metrics.orNull)
 val serializedDirectResult = ser.serialize(directResult)
 Default JavaSerializer
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}
Network
Network
Network
 Problem with firewall/nat/multiple ip/etc.
SQL
 Shark (dead)
 Spark SQL
 Spark on Hive
SparkR
SparkR
 Unstable API
 Minimum docs
SparkR
 Unstable API
 Minimum docs
 Rstudio Server
Links
 Spark
 http://spark.apache.org/
 Flink
 http://flink.apache.org/
 Tez
 http://tez.apache.org/

«Почему Spark отнюдь не так хорош»

Editor's Notes