«Почему Spark отнюдь не так хорош»

SPARK
Alexey Diomin, diominay@gmail.com

RDD
 Resilient Distributed Dataset

RDD
 Resilient Distributed Dataset
 SchemaRDD

Mythology
 Spark is not MapReduce

Mythology
 Run programs up to 100x faster than
MapReduce in memory, or 10x faster on disk

Mythology
 InMemory processing

Mythology
 Spark Streaming is real-time streaming

Mythology
 Spark Streaming is real-time streaming
 Lightning-fast cluster computing

Spark

Spark
 Run programs up to 100x faster than Hadoop
MapReduce* in memory, or 10x faster on disk
*Hadoop without Tez

InMemory
 The MapReduce and Spark shuffles use a “pull”
model. Every map task writes out data to local
disk, and then the reduce tasks make remote
requests to fetch that data
 http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

Spark Streaming
 RDD
 DAG

Spark Streaming
Receiver.store(...)

Google Cloud Dataflow
 One of the most compelling aspects of Cloud
Dataflow is its approach to one of the most
difficult problems facing data engineers: how to
develop pipeline logic that can execute in both
batch and streaming contexts.
 http://blog.cloudera.com/blog/2015/01/new-in-cloudera-labs-google-cloud-
dataflow-on-apache-spark/

Lightning-fast cluster computing

Spark
 Logging
 Pipeline
 Indexes
 Job progress
 Effective Memory
 Network

Indexes
 Netflix
 https://github.com/amplab/spark-indexedrdd

Job Progress
 Accumulators
 Broadcast

Memory
 val value = task.run(taskId, attemptNumber)

Memory
 val valueBytes = resultSer.serialize(value)

Memory
 val directResult = new DirectTaskResult(valueBytes,
accumUpdates, task.metrics.orNull)
 val serializedDirectResult = ser.serialize(directResult)

Memory
 val directResult = new DirectTaskResult(valueBytes,
accumUpdates, task.metrics.orNull)
 val serializedDirectResult = ser.serialize(directResult)
 Default JavaSerializer
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}

Network
 Problem with firewall/nat/multiple ip/etc.

SQL
 Shark (dead)
 Spark SQL
 Spark on Hive

SparkR
 Unstable API
 Minimum docs

SparkR
 Unstable API
 Minimum docs
 Rstudio Server

Links
 Spark
 http://spark.apache.org/
 Flink
 http://flink.apache.org/
 Tez
 http://tez.apache.org/

«Почему Spark отнюдь не так хорош»

More Related Content

What's hot

Similar to «Почему Spark отнюдь не так хорош»

More from Olga Lavrentieva

Recently uploaded

«Почему Spark отнюдь не так хорош»

Editor's Notes