Early on a colleague of ours sent us this exception… this is truncated
This talk is going to be about these kinds of errors you sometimes get when running…
This is probably the most common failure you’re going to see. First of all, in this case, the punchline here is going to be that the problem is your fault. But second of all, what does all this other stuff mean and why is Spark telling me this in this way.
This is probably the most common failure you’re going to see. First of all, in this case, the punchline here is going to be that the problem is your fault. But second of all, what does all this other stuff mean and why is Spark telling me this in this way.
Lets start with an example program in Spark.
The sum() call launches a job
Lets start with an example program in Spark.
A chunk of data somewhere
Could be on Hadoop File System (HDFS)
Could be cached in Spark
Defines the degree of parallelism
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Describes a way of generating input and output partitions
Immutable – very important!
RDDs can depend on other RDDs
Most have single parent
Joins have multiple parents
Lineage over replication for fault tolerance
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Narrow
map
filter
Wide
join
groupByKey
More details:
A job is a DAG of stages
The scheduler creates a set of tasks per stage
Partitions assigned to a task
This is probably the most common failure you’re going to see. First of all, in this case, the punchline here is going to be that the problem is your fault. But second of all, what does all this other stuff mean and why is Spark telling me this in this way.
Talk about the relationship between stages and tasks.
So with all this information in hand, we can come back and interpret this error.
We tried to run a job, but it failed because one of its stages failed. Why did that stage fail? Because one of its tasks failed. Tasks will be retried.
Mercifully, Spark gives us the exception that caused the most recent failure.
Mercifully, Spark gives us the exception that caused the most recent failure.
Mercifully, Spark gives us the exception that caused the most recent failure.
Lets review the general Spark architecture
A driver
Where the DAG scheduler lives
Drives the sho
Single point of failure
Executors
Communicates with driver
Runs the tasks created by the driver
Think of this as a ThreadPoolExecutor in java
Pluggable cluster managers
YARN, Mesos, standalone
----- Meeting Notes (6/10/15 14:57) -----
Lets review the general Spark architecture
Lets review the general Spark architecture
Lets review the general Spark architecture
This shows up in the YARN NodeManager logs
mention that this is what happens with a groupByKey or reduceByKey
show blocks being deserialized into Java objects and placed into map
show spill
with fewer tasks, more of these blocks have to go to the same reducer, and more stuff needs to be held in this map