Scaling Big Data with Hadoop and Mesos

Scaling Big Data with
Hadoop And Mesos

Bernardo Gomez Palacio
Software Engineer at Guavus Inc

Mesos and Data Analysis
Yes, you don't need Hadoop to start using Mesos and
Spark.

Now, If You...
4 Need to store large files? by default each block is
128MB.
4 Data is written mainly as new files or by appending
into existing ones?

Convinced you want to jump into the
Hadoop bandwagon?
Read
Sammer, Eric. "Hadoop Operations." Sebastopol, CA:
O'Reilly, 2012. Print.

Distributions
Apache Bigtop, CDH, HDP, MapR

Assuming You Already Have Mesos
4 Mesosphere Packages
4 https://mesosphere.io/downloads/
4 From Source.
4 https://github.com/apache/mesos

Hadoop MRV1 in Meso
https://github.com/mesos/hadoop

Hadoop MRV1 in Mesos
4 Requires Hadoop MRV1
4 Officially works with CDH5 MRV1
4 Apache Hadoop 0.22, 0.23 and 1+
4 Apache Hadoop 2+ doesn't come with MRV1!

4 Requires a JobTracker.
4 By default uses the
org.apache.hadoop.mapred.JobQueueTaskScheduler
4 You can change it .e.g ...mapred.FairScheduler

4 Requires TaskTracker.
4 That is
org.apache.hadoop.mapreduce.server.jobtracker.
TaskTracker.
4 And not
org.apache.hadoop.mapred.TaskTracker.java.

How Hadoop MRV1 Runs In
Mesos?

How Hadoop MRV1 in Mesos works?
1. Framework Mesos Scheduler creates the Job
Tracker as part of the driver.
2. The Job Trakcer will use
org.apache.hadoop.mapred.MesosScheduler to lunch
tasks.

Mesos Hadoop Task Scheduling
4 mapred.mesos.slot.cpus (1)
4 mapred.mesos.slot.disk (1024MB)
4 mapred.mesos.slot.mem (1024MB)

Additional Mesos parameters
4 mapred.mesos.checkpoint (false)
4 mapred.mesos.role (*)

Thoughts
What about Hadoop 2.4?
Namenode HA?
MRV2 and YARN?

Personal Preference
4 Use Hadoop 2.4.0 or above.
4 Name Node HA through the Quorum Journal
Manager.
4 Move to Spark if Possible.

Example of a Mesos Data Analysis
Stack
1. HDFS stores files.
2. Use the Spark CLI to test ideas.
3. Use Spark Submit for jobs.
4. Use Chronos or Oozie to schedule workflows.

Spark On Mesos
https://spark.apache.org/docs/latest/img/cluster-overview.png

Know that Each Spark Application
1. Has its own driving process.
2. Has its own RDDs
3. Has its own cache.

Spark Schedulers on Mesos
Fine Grained
Coarse Grained

Spark Fine Grained Scheduling
4 Enabled by default.
4 Each Spark task runs as a separate Mesos task.
4 Has an overhead in launching each task.

Spark Coarse Grained Scheduling
4 Uses only one long-running Spark task on each Mesos
slave.
4 Dynamically schedules its own “mini-tasks”, using
Akka.
4 Lower startup overhead.
4 Reserving the cluster resources for the complete
duration of the application.

Be ware of...
4 Greedy Scheduling (Coarse Grain)
4 Over committing and deadlocks (Fine Grained)

Using Spark
Understand Parametrization and Usage
4 spark.app.name
4 spark.executor.memory
4 spark.serializer
4 spark.local.dir
4 ....

Use Spark Submit
Avoid parametrizing the Spark Context in your code as
much as possible.
Leverage the spark-submit arguments, properties files
as well as environment variables to configure your
application.

Using Spark
Accept That Tunning is a
Science & an Art

Understand and Tune Your Applications
4 Know your Working Set.
4 Understand Spark Partitioning and Block
management.
4 Define your Spark workflow and where to cache/
persist.
4 If you cache you will serialize, use Kryo.

Example Spark API PairRDDFunctions
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int): RDD[(K, C)]

PairRDDFunctions.combineByKey
4 Combines the elements for key using a custom set of
aggregations.
4 RDD[(K, V)] to RDD[(K, C)]

PairRDDFunctions.combineByKey
4 createCombiner: Turns a V into a C
4 mergeValue: merge a V into a C
4 mergeCombiners: to combine two C's into a single
one.
partitioner defaults to HashPartitioner.

Example Spark API PairRDDFunctions
self: RDD[(K, V)]
def aggregateByKey[U: ClassTag](zeroValue: U)(
seqOp: (U, V) => U,
combOp: (U, U) => U
): RDD[(K, U)]
Uses the default partitioner.

Tune your Data
4 Per Data Source understand its optimal block size
4 Leverage Avro as the serialization format.
4 Leverage Parquet as the storage format.
4 Try to keep your Avro & Parquet schemas flat.

Each Application
4 Instrument the Code.
4 Measure Input size in number of records and byte
size.
4 Measure Output size in the same way.

Standardize
4 JDK & JRE version across your cluster.
4 The Spark version across your cluster.
4 The libraries that will be added to the JVM classpath
by default.
4 A packaging strategy for your application, uber jar.

Some Differences with YARN
4 Execution Cluster vs Client modes.
4 Isolation process vs cgroups
4 Docker support? LXC Templates?
4 Deployment complexity?

References
1. "Hadoop - Apache Hadoop 2.4.0." Apache Hadoop
2.4.0. Apache Software Foundation,
31 Mar. 2014. Web. 24 July 2014. link.
2. "Hadoop Distributed File System-2.4.0 - HDFS High
Availability Using the Quorum Journal Manager."
Apache Hadoop 2.4.0. Apache Software Foundation,
31 Mar. 2014. Web. 23 July 2014.
link.

References
1. Sammer, Eric. Hadoop Operations. Sebastopol, CA:
O'Reilly, 2012. Print.
2. "Spark Configuration." Spark 1.0.1 Documentation.
Apache Software Foundation, n.d. Web. 24 July 2014.
link.
3. "Tuning Spark." Spark 1.0.1 Documentation. Apache
Software Foundation, n.d. Web. 24 July 2014.
link.

References
1. Ryza, Sandy. "Managing Multiple Resources in
Hadoop 2 with YARN." Cloudera Developer Blog.
Cloudera, 2 Dec. 2013. Web. 24 July 2014.
link.

Scaling Big Data with Hadoop and Mesos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Scaling Big Data with Hadoop and Mesos

Similar to Scaling Big Data with Hadoop and Mesos (20)

Recently uploaded

Recently uploaded (20)

Scaling Big Data with Hadoop and Mesos