SlideShare a Scribd company logo
Scaling Big Data with
Hadoop And Mesos
Bernardo Gomez Palacio
Software Engineer at Guavus Inc
Beyond Buzz Words
Mesos and Data Analysis
Yes, you don't need Hadoop to start using Mesos and
Spark.
Now, If You...
4 Need to store large files? by default each block is
128MB.
4 Data is written mainly as new files or by appending
into existing ones?
Convinced you want to jump into the
Hadoop bandwagon?
Read
Sammer, Eric. "Hadoop Operations." Sebastopol, CA:
O'Reilly, 2012. Print.
Welcome to the Jungle
Version Hell
Distributions
Apache Bigtop, CDH, HDP, MapR
Hadoop
HDFS
MRV1
MRV2
Assuming You Already Have Mesos
4 Mesosphere Packages
4 https://mesosphere.io/downloads/
4 From Source.
4 https://github.com/apache/mesos
Hadoop MRV1 in Meso
https://github.com/mesos/hadoop
Hadoop MRV1 in Mesos
4 Requires Hadoop MRV1
4 Officially works with CDH5 MRV1
4 Apache Hadoop 0.22, 0.23 and 1+
4 Apache Hadoop 2+ doesn't come with MRV1!
Hadoop MRV1 in Mesos
4 Requires a JobTracker.
4 By default uses the
org.apache.hadoop.mapred.JobQueueTaskScheduler
4 You can change it .e.g ...mapred.FairScheduler
Hadoop MRV1 in Mesos
4 Requires TaskTracker.
4 That is
org.apache.hadoop.mapreduce.server.jobtracker.
TaskTracker.
4 And not
org.apache.hadoop.mapred.TaskTracker.java.
How Hadoop MRV1 Runs In
Mesos?
How Hadoop MRV1 in Mesos works?
1. Framework Mesos Scheduler creates the Job
Tracker as part of the driver.
2. The Job Trakcer will use
org.apache.hadoop.mapred.MesosScheduler to lunch
tasks.
Mesos Hadoop Task Scheduling
4 mapred.mesos.slot.cpus (1)
4 mapred.mesos.slot.disk (1024MB)
4 mapred.mesos.slot.mem (1024MB)
Additional Mesos parameters
4 mapred.mesos.checkpoint (false)
4 mapred.mesos.role (*)
Thoughts
What about Hadoop 2.4?
Namenode HA?
MRV2 and YARN?
Personal Preference
4 Use Hadoop 2.4.0 or above.
4 Name Node HA through the Quorum Journal
Manager.
4 Move to Spark if Possible.
Example of a Mesos Data Analysis
Stack
1. HDFS stores files.
2. Use the Spark CLI to test ideas.
3. Use Spark Submit for jobs.
4. Use Chronos or Oozie to schedule workflows.
Spark On Mesos
Spark On Mesos
https://spark.apache.org/docs/latest/img/cluster-overview.png
Know that Each Spark Application
1. Has its own driving process.
2. Has its own RDDs
3. Has its own cache.
Spark Schedulers on Mesos
Fine Grained
Coarse Grained
Spark Fine Grained Scheduling
4 Enabled by default.
4 Each Spark task runs as a separate Mesos task.
4 Has an overhead in launching each task.
Spark Coarse Grained Scheduling
4 Uses only one long-running Spark task on each Mesos
slave.
4 Dynamically schedules its own “mini-tasks”, using
Akka.
4 Lower startup overhead.
4 Reserving the cluster resources for the complete
duration of the application.
Be ware of...
4 Greedy Scheduling (Coarse Grain)
4 Over committing and deadlocks (Fine Grained)
Using Spark
Understand Parametrization and Usage
4 spark.app.name
4 spark.executor.memory
4 spark.serializer
4 spark.local.dir
4 ....
Use Spark Submit
Avoid parametrizing the Spark Context in your code as
much as possible.
Leverage the spark-submit arguments, properties files
as well as environment variables to configure your
application.
Using Spark
Accept That Tunning is a
Science & an Art
Understand and Tune Your Applications
4 Know your Working Set.
4 Understand Spark Partitioning and Block
management.
4 Define your Spark workflow and where to cache/
persist.
4 If you cache you will serialize, use Kryo.
Example Spark API PairRDDFunctions
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int): RDD[(K, C)]
PairRDDFunctions.combineByKey
4 Combines the elements for key using a custom set of
aggregations.
4 RDD[(K, V)] to RDD[(K, C)]
PairRDDFunctions.combineByKey
4 createCombiner: Turns a V into a C
4 mergeValue: merge a V into a C
4 mergeCombiners: to combine two C's into a single
one.
partitioner defaults to HashPartitioner.
Example Spark API PairRDDFunctions
self: RDD[(K, V)]
def aggregateByKey[U: ClassTag](zeroValue: U)(
seqOp: (U, V) => U,
combOp: (U, U) => U
): RDD[(K, U)]
Uses the default partitioner.
Understand your Data
Tune your Data
4 Per Data Source understand its optimal block size
4 Leverage Avro as the serialization format.
4 Leverage Parquet as the storage format.
4 Try to keep your Avro & Parquet schemas flat.
Suggestions
Each Application
4 Instrument the Code.
4 Measure Input size in number of records and byte
size.
4 Measure Output size in the same way.
Standardize
4 JDK & JRE version across your cluster.
4 The Spark version across your cluster.
4 The libraries that will be added to the JVM classpath
by default.
4 A packaging strategy for your application, uber jar.
About YARN and Spark
Some Differences with YARN
4 Execution Cluster vs Client modes.
4 Isolation process vs cgroups
4 Docker support? LXC Templates?
4 Deployment complexity?
Wrapping Up
Some Ideas..
References
1. "Hadoop - Apache Hadoop 2.4.0." Apache Hadoop
2.4.0. Apache Software Foundation,
31 Mar. 2014. Web. 24 July 2014. link.
2. "Hadoop Distributed File System-2.4.0 - HDFS High
Availability Using the Quorum Journal Manager."
Apache Hadoop 2.4.0. Apache Software Foundation,
31 Mar. 2014. Web. 23 July 2014.
link.
References
1. Sammer, Eric. Hadoop Operations. Sebastopol, CA:
O'Reilly, 2012. Print.
2. "Spark Configuration." Spark 1.0.1 Documentation.
Apache Software Foundation, n.d. Web. 24 July 2014.
link.
3. "Tuning Spark." Spark 1.0.1 Documentation. Apache
Software Foundation, n.d. Web. 24 July 2014.
link.
References
1. Ryza, Sandy. "Managing Multiple Resources in
Hadoop 2 with YARN." Cloudera Developer Blog.
Cloudera, 2 Dec. 2013. Web. 24 July 2014.
link.
Thank you! ✌

More Related Content

What's hot

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
Openshift Container Platform on Azure
Openshift Container Platform on Azure Openshift Container Platform on Azure
Openshift Container Platform on Azure
Glenn West
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Spark Summit
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
DataStax Academy
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
Yousun Jeong
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Data Con LA
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
Joe Stein
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
Redis Labs
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginer
Yousun Jeong
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
Joe Stein
 
Apache Superset at Airbnb
Apache Superset at AirbnbApache Superset at Airbnb
Apache Superset at Airbnb
Bill Liu
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
DoiT International
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized Environment
BlueData, Inc.
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
DalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformDalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataform
Leandro Totino Pereira
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
Radhika Puthiyetath
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containers
pranav_joshi
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
Camuel Gilyadov
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 

What's hot (20)

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Openshift Container Platform on Azure
Openshift Container Platform on Azure Openshift Container Platform on Azure
Openshift Container Platform on Azure
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginer
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on MesosApache Kafka, HDFS, Accumulo and more on Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
 
Apache Superset at Airbnb
Apache Superset at AirbnbApache Superset at Airbnb
Apache Superset at Airbnb
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized Environment
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
DalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataformDalmatinerDB and cockroachDB monitoring plataform
DalmatinerDB and cockroachDB monitoring plataform
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containers
 
Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)Apache Drill (ver. 0.1, check ver. 0.2)
Apache Drill (ver. 0.1, check ver. 0.2)
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 

Viewers also liked

Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
Discover Pinterest
 
Introduction to Apache Spark / PUT 06.2014
Introduction to Apache Spark / PUT 06.2014Introduction to Apache Spark / PUT 06.2014
Introduction to Apache Spark / PUT 06.2014bbogacki
 
8 devstack beyond_hello-world
8 devstack beyond_hello-world8 devstack beyond_hello-world
8 devstack beyond_hello-worldopenstackindia
 
SwiftStack Presents at Under the Radar 2013
SwiftStack Presents at Under the Radar 2013SwiftStack Presents at Under the Radar 2013
SwiftStack Presents at Under the Radar 2013
Dealmaker Media
 
Resource Sharing Beyond Boundaries - Apache Myriad
Resource Sharing Beyond Boundaries - Apache MyriadResource Sharing Beyond Boundaries - Apache Myriad
Resource Sharing Beyond Boundaries - Apache Myriad
Santosh Marella
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaEdureka!
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging Infrastructure
Discover Pinterest
 
Data Driven Growth
Data Driven GrowthData Driven Growth
Data Driven Growth
Discover Pinterest
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Deploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and MarathonDeploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and Marathon
Discover Pinterest
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Discover Pinterest
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
IMC Institute
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
Heiko Loewe
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
COSTARCH Analytical Consulting (P) Ltd.
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionsaba khan
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 

Viewers also liked (19)

Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
Introduction to Apache Spark / PUT 06.2014
Introduction to Apache Spark / PUT 06.2014Introduction to Apache Spark / PUT 06.2014
Introduction to Apache Spark / PUT 06.2014
 
8 devstack beyond_hello-world
8 devstack beyond_hello-world8 devstack beyond_hello-world
8 devstack beyond_hello-world
 
SwiftStack Presents at Under the Radar 2013
SwiftStack Presents at Under the Radar 2013SwiftStack Presents at Under the Radar 2013
SwiftStack Presents at Under the Radar 2013
 
Resource Sharing Beyond Boundaries - Apache Myriad
Resource Sharing Beyond Boundaries - Apache MyriadResource Sharing Beyond Boundaries - Apache Myriad
Resource Sharing Beyond Boundaries - Apache Myriad
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging Infrastructure
 
Data Driven Growth
Data Driven GrowthData Driven Growth
Data Driven Growth
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Deploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and MarathonDeploying Docker Containers at Scale with Mesos and Marathon
Deploying Docker Containers at Scale with Mesos and Marathon
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similar to Scaling Big Data with Hadoop and Mesos

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
srikanthhadoop
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Mich Talebzadeh (Ph.D.)
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-training
Geohedrick
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
amarkayam
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
Krishna Sangeeth KS
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
Shweta Patnaik
 

Similar to Scaling Big Data with Hadoop and Mesos (20)

Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-training
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 

Recently uploaded

Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
bhadouriyakaku
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
Kamal Acharya
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Ethernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.pptEthernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.ppt
azkamurat
 

Recently uploaded (20)

Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Online aptitude test management system project report.pdf
Online aptitude test management system project report.pdfOnline aptitude test management system project report.pdf
Online aptitude test management system project report.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Ethernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.pptEthernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.ppt
 

Scaling Big Data with Hadoop and Mesos

  • 1. Scaling Big Data with Hadoop And Mesos
  • 2. Bernardo Gomez Palacio Software Engineer at Guavus Inc
  • 4. Mesos and Data Analysis Yes, you don't need Hadoop to start using Mesos and Spark.
  • 5. Now, If You... 4 Need to store large files? by default each block is 128MB. 4 Data is written mainly as new files or by appending into existing ones?
  • 6. Convinced you want to jump into the Hadoop bandwagon? Read Sammer, Eric. "Hadoop Operations." Sebastopol, CA: O'Reilly, 2012. Print.
  • 7. Welcome to the Jungle
  • 11. Assuming You Already Have Mesos 4 Mesosphere Packages 4 https://mesosphere.io/downloads/ 4 From Source. 4 https://github.com/apache/mesos
  • 12. Hadoop MRV1 in Meso https://github.com/mesos/hadoop
  • 13. Hadoop MRV1 in Mesos 4 Requires Hadoop MRV1 4 Officially works with CDH5 MRV1 4 Apache Hadoop 0.22, 0.23 and 1+ 4 Apache Hadoop 2+ doesn't come with MRV1!
  • 14. Hadoop MRV1 in Mesos 4 Requires a JobTracker. 4 By default uses the org.apache.hadoop.mapred.JobQueueTaskScheduler 4 You can change it .e.g ...mapred.FairScheduler
  • 15. Hadoop MRV1 in Mesos 4 Requires TaskTracker. 4 That is org.apache.hadoop.mapreduce.server.jobtracker. TaskTracker. 4 And not org.apache.hadoop.mapred.TaskTracker.java.
  • 16. How Hadoop MRV1 Runs In Mesos?
  • 17. How Hadoop MRV1 in Mesos works? 1. Framework Mesos Scheduler creates the Job Tracker as part of the driver. 2. The Job Trakcer will use org.apache.hadoop.mapred.MesosScheduler to lunch tasks.
  • 18. Mesos Hadoop Task Scheduling 4 mapred.mesos.slot.cpus (1) 4 mapred.mesos.slot.disk (1024MB) 4 mapred.mesos.slot.mem (1024MB)
  • 19. Additional Mesos parameters 4 mapred.mesos.checkpoint (false) 4 mapred.mesos.role (*)
  • 20. Thoughts What about Hadoop 2.4? Namenode HA? MRV2 and YARN?
  • 21. Personal Preference 4 Use Hadoop 2.4.0 or above. 4 Name Node HA through the Quorum Journal Manager. 4 Move to Spark if Possible.
  • 22. Example of a Mesos Data Analysis Stack 1. HDFS stores files. 2. Use the Spark CLI to test ideas. 3. Use Spark Submit for jobs. 4. Use Chronos or Oozie to schedule workflows.
  • 25. Know that Each Spark Application 1. Has its own driving process. 2. Has its own RDDs 3. Has its own cache.
  • 26. Spark Schedulers on Mesos Fine Grained Coarse Grained
  • 27. Spark Fine Grained Scheduling 4 Enabled by default. 4 Each Spark task runs as a separate Mesos task. 4 Has an overhead in launching each task.
  • 28. Spark Coarse Grained Scheduling 4 Uses only one long-running Spark task on each Mesos slave. 4 Dynamically schedules its own “mini-tasks”, using Akka. 4 Lower startup overhead. 4 Reserving the cluster resources for the complete duration of the application.
  • 29. Be ware of... 4 Greedy Scheduling (Coarse Grain) 4 Over committing and deadlocks (Fine Grained)
  • 30. Using Spark Understand Parametrization and Usage 4 spark.app.name 4 spark.executor.memory 4 spark.serializer 4 spark.local.dir 4 ....
  • 31. Use Spark Submit Avoid parametrizing the Spark Context in your code as much as possible. Leverage the spark-submit arguments, properties files as well as environment variables to configure your application.
  • 32. Using Spark Accept That Tunning is a Science & an Art
  • 33. Understand and Tune Your Applications 4 Know your Working Set. 4 Understand Spark Partitioning and Block management. 4 Define your Spark workflow and where to cache/ persist. 4 If you cache you will serialize, use Kryo.
  • 34. Example Spark API PairRDDFunctions def combineByKey[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
  • 35. PairRDDFunctions.combineByKey 4 Combines the elements for key using a custom set of aggregations. 4 RDD[(K, V)] to RDD[(K, C)]
  • 36. PairRDDFunctions.combineByKey 4 createCombiner: Turns a V into a C 4 mergeValue: merge a V into a C 4 mergeCombiners: to combine two C's into a single one. partitioner defaults to HashPartitioner.
  • 37. Example Spark API PairRDDFunctions self: RDD[(K, V)] def aggregateByKey[U: ClassTag](zeroValue: U)( seqOp: (U, V) => U, combOp: (U, U) => U ): RDD[(K, U)] Uses the default partitioner.
  • 39. Tune your Data 4 Per Data Source understand its optimal block size 4 Leverage Avro as the serialization format. 4 Leverage Parquet as the storage format. 4 Try to keep your Avro & Parquet schemas flat.
  • 41. Each Application 4 Instrument the Code. 4 Measure Input size in number of records and byte size. 4 Measure Output size in the same way.
  • 42. Standardize 4 JDK & JRE version across your cluster. 4 The Spark version across your cluster. 4 The libraries that will be added to the JVM classpath by default. 4 A packaging strategy for your application, uber jar.
  • 43. About YARN and Spark
  • 44. Some Differences with YARN 4 Execution Cluster vs Client modes. 4 Isolation process vs cgroups 4 Docker support? LXC Templates? 4 Deployment complexity?
  • 47. References 1. "Hadoop - Apache Hadoop 2.4.0." Apache Hadoop 2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 24 July 2014. link. 2. "Hadoop Distributed File System-2.4.0 - HDFS High Availability Using the Quorum Journal Manager." Apache Hadoop 2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 23 July 2014. link.
  • 48. References 1. Sammer, Eric. Hadoop Operations. Sebastopol, CA: O'Reilly, 2012. Print. 2. "Spark Configuration." Spark 1.0.1 Documentation. Apache Software Foundation, n.d. Web. 24 July 2014. link. 3. "Tuning Spark." Spark 1.0.1 Documentation. Apache Software Foundation, n.d. Web. 24 July 2014. link.
  • 49. References 1. Ryza, Sandy. "Managing Multiple Resources in Hadoop 2 with YARN." Cloudera Developer Blog. Cloudera, 2 Dec. 2013. Web. 24 July 2014. link.