Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache spark on Hadoop Yarn Resource Manager

69 views

Published on

How we can configure the spark on apache hadoop environment, and why we need that compared to standalone cluster manager.

Slide also includes docker based demo to play with the hadoop and spark on your laptop itself. See more on the demo codes and other documentation here - https://github.com/haridas/hadoop-env

Published in: Software
  • Be the first to comment

  • Be the first to like this

Apache spark on Hadoop Yarn Resource Manager

  1. 1. Apache Spark on Yarn Haridas N
  2. 2. Agenda ● What’s apache spark ● How spark differs from standard map-reduce framework ● Spark on YARN ● Scala / Python Spark APIs ○ RDD, DataFrame, DataSet ○ Serialisation techniques ● Workshop: ○ Setup spark and configure it with our hadoop cluster ○ Submit some jobs
  3. 3. Images Need for this Session docker pull haridasn/spark-2.4.0 docker pull haridasn/hadoop-2.8.5 docker pull haridasn/hadoop-cli Tutorial Link: https://github.com/haridas/hadoop-env/blob/master/tutorials/spark-on-hadoop.adoc
  4. 4. Mapreduce framework ● Lot of disk operations. ● Well fit for batch processing over huge dataset. ● The APIs are bit low level, which leads to more work at application side. ● IO overhead is high.
  5. 5. Spark ● Does map-reduce in optimal way by using RAM as main storage. ● Optimised Memory operations. ● DAG based computation. ● Lazy evaluation and optimization of execution plan.
  6. 6. Spark Running modes ● Local Mode ○ Single JVM both driver and executor runs there. Good for local testing. ● Client Mode ○ Submitting jobs to spark cluster, The driver program runs on the client machine. ○ Easy way to see the program status. ● Cluster Mode ○ Driver program also runs on Cluster ( One worker ) ○ The client who submitted the job don’t need to be around. ○ Production grade jobs.
  7. 7. Spark APIs ● RDD ● DataSet ● DataFrame ● Transformer ● Serializers
  8. 8. RDD vs DataFrame vs DataSet ● RDD is the low level representation of data, Uses Java Serialisation by default. ● Immutable ● DataFrame and DataSet are typed RDDs ● Type information helps to do further optimization at low level ( Network transfer, serialisation ). ● Columnar Storage friendly. ● Apache Parquet and Apache Arrow project.
  9. 9. Speed up techniques ● Serialization ● Better memory representation. ● Columnar database
  10. 10. HDFS vs Object store ● Both aren’t POSIX compliant file system. ● HDFS more close to POSIX but some apis aren’t standard. ● Object stores are ( s3, swift etc ) purely a key-value blob store. ● Hadoop provides interfaces to interact with blob stores, which mimics the HDFS APIs over blob store. ● Spark or similar compute engines make use of this File System abstraction to interact with cloud storages.
  11. 11. Columnar Storage ● How columnar storage change the big data performance scale ● Columnar databases, File storage formats, columnar memory representation. ● Parquet and Arrow project
  12. 12. Write spark jobs on python
  13. 13. Real workload - quick demo ● Document pre-processing ● Jobs are in python, using nltk and spacy NLP libraries.
  14. 14. Workshop
  15. 15. Thank you Haridas N

×