Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data science with spark on amazon EMR - Pop-up Loft Tel Aviv


Published on

Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.

Published in: Technology
  • Be the first to comment

Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

  1. 1. Analytics with Spark on EMR Jonathan Fritz Sr. Product Manager, AWS
  2. 2. Spark moves at interactive speed join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in RDDs in memory • Partitioning-aware to avoid network-intensive shuffle
  3. 3. Spark components to match your use case
  4. 4. Spark speaks your language
  5. 5. Use DataFrames to easily interact with data • Distributed collection of data organized in columns • An extension of the existing RDD API • Optimized for query execution
  6. 6. Easily create DataFrames from many formats RDD Additional libraries for Spark SQL Data Sources at
  7. 7. Load data with the Spark SQL Data Sources API Additional libraries at
  8. 8. Sample DataFrame manipulations
  9. 9. Use DataFrames for machine learning • Spark ML libraries (replacing MLlib) use DataFrames as input/output for models • Create ML pipelines with a variety of distributed algorithms
  10. 10. Create DataFrames on streaming data • Access data in Spark Streaming DStream • Create SQLContext on the SparkContext used for Spark Streaming application for ad hoc queries • Incorporate DataFrame in Spark Streaming application • Checkpointing streaming jobs
  11. 11. Spark Pipeline
  12. 12. Use R to interact with DataFrames • SparkR package for using R to manipulate DataFrames • Create SparkR applications or interactively use the SparkR shell (no Zeppelin support yet - ZEPPELIN-156) • Comparable performance to Python and Scala DataFrames
  13. 13. Spark SQL • Seamlessly mix SQL with Spark programs • Uniform data access • Hive compatibility – run Hive queries without modifications using HiveContext • Connect through JDBC/ODBC using the Spark ThriftServer (coming soon natively in EMR)
  14. 14. Spark architecture
  15. 15. • SparkContext runs as a library in your program, one instance per Spark app. • Cluster managers: Standalone, Mesos or YARN • Accesses storage via Hadoop InputFormat API, and can use S3 with EMRFS, HBase, HDFS, and more Your application SparkContext Local threads Cluster manager Worker Worker HDFS or other storage Spark executor Spark executor
  16. 16. Amazon EMR runs Spark on YARN • Dynamically share and centrally configure the same pool of cluster resources across engines • Schedulers for categorizing, isolating, and prioritizing workloads • Choose the number of executors to use, or allow YARN to choose (dynamic allocation) • Kerberos authentication Storage S3, HDFS YARN Cluster Resource Management Batch MapReduce In Memory Spark Applications Pig, Hive, Cascading, Spark Streaming, Spark SQL
  17. 17. RDDs (and now DataFrames) and Fault Tolerance RDDs track the transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(lambda s: s.contains(“ERROR”)) .map(lambda s: s.split(‘t’)[2]) HadoopRDD path = hdfs://… FilteredRDD func = contains(...) MappedRDD func = split(…)
  18. 18. lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Caching RDDs can boost performance Load error messages from a log into memory, then interactively search for patterns
  19. 19. RDD Persistence • Caching or Persisting dataset in memory • Methods • cache() • persist() • Small RDD  MEMORY_ONLY • Big RDD  MEMORY_ONLY_SER (CPU intensive) • Don’t spill to disk • Use replicated storage for faster recovery
  20. 20. Inside Spark Executor on YARN Max Container size on node YARN Container Controls the max sum of memory used by the container yarn.nodemanager.resource.memory-mb → Default: 116 GConfig File: yarn-site.xml
  21. 21. Inside Spark Executor on YARN Max Container size on node Executor space Where Spark executor Runs Executor Container →
  22. 22. Inside Spark Executor on YARN Max Container size on node Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.) 𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟. 𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10 Executor Container Memory Overhead Config File: spark-default.conf
  23. 23. Inside Spark Executor on YARN Max Container size on node Spark executor memory - Amount of memory to use per executor process spark.executor.memory Executor Container Memory Overhead Spark Executor Memory Config File: spark-default.conf
  24. 24. Inside Spark Executor on YARN Max Container size on node Shuffle Memory Fraction – pre-Spark 1.6 Executor Container Memory Overhead Spark Executor Memory Shuffle memoryFraction Default: 0.2
  25. 25. Inside Spark Executor on YARN Max Container size on node Storage storage Fraction - pre-Spark 1.6 Executor Container Memory Overhead Spark Executor Memory Shuffle memoryFraction Storage memoryFraction Default: 0.6
  26. 26. Inside Spark Executor on YARN Max Container size on node In Spark 1.6+, Spark automatically balances the amount of memory for execution and cached data. Executor Container Memory Overhead Spark Executor Memory Execution / Cache Default: 0.6
  27. 27. Dynamic Allocation on YARN Scaling up on executors - Request when you want the job to complete faster - Idle resources on cluster - Exponential increase in executors over time New default in EMR 4.4 (coming soon!)
  28. 28. Dynamic allocation setup Property Value Spark.dynamicAllocation.enabled true Spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 5 spark.dynamicAllocation.maxExecutors 17 spark.dynamicAllocation.initalExecutors 0 sparkdynamicAllocation.executorIdleTime 60s spark.dynamicAllocation.schedulerBacklogTimeout 5s spark.dynamicAllocation.sustainedSchedulerBackl ogTimeout 5s Optional
  29. 29. Compress your input data set • Always compress Data Files on Amazon S3 • Reduces storage cost • Reduces bandwidth between Amazon S3 and Amazon EMR, which can speed up bandwidth constrained jobs
  30. 30. Compressions Compression Types: – Some are fast BUT offer less space reduction – Some are space efficient BUT Slower – Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
  31. 31. Data Serialization • Data is serialized when cached or shuffled Default: Java serializer • Kyro serialization (10x faster than Java serialization) • Does not support all Serializable types • Register the class in advance Usage: Set in SparkConf conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
  32. 32. Running Spark on Amazon EMR
  33. 33. Focus on deriving insights from your data instead of manually configuring clusters Easy to install and configure Spark Secured Spark submit, Oozie or use Zeppelin UI Quickly add and remove capacity Hourly, reserved, or EC2 Spot pricing Use S3 to decouple compute and storage
  34. 34. Launch the latest Spark version Spark 1.6.0 is the current version on EMR. < 3 week cadence with latest open source release
  35. 35. Create a fully configured cluster in minutes AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API
  36. 36. Or easily change your settings
  37. 37. Many storage layers to choose from Amazon DynamoDB EMR-DyanmoDB connector Amazon RDS Amazon Kinesis Streaming data connectorsJDBC Data Source w/ Spark SQL ElasticSearch connector Amazon Redshift Spark-Redshift connector EMR File System (EMRFS) Amazon S3 Amazon EMR
  38. 38. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Amazon EMR
  39. 39. Easy to run your Spark workloads Amazon EMR Step API SSH to master node and use Spark Submit, Oozie or Zeppelin Submit a Spark application Amazon EMR
  40. 40. Secured Spark clusters Encryption At-Rest • HDFS transparent encryption (AES 256) • Local disk encryption for temporary files using LUKS encryption • EMRFS support for Amazon S3 client-side and server-side encryption Encryption In-Flight • Secure communication with SSL from S3 to EC2 (nodes of cluster) • HDFS blocks encrypted in-transit when using HDFS encryption • SASL encryption for Spark Shuffle Permissions • IAM roles, Kerberos, and IAM Users Access • VPC and Security Groups Auditing • AWS CloudTrailAmazon S3
  41. 41. Customer use cases
  42. 42. Some of our customers running Spark on EMR
  43. 43. Integration Pattern – ETL with Spark Amazon EMRAmazon S3 HDFSRead Unstructure d Data Write Structured Extract Load from HDFS Store Output Data
  44. 44. Integration Pattern – Tumbling Window Reporting Amazon EMR Amazon Kinesis Streaming Input HDFS Tumbling/Fixed Window Aggregation Periodic Output Amazon Redshift COPY from EMR Or checkpoint to S3 and use the Lambda loader app
  45. 45. Zeppelin demo
  46. 46. Jonathan Fritz Sr. Product Manager