Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ

643 views

Published on

Robert Hryniewicz

Published in: Technology
  • Be the first to comment

Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ

  1. 1. Robert Hryniewicz Data Evangelist @RobHryniewicz Hands-on Intro to Spark & Zeppelin Crash Course
  2. 2. 2 © Hortonworks Inc. 2011 –2016. All Rights Reserved The “Big Data” Problem à A single machine cannot process or even store all the data! Problem Solution à Distribute data over large clusters Difficulty à How to split work across machines? à Moving data over network is expensive à Must consider data & network locality à How to deal with failures? à How to deal with slow nodes?
  3. 3. 3 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark Background
  4. 4. 4 © Hortonworks Inc. 2011 –2016. All Rights Reserved Access Rates At least an order of magnitude difference between memory and hard drive / network speed FAST slow slow
  5. 5. 5 © Hortonworks Inc. 2011 –2016. All Rights Reserved What is Spark? à Apache Open Source Project - originally developed at AMPLab (University of California Berkeley) à Data Processing Engine - focused on in-memory distributed computing use-cases à API - Scala, Python, Java and R
  6. 6. 6 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark Ecosystem Spark Core Spark SQL Spark Streaming MLLib GraphX
  7. 7. 7 © Hortonworks Inc. 2011 –2016. All Rights Reserved Why Spark? Ã Elegant Developer APIs – Single environment for data munging and Machine Learning (ML) Ã In-memory computation model – Fast! – Effective for iterative computations and ML Ã Machine Learning – Implementation of distributed ML algorithms – Pipeline API (Spark ML)
  8. 8. 8 © Hortonworks Inc. 2011 –2016. All Rights Reserved History of Hadoop & Spark
  9. 9. 9 © Hortonworks Inc. 2011 –2016. All Rights Reserved Apache Spark Basics
  10. 10. 10 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark Context à Main entry point for Spark functionality à Represents a connection to a Spark cluster à Represented as sc in your code What is it?
  11. 11. 11 © Hortonworks Inc. 2011 –2016. All Rights Reserved RDD - Resilient Distributed Dataset à Primary abstraction in Spark – An Immutable collection of objects (or records, or elements) that can be operated on in parallel à Distributed – Collection of elements partitioned across nodes in a cluster – Each RDD is composed of one or more partitions – User can control the number of partitions – More partitions => more parallelism à Resilient – Recover from node failures – An RDD keeps its lineage information -> it can be recreated from parent RDDs à Created by starting with a file in Hadoop Distributed File System (HDFS) or an existing collection in the driver program à May be persisted in memory for efficient reuse across parallel operations (caching)
  12. 12. 12 © Hortonworks Inc. 2011 –2016. All Rights Reserved RDD – Resilient Distributed Dataset Partition 1 Partition 2 Partition 3 RDD 2 Partition 1 Partition 2 Partition 3 Partition 4 RDD 1 Cluster Nodes
  13. 13. 13 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark SQL
  14. 14. 14 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark SQL Overview à Spark module for structured data processing (e.g. DB tables, JSON files) à Three ways to manipulate data: – DataFrames API – SQL queries – Datasets API à Same execution engine for all three à Spark SQL interfaces provide more information about both structure and computation being performed than basic Spark RDD API
  15. 15. 15 © Hortonworks Inc. 2011 –2016. All Rights Reserved DataFrames à Conceptually equivalent to a table in relational DB or data frame in R/Python à API available in Scala, Java, Python, and R à Richer optimizations (significantly faster than RDDs) à Distributed collection of data organized into named columns à Underneath is an RDD
  16. 16. 16 © Hortonworks Inc. 2011 –2016. All Rights Reserved DataFrames CSVAvro HIVE Spark SQL Text Col1 Col2 … … ColN DataFrame (with RDD underneath) Column Row Created from Various Sources à DataFrames from HIVE: – Reading and writing HIVE tables, including ORC à DataFrames from files: – Built-in: JSON, JDBC, ORC, Parquet, HDFS – External plug-in: CSV, HBASE, Avro à DataFrames from existing RDDs – with toDF()function Data is described as a DataFrame with rows, columns and a schema
  17. 17. 17 © Hortonworks Inc. 2011 –2016. All Rights Reserved SQL Context and Hive Context à Entry point into all functionality in Spark SQL à All you need is SparkContext val sqlContext = SQLContext(sc) SQLContext à Superset of functionality provided by basic SQLContext – Read data from Hive tables – Access to Hive Functions à UDFs HiveContext val hc = HiveContext(sc) Use when your data resides in Hive
  18. 18. 18 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark SQL Examples
  19. 19. 19 © Hortonworks Inc. 2011 –2016. All Rights Reserved DataFrame Example val df = sqlContext.table("flightsTbl") df.select("Origin", "Dest", "DepDelay").show(5) Reading Data From Table +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 8| | IAD| TPA| 19| | IND| BWI| 8| | IND| BWI| -4| | IND| BWI| 34| +------+----+--------+
  20. 20. 20 © Hortonworks Inc. 2011 –2016. All Rights Reserved DataFrame Example df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5) Using DataFrame API to Filter Data (show delays more than 15 min) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  21. 21. 21 © Hortonworks Inc. 2011 –2016. All Rights Reserved SQL Example // Register Temporary Table df.registerTempTable("flights") // Use SQL to Query Dataset sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT 5").show Using SQL to Query and Filter Data (again, show delays more than 15 min) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  22. 22. 22 © Hortonworks Inc. 2011 –2016. All Rights Reserved RDD vs. DataFrame
  23. 23. 23 © Hortonworks Inc. 2011 –2016. All Rights Reserved RDDs vs. DataFrames RDD DataFrame à Lower-level API (more control) à Lots of existing code & users à Compile-time type-safety à Higher-level API (faster development) à Faster sorting, hashing, and serialization à More opportunities for automatic optimization à Lower memory pressure
  24. 24. 24 © Hortonworks Inc. 2011 –2016. All Rights Reserved Data Frames are Intuitive RDD Example Equivalent Data Frame Example dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department?
  25. 25. 25 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark SQL Optimizations à Spark SQL uses an underlying optimization engine (Catalyst) – Catalyst can perform intelligent optimization since it understands the schema à Spark SQL does not materialize all the columns (as with RDD) only what’s needed
  26. 26. 26 © Hortonworks Inc. 2011 –2016. All Rights Reserved Catalyst: Spark SQL optimizer à Query or data frame operations modeled as a tree à Logical plan created and optimized à Various physical plans created; best plan chosen à Code generation and execution
  27. 27. 27 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark Streaming
  28. 28. 28 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark Streaming à Extension of Spark Core API à Stream processing of live data streams – Scalable – High-throughput – Fault-tolerant Overview
  29. 29. 29 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark Streaming
  30. 30. 30 © Hortonworks Inc. 2011 –2016. All Rights Reserved Spark Streaming à Apply transformations over a sliding window of data, e.g. rolling average Window Operations
  31. 31. 31 © Hortonworks Inc. 2011 –2016. All Rights Reserved Apache Zeppelin & HDP Sandbox
  32. 32. 32 © Hortonworks Inc. 2011 –2016. All Rights Reserved Apache Zeppelin – A Modern Web-based Data Science Studio à Data exploration and discovery à Visualization à Deeply integrated with Spark and Hadoop à Pluggable interpreters à Multiple languages in one notebook: R, Python, Scala
  33. 33. 33 © Hortonworks Inc. 2011 –2016. All Rights Reserved
  34. 34. 34 © Hortonworks Inc. 2011 –2016. All Rights Reserved
  35. 35. 35 © Hortonworks Inc. 2011 –2016. All Rights Reserved
  36. 36. 36 © Hortonworks Inc. 2011 –2016. All Rights Reserved What’s not included with Spark? ResourceManagement Storage Applications Spark Core Engine Scala Java Python libraries MLlib (Machine learning) Spark SQL* Spark Streaming* Spark Core Engine
  37. 37. 37 © Hortonworks Inc. 2011 –2016. All Rights Reserved HDP Sandbox What’s included in the Sandbox? à Zeppelin à Latest Hortonworks Data Platform (HDP) – Spark – YARN à Resource Management – HDFS à Distributed Storage Layer – And many more components... YARN Scala Java Python R APIs Spark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS
  38. 38. 38 © Hortonworks Inc. 2011 –2016. All Rights Reserved Access patterns enabled by YARN YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS Hadoop Distributed File System Interactive Real-TimeBatch Applications Batch Needs to happen but, no timeframe limitations Interactive Needs to happen at Human time Real-Time Needs to happen at Machine Execution time.
  39. 39. 39 © Hortonworks Inc. 2011 –2016. All Rights Reserved Why Spark on YARN? à Utilize existing HDP cluster infrastructure à Resource management – share Spark workloads with other workloads like PIG, HIVE, etc. à Scheduling and queues Spark Driver Client Spark Application Master YARN container Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task
  40. 40. 40 © Hortonworks Inc. 2011 –2016. All Rights Reserved Why HDFS? Fault Tolerant Distributed Storage • Divide files into big blocks and distribute 3 copies randomlyacross the cluster • Processing Data Locality • Not Just storage but computation 10110100101 00100111001 11111001010 01110100101 00101100100 10101001100 01010010111 01011101011 11011011010 10110100101 01001010101 01011100100 11010111010 0 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  41. 41. 41 © Hortonworks Inc. 2011 –2016. All Rights Reserved There’s more to HDP YARN : Data Operating System DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Data Lifecycle & Governance Falcon Atlas Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS EncryptionData Workflow Sqoop Flume Kafka NFS WebHDFS Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper Scheduling Oozie Batch MapReduce Script Pig Search Solr SQL Hive NoSQL HBase Accumulo Phoenix Stream Storm In-memory Others ISV Engines Tez Tez Slider Slider DATA MANAGEMENT Hortonworks Data Platform 2.4.x Deployment ChoiceLinux Windows On-Premise Cloud HDFS Hadoop Distributed File System
  42. 42. 42 © Hortonworks Inc. 2011 –2016. All Rights Reserved HDP 2.5 TP
  43. 43. 43 © Hortonworks Inc. 2011 –2016. All Rights Reserved
  44. 44. 44 © Hortonworks Inc. 2011 –2016. All Rights Reserved
  45. 45. 45 © Hortonworks Inc. 2011 –2016. All Rights Reserved View User Sessions
  46. 46. 46 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hortonworks Community Connection
  47. 47. 47 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hortonworks Community Connection Read access for everyone, join to participate and be recognized • Full Q&A Platform (like StackOverflow) • Knowledge Base Articles • Code Samples and Repositories
  48. 48. 48 © Hortonworks Inc. 2011 –2016. All Rights Reserved Community Engagement Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 –2015. All Rights Reserved 7,500+ Registered Users 15,000+ Answers 20,000+ Technical Assets One Website!
  49. 49. 49 © Hortonworks Inc. 2011 –2016. All Rights Reserved Lab Preview
  50. 50. 50 © Hortonworks Inc. 2011 –2016. All Rights Reserved Link to Tutorial with Lab Instructions http://tinyurl.com/hwx-intro-to-spark
  51. 51. Robert Hryniewicz @RobHryniewicz Thanks!

×