Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark SQL | Apache Spark

1,928 views

Published on

In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.

Published in: Technology
  • Be the first to comment

Spark SQL | Apache Spark

  1. 1. View Apache Spark and Scala course details at www.edureka.co/apache-spark-scala-training Apache Spark | Spark SQL
  2. 2. Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2 Objectives At the end of this module, you will be able to  Introduction of Spark  Spark Architecture  What is an RDD  Demo On Creating RDD and Running sample example  Spark SQL
  3. 3. Slide 3 www.edureka.co/apache-spark-scala-trainingSlide 3 What is Spark? Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.  Developed at UC Berkeley Written in Scala , a Functional Programming Language that runs in a JMV It generalize the Map Reduce framework
  4. 4. Slide 4 www.edureka.co/apache-spark-scala-trainingSlide 4 Why Spark ? Speed Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk. Ease of Use Supports different languages for developing applications using Spark Generality Combine SQL, streaming, and complex analytics into one platform Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud.
  5. 5. Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5 Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms ( Machine learning etc.) To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in sequence  Each of those jobs was high-latency, and none could start until the previous job had finished completely The Job output data between each step has to be stored in the local file system before the next step can begin  Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing) Map Reduce Limitations
  6. 6. Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6 Spark Features  Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in- memory data storage  Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing  It’s designed to be an execution engine that works both in-memory and on-disk  Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow  Provides concise and consistent APIs in Scala, Java and Python  Offers interactive shell for Scala and Python. This is not available in Java yet  Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)
  7. 7. Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7 Spark Core Spark Streaming Spark Sql Blink DB MLlib Graph X Spark R Spark Architecture
  8. 8. Slide 8 www.edureka.co/apache-spark-scala-trainingSlide 8 Spark Core Spark Streaming Spark Sql Blink DB MLlib Graph X Spark R Spark Architecture Cluster management ( Native Spark Cluster, YARN, MESOS ) Distributed storage ( HDFS, Cassandra, S3, HBase )
  9. 9. Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9 Spark Advantages EASE OF DEVELOPMENT COMBINE WORKFLOWS IN-MEMORY PERFORMANCE  Easier APIs  Python, Scala, Java  RDDs  DAGs Unify Processing  Shark, ML Streaming, GraphX
  10. 10. Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10 UNLIMITED SCALE WIDE RANGE OF APPLICATIONS ENTERPRISE PLATFORM  Multiple data sources  Multiple applications  Multiple users  Reliability  Multi-tenancy  Security  Files  Databases  Semi-structured Hadoop Advantages
  11. 11. Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11 Spark + Hadoop UNLIMITED SCALE WIDE RANGE OF APPLICATIONS ENTERPRISE PLATFORM EASE OF DEVELOPMENT COMBINE WORKFLOWS IN-MEMORY PERFORMANCE Operational Applications Augmented by In-Memory Performance
  12. 12. Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12 Resilient Distributed Datasets RDD ( Resilient Distributed Data Sets ) Resilient – If data in memory is lost, It can be recreated Distributed – Stored in memory across the cluster Dataset – Initial data can come from a file or created programmatically. RDDs are the fundamental unit of data in spark
  13. 13. Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13 Resilient Distributed Datasets Core concept of Spark framework. RDDs can store any type of data. Primitive Types : Integer, Characters, Boolean etc. Files : Text files, SequencFiles etc. RDD is fault tolerance. RDDs are immutable
  14. 14. Slide 14 www.edureka.co/apache-spark-scala-trainingSlide 14 RDD supports two types of operations: Transformation: Transformations don't return a single value, they return a new RDD. Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce. Action: Action operation evaluates and returns a new value. Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach. Resilient Distributed Datasets
  15. 15. Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15 Spark Sql Spark Core  Spark SQL allows relational queries through Spark  The backbone for all these operations is SchemaRDD  Schema RDDs are mode of row objects along with the metadata information  SchemaRDDs are equivalent to RDBMS tables  They can be constructed from existing RDDs, JSON data sets, Parquet files or Hive QL queries against the data stored in Apache Hive(*) Spark SQL
  16. 16. Slide 16 www.edureka.co/apache-spark-scala-training Spark SQL Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Scala and Java  Shark Project is completely closed now Earlier it was Shark but now we will use Spark SQL Shark Spark SQL Hive on Spark Development ending: transitioning to Spark SQL A new SQL engine designed from ground up for Spark Help existing Hive users migrate Spark
  17. 17. Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17 Efficient In-Memory Storage Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Spark SQL employs column-oriented storage using arrays of primitive types 1 Column Storage 2 3 john mike sally 4.1 3.5 6.4 Row Storage 1 john 4.1 2 mike 3.5 3 sally 6.4
  18. 18. Slide 18 www.edureka.co/apache-spark-scala-trainingSlide 18 Demo On Spark RDDs
  19. 19. Slide 19 www.edureka.co/apache-spark-scala-training LIVE Online Class Class Recording in LMS 24/7 Post Class Support Module Wise Quiz Project Work Verifiable Certificate Course Features
  20. 20. Slide 20 www.edureka.co/apache-spark-scala-training Questions
  21. 21. Slide 21 www.edureka.co/apache-spark-scala-training Course Topics  Module 1 » Introduction to Scala  Module 2 » Scala Essentials  Module 3 » Traits and OOPs in Scala  Module 4 » Functional Programming in Scala Module 5 » Introduction to Big Data and Spark Module 6 » Spark Baby Steps Module 7 » Playing with RDDs Module 8 » Spark with SQL- When Spark meets Hive
  22. 22. Slide 22 www.edureka.co/apache-spark-scala-training

×