C H A P T E R 0 1 : I N T R O D U C T I O N T O D A T A
A N A L Y S I S W I T H S P A R K
Learning Spark
by Holden Karau et. al.
Overview: Introduction to Data Analysis with
SPARK
 What Is Apache Spark?
 A Unified Stack
 Spark Core
 Spark SQL
 Spark Streaming
 MLlib
 GraphX
 Cluster Managers
 Who Uses Spark, and for What?
 Data Science Tasks
 Data Processing Applications
 A Brief History of Spark
 Spark Versions and Releases
 Storage Layers for Spark
1.1 What Is Apache Spark?
 Apache Spark is a cluster computing platform
 Spark extends MapReduce model to support
 Different computations
 batch applications,
 iterative algorithms,
 interactive queries,
 and streaming
 Run computations in memory
 Highly Accessible
 simple APIs in Python, Java, Scala, and SQL
 rich built-in libraries accessing Hadoop Clusters/Data Sources
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
1.2 A Unified Stack
1.2.1 A Unified Stack: Core, SQL, Streaming
 Spark Core
 Task Scheduling
 Memory management
 Fault recovery
 Storage system interaction
 API that defines resilient Distributed Dataset (RDD)
 Spark SQL
 Provide SQL interface to Spark
 Allow programmatic data manipulations mix with SQL
 Spark Streaming
 Enables processing of live stream data e.g. web logs
1.2.2 A Unified Stack: MLlib, GraphX, ClusterM
 MLlib
 Contains common machine learning (ML) modules
 Classification, Regression, Clustering, Collaborative Filtering
 Model evaluation, Data Import, Lower-level ML primitives
 GraphX
 Extends Spark RDD APIs just like Spark SQL/Streaming
 Contains graph algorithms
 Cluster Managers
 Hadoop YARN, Apache Mesos
 Default: Standalone scheduler
1.3 Who Uses Spark, and for What ?
 General-purpose framework for cluster computing
 Data Scientists
 Engineers
 Data Scientists
 Analyze and Model data
 SQL, Statistics, Predictive Model (ML) using Python, R
 Use Cases: Interactive shells with Python, Scala, SparkSQL
supporting MLlib libraries calling out Matlab/R
 Engineers
 Data Processing Applications
 Principles of SW engineering (Encapsulation, OOP, Interface
design)
1.4 A Brief History of Spark
 2009: UC Berkeley RAD lab became AMPlab
 Start with Hadoop MapReduce was inefficient for interactive
computing jobs  designed for interactive and iterative query
performance
 In-memory storage
 Efficient fault recovery 10-20X times faster than MapReduce
 Early Adopters
 Spark PoweredBy page
 Spark Meetups
 Spark Summit
 2011
 Berkeley Data Analytics Stacks (BDAS)
1.5 Spark Versions and Releases
 May 2014 Spark 1.1.0
 April 2015 Spark 1.3.1
 Spark Documentation
1.6 Storage Layers for Spark
 Spark can create distributed datasets from
 HDFS
 Supported by Hadoop API
 Local Filesystem
 Amazon S3
 Cassandra
 Hive
 Hbase …etc
 Supports others
 Text file
 Sequence file
 Arvo
 Parquet
 Hadoop InputFormat
Learn More about Apache Spark

Learning spark ch01 - Introduction to Data Analysis with Spark

  • 1.
    C H AP T E R 0 1 : I N T R O D U C T I O N T O D A T A A N A L Y S I S W I T H S P A R K Learning Spark by Holden Karau et. al.
  • 2.
    Overview: Introduction toData Analysis with SPARK  What Is Apache Spark?  A Unified Stack  Spark Core  Spark SQL  Spark Streaming  MLlib  GraphX  Cluster Managers  Who Uses Spark, and for What?  Data Science Tasks  Data Processing Applications  A Brief History of Spark  Spark Versions and Releases  Storage Layers for Spark
  • 3.
    1.1 What IsApache Spark?  Apache Spark is a cluster computing platform  Spark extends MapReduce model to support  Different computations  batch applications,  iterative algorithms,  interactive queries,  and streaming  Run computations in memory  Highly Accessible  simple APIs in Python, Java, Scala, and SQL  rich built-in libraries accessing Hadoop Clusters/Data Sources
  • 4.
    Edx and CourseraCourses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 5.
  • 6.
    1.2.1 A UnifiedStack: Core, SQL, Streaming  Spark Core  Task Scheduling  Memory management  Fault recovery  Storage system interaction  API that defines resilient Distributed Dataset (RDD)  Spark SQL  Provide SQL interface to Spark  Allow programmatic data manipulations mix with SQL  Spark Streaming  Enables processing of live stream data e.g. web logs
  • 7.
    1.2.2 A UnifiedStack: MLlib, GraphX, ClusterM  MLlib  Contains common machine learning (ML) modules  Classification, Regression, Clustering, Collaborative Filtering  Model evaluation, Data Import, Lower-level ML primitives  GraphX  Extends Spark RDD APIs just like Spark SQL/Streaming  Contains graph algorithms  Cluster Managers  Hadoop YARN, Apache Mesos  Default: Standalone scheduler
  • 8.
    1.3 Who UsesSpark, and for What ?  General-purpose framework for cluster computing  Data Scientists  Engineers  Data Scientists  Analyze and Model data  SQL, Statistics, Predictive Model (ML) using Python, R  Use Cases: Interactive shells with Python, Scala, SparkSQL supporting MLlib libraries calling out Matlab/R  Engineers  Data Processing Applications  Principles of SW engineering (Encapsulation, OOP, Interface design)
  • 9.
    1.4 A BriefHistory of Spark  2009: UC Berkeley RAD lab became AMPlab  Start with Hadoop MapReduce was inefficient for interactive computing jobs  designed for interactive and iterative query performance  In-memory storage  Efficient fault recovery 10-20X times faster than MapReduce  Early Adopters  Spark PoweredBy page  Spark Meetups  Spark Summit  2011  Berkeley Data Analytics Stacks (BDAS)
  • 10.
    1.5 Spark Versionsand Releases  May 2014 Spark 1.1.0  April 2015 Spark 1.3.1  Spark Documentation
  • 11.
    1.6 Storage Layersfor Spark  Spark can create distributed datasets from  HDFS  Supported by Hadoop API  Local Filesystem  Amazon S3  Cassandra  Hive  Hbase …etc  Supports others  Text file  Sequence file  Arvo  Parquet  Hadoop InputFormat
  • 12.
    Learn More aboutApache Spark