Spark Workshop
Big Data Week - Málaga
Who are we?
Partners
Who are we?
Juan Pedro Moreno
Scala Software Engineer at 47Degrees
@juanpedromoreno
Fran Pérez
Scala Software Engineer at 47Degrees
@FPerezP
Workshop repo: https://github.com/47deg/spark-workshop
Roadmap
• Intro Big Data and Spark
• Spark Architecture
• Resilient Distributed Datasets (RDDs)
• Transformations and Actions on Data using RDDs
• Overview Spark SQL and DataFrames
• Overview Spark Streaming
• Spark Architecture and Cluster Deployment
Apache Spark Overview
• Fast and general engine for large-scale data processing
• Speed
• Ease of Use
• Generality
• Runs Everywhere
https://github.com/apache/spark
http://spark.apache.org
Spark Architecture
Scala Java Python R
Spark
SQL
Spark
Streaming
MLlib GraphX
DataFrames API
RDD API
Spark Core
Hadoop
HDFS
Cassandra JSON MySQL …
DATA SOURCES
Spark Core Concepts
Driver Program
Worker Node
Worker Node
Cluster Manager
SparkContext
Executor
Executor
Cache
Cache
Task
Task
Task
Task
Hadoop YARN Standalone Apache Mesos
Spark Core Concepts
• SparkContext: Main entry point for Spark functionality. A SparkContext
represents the connection to a Spark cluster.
• Executor: A process launched for an application on a worker node.
Each application has its own executors.
• Jobs: A parallel computation consisting of one or multiple stages that
gets spawned in response to a Spark action.
• Stages: Smaller set of tasks that each job is divided into.
• Tasks: A unit of work that will be sent to one executor.
Resilient Distributed Datasets
• Immutable.
• Partitioned collection.
• Operates in parallel.
• Customizable.
RDDs - Partitions
• A Partition is one of the different chunks that a RDD is splitted on and
that is sent to a node
• The more partitions we have, the more parallelism we get
• Each partition is candidate to be spread out to different worker nodes
Error, ts, msg1
Warn, ts, msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts, msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
RDD with 4 partitions
RDDs - Partitions
RDD with 8 partitions
P1 P2 P3 P4 P5 P6 P7 P8
Worker Node
Executor
Worker Node
Executor
Worker Node
Executor
Worker Node
Executor
P1 P5 P2 P6 P3 P7 P4 P8
RDDs - Operations
Transformations
• Lazy operations. They don’t return a value, but a pointer to a new RDD.
Actions
• Non-lazy operations. They apply an operation to a RDD and return a
value or write data to an external storage system.
RDDs - Transformations
A set of some of the most popular Spark transformations:
• map
• flatMap
• filter
• groupByKey
• reduceByKey
RDDs - Actions
A set of some of the most popular Spark actions:
• reduce
• collect
• foreach
• saveAsTextFile
Transformations and Actions
With Visual Mnemonics, better.
Thanks to Jeffrey Thompson
• http://data-frack.blogspot.com.es/2015/01/visual-mnemonics-for-
pyspark-api.html
• https://github.com/jkthompson/pyspark-pictures
• http://nbviewer.ipython.org/github/jkthompson/pyspark-pictures/
blob/master/pyspark-pictures.ipynb
Practice - Part 1 && Part 2
https://github.com/andypetrella/spark-notebook
http://spark-notebook.io
Overview Spark SQL and DataFrames
• Works with structured and semistructured data
• DataFrame simplifies working with structured data
• Read/Write from structure data like JSON, Hive tables, Parquet, etc.
• SQL inside your Spark App
• Best Performance and more powerful operations API
Practice - Part 3
https://github.com/andypetrella/spark-notebook
http://spark-notebook.io
Overview Spark Streaming
• Streaming Applications
• DStreams or Discretized Streams
• Continuous Series of RDDs, grouped by batches
Kafka
Spark StreamingR
e
c
e
i
v
e
r
s
Flume HDFS
batches of
input data
Spark
Core
HDFS/S3 Database
Kinesis Dashboard
Twitter
Practice - Part 4
https://github.com/andypetrella/spark-notebook
http://spark-notebook.io
Resources
• Official docs - http://spark.apache.org/docs/latest
• Learning Spark - http://shop.oreilly.com/product/0636920028512.do
• Databricks Spark Knowledge Base - https://goo.gl/wMy7Se
• Community packages for Spark - http://spark-packages.org/
• Apache Spark Youtube channel - https://goo.gl/8d7tGu
• API through pictures - https://goo.gl/JMDeqJ
• 47 Degrees Blog - http://www.47deg.com/blog/tags/spark
• Spark Notebook - https://github.com/andypetrella/spark-notebook
Thanks!
47deg.com
Q&A

Spark Worshop

  • 1.
  • 2.
  • 3.
    Who are we? JuanPedro Moreno Scala Software Engineer at 47Degrees @juanpedromoreno Fran Pérez Scala Software Engineer at 47Degrees @FPerezP Workshop repo: https://github.com/47deg/spark-workshop
  • 4.
    Roadmap • Intro BigData and Spark • Spark Architecture • Resilient Distributed Datasets (RDDs) • Transformations and Actions on Data using RDDs • Overview Spark SQL and DataFrames • Overview Spark Streaming • Spark Architecture and Cluster Deployment
  • 5.
    Apache Spark Overview •Fast and general engine for large-scale data processing • Speed • Ease of Use • Generality • Runs Everywhere https://github.com/apache/spark http://spark.apache.org
  • 6.
    Spark Architecture Scala JavaPython R Spark SQL Spark Streaming MLlib GraphX DataFrames API RDD API Spark Core Hadoop HDFS Cassandra JSON MySQL … DATA SOURCES
  • 7.
    Spark Core Concepts DriverProgram Worker Node Worker Node Cluster Manager SparkContext Executor Executor Cache Cache Task Task Task Task Hadoop YARN Standalone Apache Mesos
  • 8.
    Spark Core Concepts •SparkContext: Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster. • Executor: A process launched for an application on a worker node. Each application has its own executors. • Jobs: A parallel computation consisting of one or multiple stages that gets spawned in response to a Spark action. • Stages: Smaller set of tasks that each job is divided into. • Tasks: A unit of work that will be sent to one executor.
  • 9.
    Resilient Distributed Datasets •Immutable. • Partitioned collection. • Operates in parallel. • Customizable.
  • 10.
    RDDs - Partitions •A Partition is one of the different chunks that a RDD is splitted on and that is sent to a node • The more partitions we have, the more parallelism we get • Each partition is candidate to be spread out to different worker nodes Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 RDD with 4 partitions
  • 11.
    RDDs - Partitions RDDwith 8 partitions P1 P2 P3 P4 P5 P6 P7 P8 Worker Node Executor Worker Node Executor Worker Node Executor Worker Node Executor P1 P5 P2 P6 P3 P7 P4 P8
  • 12.
    RDDs - Operations Transformations •Lazy operations. They don’t return a value, but a pointer to a new RDD. Actions • Non-lazy operations. They apply an operation to a RDD and return a value or write data to an external storage system.
  • 13.
    RDDs - Transformations Aset of some of the most popular Spark transformations: • map • flatMap • filter • groupByKey • reduceByKey
  • 14.
    RDDs - Actions Aset of some of the most popular Spark actions: • reduce • collect • foreach • saveAsTextFile
  • 15.
    Transformations and Actions WithVisual Mnemonics, better. Thanks to Jeffrey Thompson • http://data-frack.blogspot.com.es/2015/01/visual-mnemonics-for- pyspark-api.html • https://github.com/jkthompson/pyspark-pictures • http://nbviewer.ipython.org/github/jkthompson/pyspark-pictures/ blob/master/pyspark-pictures.ipynb
  • 16.
    Practice - Part1 && Part 2 https://github.com/andypetrella/spark-notebook http://spark-notebook.io
  • 17.
    Overview Spark SQLand DataFrames • Works with structured and semistructured data • DataFrame simplifies working with structured data • Read/Write from structure data like JSON, Hive tables, Parquet, etc. • SQL inside your Spark App • Best Performance and more powerful operations API
  • 18.
    Practice - Part3 https://github.com/andypetrella/spark-notebook http://spark-notebook.io
  • 19.
    Overview Spark Streaming •Streaming Applications • DStreams or Discretized Streams • Continuous Series of RDDs, grouped by batches Kafka Spark StreamingR e c e i v e r s Flume HDFS batches of input data Spark Core HDFS/S3 Database Kinesis Dashboard Twitter
  • 20.
    Practice - Part4 https://github.com/andypetrella/spark-notebook http://spark-notebook.io
  • 21.
    Resources • Official docs- http://spark.apache.org/docs/latest • Learning Spark - http://shop.oreilly.com/product/0636920028512.do • Databricks Spark Knowledge Base - https://goo.gl/wMy7Se • Community packages for Spark - http://spark-packages.org/ • Apache Spark Youtube channel - https://goo.gl/8d7tGu • API through pictures - https://goo.gl/JMDeqJ • 47 Degrees Blog - http://www.47deg.com/blog/tags/spark • Spark Notebook - https://github.com/andypetrella/spark-notebook
  • 22.