Apache Spark 101
Demi Ben-Ari
About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays
● BS’c Computer Science – Academic College Tel-Aviv Yaffo
● Co-Founder @
○ “Big Things” Big Data Community
○ Cloud Google Developer Group
In the Past:
● Sr. Data Engineer @ Windward
● Software Team Leader & Senior Java Software Engineer
● Missile defense and Alert System - “Ofek” – IAF
Introduction to Big Data
What are Big Data Systems?
● Applications involving the “3 Vs”
○ Volume - Gigabytes, Terabytes, Petabytes...
○ Velocity - Sensor Data, Logs, Data Streams, Financial Transactions, Geo Locations..
○ Variety - Different data format ingestion
● Some define it the “7 Vs”
○ Variability (constantly changing)
○ Veracity (accuracy)
○ Visualization
○ Value
● Characteristics
○ Multi-region availability
○ Very fast and reliable response
○ No single point of failure
What is Big Data (IMHO)?
● Systems involving the “3 Vs”:
What are the right questions we want to ask?
○ Volume - How much?
○ Velocity - How fast?
○ Variety - What kind? (Difference)
Why Not Relational Data
● Relational Model Provides
○ Normalized table schema
○ Cross table joins
○ ACID compliance (Atomicity, Consistency, Isolation, Durability)
● But at very high cost
○ Big Data table joins - bilions of rows or more - require massive overhead
○ Sharding tables across systems is complex and fragile
● Modern applications have different priorities
○ Needs for speed and availability come over consistency
○ Commodity servers racks trump massive high-end systems
○ Real world need for transactional guarantees is limited
What strategies help manage Big Data?
● Distribute data across nodes
○ Replication
● Relax consistency requirements
● Relax schema requirements
● Optimize data to suit actual needs
What is the NoSQL landscape?
● 4 broad classes of non-relational databases (http://db-engines.com/en/ranking)
○ Graph: data elements each relate to N others in graph / network
○ Key-Value: keys map to arbitrary values of any data type
○ Document: document sets (JSON) queryable in whole or part
○ Wide column Store (Column Family): keys mapped to sets of n-numbers of typed
columns
● Three key factors to help understand the subject
○ Consistency: do you get identical results, regardless which node is queried?
○ Availability: can the cluster respond to very high read and write volumes?
○ Partition tolerance: is a cluster still available when part of it is down?
What is the CAP theorem?
● In distributed systems, consistency, availability and partition tolerance
exist in a manually dependant relationship, Pick any two.
Availability
Partition toleranceConsistency
MySQL, PostgreSQL,
Greenplum, Vertica,
Neo4J
Cassandra,
DynamoDB, Riak,
CouchDB, Voldemort
HBase, MongoDB, Redis, BigTable, BerkeleyDB
Graph
Key-Value
Wide Column
RDBMS
Characteristics of Hadoop
● A system to process very large amounts of unstructured and complex
data with wanted speed
● A system to run on a large amount of machines that don’t share any
memory or disk
● A system to run on a cluster of machines which can put together in
relatively lower cost and easier maintenance
Hadoop Principal
● “A system to move the computation, where the data is”
● Key Concepts of Hadoop
Flexibility
A single repo for storing and
analyzing any kind of data not
bounded by schema
Scalability
Scale-out architecture divides
workload across multiple
nodes using flexible
distributed file system
Low cost
Deployed on commodity
hardware & open source
platform
Fault Tolerant
Continue working even if
node(s) go
Hadoop Core Components
● HDFS - Hadoop Distributed File System
○ Provides a distributed data storage system to store data in smaller blocks in a fail safe
manner
● MapReduce - Programming framework
○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple
nodes
● Yarn - (Yet Another Resource Negotiator) MRv2
○ Splitting a MapReduce Job Tracker’s info
■ Resource Manager (Global)
■ Application Manager (Per application)
MapReduce via WordCount
Map/Reduce model and locality of data
Hadoop Ecosystem
Hadoop Core
HDFS
MapReduce /
YARN
Hadoop Common
Hadoop Applications
Hive Pig HBase Oozie Zookeeper Sqoop Spark
Hadoop (+Spark) Distributions
Elastic MapReduce DataProc
Summary - When to choose hadoop?
● Large volumes of data to store and process
● Semi-Structured or Unstructured data
● Data is not well categorized
● Data contains a lot of redundancy
● Data arrives in streams or large batches
● Complex batch jobs arriving in parallel
● You don’t know how the data might be useful
Spark Introduction
What is spark?
● Apache Spark is a general-purpose, cluster
computing framework
● Spark does computation In Memory & on Disk
● Apache Spark has low level and high level APIs
Spark Philosophy
● Make life easy and productive for data scientists
● Well documented, expressive API’s
● Powerful domain specific libraries
● Easy integration with storage systems… and caching to avoid data
movement
● Predictable releases, stable API’s
● Stable release each 3 months
Spark Contributors
https://github.com/apache/spark/
https://www.openhub.net/p/apache-spark
About Spark Project
● Spark was founded at UC Berkeley and the
main contributor is “Databricks”.
● Interactive shell Spark in Scala and Python
(spark-shell, pyspark)
● Currently stable in version 2.1
Spark Petabyte Sort
United Tools Platform
Unified Tools Platform
Spark Languages
Spark Word Count example - Spark Shell
RDD - Resilient Distributed Dataset
● … Collection of elements partitioned across the nodes of the cluster that
can be operated on it in parallel…
○ http://spark.apache.org/docs/latest/programming-guide.html#overview
● RDD - Resilient Distributed Dataset
○ Collection similar to a List / Array (Abstraction)
○ It’s actually an Interface (Behind the scenes it’s distributed over the cluster)
● DAG - Directed Acyclic Graph
● Are Immutable!!!
RDD - Resilient Distributed Dataset
● Transformations are Lazy evaluated
○ map
○ filter
○ …..
● Actions - Triggers DAG computation
○ collect
○ count
○ reduce
What’s really an RDD???
Spark Mechanics
Driver
Spark Context
Worker WorkerWorker
Executor Executor Executor
Task Task
Task
Task Task Task Task Task Task
Wide and Narrow Transformations
● Narrow dependency: each partition of the parent RDD is used by at
most one partition of the child RDD. This means the task can be executed
locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample)
● Wide dependency: multiple child partitions may depend on one partition
of the parent RDD. This means we have to shuffle data unless the
parents are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey,
cogroupByKey, join, cartesian)
● You can read a good blog post about it.
Basic Terms - Wide and Narrow Transformations
Narrow Dependencies: Wide (Shuffle) Dependencies:
filter,
map...
union
groupByKey
join...
List of Transformations and Actions
Spark High Level APIs
Spark Packages
● http://spark-packages.org/
● All of the possible connectors (All of the open sourced
ones)
● If you want to post anything for users, It’s here
Alluxio (formerly Tachyon)
● Alluxio - In memory storage system
● GitHub
SparkSQL
DataFrame
● Main programming abstraction in SparkSQL
● Distributed collection of data organized into named
columns
● Similar to a table in a relational database
● Has schema, rows and rich API
● http://spark.apache.org/docs/latest/sql-programming-guide.html
Spark Streaming
Lambda Architecture Pattern
● Used for Lambda architecture
implementation
○ Batch Layer
○ Speed Layer
○ Serving Layer
Speed Layer Batch Layer
Serving LayerData Consumer
Query
Query
Input
Discretized Streams
Spark Streaming
Spark
Data Input
Data Output
Batches of X seconds
MLlib
GraphX
Notebooks
Apache Zeppelin
● https://zeppelin.incubator.apache.org/
● A web-based notebook that enables interactive data analytics.
You can make beautiful data-driven, interactive and collaborative documents
with SQL, Scala and more.
Zeppelin Demo
IPython Notebook Spark
● http://jupyter.org/
● http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-wit
h-apache-spark/
● Can connect the IPython notebook to a Spark cluster and run interactive
queries in Python.
Conclusion
● If you’ve got a choice, keep it simple and not Distributed.
● Spark is a great framework for distributed collections
◦ Fully functional API
◦ Can perform imperative actions
● With all of this compute power, comes a lot of operational
overhead.
● Control your work and data distribution via partitions.
◦ (No more threads :) )
Questions?
● LinkedIn
● Twitter: @demibenari
● Blog:
http://progexc.blogspot.com/
● demi.benari@gmail.com
● “Big Things” Community
Meetup, YouTube, Facebook,
Twitter
● GDG Cloud
Apache Spark 101 - Demi Ben-Ari

Apache Spark 101 - Demi Ben-Ari

  • 1.
  • 2.
    About Me Demi Ben-Ari,Co-Founder & VP R&D @ Panorays ● BS’c Computer Science – Academic College Tel-Aviv Yaffo ● Co-Founder @ ○ “Big Things” Big Data Community ○ Cloud Google Developer Group In the Past: ● Sr. Data Engineer @ Windward ● Software Team Leader & Senior Java Software Engineer ● Missile defense and Alert System - “Ofek” – IAF
  • 3.
  • 4.
    What are BigData Systems? ● Applications involving the “3 Vs” ○ Volume - Gigabytes, Terabytes, Petabytes... ○ Velocity - Sensor Data, Logs, Data Streams, Financial Transactions, Geo Locations.. ○ Variety - Different data format ingestion ● Some define it the “7 Vs” ○ Variability (constantly changing) ○ Veracity (accuracy) ○ Visualization ○ Value ● Characteristics ○ Multi-region availability ○ Very fast and reliable response ○ No single point of failure
  • 5.
    What is BigData (IMHO)? ● Systems involving the “3 Vs”: What are the right questions we want to ask? ○ Volume - How much? ○ Velocity - How fast? ○ Variety - What kind? (Difference)
  • 6.
    Why Not RelationalData ● Relational Model Provides ○ Normalized table schema ○ Cross table joins ○ ACID compliance (Atomicity, Consistency, Isolation, Durability) ● But at very high cost ○ Big Data table joins - bilions of rows or more - require massive overhead ○ Sharding tables across systems is complex and fragile ● Modern applications have different priorities ○ Needs for speed and availability come over consistency ○ Commodity servers racks trump massive high-end systems ○ Real world need for transactional guarantees is limited
  • 7.
    What strategies helpmanage Big Data? ● Distribute data across nodes ○ Replication ● Relax consistency requirements ● Relax schema requirements ● Optimize data to suit actual needs
  • 8.
    What is theNoSQL landscape? ● 4 broad classes of non-relational databases (http://db-engines.com/en/ranking) ○ Graph: data elements each relate to N others in graph / network ○ Key-Value: keys map to arbitrary values of any data type ○ Document: document sets (JSON) queryable in whole or part ○ Wide column Store (Column Family): keys mapped to sets of n-numbers of typed columns ● Three key factors to help understand the subject ○ Consistency: do you get identical results, regardless which node is queried? ○ Availability: can the cluster respond to very high read and write volumes? ○ Partition tolerance: is a cluster still available when part of it is down?
  • 9.
    What is theCAP theorem? ● In distributed systems, consistency, availability and partition tolerance exist in a manually dependant relationship, Pick any two. Availability Partition toleranceConsistency MySQL, PostgreSQL, Greenplum, Vertica, Neo4J Cassandra, DynamoDB, Riak, CouchDB, Voldemort HBase, MongoDB, Redis, BigTable, BerkeleyDB Graph Key-Value Wide Column RDBMS
  • 10.
    Characteristics of Hadoop ●A system to process very large amounts of unstructured and complex data with wanted speed ● A system to run on a large amount of machines that don’t share any memory or disk ● A system to run on a cluster of machines which can put together in relatively lower cost and easier maintenance
  • 11.
    Hadoop Principal ● “Asystem to move the computation, where the data is” ● Key Concepts of Hadoop Flexibility A single repo for storing and analyzing any kind of data not bounded by schema Scalability Scale-out architecture divides workload across multiple nodes using flexible distributed file system Low cost Deployed on commodity hardware & open source platform Fault Tolerant Continue working even if node(s) go
  • 12.
    Hadoop Core Components ●HDFS - Hadoop Distributed File System ○ Provides a distributed data storage system to store data in smaller blocks in a fail safe manner ● MapReduce - Programming framework ○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple nodes ● Yarn - (Yet Another Resource Negotiator) MRv2 ○ Splitting a MapReduce Job Tracker’s info ■ Resource Manager (Global) ■ Application Manager (Per application)
  • 14.
  • 15.
    Map/Reduce model andlocality of data
  • 16.
    Hadoop Ecosystem Hadoop Core HDFS MapReduce/ YARN Hadoop Common Hadoop Applications Hive Pig HBase Oozie Zookeeper Sqoop Spark
  • 17.
  • 18.
    Summary - Whento choose hadoop? ● Large volumes of data to store and process ● Semi-Structured or Unstructured data ● Data is not well categorized ● Data contains a lot of redundancy ● Data arrives in streams or large batches ● Complex batch jobs arriving in parallel ● You don’t know how the data might be useful
  • 19.
  • 20.
    What is spark? ●Apache Spark is a general-purpose, cluster computing framework ● Spark does computation In Memory & on Disk ● Apache Spark has low level and high level APIs
  • 21.
    Spark Philosophy ● Makelife easy and productive for data scientists ● Well documented, expressive API’s ● Powerful domain specific libraries ● Easy integration with storage systems… and caching to avoid data movement ● Predictable releases, stable API’s ● Stable release each 3 months
  • 23.
  • 24.
    About Spark Project ●Spark was founded at UC Berkeley and the main contributor is “Databricks”. ● Interactive shell Spark in Scala and Python (spark-shell, pyspark) ● Currently stable in version 2.1
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    Spark Word Countexample - Spark Shell
  • 30.
    RDD - ResilientDistributed Dataset ● … Collection of elements partitioned across the nodes of the cluster that can be operated on it in parallel… ○ http://spark.apache.org/docs/latest/programming-guide.html#overview ● RDD - Resilient Distributed Dataset ○ Collection similar to a List / Array (Abstraction) ○ It’s actually an Interface (Behind the scenes it’s distributed over the cluster) ● DAG - Directed Acyclic Graph ● Are Immutable!!!
  • 31.
    RDD - ResilientDistributed Dataset ● Transformations are Lazy evaluated ○ map ○ filter ○ ….. ● Actions - Triggers DAG computation ○ collect ○ count ○ reduce
  • 32.
  • 33.
    Spark Mechanics Driver Spark Context WorkerWorkerWorker Executor Executor Executor Task Task Task Task Task Task Task Task Task
  • 34.
    Wide and NarrowTransformations ● Narrow dependency: each partition of the parent RDD is used by at most one partition of the child RDD. This means the task can be executed locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample) ● Wide dependency: multiple child partitions may depend on one partition of the parent RDD. This means we have to shuffle data unless the parents are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, cogroupByKey, join, cartesian) ● You can read a good blog post about it.
  • 35.
    Basic Terms -Wide and Narrow Transformations Narrow Dependencies: Wide (Shuffle) Dependencies: filter, map... union groupByKey join...
  • 36.
  • 37.
  • 38.
    Spark Packages ● http://spark-packages.org/ ●All of the possible connectors (All of the open sourced ones) ● If you want to post anything for users, It’s here
  • 39.
    Alluxio (formerly Tachyon) ●Alluxio - In memory storage system ● GitHub
  • 40.
  • 41.
    DataFrame ● Main programmingabstraction in SparkSQL ● Distributed collection of data organized into named columns ● Similar to a table in a relational database ● Has schema, rows and rich API ● http://spark.apache.org/docs/latest/sql-programming-guide.html
  • 42.
  • 43.
    Lambda Architecture Pattern ●Used for Lambda architecture implementation ○ Batch Layer ○ Speed Layer ○ Serving Layer Speed Layer Batch Layer Serving LayerData Consumer Query Query Input
  • 44.
    Discretized Streams Spark Streaming Spark DataInput Data Output Batches of X seconds
  • 45.
  • 46.
  • 47.
  • 48.
    Apache Zeppelin ● https://zeppelin.incubator.apache.org/ ●A web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.
  • 49.
  • 50.
    IPython Notebook Spark ●http://jupyter.org/ ● http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-wit h-apache-spark/ ● Can connect the IPython notebook to a Spark cluster and run interactive queries in Python.
  • 51.
    Conclusion ● If you’vegot a choice, keep it simple and not Distributed. ● Spark is a great framework for distributed collections ◦ Fully functional API ◦ Can perform imperative actions ● With all of this compute power, comes a lot of operational overhead. ● Control your work and data distribution via partitions. ◦ (No more threads :) )
  • 52.
  • 53.
    ● LinkedIn ● Twitter:@demibenari ● Blog: http://progexc.blogspot.com/ ● demi.benari@gmail.com ● “Big Things” Community Meetup, YouTube, Facebook, Twitter ● GDG Cloud