Apache Spark 101 - Demi Ben-Ari - Panorays

Apache Spark 101
Demi Ben-Ari - VP R&D @ Panorays

About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays
! Google Developer Expert
! Co-Founder of Communities:
○ “Big Things” - Big Data, Data Science, DevOps
○ Google Developer Group Cloud
○ Ofek Alumni Association
In the Past:
! Sr. Data Engineer - Windward
! Team Leader & Sr. Java Software Engineer, 
Missile defence and Alert System - “Ofek” – IAF

Automate the Security Management of Third Parties
Capture the  
Hacker’s View
Get Realtime  
Ratings
Comply with  
Regulations

What is Big Data (IMHO)?
! Systems involving the “3 Vs”: 
What are the right questions we want to ask?
○ Volume - How much?
○ Velocity - How fast?
○ Variety - What kind? (Difference)

What strategies help manage Big Data?
! Distribute data across nodes
○ Replication
! Relax consistency requirements
! Relax schema requirements
! Optimize data to suit actual needs

Why Not Relational Data - NoSQL???
! Relational Model Provides
○ Normalized table schema
○ Cross table joins
○ ACID compliance (Atomicity, Consistency, Isolation, Durability)
! But at very high cost
○ Big Data table joins - bilions of rows or more - require massive overhead
○ Sharding tables across systems is complex and fragile
! Modern applications have different priorities
○ Needs for speed and availability come over consistency
○ Commodity servers racks trump massive high-end systems
○ Real world need for transactional guarantees is limited

What is the NoSQL landscape?
! 4 broad classes of non-relational databases (http://db-engines.com/en/ranking)
○ Graph: data elements each relate to N others in graph / network
○ Key-Value: keys map to arbitrary values of any data type
○ Document: document sets (JSON) queryable in whole or part
○ Wide column Store (Column Family): keys mapped to sets of n-numbers of typed columns
! Three key factors to help understand the subject
○ Consistency: do you get identical results, regardless which node is queried?
○ Availability: can the cluster respond to very high read and write volumes?
○ Partition tolerance: is a cluster still available when part of it is down?

What is the CAP theorem?
! In distributed systems, consistency, availability and partition tolerance exist in
a manually dependant relationship, Pick any two.
Availability
Partition toleranceConsistency
MySQL, PostgreSQL, 
Greenplum, Vertica,  
Neo4J
Cassandra,
DynamoDB, Riak,
CouchDB, Voldemort
HBase, MongoDB, Redis, BigTable, BerkeleyDB
Graph 
Key-Value 
Wide Column
RDBMS

! “A system to move the computation, where the data is”
! Key Concepts of Hadoop
Flexibility
A single repo for storing and
analyzing any kind of data not
bounded by schema
Scalability
Scale-out architecture divides
workload across multiple
nodes using flexible distributed
file system
Low cost
Deployed on commodity
hardware & open source
platform
Fault Tolerant
Continue working even if
node(s) go
Hadoop Principals

Hadoop Core Components
! HDFS - Hadoop Distributed File System
○ Provides a distributed data storage system to store data in smaller blocks in a fail safe manner
! MapReduce - Programming framework
○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple nodes
! Yarn - (Yet Another Resource Negotiator) MRv2
○ Splitting a MapReduce Job Tracker’s info
■ Resource Manager (Global)
■ Application Manager (Per application)

Map/Reduce model and locality of data

Hadoop Ecosystem
Hadoop Core
HDFS
MapReduce / 
YARN
Hadoop Common
Hadoop Applications
Hive Pig HBase Oozie Zookeeper Sqoop Spark

Hadoop (+Spark) Distributions
Elastic MapReduce DataProc

Summary - When to choose Big Data technologies?
! Large volumes of data to store and process
! Semi-Structured or Unstructured data
! Data is not well categorized
! Data contains a lot of redundancy
! Data arrives in streams or large batches
! Complex batch jobs arriving in parallel
! You don’t know how the data might be useful

What is spark?
! Apache Spark is a general-purpose, cluster
computing framework
! Spark does computation In Memory & on Disk
! Apache Spark has low level and high level APIs

Spark Philosophy
! Make life easy and productive for data scientists
! Well documented, expressive API’s
! Powerful domain specific libraries
! Easy integration with storage systems… and caching to avoid data movement
! Predictable releases, stable API’s
! Stable release each 3 months

Spark as Open Source
https://github.com/apache/spark/
https://www.openhub.net/p/apache-spark

About Spark Project
● Spark was founded at UC Berkeley and the main contributor is “Databricks”
● Interactive shell Spark in Scala and Python (spark-shell, pyspark)
● Currently stable in version 2.2.1 (01.12.2017)

Spark Petabyte Sort
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

So Bottom Line… 
What’s Spark???

United Tools Platform - Single Framework
Batch
InteractiveStreaming
Single Framework

Scala & Spark (Architecture)
Scala REPL Scala Compiler
Spark Runtime
Scala Runtime
JVM
File System  
(eg. HDFS,
Cassandra, S3..)
Cluster Manager  
(eg. Yarn, Mesos)

What kind of DSL is Apache Spark
! Centered around Collections
! Immutable data sets equipped with functional transformations
! These are exactly the Scala collection operations
map 
flatMap  
filter
...
reduce 
fold  
aggregate
...
union 
intersection
...

Spark Word Count example - Spark Shell

RDD - Resilient Distributed Dataset
! … Collection of elements partitioned across the nodes of the cluster that can
be operated on it in parallel…
○ http://spark.apache.org/docs/latest/programming-guide.html#overview
! RDD - Resilient Distributed Dataset
○ Collection similar to a List / Array (Abstraction)
○ It’s actually an Interface (Behind the scenes it’s distributed over the cluster)
! DAG - Directed Acyclic Graph
! Are Immutable!!!

RDD - Resilient Distributed Dataset
! Transformations are Lazy evaluated
○ map
○ filter
○ …..
! Actions - Triggers DAG computation
○ collect
○ count
○ reduce

Spark Mechanics
Driver
Spark Context
Worker WorkerWorker
Executor Executor Executor
Task Task
Task
Task Task Task Task Task Task

Wide and Narrow Transformations
! Narrow dependency: each partition of the parent RDD is used by at most
one partition of the child RDD. This means the task can be executed locally
and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample)
! Wide dependency: multiple child partitions may depend on one partition of
the parent RDD. This means we have to shuffle data unless the parents are
hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, cogroupByKey,
join, cartesian)
! You can read a good blog post about it.

Basic Terms - Wide and Narrow Transformations
Narrow Dependencies: Wide (Shuffle) Dependencies:
filter, 
map...
union
groupByKey
join...

List of Transformations and Actions

Spark Packages
● http://spark-packages.org/
● All of the possible connectors (All of the open sourced
ones)
● If you want to post anything for users, It’s here

DataFrame
● Main programming abstraction in SparkSQL
● Distributed collection of data organized into named
columns
● Similar to a table in a relational database
● Has schema, rows and rich API
● http://spark.apache.org/docs/latest/sql-programming-guide.html

Discretized Streams
Spark Streaming
Spark
Data Input
Data Output
Batches of X seconds

Apache Zeppelin
● https://zeppelin.incubator.apache.org/
● A web-based notebook that enables interactive data analytics.  
You can make beautiful data-driven, interactive and collaborative documents
with SQL, Scala and more.

IPython Notebook Spark
! http://jupyter.org/
! http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-
apache-spark/
! Can connect the IPython notebook to a Spark cluster and run interactive
queries in Python.

Conclusion
! If you’ve got a choice, keep it simple and not Distributed.
! Spark is a great framework for distributed collections
◦ Fully functional API
◦ Can perform imperative actions
! With all of this compute power, comes a lot of operational overhead.
! Control your work and data distribution via partitions.
◦ (No more threads :) )

! LinkedIn
! Twitter: @demibenari
! Blog: http://
progexc.blogspot.com/
! demi.benari@gmail.com
! “Big Things” Community
Meetup, YouTube, Facebook, Twitter
! GDG Cloud

Apache Spark 101 - Demi Ben-Ari - Panorays

Apache Spark 101 - Demi Ben-Ari - Panorays

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark 101 - Demi Ben-Ari - Panorays

Similar to Apache Spark 101 - Demi Ben-Ari - Panorays (20)

More from Demi Ben-Ari

More from Demi Ben-Ari (20)

Recently uploaded

Recently uploaded (20)

Apache Spark 101 - Demi Ben-Ari - Panorays