Apache Spark 101 - Demi Ben-Ari

About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays
● BS’c Computer Science – Academic College Tel-Aviv Yaffo
● Co-Founder @
○ “Big Things” Big Data Community
○ Cloud Google Developer Group
In the Past:
● Sr. Data Engineer @ Windward
● Software Team Leader & Senior Java Software Engineer
● Missile defense and Alert System - “Ofek” – IAF

What are Big Data Systems?
● Applications involving the “3 Vs”
○ Volume - Gigabytes, Terabytes, Petabytes...
○ Velocity - Sensor Data, Logs, Data Streams, Financial Transactions, Geo Locations..
○ Variety - Different data format ingestion
● Some define it the “7 Vs”
○ Variability (constantly changing)
○ Veracity (accuracy)
○ Visualization
○ Value
● Characteristics
○ Multi-region availability
○ Very fast and reliable response
○ No single point of failure

What is Big Data (IMHO)?
● Systems involving the “3 Vs”:
What are the right questions we want to ask?
○ Volume - How much?
○ Velocity - How fast?
○ Variety - What kind? (Difference)

Why Not Relational Data
● Relational Model Provides
○ Normalized table schema
○ Cross table joins
○ ACID compliance (Atomicity, Consistency, Isolation, Durability)
● But at very high cost
○ Big Data table joins - bilions of rows or more - require massive overhead
○ Sharding tables across systems is complex and fragile
● Modern applications have different priorities
○ Needs for speed and availability come over consistency
○ Commodity servers racks trump massive high-end systems
○ Real world need for transactional guarantees is limited

What strategies help manage Big Data?
● Distribute data across nodes
○ Replication
● Relax consistency requirements
● Relax schema requirements
● Optimize data to suit actual needs

What is the NoSQL landscape?
● 4 broad classes of non-relational databases (http://db-engines.com/en/ranking)
○ Graph: data elements each relate to N others in graph / network
○ Key-Value: keys map to arbitrary values of any data type
○ Document: document sets (JSON) queryable in whole or part
○ Wide column Store (Column Family): keys mapped to sets of n-numbers of typed
columns
● Three key factors to help understand the subject
○ Consistency: do you get identical results, regardless which node is queried?
○ Availability: can the cluster respond to very high read and write volumes?
○ Partition tolerance: is a cluster still available when part of it is down?

What is the CAP theorem?
● In distributed systems, consistency, availability and partition tolerance
exist in a manually dependant relationship, Pick any two.
Availability
Partition toleranceConsistency
MySQL, PostgreSQL,
Greenplum, Vertica,
Neo4J
Cassandra,
DynamoDB, Riak,
CouchDB, Voldemort
HBase, MongoDB, Redis, BigTable, BerkeleyDB
Graph
Key-Value
Wide Column
RDBMS

Characteristics of Hadoop
● A system to process very large amounts of unstructured and complex
data with wanted speed
● A system to run on a large amount of machines that don’t share any
memory or disk
● A system to run on a cluster of machines which can put together in
relatively lower cost and easier maintenance

Hadoop Principal
● “A system to move the computation, where the data is”
● Key Concepts of Hadoop
Flexibility
A single repo for storing and
analyzing any kind of data not
bounded by schema
Scalability
Scale-out architecture divides
workload across multiple
nodes using flexible
distributed file system
Low cost
Deployed on commodity
hardware & open source
platform
Fault Tolerant
Continue working even if
node(s) go

Hadoop Core Components
● HDFS - Hadoop Distributed File System
○ Provides a distributed data storage system to store data in smaller blocks in a fail safe
manner
● MapReduce - Programming framework
○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple
nodes
● Yarn - (Yet Another Resource Negotiator) MRv2
○ Splitting a MapReduce Job Tracker’s info
■ Resource Manager (Global)
■ Application Manager (Per application)

Map/Reduce model and locality of data

Hadoop Ecosystem
Hadoop Core
HDFS
MapReduce /
YARN
Hadoop Common
Hadoop Applications
Hive Pig HBase Oozie Zookeeper Sqoop Spark

Hadoop (+Spark) Distributions
Elastic MapReduce DataProc

Summary - When to choose hadoop?
● Large volumes of data to store and process
● Semi-Structured or Unstructured data
● Data is not well categorized
● Data contains a lot of redundancy
● Data arrives in streams or large batches
● Complex batch jobs arriving in parallel
● You don’t know how the data might be useful

What is spark?
● Apache Spark is a general-purpose, cluster
computing framework
● Spark does computation In Memory & on Disk
● Apache Spark has low level and high level APIs

Spark Philosophy
● Make life easy and productive for data scientists
● Well documented, expressive API’s
● Powerful domain specific libraries
● Easy integration with storage systems… and caching to avoid data
movement
● Predictable releases, stable API’s
● Stable release each 3 months

Spark Contributors
https://github.com/apache/spark/
https://www.openhub.net/p/apache-spark

About Spark Project
● Spark was founded at UC Berkeley and the
main contributor is “Databricks”.
● Interactive shell Spark in Scala and Python
(spark-shell, pyspark)
● Currently stable in version 2.1

Spark Word Count example - Spark Shell

RDD - Resilient Distributed Dataset
● … Collection of elements partitioned across the nodes of the cluster that
can be operated on it in parallel…
○ http://spark.apache.org/docs/latest/programming-guide.html#overview
● RDD - Resilient Distributed Dataset
○ Collection similar to a List / Array (Abstraction)
○ It’s actually an Interface (Behind the scenes it’s distributed over the cluster)
● DAG - Directed Acyclic Graph
● Are Immutable!!!

RDD - Resilient Distributed Dataset
● Transformations are Lazy evaluated
○ map
○ filter
○ …..
● Actions - Triggers DAG computation
○ collect
○ count
○ reduce

Spark Mechanics
Driver
Spark Context
Worker WorkerWorker
Executor Executor Executor
Task Task
Task
Task Task Task Task Task Task

Wide and Narrow Transformations
● Narrow dependency: each partition of the parent RDD is used by at
most one partition of the child RDD. This means the task can be executed
locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample)
● Wide dependency: multiple child partitions may depend on one partition
of the parent RDD. This means we have to shuffle data unless the
parents are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey,
cogroupByKey, join, cartesian)
● You can read a good blog post about it.

Basic Terms - Wide and Narrow Transformations
Narrow Dependencies: Wide (Shuffle) Dependencies:
filter,
map...
union
groupByKey
join...

List of Transformations and Actions

Spark Packages
● http://spark-packages.org/
● All of the possible connectors (All of the open sourced
ones)
● If you want to post anything for users, It’s here

Alluxio (formerly Tachyon)
● Alluxio - In memory storage system
● GitHub

DataFrame
● Main programming abstraction in SparkSQL
● Distributed collection of data organized into named
columns
● Similar to a table in a relational database
● Has schema, rows and rich API
● http://spark.apache.org/docs/latest/sql-programming-guide.html

Lambda Architecture Pattern
● Used for Lambda architecture
implementation
○ Batch Layer
○ Speed Layer
○ Serving Layer
Speed Layer Batch Layer
Serving LayerData Consumer
Query
Query
Input

Discretized Streams
Spark Streaming
Spark
Data Input
Data Output
Batches of X seconds

Apache Zeppelin
● https://zeppelin.incubator.apache.org/
● A web-based notebook that enables interactive data analytics.
You can make beautiful data-driven, interactive and collaborative documents
with SQL, Scala and more.

IPython Notebook Spark
● http://jupyter.org/
● http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-wit
h-apache-spark/
● Can connect the IPython notebook to a Spark cluster and run interactive
queries in Python.

Conclusion
● If you’ve got a choice, keep it simple and not Distributed.
● Spark is a great framework for distributed collections
◦ Fully functional API
◦ Can perform imperative actions
● With all of this compute power, comes a lot of operational
overhead.
● Control your work and data distribution via partitions.
◦ (No more threads :) )

● LinkedIn
● Twitter: @demibenari
● Blog:
http://progexc.blogspot.com/
● demi.benari@gmail.com
● “Big Things” Community
Meetup, YouTube, Facebook,
Twitter
● GDG Cloud

Apache Spark 101 - Demi Ben-Ari

Apache Spark 101 - Demi Ben-Ari

More Related Content

What's hot

Viewers also liked

Similar to Apache Spark 101 - Demi Ben-Ari

More from Demi Ben-Ari

Recently uploaded

Apache Spark 101 - Demi Ben-Ari