Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni

Demi Ben-Ari
Ofek Alumni Meetup
24/12/2015

About me
Demi Ben-Ari
Senior Software Engineer at Windward Ltd.
BS’c Computer Science – Academic College Tel-Aviv Yaffo
Co-Founder “Big Things” Big Data Community
In the Past:
Software Team Leader & Senior Java Software Engineer,
Missile defense and Alert System - “Ofek” – IAF
• Interested in almost every kind of technology – A True Geek

Agenda
 What is Spark?
 Spark Infrastructure and Basics
 Spark Features and Suite
◦ Spark-Shell Live Demo
◦ Cassandra & Spark
 Development with Spark
 Conclusion

What is Spark?
Efficient Usable
 General execution
graphs
 In-memory storage
 Rich APIs in Java,
Scala, Python
 Interactive shell
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop

What is Spark?
 Apache Spark is a general-purpose, cluster
computing framework
 Spark does computation In Memory & on
Disk
 Apache Spark has low level and high level
APIs

Spark Philosophy
 Make life easy and productive for data
scientists
 Well documented, expressive API’s
 Powerful domain specific libraries
 Easy integration with storage systems
 … and caching to avoid data movement
 Predictable releases, stable API’s
 Stable release each 3 months

Spark Contributors
 Highly active open source community
(09/2015)
◦ https://github.com/apache/spark/
 https://www.openhub.net/p/apache-spark

About Spark project
 Spark was founded at UC Berkeley and the
main contributor is “Databricks”.
 Interactive shell Spark in Scala and Python
◦ (spark-shell, pyspark)
 Currently stable in version 1.6

Multi Language API Support
 Scala
 Java
 Python
 Clojure
 R

Unified Tools Platform
Spark
SQL
GraphX
MLlib
Machine
Learning
Spark
Streamin
g
Spark Core
Data Frames
SparkR

Basic Terms
 Cluster
 Driver (Master)
 Executors (Slaves)
 Spark Context
 RDD – Resilient Distributed Dataset

Driver and Spark Context
 Spark Context is your “handle” to the Spark
cluster.
 The driver program contains the main
method.
 You use your Spark Context to access your
cluster.
◦ Configure the connection to the cluster
◦ It lets you create RDDs.
 The variable named sc (for the Spark
Context) is already defined in your Driver in
the Spark Shell.

What’s an RDD?
 Resilient Distributed Datasets
◦ Fault tolerant
◦ Parallel data structure
◦ Distributed on the nodes in the cluster
◦ Immutable!!!
◦ Can persist intermediate results in memory
◦ Transformations are operators and are Lazy
evaluated

Resilient Distributed Datasets

RDD Persistence and
partitioning
 Users have control which RDD will be
reuse (in memory and disk storage)
◦ Persist, Cache, Unpersist
 Users can order an RDD’s to be
partitioned across machines
 Only the lost partitions of an RDD
need to be recomputed upon failure.

Spark execution engine
 Spark uses lazy evaluation
◦ Runs the code only when it encounters an
action operation
 There is no need to design and write a
single complex map-reduce job.
◦ In Spark we can write smaller and
manageable operations
◦ Spark will group operations together

Spark execution engine
 Serializes your code to the executors
◦ Can choose your serialization method
(Java serialization, Kryo)
 In Java - functions are specified as
objects that implement one of Spark’s
Function interfaces.
◦ Can use the same method of
implementation in Scala and Python as
well.

Persistence layers for Spark
 Distributed system
◦ Hadoop (HDFS)
◦ Local file system
◦ Amazon S3
◦ Cassandra
◦ Hive
◦ Hbase
 File formats
◦ Text file
 CSV, TSV, Plain Text
◦ Sequence File
◦ AVRO
◦ Parquet

Spark Core Features
 Distributed In memory Computation
 Stand alone and Local Capabilities
 History server for Spark UI
 Resource management Integration
 Unified job submission tool

History Server
 Can be run on all Spark deployments,
◦ Stand Alone, YARN, Mesos
 Integrates both with YARN and Mesos
 In Yarn / Mesos, run history server as
a daemon.

Job Submission Tool
 ./bin/spark-submit <app-jar>
--class my.main.Class
--name myAppName
--master local[4]
--master spark://some-cluster

Spark Shell
 YouTube – Word Count Example

Cassandra & Spark
 Cassandra cluster
◦ Bare metal vs. On the cloud
 DSE – DataStax Enterprise
◦ Cassandra & Spark in each node
 Vs
◦ Separate Cassandra and Spark clusters

Where do I start from?!
 Download spark as a package
◦ Run it on “local” mode (no need of a real
cluster)
◦ “spark-ec2” scripts to ramp-up a Stand Alone
mode cluster
◦ Amazon Elastic Map Reduce (EMR)
 Yarn vs. Mesos vs. Stand Alone

Running Environments
 Development – Testing – Production
◦ Don’t you need more?
◦ Be as flexible as you can
 Cluster Utilization
◦ Unified Cluster for all environments
 Vs.
◦ Cluster per Environment
 (Cluster per Data Center)
 Configuration
◦ Local Files vs. Distributed

Saving and Maintaining the
Data Local File System – Not effective in a distributed
environment
 HDFS
◦ Might be very Expensive
◦ Locality Rules – Spark + HDFS node + Same machine
 S3
◦ High latency and pretty slow but low costs
 Cassandra
◦ Rigid data model
◦ Very fast and depends on the Volume of the data can be

DevOps – Keep It Simple,
Stupid Linux
◦ Bash scripts
◦ Crontab
 Automation via Jenkins
 Continuous Deployment – with every GIT push
Dev Testing
Live
Staging
Production
Daily ManualAutomaticAutomatic

Build Automation
 Maven
◦ Sonatype Nexus artifact management
 -
◦ Deploy and Script generation scripts
◦ Per Environment Testing
◦ Data Validation
◦ Scheduled Tasks

Testing
Dev Testing
Live
Staging
Production

Summary
Cluster
Dev Testing
Live
Staging
ProductionEnv
ELK

Data Flow
Extern
al Data
Source
s
Analytics Layers Data Output

Conclusion
 Spark is a popular and very powerful
distributed in memory computation
framework
 Broadly used and has lots of contributors
 Leading tool in the new world of Petabytes
of unexplored data in the world

Thanks,
Resources and Contact
 Demi Ben-Ari
◦ LinkedIn
◦ Twitter: @demibenari
◦ Blog: http://progexc.blogspot.com/
◦ Email: demi.benari@gmail.com
◦ “Big Things” Community
 Meetup, YouTube, Facebook, Twitter

Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni

Similar to Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni (20)

More from Demi Ben-Ari

More from Demi Ben-Ari (20)

Recently uploaded

Recently uploaded (20)

Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni

Editor's Notes