Demi Ben Ari - Apache Spark 101 - First Steps into distributed computing:
The world has changed, having one huge server won’t do the job, the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with streaming, SQL, machine learning and graph processing. Showing the basics of Apache Spark and distributed computing.
Demi is a Software engineer, Entrepreneur and an International Tech Speaker.
Demi has over 10 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community and Google Developer Group Cloud.
Big Data Expert, but interested in all kinds of technologies, from front-end to backend, whatever moves data around.
2. About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays
! Google Developer Expert
! Co-Founder of Communities:
○ “Big Things” - Big Data, Data Science, DevOps
○ Google Developer Group Cloud
○ Ofek Alumni Association
In the Past:
! Sr. Data Engineer - Windward
! Team Leader & Sr. Java Software Engineer,
Missile defence and Alert System - “Ofek” – IAF
5. What is Big Data (IMHO)?
! Systems involving the “3 Vs”:
What are the right questions we want to ask?
○ Volume - How much?
○ Velocity - How fast?
○ Variety - What kind? (Difference)
6. What strategies help manage Big Data?
! Distribute data across nodes
○ Replication
! Relax consistency requirements
! Relax schema requirements
! Optimize data to suit actual needs
7. Why Not Relational Data - NoSQL???
! Relational Model Provides
○ Normalized table schema
○ Cross table joins
○ ACID compliance (Atomicity, Consistency, Isolation, Durability)
! But at very high cost
○ Big Data table joins - bilions of rows or more - require massive overhead
○ Sharding tables across systems is complex and fragile
! Modern applications have different priorities
○ Needs for speed and availability come over consistency
○ Commodity servers racks trump massive high-end systems
○ Real world need for transactional guarantees is limited
8. What is the NoSQL landscape?
! 4 broad classes of non-relational databases (http://db-engines.com/en/ranking)
○ Graph: data elements each relate to N others in graph / network
○ Key-Value: keys map to arbitrary values of any data type
○ Document: document sets (JSON) queryable in whole or part
○ Wide column Store (Column Family): keys mapped to sets of n-numbers of typed columns
! Three key factors to help understand the subject
○ Consistency: do you get identical results, regardless which node is queried?
○ Availability: can the cluster respond to very high read and write volumes?
○ Partition tolerance: is a cluster still available when part of it is down?
9. What is the CAP theorem?
! In distributed systems, consistency, availability and partition tolerance exist in
a manually dependant relationship, Pick any two.
Availability
Partition toleranceConsistency
MySQL, PostgreSQL,
Greenplum, Vertica,
Neo4J
Cassandra,
DynamoDB, Riak,
CouchDB, Voldemort
HBase, MongoDB, Redis, BigTable, BerkeleyDB
Graph
Key-Value
Wide Column
RDBMS
10. ! “A system to move the computation, where the data is”
! Key Concepts of Hadoop
Flexibility
A single repo for storing and
analyzing any kind of data not
bounded by schema
Scalability
Scale-out architecture divides
workload across multiple
nodes using flexible distributed
file system
Low cost
Deployed on commodity
hardware & open source
platform
Fault Tolerant
Continue working even if
node(s) go
Hadoop Principals
11. Hadoop Core Components
! HDFS - Hadoop Distributed File System
○ Provides a distributed data storage system to store data in smaller blocks in a fail safe manner
! MapReduce - Programming framework
○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple nodes
! Yarn - (Yet Another Resource Negotiator) MRv2
○ Splitting a MapReduce Job Tracker’s info
■ Resource Manager (Global)
■ Application Manager (Per application)
17. Summary - When to choose Big Data technologies?
! Large volumes of data to store and process
! Semi-Structured or Unstructured data
! Data is not well categorized
! Data contains a lot of redundancy
! Data arrives in streams or large batches
! Complex batch jobs arriving in parallel
! You don’t know how the data might be useful
19. What is spark?
! Apache Spark is a general-purpose, cluster
computing framework
! Spark does computation In Memory & on Disk
! Apache Spark has low level and high level APIs
20. Spark Philosophy
! Make life easy and productive for data scientists
! Well documented, expressive API’s
! Powerful domain specific libraries
! Easy integration with storage systems… and caching to avoid data movement
! Predictable releases, stable API’s
! Stable release each 3 months
21.
22. Spark as Open Source
https://github.com/apache/spark/
https://www.openhub.net/p/apache-spark
23. About Spark Project
● Spark was founded at UC Berkeley and the main contributor is “Databricks”
● Interactive shell Spark in Scala and Python (spark-shell, pyspark)
● Currently stable in version 2.2.1 (01.12.2017)
30. What kind of DSL is Apache Spark
! Centered around Collections
! Immutable data sets equipped with functional transformations
! These are exactly the Scala collection operations
map
flatMap
filter
...
reduce
fold
aggregate
...
union
intersection
...
32. RDD - Resilient Distributed Dataset
! … Collection of elements partitioned across the nodes of the cluster that can
be operated on it in parallel…
○ http://spark.apache.org/docs/latest/programming-guide.html#overview
! RDD - Resilient Distributed Dataset
○ Collection similar to a List / Array (Abstraction)
○ It’s actually an Interface (Behind the scenes it’s distributed over the cluster)
! DAG - Directed Acyclic Graph
! Are Immutable!!!
36. Wide and Narrow Transformations
! Narrow dependency: each partition of the parent RDD is used by at most
one partition of the child RDD. This means the task can be executed locally
and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample)
! Wide dependency: multiple child partitions may depend on one partition of
the parent RDD. This means we have to shuffle data unless the parents are
hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, cogroupByKey,
join, cartesian)
! You can read a good blog post about it.
37. Basic Terms - Wide and Narrow Transformations
Narrow Dependencies: Wide (Shuffle) Dependencies:
filter,
map...
union
groupByKey
join...
42. DataFrame
● Main programming abstraction in SparkSQL
● Distributed collection of data organized into named
columns
● Similar to a table in a relational database
● Has schema, rows and rich API
● http://spark.apache.org/docs/latest/sql-programming-guide.html
50. IPython Notebook Spark
! http://jupyter.org/
! http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-
apache-spark/
! Can connect the IPython notebook to a Spark cluster and run interactive
queries in Python.
51. Conclusion
! If you’ve got a choice, keep it simple and not Distributed.
! Spark is a great framework for distributed collections
◦ Fully functional API
◦ Can perform imperative actions
! With all of this compute power, comes a lot of operational overhead.
! Control your work and data distribution via partitions.
◦ (No more threads :) )