Apache Spark

Lightning-fast cluster computing

Hadoop issues
- Difficult to maintain / install
- Slow due to replication & disk storage
- Need integration for differents tools (machine learning, stream processing)
- "Spending more time learning processing data tool than processing data"

Hadoop / Spark comparison
10x
faster

Cluster
● Standalone
● Apache Mesos
● Hadoop YARN (2.0)
agnostic to the underlying cluster manager

Which one should I choose ?
Standalone - simulation / repl
YARN / Mesos - run Spark alongside with other applications / use the richer
resource scheduling capabilities
YARN - Resource manager / node manager
MESOS - Mesos master / mesos agent
YARN - will likely be preinstalled in many Hadoop distributions.
In all cases - it is best to run Spark on the same nodes as HDFS for fast access to
storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most
Hadoop distributions already install YARN and HDFS together.

RDD - Resilient Distributed Dataset
Cluster
Error, ts,
msg1, warn, ts,
msg2, Error
info, ts, msg8,
info, ts, msg3,
info
Error, ts,
msg5, ts, info
Error, ts, info,
msg9, ts, info,
Error
RDD / 4 partitions (2-4 partition for CPU in your cluster)
Worker Worker Worker

Parallelized Collections
JavaSparkContext’s parallelize method
(distData) can be operated on in parallel

External Datasets

Error, ts,
msg1, warn, ts,
msg2, Error
info, ts, msg8,
info, ts, msg3,
info
Error, ts,
msg5, ts, info
Error, ts, info,
msg9, ts, info,
Error
Error, ts,
msg1, ts, Error
Error, ts,
msg3, ts
Error, ts,
msg9, ts, Error
logLines RDD
(inputBase RDD)
filter(fx) transformation
errorsRDD
RDD in action

RDD in action
Error, ts,
msg1, ts, Error
Error, ts,
msg3, ts
Error, ts,
msg9, ts, Error
Error, ts,
Error, ts,
msg3, ts
Error, ts,
msg9, Error,
msg1
errorsRDD
coalesce(2) transformation
cleanedRDD
collect()
action RDD is
materialize
Execute DAG
(Directed Acyclic Graph)

logLinesRDD
cleanedRDD
collect()
errosRDD
Error, ts, msg1,
ts, msg3, ts
Error, ts, msg4,
ts, msg1
Error, ts, msg1, ts Error, ts, ts, msg1
filter(fx)
errorMsg1RDD
count()
saveToCassandra()

cache
logLinesRDD
cleanedRDD
collect()
errosRDD
Error, ts, msg1,
ts, msg3, ts
Error, ts, msg4,
ts, msg1
Error, ts, msg1, ts Error, ts, ts, msg1
filter(fx)
errorMsg1RDD
count()
saveToCassandra()

RDD Persistence
persist() ->
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER (Java and Scala),
MEMORY_AND_DISK_SER (Java and Scala), DISK_ONLY
cache() -> default (StorageLevel.MEMORY_ONLY)

Which Storage Level to Choose
MEMORY_ONLY -> MEMORY_ONLY_SER -> DISK_ONLY

Removing data
- LRU (least-recently-used)
- RDD.unpersist() method.

Lifecycle of a Spark program
1) Create some input RDD from external data
2) Lazily transform them (filter(), map())
3) Ask Spark to cache() RDDs that need to be reuse
4) Launch actions (count(), reduce()) to kick off parallel computation

Spark SQL
DataFrames can be created from different data sources such as:
- Existing RDDs
- Structured data files
- JSON datasets
- Hive tables
- External databases
SQLContext
HiveContext
(HiveQL)

Spark streaming
Streaming data: user activity on websites, monitoring data, server logs, and other
event data
Threat as RDDs
pre-defined interval
(N seconds)

Other Spark libraries
- MLib (Machine learning)
- Spark Streaming (Streaming)
- GraphX (distributed graph processing)
- Third party projects
(https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects)

Monitoring
- Web interface
- Rest API
- JMX

Security
Authentication via a shared secret
- YARN: spark.authenticate to true / automatically handle generation and
distribution of shared secret
- OTHERS: spark.authenticate.secret for each node
WebUI - java servlet filters (spark.ui.filters)

References
http://spark.apache.org/docs/latest/cluster-overview.html
https://www.youtube.com/watch?v=PFK6gsnlV5E
https://www.youtube.com/watch?v=49Hr5xZyTEA

Thanks!Questions?
jefersonm@gmail.com
@jefersonm
jefersonm
jefersonm
jefmachado

Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark

Similar to Apache Spark (20)

More from Jéferson Machado

More from Jéferson Machado (20)

Recently uploaded

Recently uploaded (20)

Apache Spark