Lightning-fast cluster computing
Solve the same problem
Hadoop issues
- Difficult to maintain / install
- Slow due to replication & disk storage
- Need integration for differents tools (machine learning, stream processing)
- "Spending more time learning processing data tool than processing data"
Hadoop / Spark comparison
10x
faster
Spark Architecture
Cluster
● Standalone
● Apache Mesos
● Hadoop YARN (2.0)
agnostic to the underlying cluster manager
Which one should I choose ?
Standalone - simulation / repl
YARN / Mesos - run Spark alongside with other applications / use the richer
resource scheduling capabilities
YARN - Resource manager / node manager
MESOS - Mesos master / mesos agent
YARN - will likely be preinstalled in many Hadoop distributions.
In all cases - it is best to run Spark on the same nodes as HDFS for fast access to
storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most
Hadoop distributions already install YARN and HDFS together.
RDD - Resilient Distributed Dataset
Cluster
Error, ts,
msg1, warn, ts,
msg2, Error
info, ts, msg8,
info, ts, msg3,
info
Error, ts,
msg5, ts, info
Error, ts, info,
msg9, ts, info,
Error
RDD / 4 partitions (2-4 partition for CPU in your cluster)
Worker Worker Worker
RDD - Resilient Distributed Dataset
Parallelized Collections
JavaSparkContext’s parallelize method
(distData) can be operated on in parallel
RDD - Resilient Distributed Dataset
External Datasets
Error, ts,
msg1, warn, ts,
msg2, Error
info, ts, msg8,
info, ts, msg3,
info
Error, ts,
msg5, ts, info
Error, ts, info,
msg9, ts, info,
Error
Error, ts,
msg1, ts, Error
Error, ts,
msg3, ts
Error, ts,
msg9, ts, Error
logLines RDD
(inputBase RDD)
filter(fx) transformation
errorsRDD
RDD in action
RDD in action
Error, ts,
msg1, ts, Error
Error, ts,
msg3, ts
Error, ts,
msg9, ts, Error
Error, ts,
Error, ts,
msg3, ts
Error, ts,
msg9, Error,
msg1
errorsRDD
coalesce(2) transformation
cleanedRDD
collect()
action RDD is
materialize
Execute DAG
(Directed Acyclic Graph)
logLinesRDD
cleanedRDD
collect()
errosRDD
Error, ts, msg1,
ts, msg3, ts
Error, ts, msg4,
ts, msg1
Error, ts, msg1, ts Error, ts, ts, msg1
filter(fx)
errorMsg1RDD
count()
saveToCassandra()
cache
logLinesRDD
cleanedRDD
collect()
errosRDD
Error, ts, msg1,
ts, msg3, ts
Error, ts, msg4,
ts, msg1
Error, ts, msg1, ts Error, ts, ts, msg1
filter(fx)
errorMsg1RDD
count()
saveToCassandra()
RDD - Resilient Distributed Dataset
RDD Persistence
persist() ->
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER (Java and Scala),
MEMORY_AND_DISK_SER (Java and Scala), DISK_ONLY
cache() -> default (StorageLevel.MEMORY_ONLY)
RDD - Resilient Distributed Dataset
Which Storage Level to Choose
MEMORY_ONLY -> MEMORY_ONLY_SER -> DISK_ONLY
RDD - Resilient Distributed Dataset
Removing data
- LRU (least-recently-used)
- RDD.unpersist() method.
Lifecycle of a Spark program
1) Create some input RDD from external data
2) Lazily transform them (filter(), map())
3) Ask Spark to cache() RDDs that need to be reuse
4) Launch actions (count(), reduce()) to kick off parallel computation
Example job
Dependency types
Scheduling Process
Spark SQL
DataFrames can be created from different data sources such as:
- Existing RDDs
- Structured data files
- JSON datasets
- Hive tables
- External databases
SQLContext
HiveContext
(HiveQL)
Spark streaming
Streaming data: user activity on websites, monitoring data, server logs, and other
event data
Threat as RDDs
pre-defined interval
(N seconds)
Other Spark libraries
- MLib (Machine learning)
- Spark Streaming (Streaming)
- GraphX (distributed graph processing)
- Third party projects
(https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects)
Monitoring
- Web interface
- Rest API
- JMX
Security
Authentication via a shared secret
- YARN: spark.authenticate to true / automatically handle generation and
distribution of shared secret
- OTHERS: spark.authenticate.secret for each node
WebUI - java servlet filters (spark.ui.filters)
References
http://spark.apache.org/docs/latest/cluster-overview.html
https://www.youtube.com/watch?v=PFK6gsnlV5E
https://www.youtube.com/watch?v=49Hr5xZyTEA
Thanks!Questions?
jefersonm@gmail.com
@jefersonm
jefersonm
jefersonm
jefmachado

Apache Spark

  • 1.
  • 2.
  • 3.
    Hadoop issues - Difficultto maintain / install - Slow due to replication & disk storage - Need integration for differents tools (machine learning, stream processing) - "Spending more time learning processing data tool than processing data"
  • 4.
    Hadoop / Sparkcomparison 10x faster
  • 5.
  • 6.
    Cluster ● Standalone ● ApacheMesos ● Hadoop YARN (2.0) agnostic to the underlying cluster manager
  • 7.
    Which one shouldI choose ? Standalone - simulation / repl YARN / Mesos - run Spark alongside with other applications / use the richer resource scheduling capabilities YARN - Resource manager / node manager MESOS - Mesos master / mesos agent YARN - will likely be preinstalled in many Hadoop distributions. In all cases - it is best to run Spark on the same nodes as HDFS for fast access to storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most Hadoop distributions already install YARN and HDFS together.
  • 8.
    RDD - ResilientDistributed Dataset Cluster Error, ts, msg1, warn, ts, msg2, Error info, ts, msg8, info, ts, msg3, info Error, ts, msg5, ts, info Error, ts, info, msg9, ts, info, Error RDD / 4 partitions (2-4 partition for CPU in your cluster) Worker Worker Worker
  • 9.
    RDD - ResilientDistributed Dataset Parallelized Collections JavaSparkContext’s parallelize method (distData) can be operated on in parallel
  • 10.
    RDD - ResilientDistributed Dataset External Datasets
  • 11.
    Error, ts, msg1, warn,ts, msg2, Error info, ts, msg8, info, ts, msg3, info Error, ts, msg5, ts, info Error, ts, info, msg9, ts, info, Error Error, ts, msg1, ts, Error Error, ts, msg3, ts Error, ts, msg9, ts, Error logLines RDD (inputBase RDD) filter(fx) transformation errorsRDD RDD in action
  • 12.
    RDD in action Error,ts, msg1, ts, Error Error, ts, msg3, ts Error, ts, msg9, ts, Error Error, ts, Error, ts, msg3, ts Error, ts, msg9, Error, msg1 errorsRDD coalesce(2) transformation cleanedRDD collect() action RDD is materialize Execute DAG (Directed Acyclic Graph)
  • 13.
    logLinesRDD cleanedRDD collect() errosRDD Error, ts, msg1, ts,msg3, ts Error, ts, msg4, ts, msg1 Error, ts, msg1, ts Error, ts, ts, msg1 filter(fx) errorMsg1RDD count() saveToCassandra()
  • 14.
    cache logLinesRDD cleanedRDD collect() errosRDD Error, ts, msg1, ts,msg3, ts Error, ts, msg4, ts, msg1 Error, ts, msg1, ts Error, ts, ts, msg1 filter(fx) errorMsg1RDD count() saveToCassandra()
  • 15.
    RDD - ResilientDistributed Dataset RDD Persistence persist() -> MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER (Java and Scala), MEMORY_AND_DISK_SER (Java and Scala), DISK_ONLY cache() -> default (StorageLevel.MEMORY_ONLY)
  • 16.
    RDD - ResilientDistributed Dataset Which Storage Level to Choose MEMORY_ONLY -> MEMORY_ONLY_SER -> DISK_ONLY
  • 17.
    RDD - ResilientDistributed Dataset Removing data - LRU (least-recently-used) - RDD.unpersist() method.
  • 18.
    Lifecycle of aSpark program 1) Create some input RDD from external data 2) Lazily transform them (filter(), map()) 3) Ask Spark to cache() RDDs that need to be reuse 4) Launch actions (count(), reduce()) to kick off parallel computation
  • 19.
  • 20.
  • 21.
  • 22.
    Spark SQL DataFrames canbe created from different data sources such as: - Existing RDDs - Structured data files - JSON datasets - Hive tables - External databases SQLContext HiveContext (HiveQL)
  • 23.
    Spark streaming Streaming data:user activity on websites, monitoring data, server logs, and other event data Threat as RDDs pre-defined interval (N seconds)
  • 24.
    Other Spark libraries -MLib (Machine learning) - Spark Streaming (Streaming) - GraphX (distributed graph processing) - Third party projects (https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects)
  • 25.
  • 26.
    Security Authentication via ashared secret - YARN: spark.authenticate to true / automatically handle generation and distribution of shared secret - OTHERS: spark.authenticate.secret for each node WebUI - java servlet filters (spark.ui.filters)
  • 27.
  • 28.