1
SPARK
INSTRUCTOR:
DR. SHIYONG LU
BY:
SRINATH REDDY KOTU
GRADUATE STUDENT
2
Data Processing Goals
Low latency (interactive) queries on
historical data: enable faster decisions
E.g., identify why a...
3
The Need for Unification (1/2)
Today’s state-of-art analytics stack
Batch stack
(e.g., Hadoop)
Input
Splitter
Streaming ...
4
Data Processing Stack
Data Processing Layer
Resource Management Layer
Storage Layer
5
Hadoop Stack
Data Processing Layer
Resource Management Layer
Storage Layer
…
Hadoop MR
Hive Pig
HBase Storm
Hadoop Yarn
...
6
BDAS Stack
Data Processing Layer
Resource Management Layer
Storage Layer
Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
G...
7
How do BDAS & Hadoop fit together?
Mesos Mesos
Spark
Spark
Streaming Shark SQL
BlinkDB
GraphX
MLlib
MLBase
HDFS, S3, …
T...
8
Apache Mesos (cluster manager)
Enable multiple frameworks to share same
cluster resources (e.g., Hadoop, Storm, Spark)
T...
9
Apache Spark
Distributed Execution Engine
Fault-tolerant, efficient in-memory storage (RDDs)
Powerful programming model ...
10
Spark Streaming
Large scale streaming computation
Implement streaming as a sequence of <1s jobs
Fault tolerant
Handle s...
11
Shark
Hive over Spark: full support for HQL and UDFs
Up to 100x when input is in memory
Up to 5-10x when input is on di...
12
BlinkDB
Trade between query performance and accuracy
using sampling
Why?
In-memory processing doesn’t guarantee interac...
13
GraphX
Combine data-parallel and graph-parallel
computations
Provide powerful abstractions:
PowerGraph, Pregel implemen...
14
MLlib and MLbase
MLlib: high quality library for ML algorithms
MLbase: make ML accessible to non-experts
Declarative in...
15
Tachyon
In-memory, fault-tolerant storage system
Flexible API, including HDFS API
Allow multiple frameworks (including ...
16
Thank You
Upcoming SlideShare
Loading in …5
×

Spark

874 views
669 views

Published on

Published in: Software, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
874
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
22
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • So what does this mean?Well, this means that we want low response-time on historical data since the faster we can make a decision the better.We want the ability to perform queries on live data since decisions on real-time data are better than on stale data.Finally, we want to perform sophisticated processing on massive data as, in principle, processing more data will lead to better decisions.
  • Spark

    1. 1. 1 SPARK INSTRUCTOR: DR. SHIYONG LU BY: SRINATH REDDY KOTU GRADUATE STUDENT
    2. 2. 2 Data Processing Goals Low latency (interactive) queries on historical data: enable faster decisions E.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable “better” decisions E.g., anomaly detection, trend analysis
    3. 3. 3 The Need for Unification (1/2) Today’s state-of-art analytics stack Batch stack (e.g., Hadoop) Input Splitter Streaming stack (e.g., Storm) Real-Time Analytics Ad-Hoc queries on historical data Interactive queries on historical data Interactive queries (e.g., HBase, Impala, SQL) Challenges: Need to maintain three separate stacks Expensive and complex Hard to compute consistent metrics across stacks Hard and slow to share data across stacks
    4. 4. 4 Data Processing Stack Data Processing Layer Resource Management Layer Storage Layer
    5. 5. 5 Hadoop Stack Data Processing Layer Resource Management Layer Storage Layer … Hadoop MR Hive Pig HBase Storm Hadoop Yarn HDFS, S3, …
    6. 6. 6 BDAS Stack Data Processing Layer Resource Management Layer Storage Layer Mesos Spark Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase HDFS, S3, … Tachyon
    7. 7. 7 How do BDAS & Hadoop fit together? Mesos Mesos Spark Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase HDFS, S3, … Tachyon Hadoop Yarn Spark Stramin g Shark SQL Graph X ML library BlinkDB MLbas e Spark Hadoop MR Hive Pig HBas e Storm
    8. 8. 8 Apache Mesos (cluster manager) Enable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, Spark) Twitter’s large scale deployment 6,000+ servers, 500+ engineers running jobs on Mesos Mesospehere: startup to commercialize Mesos
    9. 9. 9 Apache Spark Distributed Execution Engine Fault-tolerant, efficient in-memory storage (RDDs) Powerful programming model and APIs (Scala, Python, Java) Fast: up to 100x faster than Hadoop Easy to use: 5-10x less code than Hadoop General: support interactive & iterative apps
    10. 10. 10 Spark Streaming Large scale streaming computation Implement streaming as a sequence of <1s jobs Fault tolerant Handle stragglers Ensure exactly one semantics Integrated with Spark: unifies batch, interactive, and batch computations
    11. 11. 11 Shark Hive over Spark: full support for HQL and UDFs Up to 100x when input is in memory Up to 5-10x when input is on disk Running on hundreds of nodes at Yahoo!
    12. 12. 12 BlinkDB Trade between query performance and accuracy using sampling Why? In-memory processing doesn’t guarantee interactive processing E.g., ~10’s sec just to scan 512 GB RAM! Gap between memory capacity and transfer rate increasing
    13. 13. 13 GraphX Combine data-parallel and graph-parallel computations Provide powerful abstractions: PowerGraph, Pregel implemented in less than 20 LOC! Leverage Spark’s fault tolerance
    14. 14. 14 MLlib and MLbase MLlib: high quality library for ML algorithms MLbase: make ML accessible to non-experts Declarative interface: allow users to say what they want E.g., classify(data) Automatically pick best algorithm for given data, time Allow developers to easily add and test new algorithms
    15. 15. 15 Tachyon In-memory, fault-tolerant storage system Flexible API, including HDFS API Allow multiple frameworks (including Hadoop) to share in-memory data
    16. 16. 16 Thank You

    ×