www.unicomlearning.com

Lightning Fast Big Data Analytics using
Apache Spark
Manish Gupta
Solutions Architect – Product En...
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Sp...
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Sp...
www.unicomlearning.com

www.bigdatainnovation.org

What is Hadoop?
It’s an open-sourced software for distributed storage o...
www.unicomlearning.com

www.bigdatainnovation.org

Limitations of Map Reduce
HDFS
read

HDFS
write

HDFS
read

iter. 1

In...
www.unicomlearning.com

www.bigdatainnovation.org

Approach: Leverage Memory?
 Memory bus >> disk & SSDs
 Many datasets ...
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Sp...
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Sp...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
“A big data analytics cluster-computing framework written in Scal...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Spark Driver (Master)

Cluster Manager
Cache

Cache

Cache

Spark...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
HDFS
read

HDFS
write

iter. 1

Input

HDFS
read

HDFS
write

ite...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
HDFS
read

iter. 1

iter. 2

. . .

Input

Not tied to 2 stage Ma...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
A simple analytical operation:
1

pagecount = spark.textFile( "/w...
www.unicomlearning.com

www.bigdatainnovation.org

Shark
 HIVE on SPARK = SHARK
 A large scale data warehouse system jus...
www.unicomlearning.com

www.bigdatainnovation.org

Shark
HIVE
Client

CLI

JDBC

Driver
Meta store

SQL
Parser

Query
Opti...
www.unicomlearning.com

www.bigdatainnovation.org

Shark
SHARK
Client

CLI

Driver
Meta store

SQL
Parser

Query
Optimizer...
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Sp...
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Sp...
www.unicomlearning.com

www.bigdatainnovation.org

Spark Programming Model
Driver Program
sc=new SparkContext
rDD=sc.textf...
www.unicomlearning.com

www.bigdatainnovation.org

Spark Programming Model
Driver Program
sc=new SparkContext
rDD=sc.textf...
www.unicomlearning.com

www.bigdatainnovation.org

RDD
 Programming Interface: Programmer can perform 3 types of operatio...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
How Spark Works:
RDD: Parallel collection with partitions
 User...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)

textFile

RDD[String]
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Example:
sc.textFile(“/wiki/pagecounts”)
.map(line => line.split(...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Execution Plan:

collect
textFile

map

map

reduceByKey

Above l...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Execution Plan:
Stage 2

Stage 1

collect
textFile

map

map

red...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Stage 2

Stage 1

collect
textFile

1.
2.
3.
4.

map

map

reduce...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Stage Execution:
Stage 1
Task 1
Task 2
Task 2
Task 2

 Create a ...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Task Execution:
Task is the fundamental unit of execution in Spar...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Spark Executor (Slaves)
Fetch Input

Core 1

Fetch Input

Execute...
www.unicomlearning.com

www.bigdatainnovation.org

Spark
Summary of Components
 Task : The fundamental unit of execution ...
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Sp...
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Sp...
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Cluster Details:

 6 m1.Xlarge EC2 nodes.
 1 machine i...
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Dataset:

 Wiki Page View Stats
 20 GB of webpage view...
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Dataset:

 Wiki Page View Stats
 20 GB of webpage view...
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Dataset:
 Network Datasets
 Directed and Bi-directed G...
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Page Rank Calculation:
•
•
•
•

Estimate the node import...
www.unicomlearning.com

www.bigdatainnovation.org

Example & Demo
Scala Code:
var iters = 100
val lines = sc.textFile("/da...
2 seconds
38 seconds
Page Rank
761.1985177
455.7028756
259.6052388
192.7257649
144.0349154
134.1566312
130.3546324
123.4014613
120.0...
www.unicomlearning.com

www.bigdatainnovation.org

Agenda Of The Talk:
Hadoop – A Quick Introduction
An Introduction To Sp...
www.unicomlearning.com

www.bigdatainnovation.org

Spark Current Users & Roadmap

Source: Apache - Powered By Spark
www.unicomlearning.com

www.bigdatainnovation.org

Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Conclusion
 Because of In-memory processing, computations are very fas...
www.unicomlearning.com

Topic:

Thank You
Speaker name: MANISH GUPTA
Email ID: manish.gupta@globallogic.com

www.bigdatain...
Backup Slides
www.unicomlearning.com

www.bigdatainnovation.org

Spark Internal Components
Spark core
Operators

Scheduler

Block manage...
www.unicomlearning.com

www.bigdatainnovation.org

In-Memory
But what if I run out of memory?
100

70

58.1

60

40.7

50
...
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks
 AMPLab performed a quantitative and qualitative comparison...
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks
Hardware Configuration
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks

• Redshift outperforms for on-disk data.
• Shark and Impala...
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks

• Redshift columnar storage outperforms every time.
• Shark...
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks
• Redshift bigger cluster has an advantage.
• Shark and Impa...
www.unicomlearning.com

www.bigdatainnovation.org

Benchmarks

• Impala & Redshift don’t have UDF.
• Shark outperforms hiv...
www.unicomlearning.com

www.bigdatainnovation.org

Roadmap
www.unicomlearning.com

www.bigdatainnovation.org

Spark

In Last 6 months of Year 2013
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Upcoming SlideShare
Loading in...5
×

Lightening Fast Big Data Analytics using Apache Spark

10,414

Published on

Lightening Fast Big Data Analytics using Apache Spark
----------------------------------------------------------------------------------

Hadoop gives you a great (actually revolutionary) mechanism for storing large data sets in a highly fault tolerant and highly available storage system (HDFS) and ability to process these mammoth datsets using it's massively parallel & distributed processing framework (Map Reduce). It was built for batch processing where analysts and programmers can submit their series of jobs to crunch very large structured/unstructured datasets and then wait for results for performing further analysis. But, one of the very few reasons why Hadoop is criticized for is it's speed and not being highly interactive (mainly because it's user base has increased tremendously and people always demand more specially when it comes to speed).
Spark is an open source system that can run on top of your existing HDFS and can provide upto 100x times faster (almost interactive) in-memory analytics than Map Reduce.

Topics that will be covered:

Quick Introduction of Hadoop & it's Limitation
Introduction of Spark
Spark Architecture
Programming model of Spark
Demo
Spark Use Cases

Published in: Technology, Education
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,414
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
513
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide
  • Solutions Architect in GlobalLogic.Been working for the last 10 years on large databases, data warehouses, ETLs, data mining, and now for around 2-3 years on Big Data Analytics, Machine Learning & distributed System.GlobalLogic is a 6000+ headcount company is into Full Product Life Cycle service and one of the fastest growing R&D services firm.Provide Advisory, Professional Services, Engineering and Support service to 250+ customers globallyWill speak about an In memory cluster computing framework that can really Nitrogen Boost your existing Hadoop based Big Data setup for analytics.
  • Quickly touch upon Hadoop, What it does, HDFS, Map Reduce, and some of it’s limitationsIntroduce Spark and one of the tool build on top of Spark called Shark (The SQL Interface to Spark)Little bit on Spark’s architecture and it’s basic programming modelShowcase a demo about Spark and Shark’s functionalityWill speak a bit about the future of Spark, where it’s heading and about some of it’s existing customers and contributors.
  • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure.Large is basically 10-100 GBs and above only.It is the driving force behind the big data industry growth.Provides 2 basic components:HDFS: Large Scale Storage SystemMap Reduce: Distributed Cluster computing frameworkTypical Hadoop setup comprises of :Cluster of a particular Hadoop DistributionTools like Hive, Pig and Mahout running on top of Hadoop (internally processing HDFS data using Map Reduce jobs)Set of tools for importing/exporting data into HDFS from/to external systems like RDBMS or Server Logs.
  • - One of the reason why Map Reduced is criticized is – Restricted programming framework - MapReduce tasks must be written as acyclic dataflow programs - Stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler - Repeated querying of datasets become difficult - thus hard to write iterative algorithms- After each iteration of Map-Reduce, data has to be persisted on disc for next iteration to proceed with processing.
  • SparkContext: represents the connection to a Spark cluster provides the entry point for interacting with Spark. we can interact our jobs.Driver program: The process runniwith Spark and distribute ng the main() function of the application and creating the SparkContextCluster manager: An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)Worker node: Any node that can run application code in the clusterExecutor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.Task: A unit of work that will be sent to one executorJob: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
  • Resilient Distributed Datasets or RDD are the distributed memory abstractions that lets programmer perform in-memory parallel computations on large clusters. And that too in a highly fault tolerant manner.This is the main concept around which the whole Spark framework revolves around.Currently 2 types of RDDs:Parallelized collections: Created by calling parallelize method on an existing Scala collection. Developer can specify the number of slices to cut the dataset into. Ideally 2-3 slices per CPU.Hadoop Datasets: These distributed datasets are created from any file stored on HDFS or other storage systems supported by Hadoop (S3, Hbaseetc). These are created using SparkContext’s textFile method. Default number of slices in this case is 1 slice per file block.
  • Transformations: Like map – takes an RDD as an input, passes & process each element to a function, and return a new transformed RDD as an output.By default, each transformed RDD is recomputed each time you run an action on it. Unless you specify the RDD to be cached in memory. Spark will try to keep the elements around the cluster for faster access.RDD can be persisted on discs as well.Caching is the Key tool for iterative algorithms.Using persist, one can specify the Storage Level for persisting an RDD. Cache is just a short hand for default storage level. Which is MEMORY_ONLY.MEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.MEMORY_AND_DISKStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.MEMORY_ONLY_SERStore RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.MEMORY_AND_DISK_SERSimilar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.DISK_ONLYStore the RDD partitions only on disk.MEMORY_ONLY_2, MEMORY_AND_DISK_2 etcSame as the levels above, but replicate each partition on two cluster nodes.Which Storage level is best:Few things to consider:Try to keep in-memory as much as possibleTry not to spill to disc unless your computed datasets are memory expensiveUse replication only if you want fault tolerance
  • PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page,[1] one of the founders of Google. PageRank is a way of measuring the importance of website pages.PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
  • Spark Streaming: For stream processingContinuously executes various parallel operations on an Input Stream of Data.System receives a continuous data and divide is into batches. And each batch is considered and processed as an RDD.Graph X:Distributed Graph SystemDesigned to efficiently execute Graph algorithms using Spark parallel and in-memory computation frameworkMLBase:Goal of MLBase is to make distributed machine learning easy.BlinkDB:Approximate query engineAllows for trade-off between accuracy and response timeHighly interactive on very large datasetsIn process of deploying this at FacebookAMPLab have demonstrated how complex queries on 17 TB data (running on 100 node cluster) can be completed in less than 2 seconds !You specify queries with time boundationSelect avg(SessionTime) from tblSession where UserGender=‘MALE’ within 2 SECONDS
  • -Interpreter: It’s actually the Scala command line (interpreter) that’s been modified for SparkHadoop I/O: for Reading/Writing from HDFSStanadlone: Custom Resource Manager- Operators: Map, Join, Group by etc on RDDNetworking: Replication, Caching, GraphBlock Manager: Very Simple Key-Value store that used as cacheBroadcaster: Sending / Receiving event, Heartbeat etc-
  • -used by majority of Fortune 50 companies.
  • Lightening Fast Big Data Analytics using Apache Spark

    1. 1. www.unicomlearning.com Lightning Fast Big Data Analytics using Apache Spark Manish Gupta Solutions Architect – Product Engineering and Development 30th Jan 2014 - Delhi www.bigdatainnovation.org
    2. 2. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
    3. 3. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
    4. 4. www.unicomlearning.com www.bigdatainnovation.org What is Hadoop? It’s an open-sourced software for distributed storage of large datasets on commodity class hardware in a highly fault-tolerant, scalable and a flexible way. HDFS It also provide a programming model/framework for processing these large datasets in a massively-parallel, fault-tolerant and data-location aware fashion. MR Map Input Reduce Map Map Output Reduce
    5. 5. www.unicomlearning.com www.bigdatainnovation.org Limitations of Map Reduce HDFS read HDFS write HDFS read iter. 1 Input Map iter. 2 Map . . . Reduce Map Input HDFS write Output Reduce  Slow due to replication, serialization, and disk IO  Inefficient for: • Iterative algorithms (Machine Learning, Graphs & Network Analysis) • Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)
    6. 6. www.unicomlearning.com www.bigdatainnovation.org Approach: Leverage Memory?  Memory bus >> disk & SSDs  Many datasets fit into memory  1TB = 1 billion records @ 1 KB  Memory Capacity also follows the Moore’s Law A single 8GB stick of RAM is about $80 right now. In 2021, you’d be able to buy a single stick of RAM that contains 64GB for the same price.
    7. 7. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
    8. 8. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
    9. 9. www.unicomlearning.com www.bigdatainnovation.org Spark “A big data analytics cluster-computing framework written in Scala.”  Open Sourced originally developed in AMPLab at UC Berkley.  Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).  Designed for running Iterative algorithms & Interactive analytics  Highly compatible with Hadoop’s Storage APIs.  - Can run on your existing Hadoop Cluster Setup.  Developers can write driver programs using multiple programming languages. …
    10. 10. www.unicomlearning.com www.bigdatainnovation.org Spark Spark Driver (Master) Cluster Manager Cache Cache Cache Spark Worker Datanode Datanode Block .... .... Spark Worker Block Spark Worker Datanode Block HDFS
    11. 11. www.unicomlearning.com www.bigdatainnovation.org Spark HDFS read HDFS write iter. 1 Input HDFS read HDFS write iter. 2 . . .
    12. 12. www.unicomlearning.com www.bigdatainnovation.org Spark HDFS read iter. 1 iter. 2 . . . Input Not tied to 2 stage Map Reduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly Logistic regression in Hadoop and Spark
    13. 13. www.unicomlearning.com www.bigdatainnovation.org Spark A simple analytical operation: 1 pagecount = spark.textFile( "/wiki/pagecounts“ ) pagecount.count() 2 englishPages = pagecount.filter( _.split(" ")(1) == "en“ ) englishPages.cache() englishPages.count() englishTuples = englishPages.map( line => line.split(" ") ) englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) ) englishKeyValues.reduceByKey( _+_, 1).collect Select count(*) from pagecounts Select Col1, sum(Col4) from pagecounts Where Col2 = “en” Group by Col1
    14. 14. www.unicomlearning.com www.bigdatainnovation.org Shark  HIVE on SPARK = SHARK  A large scale data warehouse system just like Apache Hive.  Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs)  Built on top of Spark (thus a faster execution engine).  Provision of creating In-memory materialized tables (Cached Tables).  And cached tables utilizes columnar storage instead of raw storage. Row Storage Column Storage 1 ABC 4.1 1 2 3 2 XYZ 3.5 ABC XYZ PPP 3 PPP 6.4 4.1 3.5 6.4
    15. 15. www.unicomlearning.com www.bigdatainnovation.org Shark HIVE Client CLI JDBC Driver Meta store SQL Parser Query Optimizer Map Reduce HDFS Physical Plan Execution
    16. 16. www.unicomlearning.com www.bigdatainnovation.org Shark SHARK Client CLI Driver Meta store SQL Parser Query Optimizer Spark HDFS JDBC Cache Mgr. Physical Plan Execution
    17. 17. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
    18. 18. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
    19. 19. www.unicomlearning.com www.bigdatainnovation.org Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Cluster Manager SparkContext Worker Node Writes Executer Task Worker Node Cache Executer Task Datanode Task … User (Developer) HDFS Cache Task Datanode
    20. 20. www.unicomlearning.com www.bigdatainnovation.org Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Writes User (Developer) RDD (Resilient Distributed Dataset) • • • • • • Immutable Data structure In-memory (explicitly) Fault Tolerant Parallel Data Structure Controlled partitioning to optimize data placement Can be manipulated using rich set of operators.
    21. 21. www.unicomlearning.com www.bigdatainnovation.org RDD  Programming Interface: Programmer can perform 3 types of operations: Transformations • Create a new dataset from and existing one. • Actions • Lazy in nature. They are executed only when some action is performed. • • Example : • Map(func) • Filter(func) • Distinct() Returns to the driver program a value or exports data to a storage system after performing a computation. Example: • Count() • Reduce(funct) • Collect • Take() Persistence • For caching datasets in-memory for future operations. • Option to store on disk or RAM or mixed (Storage Level). • Example: • Persist() • Cache()
    22. 22. www.unicomlearning.com www.bigdatainnovation.org Spark How Spark Works: RDD: Parallel collection with partitions  User application create RDDs, transform them, and run actions. This results in a DAG (Directed Acyclic Graph) of operators. DAG is compiled into stages Each stage is executed as a series of Task (one Task for each Partition).
    23. 23. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) textFile RDD[String]
    24. 24. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) textFile map RDD[String] RDD[List[String]]
    25. 25. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) textFile map map RDD[String] RDD[List[String]] RDD[(String, Int)]
    26. 26. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) textFile map map RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey
    27. 27. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] collect textFile map map reduceByKey
    28. 28. www.unicomlearning.com www.bigdatainnovation.org Spark Execution Plan: collect textFile map map reduceByKey Above logical plan gets compiled by the DAG scheduler into a Plan comprising of Stages as…
    29. 29. www.unicomlearning.com www.bigdatainnovation.org Spark Execution Plan: Stage 2 Stage 1 collect textFile map map reduceByKey Stages are sequences of RDDs, that don’t have a Shuffle in between
    30. 30. www.unicomlearning.com www.bigdatainnovation.org Spark Stage 2 Stage 1 collect textFile 1. 2. 3. 4. map map reduceByKey 1. 2. 3. Read HDFS split Apply both the maps Start Partial reduce Write shuffle data Stage 1 Stage 2 Read shuffle data Final reduce Send result to driver program
    31. 31. www.unicomlearning.com www.bigdatainnovation.org Spark Stage Execution: Stage 1 Task 1 Task 2 Task 2 Task 2  Create a task for each Partition in the new RDD  Serialize the Task  Schedule and ship Tasks to Slaves And all this happens internally (you need to do anything)
    32. 32. www.unicomlearning.com www.bigdatainnovation.org Spark Task Execution: Task is the fundamental unit of execution in Spark Fetch Input HDFS / RDD Execute Task Write Output time HDFS / RDD / intermediate shuffle output
    33. 33. www.unicomlearning.com www.bigdatainnovation.org Spark Spark Executor (Slaves) Fetch Input Core 1 Fetch Input Execute Task Fetch Input Execute Task Write Output Execute Task Write Output Fetch Input Core 2 Write Output Fetch Input Execute Task Execute Task Write Output Fetch Input Core 3 Write Output Fetch Input Execute Task Write Output Execute Task Write Output
    34. 34. www.unicomlearning.com www.bigdatainnovation.org Spark Summary of Components  Task : The fundamental unit of execution in Spark  Stage: Set of Tasks that run parallel  DAG : Logical Graph of RDD operations  RDD : Parallel dataset with partitions
    35. 35. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
    36. 36. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
    37. 37. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Cluster Details:  6 m1.Xlarge EC2 nodes.  1 machine is Master Node  5 worker node machines  64 bit, 4 vCPU  15 GB Ram
    38. 38. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Dataset:  Wiki Page View Stats  20 GB of webpage view counts  3 days worth of data <date_time> <project_code> <page_title> <num_hits> <page_size> Base RDD to All Wiki Pages val allPages = sc.textFile("/wiki/pagecounts") allPages.take(10).foreach(println) allPages.count() Transformed RDD for all English pages (cached) val englishPages = allPages.filter(_.split(" ")(1) == "en") englishPages.cache() englishPages.count() englishPages.count()
    39. 39. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Dataset:  Wiki Page View Stats  20 GB of webpage view counts  3 days worth of data <date_time> <project_code> <page_title> <num_hits> <page_size> Select date, sum(pageviews) from pagecounts group by date englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println) Select date, count(distinct pageURL) from pagecounts group by date englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println) Select distinct(datetime) from pagecounts order by datetime englishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)
    40. 40. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Dataset:  Network Datasets  Directed and Bi-directed Graphs  One small Facebook Social Network  127 nodes (Friends)  1668 Edges (Friendships)  Bi-directed graph  Google’s internal site network  15713 Nodes (web pages)  170845 Edges (hyperlinks)  Directed Graph
    41. 41. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Page Rank Calculation: • • • • Estimate the node importance Each directed link from A -> B is a vote to B from A. More links to a page, more important a page is. When a page with higher PR, points to something, then it’s vote weighs more. 1. Start each page at a rank of 1 2. On each iteration, have page p contribute (rank of p) / (no. of neighbors of p) to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
    42. 42. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Scala Code: var iters = 100 val lines = sc.textFile("/dataset/google/edges.csv",1) val links = lines.map{ s => val parts = s.split( "t“ ) (parts(0), parts(1)) }.distinct().groupByKey().cache() var ranks = links.mapValues(v => 1.0) for (i <- 1 to iters) { val contribs = links.join(ranks).values.flatMap{ case (urls, rank) => val size = urls.size urls.map(url => (url, rank / size)) } ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _) } val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1)) output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))
    43. 43. 2 seconds
    44. 44. 38 seconds Page Rank 761.1985177 455.7028756 259.6052388 192.7257649 144.0349154 134.1566312 130.3546324 123.4014613 120.0661165 118.6884515 112.2309539 108.8375347 106.9724799 105.822426 105.1554798 99.97741309 97.90651416 90.7910291 90.70522689 87.4353413 Page URL google google/about.html google/privacy.html google/jobs/ google/support google/terms_of_service.html google/intl/en/about.html google/imghp google/accounts/Login google/intl/en/options/ google/preferences google/sitemap.html google/press/ google/language_tools google/support/toolbar/ google/maps google/advanced_search google/intl/en/services/ google/intl/en/ads/ google/adsense/
    45. 45. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
    46. 46. www.unicomlearning.com www.bigdatainnovation.org Spark Current Users & Roadmap Source: Apache - Powered By Spark
    47. 47. www.unicomlearning.com www.bigdatainnovation.org Roadmap
    48. 48. www.unicomlearning.com www.bigdatainnovation.org Conclusion  Because of In-memory processing, computations are very fast. Developers can write iterative algorithms without writing out a result set after each pass through the data.  Suitable for scenarios when sufficient memory available in your cluster.  It provides an integrated framework for advanced analytics like Graph processing, Stream Processing, Machine Learning etc. This simplifies integration.  It’s community is expanding and development is happening very aggressively.  It’s comparatively newer than Hadoop and only few users.
    49. 49. www.unicomlearning.com Topic: Thank You Speaker name: MANISH GUPTA Email ID: manish.gupta@globallogic.com www.bigdatainnovation.org Organized by UNICOM Trainings & Seminars Pvt. Ltd. contact@unicomlearning.com
    50. 50. Backup Slides
    51. 51. www.unicomlearning.com www.bigdatainnovation.org Spark Internal Components Spark core Operators Scheduler Block manager Networking Accumulators Interpreter Broadcast Hadoop I/O Mesos backend Standalone backend
    52. 52. www.unicomlearning.com www.bigdatainnovation.org In-Memory But what if I run out of memory? 100 70 58.1 60 40.7 50 29.7 40 30 11.5 Iteration time (s) 80 68.8 90 20 10 0 Cache disabled 25% 50% 75% % of working set in memory Fully cached
    53. 53. www.unicomlearning.com www.bigdatainnovation.org Benchmarks  AMPLab performed a quantitative and qualitative comparisons of 4 system  HIVE, Impala, Redshift and Shark  Done on Common Crawl Corpus Dataset  81 TB size  Consists of 3 tables:  Page Rankings  User Visits  Documents  Data was partitioned in such a way that each node had:  25GB of User Visits  1GB of Ranking  30GB of Web Crawl (document) Source: https://amplab.cs.berkeley.edu/benchmark/#
    54. 54. www.unicomlearning.com www.bigdatainnovation.org Benchmarks
    55. 55. www.unicomlearning.com www.bigdatainnovation.org Benchmarks Hardware Configuration
    56. 56. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Redshift outperforms for on-disk data. • Shark and Impala outperform Hive by 3-4X. • For larger result-sets, Shark outperforms Impala.
    57. 57. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Redshift columnar storage outperforms every time. • Shark in-memory is 2nd best in all cases.
    58. 58. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Redshift bigger cluster has an advantage. • Shark and Impala competing.
    59. 59. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Impala & Redshift don’t have UDF. • Shark outperforms hive.
    60. 60. www.unicomlearning.com www.bigdatainnovation.org Roadmap
    61. 61. www.unicomlearning.com www.bigdatainnovation.org Spark In Last 6 months of Year 2013
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×