Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lightening Fast Big Data Analytics using Apache Spark

14,296 views

Published on

Quick Introduction of Hadoop & it's Limitation
Introduction of Spark
Spark Architecture
Programming model of Spark
Demo
Spark Use Cases

Published in: Technology, Education
  • Be the first to comment

Lightening Fast Big Data Analytics using Apache Spark

  1. 1. www.unicomlearning.com Lightning Fast Big Data Analytics using Apache Spark Manish Gupta Solutions Architect – Product Engineering and Development 30th Jan 2014 - Delhi www.bigdatainnovation.org
  2. 2. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  3. 3. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  4. 4. www.unicomlearning.com www.bigdatainnovation.org What is Hadoop? It’s an open-sourced software for distributed storage of large datasets on commodity class hardware in a highly fault-tolerant, scalable and a flexible way. HDFS It also provide a programming model/framework for processing these large datasets in a massively-parallel, fault-tolerant and data-location aware fashion. MR Map Input Reduce Map Map Output Reduce
  5. 5. www.unicomlearning.com www.bigdatainnovation.org Limitations of Map Reduce HDFS read HDFS write HDFS read iter. 1 Input Map iter. 2 Map . . . Reduce Map Input HDFS write Output Reduce  Slow due to replication, serialization, and disk IO  Inefficient for: • Iterative algorithms (Machine Learning, Graphs & Network Analysis) • Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)
  6. 6. www.unicomlearning.com www.bigdatainnovation.org Approach: Leverage Memory?  Memory bus >> disk & SSDs  Many datasets fit into memory  1TB = 1 billion records @ 1 KB  Memory Capacity also follows the Moore’s Law A single 8GB stick of RAM is about $80 right now. In 2021, you’d be able to buy a single stick of RAM that contains 64GB for the same price.
  7. 7. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  8. 8. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  9. 9. www.unicomlearning.com www.bigdatainnovation.org Spark “A big data analytics cluster-computing framework written in Scala.”  Open Sourced originally developed in AMPLab at UC Berkley.  Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).  Designed for running Iterative algorithms & Interactive analytics  Highly compatible with Hadoop’s Storage APIs.  - Can run on your existing Hadoop Cluster Setup.  Developers can write driver programs using multiple programming languages. …
  10. 10. www.unicomlearning.com www.bigdatainnovation.org Spark Spark Driver (Master) Cluster Manager Cache Cache Cache Spark Worker Datanode Datanode Block .... .... Spark Worker Block Spark Worker Datanode Block HDFS
  11. 11. www.unicomlearning.com www.bigdatainnovation.org Spark HDFS read HDFS write iter. 1 Input HDFS read HDFS write iter. 2 . . .
  12. 12. www.unicomlearning.com www.bigdatainnovation.org Spark HDFS read iter. 1 iter. 2 . . . Input Not tied to 2 stage Map Reduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly Logistic regression in Hadoop and Spark
  13. 13. www.unicomlearning.com www.bigdatainnovation.org Spark A simple analytical operation: 1 pagecount = spark.textFile( "/wiki/pagecounts“ ) pagecount.count() 2 englishPages = pagecount.filter( _.split(" ")(1) == "en“ ) englishPages.cache() englishPages.count() englishTuples = englishPages.map( line => line.split(" ") ) englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) ) englishKeyValues.reduceByKey( _+_, 1).collect Select count(*) from pagecounts Select Col1, sum(Col4) from pagecounts Where Col2 = “en” Group by Col1
  14. 14. www.unicomlearning.com www.bigdatainnovation.org Shark  HIVE on SPARK = SHARK  A large scale data warehouse system just like Apache Hive.  Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs)  Built on top of Spark (thus a faster execution engine).  Provision of creating In-memory materialized tables (Cached Tables).  And cached tables utilizes columnar storage instead of raw storage. Row Storage Column Storage 1 ABC 4.1 1 2 3 2 XYZ 3.5 ABC XYZ PPP 3 PPP 6.4 4.1 3.5 6.4
  15. 15. www.unicomlearning.com www.bigdatainnovation.org Shark HIVE Client CLI JDBC Driver Meta store SQL Parser Query Optimizer Map Reduce HDFS Physical Plan Execution
  16. 16. www.unicomlearning.com www.bigdatainnovation.org Shark SHARK Client CLI Driver Meta store SQL Parser Query Optimizer Spark HDFS JDBC Cache Mgr. Physical Plan Execution
  17. 17. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  18. 18. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  19. 19. www.unicomlearning.com www.bigdatainnovation.org Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Cluster Manager SparkContext Worker Node Writes Executer Task Worker Node Cache Executer Task Datanode Task … User (Developer) HDFS Cache Task Datanode
  20. 20. www.unicomlearning.com www.bigdatainnovation.org Spark Programming Model Driver Program sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Writes User (Developer) RDD (Resilient Distributed Dataset) • • • • • • Immutable Data structure In-memory (explicitly) Fault Tolerant Parallel Data Structure Controlled partitioning to optimize data placement Can be manipulated using rich set of operators.
  21. 21. www.unicomlearning.com www.bigdatainnovation.org RDD  Programming Interface: Programmer can perform 3 types of operations: Transformations • Create a new dataset from and existing one. • Actions • Lazy in nature. They are executed only when some action is performed. • • Example : • Map(func) • Filter(func) • Distinct() Returns to the driver program a value or exports data to a storage system after performing a computation. Example: • Count() • Reduce(funct) • Collect • Take() Persistence • For caching datasets in-memory for future operations. • Option to store on disk or RAM or mixed (Storage Level). • Example: • Persist() • Cache()
  22. 22. www.unicomlearning.com www.bigdatainnovation.org Spark How Spark Works: RDD: Parallel collection with partitions  User application create RDDs, transform them, and run actions. This results in a DAG (Directed Acyclic Graph) of operators. DAG is compiled into stages Each stage is executed as a series of Task (one Task for each Partition).
  23. 23. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) textFile RDD[String]
  24. 24. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) textFile map RDD[String] RDD[List[String]]
  25. 25. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) textFile map map RDD[String] RDD[List[String]] RDD[(String, Int)]
  26. 26. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) textFile map map RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] reduceByKey
  27. 27. www.unicomlearning.com www.bigdatainnovation.org Spark Example: sc.textFile(“/wiki/pagecounts”) .map(line => line.split(“t”)) .map(R => (R[0], int(R[1]))) .reduceByKey(_+_, 3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] RDD[(String, Int)] Array[(String, Int)] collect textFile map map reduceByKey
  28. 28. www.unicomlearning.com www.bigdatainnovation.org Spark Execution Plan: collect textFile map map reduceByKey Above logical plan gets compiled by the DAG scheduler into a Plan comprising of Stages as…
  29. 29. www.unicomlearning.com www.bigdatainnovation.org Spark Execution Plan: Stage 2 Stage 1 collect textFile map map reduceByKey Stages are sequences of RDDs, that don’t have a Shuffle in between
  30. 30. www.unicomlearning.com www.bigdatainnovation.org Spark Stage 2 Stage 1 collect textFile 1. 2. 3. 4. map map reduceByKey 1. 2. 3. Read HDFS split Apply both the maps Start Partial reduce Write shuffle data Stage 1 Stage 2 Read shuffle data Final reduce Send result to driver program
  31. 31. www.unicomlearning.com www.bigdatainnovation.org Spark Stage Execution: Stage 1 Task 1 Task 2 Task 2 Task 2  Create a task for each Partition in the new RDD  Serialize the Task  Schedule and ship Tasks to Slaves And all this happens internally (you need to do anything)
  32. 32. www.unicomlearning.com www.bigdatainnovation.org Spark Task Execution: Task is the fundamental unit of execution in Spark Fetch Input HDFS / RDD Execute Task Write Output time HDFS / RDD / intermediate shuffle output
  33. 33. www.unicomlearning.com www.bigdatainnovation.org Spark Spark Executor (Slaves) Fetch Input Core 1 Fetch Input Execute Task Fetch Input Execute Task Write Output Execute Task Write Output Fetch Input Core 2 Write Output Fetch Input Execute Task Execute Task Write Output Fetch Input Core 3 Write Output Fetch Input Execute Task Write Output Execute Task Write Output
  34. 34. www.unicomlearning.com www.bigdatainnovation.org Spark Summary of Components  Task : The fundamental unit of execution in Spark  Stage: Set of Tasks that run parallel  DAG : Logical Graph of RDD operations  RDD : Parallel dataset with partitions
  35. 35. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  36. 36. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  37. 37. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Cluster Details:  6 m1.Xlarge EC2 nodes.  1 machine is Master Node  5 worker node machines  64 bit, 4 vCPU  15 GB Ram
  38. 38. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Dataset:  Wiki Page View Stats  20 GB of webpage view counts  3 days worth of data <date_time> <project_code> <page_title> <num_hits> <page_size> Base RDD to All Wiki Pages val allPages = sc.textFile("/wiki/pagecounts") allPages.take(10).foreach(println) allPages.count() Transformed RDD for all English pages (cached) val englishPages = allPages.filter(_.split(" ")(1) == "en") englishPages.cache() englishPages.count() englishPages.count()
  39. 39. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Dataset:  Wiki Page View Stats  20 GB of webpage view counts  3 days worth of data <date_time> <project_code> <page_title> <num_hits> <page_size> Select date, sum(pageviews) from pagecounts group by date englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println) Select date, count(distinct pageURL) from pagecounts group by date englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println) Select distinct(datetime) from pagecounts order by datetime englishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)
  40. 40. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Dataset:  Network Datasets  Directed and Bi-directed Graphs  One small Facebook Social Network  127 nodes (Friends)  1668 Edges (Friendships)  Bi-directed graph  Google’s internal site network  15713 Nodes (web pages)  170845 Edges (hyperlinks)  Directed Graph
  41. 41. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Page Rank Calculation: • • • • Estimate the node importance Each directed link from A -> B is a vote to B from A. More links to a page, more important a page is. When a page with higher PR, points to something, then it’s vote weighs more. 1. Start each page at a rank of 1 2. On each iteration, have page p contribute (rank of p) / (no. of neighbors of p) to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
  42. 42. www.unicomlearning.com www.bigdatainnovation.org Example & Demo Scala Code: var iters = 100 val lines = sc.textFile("/dataset/google/edges.csv",1) val links = lines.map{ s => val parts = s.split( "t“ ) (parts(0), parts(1)) }.distinct().groupByKey().cache() var ranks = links.mapValues(v => 1.0) for (i <- 1 to iters) { val contribs = links.join(ranks).values.flatMap{ case (urls, rank) => val size = urls.size urls.map(url => (url, rank / size)) } ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _) } val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1)) output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))
  43. 43. 2 seconds
  44. 44. 38 seconds Page Rank 761.1985177 455.7028756 259.6052388 192.7257649 144.0349154 134.1566312 130.3546324 123.4014613 120.0661165 118.6884515 112.2309539 108.8375347 106.9724799 105.822426 105.1554798 99.97741309 97.90651416 90.7910291 90.70522689 87.4353413 Page URL google google/about.html google/privacy.html google/jobs/ google/support google/terms_of_service.html google/intl/en/about.html google/imghp google/accounts/Login google/intl/en/options/ google/preferences google/sitemap.html google/press/ google/language_tools google/support/toolbar/ google/maps google/advanced_search google/intl/en/services/ google/intl/en/ads/ google/adsense/
  45. 45. www.unicomlearning.com www.bigdatainnovation.org Agenda Of The Talk: Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap
  46. 46. www.unicomlearning.com www.bigdatainnovation.org Spark Current Users & Roadmap Source: Apache - Powered By Spark
  47. 47. www.unicomlearning.com www.bigdatainnovation.org Roadmap
  48. 48. www.unicomlearning.com www.bigdatainnovation.org Conclusion  Because of In-memory processing, computations are very fast. Developers can write iterative algorithms without writing out a result set after each pass through the data.  Suitable for scenarios when sufficient memory available in your cluster.  It provides an integrated framework for advanced analytics like Graph processing, Stream Processing, Machine Learning etc. This simplifies integration.  It’s community is expanding and development is happening very aggressively.  It’s comparatively newer than Hadoop and only few users.
  49. 49. www.unicomlearning.com Topic: Thank You Speaker name: MANISH GUPTA Email ID: manish.gupta@globallogic.com www.bigdatainnovation.org Organized by UNICOM Trainings & Seminars Pvt. Ltd. contact@unicomlearning.com
  50. 50. Backup Slides
  51. 51. www.unicomlearning.com www.bigdatainnovation.org Spark Internal Components Spark core Operators Scheduler Block manager Networking Accumulators Interpreter Broadcast Hadoop I/O Mesos backend Standalone backend
  52. 52. www.unicomlearning.com www.bigdatainnovation.org In-Memory But what if I run out of memory? 100 70 58.1 60 40.7 50 29.7 40 30 11.5 Iteration time (s) 80 68.8 90 20 10 0 Cache disabled 25% 50% 75% % of working set in memory Fully cached
  53. 53. www.unicomlearning.com www.bigdatainnovation.org Benchmarks  AMPLab performed a quantitative and qualitative comparisons of 4 system  HIVE, Impala, Redshift and Shark  Done on Common Crawl Corpus Dataset  81 TB size  Consists of 3 tables:  Page Rankings  User Visits  Documents  Data was partitioned in such a way that each node had:  25GB of User Visits  1GB of Ranking  30GB of Web Crawl (document) Source: https://amplab.cs.berkeley.edu/benchmark/#
  54. 54. www.unicomlearning.com www.bigdatainnovation.org Benchmarks
  55. 55. www.unicomlearning.com www.bigdatainnovation.org Benchmarks Hardware Configuration
  56. 56. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Redshift outperforms for on-disk data. • Shark and Impala outperform Hive by 3-4X. • For larger result-sets, Shark outperforms Impala.
  57. 57. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Redshift columnar storage outperforms every time. • Shark in-memory is 2nd best in all cases.
  58. 58. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Redshift bigger cluster has an advantage. • Shark and Impala competing.
  59. 59. www.unicomlearning.com www.bigdatainnovation.org Benchmarks • Impala & Redshift don’t have UDF. • Shark outperforms hive.
  60. 60. www.unicomlearning.com www.bigdatainnovation.org Roadmap
  61. 61. www.unicomlearning.com www.bigdatainnovation.org Spark In Last 6 months of Year 2013

×