Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Spark

852 views

Published on

And introduction to Apache Spark in HadoopCon Taiwan 2015.
If there is any copyright issus, please let me know. thanks a lot.

Published in: Software
  • Be the first to comment

Introduction to Apache Spark

  1. 1. Introduction to Apache Spark Hubert 范姜 @hubert HadoopCon Taiwan 2015 Sep. 19 , 2015 Taipei Photo from http://quotesgram.com/spark-quotes/
  2. 2. Who are we? • 亦思科技 • 位於新竹科學園區 • 過去主要客戶為園區各大製造廠 • 2010.7 以研發雲端計算軟體工具之投資計畫獲准進駐新竹科學園區 • 2011 與清華大學資工系鍾葉青教授合作進行產學合作 • 少數獲邀參與國際雲端計算研討會 IEEE CloudCom的專業公司 • 少數已經有實際經驗協助客戶完成建置 Hadoop 系統的資訊廠商 • 2012.01 JackHare (ANSI SQL JDBC Driver) • 2012.11 HareDB Hbase Client • 2013.08 Hare ( High Speed Query in HBase) • 2013.12 榮獲科學園區創新產品獎 • 2014.12 榮獲資訊月創新金質獎
  3. 3. Hadoop HBase Hive Spark HareDB Core HBase Client HDFS Client Solr Cloud Security KerberosSentry Indexing Restful Service JDBC/ODBC Cluster Monitor HareDB Arch.
  4. 4. WHAT IS SPARK ?
  5. 5. What is Apache Spark ? • It is an open source cluster computing framework • In contrast to Hadoop's two-stage disk- based MapReduce paradigm, Spark's multi-stage in-memory primitives provides performance up to 100 faster for certain applications.
  6. 6. Databricks • Founded in late 2013 • By the creators of Apache Spark • Original team from UC Berkeley AMPLab (Algorithms,Machines,People) • Contributed more than 75% of the code added to Spark in 2014
  7. 7. World Record From Spark Summit 2015, Matei Zaharia
  8. 8. Spark is hot ! From Spark Summit 2015, Matei Zaharia
  9. 9. SPARK 會取代 HADOOP ?
  10. 10. From Spark Summit 2015, Mike Olson (Cloudera), http://www.slideshare.net/SparkSummit/spark-in-the-hadoop-ecosystem-mike-olson
  11. 11. From http://hortonworks.com/blog/apache-spark-yarn-ready-hortonworks-data-platform/ && Spark Summit 2015, Arun C. Murthy
  12. 12. From Spark Summit 2015, Anil Gadre (MapR), http://www.slideshare.net/SparkSummit/spark-in-the-hadoop-ecosystem-mike-olson
  13. 13. http://www.slideshare.net/SparkSummit/intro-to-spark-development
  14. 14. http://www.slideshare.net/SparkSummit/intro-to-spark-development
  15. 15. http://www.slideshare.net/SparkSummit/intro-to-spark-development
  16. 16. Spark Software Stack Resource Virtualization Storage Processing Engine Access and Interfaces Mesos Hadoop Yarn HDFS,S3 Spark Core Spark Streami ng Spark SQL Spark R GraphX MLlib Splash MLPipelinesBlinkDB
  17. 17. VS
  18. 18. 10X ~100X
  19. 19. 300 MB/s 600 MB/s 10GB/s 1Gb/s = 125MB/s 1Gb/s 125MB/s Nodes in the same rack Nodes in another rack 0.1Gb/s 12.5MB/s 1.資料記憶體化 2.資料在地化 Physical Bottleneck
  20. 20. Spark 執行流程 1 2 2 3 3 http://www.codeproject.com/Articles/1023037/Introduction-to-Apache- Spark
  21. 21. R D D
  22. 22. http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf “The main abstraction in Spark is that of a resilient dis- tributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.” What is RDD ?
  23. 23. (Scala & Python only) Interactive Shell
  24. 24. # Read a local txt file in Python linesRDD = sc.textFile("/path/to/README.md") // Read a local txt file in Scala val linesRDD = sc.textFile("/path/to/README.md") // Read a local txt file in Java JavaRDD<String> lines = sc.textFile("/path/to/README.md"); Read From TextFile
  25. 25. item-1 item-2 item-3 item-4 item-5 item-6 item-7 item-8 item-9 item-10 item-11 item-12 item-13 item-14 item-15 item-16 item-17 item-18 item-19 item-20 item-21 item-22 item-23 item-24 item-25 RDD Ex RDD W RDD Ex RDD W RDD Ex RDD W more partitions = more parallelism Where is RDD ? http://www.slideshare.net/SparkSummit/intro-to-spark-development
  26. 26. Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1 Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8 Error, ts, msg3 Info, ts, msg5 Info, ts, msg5 Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1 logLinesRDD Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 errorsRDD .filter( ) (input/base RDD) http://www.slideshare.net/SparkSummit/intro-to-spark-development
  27. 27. errorsRDD .coalesce( 2 ) Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 cleanedRDD Error, ts, msg1 Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Error, ts, msg1 .collect( ) Driver http://www.slideshare.net/SparkSummit/intro-to-spark-development
  28. 28. .collect( ) Execute ! Driver http://www.slideshare.net/SparkSummit/intro-to-spark-development
  29. 29. .collect( ) Driver logLinesRD D http://www.slideshare.net/SparkSummit/intro-to-spark-development
  30. 30. .collect( ) logLinesRD D errorsRDD cleanedRDD .filter( ) .coalesce( 2 ) Driver Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 http://www.slideshare.net/SparkSummit/intro-to-spark-development
  31. 31. .collect( ) Driver logLinesRDD errorsRDD cleanedRDD data .filter( ) .coalesce( 2, shuffle= False) http://www.slideshare.net/SparkSummit/intro-to-spark-development
  32. 32. Driver logLinesRDD errorsRDD cleanedRDD http://www.slideshare.net/SparkSummit/intro-to-spark-development
  33. 33. Driver data http://www.slideshare.net/SparkSummit/intro-to-spark-development
  34. 34. logLinesRD D errorsRDD Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 cleanedRDD .filter( ) Error, ts, msg1 Error, ts, msg1 Error, ts, msg1 errorMsg1RD D .collect( ) .saveToCassandra( ) .count( ) 5 http://www.slideshare.net/SparkSummit/intro-to-spark-development
  35. 35. logLinesRD D errorsRDD Error, ts, msg1 Error, ts, msg3 Error, ts, msg1 Error, ts, msg4 Error, ts, msg1 cleanedRDD .filter( ) Error, ts, msg1 Error, ts, msg1 Error, ts, msg1 errorMsg1RDD .collect( ) .count( ) .saveToCassandra( ) 5 http://www.slideshare.net/SparkSummit/intro-to-spark-development
  36. 36. Lifecycle of a Spark program 1. Create some input RDDs from external data or parallelize a collection in your driver program. 2. Lazily transform them to define new RDDs using transformations like filter() or map() 3. Ask Spark to cache() any intermediate RDDs that will need to be reused. 4. Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark. http://www.slideshare.net/SparkSummit/intro-to-spark-development
  37. 37. map() intersection() cartesion() flatMap() distinct() pipe() filter() groupByKey() coalesce() mapPartitions() reduceByKey() repartition() mapPartitionsWithIndex() sortByKey() partitionBy() sample() join() ... union() cogroup() ... Transformations http://www.slideshare.net/SparkSummit/intro-to-spark-development
  38. 38. reduce() takeOrdered() collect() saveAsTextFile() count() saveAsSequenceFile() first() saveAsObjectFile() take() countByKey() takeSample() foreach() ... ... Actions http://www.slideshare.net/SparkSummit/intro-to-spark-development
  39. 39. SPARK SQL
  40. 40. sqlCtx = new HiveContext(sc) results = sqlCtx.sql( "SELECT * FROM people") names = results.map(lambda p: p.name) What is Spark SQL
  41. 41. NEW FEATURES IN 1.4 AND 1.5
  42. 42. DataFrames 42 http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
  43. 43. DataFrames 43 http://www.slideshare.net/SparkSummit/reynold-xin
  44. 44. DataFrames 44 http://www.slideshare.net/SparkSummit/reynold-xin
  45. 45. DataFrames 45 http://www.slideshare.net/SparkSummit/reynold-xin
  46. 46. DataFrames 46 http://www.slideshare.net/SparkSummit/reynold-xin
  47. 47. DataFrames 47 http://www.slideshare.net/SparkSummit/reynold-xin
  48. 48. Spark R 48 http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
  49. 49. Spark R 49
  50. 50. Spark R 50
  51. 51. Machine Learning Pipelines 51 http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
  52. 52. External Data Sources 52 http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
  53. 53. External Data Sources 53 http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
  54. 54. Tungsten 54 http://www.slideshare.net/SparkSummit/reynold-xin
  55. 55. Tungsten 55 http://www.slideshare.net/SparkSummit/reynold-xin
  56. 56. Tungsten 56 http://www.slideshare.net/SparkSummit/reynold-xin
  57. 57. Tungsten 57 http://www.slideshare.net/SparkSummit/reynold-xin
  58. 58. All New Spark 58 http://www.slideshare.net/SparkSummit/reynold-xin
  59. 59. Spark 1.5 • A large part of Spark 1.5, on the other hand, focuses on under-the-hood changes to improve Spark’s performance, usability, and operational stability. • Spark 1.5 delivers the first phase of Project Tungsten Reference: https://databricks.com/blog/2015/08/18/spark-1-5-preview-now-available-in-databricks.html
  60. 60. Thank you

×