Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark in Action

1,637 views

Published on

Thailand Big Data Challenge #2/2016: Apache Spark in Action, June 2016

Published in: Technology

Apache Spark in Action

  1. 1. thanachart@imcinstitute.com1 Thailand Big Data Challenge #2/2016 Apache Spark in Actions 18-19 June 2016 Dr.Thanachart Numnonda IMC Institute thanachart@imcinstitute.com
  2. 2. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Outline ● Launch Azure Instance ● Install Docker on Ubuntu ● Pull Cloudera QuickStart to the docker ● HDFS ● Spark ● Spark SQL ● Spark Streaming Hive.apache.org
  3. 3. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Cloudera VM This lab will use a EC2 virtual server on AWS to install Cloudera, However, you can also use Cloudera QuickStart VM which can be downloaded from: http://www.cloudera.com/content/www/en-us/downloads.html
  4. 4. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Hadoop Ecosystem
  5. 5. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Streaming
  6. 6. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Hands-On: Launch a virtual server on Microsoft Azure (Note: You can skip this session if you use your own computer or another cloud service)
  7. 7. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Sign up for Visual Studio Dev Essential to get free Azure credit.
  8. 8. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Sign in to Azure Portal
  9. 9. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  10. 10. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Virtual Server This lab will use an Azure virtual server to install a Cloudera Quickstart Docker using the following features: Ubuntu Server 14.04 LTS DS3_V2 Standard 4 Core, 14 GB memory,28 GB SSD
  11. 11. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Select New => Virtual Machines => Virtual Machines
  12. 12. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action On the Basics page, enter: ● a name for the VM ● a username for the Admin User ● the Authentication Type set to password ● a password ● a resource group name
  13. 13. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  14. 14. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Choose DS3_v2 Standard
  15. 15. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  16. 16. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  17. 17. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  18. 18. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Setting the inbound port for Hue (8888)
  19. 19. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  20. 20. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  21. 21. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  22. 22. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Get the IP address
  23. 23. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Connect to an instance from Mac/Linux ssh -i ~/.ssh/id_rsa imcinstitute@104.210.146.182
  24. 24. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Connect to an instance from Windows using Putty
  25. 25. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Hands-On: Installing Cloudera Quickstart on Docker Container
  26. 26. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Installation Steps ● Update OS ● Install Docker ● Pull Cloudera Quickstart ● Run Cloudera Quickstart ● Run Cloudera Manager Hive.apache.org
  27. 27. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Update OS (Ubuntu) ● Command: sudo apt-get update
  28. 28. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Docker Installation ● Command: sudo apt-get install docker.io
  29. 29. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Pull Cloudera Quickstart ● Command: sudo docker pull cloudera/quickstart:latest
  30. 30. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Show docker images ● Command: sudo docker images
  31. 31. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Run Cloudera quickstart ● Command: sudo docker run --hostname=quickstart.cloudera --privileged=true -t -i [OPTIONS] [IMAGE] /usr/bin/docker-quickstart Example: sudo docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 cloudera/quickstart /usr/bin/docker-quickstart
  32. 32. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Docker commands: ● docker images ● docker ps ● docker attach id ● docker kill id ● Exit from container ● exit (exit & kill the running image) ● Ctrl-P, Ctrl-Q (exit without killing the running image) Hive.apache.org
  33. 33. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Login to Hue http://104.210.146.182:8888
  34. 34. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  35. 35. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Hands-On: Importing/Exporting Data to HDFS
  36. 36. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action HDFS ● Default storage for the Hadoop cluster ● Data is distributed and replicated over multiple machines ● Designed to handle very large files with straming data access patterns. ● NameNode/DataNode ● Master/slave architecture (1 master 'n' slaves) ● Designed for large files (64 MB default, but configurable) across all the nodes Hive.apache.org
  37. 37. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action HDFS Architecture Source Hadoop: Shashwat Shriparv
  38. 38. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Data Replication in HDFS Source Hadoop: Shashwat Shriparv
  39. 39. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action How does HDFS work? Source Introduction to Apache Hadoop-Pig: PrashantKommireddi
  40. 40. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action How does HDFS work? Source Introduction to Apache Hadoop-Pig: PrashantKommireddi
  41. 41. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action How does HDFS work? Source Introduction to Apache Hadoop-Pig: PrashantKommireddi
  42. 42. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action How does HDFS work? Source Introduction to Apache Hadoop-Pig: PrashantKommireddi
  43. 43. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action How does HDFS work? Source Introduction to Apache Hadoop-Pig: PrashantKommireddi
  44. 44. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Review file in Hadoop HDFS using File Browse
  45. 45. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Create a new directory name as: input & output
  46. 46. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  47. 47. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Upload a local file to HDFS
  48. 48. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  49. 49. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Hands-On: Connect to a master node via SSH
  50. 50. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action SSH Login to a master node
  51. 51. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Hadoop syntax for HDFS
  52. 52. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Install wget ● Command: yum install wget
  53. 53. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Download an example text file Make your own durectory at a master node to avoid mixing with others $mkdir guest1 $cd guest1 $wget https://s3.amazonaws.com/imcbucket/input/pg2600.txt
  54. 54. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Upload Data to Hadoop $hadoop fs -ls /user/cloudera/input $hadoop fs -rm /user/cloudera/input/* $hadoop fs -put pg2600.txt /user/cloudera/input/ $hadoop fs -ls /user/cloudera/input Note: you login as ubuntu, so you need to a sudo command to Switch user to hdfs
  55. 55. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Lecture Understanding Spark
  56. 56. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Introduction A fast and general engine for large scale data processing An open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.
  57. 57. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action What is Spark? Framework for distributed processing. In-memory, fault tolerant data structures Flexible APIs in Scala, Java, Python, SQL, R Open source
  58. 58. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Why Spark? Handle Petabytes of data Significant faster than MapReduce Simple and intutive APIs General framework – Runs anywhere – Handles (most) any I/O – Interoperable libraries for specific use-cases
  59. 59. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Source: Jump start into Apache Spark and Databricks
  60. 60. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark: History Founded by AMPlab, UC Berkeley Created by Matei Zaharia (PhD Thesis) Maintained by Apache Software Foundation Commercial support by Databricks
  61. 61. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  62. 62. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Platform
  63. 63. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  64. 64. thanachart@imcinstitute.com64 Spark Platform Source: MapR Academy
  65. 65. thanachart@imcinstitute.com65 Source: MapR Academy
  66. 66. thanachart@imcinstitute.com66 Source: TRAINING Intro to Apache Spark - Brian Clapper
  67. 67. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Source: Jump start into Apache Spark and Databricks
  68. 68. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Source: Jump start into Apache Spark and Databricks
  69. 69. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Source: Jump start into Apache Spark and Databricks
  70. 70. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Source: Jump start into Apache Spark and Databricks
  71. 71. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action What is a RDD? Resilient: if the data in memory (or on a node) is lost, it can be recreated. Distributed: data is chucked into partitions and stored in memory acress the custer. Dataset: initial data can come from a table or be created programmatically
  72. 72. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action RDD: Fault tollerant Immutable Three methods for creating RDD: – Parallelizing an existing correction – Referencing a dataset – Transformation from an existing RDD Types of files supported: – Text files – SequenceFiles – Hadoop InputFormat
  73. 73. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action RDD Creation hdfsData = sc.textFile("hdfs://data.txt”) Source: Pspark: A brain-friendly introduction
  74. 74. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action RDD: Operations Transformations: transformations are lazy (not computed immediately) Actions: the transformed RDD gets recomputed when an action is run on it (default)
  75. 75. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Direct Acyclic Graph (DAG)
  76. 76. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  77. 77. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action What happens when an action is executed Source: Spark Fundamentals I Big Data Usibersity
  78. 78. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action What happens when an action is executed Source: Spark Fundamentals I Big Data Usibersity
  79. 79. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action What happens when an action is executed Source: Spark Fundamentals I Big Data Usibersity
  80. 80. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action What happens when an action is executed Source: Spark Fundamentals I Big Data Usibersity
  81. 81. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action What happens when an action is executed Source: Spark Fundamentals I Big Data Usibersity
  82. 82. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action What happens when an action is executed
  83. 83. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark:Transformation
  84. 84. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark:Transformation
  85. 85. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Single RDD Transformation Source: Jspark: A brain-friendly introduction
  86. 86. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Multiple RDD Transformation Source: Jspark: A brain-friendly introduction
  87. 87. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Pair RDD Transformation Source: Jspark: A brain-friendly introduction
  88. 88. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark:Actions
  89. 89. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark:Actions
  90. 90. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark: Persistence
  91. 91. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Accumulators Similar to a MapReduce “Counter” A global variable to track metrics about your Spark program for debugging. Reasoning: Excutors do not communicate with each other. Sent back to driver
  92. 92. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Broadcast Variables Similar to a MapReduce “Distributed Cache” Sends read-only values to worker nodes. Great for lookup tables, dictionaries, etc.
  93. 93. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Hands-On: Spark Programming
  94. 94. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Functional tools in Python map filter reduce lambda IterTools • Chain, flatmap
  95. 95. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Map >>> a= [1,2,3] >>> def add1(x) : return x+1 >>> map(add1, a) Result: [2,3,4]
  96. 96. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Filter >>> a= [1,2,3,4] >>> def isOdd(x) : return x%2==1 >>> filter(isOdd, a) Result: [1,3]
  97. 97. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Reduce >>> a= [1,2,3,4] >>> def add(x,y) : return x+y >>> reduce(add, a) Result: 10
  98. 98. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action lambda >>> (lambda x: x + 1)(3) Result: 4 >>> map((lambda x: x + 1), [1,2,3]) Result: [2,3,4]
  99. 99. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Exercises
  100. 100. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Start Spark-shell $spark-shell
  101. 101. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Testing SparkContext Spark-context scala> sc
  102. 102. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Program in Scala: WordCount scala> val file = sc.textFile("hdfs:///user/cloudera/input/pg2600.txt") scala> val wc = file.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) scala> wc.saveAsTextFile("hdfs:///user/cloudera/output/wordcountScala ")
  103. 103. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  104. 104. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action WordCount output
  105. 105. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Program in Python: WordCount $ pyspark >>> from operator import add >>> file = sc.textFile("hdfs:///user/cloudera/input/pg2600.txt") >>> wc = file.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add) >>> wc.saveAsTextFile("hdfs:///user/cloudera/output/ wordcountPython")
  106. 106. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Project: Flight
  107. 107. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Flight Details Data http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
  108. 108. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Flight Details Data http://stat-computing.org/dataexpo/2009/the-data.html
  109. 109. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Data Description
  110. 110. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Snapshot of Dataset
  111. 111. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action FiveThirtyEight http://projects.fivethirtyeight.com/flights/
  112. 112. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Program : Upload Flight Delay Data $ wget https://s3.amazonaws.com/imcbucket/data/flights/2008.csv $ hadoop fs -put 2008.csv /user/cloudera/input Upload a data to HDFS
  113. 113. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Program : Navigating Flight Delay Data >>> airline = sc.textFile("hdfs:///user/cloudera/input/2008.csv") >>> airline.take(2)
  114. 114. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Program : Preparing Data >>> header_line = airline.first() >>> header_list = header_line.split(',') >>> airline_no_header = airline.filter(lambda row: row != header_line) >>> airline_no_header.first() >>> def make_row(row): ... row_list = row.split(',') ... d = dict(zip(header_list,row_list)) ... return d ... >>> airline_rows = airline_no_header.map(make_row) >>> airline_rows.take(5)
  115. 115. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Program : Define convert_float function >>> def convert_float(value): ... try: ... x = float(value) ... return x ... except ValueError: ... return 0 ... >>>
  116. 116. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Program : Finding best/worst airline >>> carrier_rdd = airline_rows.map(lambda row: (row['UniqueCarrier'],convert_float(row['ArrDelay']))) >>> carrier_rdd.take(2)
  117. 117. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Program : Finding best/worst airlines >>> mean_delays_dest = carrier_rdd.groupByKey().mapValues(lambda delays: sum(delays.data)/len(delays.data)) >>> mean_delays_dest.sortBy(lambda t:t[1], ascending=True).take(10) >>> mean_delays_dest.sortBy(lambda t:t[1], ascending=False).take(10)
  118. 118. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark SQL
  119. 119. thanachart@imcinstitute.com119 A distributed collection of rows organied into named columns. An abstraction for selecting, filtering, aggregating, and plotting structured data. Previously => SchemaRDD DataFrame
  120. 120. thanachart@imcinstitute.com120 Creating and running Spark program faster – Write less code – Read less data – Let the optimizer do the hard work SparkSQL
  121. 121. thanachart@imcinstitute.com121 Source: Jump start into Apache Spark and Databricks
  122. 122. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action SparkSQL
  123. 123. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Preparing Large Dataset http://grouplens.org/datasets/movielens/
  124. 124. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action MovieLen Dataset 1)Type command > wget http://files.grouplens.org/datasets/movielens/ml-100k.zip 2)Type command > yum install unzip 3)Type command > unzip ml-100k.zip 4)Type command > more ml-100k/u.user
  125. 125. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Moving dataset to HDFS 1)Type command > cd ml-100k 2)Type command > hadoop fs -mkdir /user/cloudera/movielens 3)Type command > hadoop fs -put u.user /user/cloudera/movielens 4)Type command > hadoop fs -ls /user/cloudera/movielens
  126. 126. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action SQL Spark MovieLens $ pyspark --packages com.databricks:spark-csv_2.10:1.2.0 >>> df = sqlContext.read.format('com.databricks.spark.csv').options(hea der='true').load('hdfs:///user/cloudera/u.user') >>> df.registerTempTable('user') >>> sqlContext.sql("SELECT * FROM user").collect() Upload a data to HDFS then
  127. 127. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  128. 128. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Spark Streaming
  129. 129. thanachart@imcinstitute.com129 Stream Process Architecture Source: MapR Academy
  130. 130. thanachart@imcinstitute.com130 Spark Streaming Architecture Source: MapR Academy
  131. 131. thanachart@imcinstitute.com131 Processing Spark DStreams Source: MapR Academy
  132. 132. thanachart@imcinstitute.com132 Use Case: Time Series Data Source: MapR Academy
  133. 133. thanachart@imcinstitute.com133 Use Case Source: http://www.insightdataengineering.com/
  134. 134. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Start Spark-shell with extra memory
  135. 135. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action WordCount using Spark Streaming $ scala> :paste import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.storage.StorageLevel import StorageLevel._ import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ val ssc = new StreamingContext(sc, Seconds(2)) val lines = ssc.socketTextStream("localhost",8585,MEMORY_ONLY) val wordsFlatMap = lines.flatMap(_.split(" ")) val wordsMap = wordsFlatMap.map( w => (w,1)) val wordCount = wordsMap.reduceByKey( (a,b) => (a+b)) wordCount.print ssc.start
  136. 136. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Running the netcat server on another window
  137. 137. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action
  138. 138. thanachart@imcinstitute.com138 Hadoop + Spark Source: MapR Academy
  139. 139. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Challenge: Dataset
  140. 140. thanachart@imcinstitute.com140 NYC’s Taxi Trip Data http://www.andresmh.com/nyctaxitrips/
  141. 141. thanachart@imcinstitute.com141 NYC Taxi : A Day in Life http://nyctaxi.herokuapp.com/
  142. 142. thanachart@imcinstitute.com142 Recommended Books
  143. 143. thanachart@imcinstitute.com143
  144. 144. Thanachart Numnonda, thanachart@imcinstitute.com June 2016Apache Spark in Action Thank you www.imcinstitute.com www.facebook.com/imcinstitute

×