• Save
Intro to Apache Spark by Marco Vasquez
Upcoming SlideShare
Loading in...5
×
 

Intro to Apache Spark by Marco Vasquez

on

  • 649 views

Marco Vasquez | Data Scientist | MapR Technologies

Marco Vasquez | Data Scientist | MapR Technologies

Presentation given at the Houston Hadoop Meetup

Statistics

Views

Total Views
649
Views on SlideShare
640
Embed Views
9

Actions

Likes
2
Downloads
0
Comments
0

4 Embeds 9

http://localhost 4
http://dschool.co 2
http://www.slideee.com 2
http://www.dschool.co 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Intro to Apache Spark by Marco Vasquez Intro to Apache Spark by Marco Vasquez Presentation Transcript

  • © 2014 MapR Technologies 1© 2014 MapR Technologies Introduction to Spark
  • © 2014 MapR Technologies 2© 2014 MapR Technologies Introduction
  • © 2014 MapR Technologies 3 Introduction • Marco Vasquez | Data Scientist | MapR Technologies – Industry experience includes areas in research {bioinformatics, machine learning, computer vision}, software engineering, and security – I work in the professional services team. We work with MapR customers to solve their big data business problems • What is MapR Professional Services? – Team of data scientist and engineers that work on solving complex business problem by applying, some services offered – Use Case Discovery (data analysis, modeling, get insights from data) – Solution Design (develop a solution around those insights) – Strategy Recommendations (big data corporate initiatives) Brief description
  • © 2014 MapR Technologies 4 About this talk 1. Briefly review Data Science and how Spark helps me to do my job 2. Introduction to Spark internals 3. Provide example use case: Machine Learning using public data set (RITA) • Questions from the audience. Have several team members who can expand on MapR platform in general. Several MapR team folks present. Make this interactive We will cover three topics
  • © 2014 MapR Technologies 5© 2014 MapR Technologies Spark and Data Science
  • © 2014 MapR Technologies 6 Introduction to Data Science • Many definitions but I like “Insights from data that results in an action that generates value”. Not enough to just get insights. • At the core of Data Science is – Data pre-processing, building predictive models, working with business to identify use cases • Tools commonly used are R, Matlab, or C/C++ • What about Spark? What is Data Science?
  • © 2014 MapR Technologies 7 Spark can be useful in Data Science • Spark allows for quick analysis and model development • It provides access to the full data set, avoiding the need to subsample as is the case with R • Spark supports streaming, which can be used for building real- time models full data sets • Using MapR platform can integrate with Hadoop to build better model that combines historical data and real-time data • It can be use as the platform to build a real solution. Unlike R or Matlab where another solution has be to used in production
  • © 2014 MapR Technologies 8© 2014 MapR Technologies Spark
  • © 2014 MapR Technologies 9 Spark • Spark is a distributed in memory computational framework • It aims to provide a single platform that can be used for real-time applications (streaming), complex analysis (machine learning), interactive queries(shell), and batch processing (Hadoop integration) • It has specialized modules that run on top of Spark – SQL (Shark), Streaming, Graph Processing(GraphX), Machine Learning(MLLIb) • Spark introduces an abstract common data format that is used for efficient data sharing across parallel computation - RDD • Spark supports Map/Reduce programming model. Note - not same as Hadoop MR What is Spark?
  • © 2014 MapR Technologies 10 Spark Platform Spark components and Hadoop integration Shark SQL Spark Streaming GraphXMLLib Spark Data HDFS Hadoop Yarn Resource MGR Execution Engine RDD Mesos Mahout PigHive
  • © 2014 MapR Technologies 11 Spark General Flow Files Transform ations Action RDD RDD’ Value
  • © 2014 MapR Technologies 12 Spark • Supports several ways to interact with Spark – Spark Interactive Shell {Scala, Python} – Programming in Java, Scala, and Python • Works by applying transformation and actions on collection of records called RDDs • In-memory and fast What are spark features?
  • © 2014 MapR Technologies 13 Clean API • Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure • Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  • © 2014 MapR Technologies 14 Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
  • © 2014 MapR Technologies 15 User-Driven Roadmap • Language support – Improved Python support – SparkR – Java 8 – Integrated Schema and SQL support in Spark’s APIs • Better ML – Sparse Data Support – Model Evaluation Framework – Performance Testing
  • © 2014 MapR Technologies 16 Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
  • © 2014 MapR Technologies 17 Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)
  • © 2014 MapR Technologies 18 Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b
  • © 2014 MapR Technologies 19 Some Key-Value Operations > pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) > pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side
  • © 2014 MapR Technologies 20 > lines = sc.textFile(“hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • © 2014 MapR Technologies 21 Other Key-Value Operations > visits = sc.parallelize([ (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ]) > pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) > visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) > visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))
  • © 2014 MapR Technologies 22© 2014 MapR Technologies Spark Internals
  • © 2014 MapR Technologies 23 Spark application • Driver program • Java program that creates a SparkContext • Executors • Worker processes that execute tasks and store data
  • © 2014 MapR Technologies 24 Types of Applications • Long lived/shared application • Shark • Spark Streaming • Job Server (Ooyala) • Short lived applications • Standalone apps • Shell sessions May do mutli- user scheduling within allocation from cluster manger
  • © 2014 MapR Technologies 25 SparkContext • Main entry point to Spark functionality • Available in shell as variable sc • In standalone programs, you’d make your own (see later for details)
  • © 2014 MapR Technologies 26 Resilient Distributed Datasets • RDD is a read only, partitioned collection of records • Since RDD is read only, mutable states are represented by many RDDs • Users can control persistence and partitioning • Transformations or actions are applied to RDDs What about RDD?
  • © 2014 MapR Technologies 27 Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)
  • © 2014 MapR Technologies 28 Cluster manager • Cluster manager grants executors to a Spark application
  • © 2014 MapR Technologies 29 Driver program • Driver program decides when to launch tasks on which executor Needs full network connectivity to workers
  • © 2014 MapR Technologies 30© 2014 MapR Technologies Spark Development
  • © 2014 MapR Technologies 31 Spark Programming • Use IntelliJ and install Scala plugin to build jar files • Use SBT for build tool. Possible to integrate Scala with Gradle but difficult • Write Scala code • Deploy with code ‘sbt package’ to generate fat jar file • Run code using ‘spark-submit’ • Use spark-shell for quick prototyping Getting started with Spark and Scala
  • © 2014 MapR Technologies 32 Spark Development Environment • Install Scala plugin for IntelliJ • Create a Scala project and use ‘with SBT build’ option • Add following lines to build.sbt file to pull in Spark dependencies scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.0.0" libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.1.0" % "test" resolvers += "Akka Repository" at "http://repo.akka.io/releases/" Using IntelliJ and Scala
  • © 2014 MapR Technologies 33 Deploy and Run • Run ‘sbt package’ to generate the jar file • Submit to spark engine using the following: – spark-submit --class com.ps.ml.RitaML --master local[4] rita_2.10-1.0.jar Using sbt and spark-submit
  • © 2014 MapR Technologies 34© 2014 MapR Technologies Linear Regression using Spark
  • © 2014 MapR Technologies 35 Linear Regression using Spark • Use linear regression using the following predictors: – actual elapsed time, air time, departure delay, distance, taxi in, taxi out • Steps: – Import data – Data pre-processing – Build model Goal: Build a model that predicts flight arrival delays
  • © 2014 MapR Technologies 36 RITA Data Sample data Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled Cancella tionCod e Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircra ftDelay 2008 1 3 4 2003 1955 2211 2225 WN 335 N712SW 128 150 116 -14 8 IAD TPA 810 4 8 0 0 NA NA NA NA NA 2008 1 3 4 754 735 1002 1000 WN 3231 N772SW 128 145 113 2 19 IAD TPA 810 5 10 0 0 NA NA NA NA NA 2008 1 3 4 628 620 804 750 WN 448 N428WN 96 90 76 14 8 IND BWI 515 3 17 0 0 NA NA NA NA NA 2008 1 3 4 926 930 1054 1100 WN 1746 N612SW 88 90 78 -6 -4 IND BWI 515 3 7 0 0 NA NA NA NA NA 2008 1 3 4 1829 1755 1959 1925 WN 3920 N464WN 90 90 77 34 34 IND BWI 515 3 10 0 0 2 0 0 0 32 2008 1 3 4 1940 1915 2121 2110 WN 378 N726SW 101 115 87 11 25 IND JAX 688 4 10 0 0 NA NA NA NA NA 2008 1 3 4 1937 1830 2037 1940 WN 509 N763SW 240 250 230 57 67 IND LAS 1591 3 7 0 0 10 0 0 0 47 2008 1 3 4 1039 1040 1132 1150 WN 535 N428WN 233 250 219 -18 -1 IND LAS 1591 7 7 0 0 NA NA NA NA NA 2008 1 3 4 617 615 652 650 WN 11 N689SW 95 95 70 2 2 IND MCI 451 6 19 0 0 NA NA NA NA NA 2008 1 3 4 1620 1620 1639 1655 WN 810 N648SW 79 95 70 -16 0 IND MCI 451 3 6 0 0 NA NA NA NA NA 2008 1 3 4 706 700 916 915 WN 100 N690SW 130 135 106 1 6 IND MCO 828 5 19 0 0 NA NA NA NA NA 2008 1 3 4 1644 1510 1845 1725 WN 1333 N334SW 121 135 107 80 94 IND MCO 828 6 8 0 0 8 0 0 0 72 2008 1 3 4 1426 1430 1426 1425 WN 829 N476WN 60 55 39 1 -4 IND MDW 162 9 12 0 0 NA NA NA NA NA 2008 1 3 4 715 715 720 710 WN 1016 N765SW 65 55 37 10 0 IND MDW 162 7 21 0 0 NA NA NA NA NA 2008 1 3 4 1702 1700 1651 1655 WN 1827 N420WN 49 55 35 -4 2 IND MDW 162 4 10 0 0 NA NA NA NA NA 2008 1 3 4 1029 1020 1021 1010 WN 2272 N263WN 52 50 37 11 9 IND MDW 162 6 9 0 0 NA NA NA NA NA 2008 1 3 4 1452 1425 1640 1625 WN 675 N286WN 228 240 213 15 27 IND PHX 1489 7 8 0 0 3 0 0 0 12 2008 1 3 4 754 745 940 955 WN 1144 N778SW 226 250 205 -15 9 IND PHX 1489 5 16 0 0 NA NA NA NA NA 2008 1 3 4 1323 1255 1526 1510 WN 4 N674AA 123 135 110 16 28 IND TPA 838 4 9 0 0 0 0 0 0 16 2008 1 3 4 1416 1325 1512 1435 WN 54 N643SW 56 70 49 37 51 ISP BWI 220 2 5 0 0 12 0 0 0 25 2008 1 3 4 706 705 807 810 WN 68 N497WN 61 65 51 -3 1 ISP BWI 220 3 7 0 0 NA NA NA NA NA 2008 1 3 4 1657 1625 1754 1735 WN 623 N724SW 57 70 47 19 32 ISP BWI 220 5 5 0 0 7 0 0 0 12 2008 1 3 4 1900 1840 1956 1950 WN 717 N786SW 56 70 49 6 20 ISP BWI 220 2 5 0 0 NA NA NA NA NA 2008 1 3 4 1039 1030 1133 1140 WN 1244 N714CB 54 70 47 -7 9 ISP BWI 220 2 5 0 0 NA NA NA NA NA 2008 1 3 4 801 800 902 910 WN 2101 N222WN 61 70 53 -8 1 ISP BWI 220 3 5 0 0 NA NA NA NA NA 2008 1 3 4 1520 1455 1619 1605 WN 2553 N394SW 59 70 50 14 25 ISP BWI 220 2 7 0 0 NA NA NA NA NA 2008 1 3 4 1422 1255 1657 1610 WN 188 N215WN 155 195 143 47 87 ISP FLL 1093 6 6 0 0 40 0 0 0 7 2008 1 3 4 1954 1925 2239 2235 WN 1754 N243WN 165 190 155 4 29 ISP FLL 1093 3 7 0 0 NA NA NA NA NA 2008 1 3 4 636 635 921 945 WN 2275 N454WN 165 190 147 -24 1 ISP FLL 1093 5 13 0 0 NA NA NA NA NA 2008 1 3 4 734 730 958 1020 WN 550 N712SW 324 350 314 -22 4 ISP LAS 2283 2 8 0 0 NA NA NA NA NA
  • © 2014 MapR Technologies 37 RITA ML - Initialize // Import machine learning library import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating // Regex helper class implicit class Regex(sc: StringContext) { def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*) } // Setup Spark context val conf = new SparkConf().setAppName("RitaML") val sc = new SparkContext(conf)
  • © 2014 MapR Technologies 38 RITA ML - Data Processing // Import file and convert RDD val rita08 = sc.textFile("maprfs:/user/ubuntu/input/2008.csv”) // Remove header from RDD val rita08_nh = rita08.filter(x => x.split(',')(0) match { case r"d+" => true case _ => false }) // Assign name to field index val actual_elapsed_time = 11 val airtime = 13 val arrdelay = 14 val depdelay = 15 val distance = 18 val taxiin = 19 val taxiout = 20
  • © 2014 MapR Technologies 39 RITA ML - Data Processing def isna(s: String): Boolean = { s match { case "NA" => true case _ => false } } // Get fields of interest and filter NAs val rita08_nh_ftd = rita08_nh.map(x => x.split(',')).map(x => (x(arrdelay), (x(actual_elapsed_time), x(airtime), x(depdelay), x(distance), x(taxiin), x(taxiout)))).filter(x => !isna(x._1) && !isna(x._2._1) && !isna(x._2._2) && !isna(x._2._3) && !isna(x._2._4) && !isna(x._2._5) && !isna(x._2._6)) // Covert to Strings to LabeledPoint: (response variable, Vector(predictors)) val rita08_training_data = rita08_nh_ftd.map(x => LabeledPoint(x._1.toDouble, Vectors.dense(Array(x._2._1.toDouble, x._2._2.toDouble, x._2._3.toDouble, x._2._4.toDouble, x._2._5.toDouble, x._2._6.toDouble))))
  • © 2014 MapR Technologies 40 RITA ML – Train Model val numIterations = 20 // Train LR model val mymodel = LinearRegressionWithSGD.train( rita08_training_data, numIterations) // Get the values and predicted features values val valuesAndPreds = rita08_training_data.map { point => val prediction = mymodel.predict(point.features); (point.label, prediction) } // Get the Mean Squared Error val MSE = valuesAndPreds.map { case (v, p) => math.pow((v - p), 2)}.mean() println("training Mean Squared Error = " + MSE)
  • © 2014 MapR Technologies 41 RITA ML – Analysis - Things to try: - Remove predictors - Add new predictors - Increase number of iterations to improve gradient descent - Run again to determine whether the MSE decreases - Iterate this process until you have an acceptable MSE (That is strength of Spark, that this can be done quickly)
  • © 2014 MapR Technologies 42 Q&A @mapr maprtech yourname@mapr.com Engage with us! MapR maprtech mapr-technologies