© 2014 MapR Technologies 1© 2014 MapR Technologies
Introduction to Spark
© 2014 MapR Technologies 2© 2014 MapR Technologies
Introduction
© 2014 MapR Technologies 3
Introduction
• Marco Vasquez | Data Scientist | MapR Technologies
– Industry experience include...
© 2014 MapR Technologies 4
About this talk
1. Briefly review Data Science and how Spark helps me to do my
job
2. Introduct...
© 2014 MapR Technologies 5© 2014 MapR Technologies
Spark and Data Science
© 2014 MapR Technologies 6
Introduction to Data Science
• Many definitions but I like “Insights from data that results in ...
© 2014 MapR Technologies 7
Spark can be useful in Data Science
• Spark allows for quick analysis and model development
• I...
© 2014 MapR Technologies 8© 2014 MapR Technologies
Spark
© 2014 MapR Technologies 9
Spark
• Spark is a distributed in memory computational framework
• It aims to provide a single ...
© 2014 MapR Technologies 10
Spark Platform
Spark components and Hadoop integration
Shark
SQL
Spark
Streaming
GraphXMLLib
S...
© 2014 MapR Technologies 11
Spark General Flow
Files
Transform
ations Action
RDD RDD’
Value
© 2014 MapR Technologies 12
Spark
• Supports several ways to interact with Spark
– Spark Interactive Shell {Scala, Python}...
© 2014 MapR Technologies 13
Clean API
• Resilient Distributed
Datasets
• Collections of objects spread
across a cluster, s...
© 2014 MapR Technologies 14
Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin...
© 2014 MapR Technologies 15
User-Driven Roadmap
• Language support
– Improved Python support
– SparkR
– Java 8
– Integrate...
© 2014 MapR Technologies 16
Basic Transformations
> nums = sc.parallelize([1, 2, 3])
# Pass each element through a functio...
© 2014 MapR Technologies 17
Basic Actions
> nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection...
© 2014 MapR Technologies 18
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations operate on
RDDs of k...
© 2014 MapR Technologies 19
Some Key-Value Operations
> pets = sc.parallelize(
[(“cat”, 1), (“dog”, 1), (“cat”, 2)])
> pet...
© 2014 MapR Technologies 20
> lines = sc.textFile(“hamlet.txt”)
> counts = lines.flatMap(lambda line: line.split(“ ”))
.ma...
© 2014 MapR Technologies 21
Other Key-Value Operations
> visits = sc.parallelize([ (“index.html”, “1.2.3.4”),
(“about.html...
© 2014 MapR Technologies 22© 2014 MapR Technologies
Spark Internals
© 2014 MapR Technologies 23
Spark application
• Driver program
• Java program that creates a SparkContext
• Executors
• Wo...
© 2014 MapR Technologies 24
Types of Applications
• Long lived/shared application
• Shark
• Spark Streaming
• Job Server (...
© 2014 MapR Technologies 25
SparkContext
• Main entry point to Spark functionality
• Available in shell as variable sc
• I...
© 2014 MapR Technologies 26
Resilient Distributed Datasets
• RDD is a read only, partitioned collection of records
• Since...
© 2014 MapR Technologies 27
Creating RDDs
# Turn a Python collection into an RDD
> sc.parallelize([1, 2, 3])
# Load text f...
© 2014 MapR Technologies 28
Cluster manager
• Cluster manager grants executors to a Spark application
© 2014 MapR Technologies 29
Driver program
• Driver program decides when to launch tasks on which executor
Needs full netw...
© 2014 MapR Technologies 30© 2014 MapR Technologies
Spark Development
© 2014 MapR Technologies 31
Spark Programming
• Use IntelliJ and install Scala plugin to build jar files
• Use SBT for bui...
© 2014 MapR Technologies 32
Spark Development Environment
• Install Scala plugin for IntelliJ
• Create a Scala project and...
© 2014 MapR Technologies 33
Deploy and Run
• Run ‘sbt package’ to generate the jar file
• Submit to spark engine using the...
© 2014 MapR Technologies 34© 2014 MapR Technologies
Linear Regression using Spark
© 2014 MapR Technologies 35
Linear Regression using Spark
• Use linear regression using the following predictors:
– actual...
© 2014 MapR Technologies 36
RITA Data
Sample data
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime Un...
© 2014 MapR Technologies 37
RITA ML - Initialize
// Import machine learning library
import org.apache.spark.mllib.regressi...
© 2014 MapR Technologies 38
RITA ML - Data Processing
// Import file and convert RDD
val rita08 = sc.textFile("maprfs:/use...
© 2014 MapR Technologies 39
RITA ML - Data Processing
def isna(s: String): Boolean = { s match { case "NA" => true
case _ ...
© 2014 MapR Technologies 40
RITA ML – Train Model
val numIterations = 20
// Train LR model
val mymodel = LinearRegressionW...
© 2014 MapR Technologies 41
RITA ML – Analysis
- Things to try:
- Remove predictors
- Add new predictors
- Increase number...
© 2014 MapR Technologies 42
Q&A
@mapr maprtech
yourname@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
Upcoming SlideShare
Loading in...5
×

Intro to Apache Spark by Marco Vasquez

3,570

Published on

http://bit.ly/1BTaXZP – This presentation was given by Marco Vasquez, Data Scientist at MapR, at the Houston Hadoop Meetup

Published in: Technology, Education
1 Comment
14 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Apache_Hive.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,570
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
1
Likes
14
Embeds 0
No embeds

No notes for slide

Transcript of "Intro to Apache Spark by Marco Vasquez"

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Introduction to Spark
  2. 2. © 2014 MapR Technologies 2© 2014 MapR Technologies Introduction
  3. 3. © 2014 MapR Technologies 3 Introduction • Marco Vasquez | Data Scientist | MapR Technologies – Industry experience includes areas in research {bioinformatics, machine learning, computer vision}, software engineering, and security – I work in the professional services team. We work with MapR customers to solve their big data business problems • What is MapR Professional Services? – Team of data scientist and engineers that work on solving complex business problem by applying, some services offered – Use Case Discovery (data analysis, modeling, get insights from data) – Solution Design (develop a solution around those insights) – Strategy Recommendations (big data corporate initiatives) Brief description
  4. 4. © 2014 MapR Technologies 4 About this talk 1. Briefly review Data Science and how Spark helps me to do my job 2. Introduction to Spark internals 3. Provide example use case: Machine Learning using public data set (RITA) • Questions from the audience. Have several team members who can expand on MapR platform in general. Several MapR team folks present. Make this interactive We will cover three topics
  5. 5. © 2014 MapR Technologies 5© 2014 MapR Technologies Spark and Data Science
  6. 6. © 2014 MapR Technologies 6 Introduction to Data Science • Many definitions but I like “Insights from data that results in an action that generates value”. Not enough to just get insights. • At the core of Data Science is – Data pre-processing, building predictive models, working with business to identify use cases • Tools commonly used are R, Matlab, or C/C++ • What about Spark? What is Data Science?
  7. 7. © 2014 MapR Technologies 7 Spark can be useful in Data Science • Spark allows for quick analysis and model development • It provides access to the full data set, avoiding the need to subsample as is the case with R • Spark supports streaming, which can be used for building real- time models full data sets • Using MapR platform can integrate with Hadoop to build better model that combines historical data and real-time data • It can be use as the platform to build a real solution. Unlike R or Matlab where another solution has be to used in production
  8. 8. © 2014 MapR Technologies 8© 2014 MapR Technologies Spark
  9. 9. © 2014 MapR Technologies 9 Spark • Spark is a distributed in memory computational framework • It aims to provide a single platform that can be used for real-time applications (streaming), complex analysis (machine learning), interactive queries(shell), and batch processing (Hadoop integration) • It has specialized modules that run on top of Spark – SQL (Shark), Streaming, Graph Processing(GraphX), Machine Learning(MLLIb) • Spark introduces an abstract common data format that is used for efficient data sharing across parallel computation - RDD • Spark supports Map/Reduce programming model. Note - not same as Hadoop MR What is Spark?
  10. 10. © 2014 MapR Technologies 10 Spark Platform Spark components and Hadoop integration Shark SQL Spark Streaming GraphXMLLib Spark Data HDFS Hadoop Yarn Resource MGR Execution Engine RDD Mesos Mahout PigHive
  11. 11. © 2014 MapR Technologies 11 Spark General Flow Files Transform ations Action RDD RDD’ Value
  12. 12. © 2014 MapR Technologies 12 Spark • Supports several ways to interact with Spark – Spark Interactive Shell {Scala, Python} – Programming in Java, Scala, and Python • Works by applying transformation and actions on collection of records called RDDs • In-memory and fast What are spark features?
  13. 13. © 2014 MapR Technologies 13 Clean API • Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure • Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  14. 14. © 2014 MapR Technologies 14 Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
  15. 15. © 2014 MapR Technologies 15 User-Driven Roadmap • Language support – Improved Python support – SparkR – Java 8 – Integrated Schema and SQL support in Spark’s APIs • Better ML – Sparse Data Support – Model Evaluation Framework – Performance Testing
  16. 16. © 2014 MapR Technologies 16 Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
  17. 17. © 2014 MapR Technologies 17 Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)
  18. 18. © 2014 MapR Technologies 18 Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b
  19. 19. © 2014 MapR Technologies 19 Some Key-Value Operations > pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) > pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side
  20. 20. © 2014 MapR Technologies 20 > lines = sc.textFile(“hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  21. 21. © 2014 MapR Technologies 21 Other Key-Value Operations > visits = sc.parallelize([ (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ]) > pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) > visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) > visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))
  22. 22. © 2014 MapR Technologies 22© 2014 MapR Technologies Spark Internals
  23. 23. © 2014 MapR Technologies 23 Spark application • Driver program • Java program that creates a SparkContext • Executors • Worker processes that execute tasks and store data
  24. 24. © 2014 MapR Technologies 24 Types of Applications • Long lived/shared application • Shark • Spark Streaming • Job Server (Ooyala) • Short lived applications • Standalone apps • Shell sessions May do mutli- user scheduling within allocation from cluster manger
  25. 25. © 2014 MapR Technologies 25 SparkContext • Main entry point to Spark functionality • Available in shell as variable sc • In standalone programs, you’d make your own (see later for details)
  26. 26. © 2014 MapR Technologies 26 Resilient Distributed Datasets • RDD is a read only, partitioned collection of records • Since RDD is read only, mutable states are represented by many RDDs • Users can control persistence and partitioning • Transformations or actions are applied to RDDs What about RDD?
  27. 27. © 2014 MapR Technologies 27 Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)
  28. 28. © 2014 MapR Technologies 28 Cluster manager • Cluster manager grants executors to a Spark application
  29. 29. © 2014 MapR Technologies 29 Driver program • Driver program decides when to launch tasks on which executor Needs full network connectivity to workers
  30. 30. © 2014 MapR Technologies 30© 2014 MapR Technologies Spark Development
  31. 31. © 2014 MapR Technologies 31 Spark Programming • Use IntelliJ and install Scala plugin to build jar files • Use SBT for build tool. Possible to integrate Scala with Gradle but difficult • Write Scala code • Deploy with code ‘sbt package’ to generate fat jar file • Run code using ‘spark-submit’ • Use spark-shell for quick prototyping Getting started with Spark and Scala
  32. 32. © 2014 MapR Technologies 32 Spark Development Environment • Install Scala plugin for IntelliJ • Create a Scala project and use ‘with SBT build’ option • Add following lines to build.sbt file to pull in Spark dependencies scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.0.0" libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.1.0" % "test" resolvers += "Akka Repository" at "http://repo.akka.io/releases/" Using IntelliJ and Scala
  33. 33. © 2014 MapR Technologies 33 Deploy and Run • Run ‘sbt package’ to generate the jar file • Submit to spark engine using the following: – spark-submit --class com.ps.ml.RitaML --master local[4] rita_2.10-1.0.jar Using sbt and spark-submit
  34. 34. © 2014 MapR Technologies 34© 2014 MapR Technologies Linear Regression using Spark
  35. 35. © 2014 MapR Technologies 35 Linear Regression using Spark • Use linear regression using the following predictors: – actual elapsed time, air time, departure delay, distance, taxi in, taxi out • Steps: – Import data – Data pre-processing – Build model Goal: Build a model that predicts flight arrival delays
  36. 36. © 2014 MapR Technologies 36 RITA Data Sample data Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled Cancella tionCod e Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircra ftDelay 2008 1 3 4 2003 1955 2211 2225 WN 335 N712SW 128 150 116 -14 8 IAD TPA 810 4 8 0 0 NA NA NA NA NA 2008 1 3 4 754 735 1002 1000 WN 3231 N772SW 128 145 113 2 19 IAD TPA 810 5 10 0 0 NA NA NA NA NA 2008 1 3 4 628 620 804 750 WN 448 N428WN 96 90 76 14 8 IND BWI 515 3 17 0 0 NA NA NA NA NA 2008 1 3 4 926 930 1054 1100 WN 1746 N612SW 88 90 78 -6 -4 IND BWI 515 3 7 0 0 NA NA NA NA NA 2008 1 3 4 1829 1755 1959 1925 WN 3920 N464WN 90 90 77 34 34 IND BWI 515 3 10 0 0 2 0 0 0 32 2008 1 3 4 1940 1915 2121 2110 WN 378 N726SW 101 115 87 11 25 IND JAX 688 4 10 0 0 NA NA NA NA NA 2008 1 3 4 1937 1830 2037 1940 WN 509 N763SW 240 250 230 57 67 IND LAS 1591 3 7 0 0 10 0 0 0 47 2008 1 3 4 1039 1040 1132 1150 WN 535 N428WN 233 250 219 -18 -1 IND LAS 1591 7 7 0 0 NA NA NA NA NA 2008 1 3 4 617 615 652 650 WN 11 N689SW 95 95 70 2 2 IND MCI 451 6 19 0 0 NA NA NA NA NA 2008 1 3 4 1620 1620 1639 1655 WN 810 N648SW 79 95 70 -16 0 IND MCI 451 3 6 0 0 NA NA NA NA NA 2008 1 3 4 706 700 916 915 WN 100 N690SW 130 135 106 1 6 IND MCO 828 5 19 0 0 NA NA NA NA NA 2008 1 3 4 1644 1510 1845 1725 WN 1333 N334SW 121 135 107 80 94 IND MCO 828 6 8 0 0 8 0 0 0 72 2008 1 3 4 1426 1430 1426 1425 WN 829 N476WN 60 55 39 1 -4 IND MDW 162 9 12 0 0 NA NA NA NA NA 2008 1 3 4 715 715 720 710 WN 1016 N765SW 65 55 37 10 0 IND MDW 162 7 21 0 0 NA NA NA NA NA 2008 1 3 4 1702 1700 1651 1655 WN 1827 N420WN 49 55 35 -4 2 IND MDW 162 4 10 0 0 NA NA NA NA NA 2008 1 3 4 1029 1020 1021 1010 WN 2272 N263WN 52 50 37 11 9 IND MDW 162 6 9 0 0 NA NA NA NA NA 2008 1 3 4 1452 1425 1640 1625 WN 675 N286WN 228 240 213 15 27 IND PHX 1489 7 8 0 0 3 0 0 0 12 2008 1 3 4 754 745 940 955 WN 1144 N778SW 226 250 205 -15 9 IND PHX 1489 5 16 0 0 NA NA NA NA NA 2008 1 3 4 1323 1255 1526 1510 WN 4 N674AA 123 135 110 16 28 IND TPA 838 4 9 0 0 0 0 0 0 16 2008 1 3 4 1416 1325 1512 1435 WN 54 N643SW 56 70 49 37 51 ISP BWI 220 2 5 0 0 12 0 0 0 25 2008 1 3 4 706 705 807 810 WN 68 N497WN 61 65 51 -3 1 ISP BWI 220 3 7 0 0 NA NA NA NA NA 2008 1 3 4 1657 1625 1754 1735 WN 623 N724SW 57 70 47 19 32 ISP BWI 220 5 5 0 0 7 0 0 0 12 2008 1 3 4 1900 1840 1956 1950 WN 717 N786SW 56 70 49 6 20 ISP BWI 220 2 5 0 0 NA NA NA NA NA 2008 1 3 4 1039 1030 1133 1140 WN 1244 N714CB 54 70 47 -7 9 ISP BWI 220 2 5 0 0 NA NA NA NA NA 2008 1 3 4 801 800 902 910 WN 2101 N222WN 61 70 53 -8 1 ISP BWI 220 3 5 0 0 NA NA NA NA NA 2008 1 3 4 1520 1455 1619 1605 WN 2553 N394SW 59 70 50 14 25 ISP BWI 220 2 7 0 0 NA NA NA NA NA 2008 1 3 4 1422 1255 1657 1610 WN 188 N215WN 155 195 143 47 87 ISP FLL 1093 6 6 0 0 40 0 0 0 7 2008 1 3 4 1954 1925 2239 2235 WN 1754 N243WN 165 190 155 4 29 ISP FLL 1093 3 7 0 0 NA NA NA NA NA 2008 1 3 4 636 635 921 945 WN 2275 N454WN 165 190 147 -24 1 ISP FLL 1093 5 13 0 0 NA NA NA NA NA 2008 1 3 4 734 730 958 1020 WN 550 N712SW 324 350 314 -22 4 ISP LAS 2283 2 8 0 0 NA NA NA NA NA
  37. 37. © 2014 MapR Technologies 37 RITA ML - Initialize // Import machine learning library import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating // Regex helper class implicit class Regex(sc: StringContext) { def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*) } // Setup Spark context val conf = new SparkConf().setAppName("RitaML") val sc = new SparkContext(conf)
  38. 38. © 2014 MapR Technologies 38 RITA ML - Data Processing // Import file and convert RDD val rita08 = sc.textFile("maprfs:/user/ubuntu/input/2008.csv”) // Remove header from RDD val rita08_nh = rita08.filter(x => x.split(',')(0) match { case r"d+" => true case _ => false }) // Assign name to field index val actual_elapsed_time = 11 val airtime = 13 val arrdelay = 14 val depdelay = 15 val distance = 18 val taxiin = 19 val taxiout = 20
  39. 39. © 2014 MapR Technologies 39 RITA ML - Data Processing def isna(s: String): Boolean = { s match { case "NA" => true case _ => false } } // Get fields of interest and filter NAs val rita08_nh_ftd = rita08_nh.map(x => x.split(',')).map(x => (x(arrdelay), (x(actual_elapsed_time), x(airtime), x(depdelay), x(distance), x(taxiin), x(taxiout)))).filter(x => !isna(x._1) && !isna(x._2._1) && !isna(x._2._2) && !isna(x._2._3) && !isna(x._2._4) && !isna(x._2._5) && !isna(x._2._6)) // Covert to Strings to LabeledPoint: (response variable, Vector(predictors)) val rita08_training_data = rita08_nh_ftd.map(x => LabeledPoint(x._1.toDouble, Vectors.dense(Array(x._2._1.toDouble, x._2._2.toDouble, x._2._3.toDouble, x._2._4.toDouble, x._2._5.toDouble, x._2._6.toDouble))))
  40. 40. © 2014 MapR Technologies 40 RITA ML – Train Model val numIterations = 20 // Train LR model val mymodel = LinearRegressionWithSGD.train( rita08_training_data, numIterations) // Get the values and predicted features values val valuesAndPreds = rita08_training_data.map { point => val prediction = mymodel.predict(point.features); (point.label, prediction) } // Get the Mean Squared Error val MSE = valuesAndPreds.map { case (v, p) => math.pow((v - p), 2)}.mean() println("training Mean Squared Error = " + MSE)
  41. 41. © 2014 MapR Technologies 41 RITA ML – Analysis - Things to try: - Remove predictors - Add new predictors - Increase number of iterations to improve gradient descent - Run again to determine whether the MSE decreases - Iterate this process until you have an acceptable MSE (That is strength of Spark, that this can be done quickly)
  42. 42. © 2014 MapR Technologies 42 Q&A @mapr maprtech yourname@mapr.com Engage with us! MapR maprtech mapr-technologies

×