SlideShare a Scribd company logo
1 of 51
Intro to Apache Spark:
Fast cluster computing engine for
Hadoop
Intro to Scala:
Object-oriented and functional
language for the Java Virtual
Machine
ACM SIGKDD, 7/9/2014
Roger Huang
Lead System Architect
rohuang@visa.com
rog4096@yahoo.com
@BigDataWrangler
2Intro to Spark: Intro to Scala | 7/9/2014
About me: Roger Huang
• Visa
– Digital & Mobile Products Architecture, Strategic Projects &
infrastructure
– Search infrastructure
– Customer segmentation
– Logging Framework
– Splunk on Hadoop (Hunk)
– Real-time monitoring
– Data
• PayPal
– Java Infrastructure
3Intro to Spark: Intro to Scala | 7/9/2014
Different perspectives on an elephant Scala
4Intro to Spark: Intro to Scala | 7/9/2014
Outline
• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
5Intro to Spark: Intro to Scala | 7/9/2014
Spark in the Hadoop ecosystem
6Intro to Spark: Intro to Scala | 7/9/2014
Spark Ecosystem of Software Projects
• Spark [Ognen]
– APIs: Scala, Python [Robert], Java
• “SQL”
– Shark (Hive + Spark) [Roger]
– SparkSQL (alpha)
• Machine Learning Library (MLlib) [Omar]
– Clustering
– Classification
• binary classification
• Linear regression
– recommendations
• Spark Streaming [Chance]
• GraphX [Srini]
• …
7Intro to Spark: Intro to Scala | 7/9/2014
Resilient Distributed Dataset
• Fault tolerant collection of elements partitioned across the
nodes of the cluster that can be operated on in parallel
• Data sources for RDDs
– Parallelized collections
• From Scala collections
– Hadoop datasets
• From HDFS, any Hadoop supported storage system (Hbase, Amazon
S3, …)
• Text files, SequenceFile, any Hadoop InputFormat
• Two types of operations
– Transformation
• takes an existing dataset and creates a new one
– Action
• takes a dataset, run a computation, and return value to driver program
8Intro to Spark: Intro to Scala | 7/9/2014
(Some) RDD Operations
• Transformations
– map(func)
– filter(func)
– flatMap(func)
– mapPartitions(func)
– mapPartitionsWithIndex(func)
– sample(withReplacement,
fraction, seed)
– union(otherDataset)
– distinct()
– groupByKey()
– reduceByKey(func)
– sortByKey()
– Join(otherDataset)
– cogroup(otherDataset)
– cartesian(otherDataset)
• Actions
– reduce(func)
– collect()
– count()
– first()
– take(n)
– takeSample(withReplacement,
num, seed)
– saveAsTextFile(path)
– saveAsSequenceFile(path)
– countByKey()
– foreach(func)
– …
9Intro to Spark: Intro to Scala | 7/9/2014
Scala background
• Scalable, Object oriented, functional language
– Version 2.11 (4/2014)
• Runs on the Java Virtual Machine
• Martin Odersky
– javac
– Java generics
• http://scala-lang.org/, REPL
• http://www.scala-lang.org/api/current
• http://scala-ide.org/
• http://www.scala-sbt.org/, Simple build tool
• Who’s using Scala?
– Twitter, LinkedIn, …
• Powered by Scala
– Apache Spark, Apache Kafka, Akka,…
10Intro to Spark: Intro to Scala | 7/9/2014
Outline
• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Hadoop/Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
11Intro to Spark: Intro to Scala | 7/9/2014
Scala for the computer scientist:
functional programming (FP)
12Intro to Spark: Intro to Scala | 7/9/2014
Scala for the computer scientist:
functional programming (FP)
• Math functions, e.g., f(x) = y
– A function has a single responsibility
– A function has no side effects
– A function is referentially transparent
• A function outputs the same value for the same inputs.
• Functional programming
– expresses computation as the evaluation and composition of
mathematical functions
– Avoid side effects and mutating state data
13Intro to Spark: Intro to Scala | 7/9/2014
Why functional programming?
• Multi core processors
• Concurrency
– Computation as a series of independent data transformations
– Parallel data transformations without side effects
• Referential transparency
14Intro to Spark: Intro to Scala | 7/9/2014
Scala for the computer scientist:
functional programming
• Functions
– Lambda, closure
• For-comprehensions
• Type inference
• Pattern matching
• Higher order functions
– map, flatMap, foldLeft
• And more …
15Intro to Spark: Intro to Scala | 7/9/2014
FP: functions
• Anonymous function
– Function without a name
– lambda function
• Example
– scala> List(100, 200, 300) map { _ * 10/100}
– res0: List[Int] = List(10, 20, 30)
• Closure (Wikipedia)
– Closure = A function, together with a referencing environment – a
table storing a reference to each of the non-local variables of that
function.
– A closure allows a function to access those non-local variables
even when invoked outside its immediate lexical scope.
16Intro to Spark: Intro to Scala | 7/9/2014
FP: functions
• applyPercentage is an example of a closure
– scala> var percentage = 10
– percentage: Int = 10
– scala> val applyPercentage = (amount: Int) => amount *
percentage / 100
– applyPercentage: Int => Int = <function1>
– scala> percentage = 20
– percentage: Int = 20
– scala> List (100, 200, 300) map applyPercentage
– res1: List[Int] = List(20, 40, 60)
– scala>
17Intro to Spark: Intro to Scala | 7/9/2014
FP: functions
• Anonymous function
• Closure
18Intro to Spark: Intro to Scala | 7/9/2014
FP: Higher order functions
scala> :load Person.scala
Loading Person.scala...
defined class Person
scala> val jd = new Person("John", "Doe", 17)
jd: Person = Person@372a6e85
scala> val rh = new Person("Roger", "Huang", 34)
rh: Person = Person@611c4041
scala> val people = Array(jd, rh)
people: Array[Person] = Array(Person@372a6e85, Person@611c4041)
scala> val (minors, adults) = people partition (_.age < 18)
minors: Array[Person] = Array(Person@372a6e85)
adults: Array[Person] = Array(Person@611c4041)
scala>
19Intro to Spark: Intro to Scala | 7/9/2014
FP: Higher order functions
• HOF
– takes a function as an argument
– Returns a function
20Intro to Spark: Intro to Scala | 7/9/2014
FP: Higher order functions: map
• Creates a new collection from an existing collection by applying
a function
• Anonymous function
scala> List(1, 2, 3 ) map { (x: Int) => x + 1 }
res0: List[Int] = List(2, 3, 4)
• Function literal
scala> List(1, 2, 3) map { _ + 1 }
res1: List[Int] = List(2, 3, 4)
• Passing an existing function
scala> def addOne(num: Int) = num + 1
addOne: (num: Int)Int
scala> List(1, 2, 3) map addOne
res2: List[Int] = List(2, 3, 4)
21Intro to Spark: Intro to Scala | 7/9/2014
FP: Higher order functions: map
22Intro to Spark: Intro to Scala | 7/9/2014
FP: Higher order functions: flatmap
23Intro to Spark: Intro to Scala | 7/9/2014
FP: for-comprehension
• Syntax
– for ( <generator> | <guard> ) <expression> [yield] <expression>
• Types
– Imperative form. Does not return a value.
scala> val aList = List(1, 2, 3)
aList: List[Int] = List(1, 2, 3)
scala> val bList = List(4, 5, 6)
bList: List[Int] = List(4, 5, 6)
scala> for { a <- aList; if (a < 2); b <- bList; if (b < 7) } println( a + b )
5
6
7
24Intro to Spark: Intro to Scala | 7/9/2014
FP: for-comprehension
• Syntax
– for ( <generator> | <guard> ) <expression> [yield] <expression>
• Types
– Functional form (a.k.a., sequence comprehension) . Returns/yields
a value
scala> for { a <- aList; b <- bList} yield a + b
res0: List[Int] = List(5, 6, 7, 6, 7, 8, 7, 8, 9)
scala> res0.take(1)
res1: List[Int] = List(5)
scala> for { a <- aList; if (a < 2); b <- bList } yield a + b
res2: List[Int] = List(5, 6, 7)
scala>
25Intro to Spark: Intro to Scala | 7/9/2014
FP: for-comprehension
26Intro to Spark: Intro to Scala | 7/9/2014
FP: foldLeft
• scala> val numbers = 1.to(10)
• numbers: scala.collection.immutable.Range.Inclusive =
Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
• scala> def add( a:Int, b:Int ): Int = { a + b }
• add: (a: Int, b: Int)Int
• scala> numbers.foldLeft(0){ add }
• res0: Int = 55
• scala> numbers.foldLeft(0){ (acc, b) => acc + b }
• res1: Int = 55
• scala>
27Intro to Spark: Intro to Scala | 7/9/2014
FP: foldLeft
28Intro to Spark: Intro to Scala | 7/9/2014
FP: find the last item in an array
• scala> val ns = Array(20, 40, 60)
• ns: Array[Int] = Array(20, 40, 60)
• scala> ns.foldLeft(ns.head) {(acc, b) => b}
• res0: Int = 60
• scala>
29Intro to Spark: Intro to Scala | 7/9/2014
FP: reverse an array w/ foldLeft
• scala> val ns = Array(20, 40, 60)
• ns: Array[Int] = Array(20, 40, 60)
• scala> ns.foldLeft( Array[Int]() ) { (acc, b) => b +: acc}
• res1: Array[Int] = Array(60, 40, 20)
• scala>
30Intro to Spark: Intro to Scala | 7/9/2014
FP: reverse an array w/ foldLeft
31Intro to Spark: Intro to Scala | 7/9/2014
Outline
• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
32Intro to Spark: Intro to Scala | 7/9/2014
Scala for the Java / OO developer:
• Interoperable w/ Java
• Case classes
• Mixins with traits
33Intro to Spark: Intro to Scala | 7/9/2014
Scala for the Java / OO developer:
• case class
– Implements equals(), hashCode(), toString()
– Can be used in Pattern Matching
34Intro to Spark: Intro to Scala | 7/9/2014
Scala for the Java / OO developer:
• http://docs.oracle.com/javase/8/docs/api/java/util/stream/Str
eam.html
• map
– <R> Stream<R> map(Function<? super T,? extends
R> mapper)Returns a stream consisting of the results of applying
the given function to the elements of this stream.This is
an intermediate operation.
• flatMap
– <R> Stream<R> flatMap(Function<? super T,? extends Stream<?
extends R>> mapper)Returns a stream consisting of the results of
replacing each element of this stream with the contents of a
mapped stream produced by applying the provided mapping
function to each element. Each mapped stream is closed after its
contents have been placed into this stream. (If a mapped stream
is null an empty stream is used, instead.)This is an intermediate
operation.
`
35Intro to Spark: Intro to Scala | 7/9/2014
Outline
• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
36Intro to Spark: Intro to Scala | 7/9/2014
Scala for the Spark developer
• ResilientDistributedDataset (RDD)
• A Resilient Distributed Dataset (RDD), the basic abstraction in
Spark. Represents an immutable, partitioned collection of
elements that can be operated on in parallel. This class contains
the basic operations available on all RDDs, such as map, filter,
and persist.
• http://spark.apache.org/docs/latest/api/scala/index.html#org.apa
che.spark.rdd.RDD
37Intro to Spark: Intro to Scala | 7/9/2014
Outline
• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
38Intro to Spark: Intro to Scala | 7/9/2014
Scala for the Big Data developer
• Spark
– Programming API in Scala
– Implemented in Scala
• Scalding
– Scala DSL on top of Cascading
– data processing API and processing query planner used for
defining, sharing, and executing data-processing workflows
– Abstractions: tuples, pipes, source/sink taps
• Algebird
• Summingbird
– Library that lets you write MapReduce programs that look like
native Scala or Java collection transformations
– Execute them on a number of well-known distributed MapReduce
platforms, including Storm and Scalding.
39Intro to Spark: Intro to Scala | 7/9/2014
Outline
• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Hadoop/Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
40Intro to Spark: Intro to Scala | 7/9/2014
Scala for the Big Data scientist / mathematician
• Monoid
– If you want to “attach” operations such as +, -, *, / or <= to data
objects (e.g., Bloom filters), then you want to provide monoid forms
of those data objects
– Consists of
• A set of objects
• Binary operation that satisfies the monoid axioms
• Monad
– If you want to create a data processing pipeline that transforms the
state of a data object
– composition
41Intro to Spark: Intro to Scala | 7/9/2014
Outline
• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Hadoop/Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
42Intro to Spark: Intro to Scala | 7/9/2014
Scala for the system architect
• Concurrency
• Problem:
– Threads
– Shared mutable state
– Locks,
• Solution:
– message passing concurrency w/ Actors
– Future, Promise
• Abstractions
– Actor
• an object that processes a message
• encapsulates state (state not shared)
– ActorRef
– Message, usually sent asynchronously
– Mailbox
– ActorSystem
43Intro to Spark: Intro to Scala | 7/9/2014
Scala for the system architect: Akka
• Fault tolerance
– Supervision
– Strategies
• Resume, restart, stop, escalate, …
• Scale out: remote actors
– Via configuration
44Intro to Spark: Intro to Scala | 7/9/2014
Scala for the system architect
• Parallel collections
– scala> import scala.collection.parallel.immutable._
– import scala.collection.parallel.immutable._
– scala> ParVector(10, 20, 30, 40, 50, 60, 70, 80, 90) .map { x =>
– | println( Thread.currentThread.getName); x / 2 }
– ForkJoinPool-1-worker-13
– ForkJoinPool-1-worker-1
– ForkJoinPool-1-worker-1
– ForkJoinPool-1-worker-9
– ForkJoinPool-1-worker-11
– ForkJoinPool-1-worker-5
– ForkJoinPool-1-worker-3
– ForkJoinPool-1-worker-15
– ForkJoinPool-1-worker-7
– res0: scala.collection.parallel.immutable.ParVector[Int] = ParVector(5, 10, 15,
– 20, 25, 30, 35, 40, 45)
– scala>
45Intro to Spark: Intro to Scala | 7/9/2014
Sequential collections
46Intro to Spark: Intro to Scala | 7/9/2014
Parallel collections
47Intro to Spark: Intro to Scala | 7/9/2014
Outline
• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
48Intro to Spark: Intro to Scala | 7/9/2014
Different perspectives on an elephant Scala
49Intro to Spark: Intro to Scala | 7/9/2014
Spark in the Hadoop ecosystem
50Intro to Spark: Intro to Scala | 7/9/2014
References
• http://scala-lang.org/
• Scala in Action, Nilanjan Raychaudhuri
• Grokking Functional Programming, Aslam Khan
• Michael Noll
Intro to Apache Spark:
Fast cluster computing engine for
Hadoop
Intro to Scala:
Object-oriented and functional
language for the Java Virtual
Machine
ACM SIGKDD, 7/9/2014
Roger Huang
Lead System Architect
Digital & Mobile Products Architecture
rohuang@visa.com
rog4096@yahoo.com

More Related Content

What's hot

Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San FranciscoMartin Odersky
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myuiMakoto Yui
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 

What's hot (20)

Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Spark etl
Spark etlSpark etl
Spark etl
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Scala Days San Francisco
Scala Days San FranciscoScala Days San Francisco
Scala Days San Francisco
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 

Viewers also liked

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkBTI360
 
கீதையின் செயல்திறன் இரகசியம்
கீதையின் செயல்திறன் இரகசியம்கீதையின் செயல்திறன் இரகசியம்
கீதையின் செயல்திறன் இரகசியம்N Ganeshan
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & ScalaEdureka!
 
Data Modeling for Microservices with Cassandra and Spark
Data Modeling for Microservices with Cassandra and SparkData Modeling for Microservices with Cassandra and Spark
Data Modeling for Microservices with Cassandra and SparkJeffrey Carpenter
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Learn Togaf 9.1 in 100 slides!
Learn Togaf 9.1 in 100 slides!Learn Togaf 9.1 in 100 slides!
Learn Togaf 9.1 in 100 slides!Sam Mandebvu
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overviewEnterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overviewWinton Winton
 
Understanding and Applying The Open Group Architecture Framework (TOGAF)
Understanding and Applying The Open Group Architecture Framework (TOGAF)Understanding and Applying The Open Group Architecture Framework (TOGAF)
Understanding and Applying The Open Group Architecture Framework (TOGAF)Nathaniel Palmer
 
Introduction to Enterprise Architecture and TOGAF 9.1
Introduction to Enterprise Architecture and TOGAF 9.1Introduction to Enterprise Architecture and TOGAF 9.1
Introduction to Enterprise Architecture and TOGAF 9.1iasaglobal
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

Viewers also liked (18)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
கீதையின் செயல்திறன் இரகசியம்
கீதையின் செயல்திறன் இரகசியம்கீதையின் செயல்திறன் இரகசியம்
கீதையின் செயல்திறன் இரகசியம்
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
Data Modeling for Microservices with Cassandra and Spark
Data Modeling for Microservices with Cassandra and SparkData Modeling for Microservices with Cassandra and Spark
Data Modeling for Microservices with Cassandra and Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Learn Togaf 9.1 in 100 slides!
Learn Togaf 9.1 in 100 slides!Learn Togaf 9.1 in 100 slides!
Learn Togaf 9.1 in 100 slides!
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overviewEnterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
Enterprise Architecture for Dummies - TOGAF 9 enterprise architecture overview
 
Understanding and Applying The Open Group Architecture Framework (TOGAF)
Understanding and Applying The Open Group Architecture Framework (TOGAF)Understanding and Applying The Open Group Architecture Framework (TOGAF)
Understanding and Applying The Open Group Architecture Framework (TOGAF)
 
Introduction to Enterprise Architecture and TOGAF 9.1
Introduction to Enterprise Architecture and TOGAF 9.1Introduction to Enterprise Architecture and TOGAF 9.1
Introduction to Enterprise Architecture and TOGAF 9.1
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Similar to Intro to Apache Spark and Scala for Big Data

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)Spark Summit
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache SparkYasoda Jayaweera
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 

Similar to Intro to Apache Spark and Scala for Big Data (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 

Recently uploaded

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 

Intro to Apache Spark and Scala for Big Data

  • 1. Intro to Apache Spark: Fast cluster computing engine for Hadoop Intro to Scala: Object-oriented and functional language for the Java Virtual Machine ACM SIGKDD, 7/9/2014 Roger Huang Lead System Architect rohuang@visa.com rog4096@yahoo.com @BigDataWrangler
  • 2. 2Intro to Spark: Intro to Scala | 7/9/2014 About me: Roger Huang • Visa – Digital & Mobile Products Architecture, Strategic Projects & infrastructure – Search infrastructure – Customer segmentation – Logging Framework – Splunk on Hadoop (Hunk) – Real-time monitoring – Data • PayPal – Java Infrastructure
  • 3. 3Intro to Spark: Intro to Scala | 7/9/2014 Different perspectives on an elephant Scala
  • 4. 4Intro to Spark: Intro to Scala | 7/9/2014 Outline • Spark – Hadoop eco system • Scala – Background • Why Scala? – For the computer scientist – For the Java / OO programmer – For the Spark developer – For the Big Data developer – For the Big Data scientist / mathematician – For the system architect
  • 5. 5Intro to Spark: Intro to Scala | 7/9/2014 Spark in the Hadoop ecosystem
  • 6. 6Intro to Spark: Intro to Scala | 7/9/2014 Spark Ecosystem of Software Projects • Spark [Ognen] – APIs: Scala, Python [Robert], Java • “SQL” – Shark (Hive + Spark) [Roger] – SparkSQL (alpha) • Machine Learning Library (MLlib) [Omar] – Clustering – Classification • binary classification • Linear regression – recommendations • Spark Streaming [Chance] • GraphX [Srini] • …
  • 7. 7Intro to Spark: Intro to Scala | 7/9/2014 Resilient Distributed Dataset • Fault tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel • Data sources for RDDs – Parallelized collections • From Scala collections – Hadoop datasets • From HDFS, any Hadoop supported storage system (Hbase, Amazon S3, …) • Text files, SequenceFile, any Hadoop InputFormat • Two types of operations – Transformation • takes an existing dataset and creates a new one – Action • takes a dataset, run a computation, and return value to driver program
  • 8. 8Intro to Spark: Intro to Scala | 7/9/2014 (Some) RDD Operations • Transformations – map(func) – filter(func) – flatMap(func) – mapPartitions(func) – mapPartitionsWithIndex(func) – sample(withReplacement, fraction, seed) – union(otherDataset) – distinct() – groupByKey() – reduceByKey(func) – sortByKey() – Join(otherDataset) – cogroup(otherDataset) – cartesian(otherDataset) • Actions – reduce(func) – collect() – count() – first() – take(n) – takeSample(withReplacement, num, seed) – saveAsTextFile(path) – saveAsSequenceFile(path) – countByKey() – foreach(func) – …
  • 9. 9Intro to Spark: Intro to Scala | 7/9/2014 Scala background • Scalable, Object oriented, functional language – Version 2.11 (4/2014) • Runs on the Java Virtual Machine • Martin Odersky – javac – Java generics • http://scala-lang.org/, REPL • http://www.scala-lang.org/api/current • http://scala-ide.org/ • http://www.scala-sbt.org/, Simple build tool • Who’s using Scala? – Twitter, LinkedIn, … • Powered by Scala – Apache Spark, Apache Kafka, Akka,…
  • 10. 10Intro to Spark: Intro to Scala | 7/9/2014 Outline • Spark – Hadoop eco system • Scala – Background • Why Scala? – For the computer scientist – For the Java / OO programmer – For the Hadoop/Spark developer – For the Big Data developer – For the Big Data scientist / mathematician – For the system architect
  • 11. 11Intro to Spark: Intro to Scala | 7/9/2014 Scala for the computer scientist: functional programming (FP)
  • 12. 12Intro to Spark: Intro to Scala | 7/9/2014 Scala for the computer scientist: functional programming (FP) • Math functions, e.g., f(x) = y – A function has a single responsibility – A function has no side effects – A function is referentially transparent • A function outputs the same value for the same inputs. • Functional programming – expresses computation as the evaluation and composition of mathematical functions – Avoid side effects and mutating state data
  • 13. 13Intro to Spark: Intro to Scala | 7/9/2014 Why functional programming? • Multi core processors • Concurrency – Computation as a series of independent data transformations – Parallel data transformations without side effects • Referential transparency
  • 14. 14Intro to Spark: Intro to Scala | 7/9/2014 Scala for the computer scientist: functional programming • Functions – Lambda, closure • For-comprehensions • Type inference • Pattern matching • Higher order functions – map, flatMap, foldLeft • And more …
  • 15. 15Intro to Spark: Intro to Scala | 7/9/2014 FP: functions • Anonymous function – Function without a name – lambda function • Example – scala> List(100, 200, 300) map { _ * 10/100} – res0: List[Int] = List(10, 20, 30) • Closure (Wikipedia) – Closure = A function, together with a referencing environment – a table storing a reference to each of the non-local variables of that function. – A closure allows a function to access those non-local variables even when invoked outside its immediate lexical scope.
  • 16. 16Intro to Spark: Intro to Scala | 7/9/2014 FP: functions • applyPercentage is an example of a closure – scala> var percentage = 10 – percentage: Int = 10 – scala> val applyPercentage = (amount: Int) => amount * percentage / 100 – applyPercentage: Int => Int = <function1> – scala> percentage = 20 – percentage: Int = 20 – scala> List (100, 200, 300) map applyPercentage – res1: List[Int] = List(20, 40, 60) – scala>
  • 17. 17Intro to Spark: Intro to Scala | 7/9/2014 FP: functions • Anonymous function • Closure
  • 18. 18Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions scala> :load Person.scala Loading Person.scala... defined class Person scala> val jd = new Person("John", "Doe", 17) jd: Person = Person@372a6e85 scala> val rh = new Person("Roger", "Huang", 34) rh: Person = Person@611c4041 scala> val people = Array(jd, rh) people: Array[Person] = Array(Person@372a6e85, Person@611c4041) scala> val (minors, adults) = people partition (_.age < 18) minors: Array[Person] = Array(Person@372a6e85) adults: Array[Person] = Array(Person@611c4041) scala>
  • 19. 19Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions • HOF – takes a function as an argument – Returns a function
  • 20. 20Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions: map • Creates a new collection from an existing collection by applying a function • Anonymous function scala> List(1, 2, 3 ) map { (x: Int) => x + 1 } res0: List[Int] = List(2, 3, 4) • Function literal scala> List(1, 2, 3) map { _ + 1 } res1: List[Int] = List(2, 3, 4) • Passing an existing function scala> def addOne(num: Int) = num + 1 addOne: (num: Int)Int scala> List(1, 2, 3) map addOne res2: List[Int] = List(2, 3, 4)
  • 21. 21Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions: map
  • 22. 22Intro to Spark: Intro to Scala | 7/9/2014 FP: Higher order functions: flatmap
  • 23. 23Intro to Spark: Intro to Scala | 7/9/2014 FP: for-comprehension • Syntax – for ( <generator> | <guard> ) <expression> [yield] <expression> • Types – Imperative form. Does not return a value. scala> val aList = List(1, 2, 3) aList: List[Int] = List(1, 2, 3) scala> val bList = List(4, 5, 6) bList: List[Int] = List(4, 5, 6) scala> for { a <- aList; if (a < 2); b <- bList; if (b < 7) } println( a + b ) 5 6 7
  • 24. 24Intro to Spark: Intro to Scala | 7/9/2014 FP: for-comprehension • Syntax – for ( <generator> | <guard> ) <expression> [yield] <expression> • Types – Functional form (a.k.a., sequence comprehension) . Returns/yields a value scala> for { a <- aList; b <- bList} yield a + b res0: List[Int] = List(5, 6, 7, 6, 7, 8, 7, 8, 9) scala> res0.take(1) res1: List[Int] = List(5) scala> for { a <- aList; if (a < 2); b <- bList } yield a + b res2: List[Int] = List(5, 6, 7) scala>
  • 25. 25Intro to Spark: Intro to Scala | 7/9/2014 FP: for-comprehension
  • 26. 26Intro to Spark: Intro to Scala | 7/9/2014 FP: foldLeft • scala> val numbers = 1.to(10) • numbers: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) • scala> def add( a:Int, b:Int ): Int = { a + b } • add: (a: Int, b: Int)Int • scala> numbers.foldLeft(0){ add } • res0: Int = 55 • scala> numbers.foldLeft(0){ (acc, b) => acc + b } • res1: Int = 55 • scala>
  • 27. 27Intro to Spark: Intro to Scala | 7/9/2014 FP: foldLeft
  • 28. 28Intro to Spark: Intro to Scala | 7/9/2014 FP: find the last item in an array • scala> val ns = Array(20, 40, 60) • ns: Array[Int] = Array(20, 40, 60) • scala> ns.foldLeft(ns.head) {(acc, b) => b} • res0: Int = 60 • scala>
  • 29. 29Intro to Spark: Intro to Scala | 7/9/2014 FP: reverse an array w/ foldLeft • scala> val ns = Array(20, 40, 60) • ns: Array[Int] = Array(20, 40, 60) • scala> ns.foldLeft( Array[Int]() ) { (acc, b) => b +: acc} • res1: Array[Int] = Array(60, 40, 20) • scala>
  • 30. 30Intro to Spark: Intro to Scala | 7/9/2014 FP: reverse an array w/ foldLeft
  • 31. 31Intro to Spark: Intro to Scala | 7/9/2014 Outline • Spark – Hadoop eco system • Scala – Background • Why Scala? – For the computer scientist – For the Java / OO programmer – For the Spark developer – For the Big Data developer – For the Big Data scientist / mathematician – For the system architect
  • 32. 32Intro to Spark: Intro to Scala | 7/9/2014 Scala for the Java / OO developer: • Interoperable w/ Java • Case classes • Mixins with traits
  • 33. 33Intro to Spark: Intro to Scala | 7/9/2014 Scala for the Java / OO developer: • case class – Implements equals(), hashCode(), toString() – Can be used in Pattern Matching
  • 34. 34Intro to Spark: Intro to Scala | 7/9/2014 Scala for the Java / OO developer: • http://docs.oracle.com/javase/8/docs/api/java/util/stream/Str eam.html • map – <R> Stream<R> map(Function<? super T,? extends R> mapper)Returns a stream consisting of the results of applying the given function to the elements of this stream.This is an intermediate operation. • flatMap – <R> Stream<R> flatMap(Function<? super T,? extends Stream<? extends R>> mapper)Returns a stream consisting of the results of replacing each element of this stream with the contents of a mapped stream produced by applying the provided mapping function to each element. Each mapped stream is closed after its contents have been placed into this stream. (If a mapped stream is null an empty stream is used, instead.)This is an intermediate operation. `
  • 35. 35Intro to Spark: Intro to Scala | 7/9/2014 Outline • Spark – Hadoop eco system • Scala – Background • Why Scala? – For the computer scientist – For the Java / OO programmer – For the Spark developer – For the Big Data developer – For the Big Data scientist / mathematician – For the system architect
  • 36. 36Intro to Spark: Intro to Scala | 7/9/2014 Scala for the Spark developer • ResilientDistributedDataset (RDD) • A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. • http://spark.apache.org/docs/latest/api/scala/index.html#org.apa che.spark.rdd.RDD
  • 37. 37Intro to Spark: Intro to Scala | 7/9/2014 Outline • Spark – Hadoop eco system • Scala – Background • Why Scala? – For the computer scientist – For the Java / OO programmer – For the Spark developer – For the Big Data developer – For the Big Data scientist / mathematician – For the system architect
  • 38. 38Intro to Spark: Intro to Scala | 7/9/2014 Scala for the Big Data developer • Spark – Programming API in Scala – Implemented in Scala • Scalding – Scala DSL on top of Cascading – data processing API and processing query planner used for defining, sharing, and executing data-processing workflows – Abstractions: tuples, pipes, source/sink taps • Algebird • Summingbird – Library that lets you write MapReduce programs that look like native Scala or Java collection transformations – Execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.
  • 39. 39Intro to Spark: Intro to Scala | 7/9/2014 Outline • Spark – Hadoop eco system • Scala – Background • Why Scala? – For the computer scientist – For the Java / OO programmer – For the Hadoop/Spark developer – For the Big Data developer – For the Big Data scientist / mathematician – For the system architect
  • 40. 40Intro to Spark: Intro to Scala | 7/9/2014 Scala for the Big Data scientist / mathematician • Monoid – If you want to “attach” operations such as +, -, *, / or <= to data objects (e.g., Bloom filters), then you want to provide monoid forms of those data objects – Consists of • A set of objects • Binary operation that satisfies the monoid axioms • Monad – If you want to create a data processing pipeline that transforms the state of a data object – composition
  • 41. 41Intro to Spark: Intro to Scala | 7/9/2014 Outline • Spark – Hadoop eco system • Scala – Background • Why Scala? – For the computer scientist – For the Java / OO programmer – For the Hadoop/Spark developer – For the Big Data developer – For the Big Data scientist / mathematician – For the system architect
  • 42. 42Intro to Spark: Intro to Scala | 7/9/2014 Scala for the system architect • Concurrency • Problem: – Threads – Shared mutable state – Locks, • Solution: – message passing concurrency w/ Actors – Future, Promise • Abstractions – Actor • an object that processes a message • encapsulates state (state not shared) – ActorRef – Message, usually sent asynchronously – Mailbox – ActorSystem
  • 43. 43Intro to Spark: Intro to Scala | 7/9/2014 Scala for the system architect: Akka • Fault tolerance – Supervision – Strategies • Resume, restart, stop, escalate, … • Scale out: remote actors – Via configuration
  • 44. 44Intro to Spark: Intro to Scala | 7/9/2014 Scala for the system architect • Parallel collections – scala> import scala.collection.parallel.immutable._ – import scala.collection.parallel.immutable._ – scala> ParVector(10, 20, 30, 40, 50, 60, 70, 80, 90) .map { x => – | println( Thread.currentThread.getName); x / 2 } – ForkJoinPool-1-worker-13 – ForkJoinPool-1-worker-1 – ForkJoinPool-1-worker-1 – ForkJoinPool-1-worker-9 – ForkJoinPool-1-worker-11 – ForkJoinPool-1-worker-5 – ForkJoinPool-1-worker-3 – ForkJoinPool-1-worker-15 – ForkJoinPool-1-worker-7 – res0: scala.collection.parallel.immutable.ParVector[Int] = ParVector(5, 10, 15, – 20, 25, 30, 35, 40, 45) – scala>
  • 45. 45Intro to Spark: Intro to Scala | 7/9/2014 Sequential collections
  • 46. 46Intro to Spark: Intro to Scala | 7/9/2014 Parallel collections
  • 47. 47Intro to Spark: Intro to Scala | 7/9/2014 Outline • Spark – Hadoop eco system • Scala – Background • Why Scala? – For the computer scientist – For the Java / OO programmer – For the Spark developer – For the Big Data developer – For the Big Data scientist / mathematician – For the system architect
  • 48. 48Intro to Spark: Intro to Scala | 7/9/2014 Different perspectives on an elephant Scala
  • 49. 49Intro to Spark: Intro to Scala | 7/9/2014 Spark in the Hadoop ecosystem
  • 50. 50Intro to Spark: Intro to Scala | 7/9/2014 References • http://scala-lang.org/ • Scala in Action, Nilanjan Raychaudhuri • Grokking Functional Programming, Aslam Khan • Michael Noll
  • 51. Intro to Apache Spark: Fast cluster computing engine for Hadoop Intro to Scala: Object-oriented and functional language for the Java Virtual Machine ACM SIGKDD, 7/9/2014 Roger Huang Lead System Architect Digital & Mobile Products Architecture rohuang@visa.com rog4096@yahoo.com

Editor's Notes

  1. Visa Presentation Template
  2. Visa Presentation Template
  3. Visa Presentation Template
  4. Visa Presentation Template
  5. Visa Presentation Template
  6. Visa Presentation Template
  7. Visa Presentation Template
  8. Visa Presentation Template
  9. Visa Presentation Template
  10. Visa Presentation Template
  11. Visa Presentation Template
  12. Visa Presentation Template
  13. Visa Presentation Template
  14. Visa Presentation Template
  15. Visa Presentation Template
  16. Visa Presentation Template
  17. Visa Presentation Template
  18. Visa Presentation Template
  19. Visa Presentation Template
  20. Visa Presentation Template
  21. Visa Presentation Template
  22. Visa Presentation Template
  23. Visa Presentation Template
  24. Visa Presentation Template
  25. Visa Presentation Template
  26. Visa Presentation Template
  27. Visa Presentation Template
  28. Visa Presentation Template
  29. Visa Presentation Template
  30. Visa Presentation Template
  31. Visa Presentation Template
  32. Visa Presentation Template
  33. Visa Presentation Template
  34. Visa Presentation Template
  35. Visa Presentation Template
  36. Visa Presentation Template
  37. Visa Presentation Template
  38. Visa Presentation Template
  39. Visa Presentation Template
  40. Visa Presentation Template
  41. Visa Presentation Template
  42. Visa Presentation Template
  43. Visa Presentation Template
  44. Visa Presentation Template
  45. Visa Presentation Template
  46. Visa Presentation Template
  47. Visa Presentation Template
  48. Visa Presentation Template
  49. Visa Presentation Template
  50. Visa Presentation Template
  51. Visa Presentation Template