Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Spark as the Gateway Drug
To Typed Functional Programming
Jeff Smith
Rohan Aletty
x.ai
Real World AI
• Scale is increasing
• Complexity is increasing
• Human brain size is constant
System Complexity
Data Ingest
Annotation
Routing
Response
Generation
Annotation
Services
Models
Annotation
Services
Models...
Problem Complexity
Complex Intelligence
Datanauts
Tools
Scala
• Bleeding edge
• Real world
Spark
• Incredibly powerful
• Easy to use
Typed Functional Programming
• Powerful abstractions
• Tough learning curve
Functions
Methods
• Collection of statements
• Might have side effects
• On an object
Methods
public class Dataset {
private List<Double> observations;
private Double average;
public Dataset(List<Double> inpu...
Methods
public class Dataset {
public double getAverage() {
Double runningSum = 0.0;
for (Double observation : observation...
Methods
public class Dataset {
public void setObservations(List<Double> inputData) {
observations = inputData;
}
}
Methods
public class Dataset {
private List<Double> observations;
private Double average;
public Dataset(List<Double> inpu...
Functions
• Collection of expressions
• Returns a value
• Are objects (in Scala)
• Can be in-lined
Functions in Scala
val inputData = List(1.0, 2.0, 3.0)
Functions in Scala
def average(observations: List[Double]) {
observations.sum / observations.size
}
average(inputData)
Functions in Scala
def add(x: Double, y: Double) = {
x + y
}
val sum = inputData.foldLeft(0.0)(add)
val average = sum / in...
Functions in Scala
val sum = inputData.foldLeft(0.0)(add)
val average = sum / inputData.size
inputData.foldLeft(0.0)(_ + _...
Functions in Spark
inputData.foldLeft(0.0)(_ + _) / inputData.size
val observations = sc.parallelize(inputData)
observatio...
Immutability
Mutation
• Changing an object
Mutation
visits = {"Church": 2, "Backus": 1, "McCarthy": 4}
old_value = visits["Backus"]
visits["Backus"] = old_value + 1
Immutability
• Never changing objects
Immutability in Scala
val visits = Map("Church" -> 2, "Backus" -> 1, "McCarthy" -> 4)
val updatedVisits = visits.updated("...
Immutability in Spark
val manyVisits = sc.parallelize(visits.toSeq)
val additionalVisit = sc.parallelize(Seq(("Backus", 1)...
Recap
Concepts
• Higher-order functions
• Anonymous functions
• Purity of functions
Concepts
• Currying
• Referential transparency
• Closures
• Resilient Distributed Datasets
Lazy Evaluation
Functional Programming — Lazy Evaluation
• Delaying evaluation of an expression until a value is needed
• Two major advant...
Spark — Lazy Evaluation
• All transformations are lazy
• Their existence added to Spark computation DAG
• Example DAGs
Spark — Lazy Evaluation
val rdd1 = sc.parallelize(...)
val rdd2 = rdd1.map(...)
val rdd3 = rdd1.map(...)
val rdd4 = rdd1.m...
Spark — Learning Laziness
• Advantage 1: (deferred computation)
• draws directly from only evaluating parts of DAG that ar...
Types
Functional Programming — Type Systems
• Mechanism for defining algebraic data types (ADTs) which
are useful for program st...
Spark — Types
• RDD’s (typed), Datasets (typed), DataFrames (untyped)
• Types provide great schema enforcement on a datase...
Spark — Types
case class Person(name: String, age: Int)
val peopleDS = spark.read.json(path).as[Person]
val ageGroupedDs =...
Spark — Learning Types
• Spark through Scala also allows learning of pattern
matching
• ADTs as both product types and uni...
Spark — Learning Types
trait Person { def name: String }
case class Student(name: String, grade: String) extends Person
ca...
Spark — Learning Types
val rdd1: RDD[Person] = sc.parallelize(...)
val rdd2: RDD[String] = rdd1.map("name: " + _) // Compi...
Monads
Functional Programming — Monads
• In category theory:
• “a monad in X is just a monoid in the category of endofunctors of ...
Scala — Monads!
trait Monad[M[]] {
// constructs a Monad instance from the given value, e.g. List(1)
def apply[T](v: T): M...
Scala — Monads!
• Many monads in Scala
• List, Set, Option, etc.
• Powerful line of thinking
• Helps code comprehension
• ...
Spark — Learning Monads?
• We have many “computation builders” -- (RDD’s, Datasets,
DataFrames)
• Containers on which tran...
For Later
Conclusions
• Spark introduces all types of devs to Scala
• Scala helps people learn typed functional programming
• Typed ...
x.ai
@xdotai
hello@human.x.ai
New York, New York
Use the code
ctwsparks17
for 40% off!
Thank You
Upcoming SlideShare
Loading in …5
×

Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East talk by Jeff Smith and Rohan Aletty

554 views

Published on

Something really exciting and largely unnoticed is going on in the Spark ecosystem. As data scientists and engineers learn Spark, they’re actually all implicitly learning a much older, more general topic: typed functional programming. While Spark itself was built on an accumulation of powerful computer science concepts from functional programming and other areas, developers are often encountering these ideas in the context of Spark for the first time. It turns out that Spark makes an excellent platform for learning concepts like immutability, higher order and anonymous functions, laziness, and monadic operators.

This talk will discuss how Spark can be used as teaching tool, to build skills in areas like typed functional programming. We’ll explore a skill-building curriculum that can be used with a data scientist or engineer who only has experience in imperative, dynamically-typed languages like Python. This curriculum introduces the core concepts of functional programming and type theory, while providing learners the opportunity to immediately apply their skills at massive scale, using the power of Spark’s painless scalability and resilience.

Based on the experience of building machine learning teams at x.ai and other data-centric startups, this curriculum is the foundation of building poly-skilled, highly autonomous team members who can build scalable intelligent systems. We’ll work from foundational concepts of Scala and functional programming towards a fully implemented machine learning pipeline, all using Spark and MLlib. Unique new features of Spark like Datasets and Structured Streaming will be particularly useful in this effort. Using this approach, teams can help members in all roles learn how to use sophisticated programming techniques that ensure correctness at scale. With these skills in their toolbox, data scientists and engineers often find that building powerful machine learning systems is intuitive, easy, and even fun.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East talk by Jeff Smith and Rohan Aletty

  1. 1. Spark as the Gateway Drug To Typed Functional Programming Jeff Smith Rohan Aletty x.ai
  2. 2. Real World AI • Scale is increasing • Complexity is increasing • Human brain size is constant
  3. 3. System Complexity Data Ingest Annotation Routing Response Generation Annotation Services Models Annotation Services Models Annotation Services Models Annotation Services Models Annotation Services Models Annotation Services Models Annotation Services Models Models Annotation Services Knowledge Base
  4. 4. Problem Complexity
  5. 5. Complex Intelligence
  6. 6. Datanauts
  7. 7. Tools
  8. 8. Scala • Bleeding edge • Real world
  9. 9. Spark • Incredibly powerful • Easy to use
  10. 10. Typed Functional Programming • Powerful abstractions • Tough learning curve
  11. 11. Functions
  12. 12. Methods • Collection of statements • Might have side effects • On an object
  13. 13. Methods public class Dataset { private List<Double> observations; private Double average; public Dataset(List<Double> inputData) { observations = inputData; } }
  14. 14. Methods public class Dataset { public double getAverage() { Double runningSum = 0.0; for (Double observation : observations) { runningSum += observation; } average = runningSum / observations.size(); return average; } }
  15. 15. Methods public class Dataset { public void setObservations(List<Double> inputData) { observations = inputData; } }
  16. 16. Methods public class Dataset { private List<Double> observations; private Double average; public Dataset(List<Double> inputData) { observations = inputData; } public double getAverage() { Double runningSum = 0.0; for (Double observation : observations) { runningSum += observation; } average = runningSum / observations.size(); return average; } public void setObservations(List<Double> inputData) { observations = inputData; } }
  17. 17. Functions • Collection of expressions • Returns a value • Are objects (in Scala) • Can be in-lined
  18. 18. Functions in Scala val inputData = List(1.0, 2.0, 3.0)
  19. 19. Functions in Scala def average(observations: List[Double]) { observations.sum / observations.size } average(inputData)
  20. 20. Functions in Scala def add(x: Double, y: Double) = { x + y } val sum = inputData.foldLeft(0.0)(add) val average = sum / inputData.size
  21. 21. Functions in Scala val sum = inputData.foldLeft(0.0)(add) val average = sum / inputData.size inputData.foldLeft(0.0)(_ + _) / inputData.size
  22. 22. Functions in Spark inputData.foldLeft(0.0)(_ + _) / inputData.size val observations = sc.parallelize(inputData) observations.fold(0.0)(_ + _) / observations.count()
  23. 23. Immutability
  24. 24. Mutation • Changing an object
  25. 25. Mutation visits = {"Church": 2, "Backus": 1, "McCarthy": 4} old_value = visits["Backus"] visits["Backus"] = old_value + 1
  26. 26. Immutability • Never changing objects
  27. 27. Immutability in Scala val visits = Map("Church" -> 2, "Backus" -> 1, "McCarthy" -> 4) val updatedVisits = visits.updated("Backus", 2)
  28. 28. Immutability in Spark val manyVisits = sc.parallelize(visits.toSeq) val additionalVisit = sc.parallelize(Seq(("Backus", 1))) val updatedVisits = manyVisits.union(additionalVisit) .aggregateByKey(0)(_ + _, _ + _)
  29. 29. Recap
  30. 30. Concepts • Higher-order functions • Anonymous functions • Purity of functions
  31. 31. Concepts • Currying • Referential transparency • Closures • Resilient Distributed Datasets
  32. 32. Lazy Evaluation
  33. 33. Functional Programming — Lazy Evaluation • Delaying evaluation of an expression until a value is needed • Two major advantages of lazy evaluation • Deferring computation allows program only evaluate what is necessary • Changing evaluation scheme into to be more efficient
  34. 34. Spark — Lazy Evaluation • All transformations are lazy • Their existence added to Spark computation DAG • Example DAGs
  35. 35. Spark — Lazy Evaluation val rdd1 = sc.parallelize(...) val rdd2 = rdd1.map(...) val rdd3 = rdd1.map(...) val rdd4 = rdd1.map(...) rdd3.take(5)
  36. 36. Spark — Learning Laziness • Advantage 1: (deferred computation) • draws directly from only evaluating parts of DAG that are necessary when executing an action • Advantage 2: (optimized evaluation scheme) • draws directly from pipelining within Spark stages to make execution more efficient
  37. 37. Types
  38. 38. Functional Programming — Type Systems • Mechanism for defining algebraic data types (ADTs) which are useful for program structure • i.e. “let’s group this data together and brand it a new type” • Compile time guarantees of correctness of program • e.g. “no, you cannot add Foo to Bar”
  39. 39. Spark — Types • RDD’s (typed), Datasets (typed), DataFrames (untyped) • Types provide great schema enforcement on a dataset for preventing unexpected behavior
  40. 40. Spark — Types case class Person(name: String, age: Int) val peopleDS = spark.read.json(path).as[Person] val ageGroupedDs = peopleDS.groupBy(_.age)
  41. 41. Spark — Learning Types • Spark through Scala also allows learning of pattern matching • ADTs as both product types and union types • Allows us to reason about code easier • Gives us compile time safety
  42. 42. Spark — Learning Types trait Person { def name: String } case class Student(name: String, grade: String) extends Person case class Professional(name: String, job: String) extends Person val personRDD: RDD[Person] = sc.parallelize(…) // working with both union and product types val mappedRDD: RDD[String] = personRDD.map { case Student(name, grade) => grade case Professional(name, job) => job }
  43. 43. Spark — Learning Types val rdd1: RDD[Person] = sc.parallelize(...) val rdd2: RDD[String] = rdd1.map("name: " + _) // Compilation error! val rdd3: RDD[String] = rdd2.map("name: " + _.name) // It works!
  44. 44. Monads
  45. 45. Functional Programming — Monads • In category theory: • “a monad in X is just a monoid in the category of endofunctors of X” • In functional programming, refers to a container that can: • Inject a value into the container • Perform operations on values returning a container with new values • Flatten nested containers into a single container
  46. 46. Scala — Monads! trait Monad[M[]] { // constructs a Monad instance from the given value, e.g. List(1) def apply[T](v: T): M[T] // effectively lets you transform values within a Monad def bind[T, U](m: M[T])(fn: (T) => M[U]): M[U] }
  47. 47. Scala — Monads! • Many monads in Scala • List, Set, Option, etc. • Powerful line of thinking • Helps code comprehension • Reduces error checking logic (pattern matching!) • Can build further transformations: map(), filter(), foreach(), etc.
  48. 48. Spark — Learning Monads? • We have many “computation builders” -- (RDD’s, Datasets, DataFrames) • Containers on which transformations can be applied • Similar to monads though not identical • No unit function to wrap constituent values • Cannot lift all types into flatMap function unconstrained
  49. 49. For Later
  50. 50. Conclusions • Spark introduces all types of devs to Scala • Scala helps people learn typed functional programming • Typed functional programming improves Spark development
  51. 51. x.ai @xdotai hello@human.x.ai New York, New York
  52. 52. Use the code ctwsparks17 for 40% off!
  53. 53. Thank You

×