Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Reactive Feature Generation
with Spark and MLlib
Jeff Smith
x.ai
x.ai is a personal assistant
who schedules meetings for you
M A N N I N G
Jeff Smith
REACTIVE MACHINE LEARNING
Reactive Machine Learning
Machine Learning Systems
Traits of Reactive Systems
Responsive
Resilient
Elastic
Message-Driven
Reactive Strategies
Machine Learning Data
Machine Learning Data
Machine Learning Systems
INTRODUCING FEATURES
Microblogging Data
Squawks
Squawkers
Super
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
FEATURE TRANSFORMS
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Extract Transform
Features
trait FeatureType[V] {
val name: String
}
trait Feature[V] extends FeatureType[V] {
val value: V
}
Features
case class IntFeature(name: String, value: Int)
extends Feature[Int]
case class BooleanFeature(name: String, valu...
Named Transforms
def binarize(feature: IntFeature, threshold: Double) = {
BooleanFeature("binarized-" + feature.name,
feat...
Non-Trivial Transforms
def categorize(thresholds: List[Double]) = {
(rawFeature: DoubleFeature) => {
IntFeature("categoriz...
Standardizing Naming
trait Named {
def name(inputFeature: Feature[Any]) : String = {
inputFeature.name + "-" +
Thread.curr...
Lineages
"cleaned-normalized-categorized-interactions"
interactions
categorized
normalized
cleaned
PIPELINES
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Multi-Stage Generation
Raw Data ExtractedExtract Transform Features
Pipeline Composition
def extract(rawSquawks: RDD[JsonDocument]):
RDD[IntFeature] = {
???
}
def transform(inputFeatures: RD...
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Pipelines
Don’t orchestrate
when you can compose
Pipeline Failure
Raw Data FeaturesFeature Generation Pipeline
Raw Data FeaturesFeature Generation Pipeline
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Supervision
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive DB Drivers Cluster Managers Feature V...
Reactive Database Drivers
Couchbase
MongoDB
Cassandra
Reactive Database Drivers
val rawSquawks: RDD[JsonDocument] = sc.couchbaseView(
ViewQuery.from("squawks", "by_squawk_id"))...
Cluster Managers
Spark Standalone
Mesos
YARN
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive DB Drivers Cluster Managers Feature V...
FEATURE COLLECTIONS
Original Features
object SquawkLength extends FeatureType[Int]
object Super extends LabelType[Boolean]
val originalFeature...
Basic Features
object PastSquawks extends FeatureType[Int]
val basicFeatures = originalFeatures + PastSquawks
More Features
object MobileSquawker extends FeatureType[Boolean]
val moreFeatures = basicFeatures + MobileSquawker
Feature Collections
case class FeatureCollection(id: Int,
createdAt: DateTime,
features: Set[_ <: FeatureType[_]],
label: ...
Feature Collections
val earlierCollection = FeatureCollection(101,
earlier,
basicFeatures,
label)
val latestCollection = F...
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Fallback Collections
val FallbackCollection = FeatureCollection(404,
beginningOfTime,
originalFeatures,
label)
Fallback Collections
def validCollection(collections: RDD[FeatureCollection],
invalidFeatures: Set[FeatureType[_]]) = {
va...
VALIDATING FEATURES
Supervising Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive DB Drivers Cluster Managers Feature V...
Predicting Super Squawkers
Training Instances
val instances = Seq(
(123, Vectors.dense(0.2, 0.3, 16.2, 1.1), 0.0),
(456, Vectors.dense(0.1, 1.3, 11.3...
Selection Setup
val featuresName = "features"
val labelName = "isSuper"
val instancesDF = sqlContext.createDataFrame(insta...
Feature Selection
val selector = new ChiSqSelector()
.setNumTopFeatures(K)
.setFeaturesCol(featuresName)
.setLabelCol(labe...
Feature Selection
val selector = new ChiSqSelector()
.setNumTopFeatures(K)
.setFeaturesCol(featuresName)
.setLabelCol(labe...
Back to RDDs
val labeledPoints = sc.parallelize(instances.map({
case (id, features, label) =>
LabeledPoint(label = label, ...
Validating Features
def validateSelection(labeledPoints: RDD[LabeledPoint],
topK: Int, cutoff: Double) = {
val pValues = S...
Persisting Validations
case class ValidatedFeatureCollection(id: Int,
createdAt: DateTime,
features: Set[_ <: FeatureType[...
SUMMARY
Machine Learning Systems
Traits of Reactive Systems
Reactive Strategies
Machine Learning Data
Feature Generation
Raw Data FeaturesFeature Generation Pipeline
Reactive Feature Generation
with Spark and MLlib
reactivemachinelearning.com
@jeffksmithjr
@xdotai
Upcoming SlideShare
Loading in …5
×

Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

2,232 views

Published on

Spark Summit East talk

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

  1. 1. Reactive Feature Generation with Spark and MLlib Jeff Smith x.ai
  2. 2. x.ai is a personal assistant who schedules meetings for you
  3. 3. M A N N I N G Jeff Smith
  4. 4. REACTIVE MACHINE LEARNING
  5. 5. Reactive Machine Learning
  6. 6. Machine Learning Systems
  7. 7. Traits of Reactive Systems
  8. 8. Responsive
  9. 9. Resilient
  10. 10. Elastic
  11. 11. Message-Driven
  12. 12. Reactive Strategies
  13. 13. Machine Learning Data
  14. 14. Machine Learning Data
  15. 15. Machine Learning Systems
  16. 16. INTRODUCING FEATURES
  17. 17. Microblogging Data Squawks Squawkers Super
  18. 18. Feature Generation Raw Data FeaturesFeature Generation Pipeline
  19. 19. FEATURE TRANSFORMS
  20. 20. Feature Generation Raw Data FeaturesFeature Generation Pipeline Extract Transform
  21. 21. Features trait FeatureType[V] { val name: String } trait Feature[V] extends FeatureType[V] { val value: V }
  22. 22. Features case class IntFeature(name: String, value: Int) extends Feature[Int] case class BooleanFeature(name: String, value: Boolean) extends Feature[Boolean]
  23. 23. Named Transforms def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature("binarized-" + feature.name, feature.value > threshold) }
  24. 24. Non-Trivial Transforms def categorize(thresholds: List[Double]) = { (rawFeature: DoubleFeature) => { IntFeature("categorized-" + rawFeature.name, thresholds.sorted .zipWithIndex .find { case (threshold, i) => rawFeature.value < threshold }.getOrElse((None, -1)) ._2) } }
  25. 25. Standardizing Naming trait Named { def name(inputFeature: Feature[Any]) : String = { inputFeature.name + "-" + Thread.currentThread.getStackTrace()(3).getMethodName } } object Binarizer extends Named { def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature(name(feature), feature.value > threshold) } }
  26. 26. Lineages "cleaned-normalized-categorized-interactions" interactions categorized normalized cleaned
  27. 27. PIPELINES
  28. 28. Feature Generation Raw Data FeaturesFeature Generation Pipeline
  29. 29. Multi-Stage Generation Raw Data ExtractedExtract Transform Features
  30. 30. Pipeline Composition def extract(rawSquawks: RDD[JsonDocument]): RDD[IntFeature] = { ??? } def transform(inputFeatures: RDD[Feature[Int]]): RDD[BooleanFeature] = { ??? } val trainableFeatures = transform(extract(rawSquawks))
  31. 31. Feature Generation Raw Data FeaturesFeature Generation Pipeline
  32. 32. Pipelines Don’t orchestrate when you can compose
  33. 33. Pipeline Failure Raw Data FeaturesFeature Generation Pipeline Raw Data FeaturesFeature Generation Pipeline
  34. 34. Supervising Feature Generation Raw Data FeaturesFeature Generation Pipeline Supervision
  35. 35. Supervising Feature Generation Raw Data FeaturesFeature Generation Pipeline Reactive DB Drivers Cluster Managers Feature Validation
  36. 36. Reactive Database Drivers Couchbase MongoDB Cassandra
  37. 37. Reactive Database Drivers val rawSquawks: RDD[JsonDocument] = sc.couchbaseView( ViewQuery.from("squawks", "by_squawk_id")) .map(_.id) .couchbaseGet[JsonDocument]()
  38. 38. Cluster Managers Spark Standalone Mesos YARN
  39. 39. Supervising Feature Generation Raw Data FeaturesFeature Generation Pipeline Reactive DB Drivers Cluster Managers Feature Validation
  40. 40. FEATURE COLLECTIONS
  41. 41. Original Features object SquawkLength extends FeatureType[Int] object Super extends LabelType[Boolean] val originalFeatures: Set[FeatureType] = Set(SquawkLength) val label = Super
  42. 42. Basic Features object PastSquawks extends FeatureType[Int] val basicFeatures = originalFeatures + PastSquawks
  43. 43. More Features object MobileSquawker extends FeatureType[Boolean] val moreFeatures = basicFeatures + MobileSquawker
  44. 44. Feature Collections case class FeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_])
  45. 45. Feature Collections val earlierCollection = FeatureCollection(101, earlier, basicFeatures, label) val latestCollection = FeatureCollection(202, now, moreFeatures, label) val featureCollections = sc.parallelize( Seq(earlierCollection, latestCollection))
  46. 46. Feature Generation Raw Data FeaturesFeature Generation Pipeline
  47. 47. Fallback Collections val FallbackCollection = FeatureCollection(404, beginningOfTime, originalFeatures, label)
  48. 48. Fallback Collections def validCollection(collections: RDD[FeatureCollection], invalidFeatures: Set[FeatureType[_]]) = { val validCollections = collections.filter( fc => !fc.features .exists(invalidFeatures.contains)) .sortBy(collection => collection.id) if (validCollections.count() > 0) { validCollections.first() } else FallbackCollection }
  49. 49. VALIDATING FEATURES
  50. 50. Supervising Feature Generation Raw Data FeaturesFeature Generation Pipeline Reactive DB Drivers Cluster Managers Feature Validation
  51. 51. Predicting Super Squawkers
  52. 52. Training Instances val instances = Seq( (123, Vectors.dense(0.2, 0.3, 16.2, 1.1), 0.0), (456, Vectors.dense(0.1, 1.3, 11.3, 1.2), 1.0), (789, Vectors.dense(1.2, 0.8, 14.5, 0.5), 0.0) )
  53. 53. Selection Setup val featuresName = "features" val labelName = "isSuper" val instancesDF = sqlContext.createDataFrame(instances) .toDF("id", featuresName, labelName) val K = 2
  54. 54. Feature Selection val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures")
  55. 55. Feature Selection val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures") val selectedFeatures = selector.fit(instancesDF) .transform(instancesDF)
  56. 56. Back to RDDs val labeledPoints = sc.parallelize(instances.map({ case (id, features, label) => LabeledPoint(label = label, features = features) }))
  57. 57. Validating Features def validateSelection(labeledPoints: RDD[LabeledPoint], topK: Int, cutoff: Double) = { val pValues = Statistics.chiSqTest(labeledPoints) .map(result => result.pValue) .sorted pValues(topK) < cutoff }
  58. 58. Persisting Validations case class ValidatedFeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_], passedValidation: Boolean, cutoff: Double)
  59. 59. SUMMARY
  60. 60. Machine Learning Systems
  61. 61. Traits of Reactive Systems
  62. 62. Reactive Strategies
  63. 63. Machine Learning Data
  64. 64. Feature Generation Raw Data FeaturesFeature Generation Pipeline
  65. 65. Reactive Feature Generation with Spark and MLlib reactivemachinelearning.com @jeffksmithjr @xdotai

×