A presentation at the Montreal Apache Spark Meetup about automatic feature generation and some parallelism challenges related to Amdahl's Law and MLlib's implementation of LASSO regression.
5. Examples
● Spam detection: presence or absence of certain email
headers, the email structure, the frequency of specific
terms, etc.
● Time-series analysis: lags of a variable, differentials of a
variable, polynomial forms, logarithms, etc.
● Computer vision: pixel color histograms, edges, etc.
6. Feature generation
● This step is generally the most expensive/difficult
in the machine learning "pipeline"
● No science to it, it's a black art (not a deductive
process, but an inductive one)
● Relies on a combination of domain knowledge
and trial and error.
8. The discovery process
● For the moment let's abstract out the domain
knowledge half of the equation.
● It boils down to a search/optimization problem:
find optimum multi-dimensional point in the
parameter space.
10. Theory vs Practice: a disclaimer
● In theory, the parameter space is infinite-
dimensional (there's an infinity of
transformations you can apply to the raw input
variables!)
● In practice, it's finite, but you can still never claim
to have searched the entire space of possible
solutions.
● Bottom line: utility. Is automation a useful, time
saving tool for data scientist?
11. Domain knowledge
● D.K. is a significant optimization that humans
use, and that is difficult for a machine to use.
● Semantic Web and knowledge ontologies:
● User-specified domain knowledge
– Variable type (quantitative, categorical, text,
entity, etc.)
– Time series or cross-sectional dataset
– Concept inheritance
– Known relationships / features of significance
● Access to public repositories of d.k.
15. My encounter with Amdahl's law
● Ran my feature discovery algorithm on 100
millions row of data, and put it on the cloud to
make full use of parallelization.
● Quickly discovered that my algo didn't scale
beyond 8 cores. Adding cores didn't improve the
performance.
18. Cause: low-level parallelism
● Lot more overhead, shuffle, etc.
● Control flow often implies collecting
intermediate results to branch on them
(.count(), .reduce(), .sum(), .first(), etc.) – hence
non-parallel bottlenecks
● Every line of code not in a Spark closure is a non-
parallel bottleneck! (compare with
mapPartitions() approach of high-level
parallelism)
19. LASSOWithSGD
● LASSOWithSGD: SGD is Stochastic Gradient
Descent (an optimization algorithm). The cost
function is the non-parallelizable section.
● Gradient Descent is an iterative algorithm that
finds the minimum point (i.e. minimum model
error) in the space of model error per parameter
values.
21. LASSOWithSGD
● Calculating the gradients of the points can be done in parallel.
● But you have to reduce to a local sum, which implies serializing,
transfering to the driver, deserializing on the driver. Then you
have to update the weights, serialize send back to the executors,
and move on to the next step.
● This is what happens, as a general rule, with low-level parallelism.
You have to aggregate results locally to proceed to a next
iteration. This is non-parallelizable section of the code.
22. Solution: High-level parallelism
● Changed my machine learning libs to traditional
single-threaded ones (used SMILE)
● Instead I'm putting everything in a
mapPartitions() closure.
● The data is broadcast to all executors.
● No shuffle, only 1 local aggregation at the end
of the analysis. Highly parallelizable.
24. Conclusions
● High-level parallelism limitation: Dataset must
fit inside an executor's memory! (so not Big
Data! A few Gigabytes at most!)
● MLlib is for real Big Data (Tera bytes +).
● Even if your feature space is Big Data, you're
probably better off without MLlib.