SlideShare a Scribd company logo
Hunter Kelly
All the Topics on the Interwebs
Perhaps this?
Or maybe this?
embassy wikileaks assange
german merkel cables
snowden speigel spying
❖ What are we actually doing?
➢ Mining web pages for insights
❖ How?
➢ Using Machine Learning to do heavy
■ Use Classifiers to filter/bucket the
■ Build Topic Models to try to discover
concepts related to words
❖ Getting Data
➢ Common Crawl
❖ Manipulating Data
➢ Spark
➢ Sparkling
■ RDDs
■ DataFrames
❖ Data Science
➢ MLLib
➢ Classification - Random Forests™
➢ LDA (Latent Dirichlet Allocation)
Common Crawl
➢ “The largest human edited directory of the
➢ Useful when you think of it in terms of
“free crowdsourced labeled data”
➢ Fairly ancient, borderline decrepit
➢ Crowdsourced is a double edged sword
❖ Common Crawl (CC)
➢ “an open repository of web crawl data
that can be accessed and analyzed by
➢ Monthly crawls
➢ Readily accessible index
➢ Tons of free data - raw, links, plain text
❖ How to use them together!
➢ Use DMOZ to samples of positive and
negative “seed links”
➢ Lookup and expand your “seed links”
using CC index
➢ Fetch your data with little/no fuss using
CC index information
Spark &
❖ Apache Spark
➢ The “next big thing”
➢ Or arguably the “current” big thing
❖ Sparkling
➢ Clojure bindings to Spark
➢ Great Presentation (highly recommended)
➢ RDDs
➢ DataFrames
❖ RDDs
➢ Resilient Distributed Datasets
➢ Easy to think of them as partitioned (or
sharded) seqs
➢ Transformations (map, filter, etc) are lazy
➢ Operations (count, collect, reduce, etc)
cause evaluation
➢ Very familiar paradigms for Clojure
(defn sieve-prime-multiples [n primes numbers]
(let [max-prime (last primes)
upto (* max-prime max-prime)
prime-multiples (->> primes
(r/mapcat #(generate-multiples % n (odd? %)))
(into #{}))
candidates (->> numbers
(r/remove prime-multiples))
new-primes (->> candidates
(r/filter #(< % upto))
(into []))
remaining (->> candidates
(r/remove (set new-primes))
[new-primes remaining]))
Clojure using Reducers
(defn sieve-prime-multiples [ctx n primes numbers-rdd]
(let [max-prime (last primes)
upto (* max-prime max-prime)
prime-multiples-rdd (->> (spark/parallelize ctx primes)
#(generate-multiples % n (odd? %))))
candidates-rdd (spark/cache (.subtract numbers-rdd
new-primes-rdd (->> candidates-rdd
(spark/filter #(< % upto))
new-prime (vec (sort (spark/collect new-primes-rdd)))
remaining-rdd (.subtract candidates-rdd new-primes-rdd)]
(.unpersist candidates-rdd false)
(.unpersist new-primes-rdd false)
[new-primes remaining-rdd]))
Clojure using Spark
❖ A Historical Tangent
➢ “Those who cannot remember the past
are condemned to repeat it.”
➢ ~15 years ago, everything is running
MySQL, Oracle, etc.
➢ ~7 years ago everyone abandoning
➢ Now looping back to SQL - Spark SQL,
Google F1, etc.
❖ DataFrames
➢ DataFrames are the new hotness
➢ It’s how Python and R can now achieve
similar speeds
➢ The Catalyst execution engine can plan
intelligently - behind the scenes,
generates source code, heavy use of
Scala macros, optimize away
boxing/unboxing calls, etc.
➢ Focus is clearly on DataFrames and
upcoming DataSets
❖ DataFrames (cont)
➢ Great in Scala, not so much via JVM
➢ Heavy use of Scala magic like implicits,
➢ Working with DataFrames from Clojure
can be… less than pleasant
➢ Scala folks really like their static, declared
➢ Going to get worse with DataSets
(def FEATURE-TYPE [[:feature DataTypes/IntegerType]])
(def FEATURE-SCHEMA (types->schema FEATURE-TYPE))
(defn create-feature-table
[sql-ctx table-name features]
(let [ctx (.sparkContext sql-ctx)
features-rdd (->> (spark/parallelize (JavaSparkContext. ctx)
(seq features))
(spark/map (fn [i] (RowFactory/create
(to-array [i])))))
features-df (.createDataFrame sql-ctx features-rdd
(.registerTempTable features-df table-name)
Creating a single column DataFrame
(let [query-df (-> bow-df
(.select "word" (into-array ["index"])))]
(reduce (fn [[bow rbow] row]
[(assoc bow (.getString row 0)
(.getInt row 1))
(assoc rbow (.getInt row 1)
(.getString row 0))])
[{} {}] (.collectAsList query-df))))
(-> bow-df
(.join features-df (.equalTo ind-col
(.col features-df
(.select (into-array [(.col bow-df "*")
(.orderBy (into-array [feature-index-col])))
Machine Learning
Elevator Pitch
❖ Machine Learning Key Points
➢ Uses statistical methods on large
amounts of data to hopefully gain insights
➢ Uses vectors of numbers extracted (by
you) from your data - “feature vectors”
➢ Classification puts things into buckets, i.e.
“fashion related website” vs. “everything
➢ Topic modeling - way of finding patterns in
a bunch of documents - a “corpus”
❖ MLLib
➢ Spark’s Machine Learning (ML) library
➢ “Its goal is to make practical machine
learning scalable and easy”
➢ Divides into two packages:
■ spark.mllib - built on top of RDDs
■ - built on top of DataFrames
❖ MLLib (cont)
➢ All the basics - Vectors, Sparse Vectors,
LabeledPoints, etc.
➢ A good variety of algorithms, all designed
for running in parallel
➢ Well documented
➢ Large community
MLLib gives us this...
But we want this!
❖ Example - Metrics
➢ BinaryClassificationMetrics has some
useful things, but not basic things
➢ Have to use MulticlassMetrics for some of
the most wanted metrics, even on a
binary classifier
➢ Neither actually give you the count of
items by label - but
BinaryClassificationMetrics logs it to INFO
➢ End up iterating your data 3 (!) times to
get all desired metrics
Computing metrics(defn metrics [rdd model]
(let [pl (->> rdd
(spark/map (fn [point]
(let [y (.label point) x (.features point)]
(spark/tuple (.predict model x) y))))
multi-metrics (MulticlassMetrics. (.rdd pl))
metrics (BinaryClassificationMetrics. (.rdd pl))
r {:area-under-pr (.areaUnderPR metrics)
:f-measure (.fMeasure multi-metrics 1.0) ;; Others elided
:label-counts (->> rdd
(fn [point] (spark/tuple (.label point) 1)))
(.unpersist pl false)
❖ Examples - Eye on the prize?
➢ HashingTF - oh boy
■ Lose all access to original word
■ Uses gigantic Array instead of a
➢ ChiSqSelector - used to select top N
■ but how do we determine N? Can’t ask
■ End up grubbing around in the source
to find uses Statistics/chiSqTest
Computing Chi-Square Test
(let [sql-ctx (spark-util/make-sql-context ctx)
labels-features-df (spark-util/maybe-sample-df options
(spark-util/load-table sql-ctx "features" input))
labeled-points-rdd (->> (lf/load-labels-and-features-from-parquet
labels-features-df true)
(fn [m] (get-in m
[:labeled-points :term-count]))))
[bow rbow] (bow/load-bow-maps-from-table sql-ctx
(spark-util/load-table sql-ctx "bow" bow-input))
chi-sq-arr (Statistics/chiSqTest labeled-points-rdd)]
(doseq [[ind tst] (map-indexed vector (seq chi-sq-arr))]
(log/info "Feature:" ind (rbow ind) "tst:" tst)))
Classification w/
Random Forests
❖ Classification
➢ Using lots of data to tell things apart
➢ Can put stuff into two buckets (or
“classes”) - Binary Classifier
➢ Or into many buckets - Multi-class
➢ Lots of different techniques
➢ Supervised learning - each sample needs:
■ “features” - a vector of numeric data
■ “label” - a label specifying its class
❖ The Bag of Words
➢ We started with very basic word cleansing
- lowercase, remove non letters/digits, 3
char min length, drop things just numbers
➢ Managed to make it this far in talk without
having to use word count!
➢ But ultimately most Data Science/ML
tasks involving text ends up heavily
dependent on word count
❖ The Bag of Words (cont)
➢ Ended up with too many words (1.3M)
even on sample
➢ Were working on bare baseline, so no
stopword removal or stemming, following
KISS principle
➢ We did say must occur on >= 5 distinct
sites (not documents), reduced size to
460k words
(defn create-bow-site-occurance [json-lines-rdd]
(->> json-lines-rdd
(fn [m] (spark/tuple (site (:url m))
(set (clean-word-seq (:raw_text m))))))
(spark/reduce-by-key union)
(fn [site words] (map spark/tuple words (repeat 1)))))
(spark/reduce-by-key +)
(fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT))))
Bag of Words
❖ Random Forests™
➢ Ensemble of Decision Trees
➢ Uses “bootstrapping” for selection of
feature set and training set
➢ Not “Deep Learning” but extremely easy
to use and very effective
➢ “Any sufficiently advanced technology is
indistinguishable from magic.”
➢ Able to get pretty decent results! F-
measure 0.86
Train the Random Forest from LabeledPoints
(defn train-random-forest [num-trees max-depth max-bins seed
(let [p {:num-classes 2, :categorical-feature-info {},
:feature-subset-strategy "auto", :impurity "gini",
:max-depth max-depth, :max-bins max-bins}]
(RandomForest/trainClassifier labeled-points-rdd
(:num-classes p)
(:categorical-feature-info p)
(:feature-subset-strategy p)
(:impurity p)
(:max-depth p)
(:max-bins p)
Prepare to train/test RandomForest
(defn load-and-train-random-forest [rdd num-trees max-depth max-bins
seed & [sample-fraction]]
(let [sampled-rdd (if sample-fraction
(spark/sample false sample-fraction seed rdd)
labeled-rdd (->> sampled-rdd
(spark/map #(labeled-point lf/fashion? %)))
[train test] (.randomSplit labeled-rdd (double-array [0.9 0.1]) seed)
cached-train (spark/cache train)
cached-test (spark/cache test)
model (train-random-forest num-trees max-depth max-bins seed
[cached-train cached-test model]))
Topic Modelling
with LDA
❖ LDA - Latent Dirichlet Allocation
➢ Topic Model which infers topics from text
➢ Topics -> cluster centers, docs -> rows
➢ Features are vectors of word counts (Bag
of Words)
➢ Unsupervised Learning technique (but
you do supply the topic count)
❖ LDA (cont)
➢ Quite tetchy to run at large scale
➢ OutOfMemory error on executors
➢ Job aborted due to stage failure: Serialized task 4341:0 was
365752339 bytes, which exceeds max allowed: spark.akka.frameSize
(134217728 bytes) - reserved (204800 bytes). Consider increasing
spark.akka.frameSize or using broadcast variables for large values.
➢ WTF?
➢ BTW, do not ever change “spark.akka.
❖ LDA (moar cont)
➢ Finally able to get a trained model after
reducing BoW to more manageable size
~11k down from ~160k
➢ Trained on ~100k documents, roughly
even split between fashion/non-fashion
➢ These models for demonstration
purposes, moar fanciness planned
Train an LDA Model
(defn train-lda-model [num-topics seed features-fn maps-rdd]
(let [rdd (->> maps-rdd
(spark/map (fn [{:keys [doc-number] :as m}]
(spark/tuple doc-number (features-fn m))))
corpus-size (spark/count rdd)
mbf (mini-batch-fraction-batch-size corpus-size 5000)
max-iters (int (Math/ceil (/ mbf)))
optimizer (doto (OnlineLDAOptimizer.)
(.setMiniBatchFraction (min 1.0 mbf)))
model (-> (doto (LDA.) (.setOptimizer optimizer) (.setK num-topics)
(.setSeed seed) (.setMaxIterations max-iters))
(.run (.rdd rdd)))]
(.unpersist rdd false)
So what’s
the point?
❖ So what did we do?
➢ We took pre-scraped, “pre-labeled” data
➢ Used Clojure and Spark/Sparkling to
munge the data
➢ Used state of the art ML tools to analyze
the data
➢ Explored for insights
❖ So what can YOU do?
➢ This will work for almost ANY domain
➢ There’s a lot of interesting information
even at this stage
➢ There’s a ton of interesting directions this
can go
■ Run classifier over all of CC data
■ Build domain-specific LDA models
➢ Do cool things and have fun doing it!
Hunter Kelly

More Related Content

What's hot

SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
Namgee Lee
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
Kai Chan
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Kai Chan
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Automatically generating-json-from-java-objects-java-objects268
Automatically generating-json-from-java-objects-java-objects268Automatically generating-json-from-java-objects-java-objects268
Automatically generating-json-from-java-objects-java-objects268Ramamohan Chokkam
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
Zheng Shao
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
Kai Chan
Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)
Sergio Gomez Villamor
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
Alexandre Victoor
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
Dr. Neil Brittliff
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Data Con LA
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop tools
alireza alikhani

What's hot (20)

SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
J s-o-n-120219575328402-3
J s-o-n-120219575328402-3J s-o-n-120219575328402-3
J s-o-n-120219575328402-3
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Automatically generating-json-from-java-objects-java-objects268
Automatically generating-json-from-java-objects-java-objects268Automatically generating-json-from-java-objects-java-objects268
Automatically generating-json-from-java-objects-java-objects268
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop tools

Viewers also liked

Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Technology
How We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards ScalabilityHow We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards Scalability
Zalando Technology
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
Zalando Technology
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
Building a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & SlickBuilding a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & Slick
Zalando Technology
The Glamorous Toolkit: Towards a novel live IDE
The Glamorous Toolkit: Towards a novel live IDEThe Glamorous Toolkit: Towards a novel live IDE
The Glamorous Toolkit: Towards a novel live IDE
Data driven community management June 2015
Data driven community management June 2015Data driven community management June 2015
Data driven community management June 2015
Conor Duke
Radical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and MicroservicesRadical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and Microservices
Zalando Technology
Auto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando TeamAuto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando Team
Zalando Technology
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
Zalando Technology
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnReactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Zalando Technology
Camunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scaleCamunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scale
camunda services GmbH
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
Order Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community DayOrder Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community Day
Zalando Technology
Radical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the CloudRadical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the Cloud
Zalando Technology
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
Shuyo Nakatani

Viewers also liked (18)

Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three Months
How We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards ScalabilityHow We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards Scalability
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Building a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & SlickBuilding a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & Slick
The Glamorous Toolkit: Towards a novel live IDE
The Glamorous Toolkit: Towards a novel live IDEThe Glamorous Toolkit: Towards a novel live IDE
The Glamorous Toolkit: Towards a novel live IDE
Data driven community management June 2015
Data driven community management June 2015Data driven community management June 2015
Data driven community management June 2015
Radical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and MicroservicesRadical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and Microservices
Auto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando TeamAuto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando Team
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnReactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Camunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scaleCamunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scale
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
Order Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community DayOrder Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community Day
Radical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the CloudRadical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the Cloud
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
BCG Matrix
BCG MatrixBCG Matrix
BCG Matrix

Similar to Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
Baishampayan Ghose
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
Introduction to R
Introduction to RIntroduction to R
Introduction to Ragnonchik
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Matt Stubbs
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Brian O'Neill
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
appaji intelhunt
Workshop on command line tools - day 2
Workshop on command line tools - day 2Workshop on command line tools - day 2
Workshop on command line tools - day 2
Leandro Lima
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
Securerank ping-opendns
Securerank ping-opendnsSecurerank ping-opendns
Securerank ping-opendns
Ping Yan
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin

Similar to Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk (20)

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
Workshop on command line tools - day 2
Workshop on command line tools - day 2Workshop on command line tools - day 2
Workshop on command line tools - day 2
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
Securerank ping-opendns
Securerank ping-opendnsSecurerank ping-opendns
Securerank ping-opendns
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

  • 1. Hunter Kelly @retnuh All the Topics on the Interwebs
  • 5. embassy wikileaks assange german merkel cables snowden speigel spying
  • 7. ❖ What are we actually doing? ➢ Mining web pages for insights ❖ How? ➢ Using Machine Learning to do heavy lifting ■ Use Classifiers to filter/bucket the data ■ Build Topic Models to try to discover concepts related to words
  • 8. ❖ Getting Data ➢ DMOZ ➢ Common Crawl ❖ Manipulating Data ➢ Spark ➢ Sparkling ■ RDDs ■ DataFrames ❖ Data Science ➢ MLLib ➢ Classification - Random Forests™ ➢ LDA (Latent Dirichlet Allocation)
  • 10. ❖ DMOZ ➢ “The largest human edited directory of the web” ➢ Useful when you think of it in terms of “free crowdsourced labeled data” ➢ Fairly ancient, borderline decrepit ➢ Crowdsourced is a double edged sword
  • 11. ❖ Common Crawl (CC) ➢ “an open repository of web crawl data that can be accessed and analyzed by anyone.” ➢ Monthly crawls ➢ Readily accessible index ➢ Tons of free data - raw, links, plain text formats
  • 12. ❖ How to use them together! ➢ Use DMOZ to samples of positive and negative “seed links” ➢ Lookup and expand your “seed links” using CC index ➢ Fetch your data with little/no fuss using CC index information
  • 14. ❖ Apache Spark ➢ The “next big thing” ➢ Or arguably the “current” big thing ❖ Sparkling ➢ Clojure bindings to Spark ➢ Great Presentation (highly recommended) ➢ RDDs ➢ DataFrames
  • 15. RDDs
  • 16. ❖ RDDs ➢ Resilient Distributed Datasets ➢ Easy to think of them as partitioned (or sharded) seqs ➢ Transformations (map, filter, etc) are lazy ➢ Operations (count, collect, reduce, etc) cause evaluation ➢ Very familiar paradigms for Clojure programmers
  • 17.
  • 18. (defn sieve-prime-multiples [n primes numbers] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples (->> primes (r/mapcat #(generate-multiples % n (odd? %))) (into #{})) candidates (->> numbers (r/remove prime-multiples)) new-primes (->> candidates (r/filter #(< % upto)) r/foldcat sort (into [])) remaining (->> candidates (r/remove (set new-primes)) r/foldcat)] [new-primes remaining])) Clojure using Reducers
  • 19. (defn sieve-prime-multiples [ctx n primes numbers-rdd] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples-rdd (->> (spark/parallelize ctx primes) (spark/flat-map #(generate-multiples % n (odd? %)))) candidates-rdd (spark/cache (.subtract numbers-rdd prime-multiples-rdd)) new-primes-rdd (->> candidates-rdd (spark/filter #(< % upto)) spark/cache) new-prime (vec (sort (spark/collect new-primes-rdd))) remaining-rdd (.subtract candidates-rdd new-primes-rdd)] (.unpersist candidates-rdd false) (.unpersist new-primes-rdd false) [new-primes remaining-rdd])) Clojure using Spark
  • 20. ❖ A Historical Tangent ➢ “Those who cannot remember the past are condemned to repeat it.” ➢ ~15 years ago, everything is running MySQL, Oracle, etc. ➢ ~7 years ago everyone abandoning SQL+RDBMS for NoSQL ➢ Now looping back to SQL - Spark SQL, Google F1, etc.
  • 22. ❖ DataFrames ➢ DataFrames are the new hotness ➢ It’s how Python and R can now achieve similar speeds ➢ The Catalyst execution engine can plan intelligently - behind the scenes, generates source code, heavy use of Scala macros, optimize away boxing/unboxing calls, etc. ➢ Focus is clearly on DataFrames and upcoming DataSets
  • 23. ❖ DataFrames (cont) ➢ Great in Scala, not so much via JVM interop ➢ Heavy use of Scala magic like implicits, etc. ➢ Working with DataFrames from Clojure can be… less than pleasant ➢ Scala folks really like their static, declared types ➢ Going to get worse with DataSets
  • 24. (def FEATURE-TYPE [[:feature DataTypes/IntegerType]]) (def FEATURE-SCHEMA (types->schema FEATURE-TYPE)) (defn create-feature-table [sql-ctx table-name features] (let [ctx (.sparkContext sql-ctx) features-rdd (->> (spark/parallelize (JavaSparkContext. ctx) (seq features)) (spark/map (fn [i] (RowFactory/create (to-array [i]))))) features-df (.createDataFrame sql-ctx features-rdd FEATURE-SCHEMA)] (.registerTempTable features-df table-name) features-df)) Creating a single column DataFrame
  • 25. (let [query-df (-> bow-df (.select "word" (into-array ["index"])))] (reduce (fn [[bow rbow] row] [(assoc bow (.getString row 0) (.getInt row 1)) (assoc rbow (.getInt row 1) (.getString row 0))]) [{} {}] (.collectAsList query-df))))
  • 26. (-> bow-df (.join features-df (.equalTo ind-col (.col features-df "feature"))) (.select (into-array [(.col bow-df "*") feature-index-col])) (.orderBy (into-array [feature-index-col])))
  • 28. ❖ Machine Learning Key Points ➢ Uses statistical methods on large amounts of data to hopefully gain insights ➢ Uses vectors of numbers extracted (by you) from your data - “feature vectors” ➢ Classification puts things into buckets, i.e. “fashion related website” vs. “everything else” ➢ Topic modeling - way of finding patterns in a bunch of documents - a “corpus”
  • 29. MLLib
  • 30. ❖ MLLib ➢ Spark’s Machine Learning (ML) library ➢ “Its goal is to make practical machine learning scalable and easy” ➢ Divides into two packages: ■ spark.mllib - built on top of RDDs ■ - built on top of DataFrames
  • 31. ❖ MLLib (cont) ➢ All the basics - Vectors, Sparse Vectors, LabeledPoints, etc. ➢ A good variety of algorithms, all designed for running in parallel ➢ Well documented ➢ Large community
  • 32. MLLib gives us this...
  • 33. But we want this!
  • 34. ❖ Example - Metrics ➢ BinaryClassificationMetrics has some useful things, but not basic things ➢ Have to use MulticlassMetrics for some of the most wanted metrics, even on a binary classifier ➢ Neither actually give you the count of items by label - but BinaryClassificationMetrics logs it to INFO ➢ End up iterating your data 3 (!) times to get all desired metrics
  • 35. Computing metrics(defn metrics [rdd model] (let [pl (->> rdd (spark/map (fn [point] (let [y (.label point) x (.features point)] (spark/tuple (.predict model x) y)))) spark/cache) multi-metrics (MulticlassMetrics. (.rdd pl)) metrics (BinaryClassificationMetrics. (.rdd pl)) r {:area-under-pr (.areaUnderPR metrics) :f-measure (.fMeasure multi-metrics 1.0) ;; Others elided :label-counts (->> rdd (spark/map-to-pair (fn [point] (spark/tuple (.label point) 1))) spark/count-by-key)}] (.unpersist pl false) r))
  • 36. ❖ Examples - Eye on the prize? ➢ HashingTF - oh boy ■ Lose all access to original word ■ Uses gigantic Array instead of a HashMap ➢ ChiSqSelector - used to select top N features ■ but how do we determine N? Can’t ask ■ End up grubbing around in the source to find uses Statistics/chiSqTest
  • 37. Computing Chi-Square Test (let [sql-ctx (spark-util/make-sql-context ctx) labels-features-df (spark-util/maybe-sample-df options (spark-util/load-table sql-ctx "features" input)) labeled-points-rdd (->> (lf/load-labels-and-features-from-parquet labels-features-df true) (spark/map (fn [m] (get-in m [:labeled-points :term-count])))) [bow rbow] (bow/load-bow-maps-from-table sql-ctx (spark-util/load-table sql-ctx "bow" bow-input)) chi-sq-arr (Statistics/chiSqTest labeled-points-rdd)] (doseq [[ind tst] (map-indexed vector (seq chi-sq-arr))] (log/info "Feature:" ind (rbow ind) "tst:" tst)))
  • 39. ❖ Classification ➢ Using lots of data to tell things apart ➢ Can put stuff into two buckets (or “classes”) - Binary Classifier ➢ Or into many buckets - Multi-class Classifier ➢ Lots of different techniques ➢ Supervised learning - each sample needs: ■ “features” - a vector of numeric data ■ “label” - a label specifying its class
  • 40. ❖ The Bag of Words ➢ We started with very basic word cleansing - lowercase, remove non letters/digits, 3 char min length, drop things just numbers ➢ Managed to make it this far in talk without having to use word count! ➢ But ultimately most Data Science/ML tasks involving text ends up heavily dependent on word count
  • 41. ❖ The Bag of Words (cont) ➢ Ended up with too many words (1.3M) even on sample ➢ Were working on bare baseline, so no stopword removal or stemming, following KISS principle ➢ We did say must occur on >= 5 distinct sites (not documents), reduced size to 460k words
  • 42. (defn create-bow-site-occurance [json-lines-rdd] (->> json-lines-rdd (spark/map-to-pair (fn [m] (spark/tuple (site (:url m)) (set (clean-word-seq (:raw_text m)))))) (spark/reduce-by-key union) (spark/flat-map-to-pair (s-de/key-value-fn (fn [site words] (map spark/tuple words (repeat 1))))) (spark/reduce-by-key +) (spark/filter (s-de/key-value-fn (fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT)))) spark/sort-by-key)) Bag of Words
  • 43. ❖ Random Forests™ ➢ Ensemble of Decision Trees ➢ Uses “bootstrapping” for selection of feature set and training set ➢ Not “Deep Learning” but extremely easy to use and very effective ➢ “Any sufficiently advanced technology is indistinguishable from magic.” ➢ Able to get pretty decent results! F- measure 0.86
  • 44. Train the Random Forest from LabeledPoints (defn train-random-forest [num-trees max-depth max-bins seed labeled-points-rdd] (let [p {:num-classes 2, :categorical-feature-info {}, :feature-subset-strategy "auto", :impurity "gini", :max-depth max-depth, :max-bins max-bins}] (RandomForest/trainClassifier labeled-points-rdd (:num-classes p) (:categorical-feature-info p) num-trees (:feature-subset-strategy p) (:impurity p) (:max-depth p) (:max-bins p) seed)))
  • 45. Prepare to train/test RandomForest (defn load-and-train-random-forest [rdd num-trees max-depth max-bins seed & [sample-fraction]] (let [sampled-rdd (if sample-fraction (spark/sample false sample-fraction seed rdd) rdd) labeled-rdd (->> sampled-rdd (spark/map #(labeled-point lf/fashion? %))) [train test] (.randomSplit labeled-rdd (double-array [0.9 0.1]) seed) cached-train (spark/cache train) cached-test (spark/cache test) model (train-random-forest num-trees max-depth max-bins seed cached-train)] [cached-train cached-test model]))
  • 47. ❖ LDA - Latent Dirichlet Allocation ➢ Topic Model which infers topics from text corpus ➢ Topics -> cluster centers, docs -> rows ➢ Features are vectors of word counts (Bag of Words) ➢ Unsupervised Learning technique (but you do supply the topic count)
  • 48. ❖ LDA (cont) ➢ Quite tetchy to run at large scale ➢ OutOfMemory error on executors ➢ Job aborted due to stage failure: Serialized task 4341:0 was 365752339 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values. ➢ WTF? ➢ BTW, do not ever change “spark.akka. frameSize”...
  • 49.
  • 50. ❖ LDA (moar cont) ➢ Finally able to get a trained model after reducing BoW to more manageable size ~11k down from ~160k ➢ Trained on ~100k documents, roughly even split between fashion/non-fashion ➢ These models for demonstration purposes, moar fanciness planned
  • 51. Train an LDA Model (defn train-lda-model [num-topics seed features-fn maps-rdd] (let [rdd (->> maps-rdd (spark/map (fn [{:keys [doc-number] :as m}] (spark/tuple doc-number (features-fn m)))) spark/cache) corpus-size (spark/count rdd) mbf (mini-batch-fraction-batch-size corpus-size 5000) max-iters (int (Math/ceil (/ mbf))) optimizer (doto (OnlineLDAOptimizer.) (.setMiniBatchFraction (min 1.0 mbf))) model (-> (doto (LDA.) (.setOptimizer optimizer) (.setK num-topics) (.setSeed seed) (.setMaxIterations max-iters)) (.run (.rdd rdd)))] (.unpersist rdd false) model))
  • 52. Demo!
  • 54. ❖ So what did we do? ➢ We took pre-scraped, “pre-labeled” data ➢ Used Clojure and Spark/Sparkling to munge the data ➢ Used state of the art ML tools to analyze the data ➢ Explored for insights
  • 55. ❖ So what can YOU do? ➢ This will work for almost ANY domain ➢ There’s a lot of interesting information even at this stage ➢ There’s a ton of interesting directions this can go ■ Run classifier over all of CC data ■ Build domain-specific LDA models ➢ Do cool things and have fun doing it!