SlideShare a Scribd company logo
Hunter Kelly
@retnuh
All the Topics on the Interwebs
manning
Perhaps this?
Or maybe this?
embassy wikileaks assange
german merkel cables
snowden speigel spying
Wut?
❖ What are we actually doing?
➢ Mining web pages for insights
❖ How?
➢ Using Machine Learning to do heavy
lifting
■ Use Classifiers to filter/bucket the
data
■ Build Topic Models to try to discover
concepts related to words
❖ Getting Data
➢ DMOZ
➢ Common Crawl
❖ Manipulating Data
➢ Spark
➢ Sparkling
■ RDDs
■ DataFrames
❖ Data Science
➢ MLLib
➢ Classification - Random Forests™
➢ LDA (Latent Dirichlet Allocation)
DMOZ
Common Crawl
❖ DMOZ
➢ “The largest human edited directory of the
web”
➢ Useful when you think of it in terms of
“free crowdsourced labeled data”
➢ Fairly ancient, borderline decrepit
➢ Crowdsourced is a double edged sword
❖ Common Crawl (CC)
➢ “an open repository of web crawl data
that can be accessed and analyzed by
anyone.”
➢ Monthly crawls
➢ Readily accessible index
➢ Tons of free data - raw, links, plain text
formats
❖ How to use them together!
➢ Use DMOZ to samples of positive and
negative “seed links”
➢ Lookup and expand your “seed links”
using CC index
➢ Fetch your data with little/no fuss using
CC index information
Spark &
Sparkling
❖ Apache Spark
➢ The “next big thing”
➢ Or arguably the “current” big thing
❖ Sparkling
➢ Clojure bindings to Spark
➢ Great Presentation (highly recommended)
➢ RDDs
➢ DataFrames
RDDs
❖ RDDs
➢ Resilient Distributed Datasets
➢ Easy to think of them as partitioned (or
sharded) seqs
➢ Transformations (map, filter, etc) are lazy
➢ Operations (count, collect, reduce, etc)
cause evaluation
➢ Very familiar paradigms for Clojure
programmers
(defn sieve-prime-multiples [n primes numbers]
(let [max-prime (last primes)
upto (* max-prime max-prime)
prime-multiples (->> primes
(r/mapcat #(generate-multiples % n (odd? %)))
(into #{}))
candidates (->> numbers
(r/remove prime-multiples))
new-primes (->> candidates
(r/filter #(< % upto))
r/foldcat
sort
(into []))
remaining (->> candidates
(r/remove (set new-primes))
r/foldcat)]
[new-primes remaining]))
Clojure using Reducers
(defn sieve-prime-multiples [ctx n primes numbers-rdd]
(let [max-prime (last primes)
upto (* max-prime max-prime)
prime-multiples-rdd (->> (spark/parallelize ctx primes)
(spark/flat-map
#(generate-multiples % n (odd? %))))
candidates-rdd (spark/cache (.subtract numbers-rdd
prime-multiples-rdd))
new-primes-rdd (->> candidates-rdd
(spark/filter #(< % upto))
spark/cache)
new-prime (vec (sort (spark/collect new-primes-rdd)))
remaining-rdd (.subtract candidates-rdd new-primes-rdd)]
(.unpersist candidates-rdd false)
(.unpersist new-primes-rdd false)
[new-primes remaining-rdd]))
Clojure using Spark
❖ A Historical Tangent
➢ “Those who cannot remember the past
are condemned to repeat it.”
➢ ~15 years ago, everything is running
MySQL, Oracle, etc.
➢ ~7 years ago everyone abandoning
SQL+RDBMS for NoSQL
➢ Now looping back to SQL - Spark SQL,
Google F1, etc.
DataFrames
❖ DataFrames
➢ DataFrames are the new hotness
➢ It’s how Python and R can now achieve
similar speeds
➢ The Catalyst execution engine can plan
intelligently - behind the scenes,
generates source code, heavy use of
Scala macros, optimize away
boxing/unboxing calls, etc.
➢ Focus is clearly on DataFrames and
upcoming DataSets
❖ DataFrames (cont)
➢ Great in Scala, not so much via JVM
interop
➢ Heavy use of Scala magic like implicits,
etc.
➢ Working with DataFrames from Clojure
can be… less than pleasant
➢ Scala folks really like their static, declared
types
➢ Going to get worse with DataSets
(def FEATURE-TYPE [[:feature DataTypes/IntegerType]])
(def FEATURE-SCHEMA (types->schema FEATURE-TYPE))
(defn create-feature-table
[sql-ctx table-name features]
(let [ctx (.sparkContext sql-ctx)
features-rdd (->> (spark/parallelize (JavaSparkContext. ctx)
(seq features))
(spark/map (fn [i] (RowFactory/create
(to-array [i])))))
features-df (.createDataFrame sql-ctx features-rdd
FEATURE-SCHEMA)]
(.registerTempTable features-df table-name)
features-df))
Creating a single column DataFrame
(let [query-df (-> bow-df
(.select "word" (into-array ["index"])))]
(reduce (fn [[bow rbow] row]
[(assoc bow (.getString row 0)
(.getInt row 1))
(assoc rbow (.getInt row 1)
(.getString row 0))])
[{} {}] (.collectAsList query-df))))
(-> bow-df
(.join features-df (.equalTo ind-col
(.col features-df
"feature")))
(.select (into-array [(.col bow-df "*")
feature-index-col]))
(.orderBy (into-array [feature-index-col])))
Machine Learning
Elevator Pitch
❖ Machine Learning Key Points
➢ Uses statistical methods on large
amounts of data to hopefully gain insights
➢ Uses vectors of numbers extracted (by
you) from your data - “feature vectors”
➢ Classification puts things into buckets, i.e.
“fashion related website” vs. “everything
else”
➢ Topic modeling - way of finding patterns in
a bunch of documents - a “corpus”
MLLib
❖ MLLib
➢ Spark’s Machine Learning (ML) library
➢ “Its goal is to make practical machine
learning scalable and easy”
➢ Divides into two packages:
■ spark.mllib - built on top of RDDs
■ spark.ml - built on top of DataFrames
❖ MLLib (cont)
➢ All the basics - Vectors, Sparse Vectors,
LabeledPoints, etc.
➢ A good variety of algorithms, all designed
for running in parallel
➢ Well documented
➢ Large community
MLLib gives us this...
But we want this!
❖ Example - Metrics
➢ BinaryClassificationMetrics has some
useful things, but not basic things
➢ Have to use MulticlassMetrics for some of
the most wanted metrics, even on a
binary classifier
➢ Neither actually give you the count of
items by label - but
BinaryClassificationMetrics logs it to INFO
➢ End up iterating your data 3 (!) times to
get all desired metrics
Computing metrics(defn metrics [rdd model]
(let [pl (->> rdd
(spark/map (fn [point]
(let [y (.label point) x (.features point)]
(spark/tuple (.predict model x) y))))
spark/cache)
multi-metrics (MulticlassMetrics. (.rdd pl))
metrics (BinaryClassificationMetrics. (.rdd pl))
r {:area-under-pr (.areaUnderPR metrics)
:f-measure (.fMeasure multi-metrics 1.0) ;; Others elided
:label-counts (->> rdd
(spark/map-to-pair
(fn [point] (spark/tuple (.label point) 1)))
spark/count-by-key)}]
(.unpersist pl false)
r))
❖ Examples - Eye on the prize?
➢ HashingTF - oh boy
■ Lose all access to original word
■ Uses gigantic Array instead of a
HashMap
➢ ChiSqSelector - used to select top N
features
■ but how do we determine N? Can’t ask
■ End up grubbing around in the source
to find uses Statistics/chiSqTest
Computing Chi-Square Test
(let [sql-ctx (spark-util/make-sql-context ctx)
labels-features-df (spark-util/maybe-sample-df options
(spark-util/load-table sql-ctx "features" input))
labeled-points-rdd (->> (lf/load-labels-and-features-from-parquet
labels-features-df true)
(spark/map
(fn [m] (get-in m
[:labeled-points :term-count]))))
[bow rbow] (bow/load-bow-maps-from-table sql-ctx
(spark-util/load-table sql-ctx "bow" bow-input))
chi-sq-arr (Statistics/chiSqTest labeled-points-rdd)]
(doseq [[ind tst] (map-indexed vector (seq chi-sq-arr))]
(log/info "Feature:" ind (rbow ind) "tst:" tst)))
Classification w/
Random Forests
❖ Classification
➢ Using lots of data to tell things apart
➢ Can put stuff into two buckets (or
“classes”) - Binary Classifier
➢ Or into many buckets - Multi-class
Classifier
➢ Lots of different techniques
➢ Supervised learning - each sample needs:
■ “features” - a vector of numeric data
■ “label” - a label specifying its class
❖ The Bag of Words
➢ We started with very basic word cleansing
- lowercase, remove non letters/digits, 3
char min length, drop things just numbers
➢ Managed to make it this far in talk without
having to use word count!
➢ But ultimately most Data Science/ML
tasks involving text ends up heavily
dependent on word count
❖ The Bag of Words (cont)
➢ Ended up with too many words (1.3M)
even on sample
➢ Were working on bare baseline, so no
stopword removal or stemming, following
KISS principle
➢ We did say must occur on >= 5 distinct
sites (not documents), reduced size to
460k words
(defn create-bow-site-occurance [json-lines-rdd]
(->> json-lines-rdd
(spark/map-to-pair
(fn [m] (spark/tuple (site (:url m))
(set (clean-word-seq (:raw_text m))))))
(spark/reduce-by-key union)
(spark/flat-map-to-pair
(s-de/key-value-fn
(fn [site words] (map spark/tuple words (repeat 1)))))
(spark/reduce-by-key +)
(spark/filter
(s-de/key-value-fn
(fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT))))
spark/sort-by-key))
Bag of Words
❖ Random Forests™
➢ Ensemble of Decision Trees
➢ Uses “bootstrapping” for selection of
feature set and training set
➢ Not “Deep Learning” but extremely easy
to use and very effective
➢ “Any sufficiently advanced technology is
indistinguishable from magic.”
➢ Able to get pretty decent results! F-
measure 0.86
Train the Random Forest from LabeledPoints
(defn train-random-forest [num-trees max-depth max-bins seed
labeled-points-rdd]
(let [p {:num-classes 2, :categorical-feature-info {},
:feature-subset-strategy "auto", :impurity "gini",
:max-depth max-depth, :max-bins max-bins}]
(RandomForest/trainClassifier labeled-points-rdd
(:num-classes p)
(:categorical-feature-info p)
num-trees
(:feature-subset-strategy p)
(:impurity p)
(:max-depth p)
(:max-bins p)
seed)))
Prepare to train/test RandomForest
(defn load-and-train-random-forest [rdd num-trees max-depth max-bins
seed & [sample-fraction]]
(let [sampled-rdd (if sample-fraction
(spark/sample false sample-fraction seed rdd)
rdd)
labeled-rdd (->> sampled-rdd
(spark/map #(labeled-point lf/fashion? %)))
[train test] (.randomSplit labeled-rdd (double-array [0.9 0.1]) seed)
cached-train (spark/cache train)
cached-test (spark/cache test)
model (train-random-forest num-trees max-depth max-bins seed
cached-train)]
[cached-train cached-test model]))
Topic Modelling
with LDA
❖ LDA - Latent Dirichlet Allocation
➢ Topic Model which infers topics from text
corpus
➢ Topics -> cluster centers, docs -> rows
➢ Features are vectors of word counts (Bag
of Words)
➢ Unsupervised Learning technique (but
you do supply the topic count)
❖ LDA (cont)
➢ Quite tetchy to run at large scale
➢ OutOfMemory error on executors
➢ Job aborted due to stage failure: Serialized task 4341:0 was
365752339 bytes, which exceeds max allowed: spark.akka.frameSize
(134217728 bytes) - reserved (204800 bytes). Consider increasing
spark.akka.frameSize or using broadcast variables for large values.
➢ WTF?
➢ BTW, do not ever change “spark.akka.
frameSize”...
❖ LDA (moar cont)
➢ Finally able to get a trained model after
reducing BoW to more manageable size
~11k down from ~160k
➢ Trained on ~100k documents, roughly
even split between fashion/non-fashion
➢ These models for demonstration
purposes, moar fanciness planned
Train an LDA Model
(defn train-lda-model [num-topics seed features-fn maps-rdd]
(let [rdd (->> maps-rdd
(spark/map (fn [{:keys [doc-number] :as m}]
(spark/tuple doc-number (features-fn m))))
spark/cache)
corpus-size (spark/count rdd)
mbf (mini-batch-fraction-batch-size corpus-size 5000)
max-iters (int (Math/ceil (/ mbf)))
optimizer (doto (OnlineLDAOptimizer.)
(.setMiniBatchFraction (min 1.0 mbf)))
model (-> (doto (LDA.) (.setOptimizer optimizer) (.setK num-topics)
(.setSeed seed) (.setMaxIterations max-iters))
(.run (.rdd rdd)))]
(.unpersist rdd false)
model))
Demo!
So what’s
the point?
❖ So what did we do?
➢ We took pre-scraped, “pre-labeled” data
➢ Used Clojure and Spark/Sparkling to
munge the data
➢ Used state of the art ML tools to analyze
the data
➢ Explored for insights
❖ So what can YOU do?
➢ This will work for almost ANY domain
➢ There’s a lot of interesting information
even at this stage
➢ There’s a ton of interesting directions this
can go
■ Run classifier over all of CC data
■ Build domain-specific LDA models
➢ Do cool things and have fun doing it!
Hunter Kelly
@retnuh
https://github.com/retnuh

More Related Content

What's hot

SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
Namgee Lee
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
Kai Chan
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Kai Chan
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Automatically generating-json-from-java-objects-java-objects268
Automatically generating-json-from-java-objects-java-objects268Automatically generating-json-from-java-objects-java-objects268
Automatically generating-json-from-java-objects-java-objects268Ramamohan Chokkam
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Kai Chan
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
Databricks
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
Zheng Shao
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
Kai Chan
 
Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)
Sergio Gomez Villamor
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
Alexandre Victoor
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
Dr. Neil Brittliff
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
Databricks
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Data Con LA
 
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop tools
alireza alikhani
 

What's hot (20)

SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Javascript2839
Javascript2839Javascript2839
Javascript2839
 
Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)Java Performance Tips (So Code Camp San Diego 2014)
Java Performance Tips (So Code Camp San Diego 2014)
 
J s-o-n-120219575328402-3
J s-o-n-120219575328402-3J s-o-n-120219575328402-3
J s-o-n-120219575328402-3
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Automatically generating-json-from-java-objects-java-objects268
Automatically generating-json-from-java-objects-java-objects268Automatically generating-json-from-java-objects-java-objects268
Automatically generating-json-from-java-objects-java-objects268
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
 
Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)Dex Technical Seminar (April 2011)
Dex Technical Seminar (April 2011)
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
 
A Little SPARQL in your Analytics
A Little SPARQL in your AnalyticsA Little SPARQL in your Analytics
A Little SPARQL in your Analytics
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop tools
 

Viewers also liked

Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Technology
 
How We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards ScalabilityHow We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards Scalability
Zalando Technology
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
Zalando Technology
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
Building a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & SlickBuilding a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & Slick
Zalando Technology
 
The Glamorous Toolkit: Towards a novel live IDE
The Glamorous Toolkit: Towards a novel live IDEThe Glamorous Toolkit: Towards a novel live IDE
The Glamorous Toolkit: Towards a novel live IDE
ESUG
 
Data driven community management June 2015
Data driven community management June 2015Data driven community management June 2015
Data driven community management June 2015
Conor Duke
 
PharoJS
PharoJSPharoJS
PharoJS
ESUG
 
Radical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and MicroservicesRadical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and Microservices
Zalando Technology
 
Auto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando TeamAuto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando Team
Zalando Technology
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
Zalando Technology
 
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnReactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Zalando Technology
 
Camunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scaleCamunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scale
camunda services GmbH
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
 
Order Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community DayOrder Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community Day
Zalando Technology
 
Radical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the CloudRadical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the Cloud
Zalando Technology
 
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
Shuyo Nakatani
 

Viewers also liked (18)

Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three Months
 
How We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards ScalabilityHow We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards Scalability
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
Building a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & SlickBuilding a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & Slick
 
The Glamorous Toolkit: Towards a novel live IDE
The Glamorous Toolkit: Towards a novel live IDEThe Glamorous Toolkit: Towards a novel live IDE
The Glamorous Toolkit: Towards a novel live IDE
 
Data driven community management June 2015
Data driven community management June 2015Data driven community management June 2015
Data driven community management June 2015
 
PharoJS
PharoJSPharoJS
PharoJS
 
Radical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and MicroservicesRadical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and Microservices
 
Auto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando TeamAuto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando Team
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
 
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnReactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
 
Camunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scaleCamunda BPM at Zalando: Order Processing at scale
Camunda BPM at Zalando: Order Processing at scale
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
Order Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community DayOrder Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community Day
 
Radical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the CloudRadical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the Cloud
 
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
[Kim+ ICML2012] Dirichlet Process with Mixed Random Measures : A Nonparametri...
 
BCG Matrix
BCG MatrixBCG Matrix
BCG Matrix
 

Similar to Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
Baishampayan Ghose
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
Michael Spector
 
Introduction to R
Introduction to RIntroduction to R
Introduction to Ragnonchik
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Matt Stubbs
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Brian O'Neill
 
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
yoavrubin
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
appaji intelhunt
 
Workshop on command line tools - day 2
Workshop on command line tools - day 2Workshop on command line tools - day 2
Workshop on command line tools - day 2
Leandro Lima
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
Securerank ping-opendns
Securerank ping-opendnsSecurerank ping-opendns
Securerank ping-opendns
Ping Yan
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 

Similar to Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk (20)

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Pune Clojure Course Outline
Pune Clojure Course OutlinePune Clojure Course Outline
Pune Clojure Course Outline
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
Workshop on command line tools - day 2
Workshop on command line tools - day 2Workshop on command line tools - day 2
Workshop on command line tools - day 2
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Securerank ping-opendns
Securerank ping-opendnsSecurerank ping-opendns
Securerank ping-opendns
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk

  • 1. Hunter Kelly @retnuh All the Topics on the Interwebs
  • 5. embassy wikileaks assange german merkel cables snowden speigel spying
  • 7. ❖ What are we actually doing? ➢ Mining web pages for insights ❖ How? ➢ Using Machine Learning to do heavy lifting ■ Use Classifiers to filter/bucket the data ■ Build Topic Models to try to discover concepts related to words
  • 8. ❖ Getting Data ➢ DMOZ ➢ Common Crawl ❖ Manipulating Data ➢ Spark ➢ Sparkling ■ RDDs ■ DataFrames ❖ Data Science ➢ MLLib ➢ Classification - Random Forests™ ➢ LDA (Latent Dirichlet Allocation)
  • 10. ❖ DMOZ ➢ “The largest human edited directory of the web” ➢ Useful when you think of it in terms of “free crowdsourced labeled data” ➢ Fairly ancient, borderline decrepit ➢ Crowdsourced is a double edged sword
  • 11. ❖ Common Crawl (CC) ➢ “an open repository of web crawl data that can be accessed and analyzed by anyone.” ➢ Monthly crawls ➢ Readily accessible index ➢ Tons of free data - raw, links, plain text formats
  • 12. ❖ How to use them together! ➢ Use DMOZ to samples of positive and negative “seed links” ➢ Lookup and expand your “seed links” using CC index ➢ Fetch your data with little/no fuss using CC index information
  • 14. ❖ Apache Spark ➢ The “next big thing” ➢ Or arguably the “current” big thing ❖ Sparkling ➢ Clojure bindings to Spark ➢ Great Presentation (highly recommended) ➢ RDDs ➢ DataFrames
  • 15. RDDs
  • 16. ❖ RDDs ➢ Resilient Distributed Datasets ➢ Easy to think of them as partitioned (or sharded) seqs ➢ Transformations (map, filter, etc) are lazy ➢ Operations (count, collect, reduce, etc) cause evaluation ➢ Very familiar paradigms for Clojure programmers
  • 17.
  • 18. (defn sieve-prime-multiples [n primes numbers] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples (->> primes (r/mapcat #(generate-multiples % n (odd? %))) (into #{})) candidates (->> numbers (r/remove prime-multiples)) new-primes (->> candidates (r/filter #(< % upto)) r/foldcat sort (into [])) remaining (->> candidates (r/remove (set new-primes)) r/foldcat)] [new-primes remaining])) Clojure using Reducers
  • 19. (defn sieve-prime-multiples [ctx n primes numbers-rdd] (let [max-prime (last primes) upto (* max-prime max-prime) prime-multiples-rdd (->> (spark/parallelize ctx primes) (spark/flat-map #(generate-multiples % n (odd? %)))) candidates-rdd (spark/cache (.subtract numbers-rdd prime-multiples-rdd)) new-primes-rdd (->> candidates-rdd (spark/filter #(< % upto)) spark/cache) new-prime (vec (sort (spark/collect new-primes-rdd))) remaining-rdd (.subtract candidates-rdd new-primes-rdd)] (.unpersist candidates-rdd false) (.unpersist new-primes-rdd false) [new-primes remaining-rdd])) Clojure using Spark
  • 20. ❖ A Historical Tangent ➢ “Those who cannot remember the past are condemned to repeat it.” ➢ ~15 years ago, everything is running MySQL, Oracle, etc. ➢ ~7 years ago everyone abandoning SQL+RDBMS for NoSQL ➢ Now looping back to SQL - Spark SQL, Google F1, etc.
  • 22. ❖ DataFrames ➢ DataFrames are the new hotness ➢ It’s how Python and R can now achieve similar speeds ➢ The Catalyst execution engine can plan intelligently - behind the scenes, generates source code, heavy use of Scala macros, optimize away boxing/unboxing calls, etc. ➢ Focus is clearly on DataFrames and upcoming DataSets
  • 23. ❖ DataFrames (cont) ➢ Great in Scala, not so much via JVM interop ➢ Heavy use of Scala magic like implicits, etc. ➢ Working with DataFrames from Clojure can be… less than pleasant ➢ Scala folks really like their static, declared types ➢ Going to get worse with DataSets
  • 24. (def FEATURE-TYPE [[:feature DataTypes/IntegerType]]) (def FEATURE-SCHEMA (types->schema FEATURE-TYPE)) (defn create-feature-table [sql-ctx table-name features] (let [ctx (.sparkContext sql-ctx) features-rdd (->> (spark/parallelize (JavaSparkContext. ctx) (seq features)) (spark/map (fn [i] (RowFactory/create (to-array [i]))))) features-df (.createDataFrame sql-ctx features-rdd FEATURE-SCHEMA)] (.registerTempTable features-df table-name) features-df)) Creating a single column DataFrame
  • 25. (let [query-df (-> bow-df (.select "word" (into-array ["index"])))] (reduce (fn [[bow rbow] row] [(assoc bow (.getString row 0) (.getInt row 1)) (assoc rbow (.getInt row 1) (.getString row 0))]) [{} {}] (.collectAsList query-df))))
  • 26. (-> bow-df (.join features-df (.equalTo ind-col (.col features-df "feature"))) (.select (into-array [(.col bow-df "*") feature-index-col])) (.orderBy (into-array [feature-index-col])))
  • 28. ❖ Machine Learning Key Points ➢ Uses statistical methods on large amounts of data to hopefully gain insights ➢ Uses vectors of numbers extracted (by you) from your data - “feature vectors” ➢ Classification puts things into buckets, i.e. “fashion related website” vs. “everything else” ➢ Topic modeling - way of finding patterns in a bunch of documents - a “corpus”
  • 29. MLLib
  • 30. ❖ MLLib ➢ Spark’s Machine Learning (ML) library ➢ “Its goal is to make practical machine learning scalable and easy” ➢ Divides into two packages: ■ spark.mllib - built on top of RDDs ■ spark.ml - built on top of DataFrames
  • 31. ❖ MLLib (cont) ➢ All the basics - Vectors, Sparse Vectors, LabeledPoints, etc. ➢ A good variety of algorithms, all designed for running in parallel ➢ Well documented ➢ Large community
  • 32. MLLib gives us this...
  • 33. But we want this!
  • 34. ❖ Example - Metrics ➢ BinaryClassificationMetrics has some useful things, but not basic things ➢ Have to use MulticlassMetrics for some of the most wanted metrics, even on a binary classifier ➢ Neither actually give you the count of items by label - but BinaryClassificationMetrics logs it to INFO ➢ End up iterating your data 3 (!) times to get all desired metrics
  • 35. Computing metrics(defn metrics [rdd model] (let [pl (->> rdd (spark/map (fn [point] (let [y (.label point) x (.features point)] (spark/tuple (.predict model x) y)))) spark/cache) multi-metrics (MulticlassMetrics. (.rdd pl)) metrics (BinaryClassificationMetrics. (.rdd pl)) r {:area-under-pr (.areaUnderPR metrics) :f-measure (.fMeasure multi-metrics 1.0) ;; Others elided :label-counts (->> rdd (spark/map-to-pair (fn [point] (spark/tuple (.label point) 1))) spark/count-by-key)}] (.unpersist pl false) r))
  • 36. ❖ Examples - Eye on the prize? ➢ HashingTF - oh boy ■ Lose all access to original word ■ Uses gigantic Array instead of a HashMap ➢ ChiSqSelector - used to select top N features ■ but how do we determine N? Can’t ask ■ End up grubbing around in the source to find uses Statistics/chiSqTest
  • 37. Computing Chi-Square Test (let [sql-ctx (spark-util/make-sql-context ctx) labels-features-df (spark-util/maybe-sample-df options (spark-util/load-table sql-ctx "features" input)) labeled-points-rdd (->> (lf/load-labels-and-features-from-parquet labels-features-df true) (spark/map (fn [m] (get-in m [:labeled-points :term-count])))) [bow rbow] (bow/load-bow-maps-from-table sql-ctx (spark-util/load-table sql-ctx "bow" bow-input)) chi-sq-arr (Statistics/chiSqTest labeled-points-rdd)] (doseq [[ind tst] (map-indexed vector (seq chi-sq-arr))] (log/info "Feature:" ind (rbow ind) "tst:" tst)))
  • 39. ❖ Classification ➢ Using lots of data to tell things apart ➢ Can put stuff into two buckets (or “classes”) - Binary Classifier ➢ Or into many buckets - Multi-class Classifier ➢ Lots of different techniques ➢ Supervised learning - each sample needs: ■ “features” - a vector of numeric data ■ “label” - a label specifying its class
  • 40. ❖ The Bag of Words ➢ We started with very basic word cleansing - lowercase, remove non letters/digits, 3 char min length, drop things just numbers ➢ Managed to make it this far in talk without having to use word count! ➢ But ultimately most Data Science/ML tasks involving text ends up heavily dependent on word count
  • 41. ❖ The Bag of Words (cont) ➢ Ended up with too many words (1.3M) even on sample ➢ Were working on bare baseline, so no stopword removal or stemming, following KISS principle ➢ We did say must occur on >= 5 distinct sites (not documents), reduced size to 460k words
  • 42. (defn create-bow-site-occurance [json-lines-rdd] (->> json-lines-rdd (spark/map-to-pair (fn [m] (spark/tuple (site (:url m)) (set (clean-word-seq (:raw_text m)))))) (spark/reduce-by-key union) (spark/flat-map-to-pair (s-de/key-value-fn (fn [site words] (map spark/tuple words (repeat 1))))) (spark/reduce-by-key +) (spark/filter (s-de/key-value-fn (fn [w c] (>= c MIN-SITE-OCCURANCE-COUNT)))) spark/sort-by-key)) Bag of Words
  • 43. ❖ Random Forests™ ➢ Ensemble of Decision Trees ➢ Uses “bootstrapping” for selection of feature set and training set ➢ Not “Deep Learning” but extremely easy to use and very effective ➢ “Any sufficiently advanced technology is indistinguishable from magic.” ➢ Able to get pretty decent results! F- measure 0.86
  • 44. Train the Random Forest from LabeledPoints (defn train-random-forest [num-trees max-depth max-bins seed labeled-points-rdd] (let [p {:num-classes 2, :categorical-feature-info {}, :feature-subset-strategy "auto", :impurity "gini", :max-depth max-depth, :max-bins max-bins}] (RandomForest/trainClassifier labeled-points-rdd (:num-classes p) (:categorical-feature-info p) num-trees (:feature-subset-strategy p) (:impurity p) (:max-depth p) (:max-bins p) seed)))
  • 45. Prepare to train/test RandomForest (defn load-and-train-random-forest [rdd num-trees max-depth max-bins seed & [sample-fraction]] (let [sampled-rdd (if sample-fraction (spark/sample false sample-fraction seed rdd) rdd) labeled-rdd (->> sampled-rdd (spark/map #(labeled-point lf/fashion? %))) [train test] (.randomSplit labeled-rdd (double-array [0.9 0.1]) seed) cached-train (spark/cache train) cached-test (spark/cache test) model (train-random-forest num-trees max-depth max-bins seed cached-train)] [cached-train cached-test model]))
  • 47. ❖ LDA - Latent Dirichlet Allocation ➢ Topic Model which infers topics from text corpus ➢ Topics -> cluster centers, docs -> rows ➢ Features are vectors of word counts (Bag of Words) ➢ Unsupervised Learning technique (but you do supply the topic count)
  • 48. ❖ LDA (cont) ➢ Quite tetchy to run at large scale ➢ OutOfMemory error on executors ➢ Job aborted due to stage failure: Serialized task 4341:0 was 365752339 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values. ➢ WTF? ➢ BTW, do not ever change “spark.akka. frameSize”...
  • 49.
  • 50. ❖ LDA (moar cont) ➢ Finally able to get a trained model after reducing BoW to more manageable size ~11k down from ~160k ➢ Trained on ~100k documents, roughly even split between fashion/non-fashion ➢ These models for demonstration purposes, moar fanciness planned
  • 51. Train an LDA Model (defn train-lda-model [num-topics seed features-fn maps-rdd] (let [rdd (->> maps-rdd (spark/map (fn [{:keys [doc-number] :as m}] (spark/tuple doc-number (features-fn m)))) spark/cache) corpus-size (spark/count rdd) mbf (mini-batch-fraction-batch-size corpus-size 5000) max-iters (int (Math/ceil (/ mbf))) optimizer (doto (OnlineLDAOptimizer.) (.setMiniBatchFraction (min 1.0 mbf))) model (-> (doto (LDA.) (.setOptimizer optimizer) (.setK num-topics) (.setSeed seed) (.setMaxIterations max-iters)) (.run (.rdd rdd)))] (.unpersist rdd false) model))
  • 52. Demo!
  • 54. ❖ So what did we do? ➢ We took pre-scraped, “pre-labeled” data ➢ Used Clojure and Spark/Sparkling to munge the data ➢ Used state of the art ML tools to analyze the data ➢ Explored for insights
  • 55. ❖ So what can YOU do? ➢ This will work for almost ANY domain ➢ There’s a lot of interesting information even at this stage ➢ There’s a ton of interesting directions this can go ■ Run classifier over all of CC data ■ Build domain-specific LDA models ➢ Do cool things and have fun doing it!