‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
RevolutionizeTextMining
withSparkandZeppelin
April2017
YanboLiang
ApacheSparkcommitter
Softwareengineer@Hortonworks
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Agenda
TextminingworkflowonBigData
TextminingwithSparkandMLlib
SparkandZeppelinastheplatform
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMining:PracticalApplications
•Textclassification
–Spamfiltering
–Frauddetection
•Textclustering
•Sentimentanalysis
•Entityextraction
•Recommendations
•Automaticlabeling
•Contextualadvertising
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TraditionalTextMining
•Commercialsoftware
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TraditionalTextMining
•Commercialsoftware
–IBMSPSS,RapidMiner,SAS
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningonBigData
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningonBigData
DataScientistsSoftwareengineers
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
WhyApacheSparkMLlib
•ScalablemachinelearningalgorithmsontopofSpark
–AlternatingLeastSquaresonSpotifydata
•50+millionusersx30+millionsongs,50billionratings
•Forrank10with10iterations,~1hourrunningtime
•Workflowutilities
–MLpipeline
–Modelimport/export
–crossvalidation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Loaddata
TextLabel
Iboughtthegame…4
DoNOTbothertry…1
Thisshirtisawesome…5
nevergotit.Seller…1
Iorderedthisto…3
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Extractfeatures
TextLabelWordsFeatures
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Fitamodel
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Evaluate
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
KeyabstractionofSparkMLpipeline
•Transformer
–Featuretransformers(e.g.,HashingTF)andtrainedMLmodels(e.g.,NaiveBayesModel).
•Estimator
–MLalgorithmsfortrainingmodels(e.g.,NaiveBayes).
•Evaluator
–Theseevaluatepredictionsandcomputemetrics,usefulfortuningalgorithmparameters(e.g.,
BinaryClassificationEvaluator).
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Spark’sTextMiningalgorithms
•LDAfortopicmodel
•Word2Vecanunsupervisedwaytoturnwordsintofeaturesbasedontheirmeaning
•CountVectorizerturnsdocumentsintovectorsbasedonwordcount
•HashingTF-IDFcalculatesimportantwordsofadocumentwithrespecttothecorpus
•Andmuchmore
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline-classification
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
textclassification
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline–topicmodel
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDFLDAtopicmodel
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline-recommendation
Dataset
RegexTokenizerWord2Vec
recommendation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
LDA
Word2Vec
textclassification
topicmodel
recommendation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Demo
•loadthefilecontentsandthecategories
•extractfeaturevectorssuitableformachinelearning
•trainalinearmodeltoperformcategorization
•useagridsearchstrategytofindagoodconfigurationofboththefeatureextraction
componentsandtheclassifier
https://github.com/yanboliang/dataworks-munich-2017
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
CustomingMLPipelines
•MLlib2.1includes:
–30+featuretransformers(Tokenizer,Word2Vec,…)
–25+models(forclassification,regression,clustering,…)
–Modeltuning&evaluation
•Butsomeapplicationsrequirecustomized
–Transformers&Models
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Optionsforcustomization
•Existingusecases:
–spark-corenlp
–spark-vlbfgs
•Extendabstractions
–Transformer
–Estimator&Model
–Evaluator
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7
Python3.5
Python3.5
Python3.5
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering
Duplicatedand
error-prone
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLpersistence
•Prototype(Python/R)
•CreatePipeline
•LoadPipeline(Java/Scala)
–Model.load(“s3n://…”)
•Deployinproduction
DataScienceSoftwareengineering
PersistmodelorPipeline:
model.save(“s3n://…”)
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Datascientistsworkwithsoftwareengineer
DataScientistsSoftwareengineers
Exploredata
Createpipeline
Findbestparams
Savemodel
Loadmodel
Deployinproduction
Scoringon
batch/streamingdata

Revolutionize Text Mining with Spark and Zeppelin