Revolutionize Text Mining with Spark and Zeppelin

‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
RevolutionizeTextMining
withSparkandZeppelin
April2017
YanboLiang
ApacheSparkcommitter
Softwareengineer@Hortonworks

‹#
›
Agenda
TextminingworkﬂowonBigData
TextminingwithSparkandMLlib
SparkandZeppelinastheplatform

‹#
›
TextMining:PracticalApplications
•Textclassiﬁcation
–Spamﬁltering
–Frauddetection
•Textclustering
•Sentimentanalysis
•Entityextraction
•Recommendations
•Automaticlabeling
•Contextualadvertising

‹#
›
TraditionalTextMining
•Commercialsoftware
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R

‹#
›
TraditionalTextMining
•Commercialsoftware
–IBMSPSS,RapidMiner,SAS
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R

‹#
›
TextMiningonBigData

‹#
›
TextMiningonBigData
DataScientistsSoftwareengineers

‹#
›
WhyApacheSparkMLlib
•ScalablemachinelearningalgorithmsontopofSpark
–AlternatingLeastSquaresonSpotifydata
•50+millionusersx30+millionsongs,50billionratings
•Forrank10with10iterations,~1hourrunningtime
•Workﬂowutilities
–MLpipeline
–Modelimport/export
–crossvalidation

‹#
›
TextMiningworkﬂow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering

‹#
›
Loaddata
TextLabel
Iboughtthegame…4
DoNOTbothertry…1
Thisshirtisawesome…5
nevergotit.Seller…1
Iorderedthisto…3
Dataset
Feature
engineering
Model
training
Model
evaluation

‹#
›
Extractfeatures
TextLabelWordsFeatures
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]
Dataset
Feature
engineering
Model
training
Model
evaluation

‹#
›
Fitamodel
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation

‹#
›
Evaluate
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation

‹#
›
KeyabstractionofSparkMLpipeline
•Transformer
–Featuretransformers(e.g.,HashingTF)andtrainedMLmodels(e.g.,NaiveBayesModel).
•Estimator
–MLalgorithmsfortrainingmodels(e.g.,NaiveBayes).
•Evaluator
–Theseevaluatepredictionsandcomputemetrics,usefulfortuningalgorithmparameters(e.g.,
BinaryClassiﬁcationEvaluator).

‹#
›
Spark’sTextMiningalgorithms
•LDAfortopicmodel
•Word2Vecanunsupervisedwaytoturnwordsintofeaturesbasedontheirmeaning
•CountVectorizerturnsdocumentsintovectorsbasedonwordcount
•HashingTF-IDFcalculatesimportantwordsofadocumentwithrespecttothecorpus
•Andmuchmore

‹#
›
MLlibTextMiningPipeline-classiﬁcation
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
textclassiﬁcation

‹#
›
MLlibTextMiningPipeline–topicmodel
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDFLDAtopicmodel

‹#
›
MLlibTextMiningPipeline-recommendation
Dataset
RegexTokenizerWord2Vec
recommendation

‹#
›
MLlibTextMiningPipeline
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
LDA
Word2Vec
textclassiﬁcation
topicmodel
recommendation

‹#
›
Demo
•loadthefilecontentsandthecategories
•extractfeaturevectorssuitableformachinelearning
•trainalinearmodeltoperformcategorization
•useagridsearchstrategytofindagoodconfigurationofboththefeatureextraction
componentsandtheclassifier
https://github.com/yanboliang/dataworks-munich-2017

‹#
›
CustomingMLPipelines
•MLlib2.1includes:
–30+featuretransformers(Tokenizer,Word2Vec,…)
–25+models(forclassiﬁcation,regression,clustering,…)
–Modeltuning&evaluation
•Butsomeapplicationsrequirecustomized
–Transformers&Models

‹#
›
Optionsforcustomization
•Existingusecases:
–spark-corenlp
–spark-vlbfgs
•Extendabstractions
–Transformer
–Estimator&Model
–Evaluator

‹#
›
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7

‹#
›
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7
Python3.5
Python3.5
Python3.5

‹#
›
TextMiningworkﬂow
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
Duplicatedand
error-prone

‹#
›
MLpersistence
•CreatePipeline
•LoadPipeline(Java/Scala)
–Model.load(“s3n://…”)
•Deployinproduction
PersistmodelorPipeline:
model.save(“s3n://…”)

‹#
›
Datascientistsworkwithsoftwareengineer
DataScientistsSoftwareengineers
Exploredata
Createpipeline
Findbestparams
Savemodel
Loadmodel
Deployinproduction
Scoringon
batch/streamingdata

Revolutionize Text Mining with Spark and Zeppelin

More Related Content

What's hot

Similar to Revolutionize Text Mining with Spark and Zeppelin

More from DataWorks Summit/Hadoop Summit

Recently uploaded

Revolutionize Text Mining with Spark and Zeppelin