Download free for 30 days
Sign in
Upload
Language (EN)
Support
Business
Mobile
Social Media
Marketing
Technology
Art & Photos
Career
Design
Education
Presentations & Public Speaking
Government & Nonprofit
Healthcare
Internet
Law
Leadership & Management
Automotive
Engineering
Software
Recruiting & HR
Retail
Sales
Services
Science
Small Business & Entrepreneurship
Food
Environment
Economy & Finance
Data & Analytics
Investor Relations
Sports
Spiritual
News & Politics
Travel
Self Improvement
Real Estate
Entertainment & Humor
Health & Medicine
Devices & Hardware
Lifestyle
Change Language
Language
English
Español
Português
Français
Deutsche
Cancel
Save
EN
Uploaded by
MapR Technologies
1,483 views
Document Classification with Apache Spark
Joe Blue presentation
Technology
◦
Read more
0
Save
Share
Embed
Embed presentation
Download
Downloaded 32 times
1
/ 30
2
/ 30
3
/ 30
4
/ 30
5
/ 30
6
/ 30
7
/ 30
8
/ 30
9
/ 30
10
/ 30
11
/ 30
12
/ 30
13
/ 30
14
/ 30
15
/ 30
16
/ 30
17
/ 30
18
/ 30
19
/ 30
20
/ 30
21
/ 30
22
/ 30
23
/ 30
24
/ 30
25
/ 30
26
/ 30
27
/ 30
28
/ 30
29
/ 30
30
/ 30
More Related Content
PDF
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
by
Spark Summit
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
by
Natalino Busa
PPTX
Converging your data landscape
by
MapR Technologies
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
by
MapR Technologies
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
by
MapR Technologies
PPTX
Enabling Real-Time Business with Change Data Capture
by
MapR Technologies
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
by
MapR Technologies
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
by
MapR Technologies
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
by
Spark Summit
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
by
Natalino Busa
Converging your data landscape
by
MapR Technologies
ML Workshop 2: Machine Learning Model Comparison & Evaluation
by
MapR Technologies
Self-Service Data Science for Leveraging ML & AI on All of Your Data
by
MapR Technologies
Enabling Real-Time Business with Change Data Capture
by
MapR Technologies
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
by
MapR Technologies
ML Workshop 1: A New Architecture for Machine Learning Logistics
by
MapR Technologies
More from MapR Technologies
PPTX
Machine Learning Success: The Key to Easier Model Management
by
MapR Technologies
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
by
MapR Technologies
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
by
MapR Technologies
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
by
MapR Technologies
PDF
Live Machine Learning Tutorial: Churn Prediction
by
MapR Technologies
PDF
An Introduction to the MapR Converged Data Platform
by
MapR Technologies
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
by
MapR Technologies
PPTX
Best Practices for Data Convergence in Healthcare
by
MapR Technologies
PPTX
Geo-Distributed Big Data and Analytics
by
MapR Technologies
PPTX
MapR Product Update - Spring 2017
by
MapR Technologies
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
by
MapR Technologies
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
by
MapR Technologies
PPTX
MapR and Cisco Make IT Better
by
MapR Technologies
PPTX
Evolving from RDBMS to NoSQL + SQL
by
MapR Technologies
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
by
MapR Technologies
PDF
Open Source Innovations in the MapR Ecosystem Pack 2.0
by
MapR Technologies
PPTX
How Spark is Enabling the New Wave of Converged Cloud Applications
by
MapR Technologies
PDF
MapR 5.2: Getting More Value from the MapR Converged Data Platform
by
MapR Technologies
PPTX
MapR on Azure: Getting Value from Big Data in the Cloud -
by
MapR Technologies
PDF
Handling the Extremes: Scaling and Streaming in Finance
by
MapR Technologies
Machine Learning Success: The Key to Easier Model Management
by
MapR Technologies
Data Warehouse Modernization: Accelerating Time-To-Action
by
MapR Technologies
Live Tutorial – Streaming Real-Time Events Using Apache APIs
by
MapR Technologies
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
by
MapR Technologies
Live Machine Learning Tutorial: Churn Prediction
by
MapR Technologies
An Introduction to the MapR Converged Data Platform
by
MapR Technologies
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
by
MapR Technologies
Best Practices for Data Convergence in Healthcare
by
MapR Technologies
Geo-Distributed Big Data and Analytics
by
MapR Technologies
MapR Product Update - Spring 2017
by
MapR Technologies
3 Benefits of Multi-Temperature Data Management for Data Analytics
by
MapR Technologies
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
by
MapR Technologies
MapR and Cisco Make IT Better
by
MapR Technologies
Evolving from RDBMS to NoSQL + SQL
by
MapR Technologies
Evolving Beyond the Data Lake: A Story of Wind and Rain
by
MapR Technologies
Open Source Innovations in the MapR Ecosystem Pack 2.0
by
MapR Technologies
How Spark is Enabling the New Wave of Converged Cloud Applications
by
MapR Technologies
MapR 5.2: Getting More Value from the MapR Converged Data Platform
by
MapR Technologies
MapR on Azure: Getting Value from Big Data in the Cloud -
by
MapR Technologies
Handling the Extremes: Scaling and Streaming in Finance
by
MapR Technologies
Recently uploaded
PDF
Towards a Vibrant AI Hardware Accelerator Ecosystem, invited talk at the 4th ...
by
Michiaki Tatsubori
PDF
Rustici Software: eLearning standards in the age of AI
by
Rustici Software
PDF
February 2026 Patch Tuesday hosted by Chris Goettl and Todd Schell
by
Ivanti
PDF
How to Install XRDP on CentOS 10 .pdf
by
Green Webpage
PDF
GTM-and-Sales-Plan for a cyber security product
by
Ashish Jangir
PPTX
Critique Prompting: The Secret to High-Quality AI Outputs
by
Semantic SEO BD
PDF
Constraint Collapse and Fidelity Decay in Scaled Language Models
by
Semantic Fidelity Lab | A. Jacobs
PDF
Digital Twin in IBM for Accelerated Discovery of Climate & Sustainability, K...
by
Michiaki Tatsubori
PDF
AI Vector Search Best Practices Multicloud Feb 2026
by
Sandesh Rao
PDF
shayk.online - Anonymous chat with Sinatra and WebSockets
by
Eleanor McHugh
PDF
TravelTech Paris 2025 | The EU Digital Identity Wallet and it’s impact on Tra...
by
apidays
PDF
The Manifesto for AI-enabled Governance !
by
Yannis Charalabidis
PDF
NTG -Company Profile_Software_Vendor_Company_in_MENA
by
Mustafa Kuğu
PDF
Founder & Tech Lead | Web Development & Digital Growth Consultant | Helping B...
by
Shyamal Das
PDF
Logical Optimal Actions – Towards Knowledge-based Reinforcement Learning with...
by
Michiaki Tatsubori
PPTX
Workflow and decision Automation with Flowable
by
chakib ch
PDF
Antminer S23Hyd-580_Manual-V1.1-oneminers.pdf
by
FranzhLawrenceBataan1
PDF
From Artificial Intelligence to African Intelligence: Transforming Research &...
by
Moses Kemibaro
PPTX
Introducing VisualSim 2610 The Next Leap in System Level Modeling
by
Deepak Shankar
PDF
TrustArc Webinar - From Trends to Action: Fitting AI Governance into Privacy Ops
by
TrustArc
Towards a Vibrant AI Hardware Accelerator Ecosystem, invited talk at the 4th ...
by
Michiaki Tatsubori
Rustici Software: eLearning standards in the age of AI
by
Rustici Software
February 2026 Patch Tuesday hosted by Chris Goettl and Todd Schell
by
Ivanti
How to Install XRDP on CentOS 10 .pdf
by
Green Webpage
GTM-and-Sales-Plan for a cyber security product
by
Ashish Jangir
Critique Prompting: The Secret to High-Quality AI Outputs
by
Semantic SEO BD
Constraint Collapse and Fidelity Decay in Scaled Language Models
by
Semantic Fidelity Lab | A. Jacobs
Digital Twin in IBM for Accelerated Discovery of Climate & Sustainability, K...
by
Michiaki Tatsubori
AI Vector Search Best Practices Multicloud Feb 2026
by
Sandesh Rao
shayk.online - Anonymous chat with Sinatra and WebSockets
by
Eleanor McHugh
TravelTech Paris 2025 | The EU Digital Identity Wallet and it’s impact on Tra...
by
apidays
The Manifesto for AI-enabled Governance !
by
Yannis Charalabidis
NTG -Company Profile_Software_Vendor_Company_in_MENA
by
Mustafa Kuğu
Founder & Tech Lead | Web Development & Digital Growth Consultant | Helping B...
by
Shyamal Das
Logical Optimal Actions – Towards Knowledge-based Reinforcement Learning with...
by
Michiaki Tatsubori
Workflow and decision Automation with Flowable
by
chakib ch
Antminer S23Hyd-580_Manual-V1.1-oneminers.pdf
by
FranzhLawrenceBataan1
From Artificial Intelligence to African Intelligence: Transforming Research &...
by
Moses Kemibaro
Introducing VisualSim 2610 The Next Leap in System Level Modeling
by
Deepak Shankar
TrustArc Webinar - From Trends to Action: Fitting AI Governance into Privacy Ops
by
TrustArc
Document Classification with Apache Spark
1.
® © 2015 MapR
Technologies 1 ® © 2015 MapR Technologies Document Classification with Apache Spark Joseph Blue, @joebluems August 19, 2015
2.
® © 2015 MapR
Technologies 2 Background survey q Data Science q Supervised vs. Unsupervised Scenarios q Classification Algorithms: Naïve Bayes, Linear, Decision Trees, etc. q Model metrics: KS, AuROC, etc. q Boosting, Stacking, Bagging, etc. q TF-IDF Feature Extraction q Apache Spark: RDD, DAG, Scala shell, MLlib q Applying Machine Learning to Business Problems
3.
® © 2015 MapR
Technologies 3 Big Data Solution Workflow EXTRACT OUTLIER REMOVAL FEATURE CREATION MODELING Import data into Hadoop and transform into format appropriate for solution ENSEMBLE Extract from the raw data inputs that the ML algorithms will use for pattern detection Identify and remove / adjust records that negatively affect the ability to achieve good performance Application of the appropriate machine learning algorithms, such as Naïve Bayes, Linear, Tree- Based methods Combining multiple models to increase the performance of the ultimate solution
4.
® © 2015 MapR
Technologies 4 Resources • GITHUB – code & instructions • Kaggle – data science competitions, code, message boards • Spark MLLib https://github.com/joebluems/Mockingbird https://www.kaggle.com/competitions http://spark.apache.org/docs/1.3.0/mllib-guide.html
5.
® © 2015 MapR
Technologies 5© 2014 MapR Technologies ® Part 1: Working with Text
6.
® © 2015 MapR
Technologies 6 Raw “documents” corpus • Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. • Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. • But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.
7.
® © 2015 MapR
Technologies 7 Tokenized* documents • ArrayBuffer(four, score, seven, year, ago, our, father, brought, forth, contin, new, nation, conceiv, liberti, dedic, proposit, all, men, creat, equal) • ArrayBuffer(now, we, engag, great, civil, war, test, whether, nation, ani, nation, so, conceiv, so, dedic, can, long, endur, we, met, great, battl, field, war, we, have, come, dedic, portion, field, final, rest, place, those, who, here, gave, live, nation, might, live, altogeth, fit, proper, we, should, do) • ArrayBuffer(larger, sens, we, can, dedic, we, can, consecr, we, can, hallow, ground, brave,men, live, dead, who,struggl, here, have, consecr, far, abov, our, poor, power, add, detract, world, littl, note, nor, long, rememb, what, we, sai, here, can, never, forget, what, did, here, us, live, rather, dedic, here, unfinish, work, which, who, fought, here, have, thu, far, so, nobli, advanc, rather, us, here, dedic, great, task, remain, befor, us, from, honor, dead, we, take, increas, devot, caus, which, gave, last, full, measur, devot, we, here, highli, resolv, dead, shall, have, di, vain, nation, under, god, shall, have, new, birth, freedom, govern, peopl, peopl, peopl, shall, perish, from, earth) *Using Apache Lucene’s Standard English Analyzer
8.
® © 2015 MapR
Technologies 8 TF* Vectors – Total Frequency (i.e. word counts) • (1000, [17,63,94,197,234,335,412,437,445,521,530,556,588,673,799,893 ,937,960,990], [1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0, 1.0,1.0,1.0]) • (1000, [17,21,22,37,63,92,167,211,240,256,270,272,393,395,445,449, 460,472,480,498,535,612,676,688,694,706,724,732,790,909,916,939,960, 965,996], [1.0,2.0,1.0,1.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0, 1.0,1.0,2.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0]) • (1000,[21,63,92,131,143,147,196,205,208,240,250,256,265,268,296,312,326,340, 341,367,378,391,399,400,412,417,441,445,449,455,464,483,494,503,515,524,526, 539,551,575,602,612,627,641,645,656,676,694,721,742,757,767,780,786,790,802, 807,817,818,844,852,920,946,951,960,983,985,990], [1.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,4.0,1.0,4.0,1.0,4.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,1.0,1.0, 2.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0, 1.0,2.0,1.0,4.0,1.0,1.0,1.0,2.0,6.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,8.0,1.0,1.0,1.0]) *Size of Hash = 1,000 (any token will be hashed to an integer 0-999)
9.
® © 2015 MapR
Technologies 9 TF-IDF* Vectors – Word weights • (1000,[17,63,94,197,234,335,412,437,445,521,530,556,588,673,799,893,937,960,990], [0.287,0.0,0.0,0.0,0.0,0.0,0.287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.287]) • (1000,[17,21,22,37,63,92,167,211,240,256,270,272,393,395,445,449,460,472,480, 498,535,612,676,688,694,706,724,732,790,909,916,939,960,965,996], [0.287,0.575,0.0,0. 0,0.0,0.575,0.0,0.0,0.287,0.287,0.0,0.0,0.0,0.0,0.0,0.287,0.0,0.0,0.0,0.0,0.0,0.287,0.575, 0.0,0.287,0.0,0.0,0.0,1.15,0.0,0.0,0.0,0.0,0.0,0.0]) • (1000,[21,63,92,131,143,147,196,205,208,240,250,256,265,268,296,312,326,340, 341,367,378,391,399,400,412,417,441,445,449,455,464,483,494,503,515,524,526, 539,551,575,602,612,627,641,645,656,676,694,721,742,757,767,780,786,790,802, 807,817,818,844,852,920,946,951,960,983,985,990], [0.287,0.0,0.575,0.0,0.0,0.0,0.0,0.0,0.0,1.150,0.0,1.150,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.287,0.0,0.0,0.0,0.287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.287,0.0,0.0,0.0,0.0,0.287,0.575,0.0,0.0,0.0,0.0,0.0,0.0,1.726,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.287]) *Minimum Document Frequency = 2 (all other tokens have TF-IDF = 0)
10.
® © 2015 MapR
Technologies 10 Transforming Text into Numeric Features 1. Use Lucene Analyzer to tokenize documents 2. Hash the tokens into sparse vectors with TF 1. Control the vector size 2. Smaller vectors require less memory; Larger vectors have fewer collisions 3. Note: hashing is one-way (cannot convert back to tokens) 3. Build an IDF dictionary from the training vectors 1. Can limit size by including minimum document frequency 2. √(TFw)*ln[(# docs +1) /(doc freqw+1)] 4. Transform the TF vectors into TF-IDF Vectors
11.
® © 2015 MapR
Technologies 11 Spark Shell commands to transform text import statements … object Stemmer { …} val getty = sc.textFile(”gettys.txt") val stemmed = getty.map{x=> Stemmer.tokenize(x)} getty.collect().foreach(println) stemmed.collect().foreach(println) val tf = new HashingTF(1000) //size impacts memory and collisions val tfdocs = stemmed tfdocs.collect().foreach(println) val idfModel = new IDF(minDocFreq = 2).fit(tfdocs) val idfDocs = idfModel.transform(tfdocs) idfDocs.collect().foreach(println)
12.
® © 2015 MapR
Technologies 12© 2014 MapR Technologies ® STEP 2: Model Building
13.
® © 2015 MapR
Technologies 13 when,he,nearli,thirteen,brother,jem,got,hi,arm,badli broken,elbow,when,heal,jem,fear,never,be,abl,plai footbal,were,assuag,he,seldom,self,consciou,about,hi,injuri hi,left,arm,somewhat,shorter,than,hi,right,when,he stood,walk,back,hi,hand,right,angl,hi,bodi,hi thumb,parallel,hi,thigh,he,couldn’t,have,care,less,so long,he,could,pass,punt,when,enough,year,had,gone enabl,us,look,back,them,sometim,discuss,event,lead,hi accid,maintain,ewel,start,all,jem,who,four,year,senior said,start,long,befor,he,said,began,summer,dill,came us,when,dill,first,gave,us,idea,make,boo,radlei come,out,said,he,want,take,broad,view,thing,realli began,andrew,jackson,gener,jackson,hadn’t,run,creek,up,creek simon,finch,would,never,have,paddl,up,alabama,where,would he,hadn’t,were,far,too,old,settl,argument,fist,fight so,consult,atticu,father,said,were,both,right,be,southern sourc,shame,some,member,famili,had,record,ancestor,either, battl,hast,all,had,simon,finch,fur,trap,apothecari,from cornwal,whose,pieti,exceed,onli,hi,stingi,england,simon,irrit persecut,those,who,call,themselv,methodist,hand,more,liber, simon,call,himself,methodist,he,work,hi,wai,across,atlant philadelphia,thenc,jamaica,thenc,mobil,up,saint,stephen,mind, weslei,strictur,us,mani,word,bui,sell,simon,made,pile practic,medicin,pursuit,he,unhappi,lest,he,tempt,do,what he,knew,glori,god,put,gold,costli,apparel,so,simon *Note: removed first person terms such as I,our, my, we, etc. A classification problem… sinc,atlanta,she,had,look,out,dine,car,window,delight almost,physic,over,her,breakfast,coffe,she,watch,last,georgia hill,reced,red,earth,appear,tin,roof,hous,set,middl swept,yard,yard,inevit,verbena,grew,surround,whitewash,tire, grin,when,she,saw,her,first,tv,antenna,atop,unpaint negro,hous,multipli,her,joi,rose,jean,louis,finch,alwai made,journei,air,she,decid,go,train,from,new,york maycomb,junction,her,fifth,annual,trip,home,on,thing,she had,life,scare,out,her,last,time,she,plane,pilot elect,fly,through,tornado,anoth,thing,fly,home,meant,her father,rise,three,morn,drive,hundr,mile,meet,her,mobil do,full,dai,work,afterward,he,seventi,two,now,longer fair,she,glad,she,had,decid,go,train,train,had chang,sinc,her,childhood,novelti,experi,amus,her,fat,geni porter,materi,when,she,press,button,wall,her,bid,stainless steel,washbasin,pop,out,anoth,wall,john,on,could,prop on,feet,she,resolv,intimid,sever,messag,stencil,around,her compart,roomett,call,when,she,went,bed,night,befor,she succeed,fold,herself,up,wall,becaus,she,had,ignor,injunct pull,lever,down,over,bracket,situat,remedi,porter,her,embarrass her,habit,sleep,onli,pajama,top,luckili,he,happen,patrol corridor,when,trap,snap,shut,her,i’ll,get,you,out miss,he,call,answer,her,pound,from,within,pleas,she said,just,tell,me,how,get,out,can,do,back turn,he,said,did,when,she,awok,morn,train,switch chug,atlanta,yard,obedi,anoth,sign,her,compart,she,stai
14.
® © 2015 MapR
Technologies 14 Create the training and evaluation sets val mock = sc.textFile(”mock.tokens") val watch = sc.textFile(”watch.tokens”) /// convert data to numeric features with TF val tf = new HashingTF(10000) val mockData = mock.map { line => var target = "1" LabeledPoint(target.toDouble, tf.transform(line.split(","))) } val watchData = watch.map { line => var target = "0" LabeledPoint(target.toDouble, tf.transform(line.split(","))) } /// build IDF model and transform data into modeling sets val data = mockData.union(watchData) val splits = data.randomSplit(Array(0.7, 0.3)) // prepare train and test sets val trainDocs = splits(0).map{ x=>x.features} val idfModel = new IDF(minDocFreq = 3).fit(trainDocs) // build on training data only val train = splits(0).map{ point=> LabeledPoint(point.label,idfModel.transform(point.features)) } val test = splits(1).map{ point=> LabeledPoint(point.label,idfModel.transform(point.features)) }
15.
® © 2015 MapR
Technologies 15 Naïve Bayes Algorithm Classification: p(Cj |T1,T2,...) = 1 Z * p(Cj )* p(Ti |Cj ) i=1 n ∏ prior * likelihoodsignore Tokens P(Ti|C1) P(Ti|C2) … P(Ti|Ck) T1 0.004 0.001 0.010 T2 0.002 0.008 0.021 … … Tn 0.014 0.002 0.003 Likelihood calculations: c l a s s e s (aka d a t a s o u r c e s)
16.
® © 2015 MapR
Technologies 16 Run the Naïve Bayes Algorithm in Spark Shell import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} val nbmodel = NaiveBayes.train(train, lambda = 1.0) val bayesTrain = train.map(p => (nbmodel.predict(p.features), p.label)) val bayesTest = test.map(p => (nbmodel.predict(p.features), p.label)) println("Training accuracy", bayesTrain.filter(x => x._1 == x._2).count() / bayesTrain.count().toDouble) println("Test accuracy ", bayesTest.filter(x => x._1 == x._2).count() / bayesTest.count().toDouble) // print confusion matrix println("Predict:mock,label:mock -> ",bayesTest.filter(x => x._1 == 1.0 & x._2==1.0).count()) println("Predict:watch,label:watch -> ",bayesTest.filter(x => x._1 == 0.0 & x._2==0.0).count()) println("Predict:mock,label:watch -> ",bayesTest.filter(x => x._1 == 1.0 & x._2==0.0).count()) println("Predict:watch,label:mock -> ",bayesTest.filter(x => x._1 == 0.0 & x._2==1.0).count())
17.
® © 2015 MapR
Technologies 17 -0.1 0.1 0.3 0.5 0.7 0.9 1.1 0 0.2 0.4 0.6 0.8 1 1.2 IdxofHeight/Age Idx of Weight/Education No Diabetes Diabetes Decision Tree for Classification X1<0.7 X1>0.7 X2<0.5 X2>0.5 X2<0.8 X2>0.8 X1<0.8 X1>0.8
18.
® © 2015 MapR
Technologies 18 Random Forest • Aggregate estimates from many independent trees • Key parameters – number of trees – maximum tree depth • Notes – Randomness in training and feature subsets – more trees decreases over- fitting – trees are built in parallel – depth is generally larger than GBT 0.24 0.56 0.54 0.66 0.37 0.54 0.82 0.25 0.34 0.29 0.04 0.18 estimate = avg(Ti) = 0.40
19.
® © 2015 MapR
Technologies 19 Train a Random Forest Model in Spark import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.tree.model.RandomForestModel val categoricalFeaturesInfo = Map[Int, Int]() val numClasses = 2 val featureSubsetStrategy = "auto”. val impurity = "variance” // tells Spark we want regression, not classification val maxDepth = 10 val maxBins = 32 val numTrees = 50 val modelRF = RandomForest.trainRegressor(train, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) val trainScores = train.map { point => val prediction = modelRF.predict(point.features) (prediction, point.label) } val testScores = test.map { point => val prediction = modelRF.predict(point.features) (prediction, point.label) }
20.
® © 2015 MapR
Technologies 20 Gradient Boosted Trees • Successive trees attempt to minimize error by focusing on large residuals • Key parameters – number of trees – maximum tree depth • Notes – more trees increases over- fitting – trees are not built in parallel – depth is generally smaller than Random Forest + + + + + + + + … 0.24 +0.02=0.26 -0.07=0.19 +0.10=0.29 +0.11=0.40 +0.02=0.42 -0.06=0.36 +0.04=0.40
21.
® © 2015 MapR
Technologies 21 Train a GBTree Model in Spark import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel val boostingStrategy = BoostingStrategy.defaultParams("Regression") // squared-error loss boostingStrategy.numIterations = 50 boostingStrategy.treeStrategy.maxDepth = 5 boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]() val modelGB = GradientBoostedTrees.train(train, boostingStrategy) val trainScores = train.map { point => val prediction = modelGB.predict(point.features) (prediction, point.label) } val testScores = test.map { point => val prediction = modelGB.predict(point.features) (prediction, point.label) }
22.
® © 2015 MapR
Technologies 22 Looking at the ROC 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% MockingbirdLinesAboveThreshold Watchmen Lines Above Threshold ROC for Test Data gradient boosted trees baseline random forest AuROCRF= 0.884 AuROCGB= 0.867 KSRF = 0.62 KSGB = 0.60 1. Order results by descending score (i.e. threshold), optionally put into ordered bins 2. Calculate cumulative percent of targets 3. Calculate cumulative percent of non-targets
23.
® © 2015 MapR
Technologies 23© 2014 MapR Technologies ® STEP 3: Value from Operational Constraints
24.
® © 2015 MapR
Technologies 24 From ROC to Value • Start with highest scores • Assign $ values – Target identification – Operational cost • Display profit @ each threshold 0 2 4 6 8 0% 20% 40% 60% 80% 100% WeeklyProfit(inthousands) Portion of Total Population Examined Profit Curve Based on Random Forest Score net profit at threshold Example: Identify target: $950 Operational cost: $800
25.
® © 2015 MapR
Technologies 25 Find my presentation and other related resources here: http://events.mapr.com/AtlantaHUG155 (you can find this link in the event’s page at meetup.com) Today’s Presentation Whiteboard & demo videos Free On-Demand Training Free eBooks Free Hadoop Sandbox And more…
26.
® © 2015 MapR
Technologies 26 Q&A @mapr maprtech sales@mapr.com Engage with us! MapR maprtech mapr-technologies
27.
® © 2015 MapR
Technologies 27© 2014 MapR Technologies ® Appendix
28.
® © 2015 MapR
Technologies 28 Glossary • RDD (Resilient Distributed Dataset) – in Apache Spark, an immutable, partitioned collection of elements that can be operated on in parallel • Supervised Modeling – family of modeling algorithms which use labeled data and thus have an error to minimize. Contrast with unsupervised methods • Overtraining – Extra performance on the training set that is due to memorization of the training data rather than learning the true patterns • ROC – Receiver Operating Characteristic. Measure performance by plotting cumulative false positives vs cumulative true positives over score bins • KS – Kolmogorov-Smirnov coefficient = maximum separation of the ROC with baseline, usually reported * 100.
29.
® © 2015 MapR
Technologies 29 “Stemmer” (Tokenizer) object import org.apache.lucene.analysis.en.EnglishAnalyzer import org.apache.lucene.analysis.tokenattributes.CharTermAttribute import scala.collection.mutable.ArrayBuffer object Stemmer { def tokenize(content:String):Seq[String]={ val analyzer=new EnglishAnalyzer() val tokenStream=analyzer.tokenStream("contents", content) val term=tokenStream.addAttribute(classOf[CharTermAttribute]) tokenStream.reset() var result = ArrayBuffer.empty[String] while(tokenStream.incrementToken()) { val termValue = term.toString if (!(termValue matches ".*[d.].*")) { result += term.toString } } tokenStream.end() tokenStream.close() result } } Launch the shell with this command: ./bin/spark-shell --packages "org.apache.lucene:lucene- analyzers-common:5.1.0” This code originally appeared here: https://chimpler.wordpress.com/2014/06/11/classifiying- documents-using-naive-bayes-on-apache-spark-mllib/
30.
® © 2015 MapR
Technologies 30 Code to produce and write an ROC //// create RDD’s of predictions and labels val trainScores = train.map { point => val prediction = modelGB.predict(point.features) (prediction, point.label) } val testScores = test.map { point => val prediction = modelGB.predict(point.features) (prediction, point.label) } //// generate ROC’s and write to file – will produce error if destination exists val metricsTrain = new BinaryClassificationMetrics(trainScores,100) val metricsTest = new BinaryClassificationMetrics(testScores,100) val trainroc= metricsTrain.roc() val testroc= metricsTest.roc() trainroc.saveAsTextFile(”./ROC/gbtrain") testroc.saveAsTextFile(”./ROC/gbtest")
Download