SlideShare a Scribd company logo
1 of 48
Mahout in Action
         Part 2


    Yasmine M. Gaber
       4 April 2013
Agenda

    Part 2: Clustering

    Part 3: Classification
Clustering

    An algorithm


    A notion of both similarity and dissimilarity


    A stopping condition
Measuring the similarity of items

    Euclidean Distance
Creating the input

    Preprocess the data

    Use that data to create vectors

    Save the vectors in SequenceFile format as input for the
    algorithm
Using Mahout clustering

    The SequenceFile containing the input
    vectors.

    The SequenceFile containing the initial cluster
    centers.

    The similarity measure to be used.

    The convergenceThreshold.

    The number of iterations to be done.

    The Vector implementation used in the input
    files.
Using Mahout clustering
Distance measures

    Euclidean distance measure




    Squared Euclidean distance measure


    Manhattan distance measure
Distance measures

    Cosine distance measure




    Tanimoto distance measure
Playing Around
Representing data
Representing text documents as
               vectors

    Vector Space Model (VSM)

    TF-IDF




    N-gram collocations
Generating vectors from documents

    $ bin/mahout seqdirectory -c UTF-8 -i
    examples/reuters-extracted/ -o reuters-seqfiles


    $ bin/mahout seq2sparse -i reuters-seqfiles/ -o
    reuters-vectors -ow
Improving quality of vectors using
             normalization

    P-norm




    $ bin/mahout seq2sparse -i reuters-seqfiles/
    -o reuters-normalized-bigram -ow -a
    org.apache.lucene.analysis.WhitespaceAnalyz
    er
-chunk 200 -wt tfidf -s 5 -md 3 -x 90     -ng 2
  -ml 50 -seq -n 2
Clustering Categories

    Exclusive clustering

    Overlapping clustering

    Hierarchical clustering

    Probabilistic clustering
Clustering Approaches


    Fixed number of centers


    Bottom-up approach


    Top-down approach
Clustering algorithms

    K-means clustering


    Fuzzy k-means clustering


    Dirichlet clustering
k-means clustering algorithm
Running k-means clustering
Running k-means clustering

    $ bin/mahout kmeans -i reuters-vectors/tfidf-
    vectors/ -c reuters-initial-clusters -o reuters-
    kmeans-clusters -dm
    org.apache.mahout.common.distance.Square
    dEuclideanDistanceMeasure -cd 1.0 -k 20
    -x 20 -cl

    $ bin/mahout kmeans -i reuters-vectors/tfidf-
    vectors/ -c reuters-initial-clusters -o reuters-
    kmeans-clusters -dm
    org.apache.mahout.common.distance.Cosine
    DistanceMeasure -cd 0.1 -k 20 -x 20 -cl

    $ bin/mahout clusterdump -dt sequencefile -d
Fuzzy k-means clustering

    Instead of the exclusive clustering in k-means,
    fuzzy k-means tries to generate overlapping
    clusters from the data set.


    Also known as fuzzy c-means algorithm.
Running fuzzy k-means clustering
Running fuzzy k-means clustering

    $ bin/mahout fkmeans -i reuters-vectors/tfidf-
    vectors/ -c reuters-fkmeans-centroids -o
    reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2
    -ow -x 10 -dm
    org.apache.mahout.common.distance.Square
    dEuclideanDistanceMeasure

    Fuzziness factor
Dirichlet clustering

    model-based clustering algorithm
Running Dirichlet clustering

    $ bin/mahout dirichlet -i reuters-vectors/tfidf-
    vectors -o reuters-dirichlet-clusters -k 60
    -x 10 -a0 1.0 -md
    org.apache.mahout.clustering.dirichlet.models.
    GaussianClusterDistribution -mp
    org.apache.mahout.math.SequentialAccessSp
    arseVector
Evaluating and improving clustering
              quality

    Inspecting clustering output

    Evaluating the quality of clustering0

    Improving clustering quality
Inspecting clustering output

    $ bin/mahout clusterdump -s kmeans-
    output/clusters-19/ -d reuters-
    vectors/dictionary.file-0 -dt sequencefile -n 10


    Top Terms:
           said                =>
    11.60126582278481
           bank                 =>
    5.943037974683544
            dollar             =>
Analyzing clustering output

    Distance measure and feature selection

    Inter-cluster and intra-cluster distances

    Mixed and overlapping clusters
Improving clustering quality

    Improving document vector generation

    Writing a custom distance measure
Real-world applications of clustering

    Clustering like-minded people on Twitter


    Suggesting tags for an artist on Last.fm using
    clustering


    Creating a related-posts feature for a website
Classification

    Classification is a process of using specific
    information (input) to choose a single selection
    (output) from a short list of predetermined
    potential responses.

    Applications of classification, e.g. spam
    filtering
Why use Mahout for classification?
How classification works
Classification

    Training versus test versus production

    Predictor variables versus target variable

    Records, fields, and values
Types of values for predictor
                variables

    Continuous

    Categorical

    Word-like

    Text-like
Classification Work flow

    Training the model


    Evaluating the model


    Using the model in production
Stage 1: training the classification
                model

Stage 2: evaluating the classification
              model
Stage 3: using the model in production
Stage 1: training the classification
                  model

    Define Categories for the Target Variable

    Collect Historical Data

    Define Predictor Variables

    Select a Learning Algorithm to Train the Model

    Use Learning Algorithm to Train the Model
Extracting features to build a
      Mahout classifier
Preprocessing raw data into
     classifiable data
Converting classifiable data into
                vectors

    Use one Vector cell per word, category, or
    continuous value

    Represent Vectors implicitly as bags of words

    Use feature hashing
Classifying the 20 newsgroups data
                 set
Choosing an algorithm
The classifier evaluation API

    Percent correct

    Confusion matrix

    Entropy matrix

    AUC

    Log likelihood
When classifiers go bad

    Target leaks

    Broken feature extraction
Tuning the problem

    Remove Fluff Variables

    Add New Variables, Interactions, and Derived
    Values
Tuning the classifier

    Try Alternative Algorithms

    Tune the Learning Algorithm
Thank You



               Contact at:
Email: Yasmine.Gaber@espace.com.eg
Twitter: Twitter.com/yasmine_mohamed

More Related Content

What's hot

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentationNaoki Nakatani
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to MahoutUri Lavi
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutAman Adhikari
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 

What's hot (20)

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Mahout
MahoutMahout
Mahout
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 

Viewers also liked

Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentationNaveen Kumar
 
Expectation Maximization | Statistics
Expectation Maximization | StatisticsExpectation Maximization | Statistics
Expectation Maximization | StatisticsTransweb Global Inc
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithmJunyoung Park
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)Jee Vang, Ph.D.
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 

Viewers also liked (14)

Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
Expectation Maximization | Statistics
Expectation Maximization | StatisticsExpectation Maximization | Statistics
Expectation Maximization | Statistics
 
Clustering
ClusteringClustering
Clustering
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Clustering, k means algorithm
Clustering, k means algorithmClustering, k means algorithm
Clustering, k means algorithm
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 

Similar to Mahout part2

PPT file
PPT filePPT file
PPT filebutest
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8Roger Barga
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDatamining Tools
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner:  Data Mining And Rapid MinerRapidMiner:  Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidmining Content
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerDataminingTools Inc
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 

Similar to Mahout part2 (20)

PPT file
PPT filePPT file
PPT file
 
R refcard-data-mining
R refcard-data-miningR refcard-data-mining
R refcard-data-mining
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner:  Data Mining And Rapid MinerRapidMiner:  Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
BioWeka
BioWekaBioWeka
BioWeka
 
My8clst
My8clstMy8clst
My8clst
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 

More from Yasmine Gaber (8)

Capistrano
CapistranoCapistrano
Capistrano
 
Ionic
IonicIonic
Ionic
 
Dyna trace
Dyna traceDyna trace
Dyna trace
 
Mahout part1
Mahout part1Mahout part1
Mahout part1
 
Ibn Sina
Ibn SinaIbn Sina
Ibn Sina
 
Home Bowling
Home BowlingHome Bowling
Home Bowling
 
Oauth2.0
Oauth2.0Oauth2.0
Oauth2.0
 
Why_do i_hate_shopping
Why_do i_hate_shoppingWhy_do i_hate_shopping
Why_do i_hate_shopping
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Mahout part2

  • 1. Mahout in Action Part 2 Yasmine M. Gaber 4 April 2013
  • 2. Agenda  Part 2: Clustering  Part 3: Classification
  • 3. Clustering  An algorithm  A notion of both similarity and dissimilarity  A stopping condition
  • 4. Measuring the similarity of items  Euclidean Distance
  • 5. Creating the input  Preprocess the data  Use that data to create vectors  Save the vectors in SequenceFile format as input for the algorithm
  • 6. Using Mahout clustering  The SequenceFile containing the input vectors.  The SequenceFile containing the initial cluster centers.  The similarity measure to be used.  The convergenceThreshold.  The number of iterations to be done.  The Vector implementation used in the input files.
  • 8. Distance measures  Euclidean distance measure  Squared Euclidean distance measure  Manhattan distance measure
  • 9. Distance measures  Cosine distance measure  Tanimoto distance measure
  • 12. Representing text documents as vectors  Vector Space Model (VSM)  TF-IDF  N-gram collocations
  • 13. Generating vectors from documents  $ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles  $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
  • 14. Improving quality of vectors using normalization  P-norm  $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-normalized-bigram -ow -a org.apache.lucene.analysis.WhitespaceAnalyz er -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
  • 15. Clustering Categories  Exclusive clustering  Overlapping clustering  Hierarchical clustering  Probabilistic clustering
  • 16. Clustering Approaches  Fixed number of centers  Bottom-up approach  Top-down approach
  • 17. Clustering algorithms  K-means clustering  Fuzzy k-means clustering  Dirichlet clustering
  • 20. Running k-means clustering  $ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl  $ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Cosine DistanceMeasure -cd 0.1 -k 20 -x 20 -cl  $ bin/mahout clusterdump -dt sequencefile -d
  • 21. Fuzzy k-means clustering  Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set.  Also known as fuzzy c-means algorithm.
  • 22. Running fuzzy k-means clustering
  • 23. Running fuzzy k-means clustering  $ bin/mahout fkmeans -i reuters-vectors/tfidf- vectors/ -c reuters-fkmeans-centroids -o reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure  Fuzziness factor
  • 24. Dirichlet clustering  model-based clustering algorithm
  • 25. Running Dirichlet clustering  $ bin/mahout dirichlet -i reuters-vectors/tfidf- vectors -o reuters-dirichlet-clusters -k 60 -x 10 -a0 1.0 -md org.apache.mahout.clustering.dirichlet.models. GaussianClusterDistribution -mp org.apache.mahout.math.SequentialAccessSp arseVector
  • 26. Evaluating and improving clustering quality  Inspecting clustering output  Evaluating the quality of clustering0  Improving clustering quality
  • 27. Inspecting clustering output  $ bin/mahout clusterdump -s kmeans- output/clusters-19/ -d reuters- vectors/dictionary.file-0 -dt sequencefile -n 10  Top Terms: said => 11.60126582278481 bank => 5.943037974683544 dollar =>
  • 28. Analyzing clustering output  Distance measure and feature selection  Inter-cluster and intra-cluster distances  Mixed and overlapping clusters
  • 29. Improving clustering quality  Improving document vector generation  Writing a custom distance measure
  • 30. Real-world applications of clustering  Clustering like-minded people on Twitter  Suggesting tags for an artist on Last.fm using clustering  Creating a related-posts feature for a website
  • 31. Classification  Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses.  Applications of classification, e.g. spam filtering
  • 32. Why use Mahout for classification?
  • 34. Classification  Training versus test versus production  Predictor variables versus target variable  Records, fields, and values
  • 35. Types of values for predictor variables  Continuous  Categorical  Word-like  Text-like
  • 36. Classification Work flow  Training the model  Evaluating the model  Using the model in production
  • 37. Stage 1: training the classification model Stage 2: evaluating the classification model Stage 3: using the model in production
  • 38. Stage 1: training the classification model  Define Categories for the Target Variable  Collect Historical Data  Define Predictor Variables  Select a Learning Algorithm to Train the Model  Use Learning Algorithm to Train the Model
  • 39. Extracting features to build a Mahout classifier
  • 40. Preprocessing raw data into classifiable data
  • 41. Converting classifiable data into vectors  Use one Vector cell per word, category, or continuous value  Represent Vectors implicitly as bags of words  Use feature hashing
  • 42. Classifying the 20 newsgroups data set
  • 44. The classifier evaluation API  Percent correct  Confusion matrix  Entropy matrix  AUC  Log likelihood
  • 45. When classifiers go bad  Target leaks  Broken feature extraction
  • 46. Tuning the problem  Remove Fluff Variables  Add New Variables, Interactions, and Derived Values
  • 47. Tuning the classifier  Try Alternative Algorithms  Tune the Learning Algorithm
  • 48. Thank You Contact at: Email: Yasmine.Gaber@espace.com.eg Twitter: Twitter.com/yasmine_mohamed