SlideShare a Scribd company logo
Efficient Text Categorization Using a Massively Parallel Machine Learning Model


                                       LU Bao-Liang
                       Department of Computer Science and Engineering
                               Shanghai Jiao Tong University
                                     blu@cs.sjtu.edu.cn



Abstract: In this talk, we present our recent research progress on categorizing Japanese text documents
based on a massively parallel machine learning model. By using this model, we can deal with the
problem of classifying large-scale document sets with advanced computing resources such as cluster
system and grid computing system. We perform experiments on an IBM P690 sever computer system
to classify Yomiuri Newspaper Corpus, which includes over two millions text documents and 75
different categories. We compare various feature extraction methods and several popular pattern
classifiers such as k-NN and support vector machines on Yomiuri Newspaper Corpus. The simulation
results show the effectiveness of the proposed massively parallel machine learning model for text
categorization.

More Related Content

Similar to Efficient Text Categorization Using a Massively Parallel ...

A novel approach based on topic
A novel approach based on topicA novel approach based on topic
A novel approach based on topic
csandit
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...
IJECEIAES
 
Comparative analysis of c99 and topictiling text segmentation algorithms
Comparative analysis of c99 and topictiling text segmentation algorithmsComparative analysis of c99 and topictiling text segmentation algorithms
Comparative analysis of c99 and topictiling text segmentation algorithms
eSAT Journals
 
Comparative analysis of c99 and topictiling text
Comparative analysis of c99 and topictiling textComparative analysis of c99 and topictiling text
Comparative analysis of c99 and topictiling text
eSAT Publishing House
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)
IJERA Editor
 
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research DomainWeb Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research Domain
Michael Genkin
 
Quantum transfer learning for image classification
Quantum transfer learning for image classificationQuantum transfer learning for image classification
Quantum transfer learning for image classification
TELKOMNIKA JOURNAL
 
TOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATION
TOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATIONTOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATION
TOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATION
Nexgen Technology
 
A Framework for Content Preparation to Support Open-Corpus Adaptive Hypermedia
A Framework for Content Preparation to Support Open-Corpus Adaptive HypermediaA Framework for Content Preparation to Support Open-Corpus Adaptive Hypermedia
A Framework for Content Preparation to Support Open-Corpus Adaptive Hypermedia
Killian Levacher
 
Using queuing theory to describe adaptive mathematical models of computing sy...
Using queuing theory to describe adaptive mathematical models of computing sy...Using queuing theory to describe adaptive mathematical models of computing sy...
Using queuing theory to describe adaptive mathematical models of computing sy...
journalBEEI
 
Advanced computer architecture unit 5
Advanced computer architecture  unit 5Advanced computer architecture  unit 5
Advanced computer architecture unit 5
Kunal Bangar
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
Infrrd
 
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...Hiroshi Ono
 
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
Katie Naple
 
Automated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic AnalysisAutomated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic Analysis
Gina Rizzo
 
F0372032035
F0372032035F0372032035
F0372032035
inventionjournals
 
8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network
INFOGAIN PUBLICATION
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
SaleihGero
 
NS-CUK Seminar: V.T.Hoang, Review on "Role Equivalence Attention for Label Pr...
NS-CUK Seminar: V.T.Hoang, Review on "Role Equivalence Attention for Label Pr...NS-CUK Seminar: V.T.Hoang, Review on "Role Equivalence Attention for Label Pr...
NS-CUK Seminar: V.T.Hoang, Review on "Role Equivalence Attention for Label Pr...
ssuser4b1f48
 
Optimizer algorithms and convolutional neural networks for text classification
Optimizer algorithms and convolutional neural networks for text classificationOptimizer algorithms and convolutional neural networks for text classification
Optimizer algorithms and convolutional neural networks for text classification
IAESIJAI
 

Similar to Efficient Text Categorization Using a Massively Parallel ... (20)

A novel approach based on topic
A novel approach based on topicA novel approach based on topic
A novel approach based on topic
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...
 
Comparative analysis of c99 and topictiling text segmentation algorithms
Comparative analysis of c99 and topictiling text segmentation algorithmsComparative analysis of c99 and topictiling text segmentation algorithms
Comparative analysis of c99 and topictiling text segmentation algorithms
 
Comparative analysis of c99 and topictiling text
Comparative analysis of c99 and topictiling textComparative analysis of c99 and topictiling text
Comparative analysis of c99 and topictiling text
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)
 
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research DomainWeb Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research Domain
 
Quantum transfer learning for image classification
Quantum transfer learning for image classificationQuantum transfer learning for image classification
Quantum transfer learning for image classification
 
TOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATION
TOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATIONTOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATION
TOWARD OPTIMAL FEATURE SELECTION IN NAÏVE BAYES FOR TEXT CATEGORIZATION
 
A Framework for Content Preparation to Support Open-Corpus Adaptive Hypermedia
A Framework for Content Preparation to Support Open-Corpus Adaptive HypermediaA Framework for Content Preparation to Support Open-Corpus Adaptive Hypermedia
A Framework for Content Preparation to Support Open-Corpus Adaptive Hypermedia
 
Using queuing theory to describe adaptive mathematical models of computing sy...
Using queuing theory to describe adaptive mathematical models of computing sy...Using queuing theory to describe adaptive mathematical models of computing sy...
Using queuing theory to describe adaptive mathematical models of computing sy...
 
Advanced computer architecture unit 5
Advanced computer architecture  unit 5Advanced computer architecture  unit 5
Advanced computer architecture unit 5
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categor...
 
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
 
Automated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic AnalysisAutomated Essay Scoring Using Generalized Latent Semantic Analysis
Automated Essay Scoring Using Generalized Latent Semantic Analysis
 
F0372032035
F0372032035F0372032035
F0372032035
 
8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
 
NS-CUK Seminar: V.T.Hoang, Review on "Role Equivalence Attention for Label Pr...
NS-CUK Seminar: V.T.Hoang, Review on "Role Equivalence Attention for Label Pr...NS-CUK Seminar: V.T.Hoang, Review on "Role Equivalence Attention for Label Pr...
NS-CUK Seminar: V.T.Hoang, Review on "Role Equivalence Attention for Label Pr...
 
Optimizer algorithms and convolutional neural networks for text classification
Optimizer algorithms and convolutional neural networks for text classificationOptimizer algorithms and convolutional neural networks for text classification
Optimizer algorithms and convolutional neural networks for text classification
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Efficient Text Categorization Using a Massively Parallel ...

  • 1. Efficient Text Categorization Using a Massively Parallel Machine Learning Model LU Bao-Liang Department of Computer Science and Engineering Shanghai Jiao Tong University blu@cs.sjtu.edu.cn Abstract: In this talk, we present our recent research progress on categorizing Japanese text documents based on a massively parallel machine learning model. By using this model, we can deal with the problem of classifying large-scale document sets with advanced computing resources such as cluster system and grid computing system. We perform experiments on an IBM P690 sever computer system to classify Yomiuri Newspaper Corpus, which includes over two millions text documents and 75 different categories. We compare various feature extraction methods and several popular pattern classifiers such as k-NN and support vector machines on Yomiuri Newspaper Corpus. The simulation results show the effectiveness of the proposed massively parallel machine learning model for text categorization.