SlideShare a Scribd company logo
1 of 5
Steps involved in Preprocessing :

1.Tokenization :

●
    Tokenization : The process of breaking a stream of text into words


●
    Removal of Punctuation marks and numbers


●
    Replacing ‘n’ by Spaces


●
    Splitting the string by space as a delimiter


●
    Tokens
Graphical view of steps in Tokenization :



                  Removal of       Replacing n   Using
Stream of text.                                   spaces as
                  punctuation      by Spaces
                  marks                           delimiter




                                                   Tokens
                                                   (words)
2. Removal of stop words :


 ●
     Passing the list of Tokens.


 ●
     Removing the unnecessary words like the, an, so, after, all, etc (stop words).


 ●
     Output : A list of meaningful words.
3.Stemming :

  ●
      Stemming : The process for reducing inflected words to their stem, base or root form.
      For example : Stemming algorithm reduces “fishing", "fished", "fish", and "fisher"
             to the root word "fish“.


  ●
      Stemmer used : Porter Stemmer Algorithm.


  ●
      Removing ‘–ee’,’ –ed’, ‘-ing’, ‘-ence’, ‘-er’, etc. & adding ‘y’, ‘I’ as required.


  ●
      Doesn’t give accurate roots .
      Example : stem(flying) =fli
         stem(fly)=fli
  ●
      Same roots for all inflected forms – serves our purpose
4. Vocabulary creation :-


●
    Vocabulary : Generally, vocabulary is the set of words.


●
    Vocabulary = Union of words from all files.


●
    For each document : Converting list obtained after stemming into Set &
      taking union.


●
    Processed further for Tf-idf evaluation.

More Related Content

Viewers also liked

Viewers also liked (12)

Scale Invariant feature transform
Scale Invariant feature transformScale Invariant feature transform
Scale Invariant feature transform
 
SIFT
SIFTSIFT
SIFT
 
Face recognition
Face recognitionFace recognition
Face recognition
 
fMRI preprocessing steps (in SPM8)
fMRI preprocessing steps (in SPM8)fMRI preprocessing steps (in SPM8)
fMRI preprocessing steps (in SPM8)
 
Face recogntion Using PCA Algorithm
Face recogntion Using PCA Algorithm Face recogntion Using PCA Algorithm
Face recogntion Using PCA Algorithm
 
face recognition system using LBP
face recognition system using LBPface recognition system using LBP
face recognition system using LBP
 
03 data preprocessing
03 data preprocessing03 data preprocessing
03 data preprocessing
 
PCA Based Face Recognition System
PCA Based Face Recognition SystemPCA Based Face Recognition System
PCA Based Face Recognition System
 
Local binary pattern
Local binary patternLocal binary pattern
Local binary pattern
 
Face Recognition using PCA-Principal Component Analysis using MATLAB
Face Recognition using PCA-Principal Component Analysis using MATLABFace Recognition using PCA-Principal Component Analysis using MATLAB
Face Recognition using PCA-Principal Component Analysis using MATLAB
 
face recognition system using LBP
face recognition system using LBPface recognition system using LBP
face recognition system using LBP
 
Image pre processing
Image pre processingImage pre processing
Image pre processing
 

Similar to Preprocessing

Similar to Preprocessing (20)

Lexical1
Lexical1Lexical1
Lexical1
 
Search pitb
Search pitbSearch pitb
Search pitb
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 
ANSI C REFERENCE CARD
ANSI C REFERENCE CARDANSI C REFERENCE CARD
ANSI C REFERENCE CARD
 
Cs3430 lecture 16
Cs3430 lecture 16Cs3430 lecture 16
Cs3430 lecture 16
 
Pcd(Mca)
Pcd(Mca)Pcd(Mca)
Pcd(Mca)
 
PCD ?s(MCA)
PCD ?s(MCA)PCD ?s(MCA)
PCD ?s(MCA)
 
Principles of Compiler Design
Principles of Compiler DesignPrinciples of Compiler Design
Principles of Compiler Design
 
Pcd(Mca)
Pcd(Mca)Pcd(Mca)
Pcd(Mca)
 
Compiler Design Material 2
Compiler Design Material 2Compiler Design Material 2
Compiler Design Material 2
 
Syntax
SyntaxSyntax
Syntax
 
Chap 1 and 2
Chap 1 and 2Chap 1 and 2
Chap 1 and 2
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptx
 
Escape Sequences and Variables
Escape Sequences and VariablesEscape Sequences and Variables
Escape Sequences and Variables
 
String Manipulation in Python
String Manipulation in PythonString Manipulation in Python
String Manipulation in Python
 
C language by Dr. D. R. Gholkar
C language by Dr. D. R. GholkarC language by Dr. D. R. Gholkar
C language by Dr. D. R. Gholkar
 
Programming basics
Programming basicsProgramming basics
Programming basics
 
MIPS Architecture
MIPS ArchitectureMIPS Architecture
MIPS Architecture
 
Lexical Analysis - Compiler design
Lexical Analysis - Compiler design Lexical Analysis - Compiler design
Lexical Analysis - Compiler design
 
Unitiv 111206005201-phpapp01
Unitiv 111206005201-phpapp01Unitiv 111206005201-phpapp01
Unitiv 111206005201-phpapp01
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Preprocessing

  • 1. Steps involved in Preprocessing : 1.Tokenization : ● Tokenization : The process of breaking a stream of text into words ● Removal of Punctuation marks and numbers ● Replacing ‘n’ by Spaces ● Splitting the string by space as a delimiter ● Tokens
  • 2. Graphical view of steps in Tokenization : Removal of Replacing n Using Stream of text. spaces as punctuation by Spaces marks delimiter Tokens (words)
  • 3. 2. Removal of stop words : ● Passing the list of Tokens. ● Removing the unnecessary words like the, an, so, after, all, etc (stop words). ● Output : A list of meaningful words.
  • 4. 3.Stemming : ● Stemming : The process for reducing inflected words to their stem, base or root form. For example : Stemming algorithm reduces “fishing", "fished", "fish", and "fisher" to the root word "fish“. ● Stemmer used : Porter Stemmer Algorithm. ● Removing ‘–ee’,’ –ed’, ‘-ing’, ‘-ence’, ‘-er’, etc. & adding ‘y’, ‘I’ as required. ● Doesn’t give accurate roots . Example : stem(flying) =fli stem(fly)=fli ● Same roots for all inflected forms – serves our purpose
  • 5. 4. Vocabulary creation :- ● Vocabulary : Generally, vocabulary is the set of words. ● Vocabulary = Union of words from all files. ● For each document : Converting list obtained after stemming into Set & taking union. ● Processed further for Tf-idf evaluation.