SlideShare a Scribd company logo
1 of 25
Download to read offline
Using Machine learning and R
Finding Order in the
Chaos
Harshad Saykhedkar
The main ideaSource of text and applications
Emails Spam detection
Product descriptions /
reviews
Sentiment analysis,
recommendation
Blogs / informational
content
Content
recommendations
Web pages / news
articles
Topic identification,
trending topics
Tweets / comments /
social content
Sentiment analysis,
named entity recognition
(Text mining) is a wonderful
world. Let's go exploring...!
The main ideaThe main idea
Itinerary
● R you ready ?
● Prep camp
● The wandering traveller
● The seeker
R you ready ?
The main ideaPacking our bags : Checks
● Starting R
● Loading required packages
● Check sessionInfo( )
The main ideaPacking our bags : Datatypes
Atomic
Vector
Lists
"Let's try our hands"
The main ideaPacking our bags : Functions
● Expressions which are evaluated
● Can be passed around
● Definitions can be nested
Details not covered : Argument matching, Call by value,
Environments and lexical scoping, Promises etc..
Prep Camp
The main ideaPrep camp : Sentiment Analysis
● Bag of words model
● Simple aggregated score
' terrible service & disorganised '
' OK - some good some bad '
' Great location, fabulous staff '
The main idea
● Part of speech ambiguity
● Further exploration ?
● Equal weightage model
● Double negations ?
Prep camp : Improvements
The Wandering
Traveller
The main ideawandering traveller : Unsupervised Learning
Can define
distance
Entity as
point in
space
How to derive this model for text ?
Feature 1
Feature 2
The main ideawandering traveller : Vector Space Model
Word,
Phrase,
Theme
Comments,
Blogs,
Tweets
Word,
Phrase,
Theme
The main ideawandering traveller : TfIdf and other details
" But how to measure the importance of
a word for a doc ? "
● Binary : Is the 'word' in the 'doc' ?
● Tf : # times the word in the 'doc' ?
● TfIdf : Penalize the obvious!
The main ideawandering traveller : Hierarchical Clustering
● Define distance measure
● Keep Merging based on similarity
Washing
Machine
Washer
Dryer
Camera
The main ideawandering traveller : Improvements
● Stemming, lemmatization
● Latent semantic analysis
"Cameras" Vs "Camera"
"Phone" "Touch Screen"
The Seeker
The main ideaSeeker : Supervised Learning
● Labels given with features
● Find rule, classify unobserved case
Feature 1
Feature 2
The main ideaSeeker : Naive Bayes Classifier
● Independence of features
● Train the model on training set
● Test accuracy on a holdout sample
Predicted 0 Predicted 1
Actual 0 F (0, 0) F(0, 1)
Actual 1 F (1, 0) F(1, 1)
Learnings
The main ideaLearnings
● How to cleanup and preprocess data
in text form ?
● How to model the data ?
● How to cluster the data ?
● How to classify the data ?
The main ideaSource of text and applications
Emails Spam detection
Product descriptions /
reviews
Sentiment analysis,
recommendation
Blogs / informational
content
Content
recommendations
Web pages / news
articles
Topic identification,
trending topics
Tweets / comments /
social content
Sentiment analysis,
named entity recognition
Questions ?
"Avid R learner, trying to apply bunch of these
techniques to the digital ads world"
Contact
harshad.saykhedkar@sokrati.com
The main ideaAbout me

More Related Content

Similar to Machine learning applications on text data

Code Quality Makes Your Job Easier
Code Quality Makes Your Job EasierCode Quality Makes Your Job Easier
Code Quality Makes Your Job EasierTonya Mork
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksLucidworks
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabsChetan Khatri
 
Scaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine LearningScaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine LearningVo Viet Anh
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfShiwani Gupta
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligenceananth
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsRamsha Ijaz
 
Cracking the coding interview columbia - march 23 2011
Cracking the coding interview   columbia - march 23 2011Cracking the coding interview   columbia - march 23 2011
Cracking the coding interview columbia - march 23 2011careercup
 
Machine Learning: Expertise On-Demand
Machine Learning: Expertise On-DemandMachine Learning: Expertise On-Demand
Machine Learning: Expertise On-DemandChristopher Mohritz
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...PyData
 
Deprecating the state machine: building conversational AI with the Rasa stack
Deprecating the state machine: building conversational AI with the Rasa stackDeprecating the state machine: building conversational AI with the Rasa stack
Deprecating the state machine: building conversational AI with the Rasa stackJustina Petraitytė
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning IntroductionPranav Prakash
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...10x Nation
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 

Similar to Machine learning applications on text data (20)

Code Quality Makes Your Job Easier
Code Quality Makes Your Job EasierCode Quality Makes Your Job Easier
Code Quality Makes Your Job Easier
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Scaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine LearningScaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine Learning
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdf
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School Exams
 
Taming Text
Taming TextTaming Text
Taming Text
 
Cracking the coding interview columbia - march 23 2011
Cracking the coding interview   columbia - march 23 2011Cracking the coding interview   columbia - march 23 2011
Cracking the coding interview columbia - march 23 2011
 
Machine Learning: Expertise On-Demand
Machine Learning: Expertise On-DemandMachine Learning: Expertise On-Demand
Machine Learning: Expertise On-Demand
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 
Deprecating the state machine: building conversational AI with the Rasa stack
Deprecating the state machine: building conversational AI with the Rasa stackDeprecating the state machine: building conversational AI with the Rasa stack
Deprecating the state machine: building conversational AI with the Rasa stack
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
12. Objects I
12. Objects I12. Objects I
12. Objects I
 
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Machine learning applications on text data

  • 1. Using Machine learning and R Finding Order in the Chaos Harshad Saykhedkar
  • 2. The main ideaSource of text and applications Emails Spam detection Product descriptions / reviews Sentiment analysis, recommendation Blogs / informational content Content recommendations Web pages / news articles Topic identification, trending topics Tweets / comments / social content Sentiment analysis, named entity recognition
  • 3. (Text mining) is a wonderful world. Let's go exploring...! The main ideaThe main idea
  • 4. Itinerary ● R you ready ? ● Prep camp ● The wandering traveller ● The seeker
  • 6. The main ideaPacking our bags : Checks ● Starting R ● Loading required packages ● Check sessionInfo( )
  • 7. The main ideaPacking our bags : Datatypes Atomic Vector Lists "Let's try our hands"
  • 8. The main ideaPacking our bags : Functions ● Expressions which are evaluated ● Can be passed around ● Definitions can be nested Details not covered : Argument matching, Call by value, Environments and lexical scoping, Promises etc..
  • 10. The main ideaPrep camp : Sentiment Analysis ● Bag of words model ● Simple aggregated score ' terrible service & disorganised ' ' OK - some good some bad ' ' Great location, fabulous staff '
  • 11. The main idea ● Part of speech ambiguity ● Further exploration ? ● Equal weightage model ● Double negations ? Prep camp : Improvements
  • 13. The main ideawandering traveller : Unsupervised Learning Can define distance Entity as point in space How to derive this model for text ? Feature 1 Feature 2
  • 14. The main ideawandering traveller : Vector Space Model Word, Phrase, Theme Comments, Blogs, Tweets Word, Phrase, Theme
  • 15. The main ideawandering traveller : TfIdf and other details " But how to measure the importance of a word for a doc ? " ● Binary : Is the 'word' in the 'doc' ? ● Tf : # times the word in the 'doc' ? ● TfIdf : Penalize the obvious!
  • 16. The main ideawandering traveller : Hierarchical Clustering ● Define distance measure ● Keep Merging based on similarity Washing Machine Washer Dryer Camera
  • 17. The main ideawandering traveller : Improvements ● Stemming, lemmatization ● Latent semantic analysis "Cameras" Vs "Camera" "Phone" "Touch Screen"
  • 19. The main ideaSeeker : Supervised Learning ● Labels given with features ● Find rule, classify unobserved case Feature 1 Feature 2
  • 20. The main ideaSeeker : Naive Bayes Classifier ● Independence of features ● Train the model on training set ● Test accuracy on a holdout sample Predicted 0 Predicted 1 Actual 0 F (0, 0) F(0, 1) Actual 1 F (1, 0) F(1, 1)
  • 22. The main ideaLearnings ● How to cleanup and preprocess data in text form ? ● How to model the data ? ● How to cluster the data ? ● How to classify the data ?
  • 23. The main ideaSource of text and applications Emails Spam detection Product descriptions / reviews Sentiment analysis, recommendation Blogs / informational content Content recommendations Web pages / news articles Topic identification, trending topics Tweets / comments / social content Sentiment analysis, named entity recognition
  • 25. "Avid R learner, trying to apply bunch of these techniques to the digital ads world" Contact harshad.saykhedkar@sokrati.com The main ideaAbout me