SlideShare a Scribd company logo
1 of 12
Download to read offline
1

A News Classifier Using Evernote	
Atsushi KOIKE
Sokendai
Feb. 18, 2014
2

Background	
•  Evernote
•  One of my favorite applications
•  I can clip interesting news and technical articles using Evernote

I will show a web-based application utilizing Evernote
3

Brief Overview	
•  News Classifier
•  Divides news articles into 4 categories
•  Tech
•  General technical news
•  My Study
•  IT news relevant to me, such as programing, development, and computer architectures
•  Biz
•  Business news
•  Life
•  Other news, such as politics, economics, sports, and music

•  Displays high-score articles

•  Training data
•  Notes clipped into my Evernote
•  The notes in my Evernote are basically divided into the above 4 folders
4

Naïve Bayes Classifier	
•  It divides documents into categories

arg max cat P ( cat doc)
•  My application also returns max cat P ( cat doc ) for scoring
•  The classifier calculates P ( cat doc ) using Bayes Theorem
•  Given doc, it returns

P ( cat doc) =

P ( cat ) P ( doc cat )

∑ P (cat!) P (doc cat!)
cat!

• 

P ( cat ), P ( doc cat ) are calculated using Multinomial model
5

(Simplified) Multinomial Model	
•  Document is defined as a sequence of n words

doc := ( w1, w2 ,!, wn )

•  Suppose documents are generated by repeatedly picking up

one word	

P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat )
W2	

W3	

W3	

W3	

W1	

W1	

W2	

W2	

Category 1	

Category 2
6

Estimating model parameters	
 (cat ), P ( w cat )
P
nw,c
Nc
•  Simple method: P ( cat ) =
, P ( w cat ) =
N
nc
N c : Num of docs in category c	
N : Num of docs in total	
nw,c : Num of word w in documents in category c	
nc : Num of total words in documents in category c	
•  Disadvantage: If nw,c = 0, P(w|cat) = 0

P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) = 0
•  Improved method: smoothing
•  Consider Prior distribution for P(cat), P(w|cat): Dirichlet distribution (α=2)
⇒	

nw,c +1
N c +1
P ( cat ) =
, P ( w cat ) =
N + cat
nc + W

cat
W

: Num of category	
: Num of vocabulary
7

Software Architecture	
Training data	
Evernote	

Evernote
Crawler	

Model parameters	

Bayes
Classifier	

classifier.json	

notes.json	
Train	

Articles
Collector	
URL list	

Classify	
articles.json	
contents.json	

Display
8

Crawling Evernote	
•  I used “Evernote Ruby API” to crawl Evernote

Category
(Notebook)	
to_json	

Title	
Content
(XML)	

{ "cat": ”Tech",
“title”: “「スマホの9割はiPhone」「有料スタンプは買わない」――女子高生起業家に聞くスマホ事情",
“content”: ”・・・”
}	

I used the “title” and the “content” as the training document
9

Training	
•  Outline
1.  Parse XML content using the Nokogiri library
2.  Divide sentences into words using the MeCab library
3.  Calculate the model parameters
10

Crawling web	
•  News sites
•  IT Media
•  Yahoo! News
•  Impress Watch
•  TechCrunch
•  gizmode
•  Nikkan Sports
•  Nikkei BP
•  I downloaded RSS files using Ruby and saved them as json files	
{
"title": "IE 10に未解決の脆弱性、悪用攻撃の発生でIE 11に更新を",
“link”: “http://rss.rssad.jp/rss/artclk/VlF3IIxoZHoi/・・・",
“desc”: “・・・"
},	
I used the “title” and the “description” for classification
11

Application	
•  Web-based
•  Ruby + Sinatra + jQuery mobile
•  Articles are sorted by P cat doc

(

)

•  The application displays the top 5 articles for each category
12

Discussion	
•  Most of displayed articles matched my interest
•  Some articles were divided into wrong categories
•  Possible reasons
•  No stemming
•  Few stop words
•  Is it appropriate to use

P ( cat doc)

as the score?

More Related Content

Viewers also liked

Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...progressive01
 
Dip Your Toes in the Sea of Security (phpDay 2016)
Dip Your Toes in the Sea of Security (phpDay 2016)Dip Your Toes in the Sea of Security (phpDay 2016)
Dip Your Toes in the Sea of Security (phpDay 2016)James Titcumb
 
Poms the farmers view - David Barker
Poms the farmers view - David BarkerPoms the farmers view - David Barker
Poms the farmers view - David Barkerprogressive01
 
How to generate sales leads
How to generate sales leadsHow to generate sales leads
How to generate sales leadsFundoodata.com
 
Steps to-heaven-2015.eng-1
Steps to-heaven-2015.eng-1Steps to-heaven-2015.eng-1
Steps to-heaven-2015.eng-1Sim Aleksiev
 
Ivan yzunov-2013eng
Ivan yzunov-2013engIvan yzunov-2013eng
Ivan yzunov-2013engSim Aleksiev
 
#e-biznes festiwal 2013: idzie nowe!
#e-biznes festiwal 2013: idzie nowe!#e-biznes festiwal 2013: idzie nowe!
#e-biznes festiwal 2013: idzie nowe!#e-biznes festiwal
 
Topik 2 pembangunan pangkalan data
Topik 2   pembangunan pangkalan dataTopik 2   pembangunan pangkalan data
Topik 2 pembangunan pangkalan dataChamp14n
 
Proiect educativ international
Proiect  educativ  internationalProiect  educativ  international
Proiect educativ internationalNico Risnoveanu
 
Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidarias
Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidariasFortacele tu imagen y reputación positiva, y viraliza tus acciones solidarias
Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidariasLaura Cecilia Silva
 
"상금 1억" 꿈의 마케팅 아이디어 공모전
"상금 1억" 꿈의 마케팅 아이디어 공모전"상금 1억" 꿈의 마케팅 아이디어 공모전
"상금 1억" 꿈의 마케팅 아이디어 공모전Ji Hyeok Kim
 
Sense T - What happens when sensing happens?
Sense T - What happens when sensing happens? Sense T - What happens when sensing happens?
Sense T - What happens when sensing happens? progressive01
 
Valchan petrov-2014-1
Valchan petrov-2014-1Valchan petrov-2014-1
Valchan petrov-2014-1Sim Aleksiev
 

Viewers also liked (20)

CWK COMMERCE
CWK COMMERCE CWK COMMERCE
CWK COMMERCE
 
Retete magica 2003
Retete magica 2003Retete magica 2003
Retete magica 2003
 
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
 
Dip Your Toes in the Sea of Security (phpDay 2016)
Dip Your Toes in the Sea of Security (phpDay 2016)Dip Your Toes in the Sea of Security (phpDay 2016)
Dip Your Toes in the Sea of Security (phpDay 2016)
 
Viral, czyli jak to działa.
Viral, czyli jak to działa.Viral, czyli jak to działa.
Viral, czyli jak to działa.
 
Poms the farmers view - David Barker
Poms the farmers view - David BarkerPoms the farmers view - David Barker
Poms the farmers view - David Barker
 
How to generate sales leads
How to generate sales leadsHow to generate sales leads
How to generate sales leads
 
Steps to-heaven-2015.eng-1
Steps to-heaven-2015.eng-1Steps to-heaven-2015.eng-1
Steps to-heaven-2015.eng-1
 
Ivan yzunov-2013eng
Ivan yzunov-2013engIvan yzunov-2013eng
Ivan yzunov-2013eng
 
#e-biznes festiwal 2013: idzie nowe!
#e-biznes festiwal 2013: idzie nowe!#e-biznes festiwal 2013: idzie nowe!
#e-biznes festiwal 2013: idzie nowe!
 
Un día de nuestras vidas
Un día de nuestras vidasUn día de nuestras vidas
Un día de nuestras vidas
 
Topik 2 pembangunan pangkalan data
Topik 2   pembangunan pangkalan dataTopik 2   pembangunan pangkalan data
Topik 2 pembangunan pangkalan data
 
Educacion autónoma
Educacion autónomaEducacion autónoma
Educacion autónoma
 
Nota justificativa
Nota justificativaNota justificativa
Nota justificativa
 
Proiect educativ international
Proiect  educativ  internationalProiect  educativ  international
Proiect educativ international
 
Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidarias
Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidariasFortacele tu imagen y reputación positiva, y viraliza tus acciones solidarias
Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidarias
 
Toamna in imagini
Toamna in imaginiToamna in imagini
Toamna in imagini
 
"상금 1억" 꿈의 마케팅 아이디어 공모전
"상금 1억" 꿈의 마케팅 아이디어 공모전"상금 1억" 꿈의 마케팅 아이디어 공모전
"상금 1억" 꿈의 마케팅 아이디어 공모전
 
Sense T - What happens when sensing happens?
Sense T - What happens when sensing happens? Sense T - What happens when sensing happens?
Sense T - What happens when sensing happens?
 
Valchan petrov-2014-1
Valchan petrov-2014-1Valchan petrov-2014-1
Valchan petrov-2014-1
 

Similar to A News Classifier Using Evernote

What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWes McKinney
 
LaTeX로 문서 작성하자
LaTeX로 문서 작성하자LaTeX로 문서 작성하자
LaTeX로 문서 작성하자Kangjun Heo
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptxJamesKirk79
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptxKashishKashish22
 
Python Programming and GIS
Python Programming and GISPython Programming and GIS
Python Programming and GISJohn Reiser
 
Certified Python Business Analyst
Certified Python Business AnalystCertified Python Business Analyst
Certified Python Business AnalystAnkitSingh2134
 
GTU Asp.net Project Training Guidelines
GTU Asp.net Project Training GuidelinesGTU Asp.net Project Training Guidelines
GTU Asp.net Project Training GuidelinesTOPS Technologies
 
Introduction to NVivo
Introduction to NVivoIntroduction to NVivo
Introduction to NVivoMarieke Guy
 
Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataTravis Oliphant
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
 

Similar to A News Classifier Using Evernote (20)

What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
LaTeX로 문서 작성하자
LaTeX로 문서 작성하자LaTeX로 문서 작성하자
LaTeX로 문서 작성하자
 
Python ml
Python mlPython ml
Python ml
 
Pinecone Vector Database.pdf
Pinecone Vector Database.pdfPinecone Vector Database.pdf
Pinecone Vector Database.pdf
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptx
 
Solved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdfSolved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdf
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 
Python
PythonPython
Python
 
Python Programming and GIS
Python Programming and GISPython Programming and GIS
Python Programming and GIS
 
Kaggle tokyo 2018
Kaggle tokyo 2018Kaggle tokyo 2018
Kaggle tokyo 2018
 
I2DS Project.pdf
I2DS Project.pdfI2DS Project.pdf
I2DS Project.pdf
 
Certified Python Business Analyst
Certified Python Business AnalystCertified Python Business Analyst
Certified Python Business Analyst
 
GTU Asp.net Project Training Guidelines
GTU Asp.net Project Training GuidelinesGTU Asp.net Project Training Guidelines
GTU Asp.net Project Training Guidelines
 
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERINGCOMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
 
Introduction to NVivo
Introduction to NVivoIntroduction to NVivo
Introduction to NVivo
 
Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

A News Classifier Using Evernote

  • 1. 1 A News Classifier Using Evernote Atsushi KOIKE Sokendai Feb. 18, 2014
  • 2. 2 Background •  Evernote •  One of my favorite applications •  I can clip interesting news and technical articles using Evernote I will show a web-based application utilizing Evernote
  • 3. 3 Brief Overview •  News Classifier •  Divides news articles into 4 categories •  Tech •  General technical news •  My Study •  IT news relevant to me, such as programing, development, and computer architectures •  Biz •  Business news •  Life •  Other news, such as politics, economics, sports, and music •  Displays high-score articles •  Training data •  Notes clipped into my Evernote •  The notes in my Evernote are basically divided into the above 4 folders
  • 4. 4 Naïve Bayes Classifier •  It divides documents into categories arg max cat P ( cat doc) •  My application also returns max cat P ( cat doc ) for scoring •  The classifier calculates P ( cat doc ) using Bayes Theorem •  Given doc, it returns P ( cat doc) = P ( cat ) P ( doc cat ) ∑ P (cat!) P (doc cat!) cat! •  P ( cat ), P ( doc cat ) are calculated using Multinomial model
  • 5. 5 (Simplified) Multinomial Model •  Document is defined as a sequence of n words doc := ( w1, w2 ,!, wn ) •  Suppose documents are generated by repeatedly picking up one word P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) W2 W3 W3 W3 W1 W1 W2 W2 Category 1 Category 2
  • 6. 6 Estimating model parameters (cat ), P ( w cat ) P nw,c Nc •  Simple method: P ( cat ) = , P ( w cat ) = N nc N c : Num of docs in category c N : Num of docs in total nw,c : Num of word w in documents in category c nc : Num of total words in documents in category c •  Disadvantage: If nw,c = 0, P(w|cat) = 0 P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) = 0 •  Improved method: smoothing •  Consider Prior distribution for P(cat), P(w|cat): Dirichlet distribution (α=2) ⇒ nw,c +1 N c +1 P ( cat ) = , P ( w cat ) = N + cat nc + W cat W : Num of category : Num of vocabulary
  • 7. 7 Software Architecture Training data Evernote Evernote Crawler Model parameters Bayes Classifier classifier.json notes.json Train Articles Collector URL list Classify articles.json contents.json Display
  • 8. 8 Crawling Evernote •  I used “Evernote Ruby API” to crawl Evernote Category (Notebook) to_json Title Content (XML) { "cat": ”Tech", “title”: “「スマホの9割はiPhone」「有料スタンプは買わない」――女子高生起業家に聞くスマホ事情", “content”: ”・・・” } I used the “title” and the “content” as the training document
  • 9. 9 Training •  Outline 1.  Parse XML content using the Nokogiri library 2.  Divide sentences into words using the MeCab library 3.  Calculate the model parameters
  • 10. 10 Crawling web •  News sites •  IT Media •  Yahoo! News •  Impress Watch •  TechCrunch •  gizmode •  Nikkan Sports •  Nikkei BP •  I downloaded RSS files using Ruby and saved them as json files { "title": "IE 10に未解決の脆弱性、悪用攻撃の発生でIE 11に更新を", “link”: “http://rss.rssad.jp/rss/artclk/VlF3IIxoZHoi/・・・", “desc”: “・・・" }, I used the “title” and the “description” for classification
  • 11. 11 Application •  Web-based •  Ruby + Sinatra + jQuery mobile •  Articles are sorted by P cat doc ( ) •  The application displays the top 5 articles for each category
  • 12. 12 Discussion •  Most of displayed articles matched my interest •  Some articles were divided into wrong categories •  Possible reasons •  No stemming •  Few stop words •  Is it appropriate to use P ( cat doc) as the score?