SlideShare a Scribd company logo
1 of 12
Download to read offline
1

A News Classifier Using Evernote	
Atsushi KOIKE
Sokendai
Feb. 18, 2014
2

Background	
•  Evernote
•  One of my favorite applications
•  I can clip interesting news and technical articles using Evernote

I will show a web-based application utilizing Evernote
3

Brief Overview	
•  News Classifier
•  Divides news articles into 4 categories
•  Tech
•  General technical news
•  My Study
•  IT news relevant to me, such as programing, development, and computer architectures
•  Biz
•  Business news
•  Life
•  Other news, such as politics, economics, sports, and music

•  Displays high-score articles

•  Training data
•  Notes clipped into my Evernote
•  The notes in my Evernote are basically divided into the above 4 folders
4

Naïve Bayes Classifier	
•  It divides documents into categories

arg max cat P ( cat doc)
•  My application also returns max cat P ( cat doc ) for scoring
•  The classifier calculates P ( cat doc ) using Bayes Theorem
•  Given doc, it returns

P ( cat doc) =

P ( cat ) P ( doc cat )

∑ P (cat!) P (doc cat!)
cat!

• 

P ( cat ), P ( doc cat ) are calculated using Multinomial model
5

(Simplified) Multinomial Model	
•  Document is defined as a sequence of n words

doc := ( w1, w2 ,!, wn )

•  Suppose documents are generated by repeatedly picking up

one word	

P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat )
W2	

W3	

W3	

W3	

W1	

W1	

W2	

W2	

Category 1	

Category 2
6

Estimating model parameters	
 (cat ), P ( w cat )
P
nw,c
Nc
•  Simple method: P ( cat ) =
, P ( w cat ) =
N
nc
N c : Num of docs in category c	
N : Num of docs in total	
nw,c : Num of word w in documents in category c	
nc : Num of total words in documents in category c	
•  Disadvantage: If nw,c = 0, P(w|cat) = 0

P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) = 0
•  Improved method: smoothing
•  Consider Prior distribution for P(cat), P(w|cat): Dirichlet distribution (α=2)
⇒	

nw,c +1
N c +1
P ( cat ) =
, P ( w cat ) =
N + cat
nc + W

cat
W

: Num of category	
: Num of vocabulary
7

Software Architecture	
Training data	
Evernote	

Evernote
Crawler	

Model parameters	

Bayes
Classifier	

classifier.json	

notes.json	
Train	

Articles
Collector	
URL list	

Classify	
articles.json	
contents.json	

Display
8

Crawling Evernote	
•  I used “Evernote Ruby API” to crawl Evernote

Category
(Notebook)	
to_json	

Title	
Content
(XML)	

{ "cat": ”Tech",
“title”: “「スマホの9割はiPhone」「有料スタンプは買わない」――女子高生起業家に聞くスマホ事情",
“content”: ”・・・”
}	

I used the “title” and the “content” as the training document
9

Training	
•  Outline
1.  Parse XML content using the Nokogiri library
2.  Divide sentences into words using the MeCab library
3.  Calculate the model parameters
10

Crawling web	
•  News sites
•  IT Media
•  Yahoo! News
•  Impress Watch
•  TechCrunch
•  gizmode
•  Nikkan Sports
•  Nikkei BP
•  I downloaded RSS files using Ruby and saved them as json files	
{
"title": "IE 10に未解決の脆弱性、悪用攻撃の発生でIE 11に更新を",
“link”: “http://rss.rssad.jp/rss/artclk/VlF3IIxoZHoi/・・・",
“desc”: “・・・"
},	
I used the “title” and the “description” for classification
11

Application	
•  Web-based
•  Ruby + Sinatra + jQuery mobile
•  Articles are sorted by P cat doc

(

)

•  The application displays the top 5 articles for each category
12

Discussion	
•  Most of displayed articles matched my interest
•  Some articles were divided into wrong categories
•  Possible reasons
•  No stemming
•  Few stop words
•  Is it appropriate to use

P ( cat doc)

as the score?

More Related Content

Viewers also liked

Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
progressive01
 
Poms the farmers view - David Barker
Poms the farmers view - David BarkerPoms the farmers view - David Barker
Poms the farmers view - David Barker
progressive01
 
Ivan yzunov-2013eng
Ivan yzunov-2013engIvan yzunov-2013eng
Ivan yzunov-2013eng
Sim Aleksiev
 
Sense T - What happens when sensing happens?
Sense T - What happens when sensing happens? Sense T - What happens when sensing happens?
Sense T - What happens when sensing happens?
progressive01
 

Viewers also liked (20)

CWK COMMERCE
CWK COMMERCE CWK COMMERCE
CWK COMMERCE
 
Retete magica 2003
Retete magica 2003Retete magica 2003
Retete magica 2003
 
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
Update of the Tasmanian Pacific Oysters Health Surveillance Program & Biosecu...
 
Dip Your Toes in the Sea of Security (phpDay 2016)
Dip Your Toes in the Sea of Security (phpDay 2016)Dip Your Toes in the Sea of Security (phpDay 2016)
Dip Your Toes in the Sea of Security (phpDay 2016)
 
Viral, czyli jak to działa.
Viral, czyli jak to działa.Viral, czyli jak to działa.
Viral, czyli jak to działa.
 
Poms the farmers view - David Barker
Poms the farmers view - David BarkerPoms the farmers view - David Barker
Poms the farmers view - David Barker
 
How to generate sales leads
How to generate sales leadsHow to generate sales leads
How to generate sales leads
 
Steps to-heaven-2015.eng-1
Steps to-heaven-2015.eng-1Steps to-heaven-2015.eng-1
Steps to-heaven-2015.eng-1
 
Ivan yzunov-2013eng
Ivan yzunov-2013engIvan yzunov-2013eng
Ivan yzunov-2013eng
 
#e-biznes festiwal 2013: idzie nowe!
#e-biznes festiwal 2013: idzie nowe!#e-biznes festiwal 2013: idzie nowe!
#e-biznes festiwal 2013: idzie nowe!
 
Un día de nuestras vidas
Un día de nuestras vidasUn día de nuestras vidas
Un día de nuestras vidas
 
Topik 2 pembangunan pangkalan data
Topik 2   pembangunan pangkalan dataTopik 2   pembangunan pangkalan data
Topik 2 pembangunan pangkalan data
 
Educacion autónoma
Educacion autónomaEducacion autónoma
Educacion autónoma
 
Nota justificativa
Nota justificativaNota justificativa
Nota justificativa
 
Proiect educativ international
Proiect  educativ  internationalProiect  educativ  international
Proiect educativ international
 
Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidarias
Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidariasFortacele tu imagen y reputación positiva, y viraliza tus acciones solidarias
Fortacele tu imagen y reputación positiva, y viraliza tus acciones solidarias
 
Toamna in imagini
Toamna in imaginiToamna in imagini
Toamna in imagini
 
"상금 1억" 꿈의 마케팅 아이디어 공모전
"상금 1억" 꿈의 마케팅 아이디어 공모전"상금 1억" 꿈의 마케팅 아이디어 공모전
"상금 1억" 꿈의 마케팅 아이디어 공모전
 
Sense T - What happens when sensing happens?
Sense T - What happens when sensing happens? Sense T - What happens when sensing happens?
Sense T - What happens when sensing happens?
 
Valchan petrov-2014-1
Valchan petrov-2014-1Valchan petrov-2014-1
Valchan petrov-2014-1
 

Similar to A News Classifier Using Evernote

What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
Víctor Zabalza
 

Similar to A News Classifier Using Evernote (20)

What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
LaTeX로 문서 작성하자
LaTeX로 문서 작성하자LaTeX로 문서 작성하자
LaTeX로 문서 작성하자
 
Python ml
Python mlPython ml
Python ml
 
Pinecone Vector Database.pdf
Pinecone Vector Database.pdfPinecone Vector Database.pdf
Pinecone Vector Database.pdf
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptx
 
Solved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdfSolved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdf
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
 
Python
PythonPython
Python
 
Python Programming and GIS
Python Programming and GISPython Programming and GIS
Python Programming and GIS
 
Kaggle tokyo 2018
Kaggle tokyo 2018Kaggle tokyo 2018
Kaggle tokyo 2018
 
I2DS Project.pdf
I2DS Project.pdfI2DS Project.pdf
I2DS Project.pdf
 
Certified Python Business Analyst
Certified Python Business AnalystCertified Python Business Analyst
Certified Python Business Analyst
 
GTU Asp.net Project Training Guidelines
GTU Asp.net Project Training GuidelinesGTU Asp.net Project Training Guidelines
GTU Asp.net Project Training Guidelines
 
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERINGCOMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
 
Introduction to NVivo
Introduction to NVivoIntroduction to NVivo
Introduction to NVivo
 
Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

A News Classifier Using Evernote

  • 1. 1 A News Classifier Using Evernote Atsushi KOIKE Sokendai Feb. 18, 2014
  • 2. 2 Background •  Evernote •  One of my favorite applications •  I can clip interesting news and technical articles using Evernote I will show a web-based application utilizing Evernote
  • 3. 3 Brief Overview •  News Classifier •  Divides news articles into 4 categories •  Tech •  General technical news •  My Study •  IT news relevant to me, such as programing, development, and computer architectures •  Biz •  Business news •  Life •  Other news, such as politics, economics, sports, and music •  Displays high-score articles •  Training data •  Notes clipped into my Evernote •  The notes in my Evernote are basically divided into the above 4 folders
  • 4. 4 Naïve Bayes Classifier •  It divides documents into categories arg max cat P ( cat doc) •  My application also returns max cat P ( cat doc ) for scoring •  The classifier calculates P ( cat doc ) using Bayes Theorem •  Given doc, it returns P ( cat doc) = P ( cat ) P ( doc cat ) ∑ P (cat!) P (doc cat!) cat! •  P ( cat ), P ( doc cat ) are calculated using Multinomial model
  • 5. 5 (Simplified) Multinomial Model •  Document is defined as a sequence of n words doc := ( w1, w2 ,!, wn ) •  Suppose documents are generated by repeatedly picking up one word P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) W2 W3 W3 W3 W1 W1 W2 W2 Category 1 Category 2
  • 6. 6 Estimating model parameters (cat ), P ( w cat ) P nw,c Nc •  Simple method: P ( cat ) = , P ( w cat ) = N nc N c : Num of docs in category c N : Num of docs in total nw,c : Num of word w in documents in category c nc : Num of total words in documents in category c •  Disadvantage: If nw,c = 0, P(w|cat) = 0 P ( doc cat ) = P ( w1 cat ) P ( w2 cat )!P ( wn cat ) = 0 •  Improved method: smoothing •  Consider Prior distribution for P(cat), P(w|cat): Dirichlet distribution (α=2) ⇒ nw,c +1 N c +1 P ( cat ) = , P ( w cat ) = N + cat nc + W cat W : Num of category : Num of vocabulary
  • 7. 7 Software Architecture Training data Evernote Evernote Crawler Model parameters Bayes Classifier classifier.json notes.json Train Articles Collector URL list Classify articles.json contents.json Display
  • 8. 8 Crawling Evernote •  I used “Evernote Ruby API” to crawl Evernote Category (Notebook) to_json Title Content (XML) { "cat": ”Tech", “title”: “「スマホの9割はiPhone」「有料スタンプは買わない」――女子高生起業家に聞くスマホ事情", “content”: ”・・・” } I used the “title” and the “content” as the training document
  • 9. 9 Training •  Outline 1.  Parse XML content using the Nokogiri library 2.  Divide sentences into words using the MeCab library 3.  Calculate the model parameters
  • 10. 10 Crawling web •  News sites •  IT Media •  Yahoo! News •  Impress Watch •  TechCrunch •  gizmode •  Nikkan Sports •  Nikkei BP •  I downloaded RSS files using Ruby and saved them as json files { "title": "IE 10に未解決の脆弱性、悪用攻撃の発生でIE 11に更新を", “link”: “http://rss.rssad.jp/rss/artclk/VlF3IIxoZHoi/・・・", “desc”: “・・・" }, I used the “title” and the “description” for classification
  • 11. 11 Application •  Web-based •  Ruby + Sinatra + jQuery mobile •  Articles are sorted by P cat doc ( ) •  The application displays the top 5 articles for each category
  • 12. 12 Discussion •  Most of displayed articles matched my interest •  Some articles were divided into wrong categories •  Possible reasons •  No stemming •  Few stop words •  Is it appropriate to use P ( cat doc) as the score?