SlideShare a Scribd company logo
1 of 15
Download to read offline
ToxicDocs
Gautam Shine
ToxicDocs
Background
• Columbia and CPR have 4 million
newly declassified legal documents
• Of interest to journalists, historians,
attorneys, public health officials
• How can this repository be made
more accessible and useful?
• More generally, what can we do to get
insights out of a document dump?
ToxicDocs
I aim to:
1. Categorize documents into types
such as memo, ad, scientific study
2. Provide retrieval of documents
similar to a given one
3. Infer missing attributes (such as the
year) by parsing text
4. Visualize trends for topics over time
The Actual Data
DIVISION OF SAFETY ANIj . n -IENE
The Industrial Commission of Ohio 700
W. Third Avenue, Columbus, Ohio 43212
.e~2e70 ~ .e~.sslle
IlIdullri,1 COOt",lIlio.
M. IIOlLAWO nlSE ChI"m,.
mm l. tHOOU Go'ftrttt
""i,,,.DIYII;'" of
wllty ",41
101m t. wl.mEt MtIlMr
lHOI1W0,,,',I1f.11G,,.,U,,,U",GHU
~
~ ~L.yI0Il" P. SHEOIAN
OCR Quality
• Out-of-Vocabulary (OOV) text, i.e. meaningless
gibberish, is limited in most documents
• Most documents have >60% readable text
Classification Task
• 50k documents
• 1k randomly labeled
• 15 unbalanced classes
• Documents can be:
• subtle
• handwritten
• illustrations
• not in English
• blank (?)
Classification Algorithm
Model
• Linear kernel one-v-all SVM (liblinear)
• Squared hinge loss + L2 regularization
Performance
• Accuracy: 62% (top 3: 88%), mean F1: 0.57
• Baseline: 15% using regular expressions
Features
• Sublinear TF-IDF on n-gram features (56%)
• +6% from document length, NER, and
semi-supervised label propagation
Visual Features
• Visual structure contains independent information
• 1-layer feed-forward network gets an accuracy of 15%
• Recent advances allow OCR with probabilistic
models for language using recurrent networks
1000 x 1000 100 x 100 10 x 10
Putting it to Use
• A search for “vinyl” (H2C=CHCl) reveals the following trend
in the composition of documents over time
• Time (x) is inferred and type (y) is predicted
Evolution of a Crisis
• chemicalindustryarchives.org/dirtysecrets/
• “By 1971 the industry knew without doubt that vinyl
chloride caused cancer in animals.”
Evolution of a Crisis
• “In January 1974, B.F. Goodrich announced the presence of
a rare liver cancer, angiosarcoma, in its polyvinyl chloride
workers at is Louisville plant.”
Evolution of a Crisis
• “In May of 1974, the Occupational Safety and Health
Administration (OSHA) proposed a maximum exposure level
for vinyl chloride at a no detectable level”
toxicdocs.org
• Launching some time after the election
• This work will be scaled and integrated
• github.com/GautamShine/toxic-docs
About me:
• Ph.D. candidate at Stanford in electrical engineering
• Interned at an AI startup (ML/NLP) and twice at Intel (supercomputing)
• Outdoor enthusiast (climbing, diving) and backpacker (30+ countries)
ToxicDocs
Semi-supervised Learning
• General idea: make use of unlabeled data
• For SVMs, distance from separating hyperplane
serves as prediction confidence
• Procedure used:
1. h1 = argmax f(Xtrain)
2. ŷunlabeled = h1(Xunlabeled)
3. Xabsorbed = {x ∈ Xunlabeled | h1(x) > C}
4. h2 = argmax f(Xtrain + Xabsorbed)
5. ŷtest = h2(Xtest)
Kernels for Text
• n-gram space is high-d and sparse
• Most points lie in a low-d subspace
• Data is often trivially separable
• Projection to higher dimensions doesn’t
reduce bias much
• But pays a price with higher variance
• Big speed difference in NLP problems
• Linear kernel is O(p) from SGD on hinge loss
• Non-linear kernels are O(p2) from
coordinate ascent on the Lagrange dual

More Related Content

Viewers also liked (8)

Kate (TP)
Kate (TP)Kate (TP)
Kate (TP)
 
Vu kien chat doc mau da cam
Vu kien chat doc mau da camVu kien chat doc mau da cam
Vu kien chat doc mau da cam
 
david getuigskrif
david getuigskrifdavid getuigskrif
david getuigskrif
 
Les Paysans
Les PaysansLes Paysans
Les Paysans
 
Power storage systems
Power storage systemsPower storage systems
Power storage systems
 
Decathlon maniyar pdf
Decathlon maniyar pdfDecathlon maniyar pdf
Decathlon maniyar pdf
 
Soluções e Solubilidade
Soluções e SolubilidadeSoluções e Solubilidade
Soluções e Solubilidade
 
Venturi scrubber by SP
Venturi scrubber by SPVenturi scrubber by SP
Venturi scrubber by SP
 

Similar to ToxicDocs

Lev gen conf se paper 1 the role of the media-draft 4
Lev gen conf se paper 1    the role of the media-draft 4Lev gen conf se paper 1    the role of the media-draft 4
Lev gen conf se paper 1 the role of the media-draft 4Steve Emery
 
Lecture 18 research ethics (1)
Lecture 18 research ethics (1)Lecture 18 research ethics (1)
Lecture 18 research ethics (1)Dr Ghaiath Hussein
 
Lecture 4 history and ethical codes
Lecture 4  history and ethical codesLecture 4  history and ethical codes
Lecture 4 history and ethical codesIshah Khaliq
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017petermurrayrust
 
, write 600-800 words that respond to the following questions .docx
, write 600-800 words that respond to the following questions .docx, write 600-800 words that respond to the following questions .docx
, write 600-800 words that respond to the following questions .docxhoney725342
 
Introduction to research ethics
Introduction to research ethicsIntroduction to research ethics
Introduction to research ethicsTamer Hifnawy
 
This week, reflect on what you learned from the NIH materials abou
This week, reflect on what you learned from the NIH materials abouThis week, reflect on what you learned from the NIH materials abou
This week, reflect on what you learned from the NIH materials abouTakishaPeck109
 
Open Data HK: open science meets open data. A primer from Scott Edmunds
Open Data HK: open science meets open data. A primer from Scott EdmundsOpen Data HK: open science meets open data. A primer from Scott Edmunds
Open Data HK: open science meets open data. A primer from Scott EdmundsScott Edmunds
 
Scott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sg
Scott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sgScott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sg
Scott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sgGigaScience, BGI Hong Kong
 
Inventions.pptx
Inventions.pptxInventions.pptx
Inventions.pptxOlSreylin
 
5.ethical consideration in research
5.ethical consideration in research5.ethical consideration in research
5.ethical consideration in researchAESHA ZAFNA
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Sciencepetermurrayrust
 
"Assuming the Burden of Proof: Data as Evidence in Science and Public Policy,...
"Assuming the Burden of Proof: Data as Evidence in Science and Public Policy,..."Assuming the Burden of Proof: Data as Evidence in Science and Public Policy,...
"Assuming the Burden of Proof: Data as Evidence in Science and Public Policy,...Tom Moritz
 
EMPHNET-PHE Course: Module seven ethical issues in public health research& in...
EMPHNET-PHE Course: Module seven ethical issues in public health research& in...EMPHNET-PHE Course: Module seven ethical issues in public health research& in...
EMPHNET-PHE Course: Module seven ethical issues in public health research& in...Dr Ghaiath Hussein
 
2. Lecture on Research Ethics for Epid 2022 (2).pdf
2.  Lecture on Research Ethics for Epid 2022 (2).pdf2.  Lecture on Research Ethics for Epid 2022 (2).pdf
2. Lecture on Research Ethics for Epid 2022 (2).pdfAnaolAbebe
 
What Is Corporeal Archive
What Is Corporeal ArchiveWhat Is Corporeal Archive
What Is Corporeal ArchiveKimberly Haynes
 

Similar to ToxicDocs (20)

Plosslides
PlosslidesPlosslides
Plosslides
 
PLOS slides
PLOS slidesPLOS slides
PLOS slides
 
Lev gen conf se paper 1 the role of the media-draft 4
Lev gen conf se paper 1    the role of the media-draft 4Lev gen conf se paper 1    the role of the media-draft 4
Lev gen conf se paper 1 the role of the media-draft 4
 
Lecture 18 research ethics (1)
Lecture 18 research ethics (1)Lecture 18 research ethics (1)
Lecture 18 research ethics (1)
 
Lecture 4 history and ethical codes
Lecture 4  history and ethical codesLecture 4  history and ethical codes
Lecture 4 history and ethical codes
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
Research Ethics
Research EthicsResearch Ethics
Research Ethics
 
, write 600-800 words that respond to the following questions .docx
, write 600-800 words that respond to the following questions .docx, write 600-800 words that respond to the following questions .docx
, write 600-800 words that respond to the following questions .docx
 
Introduction to research ethics
Introduction to research ethicsIntroduction to research ethics
Introduction to research ethics
 
This week, reflect on what you learned from the NIH materials abou
This week, reflect on what you learned from the NIH materials abouThis week, reflect on what you learned from the NIH materials abou
This week, reflect on what you learned from the NIH materials abou
 
Open Data HK: open science meets open data. A primer from Scott Edmunds
Open Data HK: open science meets open data. A primer from Scott EdmundsOpen Data HK: open science meets open data. A primer from Scott Edmunds
Open Data HK: open science meets open data. A primer from Scott Edmunds
 
2nd Thematic Conference on Knowledge Commons - Call for papers
2nd Thematic Conference on Knowledge Commons - Call for papers2nd Thematic Conference on Knowledge Commons - Call for papers
2nd Thematic Conference on Knowledge Commons - Call for papers
 
Scott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sg
Scott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sgScott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sg
Scott Edmunds: Publishing in the Open Data Era, talk at Hackerspace.sg
 
Inventions.pptx
Inventions.pptxInventions.pptx
Inventions.pptx
 
5.ethical consideration in research
5.ethical consideration in research5.ethical consideration in research
5.ethical consideration in research
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Science
 
"Assuming the Burden of Proof: Data as Evidence in Science and Public Policy,...
"Assuming the Burden of Proof: Data as Evidence in Science and Public Policy,..."Assuming the Burden of Proof: Data as Evidence in Science and Public Policy,...
"Assuming the Burden of Proof: Data as Evidence in Science and Public Policy,...
 
EMPHNET-PHE Course: Module seven ethical issues in public health research& in...
EMPHNET-PHE Course: Module seven ethical issues in public health research& in...EMPHNET-PHE Course: Module seven ethical issues in public health research& in...
EMPHNET-PHE Course: Module seven ethical issues in public health research& in...
 
2. Lecture on Research Ethics for Epid 2022 (2).pdf
2.  Lecture on Research Ethics for Epid 2022 (2).pdf2.  Lecture on Research Ethics for Epid 2022 (2).pdf
2. Lecture on Research Ethics for Epid 2022 (2).pdf
 
What Is Corporeal Archive
What Is Corporeal ArchiveWhat Is Corporeal Archive
What Is Corporeal Archive
 

Recently uploaded

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

ToxicDocs

  • 2. ToxicDocs Background • Columbia and CPR have 4 million newly declassified legal documents • Of interest to journalists, historians, attorneys, public health officials • How can this repository be made more accessible and useful? • More generally, what can we do to get insights out of a document dump?
  • 3. ToxicDocs I aim to: 1. Categorize documents into types such as memo, ad, scientific study 2. Provide retrieval of documents similar to a given one 3. Infer missing attributes (such as the year) by parsing text 4. Visualize trends for topics over time
  • 4. The Actual Data DIVISION OF SAFETY ANIj . n -IENE The Industrial Commission of Ohio 700 W. Third Avenue, Columbus, Ohio 43212 .e~2e70 ~ .e~.sslle IlIdullri,1 COOt",lIlio. M. IIOlLAWO nlSE ChI"m,. mm l. tHOOU Go'ftrttt ""i,,,.DIYII;'" of wllty ",41 101m t. wl.mEt MtIlMr lHOI1W0,,,',I1f.11G,,.,U,,,U",GHU ~ ~ ~L.yI0Il" P. SHEOIAN
  • 5. OCR Quality • Out-of-Vocabulary (OOV) text, i.e. meaningless gibberish, is limited in most documents • Most documents have >60% readable text
  • 6. Classification Task • 50k documents • 1k randomly labeled • 15 unbalanced classes • Documents can be: • subtle • handwritten • illustrations • not in English • blank (?)
  • 7. Classification Algorithm Model • Linear kernel one-v-all SVM (liblinear) • Squared hinge loss + L2 regularization Performance • Accuracy: 62% (top 3: 88%), mean F1: 0.57 • Baseline: 15% using regular expressions Features • Sublinear TF-IDF on n-gram features (56%) • +6% from document length, NER, and semi-supervised label propagation
  • 8. Visual Features • Visual structure contains independent information • 1-layer feed-forward network gets an accuracy of 15% • Recent advances allow OCR with probabilistic models for language using recurrent networks 1000 x 1000 100 x 100 10 x 10
  • 9. Putting it to Use • A search for “vinyl” (H2C=CHCl) reveals the following trend in the composition of documents over time • Time (x) is inferred and type (y) is predicted
  • 10. Evolution of a Crisis • chemicalindustryarchives.org/dirtysecrets/ • “By 1971 the industry knew without doubt that vinyl chloride caused cancer in animals.”
  • 11. Evolution of a Crisis • “In January 1974, B.F. Goodrich announced the presence of a rare liver cancer, angiosarcoma, in its polyvinyl chloride workers at is Louisville plant.”
  • 12. Evolution of a Crisis • “In May of 1974, the Occupational Safety and Health Administration (OSHA) proposed a maximum exposure level for vinyl chloride at a no detectable level”
  • 13. toxicdocs.org • Launching some time after the election • This work will be scaled and integrated • github.com/GautamShine/toxic-docs About me: • Ph.D. candidate at Stanford in electrical engineering • Interned at an AI startup (ML/NLP) and twice at Intel (supercomputing) • Outdoor enthusiast (climbing, diving) and backpacker (30+ countries) ToxicDocs
  • 14. Semi-supervised Learning • General idea: make use of unlabeled data • For SVMs, distance from separating hyperplane serves as prediction confidence • Procedure used: 1. h1 = argmax f(Xtrain) 2. ŷunlabeled = h1(Xunlabeled) 3. Xabsorbed = {x ∈ Xunlabeled | h1(x) > C} 4. h2 = argmax f(Xtrain + Xabsorbed) 5. ŷtest = h2(Xtest)
  • 15. Kernels for Text • n-gram space is high-d and sparse • Most points lie in a low-d subspace • Data is often trivially separable • Projection to higher dimensions doesn’t reduce bias much • But pays a price with higher variance • Big speed difference in NLP problems • Linear kernel is O(p) from SGD on hinge loss • Non-linear kernels are O(p2) from coordinate ascent on the Lagrange dual