Boilerpipe Integration
& Improvement
Allan Huang @ esobi Inc.
Known Issues
 本文內容空白
 本文內容亂碼
 特殊字元亂碼
 缺少本文主體
 與本文無關的內容
Integration
 必要的參數有…



URL 網址 或…
HTML 全文


<base> tag 的 href

 可選的參數有…


Extractor




Boilerpipe 演算法

Output Mod...
Improvement






強化 HTTP 和 HTML 編碼的判斷與處理
支援 HTTP Response 解壓縮演算法
安插 <base> tag 以改善 Image 於相對路徑的顯示
更換成最新版的 Boilerpipe...
Failure Cases






只抓到 HTML Title 而抓不到本文
 2 則新聞,中時電子報、臉書的動態時報照片
缺少本文主體
 2 則新聞, UrCosme 美容討論區、青年日報
抓到 JavaScript code...
Solved Cases




時常抓到亂碼的本文
 2 則新聞,中時電子報的焦點新聞
 起因為無法下載整個 HTML 全文
解決方案
 避免使用 Java PushbackStream ,改以一次性下載整
個 HTML 全文後,再...
Solved Cases




CJK 特殊字元亂碼
 宏碁 R7 筆電 「星際爭霸戰」款限量出擊
 朱镕基退休前后“判若两人” 非常注重晚节
 起因為 Java 引用同一字元集缺少特殊字元
解決方案
 繁中 Big5-HKSCS...
Algorithm Comparison







Structure retainment
Inner content cleaning
Implementation
Language dependency
Source pa...
Structure
retainment

Inner content
cleaning

Boilerpipe

plain text only

uses a classifier to
determine whether or
not t...
Reference
 Evaluating

Text Extraction Algorithms
 List of resources: Article text extraction from
HTML documents
 Feat...
Conclusion
 Next



step…

Boilerpipe 抓取本文並未包含 Image 資訊
URL 對應的 HTML 全文或本文 Cache 機制

 Q&A
Upcoming SlideShare
Loading in …5
×

Boilerpipe Integration And Improvement

836 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
836
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Boilerpipe Integration And Improvement

  1. 1. Boilerpipe Integration & Improvement Allan Huang @ esobi Inc.
  2. 2. Known Issues  本文內容空白  本文內容亂碼  特殊字元亂碼  缺少本文主體  與本文無關的內容
  3. 3. Integration  必要的參數有…   URL 網址 或… HTML 全文  <base> tag 的 href  可選的參數有…  Extractor   Boilerpipe 演算法 Output Mode  HTML Extraction, HTML Highlighting, Plain Text, JSON
  4. 4. Improvement      強化 HTTP 和 HTML 編碼的判斷與處理 支援 HTTP Response 解壓縮演算法 安插 <base> tag 以改善 Image 於相對路徑的顯示 更換成最新版的 Boilerpipe 和相關的 nekohtml library 測試結果  共有 150 則新聞 則繁中新聞  80 則英文新聞  2 則簡中新聞  66  目前成功率為 94%
  5. 5. Failure Cases    只抓到 HTML Title 而抓不到本文  2 則新聞,中時電子報、臉書的動態時報照片 缺少本文主體  2 則新聞, UrCosme 美容討論區、青年日報 抓到 JavaScript code 或 HTML escape 字元  2 則新聞,香港成報、 The Wall Street Journal
  6. 6. Solved Cases   時常抓到亂碼的本文  2 則新聞,中時電子報的焦點新聞  起因為無法下載整個 HTML 全文 解決方案  避免使用 Java PushbackStream ,改以一次性下載整 個 HTML 全文後,再進行 HTML 字串取樣,以利於 HTML 全文編碼的判斷
  7. 7. Solved Cases   CJK 特殊字元亂碼  宏碁 R7 筆電 「星際爭霸戰」款限量出擊  朱镕基退休前后“判若两人” 非常注重晚节  起因為 Java 引用同一字元集缺少特殊字元 解決方案  繁中 Big5-HKSCS 替代 Big5  簡中 GB18030 替代 GB2312  日文 Windows-31J 替代 Shift_JIS  韓文尚未找到案例
  8. 8. Algorithm Comparison       Structure retainment Inner content cleaning Implementation Language dependency Source parameter Additional features and remarks
  9. 9. Structure retainment Inner content cleaning Boilerpipe plain text only uses a classifier to determine whether or not the atomic text open source java library block holds useful content Alchemy API text only (has an option to include relevant hyperlinks) n/a Name Diffbot Readability Goose Extractiv Repustate API Webstemmer plain text or html an option to remove inline ads retains original structure uses hardcoded heuristics to extract content divided by ads plain text n/a depends on the chosen output format n/a – e.g. xml format breaks the content plain text n/a plain text NCleaner (paper) plain text Implementation commercial web api web api (private beta) Source parameter should be language you can fetch independent since the documents by yourself text block classifier or use built-in utilities observes language to fetch them for you independent text observation: returns an include the whole error for non-english document in the post content e.g. the request or provide an document contains url “unsupported text does fetching for you n/a via provided url open source javascript bookmarklet via browser open source java library url only (my fork enables you to fetch the document by yourself) commercial web api commercial web api n/a open source python library uses character level n-grams to detect content text blocks open source perl library Language dependancy language independent but it relies on language dependent regular expressions to match id and class labels language independent but it relies on language dependent regular expressions to match id and class labels include the whole document in post n/a request or provide an url url only n/a first runs a crawler to obtain seed pages, then it learns layout language independent patterns that are later put to work to extract arbitrary html document Additional features and remarks implements many extractors with different classification rules trained on different datasets extra API call to extract the title extracts: relevant media, titile, tags, xpath descriptor for wrappers, comments and comment count, article summary uses hardcoded heuristics to search for related images and embedded media capable of enriching the extracted text with semantic entities and relationships the only piece of software on this list that requires a cluster of similar documents obtained by crawling reliant on lynx browser for depends on the training converting html to language structured plain text
  10. 10. Reference  Evaluating Text Extraction Algorithms  List of resources: Article text extraction from HTML documents  Feature-wise Comparison of HTML Article Text Extractors  Overview: Extracting article text from HTML documents  Readability for Java - Snacktory
  11. 11. Conclusion  Next   step… Boilerpipe 抓取本文並未包含 Image 資訊 URL 對應的 HTML 全文或本文 Cache 機制  Q&A

×