Java Web Scraping


Published on

Published in: Technology, Education
1 Comment
1 Like
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Java Web Scraping

  1. 1. Java Web Scraping Sumant Kumar Raja
  2. 2. Agenda <ul><li>What is Web Scraping!!!! </li></ul><ul><li>Stages in web - scraping </li></ul><ul><li>Useful API for web scraping </li></ul><ul><li>Limitations using above APIs </li></ul><ul><li>Defining I18n and L10n </li></ul><ul><li>I18n and L10n checkpoints </li></ul><ul><li>Before we end </li></ul><ul><li>And finally {} </li></ul>
  3. 3. What is web scraping? <ul><li>Web scraping describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. </li></ul><ul><li>Also called harvesting </li></ul>
  4. 4. Stages in web scraping <ul><li>Connect : connect with the remote site over HTTP or FTP </li></ul><ul><li>Extract and Process: Extract information from the website and Process data into useful data format </li></ul><ul><li>Save : save data in desired format </li></ul><ul><li>The process stage consists of </li></ul><ul><ul><li>Filter : filter useful data from source </li></ul></ul><ul><ul><li>Format : format data to a format required by user. </li></ul></ul>
  5. 5. Useful APIs for web scraping <ul><ul><li>commons-httpclient-3.1.jar </li></ul></ul>HTTP <ul><ul><li>javacsv.jar </li></ul></ul>csv Save pmd-4.2.3.jar Code quality <ul><ul><li>jxl.jar </li></ul></ul><ul><ul><li>poi-3.1-FINAL-20080629.jar </li></ul></ul>Excel <ul><ul><li>javacsv.jar </li></ul></ul>CSV <ul><ul><li>jericho-html-2.6.jar </li></ul></ul>HTML <ul><ul><li>FontBox-0.1.0.jar </li></ul></ul><ul><ul><li>PDFBox-0.7.3.jar </li></ul></ul>PDF Extract and process <ul><ul><li>commons-net-1.4.1.jar </li></ul></ul>FTP Connection <ul><ul><li>slf4j-api-1.5.2.jar </li></ul></ul><ul><ul><li>slf4j-log4j12-1.5.2.jar </li></ul></ul>NA Logging API Type Process
  6. 6. Limitations using above APIs <ul><li>Apache POI does not support extraction of older version of excel. Use JExcel in place of POI. </li></ul><ul><li>PDF box and Font box failed to process the pdf certain encodings. </li></ul>
  7. 7. Defining I18n and L10n <ul><li>I18n stands for I nternationalizatio n </li></ul><ul><ul><li>The process of converting locale dependent data into locale independent data </li></ul></ul><ul><ul><li>Example: The date string 12-Mar-2008 can be saved as date object. Date object is locale independent. </li></ul></ul><ul><li>L10n stands for L ocalizatio n </li></ul><ul><ul><li>The process of converting data from one locale to another locale or from locale independent format to locale dependent format. </li></ul></ul><ul><ul><li>Example: The currency $1000 in US locale to equivalent pounds in UK locale. </li></ul></ul>
  8. 8. I18n and L10n checkpoints <ul><li>Take care of following points while scrapping data from various locales </li></ul><ul><ul><li>Convert the number format. </li></ul></ul><ul><ul><ul><li>Example: the number is Dutch locale 1000.00,95 is 100000.95. </li></ul></ul></ul><ul><ul><li>Convert the date format. </li></ul></ul><ul><ul><ul><li>The date should be converted from one format to another. Ie, date in French locale should be converted into date object </li></ul></ul></ul><ul><ul><li>Convert Unit of measure (length, area, weight, etc.) </li></ul></ul>
  9. 9. Before we end <ul><li>Economize the internet roundtrip </li></ul><ul><ul><li>Fetching data from HTTP/FTP is costly. </li></ul></ul><ul><ul><li>Make minimum number of round trip to get data from internet. </li></ul></ul><ul><li>Write all data at same time as writing into disk is costly. </li></ul>
  10. 10. And finally {} <ul><li>Some of the ready to use web scrapping software </li></ul><ul><li> </li></ul>
  11. 11. Thank You