Java Web Scraping Sumant Kumar Raja
Agenda What is Web Scraping!!!! Stages in web - scraping Useful API for web scraping Limitations using above APIs Defining I18n and L10n I18n and L10n checkpoints Before we end And finally {}
What is web scraping? Web scraping  describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Also called  harvesting
Stages in web scraping Connect : connect with the remote site over HTTP or FTP Extract and Process: Extract information from the website and Process data into useful data format Save : save data in desired format The process stage consists of Filter : filter useful data from source Format : format data to a format required by user.
Useful APIs for web scraping commons-httpclient-3.1.jar HTTP javacsv.jar csv Save pmd-4.2.3.jar Code quality jxl.jar poi-3.1-FINAL-20080629.jar Excel javacsv.jar CSV jericho-html-2.6.jar HTML FontBox-0.1.0.jar PDFBox-0.7.3.jar PDF Extract and process commons-net-1.4.1.jar FTP Connection slf4j-api-1.5.2.jar slf4j-log4j12-1.5.2.jar NA Logging API Type Process
Limitations using above APIs Apache POI does not support extraction of older version of excel. Use JExcel in place of POI. PDF box and Font box failed to process the pdf certain encodings.
Defining I18n and L10n I18n stands for I nternationalizatio n The process of converting locale dependent data into locale independent data Example: The date string 12-Mar-2008 can be saved as date object. Date object is locale independent. L10n stands for L ocalizatio n The process of converting data from one locale to another locale or from locale independent format to  locale dependent format. Example: The currency $1000 in US locale to equivalent pounds in UK locale.
I18n and L10n checkpoints Take care of following points while scrapping data from various locales Convert the number format. Example: the number is Dutch locale 1000.00,95 is 100000.95. Convert the date format. The date should be converted from one format to another. Ie, date in French locale should be converted into date object Convert Unit of measure (length, area, weight, etc.)
Before we end Economize the internet roundtrip Fetching data from HTTP/FTP is costly. Make minimum number of round trip to get data from internet. Write all data at same time as writing into disk is costly.
And finally {} Some of the ready to use web scrapping software  http://en.wikipedia.org/wiki/Web-scraping_software_comparison
Thank You

Java Web Scraping

  • 1.
    Java Web ScrapingSumant Kumar Raja
  • 2.
    Agenda What isWeb Scraping!!!! Stages in web - scraping Useful API for web scraping Limitations using above APIs Defining I18n and L10n I18n and L10n checkpoints Before we end And finally {}
  • 3.
    What is webscraping? Web scraping describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Also called harvesting
  • 4.
    Stages in webscraping Connect : connect with the remote site over HTTP or FTP Extract and Process: Extract information from the website and Process data into useful data format Save : save data in desired format The process stage consists of Filter : filter useful data from source Format : format data to a format required by user.
  • 5.
    Useful APIs forweb scraping commons-httpclient-3.1.jar HTTP javacsv.jar csv Save pmd-4.2.3.jar Code quality jxl.jar poi-3.1-FINAL-20080629.jar Excel javacsv.jar CSV jericho-html-2.6.jar HTML FontBox-0.1.0.jar PDFBox-0.7.3.jar PDF Extract and process commons-net-1.4.1.jar FTP Connection slf4j-api-1.5.2.jar slf4j-log4j12-1.5.2.jar NA Logging API Type Process
  • 6.
    Limitations using aboveAPIs Apache POI does not support extraction of older version of excel. Use JExcel in place of POI. PDF box and Font box failed to process the pdf certain encodings.
  • 7.
    Defining I18n andL10n I18n stands for I nternationalizatio n The process of converting locale dependent data into locale independent data Example: The date string 12-Mar-2008 can be saved as date object. Date object is locale independent. L10n stands for L ocalizatio n The process of converting data from one locale to another locale or from locale independent format to locale dependent format. Example: The currency $1000 in US locale to equivalent pounds in UK locale.
  • 8.
    I18n and L10ncheckpoints Take care of following points while scrapping data from various locales Convert the number format. Example: the number is Dutch locale 1000.00,95 is 100000.95. Convert the date format. The date should be converted from one format to another. Ie, date in French locale should be converted into date object Convert Unit of measure (length, area, weight, etc.)
  • 9.
    Before we endEconomize the internet roundtrip Fetching data from HTTP/FTP is costly. Make minimum number of round trip to get data from internet. Write all data at same time as writing into disk is costly.
  • 10.
    And finally {}Some of the ready to use web scrapping software http://en.wikipedia.org/wiki/Web-scraping_software_comparison
  • 11.