Java Web Scraping
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Java Web Scraping






Total Views
Views on SlideShare
Embed Views



4 Embeds 106 89 8 6 3



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Java Web Scraping Presentation Transcript

  • 1. Java Web Scraping Sumant Kumar Raja
  • 2. Agenda
    • What is Web Scraping!!!!
    • Stages in web - scraping
    • Useful API for web scraping
    • Limitations using above APIs
    • Defining I18n and L10n
    • I18n and L10n checkpoints
    • Before we end
    • And finally {}
  • 3. What is web scraping?
    • Web scraping describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context.
    • Also called harvesting
  • 4. Stages in web scraping
    • Connect : connect with the remote site over HTTP or FTP
    • Extract and Process: Extract information from the website and Process data into useful data format
    • Save : save data in desired format
    • The process stage consists of
      • Filter : filter useful data from source
      • Format : format data to a format required by user.
  • 5. Useful APIs for web scraping
      • commons-httpclient-3.1.jar
      • javacsv.jar
    csv Save pmd-4.2.3.jar Code quality
      • jxl.jar
      • poi-3.1-FINAL-20080629.jar
      • javacsv.jar
      • jericho-html-2.6.jar
      • FontBox-0.1.0.jar
      • PDFBox-0.7.3.jar
    PDF Extract and process
      • commons-net-1.4.1.jar
    FTP Connection
      • slf4j-api-1.5.2.jar
      • slf4j-log4j12-1.5.2.jar
    NA Logging API Type Process
  • 6. Limitations using above APIs
    • Apache POI does not support extraction of older version of excel. Use JExcel in place of POI.
    • PDF box and Font box failed to process the pdf certain encodings.
  • 7. Defining I18n and L10n
    • I18n stands for I nternationalizatio n
      • The process of converting locale dependent data into locale independent data
      • Example: The date string 12-Mar-2008 can be saved as date object. Date object is locale independent.
    • L10n stands for L ocalizatio n
      • The process of converting data from one locale to another locale or from locale independent format to locale dependent format.
      • Example: The currency $1000 in US locale to equivalent pounds in UK locale.
  • 8. I18n and L10n checkpoints
    • Take care of following points while scrapping data from various locales
      • Convert the number format.
        • Example: the number is Dutch locale 1000.00,95 is 100000.95.
      • Convert the date format.
        • The date should be converted from one format to another. Ie, date in French locale should be converted into date object
      • Convert Unit of measure (length, area, weight, etc.)
  • 9. Before we end
    • Economize the internet roundtrip
      • Fetching data from HTTP/FTP is costly.
      • Make minimum number of round trip to get data from internet.
    • Write all data at same time as writing into disk is costly.
  • 10. And finally {}
    • Some of the ready to use web scrapping software
  • 11. Thank You