Java Web Scraping
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Java Web Scraping

on

  • 19,097 views

 

Statistics

Views

Total Views
19,097
Views on SlideShare
18,991
Embed Views
106

Actions

Likes
1
Downloads
155
Comments
1

4 Embeds 106

http://www.slideshare.net 89
https://twimg0-a.akamaihd.net 8
https://www.linkedin.com 6
http://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • http://half-wit4u.blogspot.com/2011/01/web-scraping-using-java-api.html
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Java Web Scraping Presentation Transcript

  • 1. Java Web Scraping Sumant Kumar Raja
  • 2. Agenda
    • What is Web Scraping!!!!
    • Stages in web - scraping
    • Useful API for web scraping
    • Limitations using above APIs
    • Defining I18n and L10n
    • I18n and L10n checkpoints
    • Before we end
    • And finally {}
  • 3. What is web scraping?
    • Web scraping describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context.
    • Also called harvesting
  • 4. Stages in web scraping
    • Connect : connect with the remote site over HTTP or FTP
    • Extract and Process: Extract information from the website and Process data into useful data format
    • Save : save data in desired format
    • The process stage consists of
      • Filter : filter useful data from source
      • Format : format data to a format required by user.
  • 5. Useful APIs for web scraping
      • commons-httpclient-3.1.jar
    HTTP
      • javacsv.jar
    csv Save pmd-4.2.3.jar Code quality
      • jxl.jar
      • poi-3.1-FINAL-20080629.jar
    Excel
      • javacsv.jar
    CSV
      • jericho-html-2.6.jar
    HTML
      • FontBox-0.1.0.jar
      • PDFBox-0.7.3.jar
    PDF Extract and process
      • commons-net-1.4.1.jar
    FTP Connection
      • slf4j-api-1.5.2.jar
      • slf4j-log4j12-1.5.2.jar
    NA Logging API Type Process
  • 6. Limitations using above APIs
    • Apache POI does not support extraction of older version of excel. Use JExcel in place of POI.
    • PDF box and Font box failed to process the pdf certain encodings.
  • 7. Defining I18n and L10n
    • I18n stands for I nternationalizatio n
      • The process of converting locale dependent data into locale independent data
      • Example: The date string 12-Mar-2008 can be saved as date object. Date object is locale independent.
    • L10n stands for L ocalizatio n
      • The process of converting data from one locale to another locale or from locale independent format to locale dependent format.
      • Example: The currency $1000 in US locale to equivalent pounds in UK locale.
  • 8. I18n and L10n checkpoints
    • Take care of following points while scrapping data from various locales
      • Convert the number format.
        • Example: the number is Dutch locale 1000.00,95 is 100000.95.
      • Convert the date format.
        • The date should be converted from one format to another. Ie, date in French locale should be converted into date object
      • Convert Unit of measure (length, area, weight, etc.)
  • 9. Before we end
    • Economize the internet roundtrip
      • Fetching data from HTTP/FTP is costly.
      • Make minimum number of round trip to get data from internet.
    • Write all data at same time as writing into disk is costly.
  • 10. And finally {}
    • Some of the ready to use web scrapping software
    • http://en.wikipedia.org/wiki/Web-scraping_software_comparison
  • 11. Thank You