Java Web Scraping

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Java Web Scraping - Presentation Transcript

    1. Java Web Scraping Sumant Kumar Raja
    2. Agenda
      • What is Web Scraping!!!!
      • Stages in web - scraping
      • Useful API for web scraping
      • Limitations using above APIs
      • Defining I18n and L10n
      • I18n and L10n checkpoints
      • Before we end
      • And finally {}
    3. What is web scraping?
      • Web scraping describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context.
      • Also called harvesting
    4. Stages in web scraping
      • Connect : connect with the remote site over HTTP or FTP
      • Extract and Process: Extract information from the website and Process data into useful data format
      • Save : save data in desired format
      • The process stage consists of
        • Filter : filter useful data from source
        • Format : format data to a format required by user.
    5. Useful APIs for web scraping
        • commons-httpclient-3.1.jar
      HTTP
        • javacsv.jar
      csv Save pmd-4.2.3.jar Code quality
        • jxl.jar
        • poi-3.1-FINAL-20080629.jar
      Excel
        • javacsv.jar
      CSV
        • jericho-html-2.6.jar
      HTML
        • FontBox-0.1.0.jar
        • PDFBox-0.7.3.jar
      PDF Extract and process
        • commons-net-1.4.1.jar
      FTP Connection
        • slf4j-api-1.5.2.jar
        • slf4j-log4j12-1.5.2.jar
      NA Logging API Type Process
    6. Limitations using above APIs
      • Apache POI does not support extraction of older version of excel. Use JExcel in place of POI.
      • PDF box and Font box failed to process the pdf certain encodings.
    7. Defining I18n and L10n
      • I18n stands for I nternationalizatio n
        • The process of converting locale dependent data into locale independent data
        • Example: The date string 12-Mar-2008 can be saved as date object. Date object is locale independent.
      • L10n stands for L ocalizatio n
        • The process of converting data from one locale to another locale or from locale independent format to locale dependent format.
        • Example: The currency $1000 in US locale to equivalent pounds in UK locale.
    8. I18n and L10n checkpoints
      • Take care of following points while scrapping data from various locales
        • Convert the number format.
          • Example: the number is Dutch locale 1000.00,95 is 100000.95.
        • Convert the date format.
          • The date should be converted from one format to another. Ie, date in French locale should be converted into date object
        • Convert Unit of measure (length, area, weight, etc.)
    9. Before we end
      • Economize the internet roundtrip
        • Fetching data from HTTP/FTP is costly.
        • Make minimum number of round trip to get data from internet.
      • Write all data at same time as writing into disk is costly.
    10. And finally {}
      • Some of the ready to use web scrapping software
      • http://en.wikipedia.org/wiki/Web-scraping_software_comparison
    11. Thank You

    + Sumant RajaSumant Raja, 11 months ago

    custom

    2230 views, 0 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 2230
      • 2230 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 22
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories