Web scraping describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context.
Also called harvesting
Stages in web scraping
Connect : connect with the remote site over HTTP or FTP
Extract and Process: Extract information from the website and Process data into useful data format
Save : save data in desired format
The process stage consists of
Filter : filter useful data from source
Format : format data to a format required by user.
Useful APIs for web scraping
commons-httpclient-3.1.jar
HTTP
javacsv.jar
csv Save pmd-4.2.3.jar Code quality
jxl.jar
poi-3.1-FINAL-20080629.jar
Excel
javacsv.jar
CSV
jericho-html-2.6.jar
HTML
FontBox-0.1.0.jar
PDFBox-0.7.3.jar
PDF Extract and process
commons-net-1.4.1.jar
FTP Connection
slf4j-api-1.5.2.jar
slf4j-log4j12-1.5.2.jar
NA Logging API Type Process
Limitations using above APIs
Apache POI does not support extraction of older version of excel. Use JExcel in place of POI.
PDF box and Font box failed to process the pdf certain encodings.
Defining I18n and L10n
I18n stands for I nternationalizatio n
The process of converting locale dependent data into locale independent data
Example: The date string 12-Mar-2008 can be saved as date object. Date object is locale independent.
L10n stands for L ocalizatio n
The process of converting data from one locale to another locale or from locale independent format to locale dependent format.
Example: The currency $1000 in US locale to equivalent pounds in UK locale.
I18n and L10n checkpoints
Take care of following points while scrapping data from various locales
Convert the number format.
Example: the number is Dutch locale 1000.00,95 is 100000.95.
Convert the date format.
The date should be converted from one format to another. Ie, date in French locale should be converted into date object
Convert Unit of measure (length, area, weight, etc.)
Before we end
Economize the internet roundtrip
Fetching data from HTTP/FTP is costly.
Make minimum number of round trip to get data from internet.
Write all data at same time as writing into disk is costly.
0 comments
Post a comment