Deep Information and Extraction ToolPresentation Transcript
Thomas Martinuzzo, Jr. Eng.
What is DIET ?
DIET is an information extraction and manipulation tool
DIET can extract information from the DEEP web by understanding
Web surface : 20 Billion pages indexed by search engines DEEP web : +600 Billion pages « The 60 largest Deep Web sources contain 84 billion pages of content. That's about 750 terabytes of information, sufficient by themselves to exceed the size of the surface Web by 40 times. » Brightplanet.com Pic from Maxumowners.org
DIET Features & Benefits
Use artificial intelligence to build automatic wrappers
No to minimal user intervention
User can easily extract and manipulate information
Car website :
Characteristics: List of cars by name with description, date, price, picture … Over 100 pages of data !
Problem : No local search engine.
But … I am looking for Acura MDX 2005 or something like that !
Job website :
Characteristics: List of jobs by title with small description, salary, city. Over 800 jobs. Local search engine. Sort capabilities.
Problem : We can only see 10 jobs by page. Unable to search by salary range. Unable to sort by city.
BUT … I want to see all jobs over 75 000$ in one single page and save it for future consultation.
DIET Core Web Services
Access only by certified clients
DIET Web Application
Users and services managers
Web based application (JSP/Servlet/JavaServer Faces/JavaBean)
Based on Java EE 5/Glassfish/MySql technology
List of new technology group by domains
Simple search engine available
We want to extract and them to manipulate all available technologies