Your SlideShare is downloading. ×
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Information Mediation: Integrating Information from Multiple ...
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Information Mediation: Integrating Information from Multiple ...

235

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
235
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Naveen Ashish Amit P. Sheth Department of Computer Science and Large Scale Distributed Information Systems Lab University of Georgia, Athens Information Mediation: Integrating Information from Multiple Information Sources
  • 2. What is an Information Agent/Mediator ?
    • A software system that provides integrated and structured query access to multiple distributed information sources
    • Sources may be databases of various kinds or Web sources
    • Sources are autonomously created and heterogeneous
    • Accessible via a network
    • Mediator provides the illusion of a single information source
  • 3. Information Agents aka Mediators Example: Restaurant and Theatre Info on the Web Ariadne Mediator Map Servers Geocoders Movies Zagat Health Ratings
  • 4. Why the Interest in Building Such Systems ? MEDIATOR Oracle Legacy System IBM DB2 Sybase Object-Oriented DB
  • 5. Mediators on the Web MEDIATOR Wrapper DB1 DB2
  • 6. Organization of Remainder of Talk
    • Introduction
      • Information Agents, System Architecture
    • Research Issues
      • Information Modeling
      • Query Planning
      • Semi-automatic Wrapper Generation
      • Performance Optimization by Materialization
      • Resolving Inconsistencies
    • Industry Products for Data Extraction and Integration
    • Start-up Ventures
  • 7. Representative Systems (Research Projects)
    • SIMS/Ariadne University of Southern California/ISI
    • TSIMMIS Stanford
    • Information Manifold AT&T Research
    • Garlic IBM Almaden
    • Tukwila University of Washington
    • InfoSleuth MCC
    • DISCO University of Maryland/INRIA
    • HERMES University of Maryland
    • InfoMaster Stanford
    • InfoQuilt University of Georgia
  • 8. Information Modeling
    • Multiple, heterogeneous, autonomously created information sources
    • Users sees an integrated (global) view
      • Queries a “mediated schema”
    • A uniform model for all sources
      • Must be (at least) expressive enough to model the most complex information source
    • Each source provides a set of relations or classes
      • Translation (model) is done by wrapper at each source
    • Integration
      • Global as view, Local as view
  • 9. Global as View
    • For each relation (class) in mediated schema we specify how to obtain its tuples from the sources
    Name FODORS Phone # Reviews ZAGAT Name Address Telephone RESTAURANT ZAGAT FODORS DOH Ratings Name Rating Name Phonenumber GEOCODER Address Lat Lon
  • 10. Heterogeneity Resolution
    • Sources may use different models
      • OO, Relational, Legacy, …..
      • May be Web sources
      • Wrapper “exports” contents in a uniform model
    • Structural and schematic differences
      • (name, address) (name, street, city, state, zip)
    • Semantic
      • (name, phonenumber) (name, telephone)
  • 11. Global as View: Models
    • KR based models (SIMS, Ariadne, ….)
      • LOOM, CLASSIC
    • OO, based on ODMG (DISCO, Garlic …)
    interface Restaurant { attribute string name; attribute string address; attribute string cuisine; attribute string review; } extent restaurant 0 of Restaurant wrapper w0 repository r0 map ((zagts0=restaurant0) (name=n) (address=a)(cuisine=c))
  • 12. Local as View
    • For every information source S describe it in terms of relations in the mediated schema
    v1(name,address,cuisine,rating) :- Restaurant(name,address, cuisine,rating) ^ city = “Santa Monica” v2(name, foodrating) :- Restaurant(name,address,cuisine,rating) … .
  • 13. Query Planning and Optimization
    • Mediator must generate an information gathering plan
    • Constraints on execution
      • Binding patterns ....
    • Optimization of query plans
    • Current areas of work
      • Optimization
      • Approximate answers (incomplete sources)
      • Query planning for other sources such as simulations, computer programs etc.
      • Query execution engines
  • 14. Query Plans and Plan Quality Low-Quality Plan High-Quality Plan
  • 15. Accessing Sources via Wrappers SELECT address, tel FROM Restaurant WHERE cuisine = “ chinese” Chinois, 2720 Main St, 310-777-9876 Peking Star, 1 Broad St, 213-999-7676 .....
  • 16. Semi-Automatic Wrapper Generation
    • Need wrappers for several sites
      • Building wrappers by hand is tedious and time consuming
    • Approaches to automating the process
      • Exploit format information (structure, HTML etc. )
      • Template based approaches
      • Machine learning techniques
    • XML
      • <name> Peking Star </name>
      • <address> 1 Broad Street, Los Angeles </address>
      • <phone>31-822-1511 </phone>
  • 17. Wrappers .... Work in Progress
    • Database wrappers
    • Variety of techniques for Web wrappers
    • “ Upmarking”
      • To XML
    • Building “Web-bases”
    • Other Artificial Intelligence techniques
      • Natural Language Processing
      • IR
      • Classifiers
  • 18. Performance Issue
    • Query processing time is typically very high
    • Despite the mediator generating efficient query plans
    • Cost of fetching data and pages from remote sources dominates
      • Have to typically fetch a large number of Web pages
      • The Web sources are not designed for database like query access
      • The Web sources can be slow
    • Further improve performance by materializing data at the mediator side.
  • 19. Store and Materialize Data Locally MEDIATOR Wrapped Web Source (SLOW) Materialized Data (FAST)
  • 20. Selective Materialization
    • Why not simply materialize all the data in all the Web sources being integrated and have a really fast mediator ??
      • Will not scale, amount of space needed may be too much
      • Web sources can get updated
        • Cost of keeping data consistent can get prohibitive
      • We are building a mediator, not a data warehouse !
    • Approach then is to selectively materialize data
    • How do we automatically identify the portion of data most useful to materialize ?
  • 21. Selecting Data to Materialize Distribution of User Queries (Identify frequently accessed classes) Structure of Sources (Prefetch data to speed up expensive queries) Updates (Have to consider maintenance cost) Classes of Data to Materialize SELECTING CLASSES
  • 22. Inconsistency Resolution
    • Same object in different formats
      • “ United States” and “US”
      • “ Red Lobster” and “The Red Lobster”
      • “ John Smith”, “Smith, J.” , “J. Smith”, “Dr. John Smith” ...
    • Has appeared in other database and IR contexts
    • Solutions
      • Mapping tables
        • For finite domains (such as cities, countries, companies …)
        • Simply maintain an enumerated list of possible formats for each object
        • (New York, N.Y., NYC, New York City, Big Apple)
  • 23. Mapping Functions
    • Mapping functions
      • When domain is not finite (person names)
      • Domain specific mapping transformations
        • Stemming common words (Inc., Corp., The etc.)
        • Matching full word and abbreviation
        • Match 2 formats with a score
    • Current work
      • Learning mapping functions from example matches
      • IR based approaches
      • Building “metabases”
  • 24. Mediator Prototypes and Software
    • Software and tools from mediator research projects
    • What may be available.
      • Mediator kernels (integration engines)
      • Data modeling tools, Description Logic systems
      • Wrapper and extractor toolkits and software
      • Plenty of papers !
    • Ariadne, USC/ISI, http://www.isi.edu/ariadne
    • TSIMMIS, Stanford, http://www-db.stanford.edu/tsimmis/
    • MIX, UCSD, http://feast.ucsd.edu/Projects/MIX/
    • InfoSleuth, MCC, http://www.mcc.com/projects/infosleuth/
    • DISCO, U Maryland, http://www.umiacs.umd.edu/labs/CLIP/im.html
    • Garlic, IBM Almaden, http://www.almaden.ibm.com/cs/garlic.html
    • Tukwila, U Washington, http://data.cs.washington.edu/integration/tukwila/
  • 25. Applications of Mediators
    • Heterogeneous and Distributed Database Integration
      • Legacy systems integration
    • Web Sources Integration
    • Data Integration for E-commerce
      • Integrating product catalogs, multiple vendors
    • Data Warehousing
      • For populating data warehouses
    • Bioinformatics
    • Information Management Environments
        • Digital Libraries
        • Healthcare Information Systems
  • 26. Industry Products (IBM DB2 DataJoiner)
    • IBM DB2 DataJoiner
    • http://www-4.ibm.com/software/data/datajoiner/
    • Enterprise data integration middleware
    • DataJoiner functionality now incorporated in IBM DB2 UDB
    • http://www-4.ibm.com/software/data/db2/udb/about.html
    • Native support for popular relational data sources
      • DB2, Informix, SQL Server, Sybase, Teradata and others
      • Supports non relational data sources
      • Support for Web data
      • Available on variety of platforms and OS
  • 27. Start-up ventures: Junglee Corp
    • Website: www.amazon.com (Acquired)
    • Researcher Founders : Rajaraman, Gupta, Harinarayanan, Mathur
    • Products and Services :
      • Tools for data extraction and integration
      • Building warehouse from multiple Web sources
        • Integrating apartment listings from multiple sources
        • Integrating job postings from multiple online job sources
    • Market focus : Online shopping
    • Current Status: Acquired by Amazon
    • Similar ventures : Netbots Inc. (www.excite.com) Acquired by Excite
  • 28. Cohera
    • Website: www.cohera.com
    • Researcher Founders : Stonebraker, Hellerstein
    • Products and Services:
      • Cohera E-Catalog System
      • Integrates product data from multiple sellers and product catalogs
      • Set of software servers and tools for building and running live “e-catalogs”
    • Market(s) Targetted : E-Commerce
    • Customers: E-Commerce communities - ThomasNet, Trapezo, LiveListings, FoodService.Com
    • Current Status : Founded October 1997, Privately Held
    • Similar ventures : Ensosys Markets Inc. (www.enosysmarkets.com)
          • Mergent Inc. (www.mergent.com)
  • 29. Nimble Technology
    • Website: www.nimble.com
    • Researcher Founders : Levy, Weld
    • Products and Services :
      • Nimble Data Integration Suite
      • XML base integration approach
      • Current focus on multiple information sources integration
      • Tools for data extraction and Data Integration Engine
    • Market focus : CRM, Business Intelligence, B2B, Portals
    • Current Status : Founded June 1999, Privately Held
  • 30. WhizbangLabs !
    • Website: www.whizbanglabs.com
    • Researcher Founders : Quass, Geddes, Mitchell
    • Products and Services:
      • Technology for building “Webbases” - databases created by extracting data from Web pages
      • Topic specific
      • Topic specific crawler for retrieving pages
      • Tools for extracting data from Web pages, cleaning data and loading into database
    • Market focus : Content providing portals
    • Current Status : Founded March 1999, Privately held
    • Similar ventures : Fetch Technologies (www.fetch.com)
  • 31. Bioinformatics: A Data Integration Grand Challenge
    • Mapping of Human Genetic Code complete
      • New, revolutionary, computational approach to drug discovery
    • Huge amounts of genetic, chemical and biological data being generated at an exponential rate in biotech/pharma R&D
      • Complex structures, maps, sequence data etc.
    • Drug discovery scientists need integrated access to this data
      • Look for patterns across data sources
    • Need to integrate data from multiple labs
    • Lab procedures (thus the data) keeps changing
    • Good amount of genomic data is free text
    • DiscoveryLink: State of the art Life Sciences data integration middleware from IBM
    • http://www-4.ibm.com/software/webservers/lifesciences/discovery.html
  • 32. Conclusion
    • Information mediation
    • Issues in building such systems
    • Research projects
    • Industry products
    • Start-up ventures
    • Applicable to wide areas such as E-commerce, database and legacy systems integration, Web source extraction, content management, portals, digital libraries, bioinformatics.

×