G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards ...
Upcoming SlideShare
Loading in...5
×
 

G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards ...

on

  • 3,870 views

 

Statistics

Views

Total Views
3,870
Views on SlideShare
3,870
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards ... G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards ... Document Transcript

    • G. Mecca, V. Crescenzi, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the 27th Very Large Databases Conference, Rome, Italy, pages 109-118, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html G. Ianni, Intelligent Anticipated Exploration of Web Sites (2001) http://citeseer.nj.nec.com/ianni01intelligent.html Robert Baumgartner, Supervised Wrapper Generation with Lixto. The VLDB Journal 2001 http://citeseer.nj.nec.com/baumgartner01supervised.html or via the Lixto Downloads page: http://www.dbai.tuwien.ac.at/proj/lixto/download.html Robert Baumgartner Sergio Flesca Georg Gottlob, Declarative Information Extraction, Web Crawling and Recursive Wrapping with Lixto, Proc LPNMR'01, 6th International Conference on Logic Programming and Nonmonotonic Reasoning, 2001. (LNCS ) Robert Baumgartner Sergio Flesca Georg Gottlob, Visual Web Information Extraction with Lixto. The VLDB Journal 2001 http://citeseer.nj.nec.com/baumgartner01visual.html Lixto web site http://www.dbai.tuwien.ac.at/proj/lixto/ William Cohen, Lee Jensen, A Structured Wrapper Induction System for Extracting Information from Semi- Structured Documents, Proc IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001) http://citeseer.nj.nec.com/cohen01structured.html Fabio Ciravegna, Daniela Petrelli, User Involvement in Adaptive Information Extraction: Position Paper, Proc IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001) http://www.smi.ucd.ie/ATEM2001/proceedings/ciravegna-position-atem2001.pdf Ralph Grishman, Adaptive Information Extraction and Sublanguage Analysis, Proc IJCAI-2001 Workshop on Adaptive Text Extraction and Mining (2001) http://www.smi.ucd.ie/ATEM2001/proceedings/grishman-position-atem2001.pdf Shian-Hua Lin Academia Sinica 128 Academia Road Sec. 2... Discovering Informative Content Blocks from Web http://citeseer.nj.nec.com/530062.html http://kp05.iis.sinica.edu.tw/shlin/paper/kdd-ShianHuaLin.pdf M. Brian Blake The MITRE Corporation Center for Advanced Aviation System... An Autonomous Decentralized Architecture for Distributed Data Management and Dissemination http://citeseer.nj.nec.com/461926.html M. Brian Blake, Patricia Liguori, ISADS, An Automated Client-Driven Approach to Data Extraction using an Autonomous Decentralized Architecture (2001) http://citeseer.nj.nec.com/blake01automated.html Yuan Jiang, Using Heuristic Approaches to Detect Record Boundaries in Semistructured Web Documents http://students.cs.byu.edu/~jiang/thesis/ Line Eikvil, Information Extraction from World Wide Web A Survey (1999). Norwegian Computing Center http://citeseer.nj.nec.com/eikvil99information.html Brad Adelberg, NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents, SIGMOD Conference 1998 http://students.cs.byu.edu/~jiang/thesis/nodose.ps Ion Muslea, Steve Minton, Craig Knoblock, Wrapper Induction for Semistructured, Web-based Information Sources (1998)
    • Sergey Brin, Extracting Patterns and Relations from the World Wide Web, WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT'98 http://citeseer.nj.nec.com/brin98extracting.html Naveen Ashish, Craig A. Knoblock, Semi-automatic Wrapper Generation for Internet Information Sources, Proc 2nd IFCIS Conference on Cooperative Information Systems (CoopIS '97), pp 160-169, June 1997. http://citeseer.nj.nec.com/28908.html Naveen Ashish and Craig Knoblock, Wrapper Generation for Semi-structured Internet Sources, ACM SIGMOD Workshop on Management of Semi-structured Data, 1997, Tucson , Arizona . http://ic.arc.nasa.gov/~ashish/sig.ps Naveen Ashish, Craig A. Knoblock, Wrapper Generation for Semi-structured Internet Sources, SIGMOD Record, Vol. 26, No. 4, December 1997. (Invited Paper) http://ic.arc.nasa.gov/~ashish/sig.ps Naveen Ashish's Home Page http://ic.arc.nasa.gov/~ashish/ Craig Knoblock, Steve Minton, Jose-Luis Ambite, Naveen Ashish, Pragnesh Modi, Ion Muslea, Andrew Philpot and Sheila Tejada, Modeling Web Sources for Information Integration, Proc. AAAI '98, 15th National Conference on Artificial Intelligence, July 1998. http://ic.arc.nasa.gov/~ashish/aaai98.ps Venkatesh Ganti, Mong-Li Lee, Raghu Ramakrishnan: ICICLES: Self-Tuning Samples for Approximate Query Answering. VLDB 2000: 176-187 http://www.acm.org/sigmod/vldb/conf/2000/P176.pdf Kushmerick N. (1997). Wrapper Induction for Information Extraction. Ph.D. Dissertation, University of Washington. Technical Report UW-CSE-97-11-04. http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-phd.ps.gz Nicholas Kushmerick, Bernd Thomas Adaptive information extraction: Core technologies for information agents (2002). In Intelligent Information Agents R&D in Europe: An AgentLink perspective. Springer. http://citeseer.nj.nec.com/kushmerick02adaptive.html Bernd Thomas papers etc web site: http://www.uni-koblenz.de/~bthomas/MIA_HTML/ Nicholas Kushmerick, Finite-state approaches to Web information extraction (2002) http://citeseer.nj.nec.com/kushmerick02finitestate.html Nicholas Kushmerick, Gleaning Answers From the Web http://citeseer.nj.nec.com/504568.html http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-maftkb-ss02.pdf Nicholas Kushmerick, Wrapper Verification. World Wide Web Journal, 3(2) pp 79-94. http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-wwwj2000.ps.gz The Niagara Internet Query System (Wisconsin) List of papers by Niagara group members. 1999-2002 http://www.cs.wisc.edu/niagara/Publications.html Building XML Statistics for the Hidden Web. Ashraf Aboulnaga and Jeffrey F. Naughton. VLDB Conference 2002 http://www.cs.wisc.edu/niagara/papers/vldb02xmlolstat.pdf Form-Based Proxy Caching for Database-Backed Web Sites. Qiong Luo and Jeffrey F. Naughton. VLDB Conference 2001
    • http://www.cs.wisc.edu/niagara/papers/formProxyFull.pdf NiagaraCQ: A Scalable Continuous Query System for Internet Databases. Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang. Proc. SIGMOD 2000 , p379-390. http://www.cs.wisc.edu/niagara/papers/NiagaraCQ.pdf B.T. Messmer, H. Bunke, Subgraph Isomorphism in Polynomial Time (1995) Technical Report IAM 95-003, University of Bern, Institute of Computer Science and Applied Mathematics, Bern, Switzerland. http://citeseer.nj.nec.com/messmer95subgraph.html David Eppstein, Subgraph Isomorphism in Planar Graphs and Related Problems (1999) Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, pages 632--640, 1995. http://citeseer.nj.nec.com/eppstein95subgraph.html Revised later for a Journal Article in 1999 (That is the printed version you have). Gio Wiederhold, Mediators in the Architecture of Future Information Systems, Readings in Agents, 1992 http://citeseer.nj.nec.com/wiederhold92mediators.html Anthony Tomasic, Louiqa Raschid, P. Valduriez, Scaling Access to Heterogeneous Data Sources with DISCO Knowledge and Data Engineering, 1998 http://citeseer.nj.nec.com/tomasic98scaling.html A. Tomasic, L. Raschid, and P. Valduriez, Scaling Heterogeneous Databases and the Design of Disco (1996) Issn apport de recherche Institut National De Recherche En Informatique Et En... Proc. International Conference on Distributed Computing Systems, ICDCS, 1996. http://citeseer.nj.nec.com/tomasic96scaling.html Olga Kapitskaia, Anthony Tomasic, Patrick Valduriez, Dealing with Discrepancies in Wrapper Functionality Proc. 13eme Journees Bases de Donnees Avancees, BDA, 1997 http://citeseer.nj.nec.com/rd/97012851%2C145295%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/p apers/cs/299/ftp:zSzzSzftp.inria.frzSzINRIAzSzpublicationzSzRRzSzRR-3138.pdf/kapitskaia97dealing.pdf Anthony Tomasic, Rémy Amouroux, Philippe Bonnet , The Distributed Information Search Component (DisCo) and the World Wide Web http://citeseer.nj.nec.com/186003.html P. Ipeirotis, and L. Gravano, Distributed Search over the Hidden-Web: Hierarchical Database Sampling and Selection, Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002), 2002 http://qprober.cs.columbia.edu/publications/vldb2002.pdf P. Ipeirotis, L. Gravano, and M. Sahami, Probe, Count, and Classify: Categorizing Hidden-Web Databases, Proceedings of the 2001 ACM SIGMOD International Conference On Management of Data, 2001. http://qprober.cs.columbia.edu/publications/sigmod2001.pdf L. Gravano, P. Ipeirotis, and M. Sahami, Query- vs. Crawling-based Classification of Searchable Web Databases, IEEE Data Engineering Bulletin, vol. 25, no. 1, March 2002. http://qprober.cs.columbia.edu/publications/deb-mar2002.pdf. Andrea Calì, Diego Calvanese, Giuseppe De Giacomo,Maurizio Lenzerini, On the Role of Integrity Constraints in Data Integration, Data Engineering Bulletin, 25(3) September 2002. http://www.research.microsoft.com/research/db/debull/A02sept/l-article.ps Rachel A. Pottinger, Philip A. Bernstein, Creating a Mediated Schema Based on Initial Correspondences, Data Engineering Bulletin, 25(3) September 2002. (Special Issue on Integration Management) http://www.research.microsoft.com/research/db/debull/A02sept/po-article.ps V. Crescenzi, G. Mecca and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the 27th Very Large Databases Conference, Rome, Italy, pages 109-118, 2001. http://citeseer.nj.nec.com/crescenzi01roadrunner.html
    • Valter Crescenzi, Giansalvatore Mecca , Paolo Merialdo, Automatic Web Information Extraction in the RoadRunner System . International Workshop on Data Semantics in Web Information Systems (DASWIS-2001) in conjunction with 20th International Conference on Conceptual Modeling (ER 2001) R. Kosala, J. Van den Bussche, M. Bruynooghe and H. Blockeel. Information Extraction in Structured Documents using Tree Automata Induction. To appear in Principles of Data Mining and Knowledge Discovery, Proceedings of the 6th International Conference (PKDD-2002). Preliminary version http://citeseer.nj.nec.com/506574.html R. Kosala and H. Blockeel, Web mining research: A survey. ACM SIGKDD Explorations, 2(1) pp 1-15, 2000, Special issue on "Internet Data Mining". SIGKDD Explorations: Newsletter of the ACM Special Interest Group on Knowledge Discovery & Data Mining. http://citeseer.nj.nec.com/kosala00web.html Boris Chidlovskii, Jon Ragetli and Maarten de Rijke. Wrapper Generation via Grammar Induction. Proc. ECML 2000, 11th European Conf on Machine Learning, 2000. (LNAI 1810) pp 96-108. http://home-4.12move.nl/~sh364624/docs/chidlovskii.pdf Boris Chidlovskii, Jon Ragetli, Maarten de Rijke, Automatic Wrapper Generation for Web Search Engines, Proc. 1st Intl Conf on Web-Age Information Management, 2000. (LNCS 1846) pp 399-410. http://home-4.12move.nl/~sh364624/docs/waimk.pdf Jon Ragetli's publications page: http://home-4.12move.nl/~sh364624/publicaties.html Wolfgang May, Rainer Himmeröder, Georg Lausen, Bertram Ludäscher, A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web. Naveen Ashish, PhD Thesis, March 2000. Optimizing Information Mediators by Selectively Materialising Data. (supervised by Craig Knoblock). Naveen Ashish, Craig Knoblock. Wrapper Generation for Semi-structured Internet Sources, Workshop on Management of Semistructured Data, Ventana Canyon Resort, Tucson, Arizona. Gestalts Project: Networked Databases Raghu Ramakrishnan, University of Wisconsin. Web page Oct '98: http://www.cs.wisc.edu/~raghu/gestalts/ The COD Project, University of Wisconsin. (Semantic database integration.) Raghu Ramakrishnan. http://www.cs.wisc.edu/~cod/ Eric Brill, Jimmy Lin, Michele Banko, Susan T. Dumais, Andrew Y. Ng: Data-Intensive Question Answering. (TREC 2001) http://trec.nist.gov/pubs/trec10/papers/Trec2001Notebook.AskMSRFinal.pdf G. Ianni, Intelligent Anticipated Exploration of Web Sites (2001) http://citeseer.nj.nec.com/ianni01intelligent.html William Cohen, Lee Jensen, A Structured Wrapper Induction System for Extracting Information from Semi- Structured Documents (2001) http://citeseer.nj.nec.com/cohen01structured.html M. Brian Blake, Patricia Liguori, ISADS, An Automated Client-Driven Approach to Data Extraction using an Autonomous Decentralized Architecture (2001) http://citeseer.nj.nec.com/blake01automated.html M. Brian Blake The MITRE Corporation Center for Advanced Aviation System... An Autonomous Decentralized Architecture for Distributed Data Management and Dissemination http://citeseer.nj.nec.com/461926.html
    • B.T. Messmer, H. Bunke, Subgraph Isomorphism in Polynomial Time (1995) Technical Report IAM 95-003, University of Bern, Institute of Computer Science and Applied Mathematics, Bern, Switzerland. http://citeseer.nj.nec.com/messmer95subgraph.html David Eppstein, Subgraph Isomorphism in Planar Graphs and Related Problems (1999) Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, pages 632--640, 1995. http://citeseer.nj.nec.com/eppstein95subgraph.html Revised later for a Journal Article in 1999 Home page of Ismail Khalil Ibrahim: http://www.tk.uni-linz.ac.at/~ismail/ Stephane Bressan - National University of Singapore: http://www.comp.nus.edu.sg/~steph/ Wee Hyong Tok and Stéphane Bressan, dbRouter - A Scaleable and Distributed Query Optimization and Processing Framework, Proc. 13th International Conference on Database and Expert Systems Applications, DEXA 2002, p. 658-668. (LNCS 2453). http://link.springer.de/link/service/series/0558/papers/2453/24530658.pdf Stéphane Bressan, Cheng Goh Semantic Integration of Disparate Information Sources over the Internet using Constraint Propagation. http://context.mit.edu/~steph/cp97/cp97.html Stéphane Bressan and Cheng Hian Goh, Semantic integration of disparate information sources over the internet using constraint propagation. In Workshop on Constraint Reasoning on the Internet, 1997. S. Bressan, C. H. Goh, T. Lee, S. Madnick, and M. Siegel. A Procedure for Mediation of Queries to Sources in Disparate Contexts. Proc. of the International Logic Programming Symposium, pp. 213-227, Port Jefferson, N.Y., October 12-17, 1997. S. Bressan and C. H. Goh, Semantic Integration of Disparate Information Sources over the Internet using Constraint Propagation. Workshop on Constraint Reasoning on the Internet at CP-97, 1997. S. Bressan, C. H. Goh, K. Fynn, M. Jakobisiak, K. Hussein, H. Kon, T. Lee, S. Madnick, T. Pena, J. Qu, A. Shum, and M. Siegel. The Context Interchange Mediator Prototype. Proc. ACM SIGMOD Conference, 1997. Bressan, K. Fynn, C. H. Goh, S. Madnick, T. Pena, and M. Siegel. Overview of Prolog Implementation of the Context Interchange Mediator. Proc. Intl Conference on Practical Applications of Prolog, pp. 83-93, 1997. C. H. Goh, S. Bressan, S. Madnick, and M. Siegel. Context Interchange: New Features and Formalisms for the Intelligent Integration of Information. Sloan School of Management Working Paper, January 1997. Submitted for publication. COntext INterchange: List of Publications http://context.mit.edu/~coin/publications/ COntext INterchange: List of Publications • Information Integration with Attribution Support for Corporate Profiles Lee, T., Chams, M., Nado, R., Madnick, S., and Siegel, M., ACM Conference on Information and Knowlege Management, 1999 • Context Mediation on Wall Street, Moulton, A., Madnick, S. E., and Siegel, M. D., CoopIS98 • Answering Queries in Context, Bressan, S. and Goh, C., International Conference on Flexible Query Answering, 1998 (LNAI) • Source Attribution for Querying Against Semi-structured Documents Lee, T., Bressan, S., and Madnick, S., Workshop on Web Information and Data Management, ACM Conference on Information and Knowledge Management, 1998 • Semantic Integration of Disparate Information Sources over the Internet Using Constraints Bressan, S. and Goh, C., Constraint Programming Workshop on Constraints and the Internet, 1997
    • • Extraction and Integration of Data from Semi-structured Documents into Business Applications Bressan, S. and Bonnet, Ph, Conference on the Industrial Applications of Prolog, 1997 • Multimodal Integration of Disparate Information Sources with Attribution, Lee, T. and Bressan, S., Entity Relationship Workshop on Information Retrieval and Conceptual Modelling, 1997 • Context Mediation: New Features and Formalisms for the Intelligent Integration of Information Goh, C. and Bressan, S. and Madnick, S. and Siegel. M., Sloan Working Paper 3941, 1997 • A Procedure for the Context Mediation of Queries to Disparate Sources, Goh, C. and Bressan, S. and Lee. T. and Madnick, S. and Siegel. M., International Logic Programming Symposium, 1997 • Information Brokering on the World Wide Web, Bressan, S. and Lee. T., WebNet world Conference 1997 • The COntext INterchange Mediator Prototype, Bressan, S. and Fynn, K. and Goh, C. and Jakobisiak, M. and Hussein, K. and Kon, H. and Lee. T. and Madnick, S. and Pena, T. and Qu, J. and Shum, A. and Siegel, M., ACM SIGMOD International Conference on Management of Data, 1997 • Overview of the Prolog Implementation of the COntext INterchange Prototype, Bressan, S. and Fynn, K. and Goh, C. and Madnick, S. and Pena, T. and Siegel, M., Fifth International Conference on Practical Applications of Prolog, 1997 • PENNY: A Programming Language and Compiler for the Context Interchange Project, Pena, T., MIT Master Thesis, Electrical Engineering and Computer Science, 1997 • A Planner/Optimizer/Executioner for Context Mediated Queries, Fynn, K., MIT Master Thesis, Electrical Engineering and Computer Science, 1997 • Representing and Reasoning about Semantic Conflicts in Heterogeneous Information Systems Goh, C., Ph.D. Thesis, MIT Sloan School of Management, 1996 ------------------- Frank Dignum - Utrecht University http://www.cs.uu.nl/people/dignum/ I. K. Ibrahim, W. Winiwarter, S. Bressan. Semantic Query Transformation for the Intelligent Integration of Information Sources over the Web. Proc. of the International Workshop on Information Integration on the Web, Rio de Janeiro, Brazil, April 2001. http://www.ifs.univie.ac.at/~ww/wiiw01.ps http://citeseer.nj.nec.com/487085.html I. K. Ibrahim, V. Dignum, W. Winiwarter, E. Weippl. Logic Based Approach to Semantic Query Transformation for Knowledge Management Applications. Proc. of the International Conference on Knowledge Management, Berlin, Springer, 2002. http://www.ifs.univie.ac.at/~ww/iknow02.pdf I. K. Ibrahim, W. Winiwarter, S. Bressan. Rewriting Rules for Semantic Query Transformation in E-Commerce Applications. Proc. of the 9th IFIP 2.6 Working Conference on Database Semantics, Dordrecht, Kluwer Academic Publishers, 2001. http://www.ifs.univie.ac.at/~ww/ds9.ps .. doesn't print properly. Try the following pdf file from citeseer instead: http://citeseer.nj.nec.com/rd/97012851%2C469573%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/p apers/cs/24013/http:zSzzSzwww.ifs.univie.ac.atzSz%7EwwzSzds9.pdf/rewriting-rules-for-semantic.pdf I.K. Ibrahim's PhD thesis 2001: Semantic Query Transformation for the Intelligent Integration of Information, Gadjah Mada University (Is not available on-line and is written in Indonesian, he tells me. He says he is translating it for publication as a book in 2003). supervised by: Stephane Bressan - National University of Singapore Frank Dignum - Utrecht University A. P. Sheth and J. A. Larson. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys, 22(3):183--236, 1990. This document is not available on-line. Amit P. Sheth’s home page is: http://lsdis.cs.uga.edu/~amit/
    • Steven Prestwich, Stéphane Bressan, A SAT Approach to Query Optimization in Mediator Systems, Fifth International Symposium on the Theory and Applications of Satisfiability Testing (SAT 2002), May 6-9, 2002, Cincinnati, Ohio, USA http://citeseer.nj.nec.com/prestwich02sat.html conference page: http://gauss.ececs.uc.edu/Conferences/SAT2002/sat2002list.html H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajararnan, Y. Sagiv, J. Ullman, V. Vassalos, and J. Widom, "The TSIMMIS Approach to Mediation: Data Models and Languages", Journal of Intelligent Information Systems, 2 (1997) 117-132. Bressan, S. and Bonnet, Ph. Extraction and Integration of Data from Semi-structured Documents into Business Applications, Conference on the Industrial Applications of Prolog, 1997 http://context.mit.edu/~coin/publications/inap97/inap97.ps Shuchi Patel, Amit Sheth, Planning And Optimizing Semantic Information Requests Using Domain Modeling And Resource Characteristics (2001) 26 page Technical Report. http://citeseer.nj.nec.com/rd/97012851%2C454763%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/p apers/cs/22883/http:zSzzSzlsdis.cs.uga.eduzSzlibzSzdownloadzSz126-TR-Shuchi.pdf/patel01planning.pdf Ajay Hemnani and Stephane Bressan, Information Extraction - Tree Alignment Approach to Pattern Discovery in Web Documents, Proc Database and Expert Systems Applications, 13th International Conference, DEXA 2002, (LNCS 2453) http://link.springer.de/link/service/series/0558/papers/2453/24530789.pdf Ajay Hemnani and Stephane Bressan, Extracting Information from Semi-structured Web Documents, Proc. OOIS 2002, Advances in Object-Oriented Information Systems ,2002. LNCS 2426, pp. 166-175 http://link.springer.de/link/service/series/0558/papers/2426/24260166.pdf Yannis Papakonstantinou’s home page: http://www.db.ucsd.edu/people/yannis.htm (contains an annotated bibiography) Yannis Papakonstantinou, Data Integration: The Need, the Challenges and the Approaches Plenary talk given at the International Symposium on Information Systems and Engineering, July 2002. Provides a classification of data integration techniques and challenges in the XML virtual view-based approach. http://www.db.ucsd.edu/people/yannis/ISE2002_files/frame.htm 59 slides. Web presentation. V. Vassalos,Y. Papakonstantinou: Describing and Using Query Capabilities of Heterogeneous Sources. VLDB'97, 1997 http://dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=1997-44&format=ps&compression= Yannis Papakonstantinou, Ashish Gupta, Laura Haas, Capabilities-Based Query Rewriting in Mediator Systems (1998) .. 42 pages Proceedings of 4th International Conference on Parallel and Distributed Information Systems http://citeseer.nj.nec.com/rd/4641068%2C322185%2C1%2C0.25%2CDownload/http %3AqSqqSqciteseer.nj.nec.comqSqcacheqSqpapersqSqcsqSq15133qSqhttp %3AzSzzSzwww.db.ucsd.eduzSzpublicationszSzdapd.pdf/papakonstantinou98capabilitiesbased.pdf Yannis Papakonstantinou Ashish Gupta Laura Haas, Capabilities-Based Query Rewriting in Mediator Systems (another version) (1996) .. 29 pages http://citeseer.nj.nec.com/rd/24323359%2C3393%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/pap ers/cs/6182/http:zSzzSzwww-db.stanford.eduzSzpubzSzpapakonstantinouzSz1995zSzcbr- extended.pdf/papakonstantinou96capabilitiesbased.pdf Michael R. Genesereth, Arthur M. Keller, et al., Infomaster: An Information Integration System (1997) http://citeseer.nj.nec.com/cache/papers/cs/15863/http:zSzzSzmas.cs.umass.eduzSz~aseltinezSz791SzSzgenesereth.i nfomaster.pdf/genesereth97infomaster.pdf Michalis Petropoulos Home page: http://www-cse.ucsd.edu/~mpetropo/
    • Michalis Petropoulos, Y. Papakonstantinou and V. Vassalos, QURSED: QUerying and Reporting SEmistructured Data, ACM SIGMOD Conference, 2002. http://www.db.ucsd.edu/People/michalis/pubs/sigmod02.pdf Michalis Petropoulos, and V. Hristidis, Semantic Caching of XML Databases , Fifth International Workshop on the Web and Databases (WebDB), 2002. http://www.db.ucsd.edu/People/michalis/pubs/webdb02.pdf Relational Wrapper for Navigation-Driven Lazy Mediator Talk given at San Diego Supercomputer Center as part of the MIX project in August 99. http://www.db.ucsd.edu/People/michalis/presentations/rdbwrapper.pdf Languages for Semistructured Data Talk given at the Fall 99 Advanced Database Topics Seminar (CSE 291) of the CSE Department of UCSD. http://www.db.ucsd.edu/People/michalis/presentations/qlsemi.pdf Jean-Robert Gruser, Louiqa Raschid , Vladimir Zadorozhny , Tao Zhan, Learning response time for Web Sources using query feedback and application in query optimization, The VLDB Journal, 9(1) 18-37 http://link.springer.de/link/service/journals/00778/papers/0009001/00090018.pdf#xml=http://athene.em.springer.de/ search97cgi/s97_cgi?action=view&VdkVgwKey=%2Fjour%2Fjour%2F00778%2Fpapers %2F0009001%2F00090018.pdf&doctype=xml&collection=springer02&queryZIP=%28%22wrapper%22%29AND %28%22web%22%29 Jaeyoung Yang, Jungsun Kim, Kyoung-Goo Doh, and Joongmin Choi, Wrapper Generation by Using XML-Based Domain Knowledge for Intelligent Information Extraction. Springer Lecture Notes in Computer Science, LNAI 2417, pp 472--??, 2002. http://link.springer.de/link/service/series/0558/papers/2417/24170472.pdf Jaeyoung Yang, Eunseok Lee and Joongmin Choi, A Shopping Agent That Automatically Constructs Wrappers for Semi-structured Online Vendors. Lecture Notes in Computer Science, Vol. 1983, p. 368-??, 2000. http://link.springer-ny.com/link/service/series/0558/papers/1983/19830368.pdf Jaeyoung Yang, Heekuck Oh, Kyung-Goo Doh and Joongmin Choi, A Knowledge-Based Information Extraction System for Semi-structured Labeled Documents. Lecture Notes in Computer Science, Vol. 2412, p. 105-??, 2002. http://link.springer-ny.com/link/service/series/0558/papers/2412/24120105.pdf Gunter Grieser , Klaus P. Jantke , Steffen Lange , Bernd Thomas, A Unifying Approach to HTML Wrapper Representation and Learning, LNCS1967, pp 50- http://link.springer.de/link/service/series/0558/papers/1967/19670050.pdf#xml=http://athene.em.springer.de/search9 7cgi/s97_cgi?action=view&VdkVgwKey=%2Fjour%2Fseries%2F0558%2Fpapers %2F1967%2F19670050.pdf&doctype=xml&collection=springer02&queryZIP=%28%22wrapper%22%29AND %28%22web%22%29 The TSIMMIS Web Site: http://www-db.stanford.edu/tsimmis/ TSIMMIS Publications http://www-db.stanford.edu/tsimmis/publications.html Joachim Hammer, Hector Garcia-Molina, Svetlozar Nestorov, Ramana Yerneni, Marcus Breunig, Vasilis Vassalos, "Template-Based Wrappers in the TSIMMIS System". In Proceedings of the Twenty-Sixth SIGMOD International Conference on Management of Data, Tucson, Arizona, May 12-15, 1997, pp 532-535. ftp://www-db.stanford.edu/pub/papers/wrapper-demo.ps http://citeseer.nj.nec.com/hammer97templatebased.html Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, J. Ullman. "A Query Translation Scheme for Rapid Implementation of Wrappers". In International Conference on Deductive and Object-Oriented Databases, 1995. ftp://www-db.stanford.edu/pub/papers/querytran.ps
    • also available as PDF file from citeseer J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. "Extracting Semistructured Information from the Web". In Proceedings of the Workshop on Management of Semistructured Data. Tucson, Arizona, May 1997. ftp://www-db.stanford.edu/pub/papers/extract.ps Chen Li, Ramana Yerneni, Vasilis Vassalos, Hector Garcia-Molina, Yannis Papakonstantinou, Jeffrey Ullman, Murty Valiveti. "Capability Based Mediation in TSIMMIS". SIGMOD 98 Demo, Seattle, June 1998. http://www-db.stanford.edu/pub/papers/cap.ps V. Vassalos , Y. Papakonstantinou. "Describing and Using Query Capabilities of Heterogeneous Sources". In VLDB Conference, Athens, Greece, August 1997. ftp://www-db.stanford.edu/pub/papers/query-cap-ext.ps Useful classified links to papers etc available on the web, and others: Post-Modern Database Systems: Databases Meet the Web http://db.cs.berkeley.edu/postmodern/ Data Integration at the University of Washington http://data.cs.washington.edu/integration/ Data Management and Intelligent Internet Systems http://www.cs.washington.edu/research/irdb.intro.html Gunter Grieser, Klaus P. Jantke, Steffen Lange, Bernd Thomas, A Unifying Approach to HTML Wrapper Representation and Learning, DS 2000, Kyoto, Japan, 4.-6.12.2000 http://www.dfki.de/~lexikon/Publikationen/Files/GJLT-DS-2000.pdf Gunter Grieser,Steffen Lange, Learning Approaches to Wrapper Induction FLAIRS 2001, 21-23 May 2001, Key West, FL http://www.dfki.de/~lexikon/Publikationen/Files/FLAIRS-2001-GL.pdf Steffen Lange, Gunter Grieser, Klaus P. Jantke, Extending Elementary Formal Systems Algorithmic Learning Theory, 12th International Conference, ALT 2001 LNAI 2225, pp. 332 - 347, 2001. http://www.dfki.de/~lexikon/Publikationen/Files/ALT-2001-LGJ.ps Manuel Álvarez, Alberto Pan, Juan Raposo, Fidel Cacheda, Ángel Viña: FINDER: A Mediator System for Structured and Semi-Structured Data Integration. 847-851 Juan Raposo, Alberto Pan, Manuel Álvarez, Justo Hidalgo, Ángel Viña: The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes. DEXA Workshops 2002: 313-320 Alberto Pan, Juan Raposo, Manuel Álvarez, Justo Hidalgo, Ángel Viña: Semi-Automatic Wrapper Generation for Commercial Web Sources. Engineering Information Systems in the Internet Context 2002: 265-283 Alberto Pan, Paula Montoto, Anastasio Molano, Manuel Álvarez, Juan Raposo, Vicente Orjales, Ángel Viña: Mediator Systems in E-Commerce Applications. WECWIS 2002: 228-235 Proceedings of the First International Workshop on Web Document Analysis (WDA2001) Seattle, Washington, USA. September 8, 2001 (in association with ICDAR'01) http://www.csc.liv.ac.uk/~wda2001/ A. Rahman and H. Alam, Content extraction from HTML documents, Proceedings of the First International Workshop on Web Document Analysis (WDA2001), Seattle, Washington, USA. September 8, 2001 (in association with ICDAR'01) http://www.csc.liv.ac.uk/~wda2001/Papers/11_rahman_wda2001.pdf V. Lakshmi, A.-H. Tan and C.-L. Tan, Web structure analysis for information mining Proceedings of the First International Workshop on Web Document Analysis (WDA2001) , Seattle, Washington, USA. September 8, 2001 (in association with ICDAR'01)
    • http://www.csc.liv.ac.uk/~wda2001/Papers/18_lakshmi_wda2001.pdf Yuan Jiang, Record-Boundary Discovery In Web Documents, MSc Dissertation, 1998. http://citeseer.nj.nec.com/rd/97012851%2C294151%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/compre ss/0/papers/cs/14590/http:zSzzSzosm7.cs.byu.eduzSzdegzSzpaperszSzSJ.Thesis.ps.gz/jiang99recordboundary.ps David W. Embley, Y. S. Jiang, Yiu-Kai Ng: Record-Boundary Discovery in Web Documents. SIGMOD Conference 1999: 467-478 Publications of Embley's Data Extraction Group at Brigham Young University: http://www.deg.byu.edu/ David W. Embley, Cui Tao, and Stephen W. Liddle, Automating the Extraction of Data from HTML Tables with Unknown Structure, submitted, May 2003. (29 pages) http://www.deg.byu.edu/papers/dke2003etl.pdf Stephen W. Liddle, Kimball A. Hewett, and David W. Embley, An Integrated Ontology Development Environment for Data Extraction, submitted, April 2003. http://www.deg.byu.edu/papers/ista2003.pdf Tim Chartrand, Ontology-Based Extraction of RDF Data from the World Wide Web, Masters Thesis, March 2003. http://www.deg.byu.edu/papers/tim_thesis.pdf Li Xu and D.W. Embley, Combining the Best of Global-as-View and Local-as-View for Data Integration, submitted. http://www.deg.byu.edu/papers/PODS.integration.pdf S.W. Liddle, D.W. Embley, D.T. Scott, and S.H. Yau, Extracting Data Behind Web Forms, Proceedings of the Workshop on Conceptual Modeling Approaches for e-Business, Finland, October, 2002. http://www.deg.byu.edu/papers/vldb02.pdf Sai Ho (Tony) Yau, Automating the Extraction of Data Behind Web Forms, Masters Thesis, December 2001. http://www.deg.byu.edu/papers/TonyYauThesis.doc On the Automatic Extraction of Data from the Hidden Web by S.W. Liddle, S.H. Yau, and D.W. Embley, Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), Yokohama, Japan, 27-30 November 2001. (181K .pdf) http://www.deg.byu.edu/papers/daswis01.pdf D.W. Embley and L. Xu, Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents, WebDB'00 Proceedings http://www.deg.byu.edu/papers/WebDB00.ps D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, R.D. Smith Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages (1999), Data & Knowledge Engineering, 31(3) 227-251. http://citeseer.nj.nec.com/rd/15353230%2C389588%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/compre ss/0/papers/cs/18642/http:zSzzSzwww.deg.byu.eduzSzpaperszSzdke99.ps.gz/embley99conceptualmodelbased.ps This 30-page paper uses 'Stanford Certainty Theory' for which they reference the following book: Luger, G.F., Stubblefield, W.A.: Artificial Intelligence: Structures and Strategies for Complex Problem Solving, Third Edition. Addison Wesley Longman, Inc., (1998) Lakshmi Vijjappu, Ah-Hwee Tan, and Chew-Lim Tan. Web Structure Analysis for Information Mining. In proceedings, ICDAR'01 Workshop on Web Document Analysis, Seattle, September 10-13, 2001. http://textmining.krdl.org.sg/people/ahhwee/papers/web_analysis_wda01.pdf Jiang T, Wang L and Zhang K, "Alignment of trees - an alternative to tree edit'', Theoretical Computer Science, Vol. 143, No. 1, 1995, pp. 137-148 L. Wang, T. Jiang and D. Gusfield, A more efficient approximation scheme for tree alignment, SIAM J. Comput. 30(1), 283-299. 2000. 17 pages long. http://www.cs.cityu.edu.hk/~lwang/research/siamj00.pdf
    • Chia-Hui Chang and Shao-Chen Lui, IEPAD: Information Extraction based on Pattern Discovery Proc 10th International Conference on World Wide Web, WWW10, May 2-5, 2001. http://www10.org/cdrom/papers/223/ This is a web document (in html) which prints as 11 pages. Wen-Tau Yih MSc Thesis: Template-based Information Extraction from Tree-structured HTML Documents. National Taiwan University (1997) . 98 pages. http://citeseer.nj.nec.com/rd/97012851%2C36105%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/pa pers/cs/2073/http:zSzzSzmobile.csie.ntu.edu.twzSz %7Er4526048zSzDocumentszSzthesis.pdf/yih97templatebased.pdf Jane Yung-Jen Hsu, Wen-tau Yih, Template-Based Information Mining from HTML Documents, Proc AAAI-97, pp 256-262. (1997) 7 pages. http://citeseer.nj.nec.com/rd/38223961%2C242103%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/p apers/cs/11305/http:zSzzSzhugo.csie.ntu.edu.twzSz%7EwtyihzSzDocumentszSzTbIE.pdf/hsu97templatebased.pdf Chun-Nan Hsu, Ming-Tzung Dung, Generating Finite-State Transducers For Semi-Structured Data Extraction From The Web, Information Systems 23(8) 521-536, 1998. (18 pages). http://citeseer.nj.nec.com/rd/97012851%2C127191%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/p apers/cs/5709/http:zSzzSzwww.iis.sinica.edu.twzSz %7EchunnanzSzDOWNLOADSzSzjis2.pdf/hsu98generating.pdf Adaptive Internet Intelligent Agents Research Group (SoftMealy wrapper/extractor) http://chunnan.iis.sinica.edu.tw/software.html IEEE Data Engineering Bulletin .. online papers: http://www.research.microsoft.com/research/db/debull/issues-list.htm Serge Abiteboul, Issues in Monitoring Web Data, Proc. 13th International Conference on Database and Expert Systems Applications, DEXA 2002, pp1-8. (LNCS 2453) Serge Abiteboul: Querying Semi-Structured Data. ICDT, 1997 http://dbpubs.stanford.edu:8090/pub/1996-19 Jason McHugh, Serge Abiteboul, Roy Goldman, Dallan Quass, Jennifer Widom, Lore: A Database Management System for Semistructured Data (1997), SIGMOD Record 26(3) 1997, pp54-66. http://citeseer.nj.nec.com/mchugh97lore.html Fifth International Workshop on the Web and Databases (WebDB 2002) Madison, Wisconsin - June 6-7, 2002. Links to papers: http://feast.ucsd.edu/webdb2002/papers.html Links to previous WedDB workshops: http://feast.ucsd.edu/webdb2002/previous.html J. Cowie, Y. Wilks, Information Extraction. In R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker. (2000) http://www.dcs.shef.ac.uk/~yorick/papers/infoext.pdf other IE papers by Yorick Wilks at Sheffield: http://www.dcs.shef.ac.uk/~yorick/papers.html Report on Discussion Group III: Web Content Extraction and Mining, from the First International Workshop on Web Document Analysis (WDA 2001) http://www.csc.liv.ac.uk/~wda2001/Discussions/Klink_Hurst/Klink_Hurst.html Tao Guan and Kam Fai Wong, KPS --- a Web Information Mining Algorithm, Proc WWW8 Conference May 1999.
    • http://www8.org/w8-papers/4a-search-mining/kps/kps.html William W. Cohen's Papers: Rule Learning http://www-2.cs.cmu.edu/~wcohen/pubs-r.html more papers by William W. Cohen http://www-2.cs.cmu.edu/~wcohen/pubs-s.html William Cohen, Matthew Hurst & Lee S. Jensen, A Flexible Learning System for Wrapping Tables and Lists in HTML Documents (HTML), in WWW-2002 (2002). http://www2002.org/CDROM/refereed/355/ William Cohen, David McAllester, and Henry Kautz, Hardening Soft Information Sources (Postscript), in KDD-2000 (2000). http://www-2.cs.cmu.edu/~wcohen/postscript/kdd-2000.ps William W. Cohen and Wei Fan, Learning Page-Independent Heuristics for Extracting Data from Web Pages, Proc The Eighth International World Wide Web Conference, May 11-14, 1999 http://www8.org/w8-papers/5a-search-query/learning/ WWW8 Conference Refereed Papers: http://www8.org/fullpaper.html Chia-Hui Chang, Shao-Chen Lui, and Yen-Chin Wu, Applying Pattern Mining to Web Information Extraction, Proc. PAKDD 2001, Knowledge Discovery and Data Mining, 5th Pacific-Asia Conference, Hong Kong, April 2001. Lecture Notes in Computer Science 2035, pp 4-16. LNAI 2035. http://link.springer.de/link/service/series/0558/papers/2035/20350004.pdf Fuchun Peng, Models for Information Extraction http://citeseer.nj.nec.com/489954.html Daniela Florescu, Alon Levy, and Alberto Mendelzon, Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3):59-74 (1998). http://citeseer.nj.nec.com/cache/papers/cs/1996/ftp:zSzzSzftp.db.toronto.eduzSzpubzSzpaperszSzsigrec.pdf/florescu 98database.pdf web version: http://oopsla.snu.ac.kr/xweet/seminar/990728-jmjeong/www.html Ion Muslea, Steve Minton, Marina del Rey, Craig Knoblock, A hierarchical approach to wrapper induction, Proceedings of the Third International Conference on Autonomous Agents (Agents'99), Seattle, 1999. http://citeseer.nj.nec.com/muslea99hierarchical.html I. Muslea. Extraction Patterns for Information Extraction Tasks: A Survey. In Proceedings of Workshop on Machine Learning and Information Extraction (AAAI-99) pag. 1-6, Orlando, Florida, 1999 http://blondie.cs.byu.edu/CS652/muslea99extraction.pdf Nicholas Kushmerick, Gleaning the Web, IEEE Intelligent Systems, Vol. 14, No. 2, March/April 1999, pp. 20-22 http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-ieeeis99.pdf Wensheng Wu, Clement Yu, Weiyi Meng, King-Lup Yu, Text Database Selection for Longer Queries, 24 pages (This concerns meta-search engine implentation). 2002. http://citeseer.nj.nec.com/455651.html Andreas Eberhart, Survey of RDF data on the Web (2002). http://citeseer.nj.nec.com/eberhart02survey.html Andreas Eberhart, SmartGuide: An Intelligent Information System basing on Semantic Web Standards. http://www.i-u.de/schools/eberhart/icai2002.pdf Michael K. Bergman, The Deep Web: Surfacing Hidden Value, Journal of Electronic Publishing 2001 vol 7 This White Paper is a version of the one on the BrightPlanet site. http://www.brightplanet.com/deepcontent/tutorials/DeepWeb/deepwebwhitepaper.pdf
    • S. Lawrence and CL Giles, Accessibility of information on the Web , Nature, Vol. 400, pp. 107-109, 1999. http://wwwmetrics.com/ D. Buttler and L. Liu and C. Pu, A Fully Automated Object Extraction System for the World Wide Web. Proc. Intl. Conf. on Distributed Computing Systems, 2001. pp 361 - 371. http://citeseer.nj.nec.com/rd/67081539%2C427047%2C1%2C0.25%2CDownload/http %3AqSqqSqwww.cc.gatech.eduqSqprojectsqSqinfosphereqSqpapersqSqfinal-icdcs01.ps David Buttler, Terence Critchlow Using Meta-Data to Automatically Wrap Bioinformatics Sources (.pdf) In ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) Workshop on Objects, XML, and Databases, October 2001. http://www.cc.gatech.edu/~buttler/DAML/OOPSLA_01.pdf Ling Liu, David Buttler, Terence Critchlow, Wei Han, Henrique Paques, Calton Pu, Daniel Rocco. BioZoom: Exploiting Source-Capability Information for Integrated Access to Multiple Bioinformatics Data Sources. In Proc. of 3rd IEEE Symposium on Bioinfomatics and Bioengineering, 2003. http://disl.cc.gatech.edu/SDM/papers/bibe03.pdf David Buttler's home page with papers links: http://www.cc.gatech.edu/~buttler/ Kazem Taghva, Allen Condit, and Julie Borsack. Autotag: A tool for creating structured document collections from printed materials, Proc. Electronic Publishing, Artistic Imaging, and Digital Typography Conference, EP'98 and RIDT'98, St Malo, France, April 1998, pages 420-431. LNCS 1375. http://www.isri.unlv.edu/publications/isripub/Taghva98b.pdf see also links on page: http://www.isri.unlv.edu/publications/isri-conf.php Kazem Taghva, Allen Condit, and Julie Borsack, An evaluation of an automatic markup system. Proc. IS&T/SPIE 1995 Intl. Symp. on Electronic Imaging Science and Technology, San Jose, CA, February 1995 http://www.isri.unlv.edu/publications/isripub/Taghva95a.ps Theodore W Hong, Keith L Clark, Using Grammatical Inference to Automate Information Extraction from the Web, Proc PKDD 2001, 5th European Conference on Principles of Data Mining and Knowledge Discovery, 2001 pp 216-226. (LNCS 2168) http://www.springerlink.com/app/home/content.asp? wasp=ecxxykvutncq491hfj6u&referrer=contribution&format=2&page=1 William W. Cohen's Papers: Text Categorization http://www-2.cs.cmu.edu/~wcohen/pubs-t.html a list of papers with links, including the following: William Cohen, Improving A Page Classifier with Anchor Extraction and Link Analysis (PDF), in NIPS-2002 (2002). http://www-2.cs.cmu.edu/~wcohen/postscript/nips-2002.pdf Un Yong Nahm and Raymond J. Mooney, Text Mining with Information Extraction. Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, pp. 60-67, Stanford, CA, March 2002. http://www.cs.utexas.edu/users/ml/papers/discotex-aaaisymp-02.pdf Mattis Neiling, Markus Schaal, Martin Schumann, WrapIt: Automated Integration of Web Databases with Extensional Overlaps. 2nd Intl.Workshop of theWorking Group "Web and Databases" of the German Informatics Society (GI) (Workshop WebDB 2002), Erfurt, Thuringia, Germany, October 9-10, 2002, pp 184-198. (LNCS 2593) http://citeseer.nj.nec.com/552045.html http://link.springer.de/link/service/series/0558/papers/2593/25930184.pdf Andreas Rauber, Oliver Witvoet, Andreas Aschenbrenner, Robert Bruckner, Putting the World Wide Web into a Data Warehouse: A DWH-based Approach to Web Analysis http://citeseer.nj.nec.com/546144.html
    • Arnaud Sahuguet and Fabien Azavant, Building intelligent Web applications using lightweight wrappers, Data & Knowledge Engineering, 36(3), March 2001, pp 283-316. Available via link on the W4F project's web page: http://db.cis.upenn.edu/Research/w4f.html Sonia Bergamaschi, Silvana Castano, Maurizio Vincini and Domenico Beneventano, Semantic integration of heterogeneous information sources, Data & Knowledge Engineering, 36(3), March 2001, pp 215-249. Matthias Klusch, Information agent technology for the Internet: A survey, Data & Knowledge Engineering, 36(3), March 2001, pp 337-372. http://www.dfki.de/%7Eklusch/papers/iat-dke-2000.zip Matthias Klusch Homepage (he edits the LNCS annual conf: International Workshop Series on Cooperative Information Agents) http://www.dfki.de/~klusch/ Paolo Atzeni, Giansalvatore Mecca, Paolo Merialdo, Semistructured and Structured Data in the Web: Going Back and Forth, Proc ACM SIGMOD Workshop on Management of Semistructured Data 1997, pp 1-8. http://citeseer.nj.nec.com/atzeni97semistructured.html Robert B. Doorenbos, Oren Etzioni, Daniel S. Weld, A Scalable Comparison-Shopping Agent for the World-Wide Web, Proceedings of the First International Conference on Autonomous Agents (Agents'97), pp 39-48, 1997. http://citeseer.nj.nec.com/doorenbos97scalable.html .. this link is to a 20-page technical report version of the ten-page conference paper. Daniela Florescu, Alon Levy, and Alberto Mendelzon, Database Techniques for the World-Wide Web: A Survey, SIGMOD Record 27(3), pp 59-74, 1998. http://citeseer.nj.nec.com/florescu98database.html web pages version of the above paper: http://oopsla.snu.ac.kr/xweet/seminar/990728-jmjeong/www.html Vladislav Shkapenyuk, Torsten Suel, Design and Implementation of a High-Performance Distributed Web Crawler, Proc 18th Intl Conf on Data Engineering (ICDE’02), pp 357–368, 2002. http://citeseer.nj.nec.com/shkapenyuk02design.html GERY, Mathias. CHEVALLET, Jean-Pierre. "Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages" International Workshop on Web Dynamics (London, UK, January 2001). 10 pages http://www.dcs.bbk.ac.uk/webDyn/webDynPapers/gery.ps SODERLAND, Stephen, Learning to Extract Text-based Information from the World Wide Web, Proc KDD'97, pp 251-254, 1997. http://www-nlp.cs.umass.edu/pubs/Soderland-KDD97.ps Stephen Soderland, Learning Information Extraction Rules for Semi-structured and Free Text (1999) Machine Learning 34(1-3) pp 233-272, 1999. http://citeseer.nj.nec.com/soderland99learning.html A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically. WEB MINING on-line list of links: http://www.dcc.uchile.cl/~ljaramil/investigacion/wmining.html Mary Elaine Califf's Web Site http://www.acs.ilstu.edu/faculty/mecalif/calif.htm URLS of data sources: RISE: Repository of Online Information Sources Used in Information Extraction Tasks http://www.isi.edu/info-agents/RISE/
    • OKRA .. Service has been discontinued. http://okra.ucr.edu/ BigBook http://www.bigbook.com/ Internet Address Finder http://www.iaf.net/ Quote Server http://www.secapl.com/cgi-bin/qs JOBS (a newsgroup) news:misc.jobs.offered L.A. Weekly Home Page http://www.laweekly.com/ L.A. Weekly Restaurants Guide http://www.laweekly.com/restaurants/search.html ZAGAT's Guide to Los Angeles Restaurants http://www.zagat.com/ Seattle Times Rentals http://classifieds.nwsource.com/classified/ also Jobs, Autos, RealEstate .. etc Seminar Announcements .. no URL provided by RISE repository: http://www-2.cs.cmu.edu/~dayne/SeminarAnnouncements/__Source__.html WHIRL: A Set of 111 Sources used by William Cohen in the WHIRL project http://www.isi.edu/info-agents/RISE/Original_WHIRL/__Source__.html The 34 WIEN sources, including OKRA, BigBook, Internet Address Finder, and Quote Server: URL no longer valid. The Road Runner Project: Towards Automatic Data Extraction from Large Web Sites http://www.dia.uniroma3.it/db/roadRunner/experiments.html includes a list of data sources and their experimental results for them. It includes the source pages they used, so is very valuable for comparison expts: amazon.com The most popular e-commerce Web site buy.com A popular e-commerce Web site wine.com An e-commerce Web site dedicated to wines uefa.com The official Web site of the European Football (Soccer) Association majorleguebaseball.com The official Web site of the Majorleague Baseball barnesandnoble.com A popular e-commerce Web site nba.com The Official NBA Web Site rpmfind.net A site hosting Linux RPM software packages. Data Sources used by Boris Chidlovskii, Jon Ragetli and Maarten de Rijke. : Library of Congress
    • http://www.lcweb.loc.gov IICM http://www.iicm.edu CS Bibliography (Karlsruhe) http://liinwww.ira.uka.de/bibliography/index.html CS Bibliography (Trier) http://www.informatik.uni-trier.de/~ley/db/index.html ftpSearch .. their Belgian url is no longer available ftpSearch.de (various countries are still available, eg ftpsearch.lt) says use file search on alltheweb instead: http://www.alltheweb.com/ A list of on-line Bibliographies: http://zeeb.library.cmu.edu/bySubject/CS+ECE/bibs.html The COD Project, University of Wisconsin. (Semantic database integration.) Raghu Ramakrishnan, Coral Deductive DB, etc. http://www.cs.wisc.edu/~cod/ Eric Brill, Jimmy Lin, Michele Banko, Susan T. Dumais, Andrew Y. Ng: Data-Intensive Question Answering. (TREC 2001) http://trec.nist.gov/pubs/trec10/papers/Trec2001Notebook.AskMSRFinal.pdf Gestalts Project: Networked Databases Raghu Ramakrishnan, University of Wisconsin. Web page Oct '98: http://www.cs.wisc.edu/~raghu/gestalts/ Lixto web site http://www.dbai.tuwien.ac.at/proj/lixto/