Whither Digital Libraries? The case of a "billion-dollar ...

329 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
329
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • "Whither digital libraries? The case of a "billion- dollar" business" Abstract     For the last ten years, digital libraries (DL) research has followed a path of technology innovation that is likely to bring significant benefits to the knowledge society in the coming decade.  To achieve this next step, sustained investment in R&D is essential.  This talk discusses a model where DL would become a successful business by developing information products and services that make public e-contents more accessible and useful.  The premise is that, like other information technologies, DL's success as a business is both necessary and desirable.for it to be a viable field of study.  Some of the research and policy issues related to this model will be discussed. An e-government example will be presented in the context of making government a business partner as well as an investor in research.
  • Points to make: There is a similar challenge on the side of Information limitation: the gap between information production and information consumption. This and the next few slides dramatizes this challenge. The challenge is in many ways the beginning and crux of all the major challenges we face today.
  • Continuation of last slide. Point to make: How do we become a smarter information consumer? One issue is Digital Literacy – the digital equivalent of “Literacy” of past focusing on the ability to read and write. In the digital age, beyond R/W, one must be able to search information and make the judgment about the information you receive. Are the search engines helping us to achieve “digital literary”?
  • Additional explanations: Studies have shown that only a smaller fraction of the web pages totality are being covered by the search engines’ indexing schemes. Few search engines are even making any progress to cover more, let alone better.
  • Points to make: Indexing for search is not the only problem in getting at sharable information or knowledge. The simultaneous and selective use of multiple languages is both a strength and a barrier for any user. To truly sharable human endeavors and experiences, crossing the language barrier is an important issue for Knowledge Community research.
  • Unique Aspects of Government Information Services . While many of the R&D ingredients for digital government are shared with those in digital libraries, there are differences in their execution, delivery and application. A 1997 NSF report outlined the unique aspects of information services towards the Digital Government of the 21 st century:   Security, privacy and integrity as prime architectural and design criteria . This includes authentication, audit, and journaling or replication. Government, unlike companies, cannot trade dollars for security, privacy exposure. Scale. This includes considerations of size of databases, interagency access, and distributed systems network. Government, unlike companies, does not have a core business; it is in every business. Citizen as a customer. This includes considerations of equality of access, service level approaching commercial practice, virtual agencies, 24x7 operation, quality of data, multiple language support, and freedom of information. Government must be open to everyone all the time. Government as a customer for Information Technology. This includes considerations of procurement from multiple vendors, and meeting stringent bidding requirements. Government must foster cooperation, innovation, and competitiveness. Unique systems and applications. This includes law enforcement, environmental protection agencies, patent and trademark office, emergency management, and regulatory agency. Government is the only information provider of unique services.
  • Making the public (government) information accessible and useful poses two major challenges to the research community: The first is integrating large, dispersed collections of data compiled by different people at different times and for different purposes. The second is overcoming the limitations of the browser paradigm to disseminate complex information derived from these multiple sources. The example below illustrate attempts to address these problem.
  • The Energy Data Collection (EDC) project, began in 1999 at the USC/ISI, is a pilot of the National Science Foundation’s DG program. Working with the representatives of federal and state agencies and other organizations, the EDC is building a system for disseminating energy data from the Bureau of Labor Statistics, the Census Bureau, the Department of Energy, and the California Energy Commission. The EDC system’s architecture includes the following major components: an integrated ontology, a users interface, a query processor, a domain model, and the data sources. The ontology includes a high-level general concept taxonomy and links to the domain model, which describes the contents of new databases and extends the ontology as needed. The interface facilitates construction of queries, which may involve ontology browsing and other interaction methods.
  • This figure shows a fragment of an EDC domain model that describes time-series data about different gasoline products. Each concept models a class of database contents. The user frames queries in terms of these concepts and their variants. Time series data include dimensions (in red) such as product, property measured (price, volume), location and unit of measure. The dimensions may also provide metadata that describes the series.
  • This figure shows the relationship between EDC ontology and domain models. New concepts derived from glossaries (in red), by Columbia University’s Automatic Lexical Knowledge Base (ALKB), extended the Sensus ontology (black nodes); then it was linked to the domain model of the actual gasoline data (nodes in blue), by the domain-specific ontologies using SIMS – Single Interface to Multiple Sources. In turn, the domain models were linked to actual databases. A central problem with linking agency-specific domain models is determining where a given term belongs in such a large ontology. The EDC research project has uncovered a variety of heuristics that help with the identification and alignment process: a five-step procedure: heuristics that make initial cross-ontology alignment suggestions a function for integrating these suggestions alignment validation criteria and heuristics a repeated integration cycle an evaluation metric Key point to make: 1. linking information search research to databases, artificial intelligence, heuristics. 2. Importance of metadata that allowed data models and ontologies to be easily extended and properly linked.
  • Whither Digital Libraries? The case of a "billion-dollar ...

    1. 1. Whither Digital Libraries? The case of a “billion-dollar” business Yi-Tzuu Chien School of Library and Information Science University of Tsukuba [email_address] October 31, 2002
    2. 2. Outline <ul><li>Need for a business model </li></ul><ul><li>Vision of digital libraries: then and now </li></ul><ul><li>Making e-contents accessible, useful and profitable: Reversing the steps of research-to-applications paradigm </li></ul><ul><li>An example in digital government: Turning government into a business partner and research investor </li></ul><ul><li>Connections to the Knowledge Society </li></ul>
    3. 3. Digital Library (Circa 1994) <ul><li>Vision – then and now </li></ul><ul><li>A digital network of knowledge systems - connecting computing, information, and people resources </li></ul><ul><li>A set of enabling technologies - for creating, distributing, and using knowledge in human-centered multimedia, multi-modal environments </li></ul><ul><li>New information services - in networked education, commerce, health care, transportation, government, and others, beyond those provided by traditional libraries and information sources </li></ul><ul><li>Ubiquitous, public, and personal – open 24 hours and is accessible where the network is </li></ul>
    4. 4. DL Roadblocks <ul><li>How much information? Production outpaces consumption </li></ul><ul><li>Research focuses on technological innovation, not on user needs </li></ul><ul><li>Lack of a business model and incentives for making public e-contents accessible </li></ul><ul><li>Commercial success in non-public domains (music, games, etc.) overshadows real DL applications in public sector </li></ul><ul><li>Slow government actions in last decade, but the landscape is changing. </li></ul>
    5. 5. Information Glut World production of data: 1999 estimates <ul><li>Magnetic 1,693,000 terabytes </li></ul><ul><ul><li>PC disk drives, departmental servers, camcorder tape, enterprise servers </li></ul></ul><ul><li>Film 427,000 </li></ul><ul><ul><li>Photograph, X-rays, cinema </li></ul></ul><ul><li>Paper 240 </li></ul><ul><ul><li>Office documents, newspapers, periodicals, books </li></ul></ul><ul><li>Optical 80 </li></ul><ul><ul><li>Music CDs, DVDs, Data CDs </li></ul></ul><ul><li>Grand Total ~ 2,120,000 terabytes </li></ul><ul><li>Source: Lyman and Varian, UC Berkeley </li></ul>
    6. 6. Information Consumption <ul><li>Total time American households spend reading, watching TV or listening to music: </li></ul><ul><li>1992: 3,324 hours </li></ul><ul><li>2000: 3,380 hours </li></ul><ul><li>Bits consumed: 3,344,783 megabytes </li></ul><ul><li> or ~ 3 Terabytes </li></ul><ul><li>(Bits created: ~2,120,000 Terabytes) </li></ul><ul><li>Source: Lyman and Varian, UC Berkeley </li></ul>
    7. 7. Search Information on the Internet Source: Global Reach GG: Goggle FAST: Fast Search AV: Alta Vista INK: Inktomi NL: Northern Light
    8. 8. Source: Global Reach Sharing Information on the Internet
    9. 9. Where is the e-Content Business? Source: U.S. Department of Commerce Report “ Digital Economy 2002”
    10. 10. U.S. Information Technology Producing Industries Gross Domestic Income 2000, $Millions <ul><li>Computing Hardware 251,655 </li></ul><ul><li>Software and Services 245,656 </li></ul><ul><li>Communications (hw&services) 299,256 </li></ul><ul><li>______________________________________ </li></ul><ul><li>Total IT-producing Industries 796,567 </li></ul><ul><li>Total National GDI 10,003,400 </li></ul><ul><li>IT share of economy 8.0% </li></ul>
    11. 11. Source: U.S. Commerce Report “ Digital Economy 2002” Information Retrieval Services
    12. 12. Source: U.S. Department of Commerce Report “ Digital Economy 2002”
    13. 13. Making e-Content a Business an European model <ul><li>Focus of Activity </li></ul><ul><ul><li>improving access to and expending use of public sector information </li></ul></ul><ul><ul><li>enhancing content production in a multilingual and multicultural environment </li></ul></ul><ul><ul><li>increasing dynamism of the digital content market </li></ul></ul><ul><li>An ambitious, multi-year r&d program designed to take the lead in e-content business worldwide </li></ul><ul><ul><li>research grants, demonstration projects, forging private-public partnerships, building tools and infrastructure, seeking new market spaces </li></ul></ul><ul><li>Addresses several of the DL roadblocks </li></ul>
    14. 14. Accessing Public e-Content Beyond the walls of libraries <ul><li>Thematic areas of e-Content </li></ul><ul><ul><li>traditional arts, cultural heritage, archives, museums, libraries </li></ul></ul><ul><ul><li>legal, administrative, and institutional data </li></ul></ul><ul><ul><li>financial, economic, and commerce data </li></ul></ul><ul><ul><li>entertainment, tourism, traffic/transportation information </li></ul></ul><ul><ul><li>geographic, agricultural, and environmental data </li></ul></ul><ul><ul><li>location-based services at the regional or national levels (education, health, crisis management, etc.) </li></ul></ul><ul><ul><li>data relating to health, safety, and consumer protection including emergency services </li></ul></ul><ul><ul><li>scientific and technical information (e.g., research publications, patents, data banks, standards, experimental testbeds, sharable software) </li></ul></ul><ul><li>Infrastructures for e-Content </li></ul><ul><ul><li>Collections, platforms, networks, organizations, standards, middleware services, etc. </li></ul></ul>
    15. 15. Enhancing e-Content Production: across institutional, cultural, national borders <ul><li>Thematic areas </li></ul><ul><ul><li>developing new strategies, partnerships, and solutions for designing and producing e-contents and services </li></ul></ul><ul><ul><li>focusing on e-contents and their multilingual and multicultural interfaces and the associated user/customer services </li></ul></ul><ul><ul><li>leveraging local, national, and global resources and expertise </li></ul></ul><ul><li>Three content communities as stakeholders </li></ul><ul><ul><li>“ commercial” content community (in place) </li></ul></ul><ul><ul><li>“ corporate” content community (private and public sector, e.g, local or federal government) </li></ul></ul><ul><ul><li>“ public” content community, including public-private partnerships for a wider deployment of public e-contents </li></ul></ul><ul><li>Localization and internationalization at the same time </li></ul>
    16. 16. Increasing Dynamism of the e-Content Business <ul><li>Bridging the gap between the e-Content business and the capital market </li></ul><ul><ul><li>Providing different channels to increase access to capital resources by various players </li></ul></ul><ul><ul><li>Making players aware of available business and tools services </li></ul></ul><ul><ul><li>Addressing the intellectual property rights and rights trading between e-Content players </li></ul></ul><ul><li>e-Europe may be more ambitious, but e-Japan may get there first </li></ul><ul><ul><li>i-mode: Successful business model for private e-Content </li></ul></ul><ul><ul><li>Advantageous IPR policy (www.wtec.org/pdf/dio.pdf) </li></ul></ul>
    17. 17. Source: e-Japan program, Office of the Prime Minister of Japan
    18. 18. Source: e-Japan program, Office of the Prime Minister of Japan
    19. 19. Source: e-Japan program, Office of the Prime Minister of Japan
    20. 20. Digital Government (DG) An example of applying Digital Libraries technology <ul><li>Components of Investment </li></ul><ul><ul><li>Vision: The PITAC report www.ccic.gov/pubs/pitac/index.html </li></ul></ul><ul><ul><li>Research: Linkage to DLI programs; DG Research initiative by NSF www.cise.nsf.gov/eia/dg </li></ul></ul><ul><ul><li>Implementation: All government levels, led by the Federal agencies www.firstgov.gov/ </li></ul></ul><ul><li>Dimensions of System Design </li></ul><ul><ul><li>Architectural relationship they have with their clients </li></ul></ul><ul><ul><li>Types of services they can provide to their clients </li></ul></ul>
    21. 21. Unique Aspects of Government Information Services <ul><li>Security, privacy, and integrity as prime architectural and design criteria </li></ul><ul><li>Scale & scope: Instead of a core business, government is in every business </li></ul><ul><li>All citizens and organizations as its equal customers </li></ul><ul><li>Government as a huge customer for information technology: leverage and limitations </li></ul><ul><li>Diversity of systems and applications </li></ul>
    22. 22. Service Levels for a Digital Government System Level Key functions and uses e-Contents and management First (low) Provide one-way communication for displaying information about a given agency or aspect of government Usually fixed type, limited to a single domain, one medium, simple data structure Second Provide simple two-way communication capabilities, usually for uncomplicated types of data collection such as registering comments Similar to level 1, but may need more complex data structure and management Third Facilitate complex transactions that may involve interagency workflows and legally binding procedures. Examples are health and welfare services Usually involves multiple databases and ontologies; need collaboration and coordination among agencies and with private sector, e.g., service providers Fourth (high) Integrate a wide range of services across a whole government administration and possibly several governments, domestic and international. Examples are crisis management and immigration & custom services. Usually requires a hierarchy of ontologies and database structures; extensive coordination and collaboration among agencies; partnerships w/ private sector in content development and management
    23. 23.   Research Areas for Digital Government Initiative Topical Areas Research Description Illustrative Examples Intelligent Information Integration Shared ontologies; metadata; sw tools Mediation of multimedia data; Collaboration tools Content searching for government data; Information systems for crisis management Very Large-scale Data Acquisition and Management Technologies to acquire, integrate, view, and assure the integrity of geographic, biological, environmental, and economic data and metadata Access to linked statistical data sources in the 70+ agencies; A master U.S. data center for Crisis and emergency management Advanced Analytics for Large Data Collections Infrastructure to broadcast range of data analysis techniques; Visualization of large and complex data sets Data mining facilities and computing services for citizens; Information-on-demand services for emergency management Electronic Transaction and e-Commerce Techniques Common transaction media between government and citizens; Data integrity and authentication techniques; Migration strategies from batch transaction to online systems Electronic services delivered via WWW; Distributed kiosks at public sites for any-time transaction; Demonstrate capability of public key technology in multiple domains Information Services for ordinary Citizens/Customers Enhanced human-computer interactions, visualization and presentation technologies Kiosk-based access for multiple services; Universal access for citizens with varied physical capabilities Applications of IT to Law, Regulation, and other Mission Domains Research on information, store, access, and management specific to mission agencies Archiving, record keeping, and preservation; Systems in support of law enforcement and regulatory process with citizen inputs Information Services for Large-scale Government R&D Projects Engineering software and other computing services for large national projects in dedicated missions or across agencies NASA launch monitoring and control; Bureau of Census integrated data services; Information services linking Social Security Administration and Health Services
    24. 24. The Energy Data Collection (EDC) Project: System Architecture Source: NSF DG Pilot Project at USC/ISI
    25. 25. Fragment of an EDC domain model Source: NSF DG Pilot project at USC/ISI
    26. 26. EDC Ontology and Domain Models Source: NSF Pilot DG project at USC/ISI
    27. 27. Some References for the U.S. DG Initiative <ul><li>“ Transforming Access to government through information technology”, PITAC report to the President, Sep. 2000; http://www.ccic.gov/pubs/index.html/ </li></ul><ul><li>“ Information Technology Research, Innovation, and E-Government”, National Research Council publication, 2002; http://www7.nationalacademies.org/ </li></ul><ul><li>NSF Digital Government research initiative; http://www.nsf.gov/eia/ </li></ul><ul><li>Special Issue on Digital Government, IEEE Computer, Feb. 2001; http://computer.org/ </li></ul>
    28. 28. DL Cross-cutting Issues <ul><li>Architectural levels </li></ul><ul><ul><li>Applications, User services, Domain Knowledge Management, Collection Management, Data Handling, Storage </li></ul></ul><ul><li>Distributed Repositories </li></ul><ul><ul><li>standards, tools, scalability, sustainability </li></ul></ul><ul><li>Integration and Interoperability </li></ul><ul><ul><li>local, regional, global collections </li></ul></ul><ul><ul><li>data, access, service levels </li></ul></ul><ul><ul><li>Core business is DL middleware </li></ul></ul>
    29. 29. Creating the Core Business <ul><li>Metadata providing information about the unlimited resources on the Web (e.g., the W3C semantic web activity, the Dublin Core Initiative, Resource Framework, etc.) </li></ul><ul><li>Automated processing of Web information by software agents, including new concepts of search engines ( next Google?) </li></ul><ul><li>Facilitating applications that require open and public rather than constrained and proprietary contents </li></ul><ul><li>Internetworking between applications: e.g., merging contents from multiple applications to create new information </li></ul><ul><li>Usability a top priority: to do for the applications contents what the Web has done for hypertext: to allow contents to be processed outside the environment in which they were created at the Internet scale </li></ul>
    30. 30. The Anatomy of a Large-Scale Hypertextual Web Search Engine: Google Sergey Brin and Lawrence Page {sergey, page}@cs.stanford.edu Computer Science Department, Stanford University, Stanford, CA 94305 Abstract        In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/        To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date.        Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.   Keywords : World Wide Web, Search Engines, Information Retrieval, PageRank, Google
    31. 31. Google, Inc.: from university research to business <ul><li>1994: DLI-1 initiative began; Stanford U Consortium funded for its Infobus project </li></ul><ul><li>1995: Grad students Larry Page and Sergey Brin developed a search technology called “BackRub” </li></ul><ul><li>1997: Research paper by Brin and Page, “The anatomy of a search engine Google”, published </li></ul><ul><li>1998: Page and Brin launched Google, Inc.; Search engine answered 10,000 queries per day </li></ul><ul><li>2002: www.google.com/corporate/facts.html. </li></ul><ul><ul><li>Answers more than 150 million queries daily </li></ul></ul><ul><ul><li>Searches more than 2 billion web pages </li></ul></ul><ul><ul><li>Has 55+ million unique users per month </li></ul></ul><ul><ul><li>Global reach: More than 50 percent of traffic is from outside the US; search covers some 80 languages </li></ul></ul>
    32. 32. Why a business model? Adding a DL entry to the innovation pipeline Source: NRC report on IT Research and Innovation
    33. 33. DL Middleware Milestones for a New Entry in the R&D pipeline <ul><li>1990 1995 2000 2005 2010 </li></ul>Initial Basic Research in DB, IR, HCI DLI phase1, other Nat'l DL projects, testbeds & digital collections DLI phase 2, e-Gov, e-Content, e-Japan, metadata, middleware University Research Industry Research New focus on public e-Contents, multilingual cross-culture interfaces Products $1B dollar Business Search engines, e-Content standards, tools, DL middleware, innovations tailored to uses, business models Google, i-mode, multilingual software, standards (Dublin core, OAI-PMH, RDF) Links to: Broadband last mile, WWW, Speech and language technology, Portable communications, Data mining, Relational databases ?
    34. 34. Digital Libraries Vision Re-visited 2002 <ul><li>A digital network of knowledge systems - connecting computing, information, and people resources </li></ul><ul><li>A set of enabling technologies - for creating, distributing, and using knowledge in human-centered multimedia, multi-modal environments </li></ul><ul><li>New information services - in networked education, commerce, health care, transportation, government, and others, beyond those provided by traditional libraries and information sources </li></ul><ul><li>Ubiquitous, public, and personal – open 24 hours and is accessible where the network is </li></ul><ul><li>Sustainable (technologically, socially, and economically) at the Internet scale </li></ul>

    ×