Irish Digital Libraries Summit

Irish Digital Libraries Summit Digital Libraries at the eve of the Next Generation Internet Sebastian Ryszard Kruk, Mary Burke, Stefan Decker http://wiki.corrib.deri.ie/index.php/SemDL/IrishDLSummit

Looking into the Future of Irish Digital Libraries ? ?

Why do we care? John teaches biology, over the Internet, using digital libraries and modern technologies (wikis, blogs) How to deliver the material just-in-time? How to pre-asses students? How to automate most of the process?

Goals Present current solutions that digital libraries to the Next Generation Internet

Goals Gather opinions, requirements and future plans of Irish libraries

Goals Build up bases for an application for funding of a national digital libraries initiative under the EU FP7 Digital Libraries theme

Schedule Semantic Digital Libraries Coffee break 11:30-11:50 Tomasz Woroniecki Building a Semantic Digital Library 11:00-11:30 Maciej Dąbrowski Ontologies for Digital Libraries 10:30-11:00 Sebastian Kruk Mary Burke Get together, Welcome 10:00-10:30 Future of Digital Libraries Lunch break 13:00-14:00 Predrag Knezevic BRICKS Project 12:30-13:00 Alexander Troussov IBM Ontological Network Miner and its applications to semantic social networks 12:00-12:30 Sebastian Kruk Introduction to the session 11:50-12:00

Schedule Digital Libraries in Ireland Wrap-up, Conclusions 16:45-17:00 Mary Burke Discussion panel: Do we need Semantic Web and Web 2.0 technologies in Digital Libraries? 15:45-16:45 Coffee break 15:30-15:45 Sebastian Kruk, Adam Gzella With a Little Help from My Friends: Social Semantic Search and Browsing 15:00-15:30 Judith Wusteman OJAX: A Web 2.0 Search user Interface 14:30-15:00 John McDonough The Irish Virtual Research Library and Archive Project – an infrastructure for humanities research. 14:00-14:30

Ontologies for Digital Libraries MarcOnt Initiative Maciej Dąbrowski Digital Enterprise Research Institute National University of Ireland, Galway maciej . dabrowski @deri.org

Outline Real-life and Semantic Web Semantic Web and Ontologies MarcOnt Ontology MarcOnt Tools Conclusions

Real-life problems Heterogenous systems Identified Problems: Interoperability Format translation Multiple data formats in DL: How to support them? How to translate between them? Who should create mappings?

Real-life problems – user’s expectations Searching: Effective and Accurate We want correct and fast answers!! Intuitive and Simple Asking questions should be easy. Meaning Jaguar – a car or an animal? Reasoning Give me articles written by students of X in Galway? Identified problems: Intuitive interface for asking complex querries

Real-life problems - summary Digital Libraries should provide: Interoperability Support for many formats Complex search features Intuitive interfaces

The Semantic Web – A Brief Introduction Current Web vs. Semantic Web? An extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. [Tim Berners-Lee] Current Web was designed for humans, and there is little information usable for machines Was the Web meant to be more? Objects with well defined attributes as opposed to untyped hyperlinks between Internet resources A network of relationships amongst named objects, yielding unified information management tasks What do you mean by “Semantic”? the semantics of something is the meaning of something Semantic Web is able to describe things in a way that computers can understand

Semantic Web vs. Current Web Current Web Semantic Web

The Semantic Web – What is RDF? Describing things on the S emantic W eb RDF (Resource Description Framework) a data format for describing information and resources, the fundamental data model for the Semantic Web Using RDF, we can describe relationships between things like: A is a part of B or Y is a member of Z and their properties ( size , weight , age , price …) in a machine-understandable format RDF graph-based model delivers straightforward machine process ing Putting information into RDF files makes it possible for “scutters” or RDF crawlers to search , discover , pick up , collect , analyse and process information from the Web

The Semantic Web – What is RDF? A simple RDF example Statement: “ Stefan Decker is the creator of the resource (web page) http://www.stefandecker.org ” Structure: Resource (subject) http://www.stefandecker.org Property (predicate) http://purl.org/dc/elements/1.1/creator Value (object) “ Stefan Decker ” Directed graph: http://www.stefandecker.org dc:creator Stefan Decker

The Semantic Web – How RDF can help us? How RDF can help us? identify objects establish relationships express a new relationship  just add a new RDF statement integrate information from different sources  copy all the RDF data together RDF allows many points of view

Ontologies What is an Ontology? „ An ontology is a specification of a conceptualization.“ Tom Gruber, 1993 Ontologies are social contracts Agreed, explicit semantics Understandable to outsiders (Often) derived in a community process Ontology markup and representation languages: RDF and RDF Schema OWL Other: DAML+OIL , EER , UML , Topic Maps , MOF , XML Schemas

Components of ontologies Concepts Book Article Author Properties hasPages hasTitle Constraints Cardinality is at least 1 Maximum value is 200 Axioms Planes can fly People can’t fly Relationships Is a Part of

Ontologies - half-time conclusions Data is not only human readable, it is now also machine readable Machines can realize much more complex tasks (eg. reasoning) Capturing the meaning of concepts is possible A new look on data storage systems (there are no data structures!!) A d v a n t a g e s

Usecase scenario Author Title Structured resources: Author Title Data storage allows: Author Title Additional information cannot be stored!! Author Title Date Title Author Regular Systems Author Title Date

Ontology development process Many approaches Different life cycles Continuous process Involves community of users Requires tools for collaboration Tools for ontology development are necessary D e v e l o p m e n t

MarcOnt Initiative Motivation: Build a bibliographic ontology for the Jerome Digital Library MarcOnt Initiative goals: Deliver a set of tools for collaborative ontology development Collaboration Tools for domain experts Enable mediation between formats (MMS)

MarcOnt Ontology Central point of MarcOnt Initiative Translation and mediation format Continuous collaborative ontology improvement Knowledge from the domain experts Community influence and evaluation

MarcOnt Ontology Goals: Capture concepts from the legacy bibliographic formats MARC21, Bibtex, Dublin Core Lattes, ... Create a uniform bibliographic description format for digital libraries. Enable the use of Semantic Web technologies (eg. reasoning) to improve capabilities of digital libraries Improve interoperability

Format Translation Scenario Author: John Smith Date of Birth: 1956-10-15 Date of death: 2004-09-10 Author: John Smith Date of Birth: ?? Date of death: ?? Author: John Smith Date of Birth: ?? Date of death: ?? Author: John Smith Date of Birth: ?? Date of death: ?? Dublin Core

Format Translation Scenario Author: John Smith Date of Birth: 1956-10-15 Date of death: 2004-09-10 Author: John Smith Date of Birth: ?? Date of death: ?? Author: John Smith Date of Birth: ?? Date of death: ?? Author: John Smith Date of Birth: 1956-10-15 Date of death: 2004-09-10 RDF Storage Dublin Core Author: John Smith Date of Birth: 1956-10-15 Date of death: 2004-09-10 Author: John Smith Date of Birth: 1956-10-15 Date of death: 2004-09-10

MarcOnt Mediation Services Format translation Interoperability MarcOnt Mediation Services RDF Translator

MarcOnt Ontology in JeromeDL Improvement of searching capabilities Natural Language Processing (NLP) Templates Show me all publications written by students of Decker.

MarcOnt Portal Collaborative ontology development. Portal provides: Suggestions Annotations Versioning Ontology editor

MarcOnt Portal On-line ontology editing Visualization of ontologies

MarcOnt Portal Comparing versions of ontologies

MarcOnt Initiative Roadmap Lattes – CV platform used in Brasil Release of MarcOnt draft ontology Digital Rights Management Sharing issues MarcOntX agent – automatic integration of concept from Digital Libraries

MarcOnt Initiative summary MarcOnt Initiative goals: Create a framework for collaborative ontology development Provide domain experts with tools to share their knowledge Offer tools for data mediation between different data formats Develop MarcOnt bibliographic ontology Create a community of users (domain experts)

Conclusions Ontologies: can improve the most important goal of digital libraries – searching the information facilitate interoperability capture much more information (metadata) than existing systems are the agreement of people (domain experts) need tools for collaborative development and community of users are the future of Digital Libraries?

Tomasz Woroniecki [email_address] JeromeDL Building a Semantic Digital Library

Outline of the presentation Introduction to Semantic Digital Libraries Overview of JeromeDL Architecture of JeromeDL Working with JeromeDL Demo

Social Semantic Digital Library A library stores and provides access to resources (books) Qualified staff updates catalogues and helps users

Social Semantic Digital Library Machine-readable resources Full-text index improves searching Easy access Availability

Social Semantic Digital Library Resources are accessible by machines, not with machines Metadata is rich and extensible Searching reflects meaning of terms RDF is a standard for representing information Not just resources but also knowledge is shared

Social Semantic Digital Library Involves the community into sharing knowledge Utilizes social network in searching Allows for comments, blogs, shared bookmarks Easy tagging

Evolution of Libraries Social Semantic Digital Library Involves the community into sharing knowledge Semantic Digital Library Accessible by machines, not only with machines Digital Library Online, easy searching with a full-text index Library Organized collection

Semantic Digital Library Semantic digital libraries integrate information based on different metadata, e.g.: resources, user profiles, bookmarks, taxonomies provide interoperability with other systems (not only digital libraries) deliver more robust, user friendly and adaptable search and browsing interfaces empowered by semantics

JeromeDL - Motivations Support for different kinds of bibliographic medatata, like: DublinCore , BibTeX and MARC21 at the same time. Making use of existing rich sources of bibliographic descriptions (like MARC21) created by human. Supporting users and communities: users have control over their profile information; community-aware profiles are integrated with bibliographic descriptions support for community generated knowledge Delivering communication between instances: P2P mode for searching and users authentication Hierarchical mode for browsing

JeromeDL – Social Semantic Digital Library JeromeDL fulfills requirements of: Librarians precise annotations rich metadata Researchers easy publishing searching related topics Average users efficient search and browsing online collaboration

Using JeromeDL Uploading a resource provide title, abstract, author etc. provide structure of the resource (e.g., chapters) choose domains of the subject choose keywords for the resource set additional properties upload digital parts of the resource

Using JeromeDL An administrator either approves or rejects a published resource

JeromeDL for a regular user Browsing resources by type, author, keyword, domain Downloading the resource and its bibliographic description in various formats Subscribing to RSS feeds Searching simple, advanced, distributed, semantic

Summary An easy solution for putting resources online A community around your repository Support for many languages Integration with Bibster and OpenSearch protocols Visit www.jeromedl.org

Irish Digital Libraries Summit Digital Libraries at the eve of the Next Generation Internet Future of Digital Libraries http://wiki.corrib.deri.ie/index.php/SemDL/IrishDLSummit

Building the Future Future Internet, semantic or social, or both, will not emerge on its own , we need to build it

Building the Future Digital libraries are important part of the Internet

Building the Future Libraries should continue to drive the changes, not only follow

Building the Future OnNeM - IBM Ontological Network Miner and its applications to semantic social networks BRICKS Project – Building Resources for Integrated Cultural Knowledge Services

IBM CAS Dublin / LanguageWare group Ontological Network Miner and its applications to models of social networks and semantics Alexander Troussov, Mikhail Sogrin, John Judge

Agenda Ontological Network Miner tool (project Galaxy) As generic tool to perform elements of soft clustering and fuzzy inference on semantic networks Applications of Galaxy to ontology-based semantic analysis of texts Semantic tagging, term disambiguation based on the global context Galaxy applications to folksonomies Community detection/Expertice location, … Applications to unified models of semantic social networks Research cooperation

Ontological Network Miner (Galaxy) A generic tool to perform elements of soft clustering and fuzzy inference on semantic networks Ongoing project based on the work we have done for EU 6 th framework integrated project Nepomuk

Applications to metadata generation

Applications to metadata generation Currently the semantic web relies on semantic annotation mostly done manually by humans Working in EU 7 th framework project Nepomuk (which aims to build social semantic desktop) we in IBM Dublin developed a tool for automation of metadata creation: Automatic ontology-based conceptual tagging (central concepts of the text with respect to the given lexico-semantic resource) Text which mentions Mulhuddart, Lansdowne, Clontarf is probably about Dublin/Ireland/Europe/Earth, this fact can be inferred from geographical relations like Mulhuddart “is-part-of” Dublin Disambiguation of terms Based on on the ontological knowledge from corresponding resource (Jaguar – a car or an animal? Jaguar, car, animal, pet, …) …

Automatic tagging based on concept mentions NETWORK OF CONCEPTS TEXT Mapping of term mentions to concepts . Finding “focus” concept Mention Mention Mention Mention

DEMO (Lotusphere 2007) Run eclipse.exe Open lotusphere_demo.config.xml located in subfolder data Have a look at the underlying personal information management ontology people, organisations, projects, Open text: email1.anno Text is processed on the fly, terms are disambiguated, central concepts are shown in the upper-right window Why US? Because most found concepts are people, and during disambiguation it was established that most likely referents of (ambiguous) names are located in US Let us remove first line with two names The text now has less names. Instead of people, other (abstract) concepts now play a more prominent role. Because of this (after a small delay caused by Eclipse, not by the performance of our system) US disappears as the top concept

What is Ontological Network Miner? Text analytics demo shown before has applications to: Context dependent smart tags Metadata generation Although text processing is a complex process involving mapping from text to concepts and usage of empirics specific to certain properties of the discourse at the heart of the processing is clustering on the graph of concepts Which was shown by the animation when wide orange area becomes smaller after “magical” shrinking This clustering is provided by IBM Dublin Ontological Network Miner Codenamed OnNeM in Nepomuk project

What exactly Ontological Network Miner does? One algorithm (a blend of soft clustering & fuzzy inference) Depending on the parameters, this algorithm provides “ Generalisation” of the model Output has less nodes compared to the input “ Expansion” of the model Which might be used for query expansion: Query “nutrion”+”science” is expanded into properly ranked list: nutritionist, dietologist, nutritional, scientific, .. Our customers and partners can tune the algorithm for specific tasks using intuitively clear parameters.

Tuning Galaxy Galaxy utilises a data-driven algorithm and more importantly, tuning can be done by a domain specialist (not necessarily a researcher or software developer), is to “tell” Galaxy what properties of the underlying semantic network are relevant to a particular task: For example, in application to geotagging the user might specify that Galaxy favour geographical locations with bigger populations, and, in addition, favour popular resorts Using WordNet – specify that Galaxy must favour hypernymy-hyponymy relations and disfavour meronymy-holonomy relations Researchers (IBMers and CAS scientists) also have the opportunity to work with us on “fine-tuning” the algorithm For example, to improve usage of graph-metrics such as in-/out- degree of nodes

Applications to folksonomy systems (Del.icio.us, IBM’s Dogear, …)

Folksonomies as ontological networks  People Documents Tags Instances of tagging

Why a “generic” ontological network miner is needed: Objects of interest might be wired into one unified model of lexicon, semantic and social networks For example, the network depicted on the previous slide can be augmented with new entities and new relations One can add relations between participants, or add new people into consideration Semantic relations between tags might be added manually, or generated automatically based on morphological similarity of words, proximity in WordNet, etc. Keywords and other metainformation about documents might be wired into the network Tags in folksonomies are created by humans. Keywords (preexisting in documents or extracted by text processing) and their relations to documents and tags might be added to augment folksonomies. Dogear can recommend tags for new document which nobody yet tagged in a style accepted in the community

Why a “generic” engine like Galaxy is needed: (cont) Unified model of lexicon, semantic and social networks gives more context to make the right decisions in Community Detection, Community Structure Analysis, Metadata Sharing & Recommendations, etc However, data network becomes quite intricate and irregular, and only generic, scalable and high-performing ontological network miners (like Galaxy) are up to the job Galaxy is a generic technique, which can efficiently work on huge networks with complex topology Most tasks on MeSH and WordNet are done in 200 msc Galaxy has native potential for explanatory module “ This person might help you to understand this document because he frequently used tags popular for this documents”

OnNeM can handle Networks like this:  People Documents Tags Instances of tagging New people and additional relations between them Relations between tags: semantic proximity, misspellings, translations, WordNet, … Relations between documents: … New objects: e.g. keywords from texts might be related with documents and tags

Applications to Semantic Social Networks & Knowledge Exchange

What problems Galaxy can address Galaxy could be used to uniformly address many problems in Semantic Social Networks & Knowledge Exchange : Tag recommendation in folksonomies; Community detection; Centrality problem in social network analysis; Expertise location… How? Galaxy is a generic technique: which takes as input a function on nodes of a semantic network and transforms this input into another function according to the parameters. To simplify explanations, instead of the input/output functions, we’ll talk about the input set of nodes and ranked output set of nodes To create solution for a particular task A set of input nodes must be chosen Parameters of the algorithm must be established Output set must be interpreted according to the task

IBM social software “ the company is serious about dominating social networking for the enterprise” Cooking Up a Social Networking Storm With IBM Labs, March 30, 2007 IBM Social Software Dogear Dogear is a social-tagging service for resources such as public URLs, company-internal URLs, and other company internal documents (e.g., Wiki pages, Domino documents, etc.) Bluepages+1 is an enhanced version of IBM online employee directory. Among its enhancements is the ability for one person to apply a tag directly to another person’s directory page. Blog Central Blog Central is an internal blogging service, open to any employee. The Blog Central data structures provide for a separate list of tags for each blog and for each entry within each blog. Activities Activities is a web-based version of ActivityExplorer, an activity-centric collaboration service in which teams may create a collections of diverse objects in a tree-like structure consisting of a root “activity” and its daughter components.

Our research plans to exploit Galaxy: We are investigating a wide range of applications in Community Detection, Community Structure Analysis, Metadata Sharing & Recommendations Enhanced with Social Reputation Mechanisms Based on our understanding of potential IBM needs, our commitments for European research projects, and our vision of the potential of Galaxy, we are looking forward to the creation of the following functionalities: Community Support Given a peer: Search for its neighbors within a community Given the entire collection: Identify trends and threads (e.g., tags becoming popular, etc.) Metadata Sharing & Recommendations Given a file with some attached metadata: Recommend additional annotations Recommend similar files Given one or more tags and/or keywords: Locate peers with expertise in the described areas

Research collaborationation Create semantic social networks of your interest … in the format which can be used by Galaxy simple XML format Design scenario and work with us on tuning parameters of Galaxy for the tasks in your scenario … Contacts Alexander Troussov, CAS Chief Scientist, [email_address] Marie Wallace, LanguageWare manager, [email_address] Brian O’Donovan, CAS Program Director, [email_address] IBM CAS Dublin https://www.ibm.com/ibm/cas/sites/dublin/ LanguageWare http://www.ibm.com/software/globalization/topics/languageware/index.jsp NEPOMUK http://nepomuk.semanticdesktop.org/

BRICKS Project Predrag Knežević Fraunhofer IPSI Institute Darmstadt, Germany [email_address]

What is BRICKS? A software infrastructure for building digital library networks Transparent access to distributed resources Multilinguality Easy installation & maintainance A set of end-user applications Network & content management Web 2.0 Tagging/Annotations Domain specific applications A business model Open Source, Platform Independent Low cost infrastructure User communities  sustainability

Sustainability User Communities Open Source Applications User App. Build on top of the foundation User Services can become Foundation services Foundation/Infrastructure Decentralized Storage Content&Metadata Mngt. Semantic Retrieval Security/DRM BRICKS

BRICKS Architecture A decentralized P2P network Avoid central coordination Highly Scalable, increased reliability Minimized maintainance costs Each P2P Node is a set of SOA components Web Service Interface Platform Independent Flexible Composition Components for Storing, accessing and protecting digital objects (Semantic) search & browsing P2P commmunication

Features Application development in any language with a good Web-service support Metadata Support for various schemas Indexed both locally and published in decentralized index as well Annotations Support for various media types (text, images, audio, video) Various supported types (text, audio, video, spatial, temporal) Content Can be stored outside of BNode Internally content can be managed in various binary and structured (XML) formats Organized into collections Location transparent for applications Search Simple, advanced, ontology-based Cross-language support Addresses all available content

Collection Manager Single access point for all content and metadata related operations (local and remote) Physical Collection Similar to folder/directory hierarchy in a file system Bound to a single BNode Each digital content object belongs to exactly one collection Logical Collection Virtual folder for organizing content items independent of their physical location Links to content items from various physical collections on different BNodes A content item might belong to many of them Stored Query similar to database views

Content Manager Two ways to handle Content in BRICKS stored locally at site of a member party, accessed via URL stored within BRICKS Based on Java Content Repository (JCR) Provide a meta-content model Re-use of existing content models Use standard models

Metadata Manager Metadata descriptions  RDF Suitable for any applícation scenario Express Relationships between objects React to changes without changing the model Schema defintions  OWL No fixed schema Extensible (e.g. Application Profiles) Semantic concepts instead of schematic strucutures SPARQL Metadata queries over ontology concepts Queries for graph patterns

Annotation Management Rich model Supported fragment types: “Text fragment”, “Time fragment”, “Rectangle”, “Circle”, “Point”, “Polygon” and “Polyline” Supported annotation types: “Structured Annotation”, “Association”, “Text annotation” and “Symbol Annotation” Annotation type “association” supports n:m relations Support of versioning Annotation of complete objects and of fragment of objects Supports annotation of multiple objects 13/03/2007

Security Manager Transparently invoked by the Framework any service call is checked Context-aware policies based on RBAC (via XACML rules), supporting Roles, Groups, at DLObject level Permission declaration through Javadoc @tags Federated identity is managed through an adapted version of OpenSAML Reputation-based Trust calculation integrated Web-based GUI for Security configuration 13/03/2007

Digital Rights Management DRM Component Support for licenses based on MPEG-21 REL license declaration standard Generic API for the integration of commercial DRM systems Watermarking Open-source watermarking tool for images other tools can be integrated BRICKS Store web application for commercial content Creative Commons support for other content in BRICKS 13/03/2007

Application: BRICKS Workspace What does it demonstrate? a web application (thin client) accessing BRICKS Foundation services Web 2.0 image annotations Reference application Primary customers? general end-users (citizens) application developers Technology Struts based interface to the BCH Live demo at http://saturn.researchstudio.at:8090/workspace

Application: BRICKS Desktop What does it demonstrate? a rich client application accessing BRICKS Foundation services direct access to the BCHN Primary customers? expert end-users (researchers, educators) application developers Technology Eclipse based rich client interface Download at http://develop.bricksfactory.org/projects/desktop

Application: Annotation Tool What does it demonstrate? Tool which allows end-users to annotate images Creation of annotation threads Supervised Annotations Primary customers? end-users Institutions with large image collections Technology Web Application

Application: Online Exhibition Authoring Tool What does it demonstrate? Creating and publishing online exhibitions using contents that is available in the BRICKS network Primary customers? expert end-users (curators) Technology Web Application Live demo at http://livingmemory.researchstudio.at/

Application: Archeological Finds Identifier What does it demonstrate? a web application for comparing found objects (e.g. ancient coins) with objects from reference collections Application of complex domain ontology (CIDOC-CRM) Map visualization of GIS-Metadata Primary customers? Museum curators, archaeologists, students, amateurs Technology Struts based interface Live Demo at http://finds.brickscommunity.org:8091/findsidentifier/index.do

BRICKS Demo Store What does it demonstrate? Purchasing digital goods License maintenances and proofing Primary customers Content providers Technology Based on OFBiz Live demo at http://brstore.metaware.it:9080/ecommerce/control/main

References BRICKS Community Web Site ( http://www.brickscommunity.org ) BNode Release Downloads ( http://foundation.bricksfactory.org ) BRICKSforge ( http://develop.bricksfactory.org ) BRICKS Developer Community ( http://dev.brickscommunity.org )

Irish Digital Libraries Summit Digital Libraries at the eve of the Next Generation Internet Digital Libraries in Ireland http://wiki.corrib.deri.ie/index.php/SemDL/IrishDLSummit

Building the Future IVRLA - The Irish Virtual Research Library and Archive Project - an infrastructure for humanities research. OJAX – A Web 2.0 Search user Interface S 3 B - With a Little Help from My Friends: Social Semantic Search and Browsing

The Irish Virtual Research Library and Archive Project - an infrastructure for humanities research.

Outline Quick Overview Digitisation Processes Repository Development Content Models IVRLA Deployment Observations

IVRLA Positioning PRTLI funded project Component of UCD Humanities Institute of Ireland and based in UCD Library Supporting research through offering access to digitised content from participating primary source repositories Direct research into digitisation and digital repositories Developing and promoting added value tools and services

IVRLA Deliverables Body of digitised content Functioning repository prototype with scaleable infrastructure Comprehensive report including regulatory and financial issues Body of corporate knowledge & expertise Centre of excellence Proof of concept

Support the creation and publication of new forms of “ information units ” Integrate with the processes (e.g., workflows) of research, collaboration, and scholarly communication Enable knowledge integration : capture semantic and factual relationships among information entities Promote information re-use and contextualization Facilitate collaborative activity and capture information that is created as a byproduct of it Capture and maintain the complex structural, semantic, provenance, and administrative relationships among digital resources* * Sandy Payette, Sydney 2006. Digital content repositories should…

Digitisation and Cataloguing Processes

Image based Digitisation Components Apple PowerMac G5 running Kodak oXYgen Scan Kodak IQSmart 2 Adobe Photoshop CS2

Audio Digitisation Components Quadriga system Lake People ADC and DAC Revox 1/4 inch tape player

Files and Formats Scanned Material (text and images) TIFF (PM) JPEG (CW) Djvu (CW) JPEG (TN) Time Based Material Audio BWF (PM) MP3 / MP4 (CW) Video Linear Digital (PM) mov,wmv? (CW)

Workflows TIFFs have metadata embedded TIFFs are backed up to LTO Photoshop macros used to watermark, create JPEG and TIFFs DVDs created and stored Additional derivatives created for resource discovery and access

Data Storage 3 high quality ‘Preservation Master’ copies 2 DVD-ROM - working 1 LTO - deep archive Copies stored in geographically disparate locations Estimate that IVRLA will require 6-8TB for all preservation master storage. Scans ~ 80MB Audio ~ 800MB/hr Online requirement is significantly less

Metadata and Database 2 stage cataloguing database MODS - descriptive metadata METS - structural and transmission metadata EAD - archival context and structure MIX - technical metadata for images MADS - descriptive metadata authority files

Collection Model Library use OPAC for searching Archives use Finding Aid for browsing Hybrid model to enable searching and browsing of complex hierarchical digital collections Model facilitates top down and bottom up approaches EAD provides context and structure MODS provides precision and accuracy Create EAD template for each ‘collection’ Catalogue to the appropriate level

Repository Architecture Articulation

Open Source Repository Systems Growing area of development Several options available; Dspace Eprints Fedora IVRLA required a solution which offers; Suitability for wide range of data types Support for collection structures and complex objects Scalability - prototype into service Future-proof architecture Long term digital preservation

IVRLA Preservation Requirements Audit trails and datastream versioning Persistent Identifiers Checksum creation and validation Whole object versioning OAIS compliance TDR compliance

IVRLA interface requirements Evidence Provenance, authenticity, integrity, context, persistence, sustainability Granularity - directed to page, clip, part .. Security, authentication and authorisation infrastructure Conversation/Participation Informal, collaborative Personalisation and customisation Recommendation Services (S/CSI) Social searching and annotation (S/CSI - S/ILS) Add value, links, connections…

Fedora Content Models A definition for a “type” of object (e.g., article, book, image, learning object) that describes the internal composition of a group of similar Fedora objects Data Type Structure Services Data Type defines payloads and metadata Structure defines relationships between objects Services define actions or disseminators for the content

RoadMap Initial Research and Demo Develop utilities - sipMaker and mixMaker Articulate collection model Develop Virtual Library and Archive 1.0 Browse Search View Cite Tag Ingest Trial and deployment of subsets Develop Virtual Library and Archive 2.0 User management Personalisation, customisation Recommendation services Annotation and tagging Research space Virtual collections

Usership Research based Context heavy - accuracy, integrity and authenticity Technically literate with Internet age expectations - the Google effect Accurate citation and source acknowledgement using persistent identifiers

Repository Challenges Architecture is not an ‘out of box’ solution Resources required to articulate and develop interface layer(s) Metadata management is complex Tension between popular delivery formats and archival preservation formats Challenge of anticipating all user environments in content modeling Improved automation is necessary for ingest and validation. Digitisation is the main bottleneck Sustainability - prototype developed into a service Human resources are central to technology projects Developing and training data curators - multidisciplinary skill sets

Observations and Conclusion 5 year project timeline requires an iterative process New advances in computing science will influence developments - eScience, eHumanities, Web 2.0 IVRLA positions the archival source with all context and structure as central to the digital deployment Define and build core sources which can be interrogated and integrated with dynamic services Standards based interoperability is key to ensure future accessibility and sustainability New repository models suggest and support user created metadata such as social bookmarking and annotating

Further Information www.ucd.ie/ivrla [email_address]

OJAX: Web 2.0 Federated search Judith Wusteman April 2007

Overview Introducing OJAX OJAX Demo Related research

Web 2.0 Technologies and Standards used in OJAX AJAX REST JSON Atom OAI-PMH OpenSearch Open API StaX Apache Lucene

Auto-completion Auto-search Dynamic archive list

OpenSearch Enables search engines to describe their search syntax to browsers Describes standards for search results syntax Based on RSS and Atom

Science Foundation Ireland: OJAX++: a next generation collaborative research tool To investigate how concepts from the Social Web can be applied to the research environment in order to facilitate dynamic collaboration and the sharing of ideas among researchers.

PhD starting September 2007 In collaboration with UCD School of Computer Science and Informatics Requirements Honours degree (preferably first class or 2.1) in Computer Science or a related field or equivalent technical expertise Preferred Experience : Web technology JavaScript AJAX one of Java, Ruby or Python. http://www.ucd.ie/wusteman [email_address] .

Advantages of OJAX Developed in Ireland. Can be adapted to suit. Already in Beta version. Available for download. Well received Responds to new user expectations generated by Web 2.0 Rich, dynamic user experience. Intuitive interface. Integration, interoperability and reuse. Open source standards-compliance. including OpenSearch, OAI-PMH, StAX and Apache Lucene.

With a Little Help from My Friends Social Semantic Search and Browsing Sebastian Ryszard Kruk, Adam Gzella Digital Enterprise Research Institute National University of Ireland, Galway sebastian.kruk@deri.org, adam.gzella@deri.org http://s3b.corrib.org/

Take away message We search in different way for different things Keyword search is not enough We create the knowledge by sharing our (search) experience

Outline Motivation How do people search Search and Browsing lifecycle Applying semantics and making use of social networks: Keyword-based search Collaborative Faceted Navigation Collaborative Filtering Conclusions - Putting it all together

How do people search? Different user goals: Resource Seeking - the user wants to find a specific resource (e.g. lyrics of a song, a program to download, a map service etc.) Navigational - the user is searching for a specific web site whose URL s/he forgot Informational - the user is looking for information about a topic s/he is interested in Rose and Levinson: Understanding user goals in web search (2004)

Search and browsing lifecycle Why ? Information can be useful Information can be a garbage How ? (Search and browsing actions) [REUSE] keyword-based search (resource seeking) [REDUCE] faceted navigation (navigational) [RECYCLE] collaborative filtering (informational) Can this process be improved with Semantic Web and Social Networking technologies?

Query refinement in keyword-based search Why simple full-text search is not enough? Too many results ( low precision ) One needs to specify the exact keyword ( low recall ) How to distinguish between: Python and python? ( high fall-out ) How ? Disambiguation through a context Query context Short-term context: User’s goal Location Time Long-term context: User’s interest Search engine specific

Query refinement in keyword-based search How ? Query refinement Spread activation Types mapping Pruning Acquiring the context information : Previous searches of the user Semantically annotated user’s bookmarks Community profile And ? (Manual query refinement) “ Tell me why ” button and the transcript of refinement process Continue to faceted navigation

Collaborative Faceted Navigation Why ? The search does not end on a (long) list of results The results are not a list (!) but a graph We loose context with linear navigation A need for unified notion (UI, Services) of filter/narrow and browse/expand services Share browsing experience – navigate collaboratively How (Services)? Defines REST access to services and their composition Basic services : access, search, filter, similar, browse, combine Meta services : RDF serialization, subscription channels, service ID generation Context services : manage contexts, manage service calls/compositions in the context, lists contexts Statistics services : properties, values, tokens

Collaborative Faceted Navigation How (User interface)? Hexagons to capture the notion of non-linear history of browsing Selecting values from list, tag cloud or TagsTreeMap TM Context zoomable interface : List (graph) of results Browse from current results Navigate between service call Navigate between contexts (with given call)

Social Semantic Collaborative Filtering Why? The bottom-line of acquiring knowledge: informal communication (“word of mouth”) How? Everyone classifies (filters) the information in bookmark folders ( user-oriented taxonomy ) Peers share (collaborate over) the information ( community-driven taxonomy ) Result? Knowledge “flows“ from the expert through the social network to the user System amass a lot of information on user/community profile ( context )

Social Semantic Collaborative Filtering Problems? The horizon of a social network (2-3 degrees of separation) How to handle fine-grained information (blogs, wikis, etc.) Solutions? Inference engine to suggest knowledge from the outskirts of the social network Support for SIOC metadata : Semantically Interlinked Online Communities: blogs, wikis, fora, … SIOC browser in SSCF Annotations and evaluations of “local” resources

Putting it all together user profile: recent actions refine search results filter, record, annotate, and share results and actions re-call shared actions user profile: user’s interests filter, record, annotate, and share results

Do we need Semantic Web and Web 2.0 technologies in Digital Libraries? Irish Digital Libraries Summit Digital Libraries at the eve of the Next Generation Internet http://wiki.corrib.deri.ie/index.php/SemDL/IrishDLSummit

Irish Digital Libraries Summit Digital Libraries at the eve of the Next Generation Internet Conclusions http://wiki.corrib.deri.ie/index.php/SemDL/IrishDLSummit

Irish Digital Libraries Summit

More Related Content

What's hot

Similar to Irish Digital Libraries Summit

More from Sebastian Ryszard Kruk

Recently uploaded

Irish Digital Libraries Summit