The most effective strategy for finding files is to carefully arrange them into folders. This strategy breaks down for teams, where organizational schemes often differ between team members. It also breaks down when information is copied and reused as it becomes harder to track versions. As storage continues to grow and costs decline, the incentives to carefully archive old versions of files diminish. It is therefore important to explore new and improved search tools. The most common approach is keyword search, though recalling effective keywords can be challenging, especially as repositories grow and information flows across projects. A less common alternative is to use provenance --information about the creation, use and sharing of documents and their context, including collaborators. This paper presents a limited user study showing that provenance data is useful and desirable in search, and that an interface based on a graphical sketchpad is not only feasible, but efficient.
Pal gov.tutorial2.session13 1.data schema integrationMustafa Jarrar
This document discusses data schema integration, which involves identifying correspondences between different data schemas and resolving conflicts between them to create an integrated schema. It describes challenges in schema integration including identifying corresponding concepts and analyzing conflicts. It then presents a generic framework for schema integration involving schema transformation, schema matching to identify correspondences, and integration and mapping generation to create the integrated schema and mappings. Finally, it provides examples of different types of conflicts and integration methods.
How to Access Your Library Book Collections Using Solrlucenerevolution
Presented by Engy Ali | The Library of Alexandria See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Do you have a large collection of text content that you want to search? Facing challenges on how to facet after performing a full text search across metadata and content? Do you want to use Solr with personalization? Bibliotheca Alexandrina provides public access to digitized book collections that exceed 220,000 books, through a web-based search and browsing facility. The facility is completely built on Solr in five different languages. The website provides full text morphological search within the books’ metadata and content with result highlighting. Different personalization features like annotation tools and tagging are also implemented using Solr. This presentation will cover how Bibliotheca Alexandrina uses Solr to implement full text indexing and searching across the entire collection, faceting, search within the content of a book and result highlighting and techniques used for personalization.
TITULAR: Lleida ha perdut 15.000 llocs de treball en tres anys i mig. FOTO DE PORTADA: Ofrena floral al Sant Crist de Balaguer. DESTAQUEM: La Paeria reduirà un 4%Mel pressupost, però salva educació i serveis socials
A simple and rapid approach for assessing agroecological function in tropical...Bioversity International
Bioversity International scientist Simon Attwood presents on how to do a Landscape Function Analysis through community particpation, giving the particular example of home gardens in Timor Leste. This presentation was presented for Simon by Bioversity colleague Fabrice DeClerck at the Ecosystem Services Partnership Conference in Costa Rica, 2014.
Find out more about our research on agricultural ecosystems:
http://www.bioversityinternational.org/research-portfolio/agricultural-ecosystems/
The document contains descriptions of several shots from a music video. The shots show the artist performing choreographed dances, close-ups of their face, and them using a mobile phone. Many of the shots emphasize the blue and black color scheme. Product placement is featured through a shot of the artist using a Nokia phone.
This document provides a summary of Ahmed Hassan Ahmed Mubarak's personal and professional details. It includes his contact information, education history, objectives, and work experience. Mubarak holds a Bachelor's Degree in Accounting and has worked in several roles in the oil and gas industry, currently as a Floor Man for AMAK Drilling and Petroleum Services. His previous positions include work as a Roustabout for Advanced Energy Systems and Materials & Base Coordinator for International Tubular Service.
The document summarizes a presentation given in Albania on the review and remedies procedure in Croatia's public procurement system. It discusses Croatia's State Commission for Supervision of Public Procurement and highlights some of its recent cases. It also examines the importance of the Commission's decisions, lessons learned from EU court decisions, and perceptions of and challenges in building trust in public procurement systems.
Pal gov.tutorial2.session13 1.data schema integrationMustafa Jarrar
This document discusses data schema integration, which involves identifying correspondences between different data schemas and resolving conflicts between them to create an integrated schema. It describes challenges in schema integration including identifying corresponding concepts and analyzing conflicts. It then presents a generic framework for schema integration involving schema transformation, schema matching to identify correspondences, and integration and mapping generation to create the integrated schema and mappings. Finally, it provides examples of different types of conflicts and integration methods.
How to Access Your Library Book Collections Using Solrlucenerevolution
Presented by Engy Ali | The Library of Alexandria See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Do you have a large collection of text content that you want to search? Facing challenges on how to facet after performing a full text search across metadata and content? Do you want to use Solr with personalization? Bibliotheca Alexandrina provides public access to digitized book collections that exceed 220,000 books, through a web-based search and browsing facility. The facility is completely built on Solr in five different languages. The website provides full text morphological search within the books’ metadata and content with result highlighting. Different personalization features like annotation tools and tagging are also implemented using Solr. This presentation will cover how Bibliotheca Alexandrina uses Solr to implement full text indexing and searching across the entire collection, faceting, search within the content of a book and result highlighting and techniques used for personalization.
TITULAR: Lleida ha perdut 15.000 llocs de treball en tres anys i mig. FOTO DE PORTADA: Ofrena floral al Sant Crist de Balaguer. DESTAQUEM: La Paeria reduirà un 4%Mel pressupost, però salva educació i serveis socials
A simple and rapid approach for assessing agroecological function in tropical...Bioversity International
Bioversity International scientist Simon Attwood presents on how to do a Landscape Function Analysis through community particpation, giving the particular example of home gardens in Timor Leste. This presentation was presented for Simon by Bioversity colleague Fabrice DeClerck at the Ecosystem Services Partnership Conference in Costa Rica, 2014.
Find out more about our research on agricultural ecosystems:
http://www.bioversityinternational.org/research-portfolio/agricultural-ecosystems/
The document contains descriptions of several shots from a music video. The shots show the artist performing choreographed dances, close-ups of their face, and them using a mobile phone. Many of the shots emphasize the blue and black color scheme. Product placement is featured through a shot of the artist using a Nokia phone.
This document provides a summary of Ahmed Hassan Ahmed Mubarak's personal and professional details. It includes his contact information, education history, objectives, and work experience. Mubarak holds a Bachelor's Degree in Accounting and has worked in several roles in the oil and gas industry, currently as a Floor Man for AMAK Drilling and Petroleum Services. His previous positions include work as a Roustabout for Advanced Energy Systems and Materials & Base Coordinator for International Tubular Service.
The document summarizes a presentation given in Albania on the review and remedies procedure in Croatia's public procurement system. It discusses Croatia's State Commission for Supervision of Public Procurement and highlights some of its recent cases. It also examines the importance of the Commission's decisions, lessons learned from EU court decisions, and perceptions of and challenges in building trust in public procurement systems.
El documento resume la historia mitológica de Aracne, una joven hábil tejedora que desafió a la diosa Atenea a una competencia de tejido. Aunque Aracne tejió un tapiz perfecto, Atenea se enojó y transformó a Aracne en araña. El documento también menciona dos obras de arte famosas que ilustran esta historia mitológica: "Las Hilanderas" de Velázquez y una ilustración de Gustave Doré que representa la transformación de Aracne.
Alejandro Antonio Diaz, de 48 años y soltero, presenta su currículum vitae. Actualmente trabaja como técnico informático en el Ministerio de Educación de Salta desde 1998, donde es responsable de 17 municipios. Anteriormente trabajó como técnico óptico en el laboratorio Lutz Ferrando desde 1979 hasta 1982, donde realizaba tareas como el desbaste y pulido de cristales para graduar lentes. Cuenta con estudios secundarios completos en contabilidad y auditoría, y domina programas informáticos como Office y Skype.
Este documento presenta una serie de viñetas políticas publicadas en mayo y junio de 2014. Está compuesto por 32 páginas, cada una con una viñeta diferente satirizando eventos políticos de la época como la coronación del Rey Felipe VI, las elecciones europeas y debates sobre la constitución española. El documento incluye también información sobre la licencia Creative Commons bajo la cual se publica.
Este documento proporciona información sobre los resultados de los formularios en Frontpage. Explica que una vez que un visitante envía un formulario, los datos introducidos (llamados resultados del formulario) deben recopilarse para poder verlos, mostrarlos o trabajar con ellos. Frontpage ofrece varias opciones para el destino de los resultados, como guardarlos en un archivo de texto o HTML, enviarlos por correo electrónico, almacenarlos en una base de datos, o usar secuencias de comandos personalizadas para controlar los resultados. El documento también indica cómo defin
El documento detalla la formación académica y experiencia profesional de una persona. Posee una Licenciatura en Ciencias de la Educación especializada en Física y Matemática y un Doctorado en Ciencias de la Educación especializado en Administración Educativa. Tiene amplia experiencia en educación, administración educativa, diseño curricular y proyectos educativos.
Este documento es una invitación a varios empleados para que participen en un programa de prevención de incidentes para supervisores de Chicago Bridge & Iron Company (SIPP 1 y 2) del 13 al 17 de abril de 2015 en Barcelona, Venezuela. El programa de 44 horas cubrirá varios temas de seguridad y se llevará a cabo en el salón de eventos de un restaurante. Los participantes se hospedarán en un hotel cercano y se les proporcionará transporte, comidas y materiales para el curso. Se les pide a los participantes que asistan
Goals on every level - Delivery Leads MelbourneTom Sommer
The document discusses setting goals at different levels to improve productivity. It recommends using the SMART framework to set specific, measurable, achievable, relevant and time-bound goals. Goals should provide meaning, have clear metrics, allow for change and have leaders to guide progress. Setting cascading goals from the company level down to individual teams and people can help align efforts. The presenter encourages the audience to set team or individual goals and provide feedback in 6 months.
Este documento presenta la relatoría de un taller sobre periodismo digital dictado en Perú en 2007. Explora cómo ha evolucionado Internet desde su origen hasta la actualidad, caracterizada por el auge de las redes sociales. También analiza cómo ha cambiado el comportamiento de las audiencias y la necesidad de que los periodistas conozcan mejor cómo acceden a la información para adaptar su trabajo. Finalmente, resalta la importancia de utilizar nuevas herramientas tecnológicas para ofrecer contenidos atractivos a las audiencias en la era digital.
Este documento presenta información sobre la Institución Educativa No 2024 - Los Olivos en Lima, Perú. Detalla la dirección, teléfonos y autoridades de la institución como el director, subdirectores y docentes. Además, incluye indicaciones generales y el reglamento interno para el uso del Aula de Innovación Pedagógica y los Centros de Recursos Tecnológicos de la escuela.
The Ku Klux Klan began as a single group in 1865 but has since split into multiple factions. Today, the main KKK groups are the Brotherhood of Klans, National Knights of the KKK, Knights of the KKK, and Imperial Klans of America. These groups continue to promote white supremacy through intimidation and violence, though their membership and influence has declined significantly since the civil rights era.
Exposicion de empresa monopolista y oligopolistaDianita León
Este documento describe diferentes empresas que operan bajo condiciones de competencia monopolística u oligopolio en Ecuador. Entre ellas se encuentran tiendas de barrio, empresas de ropa como Vatex, empresas de transporte interprovincial, azucareras como Azucar Valdez S.A. e Industria Molinera C.A., y la empresa de telecomunicaciones CNT. Todas estas empresas compiten entre sí al ofrecer productos diferenciados al mismo grupo de clientes, y pueden entrar o salir libremente del mercado.
El documento presenta los elementos fundamentales de la pedagogía tradicional. Se describe que el docente tradicional elige los contenidos y forma de dictar las clases, con el objetivo de transmitir información al estudiante. El aprendizaje se logra a través de la imitación del ejemplo del maestro y la disciplina es un componente central. La relación entre docente y estudiante es verticalista.*
- The FixRep project aims to examine techniques for automated metadata extraction from documents to enable accessibility evaluation and triage in repositories.
- A prototype was developed to extract formal metadata like file type, title, author from PDFs using tools like pdfinfo, pdftotext, and pdfimages.
- An initial pilot study analyzed PDFs from the University of Bath repository, finding 80% were successfully processed but some errors occurred due to missing or unsupported metadata and formats.
This presentation presents results from a pilot study exploring automated formal metadata extraction in accessibility evaluation. We demonstrate a prototype created during the FixRep project that aims to support capture, storage and reuse of accessibility information where available, and to approach the problem of reconstructing required data from available sources.
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Anne Nicolas
1. The document discusses using RDF metadata to improve traceability between software projects and distributions by linking related documentation like bug reports and security advisories.
2. The author describes harvesting metadata from projects like GNOME, Apache, PyPI, and Debian to create a graph of interlinked resources and allowing better search and navigation between packages, bugs, and releases.
3. Recommendations include asking upstream authors to create DOAP descriptions and distributions to adopt the ADMS.SW standard to document package releases using semantic web technologies like RDF.
El documento resume la historia mitológica de Aracne, una joven hábil tejedora que desafió a la diosa Atenea a una competencia de tejido. Aunque Aracne tejió un tapiz perfecto, Atenea se enojó y transformó a Aracne en araña. El documento también menciona dos obras de arte famosas que ilustran esta historia mitológica: "Las Hilanderas" de Velázquez y una ilustración de Gustave Doré que representa la transformación de Aracne.
Alejandro Antonio Diaz, de 48 años y soltero, presenta su currículum vitae. Actualmente trabaja como técnico informático en el Ministerio de Educación de Salta desde 1998, donde es responsable de 17 municipios. Anteriormente trabajó como técnico óptico en el laboratorio Lutz Ferrando desde 1979 hasta 1982, donde realizaba tareas como el desbaste y pulido de cristales para graduar lentes. Cuenta con estudios secundarios completos en contabilidad y auditoría, y domina programas informáticos como Office y Skype.
Este documento presenta una serie de viñetas políticas publicadas en mayo y junio de 2014. Está compuesto por 32 páginas, cada una con una viñeta diferente satirizando eventos políticos de la época como la coronación del Rey Felipe VI, las elecciones europeas y debates sobre la constitución española. El documento incluye también información sobre la licencia Creative Commons bajo la cual se publica.
Este documento proporciona información sobre los resultados de los formularios en Frontpage. Explica que una vez que un visitante envía un formulario, los datos introducidos (llamados resultados del formulario) deben recopilarse para poder verlos, mostrarlos o trabajar con ellos. Frontpage ofrece varias opciones para el destino de los resultados, como guardarlos en un archivo de texto o HTML, enviarlos por correo electrónico, almacenarlos en una base de datos, o usar secuencias de comandos personalizadas para controlar los resultados. El documento también indica cómo defin
El documento detalla la formación académica y experiencia profesional de una persona. Posee una Licenciatura en Ciencias de la Educación especializada en Física y Matemática y un Doctorado en Ciencias de la Educación especializado en Administración Educativa. Tiene amplia experiencia en educación, administración educativa, diseño curricular y proyectos educativos.
Este documento es una invitación a varios empleados para que participen en un programa de prevención de incidentes para supervisores de Chicago Bridge & Iron Company (SIPP 1 y 2) del 13 al 17 de abril de 2015 en Barcelona, Venezuela. El programa de 44 horas cubrirá varios temas de seguridad y se llevará a cabo en el salón de eventos de un restaurante. Los participantes se hospedarán en un hotel cercano y se les proporcionará transporte, comidas y materiales para el curso. Se les pide a los participantes que asistan
Goals on every level - Delivery Leads MelbourneTom Sommer
The document discusses setting goals at different levels to improve productivity. It recommends using the SMART framework to set specific, measurable, achievable, relevant and time-bound goals. Goals should provide meaning, have clear metrics, allow for change and have leaders to guide progress. Setting cascading goals from the company level down to individual teams and people can help align efforts. The presenter encourages the audience to set team or individual goals and provide feedback in 6 months.
Este documento presenta la relatoría de un taller sobre periodismo digital dictado en Perú en 2007. Explora cómo ha evolucionado Internet desde su origen hasta la actualidad, caracterizada por el auge de las redes sociales. También analiza cómo ha cambiado el comportamiento de las audiencias y la necesidad de que los periodistas conozcan mejor cómo acceden a la información para adaptar su trabajo. Finalmente, resalta la importancia de utilizar nuevas herramientas tecnológicas para ofrecer contenidos atractivos a las audiencias en la era digital.
Este documento presenta información sobre la Institución Educativa No 2024 - Los Olivos en Lima, Perú. Detalla la dirección, teléfonos y autoridades de la institución como el director, subdirectores y docentes. Además, incluye indicaciones generales y el reglamento interno para el uso del Aula de Innovación Pedagógica y los Centros de Recursos Tecnológicos de la escuela.
The Ku Klux Klan began as a single group in 1865 but has since split into multiple factions. Today, the main KKK groups are the Brotherhood of Klans, National Knights of the KKK, Knights of the KKK, and Imperial Klans of America. These groups continue to promote white supremacy through intimidation and violence, though their membership and influence has declined significantly since the civil rights era.
Exposicion de empresa monopolista y oligopolistaDianita León
Este documento describe diferentes empresas que operan bajo condiciones de competencia monopolística u oligopolio en Ecuador. Entre ellas se encuentran tiendas de barrio, empresas de ropa como Vatex, empresas de transporte interprovincial, azucareras como Azucar Valdez S.A. e Industria Molinera C.A., y la empresa de telecomunicaciones CNT. Todas estas empresas compiten entre sí al ofrecer productos diferenciados al mismo grupo de clientes, y pueden entrar o salir libremente del mercado.
El documento presenta los elementos fundamentales de la pedagogía tradicional. Se describe que el docente tradicional elige los contenidos y forma de dictar las clases, con el objetivo de transmitir información al estudiante. El aprendizaje se logra a través de la imitación del ejemplo del maestro y la disciplina es un componente central. La relación entre docente y estudiante es verticalista.*
- The FixRep project aims to examine techniques for automated metadata extraction from documents to enable accessibility evaluation and triage in repositories.
- A prototype was developed to extract formal metadata like file type, title, author from PDFs using tools like pdfinfo, pdftotext, and pdfimages.
- An initial pilot study analyzed PDFs from the University of Bath repository, finding 80% were successfully processed but some errors occurred due to missing or unsupported metadata and formats.
This presentation presents results from a pilot study exploring automated formal metadata extraction in accessibility evaluation. We demonstrate a prototype created during the FixRep project that aims to support capture, storage and reuse of accessibility information where available, and to approach the problem of reconstructing required data from available sources.
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Anne Nicolas
1. The document discusses using RDF metadata to improve traceability between software projects and distributions by linking related documentation like bug reports and security advisories.
2. The author describes harvesting metadata from projects like GNOME, Apache, PyPI, and Debian to create a graph of interlinked resources and allowing better search and navigation between packages, bugs, and releases.
3. Recommendations include asking upstream authors to create DOAP descriptions and distributions to adopt the ADMS.SW standard to document package releases using semantic web technologies like RDF.
This document discusses linking software project and distribution metadata using RDF to improve traceability. It describes current issues with duplication and lack of interoperability. The approach presented extracts metadata from various projects and distributions and stores it in a Virtuoso triplestore to allow semantic queries across projects. Examples demonstrate matching packages between Debian and Apache projects based on homepage URLs. The document also outlines ongoing work to add RDF/Linked Data to additional systems like Debian and FusionForge.
The document discusses how empowering transformational science through open data access, optimized data formats, and open-source tools. It argues that traditional methods of accessing large datasets can be inefficient, with 80% of time spent on data preparation and only 10% on analysis. New approaches using analytics optimized data stores (AODS) like Zarr, and tools like Xarray and Dask, allow accessing large datasets with a single line of code and performing analyses within minutes by leveraging lazy loading and parallel computing. This represents a paradigm shift from traditional project timelines that can reduce barriers to science and increase reproducibility, empowering more researchers to efficiently analyze data and focus on scientific questions.
The document discusses flexible resources in Eclipse 4.0 and 3.6. It describes how work being done in Eclipse 4.0 on resources can be merged into the 3.x code stream. It provides examples of new resource functionality that originated in Eclipse 4.0 but has been backported to 3.6, such as resource filters and virtual folders. It also outlines plans for a semantic file system in Eclipse to provide more flexible mapping between workspace paths and network locations.
TYPO3 is developing a new framework called TYPO3 5.0 that will provide the foundation for the CMS and other applications. The framework uses components, packages, AOP, MVC and other design patterns to improve modularity, reuse and separation of concerns. It also leverages existing open source frameworks where appropriate. The goal is to develop a flexible yet powerful platform that inspires continued collaboration.
Pain points for preservation services / workflows in repositories prwheatley
Paul Wheatley discusses key challenges for digital preservation practitioners based on the SPRUCE project. He identifies several themes, including the need for (1) quality assurance tools to check for data flaws, (2) improved characterization for appraisal and ingest preparation, (3) locating preservation-worthy data, (4) checking conformance to institutional policies, and (5) identifying preservation risks. He emphasizes the importance of taking an evidence-based, practitioner-driven approach and reusing existing tools and knowledge rather than reinventing solutions.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
The document provides an introduction to document management. It discusses the promise of moving to a paperless office by reducing costs, improving organization and accessibility of documents. While the paperless office has been discussed for over 15 years, paper usage still grows significantly in most organizations. The document then covers hardware, storage, software considerations and common document formats for digital documents. It emphasizes the importance of planning workflow when implementing a document management system. Finally, it provides examples of document management solutions for small home offices, small businesses and professional offices like CPA firms. It concludes with 9 tips for getting started with document management.
Research Data Management: What is it and why is the Library & Archives Servic...GarethKnight
This document summarizes research data management and the library and archives service's involvement. It defines research data, explains why data needs to be managed, and outlines the key drivers for data management and publication. It then describes the library and archives service's knowledge of data management, the research data management support service being established, and the guidance, training, and tools being developed to help researchers with data management.
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
This document discusses approaches for mining frequent item sets on Apache Hadoop. It begins with an introduction to data mining and association rule mining. Association rule mining involves finding frequent item sets, which are items that frequently occur together. Apache Hadoop is then introduced as a framework for distributed processing of large datasets. Several algorithms for mining frequent item sets are discussed, including Apriori, FP-Growth, and H-mine. These algorithms differ in how they generate and count candidate item sets. The document then discusses how these algorithms can be implemented on Hadoop to take advantage of its distributed and parallel processing abilities in order to efficiently mine frequent item sets from large datasets.
BigScience is a one-year research workshop involving over 800 researchers from 60 countries to build and study very large multilingual language models and datasets. It was granted 5 million GPU hours on the Jean Zay supercomputer in France. The workshop aims to advance AI/NLP research by creating shared models and data as well as tools for researchers. Several working groups are studying issues like bias, scaling, and engineering challenges of training such large models. The first model, T0, showed strong zero-shot performance. Upcoming work includes further model training and papers.
How to best manage your data to make the most of it for your research - With ODAM Framework (Open Data for Access and Mining) Give an open access to your data and make them ready to be mined
This document summarizes Brad Houston's presentation on building a simple electronic records workflow. It discusses the benefits of electronic records like improved access and context, but also challenges like volume and preservation. It proposes using a "mechanic metaphor" where individuals have enough knowledge to manage electronic records issues. The presentation outlines using free and open source tools to accession, arrange, describe, and preserve a collection of records from a university chancellor's office. It emphasizes the ongoing nature of digital preservation and provides resources for further information.
This document discusses best practices for data management for research. It covers topics such as file organization, documentation, storage, sharing and publishing data, and archiving. Good practices include using file naming conventions and open formats, documenting projects, processes, and data, making backups in multiple locations, and publishing and archiving data in repositories to enable access and preservation. Data management is important for research reproducibility, sharing, and complying with funder requirements.
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...Mark Matienzo
This document discusses using open source digital forensics software to develop pre-ingest workflows for processing born-digital archival records. It proposes applying the digital investigation process model to analyze volume and file system metadata during accessioning. Specific tools mentioned include dd, AFF, The Sleuth Kit, and related utilities for disk imaging, extracting metadata and files to mitigate risks and support long-term preservation of digital records.
On demand access to Big Data through Semantic TechnologiesPeter Haase
The document discusses enabling on-demand access to big data through semantic technologies. It describes how semantic technologies like Linked Data and ontologies can be used to virtually integrate and provide access to large, heterogeneous datasets across different data silos. The key points are that semantic technologies allow for big data to be accessed and analyzed on-demand in a self-service manner through a "Linked Data as a Service" approach, providing scalable end user access to big data.
Similar to Leyline: A provenance-based desktop search (20)
On demand access to Big Data through Semantic Technologies
Leyline: A provenance-based desktop search
1. The Leyline: A Comparative
Approach To Designing a Graphical
Provenance-Based Search UI
Soroush Ghorashi, Carlos Jensen
Oregon State University
HICSS 2013
2. What is the problem?
Computers are increasingly “black holes” for information
— Storage abundant and cheap, no incentives to delete or archive
— Collaboration and sharing are growing
— Information increasingly flowing across devices
3. What is the problem?
Computers are increasingly “black holes” for information
— Storage abundant and cheap, no incentives to delete or archive
— Collaboration and sharing are growing
— Information increasingly flowing across devices
More information available, harder to (re)find anything
4. What is the problem?
Computers are increasingly “black holes” for information
— Storage abundant and cheap, no incentives to delete or archive
— Collaboration and sharing are growing
— Information increasingly flowing across devices
More information available, harder to (re)find anything
Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008]
— Collaborators use conflicting name schemes
— Overlapping projects introduce uncertainty
5. What is the problem?
Computers are increasingly “black holes” for information
— Storage abundant and cheap, no incentives to delete or archive
— Collaboration and sharing are growing
— Information increasingly flowing across devices
More information available, harder to (re)find anything
Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008]
— Using conflicting name scheme by collaborators
— Overlapping projects introduce uncertainty
Keyword Search
— Having larger repositories and information reuse lead to long list of hits for common keywords
— Multiple Copies and drafts of files
6. Solution?
What about: “Leveraging provenance to enrich file search”
— Provenance: The history of a document’s ownership, transformations, as
well as sources and derivatives
att
ac
hm e
en ast
RE: presentation draft ts
av y /p data.html
e cop
sav
presentation.ppt e as
presentation-v2.ppt
— Track provenance events: Make available in search queries, use in results
presentation
— Allow for fundamentally different types of queries
— People remember related documents [Gonçalves , 2004; Blanc-Brude,
2007]
7. Research Goals
— Phase 1: Analyze information reuse, information
flow, and provenance events in a real-world settings
— Phase 2: Investigate the effectiveness of
provenance cues in desktop search
— Phase 3: Develop and evaluate provenance-based
search tools (if appropriate)
8. Phase 1: Study Real-World Work
Practices (2008/2010)
File use per person-day
3 month user study at Intel Corporation
Web* 89.9
— Logging subjects’ activities on their computers Email 73.7
— Data cleaned for personal and sensitive information Word 4.4
Excel 2.5
— Recorded provenance and information access events
PowerPoint 2.1
Text 0.4
— Participants PDF 0.2
Total 173.2
— 17 information workers, 43 workdays average
— 9 observation sessions DownloadFile
3%
FileRename
— Exit interview with test 5%
MoveFile
6%
— Findings SaveAs
15%
— 126,620 unique resources CopyPaste
63%
— 7,448 resources per subject UploadFile
2%
AttachmentAdd
— Min: 3,211; Max: 17,570; σ: 3,326 3%
AttachmentSave
3%
C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
9. Phase 1 contd.
Provenance networks are more common than we expected!
— 521 significant graphs (3+ nodes)
— Average 5.8 resources per graph
— 53.7% of files related to at least one other file in their own network
C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
10. Phase 1 contd.
“It looks like it comes from the IAP tool, and all the green boxes are “I recall uploading
my Excel spreadsheets that I exported to. The word documents are those to the
probably what I copied the Excel data to, probably for email.” SharePoint site!”
“Oh, I see what’s going Half of subjects remembered
on. I tend to open a “2.4 might have been
spreadsheet and more about their documents embedded in a doc, so I
sometimes I’ll have more had to copy it out from
than one open at the same after seeing a provenance there.”
time…”
graph.
“Yeah, that’s what I did, I turned it into Excel… I saved it, “Looks like I copied and pasted from the website into
and then I changed the name because I wanted to make a doc… It’s kind of complicated what I did here. I
sure it was distinguished from other files I have with the took 2.2, copied and pasted info into an Excel
same name for a different group.” spreadsheet. And then yeah, there’s number 7, a
spreadsheet as well.”
C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
11. Can We Use Provenance More
Directly?
Textual query in most traditional
keyword search tools
12. Can We Use Provenance More
Directly?
Textual query in most traditional
keyword search tools
What about drawing queries?
13. Phase 2: Provenance in Search?
Is it Appropriate?
Can provenance be used effectively in search?
— How complex a query do we need to find a file?
— List of all unique walks in provenance graphs
— Find longest repeating strings for each subject
— Worst case unique query: Longest repeating string + 1
— With/without provenance event type to examine impact
Outlook--AS--Word--CP--PowerPoint--SA--PowerPoint--CP--Powerpoint
14. Phase 2 contd.
— Maximum query length for a repository of ~7500
items:
— Considering the type of provenance events
— 3 to 9, median 4
— Without considering the type of provenance events
— 3 to 10, median 4.5
Provenance events like copy/paste and versioning are too
common to add value!
— Provenance search grows linearly
— 1 node per 200 links
Provenance can be used to narrow search space quickly.
15. Tool Analysis
Categorizing tools that are using provenance-like data to enhance search
— Provenance Types
— Provenance Monitoring
— Provenance Use
— UI Approach
— Evaluation
16. Tool Analysis contd.
Name Provenance Types Provenance Monitoring Provenance Use UI Approach Evaluation
File meta-data, Extracting relations from Query formulation, Flow-chart like, Canned data,
keyword, static Google Desktop’s database Search process List view model limited within
Feldspar
relations between using its API (real-time results subjects user
resources updating) study
Meta-data such as Built-in System Monitor to Query formulation, Narrative-based, Multiple user
author, storage place, record meta-data about the Search process List of resources’ studies
Quill date, physical place user’s documents, email thumbnails (real-
tag (home, work, attachments, WebPages, time results
etc.) applications and calendar updating)
File meta-data (such Microsoft Desktop Search Query formulation, Text input with Longitudinal
as kind, date, author, database, fuzzy matching (car Search process, selectable filters, study using real
email attributes) and cars are same), fielded Results List view of data on subjects’
SIS
search (author is “john doe”) presentation results with a PCs (234 people),
preview and 6 weeks
meta-data
File meta-data (such Microsoft Desktop Search Query formation, Text input with Longitudinal
as kind, date, author, database, Extra meta-data as Search process, selectable filters, study using real
email attributes). tags (Labeling system) Results List view of data on subjects’
Phlat
Contextual cues such presentation results with a PCs (225 people),
as user defined tags preview and 8 months
meta-data
Environmental Integrated system monitor to Query formulation, Textual input and Canned data,
factors as contextual record contextual cues and Search Process selectable filters, limited within
YouPivot
cues, user defined their occurrences List view of subjects user
marks results study
17. Tool Analysis
Feldspar
— Feldspar – Chau et. al 2008
— Desktop search
— Uses associations between files and resources
— extracted from Google Desktop database
— Keyword and meta-data search
— Flowchart-like user interface
— Real-time results, fast
— Evaluated with canned data
— Within subject study
18. Tool Analysis
Stuff I’ve seen, Phlat
— Stuff I’ve Seen (SIS) – Dumais et. al 2003, Phlat – Cutrell et. al 2006
— Similar to Windows Desktop Search
— Keyword and meta-data search
— Ranks the results using contextual cues
— Textual input
— List view of results with snippet and meta-data
— Unified labeling (Phlat)
— Longitudinal study
19. Tool Analysis
YouPivot
— YouPivot – Hailpern et. al 2011
— Search web browsing history
— Internal system monitor
— Uses keyword for search and contextual cues to filter the results
— Timeline view for user activities
— Textual input, list view of results
— TimeMarks to filter the results
— Evaluated with canned data
— Within subject study
20. Phase 3: Design Goals
— Use dynamic relations
between files
— Integration with keyword
search
— Graphical UI
— Allowing all kinds of
graphical queries
— Internal system monitor
— Result exploration
21. Phase 3: System Requirements
— Provenance + Keyword search
— Streamline query composition
using a drag-drop graphical
sketchpad
— Allow for flexible exploration
and discovery
— Integration with Windows
Explorer to allow exploration of
workflow and information
provenance
22. Phase 3 contd.
Exact pattern matching problem is np-complete!
(sub-graph isomorphism problem)
— Introducing * links
23. Phase 3 contd.
Exact pattern matching problem is np-complete!
(sub-graph isomorphism problem)
— Introducing * links
24. Phase 3 contd.
Exact pattern matching problem is np-complete!
(sub-graph isomorphism problem)
— Introducing * links
— Partial matching
— Easier to solve
— Better matches user recall
— Use G-Ray algorithm [Tong et al. 2007]
— Best-effort matching
— Fast, scalable, flexible and forgiving
26. Phase 3: Preliminary Evaluation
Is UI approach reasonable?
— User Study
— Used file repository modeled after those found at Intel
— Participant selection
— Questionnaire to examine knowledge of search tools
— Graduate students
— Interactive tutorial
— 9 Experiment tasks
“Find the word document you created using information copy/pasted from an email, a web page, and
an excel document. Find the emails that have this word document as an attachment.”
— Tasks ordered randomly
— Think aloud protocol
— 4 minutes for each tasks
— Exit interview about their experience
S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human-
Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.
27. Phase 3: Preliminary Evaluation
contd.
— Average completion time: 106 seconds
— Simple tasks (72 seconds – 93 seconds)
— Hard tasks (126 seconds – 155 seconds)
— Query complexity (#nodes & #edges)
— Average of 2.8 nodes and 2 edges
— System scales well (Completion time vs. Complexity)
— Observations
— Importance of target document
— Working on one resource or relation at a time
— Saw marked learning effect
— Interviews
— Overall likability rating: 4.2 out of 5 (σ = 0.4)
— Wanted Leyline in real life
— No one complained about effort/time requirement
— Areas for improvement
— Query composition history panel
— Customization options
— Support more resource types
S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human-
Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.
28. Conclusion
— Provenance events are very common in real-world
settings, and potentially helpful in search
— Provenance alone can quickly and effectively identify
unique files/resources (assuming perfect recall)
— A graphical sketchpad is a viable UI for query
composition
— Isn’t going to replace keyword search, but valuable addition
— Users quickly learned how to use our system, and
wanted the tool
29. What about the future?
— Incorporate the feedback and lessons learned into a new
prototype
— Expand feature set to include:
— Auto-completion and suggestion features to speed up the
search process
— Support a broader set of files and resources
— Possibly support other computer platforms
— Prepare for longitudinal study
— How do people adapt and use the Leyline?
— How does the Leyline scale in a large database?
— Does the Leyline change exploration?
— Does the Leyline work in collaborative environment?
30. Thank you
— Thanks to Intel for early funding and subjects!
— For more information:
— Soroush Ghorashi
— (ghorashi@eecs.oregonstate.edu)
— Carlos Jensen
— (cjensen@eecs.oregonstate.edu)