The document discusses semi-structured data processing and personal information search. It describes research at Rutgers University on developing tools for searching semi-structured data that unifies content and structure. The research aims to allow queries to contain both structural and content components and return results even if queries are incomplete. It proposes approaches like defining a unified data model, query relaxations to approximate queries, and scoring frameworks to rank unified search results.
Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
The document discusses techniques for layout and composition in publications such as posters. It explains the rule of thirds, which involves placing important elements in specific areas of an image. The Z layout follows the eye's natural reading pattern. Vanishing point places the focal point where lines seem to recede into the distance, drawing the eye to that location. These techniques can be used to create aesthetically pleasing and effective visuals that guide the viewer's attention.
70% of all SAP on Linux customers rely on SUSE Linux
Reduce your SAP infrastructure TCO by up to 80%
Intel's Enterprise Computing Platform is pulling ahead of UNIX
How to get your SAP landscapes to SUSE Linux on Intel: SAP Consulting by Texperts
This document describes the complete health and beauty services offered by a spa or wellness center. They provide high quality facial and body treatments using the most advanced products and latest technology, including deep exfoliation facials, firming treatments using cavitation and radiofrequency, therapeutic massages, hair services, and manicures/pedicures. The goal is to provide rejuvenating and beautifying services for clients' skin, body, hair, hands and feet.
15sec is a mobile app and website that allows users to take or select a photo, add effects, and share it where it will receive at least 15 seconds of fame by being displayed on the 15sec site and Facebook page. The photo is then removed from the services but can be shared further from the Facebook folder. Currently supported effects include filters, frames, and drawing tools. The app is available on Android and the website and social media pages provide information on the service and where it is located.
Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
The document discusses techniques for layout and composition in publications such as posters. It explains the rule of thirds, which involves placing important elements in specific areas of an image. The Z layout follows the eye's natural reading pattern. Vanishing point places the focal point where lines seem to recede into the distance, drawing the eye to that location. These techniques can be used to create aesthetically pleasing and effective visuals that guide the viewer's attention.
70% of all SAP on Linux customers rely on SUSE Linux
Reduce your SAP infrastructure TCO by up to 80%
Intel's Enterprise Computing Platform is pulling ahead of UNIX
How to get your SAP landscapes to SUSE Linux on Intel: SAP Consulting by Texperts
This document describes the complete health and beauty services offered by a spa or wellness center. They provide high quality facial and body treatments using the most advanced products and latest technology, including deep exfoliation facials, firming treatments using cavitation and radiofrequency, therapeutic massages, hair services, and manicures/pedicures. The goal is to provide rejuvenating and beautifying services for clients' skin, body, hair, hands and feet.
15sec is a mobile app and website that allows users to take or select a photo, add effects, and share it where it will receive at least 15 seconds of fame by being displayed on the 15sec site and Facebook page. The photo is then removed from the services but can be shared further from the Facebook folder. Currently supported effects include filters, frames, and drawing tools. The app is available on Android and the website and social media pages provide information on the service and where it is located.
Presentation at the INSPIRE Workshop "Concrete steps to implement INSPIRE: synergies between the public and the private sector" - Florence, 24th June 2013
Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2t575ae
www.LRLiderTime.blogspot.ru Skype marinair2011
Приглашаем к сотрудничеству!
Крупная Немецкая компания прямых продаж LR HEALTH&BEAUTY SYSTEMS проводит набор менеджеров для рекламы компании и ее продукции для красоты и здоровья в России, Украине и Казахстане. Обучение проводится для менеджеров компании бесплатно. Критерии отбора: обучаемость, коммуникабельность, порядочность, активная жизненная позиция. Начинать работать у нас возможно от 18 лет, образование значения не имеет, пол тоже. Сотрудники компании имеют возможность получить весь ассортимент товаров по закупочной цене. При выполнении условий компании по продажам продукции дальнейшее обучение в г.Москва, а также обучение за границей. Возможно сотрудничество с ИП и юридич.лицами. Телефон для связи 89136910033
The document discusses the history and development of artificial intelligence over the past 70 years. It outlines some of the key milestones in AI research from the early work in the 1950s to modern advances in machine learning using neural networks. While progress has been made, fully general human-level artificial intelligence remains an ongoing challenge being explored by researchers around the world.
This document outlines details for a final English workshop for the Mechanics II (Night) course taught by Professor Luis Martines on October 20, 2012 at CEFIT's San Mateo campus. The workshop will include students Edison Rene Henao and Gustavo Ruiz.
Coca-Cola is the world's largest beverage company founded in 1886 in Atlanta, Georgia by John S. Pemberton and Asa Candler. It operates in over 200 countries and sells over 3,000 beverage products. Coca-Cola employs about 146,200 people worldwide and has a current capital of $1.81 trillion with total annual revenue of $48.01 billion. Some of Coca-Cola's most popular brands include Coca-Cola, Diet Coke, Fanta, and Sprite.
This document summarizes a study on the impact of demonstration plots and contact farmers on the adoption of sustainable land management (SLM) practices in Mozambique. The study found that assigning female contact farmers (FCFs) who conducted demonstration plots increased both women's and men's knowledge of SLM practices as well as women's adoption of these practices. While male contact farmers (MCFs) appeared to have no significant effect, there was some evidence that male farmers learned from other male peers. The study is being followed up to further explore factors like labor constraints and the roles and relationships between contact farmers.
The passage discusses Saul's disobedience to God's command to utterly destroy the Amalekites. When Samuel confronts Saul about sparing some of the livestock, Saul tries to justify his actions by saying the people spared them to sacrifice to God. Samuel tells Saul that obedience is better than sacrifice, and that rebelling against God's word is as serious as idolatry. Because Saul disobeyed, God rejects him as king over Israel.
Panel Discussion – Grooming Data Scientists for Today and for TomorrowHPCC Systems
In this session, we will explore the talent gap for data scientists including the potential causes and what academia and the private sector are doing to develop the necessary talent. Will the skills which are in such explosive demand today still be in demand in the future? This panel of professors and practitioners will engage in a conversation about the talent issues facing companies across the country and around the world and what they are doing about it.
The document discusses various considerations for planning and structuring a training course, including determining objectives, number of participants, venue requirements, and seating arrangements. It provides recommendations for each, describing advantages and disadvantages of different options. For example, it suggests limiting participants to 20 and assessing distractions at the venue. For seating, it analyzes formats like rows of chairs, a U-shape, banquet style, conference tables, and circles of chairs. The goal is to choose arrangements conducive to participation and that facilitate discussions, group work, and trainer movement in the room.
This document provides tips for scientists on how to spread their discoveries to the general public through journalists and media outlets. It advises focusing on practical consequences and everyday life impacts to attract journalists' interest. It also suggests using catchy opening lines and focusing on the intended audience when reaching out to general news publications versus scientific magazines. The document emphasizes making the message social through groups like "Dibattito scienza" on Facebook and considering both the effectiveness and ethics of experiments involving animals.
The document provides examples of completing sentences in the past tense and choosing the correct verb form or pronoun for different sentences. It includes exercises transforming sentences to the simple past tense, using simple past or past continuous tenses to complete sentences, and choosing between objective and subjective pronouns to complete sentences. The exercises focus on practicing different English verb tenses and pronouns.
This poem expresses love and care for the addressee, referring to them as a jewel, crystal, and the speaker's heart beat and total wisdom. The speaker feels shame and sorrow that they cannot protect the addressee from harm or pain, and lives in others' shadows, wanting to know how to protect the addressee from enemies as they are astonished by their power and resilience, continuing to shine like a star.
Enjoy Upto 50% Discounts on all computer training coursesCMS Computer
CMS Computer Training Center is offering up to 50% discounts on all IT training courses during a festival. With 18 years of experience, CMS has trained thousands of students in technical, soft, and job-essential skills and placed them in top firms. CMS provides training and certification in programs like CATIA, Creo, Oracle DBA, Java, Microsoft technologies and more. Courses are offered in regular, weekend, fast track and virtual formats. Students will receive certificates upon completion and discounts on international exams.
The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...Eyal Doron
The checklist for preparing your Exchange 2010 infrastructure for Exchange 2013 coexistence |10#23
http://o365info.com/the-checklist-for-preparing-your-exchange-2010-infrastructure-for-exchange-2013-coexistence/
A short preparation checklist for the project of: Exchange 2013/2010 coexistence environment, in which we review some of the components and infrastructure that we will need to prepare.
Eyal Doron | o365info.com
El documento presenta dos protagonistas, Bovina y Taurus, que son prendas de ropa diseñadas para disfrutar de la naturaleza. Ambas prendas son blancas y negras, pesan 200 gramos, son transpirables y resistentes. Fueron diseñadas para la temporada de verano de 2008-2009 por Daniel Moreno Casas y Oscar Martinez Amelibia.
The document discusses using contextual information from personal data sources to improve information search and retrieval. It describes how people naturally remember past data based on contextual clues like location, time, other people involved. A personal data assistant could index and integrate content and metadata from various sources to enable contextual searches. Challenges include developing unified data models and tools to discover and leverage both explicit and implicit contextual information from personal information sources.
The document discusses the emerging role of social media and web technologies in data sharing and provenance management. It outlines barriers to data sharing such as privacy, competition and lack of incentives. Provenance tracking is presented as a key way to enable data sharing by providing metadata on the origin, transformation and usage of data. The W7 provenance model and example provenance graphs are provided. Lessons from social media include taking a dashboard approach to provenance exploration and using crowdsourcing to facilitate sharing through wikis, blogs and forums. The conclusion states that good provenance management can help remove barriers to data sharing.
Presentation at the INSPIRE Workshop "Concrete steps to implement INSPIRE: synergies between the public and the private sector" - Florence, 24th June 2013
Каталог продукции LR HEALTH&BEAUTY SYSTEMS 2014_2t575ae
www.LRLiderTime.blogspot.ru Skype marinair2011
Приглашаем к сотрудничеству!
Крупная Немецкая компания прямых продаж LR HEALTH&BEAUTY SYSTEMS проводит набор менеджеров для рекламы компании и ее продукции для красоты и здоровья в России, Украине и Казахстане. Обучение проводится для менеджеров компании бесплатно. Критерии отбора: обучаемость, коммуникабельность, порядочность, активная жизненная позиция. Начинать работать у нас возможно от 18 лет, образование значения не имеет, пол тоже. Сотрудники компании имеют возможность получить весь ассортимент товаров по закупочной цене. При выполнении условий компании по продажам продукции дальнейшее обучение в г.Москва, а также обучение за границей. Возможно сотрудничество с ИП и юридич.лицами. Телефон для связи 89136910033
The document discusses the history and development of artificial intelligence over the past 70 years. It outlines some of the key milestones in AI research from the early work in the 1950s to modern advances in machine learning using neural networks. While progress has been made, fully general human-level artificial intelligence remains an ongoing challenge being explored by researchers around the world.
This document outlines details for a final English workshop for the Mechanics II (Night) course taught by Professor Luis Martines on October 20, 2012 at CEFIT's San Mateo campus. The workshop will include students Edison Rene Henao and Gustavo Ruiz.
Coca-Cola is the world's largest beverage company founded in 1886 in Atlanta, Georgia by John S. Pemberton and Asa Candler. It operates in over 200 countries and sells over 3,000 beverage products. Coca-Cola employs about 146,200 people worldwide and has a current capital of $1.81 trillion with total annual revenue of $48.01 billion. Some of Coca-Cola's most popular brands include Coca-Cola, Diet Coke, Fanta, and Sprite.
This document summarizes a study on the impact of demonstration plots and contact farmers on the adoption of sustainable land management (SLM) practices in Mozambique. The study found that assigning female contact farmers (FCFs) who conducted demonstration plots increased both women's and men's knowledge of SLM practices as well as women's adoption of these practices. While male contact farmers (MCFs) appeared to have no significant effect, there was some evidence that male farmers learned from other male peers. The study is being followed up to further explore factors like labor constraints and the roles and relationships between contact farmers.
The passage discusses Saul's disobedience to God's command to utterly destroy the Amalekites. When Samuel confronts Saul about sparing some of the livestock, Saul tries to justify his actions by saying the people spared them to sacrifice to God. Samuel tells Saul that obedience is better than sacrifice, and that rebelling against God's word is as serious as idolatry. Because Saul disobeyed, God rejects him as king over Israel.
Panel Discussion – Grooming Data Scientists for Today and for TomorrowHPCC Systems
In this session, we will explore the talent gap for data scientists including the potential causes and what academia and the private sector are doing to develop the necessary talent. Will the skills which are in such explosive demand today still be in demand in the future? This panel of professors and practitioners will engage in a conversation about the talent issues facing companies across the country and around the world and what they are doing about it.
The document discusses various considerations for planning and structuring a training course, including determining objectives, number of participants, venue requirements, and seating arrangements. It provides recommendations for each, describing advantages and disadvantages of different options. For example, it suggests limiting participants to 20 and assessing distractions at the venue. For seating, it analyzes formats like rows of chairs, a U-shape, banquet style, conference tables, and circles of chairs. The goal is to choose arrangements conducive to participation and that facilitate discussions, group work, and trainer movement in the room.
This document provides tips for scientists on how to spread their discoveries to the general public through journalists and media outlets. It advises focusing on practical consequences and everyday life impacts to attract journalists' interest. It also suggests using catchy opening lines and focusing on the intended audience when reaching out to general news publications versus scientific magazines. The document emphasizes making the message social through groups like "Dibattito scienza" on Facebook and considering both the effectiveness and ethics of experiments involving animals.
The document provides examples of completing sentences in the past tense and choosing the correct verb form or pronoun for different sentences. It includes exercises transforming sentences to the simple past tense, using simple past or past continuous tenses to complete sentences, and choosing between objective and subjective pronouns to complete sentences. The exercises focus on practicing different English verb tenses and pronouns.
This poem expresses love and care for the addressee, referring to them as a jewel, crystal, and the speaker's heart beat and total wisdom. The speaker feels shame and sorrow that they cannot protect the addressee from harm or pain, and lives in others' shadows, wanting to know how to protect the addressee from enemies as they are astonished by their power and resilience, continuing to shine like a star.
Enjoy Upto 50% Discounts on all computer training coursesCMS Computer
CMS Computer Training Center is offering up to 50% discounts on all IT training courses during a festival. With 18 years of experience, CMS has trained thousands of students in technical, soft, and job-essential skills and placed them in top firms. CMS provides training and certification in programs like CATIA, Creo, Oracle DBA, Java, Microsoft technologies and more. Courses are offered in regular, weekend, fast track and virtual formats. Students will receive certificates upon completion and discounts on international exams.
The checklist for preparing your Exchange 2010 infrastructure for Exchange 20...Eyal Doron
The checklist for preparing your Exchange 2010 infrastructure for Exchange 2013 coexistence |10#23
http://o365info.com/the-checklist-for-preparing-your-exchange-2010-infrastructure-for-exchange-2013-coexistence/
A short preparation checklist for the project of: Exchange 2013/2010 coexistence environment, in which we review some of the components and infrastructure that we will need to prepare.
Eyal Doron | o365info.com
El documento presenta dos protagonistas, Bovina y Taurus, que son prendas de ropa diseñadas para disfrutar de la naturaleza. Ambas prendas son blancas y negras, pesan 200 gramos, son transpirables y resistentes. Fueron diseñadas para la temporada de verano de 2008-2009 por Daniel Moreno Casas y Oscar Martinez Amelibia.
The document discusses using contextual information from personal data sources to improve information search and retrieval. It describes how people naturally remember past data based on contextual clues like location, time, other people involved. A personal data assistant could index and integrate content and metadata from various sources to enable contextual searches. Challenges include developing unified data models and tools to discover and leverage both explicit and implicit contextual information from personal information sources.
The document discusses the emerging role of social media and web technologies in data sharing and provenance management. It outlines barriers to data sharing such as privacy, competition and lack of incentives. Provenance tracking is presented as a key way to enable data sharing by providing metadata on the origin, transformation and usage of data. The W7 provenance model and example provenance graphs are provided. Lessons from social media include taking a dashboard approach to provenance exploration and using crowdsourcing to facilitate sharing through wikis, blogs and forums. The conclusion states that good provenance management can help remove barriers to data sharing.
Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014aceas13tern
This document summarizes challenges and issues around data management and synthesis projects discussed at a workshop. Key challenges identified include a lack of metadata, limited availability of relevant open data, difficulties identifying and acquiring the right data at the appropriate spatial and temporal scales, data mismatches between available data and research questions, and reluctance from some data owners to share data. Possible actions discussed to help address these challenges include encouraging standardization, concentrating on large long-term studies, providing tools to incentivize data sharing, and changing norms around data sharing within disciplines.
This document provides an overview of data management basics for graduate students. It discusses why managing data is important, including requirements from funders and for responsible research. It then covers topics like organizing data through file naming, versioning, backup and storage strategies, and post-project activities. Resources for developing data management plans and tools are also listed. The overall message is that planning is key to prevent data loss and enable efficient and ethical research.
It's 2015. Do You Know Where Your Data Are?Patricia Hswe
This document summarizes a presentation on research data management. It discusses definitions of research data and why data should be shared. It provides tips for best practices in file naming, description standards, formats and storage. Tools, resources and services for research data management from Penn State and beyond are presented, including ScholarSphere and DMPTool. The importance of having an online presence and sharing research is discussed.
Semantic Similarity and Selection of Resources Published According to Linked ...Riccardo Albertoni
The position paper aims at discussing the potential of exploiting linked data best practice to provide metadata documenting domain specific resources created through verbose acquisition-processing pipelines. It argues that resource selection, namely the process engaged to choose a set of resources suitable for a given analysis/design purpose, must be supported by a deep comparison of their metadata. The semantic similarity proposed in our previous works is discussed for this purpose and the main issues to make it scale up to the web of data are introduced. Discussed issues contribute beyond the re-engineering of our similarity since they largely apply to every tool which is going to exploit information made available as linked data. A research plan and an exploratory phase facing the presented issues are described remarking the lessons we have learnt so far.
Research Data Management: What is it and why is the Library & Archives Servic...GarethKnight
This document summarizes research data management and the library and archives service's involvement. It defines research data, explains why data needs to be managed, and outlines the key drivers for data management and publication. It then describes the library and archives service's knowledge of data management, the research data management support service being established, and the guidance, training, and tools being developed to help researchers with data management.
This document provides an introduction to data management and databases. It defines key terms like data, information, and knowledge. It explains that data is stored in databases and files, but neither stores information until the data is processed. It also describes the basic components of a database management system including the database itself, access engine, and utilities. The document outlines the history of database systems from file management to hierarchical, network, and relational models. It explains some advantages and disadvantages of database processing.
Introduction to Data Management Powerpointichanismo
This document provides an introduction to data management and databases. It defines key terms like data, information, and knowledge. It explains that data is stored in databases and files, but neither stores information until the data is processed. It also describes the basic components of a database management system including the database itself, access engine, and utilities. The document outlines the history of database systems from file management to hierarchical, network, and relational models. It explains some advantages and disadvantages of database processing.
Join Objectivity, Inc.’s VP of Product Management, Brian Clark, in a discussion of the latest trends in Big Data Analytics, defining what is Big Data and understanding how to maximize your existing architectures by utilizing NOSQL technologies to improve functionality and provide real-time results. There will be a focus on relationship analytics as well as an introduction to NOSQL data stores, object and graph databases, such as the architecture behind Objectivity/DB and InfiniteGraph.
2013 02 data portal science group update -v smithVince Smith
The document discusses plans to create a new data portal at data.nhm.ac.uk to address issues with finding, accessing, citing, and integrating research data and collection data from the Natural History Museum. It will provide a central access point, allow for integrated search and browse of datasets, and enable users to download, export, and analyze data. The portal will follow an open by default approach and be populated by museum staff. Development will occur over three years with initial focus on discovery of research datasets and collections data, followed by improved visualization and citation of data.
Presentation of "Reusing Linguistic Resources: Tasks and Goals for a Linked Data Approach", March 9, DGfS 34, Frankfurt Germany.
Find the paper at: http://www.springerlink.com/content/k535323272457913
ESI Supplemental Webinar 2 - DataONE presentation slides DuraSpace
This document provides an overview of a webinar on DataONE, a project that aims to provide tools and approaches for supporting the data life cycle. The webinar covered three key challenges in data management: preservation and planning, discovery, and innovation. It discussed how DataONE is working to address these challenges through its coordinated network of member nodes that allow for data preservation, sharing and discovery. The webinar also demonstrated some of DataONE's tools like the DMPTool for data management planning and the Investigator Toolkit for data analysis and visualization.
On demand access to Big Data through Semantic TechnologiesPeter Haase
The document discusses enabling on-demand access to big data through semantic technologies. It describes how semantic technologies like Linked Data and ontologies can be used to virtually integrate and provide access to large, heterogeneous datasets across different data silos. The key points are that semantic technologies allow for big data to be accessed and analyzed on-demand in a self-service manner through a "Linked Data as a Service" approach, providing scalable end user access to big data.
This document provides information and recommendations for preventing data loss through proper storage, organization, and backup of research files. It discusses developing a consistent file naming convention and folder structure for projects. The document also recommends storing multiple copies of important files in different locations and using version control software to track changes over time. Activities are included to help attendees evaluate their current practices and develop improved plans for organizing, backing up, and locking important versions of their data and files.
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseHilmar Lapp
Jane's lab uses a freely available open source software package called PhyloDOM to create a phylogenetic database of their molecular data resolving the phylogeny of endangered frog species, allowing their results to be easily shared and integrated by other researchers through web interfaces, data aggregators, and visualization tools that take advantage of standardized metadata.
Automated Target Definition Using Existing Evidence and Outlier AnalysisGeorge Ang
This document discusses using outlier analysis techniques to help define targets for digital forensic investigations. It proposes using existing evidence to generate search suggestions and defines targets based on file attributes like time, name, and content similarities. The techniques were tested on honeypot data and identified hidden files. Future work includes reducing human error rates and combining outlier analysis with other forensic techniques.
Talk about Exploring the Semantic Web, and particularly Linked Data, and the Rhizomer approach. Presented August 14th 2012 at the SRI AIC Seminar Series, Menlo Park, CA
RDAP13 Mark Leggott: Stewarding research data using the Islandora frameworkASIS&T
Mark Leggott, University of PEI/DiscoveryGarden
Islandora: Stewarding research data using the Islandora framework
Mark Leggott, Thornton Staples and Kathleen Van Ekris
Panel: Global scientific data infrastructure
Research Data Access & Preservation Summit 2013
Baltimore, MD April 4, 2013 #rdap13
Personal Information Management Systems - EDBT/ICDT'15 TutorialAmélie Marian
The document discusses the challenges of personal information management systems (PIMS) in the past and potential solutions. It notes that personal data used to be stored in fragmented and disconnected ways across devices, applications and services, making it difficult for users to organize, search and control their data. Early PIMS projects from the late 1990s and 2000s tried to address these issues by developing new models and tools for organizing personal data based on concepts like time, tasks, semantics and social networks. However, personal data remains fragmented across many different systems today. The document proposes that a unified PIMS that centrally manages all of a user's information could help overcome these challenges by giving users more control and freedom over their personal data.
Personal Information Search and DiscoveryAmélie Marian
The document discusses the challenges of personal information management in the digital age. It describes how personal data is fragmented across many different devices and systems. Effective personal information management systems are needed to integrate this diverse personal data and support tasks like search, recollecting memories, and knowledge discovery. The author proposes a context-aware personal information management system called the Digital Self Project, which uses a w5h data model to index personal data by context dimensions like what, who, where, when, why and how. Preliminary results show the w5h search approach improves search accuracy over traditional text search.
Personalizing Forum Search using Multidimensional Random WalksAmélie Marian
This document describes a method for improving forum search through personalization. It proposes using multidimensional random walks to compute user similarities based on multiple dimensions like co-participation, interactions, topics and profiles. The approach builds a multidimensional heterogeneous graph and executes random walks with weights based on egocentric relations. Key results show this method predicts similar users to answer future questions better than 6 baselines, and can enhance keyword search by re-ranking results based on contributor authority scores from multiple relations.
Corroborating Facts from Affirmative StatementsAmélie Marian
This document discusses approaches for corroborating information from sources even when the information provided is mostly consistent and does not directly contradict. It proposes assigning multiple trust scores to each source to account for sources having different accuracy levels for different facts. An algorithm is presented that incrementally evaluates facts by selecting groups of facts based on entropy and updating source trust scores. Experiments on real restaurant data demonstrate the approach outperforms existing techniques in precision, recall, and accuracy.
The document discusses searching web forums and summarizing forum content at different granularity levels. It proposes a hierarchical model to represent forum structure with threads, posts, and sentences at different levels. It also describes algorithms like OAKS to generate optimal non-overlapping result sets by maximizing quality scores across levels. Evaluation shows the mixed-granularity approach outperforms methods using only posts in terms of perceived relevance of results for queries. The document also discusses enhancing search using authorship information through multi-dimensional random walks to compute author scores.
1. 1
Searching Data with Substance
and Style
Amélie Marian
Rutgers University
http://www.cs.rutgers.edu/~amelie
2. 2
Semi-structured Data Processing
• Large amount of data online and in personal
devices
▫ Structure (style)
▫ Text content (substance)
▫ Different sources (soul)
▫ Finding the data we need can be difficult
Amélie Marian - Rutgers University
3. 3
Semi-structured Data Processing at Rutgers
SPIDR Lab
• Personal Information Search
▫ Semi-structured data
▫ Need for high -quality search tools
• Structuring of User Web Posts
▫ Large amount of user-generated data untapped
▫ Text has inherent structure
▫ Use of text for guiding search and analyze data
• Data Corroboration
▫ Conflicting sources of data
▫ Need to identify true facts
Amélie Marian - Rutgers University
4. 4
Joint work with:
Wei Wang
Christopher Peery
Thu Nguyen
Computer Science, Rutgers University
Amélie Marian - Rutgers University
5. 5
Personal Information Search
Web Personal
Data
Search for relevant documents Search for specific documents
Information that can be used for personal information search
• Content (keywords)
• Metadata (file size, modification time, etc.)
• Structure
▫ Directory (external)
▫ File structure (internal): XML, LaTeX tags, Picture tags, etc.
▫ Partially known
Amélie Marian - Rutgers University
6. 6
EDBT’08
ICDE’08 (demo)
PIMS Project Description DEB’09
EDBT’11
TKDE (accepted)
• Data and query models that unify content and structure
• Scoring framework to rank unified search results
• Query processing algorithms and index structures to
score and rank answers efficiently
• Evaluation of the quality and efficiency of the unified
scoring
NSF CAREER Award July 2009-2014
Amélie Marian - Rutgers University
7. 7
Target file: Halloween party pictures taken at home where someone
wears a witch costume
Separate Structure and Content
File
Boundary
Directory: //Home
Keywords: Halloween, witch
Amélie Marian - Rutgers University
8. 8
Current Search Tools
Current search tools (i.e. web, desktop, GDS) mostly rely on
ranking and filtering.
▫ Ranking content keywords
▫ Filtering additional conditions (e.g., metadata, structure)
Find a jpg file saved in directory /Desktop/Pictures/Home
that contains the words “Halloween witch”
This approach is often insufficient.
▫ Filtering forces a binary decision. Gif files and files under
directory /Archive/Pictures/Home are not returned.
▫ Structure and content are strictly separated. Files under
directory /Pictures/Halloween are not returned.
Amélie Marian - Rutgers University
9. 9
Unified Approach
Goal: Unify structure and content
▫ Develop a unified view of directory and file structure
▫ Allow for a single query to contain both structure and
content components and to be answered at once
▫ Return results even if queries are incomplete or contain
mistakes
Approach:
▫ Define a unified data model by ignoring file boundaries
▫ Define a unified query model
▫ Define relaxations to approximate unified queries
▫ Define relevance score for unified queries
Amélie Marian - Rutgers University
10. 10
Unified Structure and Content
Target file: Halloween party pictures taken at home where someone
wears a witch costume
//Home[.//“Halloween” and .//“witch”]
File
root
Boundary
Home
“Halloween” “witch”
Amélie Marian - Rutgers University
11. 11
From Query to Answers
DAG
Relaxation Matching
Relaxed Queries
Matches
Query
/ Answers
User Scoring
Ranked Answers
(TA algorithm)
Amélie Marian - Rutgers University
13. 13
DAG Representation
IDF score
p – Pictures ▫ Function of how many
h – Home files match the query
/p/h (exact match)
▫ DAG stores IDF scoring
information
//p/h /p//h /(p/h)
1
//p//h 2 3
//n
//p//* //h//*
1 - /p/h//*
2 - //p/h//* //* (match all)
3 - //(p/h)
Amélie Marian - Rutgers University
14. 14
Query Evaluation
• Top-k query processing
▫ Branch-and-bound approach
• Lazy evaluation of the relaxed DAG structure
▫ DAG is query dependent and has to be generated at runtime
▫ We developed two algorithms to speed up query evaluation
DAGJump allows skip unnecessary parts of the DAG (sorted
accesses)
RandomDAG allows to zoom in on the relevant part of the DAG
(random accesses)
• Matching of answers using dedicated data structures
We extended PathStack (Bruno et al. ICDE’02) to support
permutations (NIPathstack)
Amélie Marian - Rutgers University
15. 15
Traditional Content TF∙IDF Scoring
• Consider files as “bag of terms”
• TF (Term Frequency)
▫ A file that mentions a query term more often is more relevant
▫ TF could be normalized by file length
• IDF (Inverse Document Frequency)
▫ Terms that appear in too many files have little differentiation
power in determining relevance
• TF∙IDF Scoring
▫ Aggregate TF and IDF scores across all query terms
score ( q , d ) tf t , d idf t
t q
Amélie Marian - Rutgers University
16. 16
Unified IDF Score
For a unified data tree T, a path query PQ, and a file F, we define:
• IDF Score
N
log
matches (T , PQ )
score idf
( PQ )
log N
where N is total number of files, and matches (T , PQ ) is the set of files that
match PQ in T.
Amélie Marian - Rutgers University
17. 17
TF Score
Path query: //a//{b}
/ matchstruct = 1 Normalized
0.25
a nodesstruct = 4
File F TF Score
c ∑f(x) f(0.25)+f(0.4)
b d
matchcontent = 2
0.4 1
“” “b e f b f” nodescontent = 5 Normalized
0.8
0.6
f(x)
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
x
1
f ( x) log( 1 x) x , n
n
2 , 3, affects relative impact on TF to unified scores
Amélie Marian - Rutgers University
18. 18
Unified Score
Aggregate IDF and TF scores across all relaxed queries
/a/b (exact match) //a/b /a//b
idf tf idf tf idf tf
1.0 0.15 0.8 0.25 0.8 0.1 ...
* * *
tf*idf 0.15 0.2 0.08 ...
+
0.875 ...
Unified Score
Amélie Marian - Rutgers University
19. 19
Experimental Setup
• Platform
PC with a 64-bit hyper-threaded 2.8GHz Intel Xeon
processor, 2GB memory, a 10K RPM 70GB SCSI disk,
Linux 2.6.16 kernel, Sun Java 1.5.0 JVM.
• Data Set
▫ Files and directories from the environment of a
graduate student (15Gb)
▫ 95,172 files (document 59%, email 34%) in 7,788
directories. Average directory depth is 6.3 with the
longest being 12.
▫ 57M nodes in the unified data tree, with 49M (86%)
leaf content nodes
Amélie Marian - Rutgers University
20. 20
Relevance Comparison
• Use Lucene as a comparison basis
• Content-only
Use the standard Lucene content indexing and
search
• Content:Dir
Create two Lucene indexes: content terms, and
terms from the directory pathnames (treated as a
small file)
• Content+Dir
Augment content index with directory path terms
Amélie Marian - Rutgers University
21. 21
Case Study
▫ Search for a witch costume picture taken at home on Halloween
Target: IMG_1391.gif (tagged with “witch” and “Halloween”)
Query Query Condition Comment Rank
Type
U //home[.//”witch” and Accurate condition 1
.//”halloween”]
U //halloween/witch/”home” Structure / content switched 1
C {witch, halloween} Accurate condition 20
C:D {witch, halloween} : {home} Accurate condition 1
C:D {witch, home} : {halloween} Structure / content switched 245-
252
Amélie Marian - Rutgers University
23. 23
Query Processing Performance
100%
90%
80%
70%
60%
50%
40%
30%
20%
10% U
C:D
0%
0 2 4 6 8 10
Query Processing Time (sec)
Amélie Marian - Rutgers University
24. 24
Personal Information Search
Contributions
• A multi-dimensional search framework that supports
fuzzy query conditions
• Scoring techniques for fuzzy query conditions against a
unified view of structure and content
Improves search accuracy over content-based methods by leveraging
both structure and content information as well as relationships between
the terms
Shows improvements over existing techniques (GDS, topX)
• Efficient index structures and optimizations to efficiently
process multi-dimensional and unified queries
Significantly reduced the overall query processing time
• Future work directions:
User studies, Twig matching, Result granularity, Context
Amélie Marian - Rutgers University
25. Joint work with:
Gayatree Ganu
Computer Science, Rutgers University
Noémie Elhadad
Biomedical Informatics, Columbia University
User Review Structure Analysis Project – URSA
Patient Emotion and stRucture SEarch USer interface - PERSEUS
26. 26
URSA:User Review Structure Analysis
Project Description WebDB’09
• Aim:
Better understanding of user reviews
Better search and access of user reviews
• Tasks:
Structure Identification and Analysis
Text and Structure Search
Similarity Search in Social Networks
Google Research Award – April 2008
Amélie Marian - Rutgers University
27. 27
Online Reviewing Systems:
Citysearch
Data in Reviews
• Structured metadata
• Textual review body
Sentiment information
Information on product specific
features
Users are inconvenienced
because:
• Large number of reviews
available
• Hard to find relevant reviews
• Vague or undefined
information needs
Amélie Marian - Rutgers University
28. 28
Data Description
• Restaurant reviews extracted from
Citysearch, New York
(http://newyork.citysearch.com)
• The corpus contains:
▫ 5531 restaurants
- associated structured information (name, location, cuisine type)
- a set of reviews
▫ 52264 reviews, of which 1359 are editorial reviews
- structured information (star rating, username, date)
- unstructured text (title, body, pros, cons)
▫ 32284 distinct users
- Distinct username information
• Dataset accessible at
http://www.research.rutgers.edu/~gganu/datasets/
Amélie Marian - Rutgers University
29. 29
Structure Identification
• Classification of review sentences with topic
and sentiment information
Sentence Topics Sentence Sentiment
Food Positive
Price Negative
Service Neutral
Ambience Conflict
Anecdotes
Miscellaneous
Amélie Marian - Rutgers University
30. 30
Text Based Recommendation
System: Evaluation Setting
• For evaluation, we separated three non-
overlapping test sets of about 260 reviews:
▫ Test A and B : Users who have reviewed at least two
restaurants (so that training set has at least one
review)
▫ Test C : Users with at least 5 reviews
• For measuring accuracy of prediction we use the
Root Mean Square Error (RMSE)
Amélie Marian - Rutgers University
31. 31
Text-Based Recommendation System:
Steps
• Text-derived rating score
▫ Regression-based rating
• Goals
1. Predicting the metadata star rating
2. Predicting the text-derived score
• Only predicts the score, not the content of the reviews
• Lower standard deviations: lower RMSE
• Prediction Strategies
▫ Average-based prediction
▫ Personalized prediction
Amélie Marian - Rutgers University
32. 32
Regression-based Text Rating
• Use text of reviews to generate a rating
• Different categories and sentiment should have different
importance in the rating
Method
• We use multivariate quadratic regression
• Each normalized sentence type [(category, sentiment)] is
a variable in the regression
• Dependent variable is metadata star-rating
• Used training sets to learn the weights for each sentence
type; weights are used in computing text-based score
Amélie Marian - Rutgers University
33. Regression-based Text Rating Food and
Negative
• Regression Constant: 3.68 Price and
Service
• Regression Weights (First order variables) appear to
Regression Weights Positive Negative Neutral Conflict
be most
Food 2.62 -2.65 -0.08 -0.69
important
Price 0.39 -2.12 -1.27 0.93
Service 0.85 -4.25 -1.83 0.36
Ambience 0.75 -0.27 0.16 0.21
Anecdotes 0.95 -1.75 0.06 -0.19
Miscellaneous 1.30 -2.62 -0.30 0.36
• Regression Weights (Second order variables)
Regression Weights Positive Negative Neutral Conflict
Food -1.99 2.05 -0.14 0.67
Price -0.27 2.04 2.17 -1.01
Service -0.52 3.15 1.76 0.34
Ambience -0.44 0.81 -0.28 -0.61
Anecdotes -0.40 2.03 -0.03 -0.20
Miscellaneous -0.65 2.38 0.5 -0.10
Amélie Marian - Rutgers University 33
34. Regression-Based Text Baseline
Rating Case
Restaurant Average-based Prediction
• Prediction using average rating given to a restaurant by all users
(we also tried user-average and combined)
• RMSE Errors:
Predicting using text does better
than popularly used star rating
Predicting Star Ratings TEST A TEST B TEST C
Using Star Rating 1.127 1.267 1.126
Using Sentiment-based text rating 1.126 1.224 1.046
Predicting Sentiment Text Rating TEST A TEST B TEST C
Using Star Rating 0.703 0.718 0.758
Using Sentiment-based text rating 0.545 0.557 0.514
Amélie Marian - Rutgers University 34
35. 35
Clustering-based strategies for
recommendations
• KNN based on a clustering over star ratings
▫ Little improvement over baseline
▫ Does not take into account the textual information
▫ Sparse data
▫ Cold start problem
▫ Hard clustering not appropriate
• Soft clustering
▫ Partitions objects into clusters,
▫ Each user has a membership probability to each
cluster
Amélie Marian - Rutgers University
36. Information Bottleneck Method
• Foundations in Rate Distortion Theory
• Allows choosing tradeoff between
▫ Compression (number of clusters T)
▫ Quality estimated through the average distortion
between cluster points and cluster centroid (β
parameter)
• Shown to work well with sparse datasets
N. Slonim, SIGIR 2002
37. 37
Leveraging text content for
personalized predictions
• Use the sentence types (categories, sentiments)
within the reviews as features
• Users clustered based on the type of information
in their reviews
• Predictions are made using membership
probabilities of clusters to find neighbors
Amélie Marian - Rutgers University
39. 39
Example: Soft-clustering Prediction
User rating (star or text)
Cluster Membership Probabilities
Restaurant1 Restaurant Restaurant
2 3 Cluster1 Cluster2 Cluster3
User1 4 - - User1 0.040 0.057 0.903
User2 2 5 4 User2 0.396 0.202 0.402
User3 4 * 3 User3 0.380 0.502 0.118
User4 5 2 - User4 0.576 0.015 0.409
User5 - - 1 User5 0.006 0.990 0.004
•For each cluster we compute the cluster contribution for the test
restaurant
•Weighted average of ratings given to the restaurant
Contribution (c2,r2)=4.793,
Contribution(c3,r2)=3.487
•We compute the final prediction based on the cluster contributions for
the test restaurant and the test user’s membership probabilities
= 4.042
Amélie Marian - Rutgers University
40. iIB Algorithm
• Experimented with different values of β and T, used
β=20, T=100.
RMSE errors and percentage improvement over baseline:
Predicting Star Ratings TEST A TEST B TEST C
Using Star Rating 1.103 (2.13%) 1.242 (1.74%) 1.106 (1.78%)
Using Sentiment-based text rating 1.113 (1.15%) 1.211(1.06%) 1.046(0%)
Predicting Sentiment Text Rating TEST A TEST B TEST C
Using Star Rating 0.692 (1.56%) 0.704(1.95%) 0.742(2.11%)
Using Sentiment-based text rating 0.544(0.18%) 0.549(1.44%) 0.514(0%)
• Always improve by using text features for clustering for
the traditional goal of predicting star ratings
• Even small improvement in RMSE are useful (Netflix,
precision in top-k)
41. 41
URSA: Qualitative Predictions
• Predict sentiment towards each topic
• Cluster users along each dimension separately
• Use threshold to classify sentiment (actual and
predicted)
100%
80%
Accuracy
60% 80%-100%
40% 60%-80%
20% 40%-60%
0%
Prediction accuracy 20%-40%
0%-20%
for positive ambience.
A-0
A-0.1
A-0.2
A-0.3
A-0.4
A-0.5
A-0.6
A-0.7
A-0.8
A-0.9
A-1
θact
Amélie Marian - Rutgers University
42. 42
PERSEUS Project Description
Patient Emotion and StRucture SEarch USer
Interface
▫ Large amount of patient-produced data
• Difficult to search and understand
• Patients need help finding information
• Health professionals could learn from the data
▫ Analyze and Search patient forums, mailing lists and blogs
• Topical information
• Specific Language
• Time sensitive
• Emotionally charged
Google Research Award – April 2010
NSF CDI Type I – October 2010-2013
Amélie Marian - Rutgers University
43. 43
PERSEUS Project Description
▫ Automatically add structure to free-text
• Use of context information
• “hair loss” side effect or symptom
• Approximate structure
▫ Use structure to guide search
• Need for high recall, but good precision
• Find users with similar experiences
• Various results granularities
• Thread vs. sentence
• Context dependent
• Needs to take approximation into account
Amélie Marian - Rutgers University
44. 44
Structuring and Searching Web Content
Contributions
• Leveraged automatically generated structure to improve
predictions
▫ Around 2% RMSE improvements
▫ Used inferred structure to group users using soft clustering
techniques
• Qualitative predictions
▫ High Accuracy
• Future directions
▫ Extension to healthcare domains
▫ Use of inferred structure to guide search
▫ Use user clusters in search
▫ Adapt to various result granularities
▫ Take classification inaccuracies into account
Amélie Marian - Rutgers University
45. 45
Joint work with:
Minji Wu Computer Science, Rutgers University
Collaborators:
Serge Abiteboul, Alban Galland INRIA
Pierre Senellart Telecom ParisTech
Magda Procopiuc, Divesh Srivasatava AT&T Research Labs
Laure Berti-Equille IRD
Amélie Marian - Rutgers University
46. 46
Motivations
• Information on web sources are unreliable
▫ Erroneous
▫ Misleading
▫ Biased
▫ Outdated
• Users need to check web sites to confirm the
information
▫ Data corroboration
Minji Wu - Rutgers University
47. 47
Example: What is the gas mileage of my
Honda Civic?
Query: “honda civic 2007
gas mileage” on MSN
Search
• Is the top hit; the
honda.com site
unbiased?
• Is the autoweb.com web
site trustworthy?
• Are all these values
referring to the correct
year of the model?
Users may check several web
sites to get an answer
Minji Wu - Rutgers University
48. 48
Example: Identifying good business
listings
• NYC restaurant information from 6 sources
▫ Yellowpages
▫ Menupages
▫ Yelp
▫ Foursquare
▫ OpenTable
▫ Mechanical Turk (check streetview)
Which listings are correct ?
Amélie Marian - Rutgers University
49. 49
WebDB’07
WSDM’10
IS’11
DEB’11
Data Corroboration Project Description
Trustworthy sources report true facts
True facts come from trustworthy sources
• Sources have different
▫ Coverage
▫ Domain
▫ Dependencies
▫ Overlap Conflict resolution with maximum
coverage
Microsoft Live Labs Search Award – May 2006
Amélie Marian - Rutgers University
50. 50
CleanDB’06
PVLDB’10
Top-k Join: Project Description
Integrate and aggregate information from several sources
(“minji”, “vldb10”, 0.2)
(“minji”, “amélie”, 1.0)
(“amélie”, “vldb10”, 0.5)
(“amélie”, “SIN”, 0.3)
(“minji”, “SIN”, 0.1)
(“SIN”, “vldb10”, 0.9)
Amélie Marian - Rutgers University
51. 51
Data Corroboration
Contributions
• Probabilistic model for corroboration
▫ Fact uncertainty
▫ Source trustworthiness
▫ Source coverage
▫ Conflict between sources
• Fixpoint techniques to compute truth values of facts and
source quality estimates
• Top-k query algorithms for computing corroborated answers
• Open Issues:
▫ Functional dependencies
▫ Time
▫ Social network
▫ Uncertain data
▫ Source dependence
Amélie Marian - Rutgers University
52. 52
Conclusions
• New Challenges in web data management
▫ Semi-structured data
PIMS
User reviews
▫ Multiple sources of data
Conflicting information
Low quality data providers (Web 2.0)
• SPIDR lab at Rutgers focuses on helping users
identify useful data in the wealth of information
available
Amélie Marian - Rutgers University