Introduction to the FP7 CODE project @ BDBC

•

0 likes•876 views

The FP7 CODE project will be presented at the Big Data Benchmarking Community call. Here, a high-level overview shall introduce CODEs vision and show the progress after 6-months.

Technology

Big Data Benchmarking Community Call

CODE
Commercial Empowered Linked Open Data
Ecosystem in Research

presented by Florian Stegmaier
University of Passau
2012-10-04

Some basic facts...
• Budget: 2,4M €, funded by the European Commission
• Started in May 2012 with a runtime of 2 years

2

Current situation

• Data is being produced in an immense rate:
• Evaluation campaigns (e.g., CLEF campaign)
• Benchmarking communities (e.g., TPC)
• Researchers (e.g., proceedings, journals or slides)
Most data remains unstructured and
sophisticated access methods are missing!

3

Why is this a problem?!
• From a data perspective...
• ...what is the quality of data?
• ...how to deal with missing values?
• From a user perspective...
• ...how can i compare this data? baseline?
• ...are there contradicting facts?
The semantics of documents must be unleashed
to make them accessible and processible!
4

Step 1: Analyze data
• Analysis of documents has to ﬁnd:
• Structural elements (TOC, images, etc.)
• Extract facts and numerical measures
• Disambiguate facts (from „string“ to „object“)
• Automatic annotation is defective or not complete
• Crowdsourced annotation of documents
• Marketplace offers revenue for expert knowledge
6

Step 2: Lift and extend data
• Extracted and disambiguated data will be lifted into the
Linked Data cloud
• Interlink with already existent data of the cloud
• Enrich data with provenance information (increase
quality estimations)
• Perform OLAP queries on data cubes (e.g., time series)
Enriched and aggregated data is exposed
as Linked Data endpoint.
7

Step 3: Interact with data
• Query wizard will focus on:
• Excel based interaction possibilities
• On the ﬂy creation of statistical analyses on a
federated dataset
• Marketplace encourages users to interact with data

Non-IT (but maybe domain) experts are able to create
visual analytics as well as create new data cubes.

8

...does all this actually work?

Current analysis of PDFs
is able to discover basic
table of contents,
reading direction, as well
as speciﬁc objects.

9

...how are the users involved?

One possible way to
engage users in
annotating data is the
Mendeley Desktop.
(early stage)

10

...what about lifting data?

Basic tripliﬁcation
chain established to
lift table based data
into a Semantic Web
compatible data
cube.

11

...how can i find data?

The ﬁrst prototype of
the query wizard is able
to show and interact with
retrieved data in a Excel-
like manner.

12

...and the marketplace?

The data can be exposed
in several ways, just like
in mind maps to help things
getting structured.
(example shows biggerplate)

13

Thank you for your
attention! http://www.code-research.eu/
https://www.facebook.com/CODEresearchEU
#CODEresearchEU (Twitter)

Thanks to Michael Granitzer, Christin Seifert, Kai Schlegel and Sebastian Bayerl for supporting me with input, ﬁgures and slide templates and last but
not least our consortium for the prototype screenshots ;)

What's hot

EuropeanaTech 2018: A distributed network of digital heritage information

Enno Meijers

A distributed network of digital heritage information - Semantics Amsterdam

Enno Meijers

KnowEscape workshop, OKCon 2013

Stefan Dietze

2013 DataCite Summer Meeting - Elsevier's program to support research data (H...

datacite

A distributed network of digital heritage information by Enno Meijers - Europ...

Europeana

Enterprise Search engines are strong in locating existing documents and information in an organization while Wikis are designed to capture new information in a lightweight and collaborative fashion. Conversely, Wikis are rather bad in locating information (especially from external documents) while Enterprise Search does not address the provision of new information and socializing around information needs. Therefore we argue that both systems focus on specific parts of the organizational information process, which should indeed be combined in order to improve enterprise information exchange. We discuss “Woogle” as a concept to integrate Enterprise Search into Wikis and describe its reference implementation “Woogle4MediaWiki”.

Woogle -- On Why and How to Marry Wikis with Enterprise Search

Hans-Joerg Happel

Managing Metadata for Science and Technology Studies: the RISIS case

Rinke Hoekstra

University of Northumbria Research

Kevin Ashley

Probabilistic indexing for archival holdings - possibilities and limits

Université Libre de Bruxelles

Building a Public Research Center for the HathiTrust Digital Library

Robert H. McDonald

EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards. Visit: https://www.eudat.eu/eudat-summer-school

Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)

EUDAT

Marjan will give an overview of the role of data archives in ensuring the safe stewardship and preservation of data over time. She will explain what it means to be a Trustworthy Digital Repository and the associated policies and processes that need to be in place to ensure data provenance and authenticity. This session will link to Monday’s exploration of the re3data.org portal Visit: https://www.eudat.eu/eudat-summer-school

Long-term data curation, aka data preservation - EUDAT Summer School (Marjan ...

EUDAT

Library and data lecture for inf21306

Hugo Besemer

Trusted data and services from the GDML

Olga Caprotti

Towards GeneratingPolicy-compliant Datasets (poster)

Christophe Debruyne

Recommendation to the EU Hearing on Access to and Preservation of Scientific ...

EDINA, University of Edinburgh

Research Data MANTRA Demo

EDINA, University of Edinburgh

20070919 Bkt Padua Esf Dfg Workshop Intro

Deutsche Forschungsgemeinschaft (DFG) - German Research Foundation

Towards the Discovery of Person-Level Data (SemStats, ISWC 2013) [2013.10]

Dr.-Ing. Thomas Hartmann

Yann will introduce the notion of data life cycles (DLCs) as an overarching framework for the workshop. This presentation will explain the key activities and roles identified by EUDAT and undertaken by researchers and data service providers in the process of creating, analysing, managing, sharing and archiving research data. It will highlight how the EUDAT service suite addresses this data lifecycle to support researchers with their key data requirements. He will then present the current research work undertaken in EUDAT to model community specific DLCs, the relation with the concept of provenance and the prototype services being currently developed to bridge the identified gaps in DLC coverage. Visit https://eudat.eu/eudat-summer-school

The Data Lifecycle - EUDAT Summer School (Yann Le Franc)

EUDAT

What's hot (20)

EuropeanaTech 2018: A distributed network of digital heritage information

A distributed network of digital heritage information - Semantics Amsterdam

KnowEscape workshop, OKCon 2013

2013 DataCite Summer Meeting - Elsevier's program to support research data (H...

A distributed network of digital heritage information by Enno Meijers - Europ...

Woogle -- On Why and How to Marry Wikis with Enterprise Search

Managing Metadata for Science and Technology Studies: the RISIS case

University of Northumbria Research

Probabilistic indexing for archival holdings - possibilities and limits

Building a Public Research Center for the HathiTrust Digital Library

Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)

Long-term data curation, aka data preservation - EUDAT Summer School (Marjan ...

Library and data lecture for inf21306

Trusted data and services from the GDML

Towards GeneratingPolicy-compliant Datasets (poster)

Recommendation to the EU Hearing on Access to and Preservation of Scientific ...

Research Data MANTRA Demo

20070919 Bkt Padua Esf Dfg Workshop Intro

Towards the Discovery of Person-Level Data (SemStats, ISWC 2013) [2013.10]

The Data Lifecycle - EUDAT Summer School (Yann Le Franc)

Similar to Introduction to the FP7 CODE project @ BDBC

Shifting Scientific Practice - ORCID 2015

Kaitlin Thaney

Shifting Scientific Practice (K. Thaney)

ORCID, Inc

DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...

Mihai Criveti

00-01 DSnDA.pdf

SugumarSarDurai

Data Science - An emerging Stream of Science with its Spreading Reach & Impact

Dr. Sunil Kr. Pandey

Connecting the Dots: Linking Digitized Collections Across Metadata Silos

OCLC

INF2190_W1_2016_public

Attila Barta

Lecture_1_Intro.pdf

paijitk

Data science presentation

MSDEVMTL

Data Science at Scale - The DevOps Approach

Mihai Criveti

BigData Visualization and Usecase@TDGA-Stelligence-11july2019-share

stelligence

Continuum Analytics and Python

Travis Oliphant

Elag workshop sessie 1 en 2 v10

Jeroen Rombouts

The CSO Open Data Experience

Dublinked .

Cloud Programming Models: eScience, Big Data, etc.

Alexandru Iosup

INDIAN STATISTICAL INSTITUTE Documentation Research & Training Centre 8th Mile, Mysore Road, RVCE Post Bangalore-560 059 DRTC Seminar- 5 2014 Data Literacy ABSTRACT In our increasingly data-driven society, data literacy is an important civic skill which we should be developing in our society. Data is slowly but steadily forcing their way into the societies. Data literacy may seem less technical than either Computer Science or any other fields. Still we need to envisage a wide variety of tools for accessing, converting and manipulating data. These require to understand relational databases (like MS Access), data manipulation techniques, statistical software tools (like Minitab, SPSS, STATA and MS Excel) and data representation software tools (like MS PowerPoint and MS Excel). This seminar includes an introduction on data literacy, its inter-relationship with information literacy and statistical literacy. It also includes various steps for working with data followed by short demonstration of data analysis techniques by using the software STATA11. Speaker: Jayanta Kr. Nayek Date:29 .10.2014. Time: 2 p.m. Venue: DRTC, ISI Bangalore. All are cordially invited. Seminar Coordinator Biswanath Dutta

Data literacy

Jayanta Nayek

Behind the scenes of data science

Loïc Lejoly

Data Science in 2016: Moving Up

Paco Nathan

The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago. O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science. By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact. This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change? Session presented at Big Data Spain 2015 Conference 15th Oct 2015 Kinépolis Madrid http://www.bigdataspain.org Event promoted by: http://www.paradigmatecnologico.com Abstract: http://www.bigdataspain.org/program/thu/slot-2.html

Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015

Big Data Spain

Self Study Business Approach to DS_01022022.docx

Shanmugasundaram M

Similar to Introduction to the FP7 CODE project @ BDBC (20)

Shifting Scientific Practice - ORCID 2015

Shifting Scientific Practice (K. Thaney)

DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...

00-01 DSnDA.pdf

Data Science - An emerging Stream of Science with its Spreading Reach & Impact

Connecting the Dots: Linking Digitized Collections Across Metadata Silos

INF2190_W1_2016_public

Lecture_1_Intro.pdf

Data science presentation

Data Science at Scale - The DevOps Approach

BigData Visualization and Usecase@TDGA-Stelligence-11july2019-share

Continuum Analytics and Python

Elag workshop sessie 1 en 2 v10

The CSO Open Data Experience

Cloud Programming Models: eScience, Big Data, etc.

Data literacy

Behind the scenes of data science

Data Science in 2016: Moving Up

Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015

Self Study Business Approach to DS_01022022.docx

More from Florian Stegmaier

Ansätze für gemeinschaftliches Filtering

Florian Stegmaier

Fortschritte im Bereich Collaborative Filtering

Florian Stegmaier

Realtime  Distributed Analysis  of Datastreams

Florian Stegmaier

Effiziente Verarbeitung von großen Datenmengen

Florian Stegmaier

Trust-based recommender systems

Florian Stegmaier

Trust und Interest Similarity und deren Anwendung für Empfehlungssysteme

Florian Stegmaier

Musikempfehlungssysteme

Florian Stegmaier

Robustheit in Empfehlungssystemen

Florian Stegmaier

Linked Open Data als Basis für Empfehlungssysteme

Florian Stegmaier

Entscheidungshilfe: Recommender System

Florian Stegmaier

Funktionsweise und Ansätze von inhaltsbasiertem Filtern

Florian Stegmaier

Context Basierte Personalisierungsansätze

Florian Stegmaier

Evaluierung von Empfehlungssystemen

Florian Stegmaier

Effiziente Verarbeitung von grossen Datenmengen

Florian Stegmaier

Derzeitig basiert der diagnostische Prozess eines Krankheitsverlaufes in Krankenhäusern auf einer manuellen Beurteilung von Patientendaten zu unterschiedlichen Zeiten und unterschiedlichen Modalitäten (z. B. CT-Aufnahmen vs. MRT). Diese Aufnahmen werden in sehr großen Datenarchiven (Picture Archiving and Communication System, PACS) gespeichert, wohingegen einzelne Datensätze aufgrund von fehlenden aussagekräftigen semantischen Annotationen nur bedingt effizient angefragt werden können. In dieser Präsentation wird ein generischer Ansatz vorgestellt, um die heterogenen Kliniksysteme durch moderne, semantisch aussagekräftige Technologien zu verbinden und uniform anfragbar zu machen. Durch einen uniformen Zugriff bezüglich Speicherungsform und Anfrageparadigma wird auf diese heterogene Datenlandschaft eine hochwertige semantische Diagnoseunterstützung ermöglicht.

Generische Datenintegration zur semantischen Diagnoseunterstützung im Projekt...

Florian Stegmaier

Nowadays multimedia data is produced and consumed at an ever increasing rate. Similarly to this trend, diverse storage approaches for multimedia data have been introduced. These observations lead to the fact that distributed and heterogeneous multimedia repositories exist whereas an unified and easy access to the stored multimedia data is not given. This paper presents an architecture, named AIR, that offers the aformentioned retrieval possibilites. To ensure interoperability, AIR makes use of recently issued standards, namely the MPEG Query Format (MPQF) (multimedia query language) and the JPSearch transformation rules (metadata interoperability).

AIR: Architecture for Interoperable Retrieval on Distributed and Heterogeneou...

Florian Stegmaier

More from Florian Stegmaier (16)

Ansätze für gemeinschaftliches Filtering

Fortschritte im Bereich Collaborative Filtering

Realtime  Distributed Analysis  of Datastreams

Effiziente Verarbeitung von großen Datenmengen

Trust-based recommender systems

Trust und Interest Similarity und deren Anwendung für Empfehlungssysteme

Musikempfehlungssysteme

Robustheit in Empfehlungssystemen

Linked Open Data als Basis für Empfehlungssysteme

Entscheidungshilfe: Recommender System

Funktionsweise und Ansätze von inhaltsbasiertem Filtern

Context Basierte Personalisierungsansätze

Evaluierung von Empfehlungssystemen

Effiziente Verarbeitung von grossen Datenmengen

Generische Datenintegration zur semantischen Diagnoseunterstützung im Projekt...

AIR: Architecture for Interoperable Retrieval on Distributed and Heterogeneou...

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Real Time Object Detection Using Open CV

Khem

Histor y of HAM Radio presentation slide

vu2urc

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Scaling API-first – The story of a global engineering organization

Radu Cotescu

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Real Time Object Detection Using Open CV

Histor y of HAM Radio presentation slide

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Boost Fertility New Invention Ups Success Rates.pdf

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Driving Behavioral Change for Information Management through Data-Driven Gree...

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

GenAI Risks & Security Meetup 01052024.pdf

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

AWS Community Day CPH - Three problems of Terraform

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Finology Group – Insurtech Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Boost PC performance: How more available memory can improve productivity

Automating Google Workspace (GWS) & more with Apps Script

🐬 The future of MySQL is Postgres 🐘

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Scaling API-first – The story of a global engineering organization

Introduction to the FP7 CODE project @ BDBC

1. Big Data Benchmarking Community Call CODE Commercial Empowered Linked Open Data Ecosystem in Research presented by Florian Stegmaier University of Passau 2012-10-04

2. Some basic facts... • Budget: 2,4M €, funded by the European Commission • Started in May 2012 with a runtime of 2 years 2

3. Current situation • Data is being produced in an immense rate: • Evaluation campaigns (e.g., CLEF campaign) • Benchmarking communities (e.g., TPC) • Researchers (e.g., proceedings, journals or slides) Most data remains unstructured and sophisticated access methods are missing! 3

4. Why is this a problem?! • From a data perspective... • ...what is the quality of data? • ...how to deal with missing values? • From a user perspective... • ...how can i compare this data? baseline? • ...are there contradicting facts? The semantics of documents must be unleashed to make them accessible and processible! 4

5. The long way to knowledge... 5

6. Step 1: Analyze data • Analysis of documents has to ﬁnd: • Structural elements (TOC, images, etc.) • Extract facts and numerical measures • Disambiguate facts (from „string“ to „object“) • Automatic annotation is defective or not complete • Crowdsourced annotation of documents • Marketplace offers revenue for expert knowledge 6

7. Step 2: Lift and extend data • Extracted and disambiguated data will be lifted into the Linked Data cloud • Interlink with already existent data of the cloud • Enrich data with provenance information (increase quality estimations) • Perform OLAP queries on data cubes (e.g., time series) Enriched and aggregated data is exposed as Linked Data endpoint. 7

8. Step 3: Interact with data • Query wizard will focus on: • Excel based interaction possibilities • On the ﬂy creation of statistical analyses on a federated dataset • Marketplace encourages users to interact with data Non-IT (but maybe domain) experts are able to create visual analytics as well as create new data cubes. 8

9. ...does all this actually work? Current analysis of PDFs is able to discover basic table of contents, reading direction, as well as speciﬁc objects. 9

10. ...how are the users involved? One possible way to engage users in annotating data is the Mendeley Desktop. (early stage) 10

11. ...what about lifting data? Basic tripliﬁcation chain established to lift table based data into a Semantic Web compatible data cube. 11

12. ...how can i find data? The ﬁrst prototype of the query wizard is able to show and interact with retrieved data in a Excel- like manner. 12

13. ...and the marketplace? The data can be exposed in several ways, just like in mind maps to help things getting structured. (example shows biggerplate) 13

14. Thank you for your attention! http://www.code-research.eu/ https://www.facebook.com/CODEresearchEU #CODEresearchEU (Twitter) Thanks to Michael Granitzer, Christin Seifert, Kai Schlegel and Sebastian Bayerl for supporting me with input, ﬁgures and slide templates and last but not least our consortium for the prototype screenshots ;)

Introduction to the FP7 CODE project @ BDBC

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to the FP7 CODE project @ BDBC

Similar to Introduction to the FP7 CODE project @ BDBC (20)

More from Florian Stegmaier

More from Florian Stegmaier (16)

Recently uploaded

Recently uploaded (20)

Introduction to the FP7 CODE project @ BDBC