Mapping a human brain generates petabytes of gene listings and the corresponding locations of these genes throughout the human brain. Due to the large dataset a prototype Semantic Web application was created with the unique ability to link new datasets from similar fields of research, and present these new models to an online community. The resulting application presents a large set of gene to location mappings and provides new information about diseases, drugs, and side effects in relation to the genes and areas of the human brain.
In this presentation we will discuss the normalization processes and tools for adding new datasets, the user experience throughout the publishing process, the underlying technologies behind the application, and demonstrate the preliminary use cases of the project.
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
The NIDDK Information Network (dkNET; http://dknet.org) is a open community resource for basic and clinical investigators in metabolic, digestive and kidney disease. dkNET’s portal facilitates access to a collection of diverse research resources (i.e. the multitude of data, software tools, materials, services, projects and organizations available to researchers in the public domain) that advance the mission of the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). This webinar was presented by dkNET principle investigator Dr. Jeffrey Grethe.
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
This presentation describes the concept of DataTags, which simplifies handling of sensitive datasets. It then shows the Tags toolset, and how it is integrated with Dataverse, Harvard's popular dataset repository.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
This talk was provided by Ursula Pieper of the National Agricultural Library for the NISO Virtual Conference, Using Open Source in Your Institution, held on Feb 17, 2016
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
The NIDDK Information Network (dkNET; http://dknet.org) is a open community resource for basic and clinical investigators in metabolic, digestive and kidney disease. dkNET’s portal facilitates access to a collection of diverse research resources (i.e. the multitude of data, software tools, materials, services, projects and organizations available to researchers in the public domain) that advance the mission of the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). This webinar was presented by dkNET principle investigator Dr. Jeffrey Grethe.
DataTags, The Tags Toolset, and Dataverse IntegrationMichael Bar-Sinai
This presentation describes the concept of DataTags, which simplifies handling of sensitive datasets. It then shows the Tags toolset, and how it is integrated with Dataverse, Harvard's popular dataset repository.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
This talk was provided by Ursula Pieper of the National Agricultural Library for the NISO Virtual Conference, Using Open Source in Your Institution, held on Feb 17, 2016
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
Access to scientific information has changed in a manner that was likely never even imagined by the early pioneers of the internet. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of this eScience cheminformatics platform and the nature of the solutions that it helps to enable including structure validation, text mining and semantic markup, the National Chemical Database Service for the United Kingdom and the development of a chemistry data repository. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
Our access to scientific information has changed in ways that were hardly imagined even by the early pioneers of the internet. The immense quantities of data and the array of tools available to search and analyze online content continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of the ChemSpider platform and the nature of the solutions that it helps to enable. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.
Talk for the workshop on the Future of the Commons, November 18, 2015: http://cendievents.infointl.com/CENDI_NFAIS_RDA_2015/
Slides distributed under under CC-by license: https://creativecommons.org/licenses/by/2.0/
At a time when the data explosion has simply been redefined as “Big”, the hurdles associated with building a subject-specific data repository for chemistry are daunting. Combining a multitude of non-standard data formats for chemicals, related properties, reactions, spectra etc., together with the confusion of licensing and embargoing, and providing for data exchange and integration with services and platforms external to the repository, the challenge is significant. This all at a time when semantic technologies are touted as the fundamental technology to enhance integration and discoverability. Funding agencies are demanding change, especially a change towards access to open data to parallel their expectations around Open Access publishing. The Royal Society of Chemistry has been funded by the Engineering and Physical Science Research of the UK to deliver a “chemical database service” for UK scientists. This presentation will provide an overview of the challenges associated with this project and our progress in delivering a chemistry repository capable of handling the complex data types associated with chemistry. The benefits of such a repository in terms of providing data to develop prediction models to further enable scientific discovery will be discussed and the potential impact on the future of scientific publishing will also be examined.
Written and presented by Tom Ingraham (F1000), at the Reproducible and Citable Data and Model Workshop, in Warnemünde, Germany. September 14th -16th 2015.
Data-knowledge transition zones within the biomedical research ecosystemMaryann Martone
Overview of the Neuroscience Information Framework and how it brings together data, in the form of distributed databases, and knowledge, in the form of ontologies to show the mapping of the dataspace and places where there are mismatches between data and knowledge.
What's new in dkNET 2.0 (March 17, 2015)
We are pleased to announce the release of dkNET 2.0! The new site is mobile and tablet friendly, so you can keep up to date with dkNET wherever you are. In addition to the new look and feel, we’ve added new data and features to make it easier to search for and access research resources:
* A new community funding section has been added to let you search for pilot grants from NIDDK supported centers. Please let us know if you have an opportunity you’d like advertised.
* We’ve consolidated the dkNET website and search portal into one site: dknet.org. Learn about the dkNET project through the ABOUT page in the top menu.
* Do you have a database, software tool or service of interest to the dkNET community? Register it into the Resource Catalog. Want to see what resources are available for kidney? Search the Registry.
* New data sources have been added to the Community Resources section.
* New search and exploration features have been added to the More Resources section.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
Data Citation Implementation at DataverseMerce Crosas
Presentation at the Data Citation Implementation Pilot Workshop in Boston, February 3rd, 2016.
https://www.force11.org/group/data-citation-implementation-pilot-dcip/pilot-project-kick-workshop
Slides from a presentation I gave at SciBarCamb 2011 (9th April, 2011) in Cambridge (UK).
Basically it goes through some of the recent work and theory i've been doing to do with setting up a data portal using distributed web services, allowing easy data sharing and reduced effort in data maintenance.
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
Presentation for the NFAIS Webinar series: Open Data Fostering Open Science: Meeting Researchers' Needs
http://www.nfais.org/index.php?option=com_mc&view=mc&mcid=72&eventId=508850&orgId=nfais
Access to scientific information has changed in a manner that was likely never even imagined by the early pioneers of the internet. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of this eScience cheminformatics platform and the nature of the solutions that it helps to enable including structure validation, text mining and semantic markup, the National Chemical Database Service for the United Kingdom and the development of a chemistry data repository. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
Talk given at Two Sigma:
The Dataverse project, developed at Harvard's Institute for Quantitative Social Science since 2006, is a widely used software platform to share and archive data for research. There are currently more than 20 Dataverse repository installations worldwide, with the Harvard Dataverse repository alone hosting more than 60,000 datasets. Dataverse provides incentives to researchers to share their data, giving them credit through data citation and control over terms of use and access. In this talk, I'll discuss the Dataverse project, as well as related projects such as DataTags to share sensitive data and Cloud Dataverse to share Big Data.
The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.
Our access to scientific information has changed in ways that were hardly imagined even by the early pioneers of the internet. The immense quantities of data and the array of tools available to search and analyze online content continues to expand while the pace of change does not appear to be slowing. ChemSpider is one of the chemistry community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of the ChemSpider platform and the nature of the solutions that it helps to enable. We will also discuss the possibilities it offers in the domain of crowdsourcing and open data sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.
Talk for the workshop on the Future of the Commons, November 18, 2015: http://cendievents.infointl.com/CENDI_NFAIS_RDA_2015/
Slides distributed under under CC-by license: https://creativecommons.org/licenses/by/2.0/
At a time when the data explosion has simply been redefined as “Big”, the hurdles associated with building a subject-specific data repository for chemistry are daunting. Combining a multitude of non-standard data formats for chemicals, related properties, reactions, spectra etc., together with the confusion of licensing and embargoing, and providing for data exchange and integration with services and platforms external to the repository, the challenge is significant. This all at a time when semantic technologies are touted as the fundamental technology to enhance integration and discoverability. Funding agencies are demanding change, especially a change towards access to open data to parallel their expectations around Open Access publishing. The Royal Society of Chemistry has been funded by the Engineering and Physical Science Research of the UK to deliver a “chemical database service” for UK scientists. This presentation will provide an overview of the challenges associated with this project and our progress in delivering a chemistry repository capable of handling the complex data types associated with chemistry. The benefits of such a repository in terms of providing data to develop prediction models to further enable scientific discovery will be discussed and the potential impact on the future of scientific publishing will also be examined.
Written and presented by Tom Ingraham (F1000), at the Reproducible and Citable Data and Model Workshop, in Warnemünde, Germany. September 14th -16th 2015.
Data-knowledge transition zones within the biomedical research ecosystemMaryann Martone
Overview of the Neuroscience Information Framework and how it brings together data, in the form of distributed databases, and knowledge, in the form of ontologies to show the mapping of the dataspace and places where there are mismatches between data and knowledge.
What's new in dkNET 2.0 (March 17, 2015)
We are pleased to announce the release of dkNET 2.0! The new site is mobile and tablet friendly, so you can keep up to date with dkNET wherever you are. In addition to the new look and feel, we’ve added new data and features to make it easier to search for and access research resources:
* A new community funding section has been added to let you search for pilot grants from NIDDK supported centers. Please let us know if you have an opportunity you’d like advertised.
* We’ve consolidated the dkNET website and search portal into one site: dknet.org. Learn about the dkNET project through the ABOUT page in the top menu.
* Do you have a database, software tool or service of interest to the dkNET community? Register it into the Resource Catalog. Want to see what resources are available for kidney? Search the Registry.
* New data sources have been added to the Community Resources section.
* New search and exploration features have been added to the More Resources section.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
Data Citation Implementation at DataverseMerce Crosas
Presentation at the Data Citation Implementation Pilot Workshop in Boston, February 3rd, 2016.
https://www.force11.org/group/data-citation-implementation-pilot-dcip/pilot-project-kick-workshop
Slides from a presentation I gave at SciBarCamb 2011 (9th April, 2011) in Cambridge (UK).
Basically it goes through some of the recent work and theory i've been doing to do with setting up a data portal using distributed web services, allowing easy data sharing and reduced effort in data maintenance.
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
Data Publishing: The research community needs reliable, standard ways to make the data produced by scientific research available to the community, while giving credit to data authors. As a result, a new form of scholarly publication is emerging: data publishing. Data publishing - or making data reusable, citable, and accessible for long periods - is more than simply providing a link to a data file or posting the data to the researcher’s web site. We will discuss best practices, including the use of persistent identifiers and full data citations, the importance of metadata, the choice between public data and restricted data with terms of use, the workflows for collaboration and review before data release, and the role of trusted archival repositories. The Harvard Dataverse repository (and the Dataverse open-source software) provides a solution for data publishing, making it easy for researchers to follow these best practices, while satisfying data management requirements and incentivizing the sharing of research data.
Sieve - Data Quality and Fusion - LWDM2012Pablo Mendes
Presentation at the LWDM workshop at EDBT 2012.
The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources
may provide conflicting values for a single real-world object. In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality.
However, quality is a very subjective matter, and nding a canonical judgement that is suitable for each and every task is not feasible.
To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity
Resolution, all crucial preliminaries for quality assessment and fusion.
We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.
Linked Data Quality Assessment – daQ and Luzzujerdeb
Presentation at the Ontology Engineering Group at UPM related to Linked Data Quality and the work done in the Enterprise Information System Group at Universität Bonn
Structural syntactic metrics for RDF Datasets that correlate with high level quality deficiencies.
The vision of the Linked Open Data (LOD) initiative is to provide a model for publishing data and meaningfully interlinking such dispersed but related data. Despite the importance of data quality for the successful growth of the LOD, only limited attention has been focused on quality of data prior to their publication on the LOD. This paper focuses on the systematic assessment of the quality of datasets prior to publication on the LOD cloud. To this end, we identify important quality deficiencies that need to be avoided and/or resolved prior to the publication of a dataset. We then propose a set of metrics to measure and identify these quality deficiencies in a dataset. This way, we enable the assessment and identification of undesirable quality characteristics of a dataset through our proposed metrics.
Slides for paper presentation at DEXA 2015:
Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri:
Quality Metrics for Linked Open Data. DEXA (1) 2015: 144-152
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
Characteristics of Data Warehouse
Benefits of a data warehouse
Designing of Data Warehouse
Extract, Transform, Load (ETL)
Data Quality
Classification Of Data Quality Issues
Causes Of Data Quality
Impact of Data Quality Issues
Cost of Poor Data Quality
Confidence and Satisfaction-based impacts
Impact on Productivity
Risk and Compliance impacts
Why Data Quality Influences?
Causes of Data Quality Problems
How to deal: Missing Data
Data Corruption
Data: Out of Range error
Techniques of Data Quality Control
Data warehousing security
https://doi.org/10.6084/m9.figshare.11854626.v1
Presented at Dutch National Librarian/Information Professianal Association annual conference 2011 - NVB2011
November 17, 2011
Presentation delivered by Ludo Hendrickx and Joris Beek on 11 December 2013 Dutch at the Ministry of Interior, The Hague, The Netherlands. More information on: https://joinup.ec.europa.eu/community/ods/description
Linked Open Data Principles, Technologies and ExamplesOpen Data Support
Theoretical and practical introducton to linked data, focusing both on the value proposition, the theory/foundations, and on practical examples. The material is tailored to the context of the EU institutions.
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...Alan D. Duncan
Time and again, we hear about the failure of data warehouses – while things may be improving, they’re moving only slowly. One explanation data quality being overlooked is that the I.T. department is often responsible for delivering and operating the DWH/BI
environment. What ensues ends up being an agenda based on “how do we build it”, not a “why are we doing this”. This needs to change. In this discussion paper, I explore the issues of data quality in data warehouse, business intelligence and analytic environments, and propose an approach based on "Data Quality by Design"
Comprehensive Self-Service Lif Science Data Federation with SADI semantic Web...Alexandre Riazanov
This is a moderately technical overview of SADI principles and capabilities, and IPSNP tools, including an overview of Life Science case studies. It is designed to be accessible to the general Computer Science and Software Engineering audience.
See also the sequel talk "A practical introduction to SADI semantic Web services and HYDRA query tool"
Linked Data for the Masses: The approach and the SoftwareIMC Technologies
Title: Linked Data for the Masses: The approach and the Software
@ EELLAK (GFOSS) Conference 2010
Athens, Greece
15/05/2010
Creator: George Anadiotis (R&D Director)
Security and Data Ownership in the Cloud
Andrew K. Pace, Executive Director, Networked Library Services, OCLC; Councilor-at-large, American Library Association
Enabling knowledge management in the Agronomic DomainPierre Larmande
This talk will focus mainly on, ongoing projects at the Institute of Computational Biology
Agronomic Linked Data (AgroLD): is a Semantic Web knowledge base designed to integrate data from various publically available plant centric data sources.
GIGwA: is a tool developed to manage genomic, transcriptomic and genotyping large data resulting from NGS analyses.
The Royal Society of Chemistry is one of the worlds foremost scientific societies, a primary publisher for the chemical sciences and an innovator in the domain of eScience. In order to deliver on a number of our eScience projects we utilize a number of components of Advanced Chemistry Development software including nomenclature, physchem prediction, spectroscopy tools and the ACD/Ilab web-based system. This presentation will provide an overview of a number of RSC projects where ACS/Labs software has played an important role in the delivery of the systems including ChemSpider and the National Chemical Database Service for the United Kingdom. We will also provide an overview of our vision to deliver a repository for various types of experimental chemistry data and how we foresee utilizing various prediction and validation software approaches to characterize the data as well as the potential to generate predictive models from the data. This couples directly with our intention to data enable our publication archive of over 300,000 articles extracting chemicals, reactions and analytical data from the historical records.
Tripal v3, the Collaborative Online Database Platform Supporting an Internati...Bradford Condon
Talk given by Dr. Bradford Condon at the NSRP10 session of the Plant and Animal Genomes conference (PAG) 2019. Covers the basics of the biological database toolkit Tripal, and how Tripal enables FAIR data.
Tutorial presented at 2012 ACM SIGHIT International Health Informatics Symposium (IHI 2012), January 28-30, 2012. http://sites.google.com/site/web2011ihi/participants/tutorials
This tutorial weaves together three themes and the associated topics:
[1] The role of biomedical ontologies
[2] Key Semantic Web technologies with focus on Semantic provenance and integration
[3] In-practice tools and real world use cases built to serve the needs of sleep medicine researchers, cardiologists involved in clinical practice, and work on vaccine development for human pathogens.
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023dkNET
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch
Presenter: Jeffrey Grethe, PhD, dkNET Principal Investigator, University of California San Diego
Abstract
The dkNET (NIDDK Information Network) team is announcing an exciting new service - Biomed Resource Watch (BRW, https://scicrunch.org/ResourceWatch), a knowledge base for aggregating and disseminating known problems and performance information about research resources such as antibodies, cell lines, and tools. We aggregate trustworthy information from authorized sources such as Cellosaurus, Antibody Registry, Human Protein Atlas, ENCODE, and many more. In addition, BRW includes antibody specificity text mining information extracted from the literature via natural language processing. BRW provides researchers and curators an easy-to-use interface to report their claims about a specific resource. Researchers can check information about a resource before planning their experiments via BRW-enhanced Resource Reports. This new service aims to help improve efficiency in selecting appropriate resources, enhancing scientific rigor and reproducibility, and promoting a FAIR (Findable, Accessible, Interoperable, Reusable) research resource ecosystem in the biomedical research community.
Join us for a webinar to introduce the following resources & topics:
1. An overview of dkNET
2. How Resource Reports benefit you
3. Biomed Resource Watch
3.1 Navigating Biomed Resource Watch
3.2 How to Submit a Claim
Upcoming webinars schedule: https://dknet.org/about/webinar
A brief introduction of dkNET (NIDDK Information Network; https://dknet.org) and the services and resources that are available, including Resource Reports, Authentication Reports, FAIR Data Services, Discovery Portal and Hypothesis Center.
Similar to Applied semantic technology and linked data (20)
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
2. Today we are discussing…
• What is the use case and who requested it?
• How do you import and normalize thousands of RDF
•
•
•
•
triples worth of gene data?
How do we enrich the normalized gene data with parallel
research data sets?
Creating instance pages without knowing exactly what will
be displayed on them.
Demonstration of the initial use cases
Question and answer session
3. Why?
• Prototype: How do we assemble the data mine and
refine the authoring tools?
How do we expand this to the research
community?
• How do we expand ownership of the data to research
professionals?
• How do we build systems in a way that research
professionals can author and link the data?
• How do we publish these new relationships to the wider
research community?
4. What is the Allen Institute for Brain
Science?
• Launched in 2003 with seed funding from founder and
philanthropist Paul G. Allen.
• Serving the scientific community is at the center of our mission
to accelerate progress toward understanding the brain and
neurological systems.
• The Allen Institute's multidisciplinary staff includes
neuroscientists, molecular biologists, informaticists, and
engineers.
“The Allen Institute for Brain Science is an
independent 501(c)(3) nonprofit medical
research organization dedicated to accelerating
the understanding of how the human brain
works.”
5. Human Brain Map
• Open, public online access
• A detailed, interactive three-
•
•
•
•
dimensional anatomic atlas of the
"normal" human brain
Data from multiple human brains
Genomic analysis of every brain
structure, providing a quantitative
inventory of which genes are
turned on where
High-resolution atlases of key brain
structures, pinpointing where
selected genes are expressed
down to the cellular level
Navigation and analysis tools for
accessing and mining the data
6. Biological Linked Data Map
• Open, public online access
• Data from multiple RDF data
•
•
•
•
stores
Complete import pipeline using
LDIF framework
Outlines of each imported
instance embedding inline wiki
properties and providing views of
imported properties from original
RDF datasets
Charting tools that „pivot‟ SPARQL
queries providing several views of
each query
Navigation and composition tools
for accessing and mining the data
7. Where did we get the data?
• KEGG : Kyoto Encyclopedia of Genes and Genomes
• “KEGG GENES is a collection of gene catalogs for all complete genomes
generated from publicly available resources, mostly NCBI RefSeq.”
• Diseasome
• “The Diseasome website is a disease/disorder relationships explorer and
a sample of an innovative map-oriented scientific work. Built by a team of
researchers and engineers, it uses the Human Disease Network data set.”
• DrugBank
• “The DrugBank database is a unique bioinformatics and cheminformatics
resource that combines detailed drug data with comprehensive drug target
information.”
• SIDER
• “SIDER contains information on marketed medicines and their recorded
adverse drug reactions. The information is extracted from public
documents and package inserts.”
8. New ontology map for import
•
Genes
•
•
•
•
Diseases
•
•
•
DrugBank : 4,772
KEGG : 2,482
SIDER : 924
Effects
•
•
Diseasome : 4,213
KEGG : 459
Drugs
•
•
•
•
DrugBank : 4,553
Diseasome : 3,919
KEGG : 9,841
SIDER : 1,737
Pathways
•
KEGG : 28,442
We chose to intentionally simplify the
ontology due to disagreements between
researchers about entity relationships and
subclasses.
9. Importing and mapping the Linked Data
•
R2R
•
•
•
•
32,900 instances were converted to the
wiki ontology.
Networked
Storage
Local
Storage
Download
583,746 properties mapped
Pathways were ignored for wiki
ontology import, but are available within
the triple store KEGG Pathway graph.
SIEVE
•
20,849 instances available in wiki
ontology after SILK normalization
•
Instance merging effected drugs,
genes, and diseases across datasets.
• Triple Store SPARQL Update
R2R
Mapping
Engine
Maps Entities to
New Ontology
Import to
Wiki
Sieve
Mapping
Engine
Normalizes Entities
across data sources
Normalize
Entities
Triple
Store
Available with
SPARQL Queries
10. Importing and mapping the Linked Data
•
R2R
•
•
•
•
32,900 instances were converted to the
wiki ontology.
Networked
Storage
Local
Storage
Download
583,746 properties mapped
Pathways were ignored for wiki
ontology import, but are available within
the triple store KEGG Pathway graph.
SIEVE
•
20,849 instances available in wiki
ontology after SILK normalization
•
Instance merging effected drugs,
genes, and diseases across datasets.
• Triple Store SPARQL Update
R2R
Mapping
Engine
Maps Entities to
New Ontology
Import to
Wiki
Sieve
Mapping
Engine
Normalizes Entities
across data sources
Normalize
Entities
Triple
Store
Available with
SPARQL Queries
12. Linked Data challenges
• Data sources that overlap in content may:
• Use a wide range of different RDF vocabularies
• Use different identifiers for the same real-world entity
• Provide conflicting values for the same properties
• Implications
• Queries become hand crafted for a specific RDF data set – no
different than using a proprietary API.
• Individual, improvised and manual merging techniques for data
sets.
• Integrating public datasets with internal databases poses
the same problems
13. Linked Data Integration Framework
• LDIF normalizes the Linked Data from multiple sources
into a clean, local target representation while keeping
track of data provenance.
1
Collect data: Managed download and update
2
Translate data into a single, target vocabulary
3
Resolve identifier aliases into local target URIs
4
Cleanse data and resolve conflicting values
5
Output to local file system or triple store
14. LDIF Pipeline
1
Collect data
2
Translate data
3
Supported Data Formats
Resolve
identities
4
Cleanse data
5
Output data
•
•
•
RDF Files (Multiple Formats
SPARQL Endpoints
Crawling Linked Data
Component Stack
15. LDIF Pipeline
1
Collect data
2
Translate data
Sources use a wide range of different
RDF vocabularies
dbpedia-owl:City
schema:Place
R2R
location:City
fb:location.citytown
3
Resolve
identities
4
Cleanse data
5
Output data
Component Stack
16. LDIF Pipeline
1
Collect data
2
Sources use different identifiers for the
same entity
Translate data
London, England
London, MA, USA
London, TN, USA
London, TX, USA
SILK
London
3
London =
London, England
Resolve
identities
4
Cleanse data
5
Output data
Component Stack
17. LDIF Pipeline
1
Collect data
2
Translate data
3
Sources provide different values for the
same property
London, England
has a population
of 8.174M people
London, England
has a population
of 9.2M people
SILK
rdfs:population:
8.174M
Resolve
identities
4
Cleanse data
5
Output data
Component Stack
18. LDIF Pipeline
1
Collect data
2
Translate data
3
Supported Output Formats
•
•
•
N-Quads
N-Triples
SPARQL Update Stream
Resolve
identities
4
Cleanse data
5
Output data
Provenance tracking using Named Graphs
Component Stack
23. Semantic MediaWiki
Semantic MediaWiki is a full-fledged framework, in
conjunction with many spinoff extensions, that can turn a
wiki into a powerful and flexible knowledge management
system. All data created within SMW can easily be
published via the Semantic Web, allowing other systems to
use this data seamlessly.
24. Four initial templates for each instance by
category
1. Custom infobox within outline
template
•
Visible inline properties
2. Outline template providing instance
information
3. Widget template displaying dynamic
charts or third party services
•
Donut charts and AIBS gene feed
4. Broad table SPARQL queries
showing instance relationships
5. Hidden inline properties for other
extensions
25. Creating instance wiki pages
• The Triple Store now contained tens of
thousands of recognized category
instances. Creating the pages require a
bot.
Create List of Page
Names
1.0
RDF Data
Download
1. Fetch the RDF dumps from an active
D2R server
2. Use regex to fetch the rdf:label property
that was mapped by R2R as an instance
name
3. Open category specific text file of wiki
markup (page of template includes)
4. Contact Neurowiki and request a new
page from the list of names with the
category content
Sanitize
Script
2.0
Create CSV
Category
Page Names
Text of Wiki
markup for page
instance
Read Open
3.0
Create MediaWiki Page
MediaWiki
Gateway rb
Framework
REST
interface
4.0
Neurowiki
Instance
Page
28. How are base entities like Calcium
represented?
1. The wiki page and
corresponding template
components are rendered.
Drug Search
1.0
Wiki Page
Aggregate
Page of
Components
2. Relations are pulled from the
normalized data store of linked
data.
2.0
Calcium
Relations
Neurobase
Data Stores
3. The JavaScript components are
3.0
Selected
Widget for
Display
populated via a data feed
29. How are base entities like Calcium
represented?
• Because so many
organisms contain
calcium the
mappings to
affected species
were never created
to conserve space
in the data store.
Drug and Disease Class Ratios of Calcium
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
30. What are the dangers of Propofol?
1. Propofol DrugBank relations are
Drug Search
Neurobase
Data Stores
rendered in corresponding
JavaScript components.
1.0
Propofol
Relations
2.0
Aggregate
Components
2. The Diseasome disease
relations show classes of illness
Propofol affects.
Propofol
Disease
Relations
3. An aggregate of SIDER side
3.0
Propofol Side
Effects
effects are rendered in relation
to Propofol and disease classes.
34. Which drugs are used in Chemotherapy?
1.
2.
Disease
Search
DrugBank and AIBS relations to
genes affected by both the disease
and drug.
3.
SIDER side effects related to the
gene, disease, and drug.
4.
DrugBank drug glossary definition
specifying various forms of Cancer
treatment.
Neurobase
Data Stores
1.0
Disease
Relations
Diseasome disease relations
normalized by LDIF.
Aggregate
Components
2.0
Gene Drug
Relations
3.0
Drug Side
Effects
4.0
Drug Info Box
36. Which drugs are used in Chemotherapy?
Drug and Disease Class Ratios of AR
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
37. Which drugs are used in Chemotherapy?
Drug and Side Effect Ratios of AR
Inner Circle: Drugs by Affected Species, Outer Circle: Side Effect Ratios of Drugs
39. Which drugs are used in Chemotherapy?
Drug and Disease Class Ratios of Nilutamide
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
40. Which drugs are used in Chemotherapy?
Drug and Disease Class Ratios of Bicalutamide
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
42. Expanding the Prototype
• Semantic MediaWiki query construction
• Could this be done in SPARQL?
• Authoring SILK / R2R mappings for the LDIF Pipeline
• Extremely difficult and the editors are not intuitive
• How do you get data owners to fuse the sets and create
the data store themselves?
• Tested with Aura Wiki prototype
• Expand authoring provenance
• How do we ensure new data / links comes from an authoritative
source?
43. Today we discussed…
• The Allen Institute for Brain Science (AIBS)
• Four similar research data sets to interlink with the AIBS
•
•
•
•
•
data set
An import pipeline named Link Data Integration
Framework (LDIF)
The interlinking process for 5 concurrent research data
sets (AIBS, DrugBank, Diseasome, KEGG, SIDER)
A prototype neurobiology authoring platform.
Creating instance pages to display the new connections.
Demonstration of the initial use cases.
Hello, My name is william smith and today we will be talking about a project near and dear to my heart.I served as project manager for a prototype application, worked closely with 2 German teams, and we were the first customer for several of the tools used to assemble this application. I was also the chief integration point into Vulcan so am well aware of the technologies, code bases, and data sets that went into assembling this project…
So what are we discussing today?First and foremost this was a project for an internal organization at Vulcan involved in mapping the human brain. This, of course, generates petabytes of data and millions of triples worth of gene mappings – but we took a smaller slice of a couple hundred thousand genes for the initial prototype. There were also several parallel research programs generating data in a format we could use, and a conference was held of industry professionals to find the interlinking pieces of these datasets. Finally, I’m going to walk through the data pipeline, the application itself, and a set of our original use cases.
Why?Well a core problem that has been in neurobiology, and most sciences for that matter, is the inability to share and author sets of data across projects by industry professionals. This leaves an odd gap where people with computer science degrees are linking data they don’t fully understand, and the people that understand the data don’t have the ability to add the interlinks for greater vision into the data.With this problem known our original prototype soon expanded into how do we get these tools into the hands of the research community, and that in itself created 3 core questions. Ownership, Authorship, and publishing provenance of the newly linked data.
The organization that chartered this project, and provided the original data sets is the Allen Institute for Brain Science – or AIBS. When you hear me say AIBS on accident I’m referring to this organization. It was launched in 2003 by Paul G. Allen and has the explicit focus of mapping the human brain to accelerate our understand of the brain and neurological systems. Furthermore, the institute is a 501c(3) nonprofit medical research organization employing hundreds of neuroscientists, molecular biologists, informaticists, and engineers within the seattle area.----- Meeting Notes (1/28/14 12:15) -----So who requested this?accelerate our understanding
And this is the Institutes core product… or several screen shots of the core product. Here we have gene heat maps… some location data… where it all is location wise in the human brain. As odd as those screen caps are they are accessed by thousands of researchers daily and this is considered a major success.It’s open, the public right now can go to this site and browse the catalog. There are currently 3 human brains fully mapped with a 4th in progress. Each of these donors have generated genomic analysis of brain structure and have created a thorough catalog of genes with respect to location. While the captions are small they are part of a much larger suite of atlas navigation tools with several components – ie. Heat map – pinpointing genes expressed down to the cellular level.And most importantly, for our purposes, they generate terabytes of data with industry wide IDS we can link to other sources!
And here’s our prototype in screenshots. No page is hand type, no graph is hand entered, 4 static templates pulling data from our normalized mine creating all these pretty pictures and full pages of text. There are over 30 thousand of these pages.We will be discussing the first two points in depth – RDF and the LDIF pipeline. Charting tools use SPARQL which we will not be discussing in depth – however I have a hidden slide of the details should somebody be really malicious and want to ask about SPARQL queries. Finally, our navigation closely resembles the common MediaWiki installation which everybody who has been on the internet in the last 10 years is familiar with… editing on the other hand is very different and currently only bots create and maintain the pages.
Which brings us to these parallel tracks of research data I keep mentioning.To choose these sets we had a conference of industry researchers and data professionals go through the hundreds of biology mines looking for useful projects that closely relate to genes found in the human brain. The 4 prototype sets chosen were <read slides>
Our original cross section of data found these connections. Not the full dump, but with roughly 15 thousand gene connections plenty of pages produced relevant connections and filled pages with interesting data points. <read numbers>And to the right we have our simplified ontology. Looks incredible right… hey they can’t all be winners and don’t blame me – blame protégé.This was generated with basic 1-1 relations, domain-range logic, where applicable. <joke about line colors>The simplification was created in part because nobody that does anything in neuroscience agrees with another person that does the same thing. We could get them agree that in some gray area way these things are related on the domain-range level … so that generates that and it looks way worst if I try to spread the boxes out in any other way.
Which brings us to the pretty graph I hate… because it makes unifying things into that ugly protégé graph look easy.It’s not, but it does give a good overall view of what we able to convert directly to the wiki be 32 thousand 900 instances turned directly into pages with over 500 thousand properties across the set. Even more important after “same as” connections were made we had 20 thousand fully populates pages – and these are the pages with connections across the datasets. That brings up an important point, if I imported all of the gene data I would end up with a huge wiki by page count, but the better part of these pages would be nothing more than a page title and empty templates. Hence the importance of finding these connections and only tracking the useful data points – like pages with more than a title.On the right we have the simplified process which I will be going into more detail very soon <read right graph>----- Meeting Notes (1/28/14 12:15) -----But it does give a good overall view of what we are able to convert directly into the wiki.
And those parts that just turned red - <read red parts> - is the process we will be discussing for a section I like to call: Linked Data Integration Framework
\----- Meeting Notes (1/28/14 13:28) -----Created over the last 4 yearsCreated by Free University of BerlinSame team that helped build the prototypeFirst customerStill active, last update late 20132 main components, R2R and SILK
And this is why I don’t like the oversimplification of that process chart. Plenty of difficult computer science problems and none of them cut and dry to solve…Assuming we can find overlapping data sources you then have to unify vocabularies – the predicate of the triple. Once this is done and you can agree on what the name of the entity is, then you will have data sets with the same entity going by a range of names and ids. Finally, once you’ve located the same entities there’s no guarantee the normalized vocabularies will be referencing the same value.Without the normalization pipeline – LDIF - this creates queries that are silo’d to a specific data set basically creating an API… and that’s good for companies like facebook and Google but terrible for independent research. The last point is less of a problem for us because we decided long ago this was a philanthropic prototype with 501-c-3 data – but it is something to be considered when working with say – national security data.
Lucky for us, as customer 1 of the LDIF framework, we get to test all of the steps in normalization and hope for the best or fix it ourselves!If this works right we will…<read steps>
And here’s the LDIF architecture.All this stuff on the bottom are the 5 data sets, the arrows don’t really apply because they didn’t link up that well before LDIF, and then to the pipeline.After processing and RE-releasing the arrows apply, and then we shove that all in our own public triple store for use in the application.
And here’s your application.
----- Meeting Notes (1/28/14 13:28) -----Pubby created 5 years agoUsed in dbpediaFree univeristy of berlinNo search, have to follow linksNot very modern viewing experienceNo expression of data via links
Less than helpful – FINE.
Well I am in this business to please the consumer, and my consumer understands common web architectures – even if they don’t know they do - so let’s try an installation of Semantic MediaWikiInvented roughly 5 years ago it’s a series of plugins, that run on mediawiki, which was created by the good folks that invented wikipedia! Millions of people see it everyday while researching homework they don’t feel like doing, when sloppily referencing college term papers, or in my opinion creating one of the most accurate and comprehensive encyclopedias humanity has to date. Even better we can display the semantic properties of our normalized data inline! <show arrows><can you expand> of course I can.
I’m going to build you 4 base templates by category – Gene, Drug, Disease, and Side Effect.These templates will have the base information displaying our semantic properties - <run through wireframe>
----- Meeting Notes (1/28/14 12:15) -----This created a problem - namely how do I create 30,000 pages and not get fired for entering data over the course of 2 years. So, a lot of what you see on wikipedia isn' t actually input or maintained by humans. The gene pages all have very complex info boxes tracking ids, regions, and a variety of known properties mined from other sources. The pieces of code that do this mining and page creation are called wiki-bots.We wrote a wiki-bot to create our 30,000 pages, one for each page type, and this is the creation pipeline these bots utilized.
----- Meeting Notes (1/28/14 12:15) -----I'll be running through 3 core use cases we used to test the project and explaining how the pages and graphs were generated. All of the graphs related to the genes, diseases, drugs, and side effects within the next few slides are generated from the wiki.However, it's far easier to view the wiki when you have access behind the vulcan firewall... so I had to run on screen shots for this portion.
----- Meeting Notes (1/28/14 12:15) -----Calcium - difficult use case- within all creatures- has lots of connections to other entities- but we don't want to create all the pages
----- Meeting Notes (1/28/14 13:28) ------ 15 minutes of fame 5 years ago- Powerful seditive used in anesthesiology-- You should not use it as a sleep aid- Listed as cause of death for popular musician
Fix this
----- Meeting Notes (1/28/14 12:15) -----Finally, we head over to drug bank and search for an obscure drug page... Bicalutamide...It's an oral steroid used in the treatment of cancer that effects the androgen receptor. Thus validating our links across the data. An example of how a not-so-simple correlation of data can give researchers deeper vision by merging sets and presenting the interlinks.
----- Meeting Notes (1/28/14 12:15) -----Aura wiki - it was used to test crowd sourcing of data authoring for a proto-AI.