This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through standards like the OBO Foundry. It introduces key semantic web technologies like RDF, IRI, TTL, SPARQL and OWL that are used to represent ontologies and linked data. It provides examples of representing ontologies and data in the Turtle syntax and using semantic web standards like RDFS for defining classes, properties and basic inference.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. Key ontologies like the Gene Ontology are described. The document then introduces semantic web technologies like RDF, URIs, triples, and ontology languages like RDFS and OWL. It provides examples of representing data and metadata in these formats. Finally, it discusses storing and querying RDF data through SPARQL.
This document provides information about biological databases including:
- Different types of biological databases such as relational, object-oriented, hierarchical, and hybrid systems.
- Common uses of biological databases including annotation searches, homology searches, pattern searches, predictions, and comparisons.
- Examples of database entries in common formats like GenBank, EMBL, and SwissProt that show the layout and key fields.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
This document provides an overview of flat file databases and biological relational databases. It discusses flat file databases like RefSeq that store sequence data in plain text files. It describes common file formats like Genbank and EMBL. It also discusses the Trace Archive and how trace files are processed into consensus sequences using Phred and Phrap. Finally, it briefly introduces biological relational databases and references resources like Swiss-Prot and TrEMBL.
This document discusses the Biological Databases project being conducted by a group of students. The project involves using the video game Minecraft to visualize protein structures retrieved from the Protein Data Bank (PDB). Python scripts are used to import PDB data files and place blocks in Minecraft to represent atoms, with different block colors used to distinguish atom types. SPARQL queries are also employed to search the RDF version of the PDB for protein entries. The goal is to build 3D protein models inside Minecraft for educational and visualization purposes.
This document discusses biological databases and bioinformatics. It begins with an overview of bioinformatics as an interdisciplinary field combining biology, computer science, and information technology. It then discusses different types of biological databases, including those focused on sequences, pathways, protein structures, and gene expression. The document outlines some common uses of biological databases, including searching for annotations, identifying similar sequences through homology, searching for patterns, and making predictions. It also briefly discusses comparing data across databases. The summary provides a high-level overview of the key topics and uses of biological databases covered in the document.
The document discusses database searching algorithms like FASTA and BLAST. It explains the mathematical concepts behind BLAST like using Erdos-Renyi theory to model random sequence alignments and calculate the expected length of the longest random match. It also describes the Karlin-Alschul equation used in BLAST to calculate the statistical significance of matches as the expected number of alignments (E) based on the size of the search space and alignment score. The document provides details on parameters and scoring approaches used in database searching algorithms.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. Key ontologies like the Gene Ontology are described. The document then introduces semantic web technologies like RDF, URIs, triples, and ontology languages like RDFS and OWL. It provides examples of representing data and metadata in these formats. Finally, it discusses storing and querying RDF data through SPARQL.
This document provides information about biological databases including:
- Different types of biological databases such as relational, object-oriented, hierarchical, and hybrid systems.
- Common uses of biological databases including annotation searches, homology searches, pattern searches, predictions, and comparisons.
- Examples of database entries in common formats like GenBank, EMBL, and SwissProt that show the layout and key fields.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
This document provides an overview of flat file databases and biological relational databases. It discusses flat file databases like RefSeq that store sequence data in plain text files. It describes common file formats like Genbank and EMBL. It also discusses the Trace Archive and how trace files are processed into consensus sequences using Phred and Phrap. Finally, it briefly introduces biological relational databases and references resources like Swiss-Prot and TrEMBL.
This document discusses the Biological Databases project being conducted by a group of students. The project involves using the video game Minecraft to visualize protein structures retrieved from the Protein Data Bank (PDB). Python scripts are used to import PDB data files and place blocks in Minecraft to represent atoms, with different block colors used to distinguish atom types. SPARQL queries are also employed to search the RDF version of the PDB for protein entries. The goal is to build 3D protein models inside Minecraft for educational and visualization purposes.
This document discusses biological databases and bioinformatics. It begins with an overview of bioinformatics as an interdisciplinary field combining biology, computer science, and information technology. It then discusses different types of biological databases, including those focused on sequences, pathways, protein structures, and gene expression. The document outlines some common uses of biological databases, including searching for annotations, identifying similar sequences through homology, searching for patterns, and making predictions. It also briefly discusses comparing data across databases. The summary provides a high-level overview of the key topics and uses of biological databases covered in the document.
The document discusses database searching algorithms like FASTA and BLAST. It explains the mathematical concepts behind BLAST like using Erdos-Renyi theory to model random sequence alignments and calculate the expected length of the longest random match. It also describes the Karlin-Alschul equation used in BLAST to calculate the statistical significance of matches as the expected number of alignments (E) based on the size of the search space and alignment score. The document provides details on parameters and scoring approaches used in database searching algorithms.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
The document describes a machine learning approach used by Polbase to classify scientific papers as either related or unrelated to DNA polymerases. It discusses three approaches to defining a classification rule, including using text searches, subject matter experts, or statistical modeling. The proposed system uses a machine learning classifier with components like an XML data feed from PubMed, data management in a PostgreSQL database, and a modeling stage to classify papers. The goal is to automatically discover new relevant papers to expand Polbase's reference repository.
This document provides an overview of bioinformatics and biological databases. It discusses how bioinformatics draws from fields like biology, computer science, statistics, and machine learning. Biological databases are important resources for bioinformatics that can be searched and analyzed to answer questions, find similar sequences, locate patterns, and make predictions. The document also outlines common uses of biological databases, such as annotation searches, homology searches, pattern searches, and predictive analyses.
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
Bernhard Haslhofer is a postdoc researcher at Cornell University studying linked data, user-contributed data, and data interoperability. He discusses Linked (Open) Data, which uses URIs and RDF to publish and link structured data on the web. The key principles are using URIs to identify things, providing useful information about those URIs when dereferenced, and including links to other URIs. Enabling technologies include URIs, RDF, RDFS/OWL for vocabularies, SPARQL for querying, and best practices for publishing vocabularies and data. Useful tools are also presented.
The document provides an overview of using Python for bioinformatics, discussing what Python is, why it is useful for bioinformatics, how to set up Python in integrated development environments like Eclipse with PyDev, how to share code using Git and GitHub, and includes examples of Hello World and bioinformatics programs in Python. It introduces Python and argues it is well-suited for bioinformatics due to its extensive standard libraries, ease of use, and wide adoption in science. The document demonstrates how to install Python, set up an IDE, create and run simple Python programs, and use version control with Git and GitHub to collaborate on projects.
This document provides an overview of the Resource Description Framework (RDF). It begins with background information on RDF including URIs, URLs, IRIs and QNames. It then describes the RDF data model, noting that RDF is a schema-less data model featuring unambiguous identifiers and named relations between pairs of resources. It also explains that RDF graphs are sets of triples consisting of a subject, predicate and object. The document also covers RDF syntax using Turtle and literals, as well as modeling with RDF. It concludes with a brief overview of common RDF tools including Jena.
The document discusses various database concepts including normalization, which is used to design optimal relation schemas by removing redundant data. It also covers transaction processing, which involves executing logical database operations as transactions to maintain data integrity. Database systems use techniques like logging and concurrency control to prevent transaction anomalies and ensure failures can be recovered from.
The document discusses views and materialized views in data warehousing and decision support systems. It covers three main points:
1) OLAP queries typically involve aggregate queries, so precomputation is essential for fast response times. Materialized views allow precomputing aggregates across multiple dimensions.
2) Warehouses can be thought of as collections of asynchronously replicated tables and periodically maintained views, renewing interest in efficient view maintenance.
3) Materialized views store the results of views in the database for fast access like a cache, but they require maintenance as underlying tables change. Incremental maintenance algorithms are ideal to efficiently update materialized views.
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
Genomics and Health data is nowadays one of the hot topics requiring lots of computations and specially machine learning. This helps science with a very relevant societal impact to get even better outcome. That is why Apache Spark and its ADAM library is a must have.
This talk will be twofold.
First, we'll show how Apache Spark, MLlib and ADAM can be plugged all together to extract information from even huge and wide genomics dataset. Everything will be packed into examples from the Spark Notebook, showing how bio-scientists can work interactively with such a system.
Second, we'll explain how these methodologies and even the datasets themselves can be shared at very large scale between remote entities like hospitals or laboratories using micro services leveraging Apache Spark, ADAM, Play Framework 2, Avro and Tachyon.
ADAM is a scalable genome analysis platform that uses a column-oriented file format called Parquet to efficiently store and access large genomic datasets across distributed systems. It provides APIs and tools for transforming, analyzing, and querying genomic data in a scalable way using Apache Spark. Some key goals of ADAM include enabling efficient processing of genomes using clusters/clouds, providing a data format for parallel data access, and enhancing data semantics to allow more flexible access patterns.
OpenFlyData aims to integrate biological data sources using Semantic Web technologies. It creates reusable data sources and query services by mapping existing gene expression databases like FlyBase and BDGP to RDF. This allows for cross-database searches using SPARQL. Performance challenges include loading large datasets and case-insensitive text searches, but the system provides benefits like a uniform data model and ability to ask unanticipated queries across integrated sources.
The document describes cross-language information retrieval (CLIR) and summarizes an English-Chinese information retrieval system called ECIRS. ECIRS allows users to input queries in English and retrieves relevant Chinese documents through translation. It includes dictionaries, document indexes, and a Chinese search engine. Screenshots show the user interface where a user can enter an English keyword, view its Chinese translation, and see search results in Chinese.
Presentation from Strata-Hadoop 2015 (http://strataconf.com/big-data-conference-ny-2015/public/schedule/speaker/197575) -- a brief introduction to genomics followed by an overview of approaches to bioinformatics coding using Spark. Pretty high-level.
The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.
This document discusses cross-lingual information retrieval. It presents approaches for translating queries from other languages to the document language, including using online machine translation systems and developing a statistical machine translation system. It describes experiments on reranking translations to select the one most effective for retrieval and on adapting the reranking model to new languages. Results show the reranking approach improves over baselines and online translation systems. The document also explores document translation and query expansion techniques.
This document discusses verifying the integrity constraints of the Portuguese WordNet (OpenWordnet-PT) against the ontology for encoding wordnets. It was the first attempt to check correctness and improve the linguistic data by correcting errors found. Various types of errors were discovered, including datatype errors, domain and range errors, and structural errors. Explanations provided by reasoning tools helped identify and fix issues, improving the overall quality and accuracy of the OpenWordnet-PT resource.
This document discusses cross-language information retrieval (CLIR). It presents the goals of allowing users to query for domain-specific information in their native language and presenting relevant search results in the target language. It describes the key components of CLIR including bilingual corpus extraction from multiple sources, corpus indexing, querying and string matching. Preliminary evaluation results of sample queries are provided, along with conclusions that machine translation based CLIR is often more useful than the proposed method and that future work could focus on automated evaluation and fuzzy matching.
This document discusses cross-language information retrieval (CLIR). It defines CLIR as retrieving information written in a language different from the user's query language. It describes approaches to CLIR such as dictionary-based query translation and pseudo-relevance feedback. Dictionary-based query translation uses bilingual dictionaries but requires disambiguation due to ambiguity. Pseudo-relevance feedback assumes top documents are relevant and selects terms from them to expand the query. The document also discusses using parallel corpora to estimate cross-lingual relevance models and evaluate CLIR using conferences like TREC and CLEF.
Slides presented at the Spark Summit East 2015 (http://spark-summit.org/east). Video should be available through their site, at some point in the future.
(Some of these slides were adapted from an earlier talk "Why is Bioinformatics a Good Fit for Spark?", given to a Spark meetup audience.)
This document provides an introduction to relational database management systems (RDBMS) through a series of slides. It covers topics such as installing MySQL, connecting to databases, using SQL commands to retrieve and manipulate data, and designing databases. The slides introduce fundamental RDBMS concepts like tables, rows, columns, keys, and relationships. It also demonstrates how to use the MySQL command line interface to issue queries and explore database structure. Examples are provided for common SQL statements like SELECT, CREATE, INSERT and more.
This document provides an overview of the Lab for Bioinformatics and Computational Genomics at a university. It describes that the lab has over 100 people from diverse backgrounds including engineers, scientists, technicians, geneticists and clinicians. The lab's work involves hardware/software engineering, mathematics, molecular biology and analysis of biological data through computing. Bioinformatics is defined as the application of information technology to biological data, including tasks like sequence analysis, molecular modeling, phylogeny analysis, medical applications and more. The document then discusses some of the promises and applications of genomics and bioinformatics in fields like medicine, agriculture and animal health.
The document describes a machine learning approach used by Polbase to classify scientific papers as either related or unrelated to DNA polymerases. It discusses three approaches to defining a classification rule, including using text searches, subject matter experts, or statistical modeling. The proposed system uses a machine learning classifier with components like an XML data feed from PubMed, data management in a PostgreSQL database, and a modeling stage to classify papers. The goal is to automatically discover new relevant papers to expand Polbase's reference repository.
This document provides an overview of bioinformatics and biological databases. It discusses how bioinformatics draws from fields like biology, computer science, statistics, and machine learning. Biological databases are important resources for bioinformatics that can be searched and analyzed to answer questions, find similar sequences, locate patterns, and make predictions. The document also outlines common uses of biological databases, such as annotation searches, homology searches, pattern searches, and predictive analyses.
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
Bernhard Haslhofer is a postdoc researcher at Cornell University studying linked data, user-contributed data, and data interoperability. He discusses Linked (Open) Data, which uses URIs and RDF to publish and link structured data on the web. The key principles are using URIs to identify things, providing useful information about those URIs when dereferenced, and including links to other URIs. Enabling technologies include URIs, RDF, RDFS/OWL for vocabularies, SPARQL for querying, and best practices for publishing vocabularies and data. Useful tools are also presented.
The document provides an overview of using Python for bioinformatics, discussing what Python is, why it is useful for bioinformatics, how to set up Python in integrated development environments like Eclipse with PyDev, how to share code using Git and GitHub, and includes examples of Hello World and bioinformatics programs in Python. It introduces Python and argues it is well-suited for bioinformatics due to its extensive standard libraries, ease of use, and wide adoption in science. The document demonstrates how to install Python, set up an IDE, create and run simple Python programs, and use version control with Git and GitHub to collaborate on projects.
This document provides an overview of the Resource Description Framework (RDF). It begins with background information on RDF including URIs, URLs, IRIs and QNames. It then describes the RDF data model, noting that RDF is a schema-less data model featuring unambiguous identifiers and named relations between pairs of resources. It also explains that RDF graphs are sets of triples consisting of a subject, predicate and object. The document also covers RDF syntax using Turtle and literals, as well as modeling with RDF. It concludes with a brief overview of common RDF tools including Jena.
The document discusses various database concepts including normalization, which is used to design optimal relation schemas by removing redundant data. It also covers transaction processing, which involves executing logical database operations as transactions to maintain data integrity. Database systems use techniques like logging and concurrency control to prevent transaction anomalies and ensure failures can be recovered from.
The document discusses views and materialized views in data warehousing and decision support systems. It covers three main points:
1) OLAP queries typically involve aggregate queries, so precomputation is essential for fast response times. Materialized views allow precomputing aggregates across multiple dimensions.
2) Warehouses can be thought of as collections of asynchronously replicated tables and periodically maintained views, renewing interest in efficient view maintenance.
3) Materialized views store the results of views in the database for fast access like a cache, but they require maintenance as underlying tables change. Incremental maintenance algorithms are ideal to efficiently update materialized views.
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
Genomics and Health data is nowadays one of the hot topics requiring lots of computations and specially machine learning. This helps science with a very relevant societal impact to get even better outcome. That is why Apache Spark and its ADAM library is a must have.
This talk will be twofold.
First, we'll show how Apache Spark, MLlib and ADAM can be plugged all together to extract information from even huge and wide genomics dataset. Everything will be packed into examples from the Spark Notebook, showing how bio-scientists can work interactively with such a system.
Second, we'll explain how these methodologies and even the datasets themselves can be shared at very large scale between remote entities like hospitals or laboratories using micro services leveraging Apache Spark, ADAM, Play Framework 2, Avro and Tachyon.
ADAM is a scalable genome analysis platform that uses a column-oriented file format called Parquet to efficiently store and access large genomic datasets across distributed systems. It provides APIs and tools for transforming, analyzing, and querying genomic data in a scalable way using Apache Spark. Some key goals of ADAM include enabling efficient processing of genomes using clusters/clouds, providing a data format for parallel data access, and enhancing data semantics to allow more flexible access patterns.
OpenFlyData aims to integrate biological data sources using Semantic Web technologies. It creates reusable data sources and query services by mapping existing gene expression databases like FlyBase and BDGP to RDF. This allows for cross-database searches using SPARQL. Performance challenges include loading large datasets and case-insensitive text searches, but the system provides benefits like a uniform data model and ability to ask unanticipated queries across integrated sources.
The document describes cross-language information retrieval (CLIR) and summarizes an English-Chinese information retrieval system called ECIRS. ECIRS allows users to input queries in English and retrieves relevant Chinese documents through translation. It includes dictionaries, document indexes, and a Chinese search engine. Screenshots show the user interface where a user can enter an English keyword, view its Chinese translation, and see search results in Chinese.
Presentation from Strata-Hadoop 2015 (http://strataconf.com/big-data-conference-ny-2015/public/schedule/speaker/197575) -- a brief introduction to genomics followed by an overview of approaches to bioinformatics coding using Spark. Pretty high-level.
The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.
This document discusses cross-lingual information retrieval. It presents approaches for translating queries from other languages to the document language, including using online machine translation systems and developing a statistical machine translation system. It describes experiments on reranking translations to select the one most effective for retrieval and on adapting the reranking model to new languages. Results show the reranking approach improves over baselines and online translation systems. The document also explores document translation and query expansion techniques.
This document discusses verifying the integrity constraints of the Portuguese WordNet (OpenWordnet-PT) against the ontology for encoding wordnets. It was the first attempt to check correctness and improve the linguistic data by correcting errors found. Various types of errors were discovered, including datatype errors, domain and range errors, and structural errors. Explanations provided by reasoning tools helped identify and fix issues, improving the overall quality and accuracy of the OpenWordnet-PT resource.
This document discusses cross-language information retrieval (CLIR). It presents the goals of allowing users to query for domain-specific information in their native language and presenting relevant search results in the target language. It describes the key components of CLIR including bilingual corpus extraction from multiple sources, corpus indexing, querying and string matching. Preliminary evaluation results of sample queries are provided, along with conclusions that machine translation based CLIR is often more useful than the proposed method and that future work could focus on automated evaluation and fuzzy matching.
This document discusses cross-language information retrieval (CLIR). It defines CLIR as retrieving information written in a language different from the user's query language. It describes approaches to CLIR such as dictionary-based query translation and pseudo-relevance feedback. Dictionary-based query translation uses bilingual dictionaries but requires disambiguation due to ambiguity. Pseudo-relevance feedback assumes top documents are relevant and selects terms from them to expand the query. The document also discusses using parallel corpora to estimate cross-lingual relevance models and evaluate CLIR using conferences like TREC and CLEF.
Slides presented at the Spark Summit East 2015 (http://spark-summit.org/east). Video should be available through their site, at some point in the future.
(Some of these slides were adapted from an earlier talk "Why is Bioinformatics a Good Fit for Spark?", given to a Spark meetup audience.)
This document provides an introduction to relational database management systems (RDBMS) through a series of slides. It covers topics such as installing MySQL, connecting to databases, using SQL commands to retrieve and manipulate data, and designing databases. The slides introduce fundamental RDBMS concepts like tables, rows, columns, keys, and relationships. It also demonstrates how to use the MySQL command line interface to issue queries and explore database structure. Examples are provided for common SQL statements like SELECT, CREATE, INSERT and more.
This document provides an overview of the Lab for Bioinformatics and Computational Genomics at a university. It describes that the lab has over 100 people from diverse backgrounds including engineers, scientists, technicians, geneticists and clinicians. The lab's work involves hardware/software engineering, mathematics, molecular biology and analysis of biological data through computing. Bioinformatics is defined as the application of information technology to biological data, including tasks like sequence analysis, molecular modeling, phylogeny analysis, medical applications and more. The document then discusses some of the promises and applications of genomics and bioinformatics in fields like medicine, agriculture and animal health.
This document provides an overview of the PEAR DB abstraction layer. It allows for portable database programming in PHP by providing a common API that works across different database backends like MySQL, PostgreSQL, Oracle, etc. It handles tasks like prepared statements, transactions, error handling, and outputting query results in a standardized way. PEAR DB aims to simplify database programming and make applications less dependent on the underlying database system.
This document provides an overview of biological databases and SQL. It discusses different data levels in biological research like primary data, derived data, and interpreted data. It also summarizes some popular biological databases like Ensembl, ArrayExpress, and PharmGKB and whether they support direct SQL querying. The document then provides definitions for key database concepts like database, table, record, and query. It also describes different data types in SQL like numeric, string, date/time types and large object types. It discusses keys, integrity rules, and referential integrity in database design.
The document announces a conference on mHealth to be held March 18-20 in Brussels. The conference will address challenges in 6 domains: 1) mHealth performance metrics, 2) bringing healthcare home, 3) delighting chronic patients, 4) tackling over/under consumption of therapy, 5) keeping Belgium's competitive advantage in clinical trials, and 6) personalized prevention. Several companies and solutions will be discussed that focus on bringing healthcare home, delighting patients, avoiding over/under consumption of therapy, and personalized prevention. Contact information is provided for further details on the conference.
This document provides an overview of Python for bioinformatics. It discusses what Python is, why it is useful for bioinformatics, and how to get started with Python. It covers installing Python on the Athena system, using IDEs like Eclipse and PyDev, code sharing with Git and GitHub, basic Python concepts like strings, control structures, and data types like lists and dictionaries. It also provides examples of bioinformatics tasks that can be done in Python like calculating Pi using random numbers.
- Dynamic programming is used to find the optimal alignment between two protein sequences by recursively computing sub-alignments and storing them in a lookup table.
- The example shows calculating the alignment score between a zinc-finger core sequence and a viral sequence fragment by filling a table and tracking the cumulative scores.
- Filling the table from left to right and top to bottom allows reconstructing the highest scoring alignment between the two sequences.
Here are the steps to translate the DNA sequence to its reverse complement using a dictionary and string translation:
1. Define a dictionary that maps DNA nucleotides to their complement (A->T, C->G, etc.)
2. Use the maketrans() string method to generate a translation table
3. Use the translate() string method to translate the sequence
4. Reverse the translated sequence using slicing
Putting it together:
1. complements = {"A":"T", "T":"A", "C":"G", "G":"C"}
2. table = str.maketrans("ACGT", "TGCA")
3. translated = sequence.translate(table)
This document provides information about bioinformatics resources including databases of nucleotide and protein sequences. It discusses flat file databases like GenBank that store sequence data in plain text files and relational databases that improve data organization. Examples of popular biological databases are described, such as GenBank, EMBL, and DDBJ for nucleotide sequences and Swiss-Prot and TrEMBL for protein sequences. The document also covers sequence file formats, web tools for querying databases, and trace files used in sequence assembly.
The document discusses various topics related to drug discovery through bioinformatics and computational approaches. It begins by discussing comparative genomics and using knowledge about model organisms to identify similar biological areas and pathways in other species. It also discusses topics like high-throughput screening of large libraries, the definitions of targets, hits and leads in drug discovery, and approaches like using RNAi and phenotypic screening in model organisms. Finally, it discusses computational methods that can be used throughout the drug discovery process, including for target identification and validation, virtual screening, assessing drug-likeness of compounds, and describing compounds using structural and physicochemical descriptors.
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the higher likelihood of transitional mutations in nucleic acids. The document emphasizes that scoring matrices represent underlying evolutionary models and influence sequence analysis outcomes.
This document provides an overview of biological databases and SQL. It discusses different types of data in biological research, including primary data and derived data. It lists several major biological databases and whether they support direct SQL querying. It also shows an example 3-tier model for biological databases. The rationale for learning SQL to query biological databases is described. The document then provides definitions and explanations of key SQL concepts like tables, records, queries, data types, keys, relationships, and normalization. It also covers creating tables, integrity constraints, authorization, and privileges in SQL.
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
This document provides an overview of variant analysis from next-generation sequencing data. It begins with introductions to the CCA-Drylab@VUmc, TraIT, and Galaxy projects. The focus of the lecture is explained to be variant analysis from NGS data using interactive demos in Galaxy. Background is provided on Illumina sequencing technology and properties of sequencing reads. Key steps in variant analysis are outlined, including quality control and read mapping, variant calling and annotation using tools like FastQC, BWA, FreeBayes, and SnpEff. Formats for storing sequencing data and variants are also introduced, such as FASTQ, SAM/BAM, and VCF.
The document discusses various topics in bioinformatics including:
1) Control structures, lists, dictionaries, and regular expressions in Python.
2) Parsing Swiss-Prot files and extracting amino acid frequencies using Biopython.
3) Functions for working with biological sequences like transcription, translation, and translating between different genetic codes using the Biopython module.
The document discusses reading and writing files in Python. It provides examples of opening files for reading, writing, and appending. It demonstrates how to read an entire file, individual lines, and loop through lines. It also shows how to write strings to files and close files once writing is complete. Additional topics covered include a template for reading files line by line and examples of counting lines, words, and characters in a file.
This document provides an overview of phylogenetic methodologies. It defines key phylogenetic terms like clade, internal node, and outgroups. It discusses different species concepts and how phylogenetic trees illustrate evolutionary relationships. It also covers popular phylogenetic methodologies like distance methods, maximum parsimony, and maximum likelihood. Distance methods calculate pairwise distances and cluster sequences into trees. UPGMA averages these distances while neighbor joining finds the shortest branches. The document highlights the use of phylogenetic analysis across various fields.
This document provides an overview of GitHub as a hosted Git service and introduces some basic Python concepts including control structures, lists, dictionaries, regular expressions, and BioPython. It demonstrates how to install Biopython and parse sequence data from Swiss-Prot using Biopython modules. It also includes example questions for analyzing sequence data from Swiss-Prot.
This document discusses various topics relating to protein structure and bioinformatics. It begins with an overview of protein structure and why understanding protein structure is important. It then discusses the different levels of protein structure from primary to quaternary structure. Methods for determining protein structure like X-ray crystallography and NMR are mentioned. Databases for storing protein structures like the Protein Data Bank are also summarized. The document touches on topics like protein folding, domains, membrane protein topology, and secondary structure prediction methods.
This document discusses biological databases and bioinformatics. It begins by listing various related fields including biology, computer science, bioinformatics, statistics, and machine learning. It then describes different types of searches that can be performed in biological databases, including annotation searches, homology searches, pattern searches, and predictions. Finally, it mentions that databases can be used for comparisons, such as gene families and phylogenetic trees.
This query will not return any results. The pattern specified in the WHERE clause contains two triples, but the second triple contains a syntax error - it is missing the property between ?x and ?ema. A valid property like email would need to be specified, such as:
SELECT ?name WHERE {
?x name ?name .
?x email ?email
}
This query will select and return the ?name of any resources ?x that have both a name and email property specified.
The document provides an introduction to Semantic Web and Linked Data. It discusses key concepts such as RDF, which represents data as subject-predicate-object triples that can be connected to form a graph. RDF has several syntaxes including XML, Turtle, and JSON. Properties in RDF triples can link to other resources or contain literal values. Types are identified with URIs and vocabularies are designed to be extensible. The goal of Linked Data is to publish structured data on the web and link it to other data to form a global data web.
W3C Tutorial on Semantic Web and Linked Data at WWW 2013Fabien Gandon
The document provides an introduction to Semantic Web and Linked Data. It discusses key concepts such as RDF, which represents data as subject-predicate-object triples that can be connected to form a graph. RDF has several syntaxes including XML, Turtle, and JSON. Properties in RDF triples can link to other resources or contain literal values. Types are identified with URIs and vocabularies are extensible. The goal of Linked Data is to publish structured data on the web and link it to other data to form a global data web.
The document discusses using RDFS and OWL reasoning to integrate heterogeneous linked data by addressing issues like terminology and naming heterogeneity. It presents an approach using a subset of OWL 2 RL rules to reason over a billion triple corpus in a scalable way, handling the TBox separately from the ABox to avoid quadratic inferences. It also describes augmenting the reasoning with annotations to track trustworthiness and using this to filter inferences, detect inconsistencies and perform a light repair of the data. Consolidation is discussed as rewriting URIs to canonical identifiers based on owl:sameAs relations. Performance results show the different techniques taking between 1-20 hours to run over the corpus distributed across 9 machines.
The document discusses semantic technologies including ontologies, RDF, RDFS and OWL. It provides examples of using these technologies to semantically annotate web pages and objects. Key concepts covered include using URIs to identify resources, semantic annotations with properties and values, and extending vocabularies with RDFS and OWL constructs like classes, properties, and restrictions. The goal is to enable more intelligent search by understanding relationships between resources.
The document discusses lessons learned in transforming metadata from XML formats to RDF. It describes how libraries and cultural heritage institutions are working to express existing metadata standards like MODS and PBCore in RDF to take advantage of capabilities like linked data. Challenges include mapping XML schemas to RDF ontologies and ensuring RDF can meet identified use cases. Examples are provided of institutions that have transformed metadata to RDF to share across systems or publish as linked open data.
This document discusses knowledge representation and management technologies for extended minds. It covers various aspects of knowledge representation including expressiveness versus computability and how the choice of representation limits what can be captured. Desired properties of knowledge representation systems include coverage, understandability, consistency, efficiency and ease of modification. The document then reviews historical attempts at knowledge representation and discusses current approaches like the semantic web, ontologies, topic maps and open source tools.
This document discusses knowledge representation and semantic web technologies for representing knowledge. It covers the history of knowledge representation from the 1970s to today, including expert systems, Cyc, computational linguistics, KR programming languages, XML, and the semantic web. It describes the semantic web approach of representing web content as machine-readable data using languages like RDF, OWL, and vocabularies. It also discusses open-source tools and services for publishing and working with semantic web data.
FHIR can be represented in RDF format. Resources are serialized as directed graphs using URIs, properties, and values. FHIR defines a metadata vocabulary for use in RDF, and a FHIR resource catalog provides the URIs for standard FHIR resources and properties. Shape expressions (ShEx) schemas validate FHIR RDF according to resource definitions. Together, these components allow FHIR data to be queried and manipulated using RDF techniques while maintaining compatibility with the JSON format. Tools exist for converting between FHIR JSON and RDF formats.
The document provides an overview of linked data fundamentals, including key concepts like URIs, RDF, ontologies, and the semantic web. It discusses aspects of linked data such as using HTTP URIs to identify resources, representing data as subject-predicate-object triples, and connecting related resources through links. It also covers RDF serialization formats, ontologies like RDFS and OWL, and notable linked open data sources.
The document provides an introduction to RDF (Resource Description Framework). It discusses that RDF is a framework for describing resources using statements with a subject, predicate, and object. RDF identifies resources with URIs and describes resources and their properties and property values. An example RDF document is provided that describes CDs with properties like artist, country, and price.
The International Federation of Library Associations and Institutions (IFLA) is responsible for the development and maintenance of International Standard Bibliographic Description (ISBD), UNIMARC, and the "Functional Requirements" family for bibliographic records (FRBR), authority data (FRAD), and subject authority data (FRSAD). ISBD underpins the MARC family of formats used by libraries world-wide for many millions of catalog records, while FRBR is a relatively new model optimized for users and the digital environment. These metadata models, schemas, and content rules are now being expressed in the Resource Description Framework language for use in the Semantic Web.
This webinar provides a general update on the work being undertaken. It describes the development of an Application Profile for ISBD to specify the sequence, repeatability, and mandatory status of its elements. It discusses issues involved in deriving linked data from legacy catalogue records based on monolithic and multi-part schemas following ISBD and FRBR, such as the duplication which arises from copy cataloging and FRBRization. The webinar provides practical examples of deriving high-quality linked data from the vast numbers of records created by libraries, and demonstrates how a shift of focus from records to linked-data triples can provide more efficient and effective user-centered resource discovery services.
This document provides an overview of semantic web technologies for publishing data. It introduces the semantic web and describes semantic web languages like RDF, RDF Schema, and OWL. These languages allow modeling data as graphs and defining ontologies to provide unambiguous meaning to information. The document discusses using these languages to publish structured data on the web in ways that enable semantic annotation, integration, and reasoning across interconnected data sources.
a system called natural language interface which transforms user's natural language question into SPARQL query
find related papers here https://sites.google.com/site/fadhlinams81/publication
This document discusses the Web Ontology Language (OWL). It begins by providing motivation for OWL, noting limitations of RDF and RDF Schema in areas like expressiveness. It then outlines the technical solution of OWL, including its design goals of being shareable, changing over time, ensuring interoperability, and balancing expressiveness with complexity. Finally, it introduces the three dialects of OWL - OWL Lite, OWL DL, and OWL Full - and their different levels of expressiveness and reasoning capabilities.
The document discusses leveraging library authority control and controlled vocabularies on the semantic web. It describes converting existing metadata like Library of Congress Subject Headings (LCSH) into semantic web standards like SKOS to make the data accessible and linkable on the web. This would allow libraries to publish and share authority and classification data using common web technologies, enabling new applications and discovery across systems.
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
This webinar will break the roadblocks that prevent many from reaping the benefits of heavyweight Semantic Technology in small scale projects. We will show you how to build Semantic Search & Analytics proof of concepts by using managed services in the Cloud.
This document discusses the Semantic Web and Linked Data. It provides an overview of key Semantic Web technologies like RDF, URIs, and SPARQL. It also describes several popular Linked Data datasets including DBpedia, Freebase, Geonames, and government open data. Finally, it discusses the Yahoo BOSS search API and WebScope data for building search applications.
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
This presentation introduces the main principles of Linked Data, the underlying technologies and background standards. It provides basic knowledge for how data can be published over the Web, how it can be queried, and what are the possible use cases and benefits. As an example, we use the development of a music portal (based on the MusicBrainz dataset), which facilitates access to a wide range of information and multimedia resources relating to music.
Similar to Bio ontologies and semantic technologies (20)
The document discusses the Rh blood group system and its clinical significance. It describes the key observations in 1939 that linked adverse reactions in mothers to stillborn fetuses and blood transfusions from fathers, indicating a relationship. This syndrome is now called hemolytic disease of the fetus and newborn. The Rh system was identified in 1940 through experiments immunizing animals with Rhesus macaque monkey red blood cells. The D antigen is the most important RBC antigen in transfusion practice, as those lacking it do not produce anti-D antibody unless exposed to D antigen through transfusion or pregnancy. Testing for D is routinely performed to ensure D-negative patients receive D-negative blood.
This document contains a list of names, emails, and study programs of students. It includes their official student code, last name, first name, email, and educational program. There are 20 students listed with their details.
The document discusses various bioinformatics tools and algorithms for analyzing protein sequences, including Biopython for working with biological sequence data, the Kyte-Doolittle algorithm for predicting transmembrane regions, and the Chou-Fasman algorithm for predicting secondary structure from amino acid preferences for alpha helices, beta sheets, and random coils. It also provides examples of analyzing Swiss-Prot data to find properties of human proteins and applying these tools and libraries to extract insights from protein sequences.
The document discusses various topics related to analyzing protein sequences using Python and Biopython. It provides examples of using Biopython to parse sequence data from UniProt, calculate lengths and translations of sequences. It also discusses analyzing properties of sequences like molecular weight, isoelectric point, transmembrane regions, and comparing sequences to find conserved motifs. Finally, it introduces hydropathy indices and tools for predicting properties like transmembrane helices from primary sequences.
This document discusses Python functions. It explains that there are built-in functions provided as part of Python and user-defined functions. User-defined functions are created using the def keyword and can take parameters and return values. The body of a function is indented and runs when the function is called. Functions allow code to be reused and organized in a modular way. Examples are provided to demonstrate defining and calling functions with different parameters and return values.
The document provides a recap of Python programming concepts like conditions and statements, while loops, for loops, break and continue statements, and working with strings. It also introduces regular expressions as a way to match patterns in strings using a formal language that can be interpreted by a regular expression processor.
[SUMMARY
This document discusses next generation DNA sequencing technologies. It begins by describing some of the limitations of traditional Sanger sequencing, such as read lengths of 500-1000 bases and throughput of 57,000 bases per run. It then introduces some key next generation sequencing technologies, such as 454 sequencing which uses emulsion PCR and pyrosequencing to achieve read lengths of 20-100 bases but higher throughput of 20-100 Mb per run. Illumina/Solexa sequencing is also discussed, which uses sequencing by synthesis with reversible terminators and laser-based detection. Finally, third generation sequencing technologies are mentioned, such as Pacific Biosciences' single molecule real time sequencing and nanopore sequencing. In summary, the document provides a high-level
The document provides an overview of the history and evolution of various programming languages. It discusses early languages like FORTRAN, LISP, PASCAL, C, and Java. It also covers scripting languages and their uses. The document explains what Python is as a programming language - that it is interpreted, object-oriented, and high-level. It was named after Monty Python and was created by Guido van Rossum. The document then gives examples of using Python to program Minecraft by importing protein data from PDB files and using coordinates to place blocks to visualize proteins in the game.
This document provides an overview of NoSQL databases, including:
- Key-value stores store data as maps or hashmaps and are efficient for data access but limited in query capabilities.
- Column-oriented stores group attributes into column families and store data efficiently but are operationally challenging.
- Document databases store loosely structured data like JSON and allow retrieving documents by keys or contents.
- Graph databases are suited for interaction networks and path finding but are less suited for tabular data.
The document discusses creating a multicore database project. It recommends taking the following steps:
1. Define what the project is about, what it aims to achieve, and who it is for.
2. Identify information resources and develop a basic data model.
3. Design a user interface mockup without technical constraints, thinking creatively.
This document discusses biological databases and PHP. It begins with an overview of biological databases and examples using BIOSQL to load genetic data from GenBank into a MySQL database. It then provides examples of building a basic 3-tier model with Apache, PHP, and a MySQL backend database. The document also includes a brief introduction to PHP, covering its history, why it is commonly used, and basic syntax like conditional statements.
This document discusses biological databases and SQL. It provides an overview of primary and derived data in biological research, as well as different data levels. It then discusses direct querying of selected bioinformatics databases using SQL and provides examples of 3-tier database models. The document proceeds to discuss rationale for learning SQL to query biological databases and provides definitions and explanations of key SQL concepts like tables, records, queries, data types, keys, integrity rules and constraints.
The document discusses several topics related to protein structure prediction using Python:
1. It introduces the Chou-Fasman algorithm for predicting protein secondary structure from amino acid sequence. The algorithm calculates preference parameters for each amino acid to be in alpha helices, beta sheets, or other structures.
2. It provides an example of calculating helical propensity.
3. It lists the preference parameters output by the Chou-Fasman algorithm for each amino acid.
4. It outlines the steps of applying the Chou-Fasman algorithm to predict secondary structure elements in a protein sequence.
The document provides information on various Python programming concepts including control structures, lists, dictionaries, regular expressions, exceptions, and biological applications using Biopython. It discusses if/else statements, while and for loops, list operations, dictionary usage, regex patterns, exception handling roles, and gives examples analyzing protein sequences and structures using Biopython.
The document describes a lab for bioinformatics and computational genomics that has over 250 people including 25 "genome hackers" who are mostly engineers and 42 scientists. It discusses using epigenetics and next generation biomarkers for better detecting and understanding cancer. Specifically, it summarizes tests like ConfirmMDx, SelectMDx, and AssureMDx which use epigenetic biomarkers found in urine or blood samples to help determine a patient's risk level for aggressive prostate or bladder cancers and guide decisions about additional testing or biopsies.
The document provides a list of regular expression patterns that could be used to scan protein sequences for prosite patterns. It begins by showing example consensus patterns for protein domains and motifs. It then lists 20 regular expression patterns translated from prosite consensus patterns that could be used to scan protein sequences and look for matches. The document concludes by providing an example Python code snippet to scan sequences for the given prosite patterns using regular expressions.
The document discusses various bioinformatics tools and algorithms for sequence alignment, including:
1. Dynamic programming algorithms like Needleman-Wunsch for global sequence alignment and Smith-Waterman for local sequence alignment.
2. The Burrows-Wheeler Transform (BWT) and how it enables fast, memory-efficient alignment of short reads to reference genomes using tools like BWA. The BWT reorders the characters in a string to group common prefixes together.
3. The SAM format for storing large nucleotide sequence alignments generated by aligners like BWA. SAM files contain the read sequences, positions aligned to the reference, and quality information.
The document discusses regular expressions and provides examples of common regex patterns used for tasks like searching DNA sequences or scanning for protein domains. It then provides a sample of DNA/protein sequences and suggests using translated regex patterns to scan the sequences for specific consensus patterns representing protein domains.
The document discusses regular expressions (regex) in Python. It provides examples of using regex to search for patterns in strings, extract matches, and find and group substrings. Key concepts covered include regex syntax like anchors, character classes, repetition, capturing groups, greedy/non-greedy matching, and the re module's functions like search, findall, finditer, and sub. Real-world applications mentioned include validating formats like IP addresses and parsing structured data.
This document contains a dictionary that maps nucleotide sequences to amino acids. It defines the genetic code by listing the 3-letter codon sequences and their corresponding single-letter amino acid abbreviations. The document also contains some example nucleotide sequences and a prompt to find the answer in the file ultimate-sequence.txt.
How to Manage Reception Report in Odoo 17Celine George
A business may deal with both sales and purchases occasionally. They buy things from vendors and then sell them to their customers. Such dealings can be confusing at times. Because multiple clients may inquire about the same product at the same time, after purchasing those products, customers must be assigned to them. Odoo has a tool called Reception Report that can be used to complete this assignment. By enabling this, a reception report comes automatically after confirming a receipt, from which we can assign products to orders.
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapitolTechU
Slides from a Capitol Technology University webinar held June 20, 2024. The webinar featured Dr. Donovan Wright, presenting on the Department of Defense Digital Transformation.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
How to Download & Install Module From the Odoo App Store in Odoo 17Celine George
Custom modules offer the flexibility to extend Odoo's capabilities, address unique requirements, and optimize workflows to align seamlessly with your organization's processes. By leveraging custom modules, businesses can unlock greater efficiency, productivity, and innovation, empowering them to stay competitive in today's dynamic market landscape. In this tutorial, we'll guide you step by step on how to easily download and install modules from the Odoo App Store.
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...EduSkills OECD
Andreas Schleicher, Director of Education and Skills at the OECD presents at the launch of PISA 2022 Volume III - Creative Minds, Creative Schools on 18 June 2024.
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxOH TEIK BIN
(A Free eBook comprising 3 Sets of Presentation of a selection of Puzzles, Brain Teasers and Thinking Problems to exercise both the mind and the Right and Left Brain. To help keep the mind and brain fit and healthy. Good for both the young and old alike.
Answers are given for all the puzzles and problems.)
With Metta,
Bro. Oh Teik Bin 🙏🤓🤔🥰
220711130082 Srabanti Bag Internet Resources For Natural Science
Bio ontologies and semantic technologies
1. Introduction to Bio
Ontologies
and The
Semantic Web
M.
Devisscher
Biological Databases
2. Overview
• Bio
ontologies
• Semantic technologies
• Practical
sessions:
– Protégé and a
bio
database
– DYI
SPARQL
endpoint
3. Introduction
• Ontologies:
what are
ontologies ?
• Ontologies in
the
bio
domain:
OBO
Foundry
• Ontologies in
the
semantic web
• OBO
• RDF,
IRI,
TTL,
SPARQL,
OWL
4. What is
an ontology ?
• Ontology =
a
specification of
a
conceptualization (Gruber 1993)
• In
practice:
controlled vocabularies
– Disambiguation (e.g.
Bank,
Running)
– Language/species
independence
• Very useful in
biology – complex
hierarchies of
terms
5. Ontologies in
the
bio
Domain
• OBO
Foundry -‐ open
Biological and
Biomedical Ontologies
• Common
principles
• List
of
ontologies at
http://www.obofoundry.org
• OBO
is
also a
data
format
.obo
6. SideTrack – The
Gene
Ontology
• The
mother of
bio-‐ontologies:
the
GO
– Oldest bio
– ontology
– Many practical
applications:
• Cross
species
studies
• Term
abundance studies
• GO
is
an OBO
ontology
8. SideTrack – The
Gene
Ontology
• Relationships between terms:
– Subsumption:
is_a
– Partonomic:
part_of
• These
terms are
transitive
• Terms form
a
DAG
(directed,
acyclic graph)
• Some information
can be inferred
14. Semantic Technologies
• W3C:
a
set
of
specifications
http://www.w3.org/standards/semanticweb/
• A
mature toolset
– Dedicated data
formats
– Storage
– Query
language
15. Semantic Technologies
• Basic
data
element
=
a
Triple
– A
mini
sentence
– Contains three Terms:
• Subject
Predicate Object
16. Semantic Technologies
• Representation of
triples
– Basic
data
format:
RDF/XML
– All data
expressed in
RDF
(Resource
Description
Framework)
– Several compatible
syntaxes:
TTL
(Terse Triple
Language)
most
human
readable
22. IRI’s and Literals
• Terms can be either IRI’s,
Literals or
blank
nodes
• IRI
= Internationalized Resource
Identifier
• Unique
id – a
virtual
URI
– Example:
http://bioinformatics.be/terms#martijn
– There is
no
requirement for resolving
– Now:
Open
Data
initiatives:
please do
use resolvable
URI’s http://linkeddata.org
– Unique
identifierscan be registered on
http://identifiers.org
23. Introduction
• Literals:
can be typed,
allowed types
from the
XSD
namespace:
– E.g.
“This is
a
string
example”^^xsd:string
– E.g.
“5”^^xsd:integer
• IRI’s are
used for entities and attributes
• Literals are
used for attribute values that
aren’t entities
30. Graphs
• Triples are
building
blocks of
Graphs
• Combining sets
of
triples allows the
construction of
arbitrarily complex
graphs
b4x:martijn b4x:karmeliethas_favorite_beer
31. Add meaning !
• Reuse terms from existing,
well
defined
vocabularies – ontologies (foaf,
dc,
go,
so)
• Describe new
terms =
Ontologies
• Contain
– A
crisp
human
definition
– Some machine
readable facts
32. Metadata
• Ontologies are
also described in
RDF
– RDFS:
RDF
-‐ Schema
– OWL:
Web
Ontology Language
– Also expressed in
RDF
• For
clarity,
file
extension
can be .rdfs or
.owl
35. RDFS:
Example
@prefix rdfs: <http://www.w3.org/2000/01/rdf-‐schema#>.
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
b4x:karmeliet a b4x:Trappist .
b4x:Beer a rdfs:Class .
b4x:Trappist a rdfs:Class .
b4x:Trappist rdfs:subClassOf b4x:Beer .
b4x:has_favorite_beer a rdf:Property ;
rdfs:domain foaf:Person ;
rdfs:range b4x:Beer .
b4x:Beer rdfs:subClassOf b4x:Drink .
36. Analogy
• RDF
=
database
=
data
• RDFS/OWL
=
schema
=
metadata
• Both
are
described in
RDF,
but
have
a
different
scope
37. Semantic Technologies
• Inference
– Enhance dataset
using knowledge from metadata
(e.g.
rdfs,
owl)
• Types
of
inference engines
– RDFS
inference
• RDFS
entailmentregime
– OWL
inference
• Under
active research
• Engines
exist for specific subsets of
OWL
(OWL-‐DL)
40. RDFS:
Inference
b4x:kevin
b4x:has_favorite_beer
b4x:stella
Inferred triples:
b4x:kevin
a
foaf:Person [from domain]
b4x:stella
a
b4x:Beer
[from range]
b4x:stella
a
b4x:Drink
[from subClassOf]
41. DuckTyping
• Watch
out
with inference !
Example:
You want
to express that people can
have
lengths
b4x:length a rdf:Property;
rdfs:domain foaf:Person;
rdfs:range xsd:integer.
42. DuckTyping
• Problem:
ex:VW_Transporter b4x:length “600”^xsd:integer.
• Would infer that VW_Transporter is
a
Person
!
• This is
called DuckTyping
If
it
looks
like
a
duck,
swims
like
a
duck,
and
quacks
like
a
duck,
then
it
probably
is
a
duck
43. Task
• Find
a
solution:
express
in
rdfs that
people
can
have
lengths
44. Task
• Find
a
solution:
express
in
rdfs that
people
can
have
lengths
b4x:havingLenght a rdfs:Class.
b4x:length a rdf:Property;
rdfs:domain b4x:havingLength;
rdfs:range xsd:integer.
foaf:Person rdfs:subClassOf b4x:havingLength.
45. Storing
RDF
• As
an RDF
file
for download
• In
a
Triplestore
– Database
optimised for storing
triples
– Examples:
BlazeGraph,
Fuseki,
Sesame
46. Semantic Technologies
• Querying over
RDF
data:
SPARQL
• Cool
features:
– Distributed
querying =
actual distribution of
data
and computing
resources
– SPARQL/Update:
modify data
• SPARQL
endpoints:
SPARQL
over
HTTP
47. SPARQL
Query
Syntax
• First
example:
SELECT ?subject ?predicate ?object WHERE {
?subject ?predicate ?object.
}
(Generally
not a
good idea as
it will pull
down
the
whole dataset)
Binding
variables
Graph matching
51. SPARQL
Query
Syntax
• Find all classes:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-‐schema#>
SELECT ?class ?label WHERE {
?class a rdfs:Class.
?class rdfs:label ?label.
}
(This will only retrieve classes
that have
a
label)
52. SPARQL
Query
Syntax
• Find all classes:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-‐schema#>
SELECT ?class ?label WHERE {
?class a rdfs:Class.
OPTIONAL {
?class rdfs:label ?label.
}
}
53. SPARQL
Query
Syntax
• Find all classes
that contain “duck”
in
the
label:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-‐schema#>
SELECT ?class ?label WHERE {
?class a rdfs:Class.
?class rdfs:label ?label.
FILTER( CONTAINS (str(?label) , “duck” ) )
}
54. SPARQL
Query
Syntax
• Make
it case
insensitive:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-‐schema#>
SELECT ?class ?label WHERE {
?class a rdfs:Class.
?class rdfs:label ?label.
FILTER( CONTAINS ( UCASE(str(?label)) , “DUCK” ) )
}
55. SPARQL
Query
Syntax
• Search
in
specific graph:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-‐schema#>
SELECT ?class ?label
FROM <http://example.org/animals>
WHERE {
?class a rdfs:Class.
?class rdfs:label ?label.
FILTER( CONTAINS ( UCASE(str(?label)) , “DUCK” ) )
}
56. SPARQL
Query
Syntax
• Search
in
specific graph:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-‐schema#>
SELECT ?class ?label WHERE {
GRAPH <http://example.org/animals> {
?class a rdfs:Class.
?class rdfs:label ?label.
FILTER( CONTAINS ( UCASE(str(?label)) , “DUCK” ) )
}
}
57. SPARQL
Query
Syntax
• Can also search
for graphs :
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-‐schema#>
SELECT ?g WHERE {
GRAPH ?g {
?class a rdfs:Class.
?class rdfs:label ?label.
FILTER( CONTAINS ( UCASE(str(?label)) , “DUCK” ) )
}
}
58. Summary:
Querying RDF
data
RDF
Data
Inference
Engine
RDFS/OWL
RDF
Data
Inferred
SPARQL
Endpoint
59. • Basic data element = a Triple
– A mini sentence
– Contains three Terms:
– Subject Predicate Object
• Example:
<http://xmpl/entities#martijn>
<http://xmpl/relations#has_favorite_beer>
<http://xmpl/entities#karmeliet>.
Take
home
Summary
64. Interoperability between OBO
and
Semantic Technologies
• Originated from two separate
academic worlds
• Computing
applications of
OBO
mainly
consistencycheckingand overrepresentation
analysis
• Semantic Technologies:
much broader toolset
• Interoperability ?
– Direct
offering in
both formats
– Automated mapping
65. Where to find ontologies
• OBO
Foundry
• Bioportal;
NCBO
• Biogateway
• Bio2RDF
66. Where to find RDF
data
• Google
for SPARQL
endpoint
• =>
e.g.
EBI
databases
• Non
biological:
DBpedia
67. How
about Tim
Berners Lee’s vision
• We’re not there yet,
but
for bio
data
we’re
getting quite close
– The
explicitome
– Crowd sourcing
– Nanopublications
79. • From
a
web
interface
• Using
http
– HTTP
GET
– HTTP
POST
:
for
larger
query
strings
– Headers
determine
response
type
(JSON,
XML,
HTML)
http://…/sparql?default-graph-uri=<http://graphName>&query=URLENCODEDQUERYSTRING
Running
SPARQL
92. • Links
pathways
with
genes,
terms
from
Pathway,
Cell
line
and
Disease
ontology,
PubMed
references
• Models
individual
Interactions
• Can
be
downloaded
as
RDF
• Has
an
experimental
SPARQL
endpoint
WikiPathways
93. • Define
a
query
to
find
pathways
linked
to
TNFalpha gene
Exercise
97. • Try
this,
or
another
query
– Using
web
interface
– Using
http
get
• Define
a
simple
describe
• Use
a
web
tool
to
URLEncode the
query
• Submit
query
as
a
URL
parameter
Exercise