SlideShare a Scribd company logo
1 of 55
Unit 2.4: Bioinformatics and Databases
Objectives: At the end of this unit, students will
-have been introduced to ome basic concepts and considerations
in bioinformatics and computational biology
-know what a relational database is
-understand why databases are useful for dealing with large
amounts of data
-have been introduced to some of the major online biological
databases and their features
-have gained experience in extracting data from online
biological databases
Reading:
Stein, L.D. 2003. Integrating biological databases. Nat Rev
Genet 4: 337-345.
Assignments:
Read the excerpts from Current Protocols in
Bioinformatics on Entrez and the UCSC Browser.
Follow along with the examples in Protocol 1 of each
section.
“Genomic research makes it possible to look at biological
phenomena on a scale not previously possible: all genes in a
genome, all transcripts in a cell, all metabolic processes in a
tissue. One feature that all of these approaches share is the
production of massive quantities of data. GenBank, for example,
now accommodates >1010 nucleotides of nucleic acid sequence
data and continues to more than double in size every year. New
technologies for assaying gene expression patterns, protein
structure, protein-protein interactions, etc., will provide even
more data. How to handle these data, make sense of them, and
render them accessible to biologists working on a wide variety of
problems is the challenge facing bioinformatics—an emerging
field that seeks to integrate computer science with applications
derived from molecular biology. We are swimming in a rapidly
rising sea of data. . . how do we keep from drowning?”
—Roos (2001). Science. 291:1260
Bioinformatics is one solution to this problem—a way of coping
with large data sets and making sense of genomic-scale data. But
like with most approaches, it is important to have a sense of what
types of things are possible or not possible to achieve using
bioinformatics approaches.
Learn to know the difference—Bioinformatics is:
• sometimes a time-saver: you can automate common and/or
repetative tasks, and parse large files
• sometimes essential: how else would you analyze results from a
25,000 gene microarray experiment
• sometimes not helpful/not useful/unimportant: it can be easier
and more straightforward to do a simple wet-lab experiment than
to devise an elaborate computational approach
• sometimes not possible: computers can’t do everything!
It’s also important to have an understanding of the underlying concepts
and algorithms in bioinformatics, just as it’s important to understand the
basic concepts and chemical basis of molecular biology, or genetics, or
biochemistry, if you’re going to do wet-lab experiments.
“Many biologists are comfortable using algorithms like BLAST or
GenScan without really understanding how the underlying algorithm
works. . . . BLAST solves a particular problem only approximately and it
has certain systematic weaknesses. . . . Users that do not know how
BLAST works might misapply the algorithm or misinterpret the results it
returns.” [Pevzner (2004). Bioinformatics 20(14): 2159-2161.]
A historical perspective
• The 1960s: the birth of
bioinformatics
– High-level computer
languages
– Protein sequence data
– Academic access to
computers
• Margaret Oakley Dayhoff
– First protein database
– First program for sequence
assembly IBM 7090 computer
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
By way of comparison…
IBM 7090 computer
32 Kbytes RAM
2.18 µHz
$2,900,000 in 1960
20” Apple iMac
1 GB RAM
2.4 GHz
$1199 in 2008
Solving problems in computer
science
• Necessary parameters for assessing the
difficulty of a computer science problem
– Algorithmic complexity
• Is the problem theoretically solvable?
• If so, what is the most efficient solution?
– Current state of computer technology
• Memory
• CPU speed
• Cost
Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper
Saddle River, New Jersey 07458
Algorithms
• An algorithm is a sequence of instructions that
one must perform in order to solve a well-
formulated problem
• First you must identify exactly what the problem
is!
• A problem describes a class of computational
tasks. A problem instance is one particular input
from that task
• In general, you should design your algorithms to
work for any instance of a problem (although
there are cases in which this is not possible)
Computer technology: memory, CPU speed, cost
• Dramatic improvements on yearly basis
• We do a lot of our work using desktop Macs out of the box
- 2 quad core 2.8 GHz processors, 500 GB disk space, 4 GB RAM for
~$3000
- 2 quad core 3.0 GHz processors, 2.5 TB disk space, 8 GB RAM for
~$6000
• CPU speed vs. memory: which is more important?
- for protein structure, might need many calculations but limited
memory
- for genome searches, might have few calculations but huge amounts
to store in memory
• Reading from memory is several orders of magnitude faster
than reading from disk
Databases
• What is a database?
– A collection of related data elements
• tables
• columns (fields)
• rows (records)
– Records retrieved using a query language
– Database technology is well established
Tables (entitites)
•basic elements of information to track, e.g., gene, organism,
sequence, citation
Columns (fields)
•attributes of tables, e.g. for citation table, title, journal,
volume, author
Rows (records)
•actual data
•whereas fields describe what data is stored, the rows of a table
are where the actual data is stored
Databases
A very simple form of (non-electronic) database is a filing
cabinet. In the filing cabinet, you can store many different records
(sheets of paper), each containing mulitple data elements.
Example: a filing cabinet of invoices
•the filing cabinet is a table
•the columns are the fields of data on the individual
invoices (customer, product, price, quantity)
•the rows (records) are the individual invoices
The biggest problem with a filing cabinet is that you can only
store your data one way (e.g., in alphabetical order of the
customer’s last name), and there’s no good way of searching your
files based on any other criteria (say, by product ordered).
Databases
Example: a filing cabinet of invoices
•the filing cabinet is a table
•the columns are the fields of data on the individual invoices
(customer, product, price, quantity)
•the rows (records) are the individual invoices
Databases
A flat-file database—a spreadsheet—is the electronic
analogue to the filing cabinet:
This is more easily searchable than a paper file cabinet, but
is still very unwieldly, especially for large amounts of data.
Databases
Suppose you now want to be able to send an advertisement to
every customer who bought the Acme Snow Machine. You could
add a column to your table that includes the address for each
customer, but this is very inefficient—you will keep repeating
information for customers (like Elmer) who make multiple
purchases. Plus, as the number of rows and columns grows,
searching a flat file becomes more and more time consuming.
Also, it is difficult to construct complex queries (e.g., customers
who bought the Snow Machine and who like opera or live in the
Southwest desert)
Relational Databases
The solution is the relational database. A relational database contains multiple
tables and defines the relationships between them. Thus you might also have a
customer table and a product table, like this:
Relational Databases
Relationships can be built between tables and fields:
database “schema”
Relational Databases
Now only three items need to be filled in for an invoice: a customer, a
product, and a quantity. The price and total fields can be filled in
automatically: price from a product_table “lookup” and total by “calculation”
(price * qty).
Relational Databases
Now we can send our advertisement to every customer who bought the Acme
Snow Machine by getting their addresses from the customer_table table.
To do this, we use Structured Query Language (SQL):
SELECT customer_table.name, customer_table.address
FROM customer_table, invoice
WHERE invoice.product = “Acme Snow Machine”
AND invoice.customer = customer_table.name
Relational Databases
We can also make our complex query
“customers who bought the Snow Machine and who like opera or live in the
Southwest desert)”:
SELECT customer_table.name
FROM customer_table, invoice
WHERE invoice.product = “Snow Machine”
AND invoice.customer = customer_table.name
AND (customer_table.notes LIKE %opera% OR
cutomer_table.address = “Southwest desert”)
Online Databases
When you query an online database, your query is translated
into SQL, the database is interrogated, and the answer displayed
on your web browser.
Your computer and
browser (the “client”)
Software to receive
and translate the
instructions you enter
into your browser (on
the “server”)
The database itself
Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. O’Reilly (2002).
Biological Databases
•Over 1000 biological databases
•Vary in size, quality, coverage, level of interest
•Many of the major ones covered in the annual
Database Issue of Nucleic Acids Research
•What makes a good database?
•comprehensiveness
•accuracy
•is up-to-date
•good interface
•batch search/download
•API (web services, DAS, etc.)
“The Ten Commandments When Using
Servers”
•Remember the server, the database, and the program version used
•Write down sequence identification numbers
•Write down the program parameters
•Save your internet results the right way
(use screenshots or PDFs if necessary)
•Databases are not like good wine
(use up-to-date builds)
•Use local installs when it becomes necessary
Source: Bioinformatics for Dummies
“Ten Important Bioinformatics Databases”
GenBank www.ncbi.nlm.nih.gov nucleotide sequences
Ensembl www.ensembl.org human/mouse genome (and others)
PubMed www.ncbi.nlm.nih.gov literature references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymes www.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
KEGG www.genome.ad.jp metabolic pathways
Source: Bioinformatics for Dummies
NCBI (National Center for Biotechnology
Information)
• over 30 databases including
GenBank, PubMed, OMIM, and
GEO
• Access all NCBI resources via
Entrez
(www.ncbi.nlm.nih.gov/Entrez/)
GenBank® is the NIH genetic
sequence database, an annotated
collection of all publicly available
DNA sequences. There are
approximately 65,369,091,950
bases in 61,132,599 sequence
records in the traditional GenBank
divisions and 80,369,977,826
bases in 17,960,667 sequence
records in the WGS division as of
August 2006.
www.ncbi.nlm.nih.gov/GenBank
www.ncbi.nlm.nih.gov/GenBank
The Reference Sequence (RefSeq) database is
a non-redundant collection of richly annotated
DNA, RNA, and protein sequences from diverse
taxa. Each RefSeq represents a single, naturally
occurring molecule from one organism. The goal
is to provide a comprehensive, standard dataset
that represents sequence information for a
species. It should be noted, though, that RefSeq
has been built using data from public archival
databases only.
RefSeq biological sequences (also known as
RefSeqs) are derived from GenBank records
but differ in that each RefSeq is a synthesis of
information, not an archived unit of primary
research data. Similar to a review article in the
literature, a RefSeq represents the consolidation
of information by a particular group at a
particular time.
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
The MOD squad
•Most model organism communities have established organism-
specific Model Organism Databases (MODs)
•Many of these databases have different schemas and implementations,
although there is movement toward harmonizing many features via the
Generic Model Organism Database project.
The MOD squad
SGD: yeast (www.yeastgenome.org)
Wormbase: C. elegans (www.wormbase.org)
FlyBase: Drosophila (flybase.bio.indiana.edu)
Zfin: zebrafish (zfin.org)
and many others (Xenopus, Dictyostelium,
Arabisdopsis…)
The MOD squad: what about Homo sapiens?
There is not a true “model organism” database for Human.
The two main sources of genome information that have
evolved are the UCSC Genome Browser and Ensembl.
EnsEMBL www.ensembl.org
UCSC genome.ucsc.edu
UCSC Browser
UCSC Browser
Ensembl
Ensembl
Ensembl
Protein Data Bank (PDB)
Protein Data Bank (PDB)
total
yearly
Protein Data Bank (PDB)

More Related Content

Similar to Bioinformatics&Databases.ppt

Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesmustafa sarac
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdfBrahmam8
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to databaseSuleman Memon
 
Chapter – 1 Intro to DBS.pdf
Chapter – 1 Intro to DBS.pdfChapter – 1 Intro to DBS.pdf
Chapter – 1 Intro to DBS.pdfTamiratDejene1
 
Chapter – 1 Intro to DBS.pdf
Chapter – 1 Intro to DBS.pdfChapter – 1 Intro to DBS.pdf
Chapter – 1 Intro to DBS.pdfTamiratDejene1
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and boltsNBER
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And DesignLijo Stalin
 
Open Source Database Management Software available on the Net
Open Source Database Management Software available on the NetOpen Source Database Management Software available on the Net
Open Source Database Management Software available on the NetDlis Mu
 
Design and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRDesign and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRPablo Pazos
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)Huibert Aalbers
 
Lec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemsLec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemssamiullahamjad06
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 

Similar to Bioinformatics&Databases.ppt (20)

Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
 
Computing 7
Computing 7Computing 7
Computing 7
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
 
Chapter – 1 Intro to DBS.pdf
Chapter – 1 Intro to DBS.pdfChapter – 1 Intro to DBS.pdf
Chapter – 1 Intro to DBS.pdf
 
Chapter – 1 Intro to DBS.pdf
Chapter – 1 Intro to DBS.pdfChapter – 1 Intro to DBS.pdf
Chapter – 1 Intro to DBS.pdf
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And Design
 
Database, Lecture-1.ppt
Database, Lecture-1.pptDatabase, Lecture-1.ppt
Database, Lecture-1.ppt
 
Open Source Database Management Software available on the Net
Open Source Database Management Software available on the NetOpen Source Database Management Software available on the Net
Open Source Database Management Software available on the Net
 
Database management system
Database management systemDatabase management system
Database management system
 
Design and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRDesign and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHR
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 
Presentation1
Presentation1Presentation1
Presentation1
 
Lec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemsLec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systems
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 

More from BlackHunt1

Plant breeding - The past, the present and the future.pptx
Plant breeding - The past, the present and the future.pptxPlant breeding - The past, the present and the future.pptx
Plant breeding - The past, the present and the future.pptxBlackHunt1
 
topic_14_-_genetic_technology.ppt
topic_14_-_genetic_technology.ppttopic_14_-_genetic_technology.ppt
topic_14_-_genetic_technology.pptBlackHunt1
 
Pierce5e_ch21_lecturePPT.ppt
Pierce5e_ch21_lecturePPT.pptPierce5e_ch21_lecturePPT.ppt
Pierce5e_ch21_lecturePPT.pptBlackHunt1
 
Lezione 17- Epigenetics.ppt
Lezione 17- Epigenetics.pptLezione 17- Epigenetics.ppt
Lezione 17- Epigenetics.pptBlackHunt1
 
4_4_lambda_decisions.ppt
4_4_lambda_decisions.ppt4_4_lambda_decisions.ppt
4_4_lambda_decisions.pptBlackHunt1
 
DNA replication_BTL.pptx
DNA replication_BTL.pptxDNA replication_BTL.pptx
DNA replication_BTL.pptxBlackHunt1
 
Gene_Expression.pptx
Gene_Expression.pptxGene_Expression.pptx
Gene_Expression.pptxBlackHunt1
 
_chapter 3.ppt_.ppt
_chapter 3.ppt_.ppt_chapter 3.ppt_.ppt
_chapter 3.ppt_.pptBlackHunt1
 
Presentation A - Using Restriction Enzymes.pptx
Presentation A - Using Restriction Enzymes.pptxPresentation A - Using Restriction Enzymes.pptx
Presentation A - Using Restriction Enzymes.pptxBlackHunt1
 
Recombinant-DNA-Technology.pdf
Recombinant-DNA-Technology.pdfRecombinant-DNA-Technology.pdf
Recombinant-DNA-Technology.pdfBlackHunt1
 

More from BlackHunt1 (13)

Plant breeding - The past, the present and the future.pptx
Plant breeding - The past, the present and the future.pptxPlant breeding - The past, the present and the future.pptx
Plant breeding - The past, the present and the future.pptx
 
topic_14_-_genetic_technology.ppt
topic_14_-_genetic_technology.ppttopic_14_-_genetic_technology.ppt
topic_14_-_genetic_technology.ppt
 
Pierce5e_ch21_lecturePPT.ppt
Pierce5e_ch21_lecturePPT.pptPierce5e_ch21_lecturePPT.ppt
Pierce5e_ch21_lecturePPT.ppt
 
Lezione 17- Epigenetics.ppt
Lezione 17- Epigenetics.pptLezione 17- Epigenetics.ppt
Lezione 17- Epigenetics.ppt
 
slides1.ppt
slides1.pptslides1.ppt
slides1.ppt
 
45931.ppt
45931.ppt45931.ppt
45931.ppt
 
4_4_lambda_decisions.ppt
4_4_lambda_decisions.ppt4_4_lambda_decisions.ppt
4_4_lambda_decisions.ppt
 
DNA replication_BTL.pptx
DNA replication_BTL.pptxDNA replication_BTL.pptx
DNA replication_BTL.pptx
 
Gene_Expression.pptx
Gene_Expression.pptxGene_Expression.pptx
Gene_Expression.pptx
 
_chapter 3.ppt_.ppt
_chapter 3.ppt_.ppt_chapter 3.ppt_.ppt
_chapter 3.ppt_.ppt
 
Databases.ppt
Databases.pptDatabases.ppt
Databases.ppt
 
Presentation A - Using Restriction Enzymes.pptx
Presentation A - Using Restriction Enzymes.pptxPresentation A - Using Restriction Enzymes.pptx
Presentation A - Using Restriction Enzymes.pptx
 
Recombinant-DNA-Technology.pdf
Recombinant-DNA-Technology.pdfRecombinant-DNA-Technology.pdf
Recombinant-DNA-Technology.pdf
 

Recently uploaded

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 

Recently uploaded (20)

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 

Bioinformatics&Databases.ppt

  • 1. Unit 2.4: Bioinformatics and Databases Objectives: At the end of this unit, students will -have been introduced to ome basic concepts and considerations in bioinformatics and computational biology -know what a relational database is -understand why databases are useful for dealing with large amounts of data -have been introduced to some of the major online biological databases and their features -have gained experience in extracting data from online biological databases Reading: Stein, L.D. 2003. Integrating biological databases. Nat Rev Genet 4: 337-345.
  • 2. Assignments: Read the excerpts from Current Protocols in Bioinformatics on Entrez and the UCSC Browser. Follow along with the examples in Protocol 1 of each section.
  • 3. “Genomic research makes it possible to look at biological phenomena on a scale not previously possible: all genes in a genome, all transcripts in a cell, all metabolic processes in a tissue. One feature that all of these approaches share is the production of massive quantities of data. GenBank, for example, now accommodates >1010 nucleotides of nucleic acid sequence data and continues to more than double in size every year. New technologies for assaying gene expression patterns, protein structure, protein-protein interactions, etc., will provide even more data. How to handle these data, make sense of them, and render them accessible to biologists working on a wide variety of problems is the challenge facing bioinformatics—an emerging field that seeks to integrate computer science with applications derived from molecular biology. We are swimming in a rapidly rising sea of data. . . how do we keep from drowning?” —Roos (2001). Science. 291:1260
  • 4. Bioinformatics is one solution to this problem—a way of coping with large data sets and making sense of genomic-scale data. But like with most approaches, it is important to have a sense of what types of things are possible or not possible to achieve using bioinformatics approaches. Learn to know the difference—Bioinformatics is: • sometimes a time-saver: you can automate common and/or repetative tasks, and parse large files • sometimes essential: how else would you analyze results from a 25,000 gene microarray experiment • sometimes not helpful/not useful/unimportant: it can be easier and more straightforward to do a simple wet-lab experiment than to devise an elaborate computational approach • sometimes not possible: computers can’t do everything!
  • 5. It’s also important to have an understanding of the underlying concepts and algorithms in bioinformatics, just as it’s important to understand the basic concepts and chemical basis of molecular biology, or genetics, or biochemistry, if you’re going to do wet-lab experiments. “Many biologists are comfortable using algorithms like BLAST or GenScan without really understanding how the underlying algorithm works. . . . BLAST solves a particular problem only approximately and it has certain systematic weaknesses. . . . Users that do not know how BLAST works might misapply the algorithm or misinterpret the results it returns.” [Pevzner (2004). Bioinformatics 20(14): 2159-2161.]
  • 6. A historical perspective • The 1960s: the birth of bioinformatics – High-level computer languages – Protein sequence data – Academic access to computers • Margaret Oakley Dayhoff – First protein database – First program for sequence assembly IBM 7090 computer Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
  • 7. By way of comparison… IBM 7090 computer 32 Kbytes RAM 2.18 µHz $2,900,000 in 1960 20” Apple iMac 1 GB RAM 2.4 GHz $1199 in 2008
  • 8. Solving problems in computer science • Necessary parameters for assessing the difficulty of a computer science problem – Algorithmic complexity • Is the problem theoretically solvable? • If so, what is the most efficient solution? – Current state of computer technology • Memory • CPU speed • Cost Benfey and Protopapas, "Genomics" © 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
  • 9. Algorithms • An algorithm is a sequence of instructions that one must perform in order to solve a well- formulated problem • First you must identify exactly what the problem is! • A problem describes a class of computational tasks. A problem instance is one particular input from that task • In general, you should design your algorithms to work for any instance of a problem (although there are cases in which this is not possible)
  • 10. Computer technology: memory, CPU speed, cost • Dramatic improvements on yearly basis • We do a lot of our work using desktop Macs out of the box - 2 quad core 2.8 GHz processors, 500 GB disk space, 4 GB RAM for ~$3000 - 2 quad core 3.0 GHz processors, 2.5 TB disk space, 8 GB RAM for ~$6000 • CPU speed vs. memory: which is more important? - for protein structure, might need many calculations but limited memory - for genome searches, might have few calculations but huge amounts to store in memory • Reading from memory is several orders of magnitude faster than reading from disk
  • 11. Databases • What is a database? – A collection of related data elements • tables • columns (fields) • rows (records) – Records retrieved using a query language – Database technology is well established
  • 12. Tables (entitites) •basic elements of information to track, e.g., gene, organism, sequence, citation Columns (fields) •attributes of tables, e.g. for citation table, title, journal, volume, author Rows (records) •actual data •whereas fields describe what data is stored, the rows of a table are where the actual data is stored Databases
  • 13. A very simple form of (non-electronic) database is a filing cabinet. In the filing cabinet, you can store many different records (sheets of paper), each containing mulitple data elements. Example: a filing cabinet of invoices •the filing cabinet is a table •the columns are the fields of data on the individual invoices (customer, product, price, quantity) •the rows (records) are the individual invoices The biggest problem with a filing cabinet is that you can only store your data one way (e.g., in alphabetical order of the customer’s last name), and there’s no good way of searching your files based on any other criteria (say, by product ordered). Databases
  • 14. Example: a filing cabinet of invoices •the filing cabinet is a table •the columns are the fields of data on the individual invoices (customer, product, price, quantity) •the rows (records) are the individual invoices Databases A flat-file database—a spreadsheet—is the electronic analogue to the filing cabinet: This is more easily searchable than a paper file cabinet, but is still very unwieldly, especially for large amounts of data.
  • 15. Databases Suppose you now want to be able to send an advertisement to every customer who bought the Acme Snow Machine. You could add a column to your table that includes the address for each customer, but this is very inefficient—you will keep repeating information for customers (like Elmer) who make multiple purchases. Plus, as the number of rows and columns grows, searching a flat file becomes more and more time consuming. Also, it is difficult to construct complex queries (e.g., customers who bought the Snow Machine and who like opera or live in the Southwest desert)
  • 16. Relational Databases The solution is the relational database. A relational database contains multiple tables and defines the relationships between them. Thus you might also have a customer table and a product table, like this:
  • 17. Relational Databases Relationships can be built between tables and fields: database “schema”
  • 18. Relational Databases Now only three items need to be filled in for an invoice: a customer, a product, and a quantity. The price and total fields can be filled in automatically: price from a product_table “lookup” and total by “calculation” (price * qty).
  • 19. Relational Databases Now we can send our advertisement to every customer who bought the Acme Snow Machine by getting their addresses from the customer_table table. To do this, we use Structured Query Language (SQL): SELECT customer_table.name, customer_table.address FROM customer_table, invoice WHERE invoice.product = “Acme Snow Machine” AND invoice.customer = customer_table.name
  • 20. Relational Databases We can also make our complex query “customers who bought the Snow Machine and who like opera or live in the Southwest desert)”: SELECT customer_table.name FROM customer_table, invoice WHERE invoice.product = “Snow Machine” AND invoice.customer = customer_table.name AND (customer_table.notes LIKE %opera% OR cutomer_table.address = “Southwest desert”)
  • 21. Online Databases When you query an online database, your query is translated into SQL, the database is interrogated, and the answer displayed on your web browser. Your computer and browser (the “client”) Software to receive and translate the instructions you enter into your browser (on the “server”) The database itself Image source: David Lane and Hugh E. Williams. Web Database Applications with PHP & MySQL. O’Reilly (2002).
  • 22. Biological Databases •Over 1000 biological databases •Vary in size, quality, coverage, level of interest •Many of the major ones covered in the annual Database Issue of Nucleic Acids Research •What makes a good database? •comprehensiveness •accuracy •is up-to-date •good interface •batch search/download •API (web services, DAS, etc.)
  • 23. “The Ten Commandments When Using Servers” •Remember the server, the database, and the program version used •Write down sequence identification numbers •Write down the program parameters •Save your internet results the right way (use screenshots or PDFs if necessary) •Databases are not like good wine (use up-to-date builds) •Use local installs when it becomes necessary Source: Bioinformatics for Dummies
  • 24. “Ten Important Bioinformatics Databases” GenBank www.ncbi.nlm.nih.gov nucleotide sequences Ensembl www.ensembl.org human/mouse genome (and others) PubMed www.ncbi.nlm.nih.gov literature references NR www.ncbi.nlm.nih.gov protein sequences SWISS-PROT www.expasy.ch protein sequences InterPro www.ebi.ac.uk protein domains OMIM www.ncbi.nlm.nih.gov genetic diseases Enzymes www.chem.qmul.ac.uk enzymes PDB www.rcsb.org/pdb/ protein structures KEGG www.genome.ad.jp metabolic pathways Source: Bioinformatics for Dummies
  • 25. NCBI (National Center for Biotechnology Information) • over 30 databases including GenBank, PubMed, OMIM, and GEO • Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/)
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 65,369,091,950 bases in 61,132,599 sequence records in the traditional GenBank divisions and 80,369,977,826 bases in 17,960,667 sequence records in the WGS division as of August 2006. www.ncbi.nlm.nih.gov/GenBank
  • 40. The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. Each RefSeq represents a single, naturally occurring molecule from one organism. The goal is to provide a comprehensive, standard dataset that represents sequence information for a species. It should be noted, though, that RefSeq has been built using data from public archival databases only. RefSeq biological sequences (also known as RefSeqs) are derived from GenBank records but differ in that each RefSeq is a synthesis of information, not an archived unit of primary research data. Similar to a review article in the literature, a RefSeq represents the consolidation of information by a particular group at a particular time.
  • 41.
  • 42. Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
  • 43. Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
  • 44. Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
  • 45. The MOD squad •Most model organism communities have established organism- specific Model Organism Databases (MODs) •Many of these databases have different schemas and implementations, although there is movement toward harmonizing many features via the Generic Model Organism Database project.
  • 46. The MOD squad SGD: yeast (www.yeastgenome.org) Wormbase: C. elegans (www.wormbase.org) FlyBase: Drosophila (flybase.bio.indiana.edu) Zfin: zebrafish (zfin.org) and many others (Xenopus, Dictyostelium, Arabisdopsis…)
  • 47. The MOD squad: what about Homo sapiens? There is not a true “model organism” database for Human. The two main sources of genome information that have evolved are the UCSC Genome Browser and Ensembl. EnsEMBL www.ensembl.org UCSC genome.ucsc.edu
  • 54. Protein Data Bank (PDB) total yearly