Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Guided Tour of Bioinformatics Databases
1. A Guided SQL Tour of
Bioinformatics Databases
Yannick Pouliot, PhD
Bioresearch Informationist
lanebioresearch@stanford.edu
Lane Medical Library & Knowledge Management Center
2/28/2007
Lane Medical Library & Knowledge Management Center
http://lane.stanford.edu
2. Content
Very abbreviated review of the relational principle
Some of the technology required to connect to a
remote database
Walk-through of the database schema for Ensembl
Walk-through of the database schema for
BioWarehouse
Hands-on querying
Hands-on querying
Resources
Details on connecting to a remote database
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
2
3. So Why Are We Here?
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
3
5. Relational Database Terms
Database: Collection of tables and relationship
between tables
Table
Collection of records that share a common
fundamental characteristic
E.g., patients and locations can each be stored in their own
table
Record
Basic unit of information in a relational database
E.g., 1 record per perso
A record is composed of columns (“fields”)
Query
Set of instructions to a database “engine” to retrieve,
sort and format returning data.
“find me all patients in my database”
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
5
6. Main Relational Database “Engines”
Filemaker
MS Access
MS SQL Server
MySQL
Oracle
Postgress
Sybase
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
6
7. Structure of Relational DB Tables
Data values
live in rows
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
7
8. Understanding the Relational Principle: A
Simple Database
“join”
Every patient gets ONE record in the Patients table
Every visit gets ONE record in the Visits table
Rows in different tables can be related one to another
using a shared key (identifier)
There can be multiple visits records for a given patient
There can be multiple tissue records for a given patient
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
return
8
9. The Relational Principle at Work
Related records can be found using a shared
key
Example: Patients.ID = Visits.PatientID
Table name Primary Key
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
9
10. SQL Querying…With What?
Query browsers used here:
MySQL Query Browser
WinSQL
Other query browsers exist but are more sophisticated
Often more expensive or more complex
Example: PL/SQL Developer, from Allround Automations
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
10
11. Example: Network Querying of Ensembl
Database Using MySQL Query Browser
What happens when you use query a remote
database?
DEMO
Of note:
May take some time
Big database, lots of data to return from far away…
Easy to write queries with voluminous output
May have to kill the query…
Setting up ODBC: not discussed here, but cheat sheet instructions are in
handout. Location will also be mailed
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
11
12. The Database Schema: Your
Roadmap For Querying
The schema describes all tables and all fields
Used to determine how to inter-relate tables to
retrieve the desired data
Very important:
Must understand schema for accurate querying
Wrong understanding = wrong results
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
12
13. Introducing The SQL Select Statement
Good news: This is the only SQL
statement you need to understand for
querying
SELECT LastName, FirstName
FROM Patients
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
13
14. Basic Syntax of Select Statement
SELECT field_name
FROM table
[WHERE condition]
[ ] = elective
Example:
Select LastName,FirstName
From Patients
Where Alive = ‘Y’;
Note: case sensitive for all but Oracle
Query statement are written into a tool such as MS Query or
MySQL Query Browser
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
Handout: p2
14
15. SELECT – (Some) Details
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
15
17. Schemas We’ll Look At…
Remember: Schemas…
describe all tables and all fields
used to determine how to inter-relate tables to
retrieve the desired data
Our schemas today:
Ensembl
BioWarehouse
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
17
18. Ensembl
Produced by Sanger Institute
Collection of genome databases for many different
organisms
Free, open source
Web querying: http://www.ensembl.org/
FAQ: What is Ensembl?
All PubMed references pertaining to Ensembl and written
by the Ensembl group
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
18
19. Exploring the Ensembl Schema
Ensembl CORE schema documentation
First place to go to answer: “what does this table
store?”
Problem: no graphical representation of overall
schema
Relationships harder to appreciate
Use Catalog function and go from there…
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
19
22. Querying Ensembl
Ensembl
runs on the MySQL
database engine
We’ll use WinSQL
MySQL Query Browser can also
be used, as well as lots of other
querying tools
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
22
23. Before Proceeding: A Word of Caution
Easy to write queries that…
Retrieve nonsense
Never complete
Scotty to Captain Kirk: “Where going in circles, and at warp 6
we’re going mighty fast…”
Understanding schema is only way to prevent this
Tips:
Use “count” to determine the number of rows in table
BEFORE returning large datasets
Remember: the more tables are joined, the slower the
query
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
Go to join
23
24. Demo Queries… To Get You
Started
Query 1: return number of genes stored in
Ensembl Human
Query 2: return number of transcripts
produced by genes stored in Ensembl
Human
Demonstrates JOINing of tables
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
24
25. Exercises
Together:
1. the number of genes stored in Ensembl Human
2. the number of transcripts produced by genes stored in
Ensembl Human
(10 min)
On your own:
3. the types of analyses that Ensembl provides
4. the number of types of markers
5. the number of markers per chromosome for all chromosomes
6. Extra points: the minimum and maximum marker distances for
markers on chromosome 19
(20 min)
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
25
26. SELCT Statement: A Refresher
“Modifiers” of
select list:
DISTINCT
FROM table_list
COUNT
[WHERE conditions]
SUM
MIN
[START WITH] [CONNECT BY]
MAX
[GROUP BY group_by_list]
Also:
ORDER BY
[HAVING search_conditions]
LIKE (used in
[ORDER BY order_list [ASC | DESC] ]
WHERE clause)
SELECT [DISTINCT] select_list
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
26
27. Example Of A Biologically-Useful
Query: All Markers on Chromosome 1
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
27
28. Now We’re Talking: Returning
Results into Your Favorite
Tool
SQL query results returned to…
MS Excel
… using Data/Import External Data/New
Database Query
Details: Excel Advanced Report Development
, Zapawa 2005
Spotfire
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
In Lane
catalog
28
29. Next:
BioWarehouse
Produced by SRI International
Integration of genome, biochem rxns, pathways, etc databases from
many different organisms
Free, open source
Accessing PublicHouse
FAQ
Schema
All PubMed references pertaining to BioWarehouse and written by
the BioWarehouse group
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
29
30. Conceptual Views of the
BioWarehouse Database
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
30
32. Querying BioWarehouse
We’ll query using MySQL Query Browser
Caveats:
Lots of datasets supported by BioWarehouse…
.. but some critical ones are missing from publichouse
due to licensing requirements, e.g.,
Also: Need to request account to query
MetaCyc
UniProt
Anonymous user not supported
Resource: MySQL v5 Reference Manual
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
32
33. BioWarehouse Demo Queries
…to get you started
Query 1: What are the datasets available in
PublicHouse?
Query 2: How many pathways are there for
the EcoCyc dataset?
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
33
34. Example Biologically Meaningful Query Of BioWarehouse:
For a Given Pathway, Return Proteins Involved Pathway
and Their Molecular Weight
SELECT D.Name as PathwayName,J.WID AS
ProteinWID, J.Name AS ProteinName,
J.MolecularWeightCalc AS MolecularWeightCalc
FROM Pathway D,PathwayReaction F, Reaction G,
EnzymaticReaction H, Protein J
WHERE D.WID = F.PathwayWID AND
F.ReactionWID = G.WID
AND G.WID = H.ReactionWID and H.ProteinWID =
J.WID
AND D.DataSetWID=19
AND D.Name LIKE "%lipopolysaccharide%"
ORDER BY ProteinName
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
34
35. Exercises
Together:
1. How many datasets are there in PublicHouse?
2. What is the number of genes in S. aureus
(SAUR158878Cyc)?
(10 min)
On your own:
3. List the coding region start and ends for all genes that
code for proteins in the SAUR158878Cyc dataset
4. How many biochemical reactions are there in each
pathway (of any type) in the EcoCyc (=E. coli) dataset?
(20 min)
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
35
36. In Summary…
Knowing the db schema is essential
SELECT statement all you need to know
Remote databases good for exploring a schema at
low cost
No installation…
But:
Performance can be poor
Restrictions on data set
Better to install locally if “real work” to be performed
Remember: SQL gives you the power to return results
directly into your favorite tool!
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
36
37. Don’t Forget The
Class Evaluation
Lane Medical Library & Knowledge Management Center
http://lane.stanford.edu
40. Setting Up Data Source Names
Steps
1. Make sure you have the requisite
driver (next slide)
2. Create a Data Source Name (Windows
only)
3.
4.
Write your query
Get the results back into Excel!
See Lane videorecorded class Managing
Experiment Data Using Excel and Friends:
Digging Out from Under the Avalanche for lots
more details.
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
40
41. Step 1: Getting Drivers
Essential for SQL Querying
A driver is a piece of software that lets your
operating system talk to a database
Installed drivers visible in ODBC manager
Each database engine (Oracle, MySQL, etc)
requires its own driver
“data connectivity” tool
Generally must be installed by user
Drivers are needed by Data Source Name
tool and querying programs
Require (simple) installation
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
41
42. MySQL Driver: Needed to Query
MySQL Databases
Windows: Download MySQL
Connector/ODBC 3.51 here
Must be installed for direct querying using
e.g. Excel
Not necessary if you are using the MySQL Query
Browser
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
42
43. Oracle Driver: Needed to Query
Oracle Databases
Installing “client” software will also install
driver
Windows: Download 10g Client here
Mac: Download 10g Client here
Free Oracle user account required to
download
Must be installed if you are querying
using MS Query or any other query
browser involving Oracle
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
43
44. Step 2: Creating a Data Source Name
A Data Source Name (DSN) tells programs
on your PC where and how to query a
database
Populating the fields:
Data Source Name: Unique name of your choice
Description: anything
Server: exactly as given by the database provider
Port number: as specified by database provider
Defaults: MySQL: 3306; Oracle: 1521; MS Access: N/A
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
44
45. Resources – SQL
eBook: Beginning SQL
eBook: Learning SQL
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
45
46. Lots More Resources From Lane
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
46
48. How To Get Accounts for Direct
SQL Querying
Direct Querying of Selected Bioinformatics Databases
Database
How?
DB
Engine
MySQL
BioWarehouse
http://biowarehouse.ai.sri.com/
get account for access to publichouse
(publicly-accessible installation of
BioWarehouse; see
http://biowarehouse.ai.sri.com/PublicHouse
Overview.html
Ensembl
http://www.ensembl.org/info/data/download MySQL
.html
Mouse Genome
Database
Mail mgi-help@informatics.jax.org to ask
for an account
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
Sybase
48
49. Example Querying with MySQL Query
Browser
Free
MySQL only
Facilitates writing of a SQL query
Execute
graphical
statement
Query statement
Get it at http://www.mysql.com/products/tools/querybrowser/
Lane Medical Library &
Knowledge Management Center
http://lane.stanford.edu
Table descriptions
49
Editor's Notes
select marker.marker_id, marker_map_location.chromosome_name, marker_map_location.position, map.map_name
from ((marker marker INNER JOIN marker_map_location marker_map_location ON marker.marker_id = marker_map_location.marker_id) INNER JOIN map map ON marker_map_location.map_id = map.map_id)
where (marker_map_location.chromosome_name = '19')
SELECT D.Name as PathwayName,J.WID as ProteinWID, J.Name as ProteinName, J.MolecularWeightCalc as MolecularWeightCalc
FROM Pathway D,PathwayReaction F, Reaction G, EnzymaticReaction H, Protein J where D.WID = F.PathwayWID and F.ReactionWID = G.WID
and G.WID = H.ReactionWID and H.ProteinWID = J.WID and D.DataSetWID=19
and D.Name like "%lipopolysaccharide%"
order by ProteinName