Mining Single Nucleotide Polymorphisms from public sequence databases. Gary Barker IACR Long Ashton
What are Single Nucleotide Polymorphisms (SNPs)? <ul><li>ATGGTAA G CCTGAG C TGACTTAGCGT-AT </li></ul><ul><li>ATGGTAA A CCTGAG T TGACTTAGCGTCAT </li></ul><ul><li> </li></ul><ul><li>snp snp indel </li></ul><ul><li>SNPs result from replication errors and DNA damage </li></ul>
Why are these polymorphisms useful? It’s sometimes possible to correlate a SNP with a particular trait. This is known as association genetics.
Disease resistant population Disease susceptible population Genotype all individuals for thousands of SNPs ATG A TTATAG ATG T TTATAG Resistant people all have an ‘A’ at position 4 in geneX , while susceptible people have a ‘T’ gene X
To use SNPs, you first have to find them. Poorly studied organisms: Sequence many ‘loci’ (different places in the genome) for many individuals. Many well studied organisms : Required data is already present in public sequence databases, it just needs to be processed.
Number of ESTs* in EMBL database *ESTs are single pass (often partial) gene sequences
Mining SNPs from EST sequences in the database AutoSNP (PERL script) can find likely SNPs in data sets downloaded from public databases. 1) Marks up only those polymorphisms where each allele is supported by at least two independent sequences. This filters out most sequencing errors. 2) Adds further confidence scores based on co-segregation 3) Results written to HTML reports.
Accessing AutoSNP results 1) Search by accession number:
Accessing AutoSNP results 2) Search with a query sequence
Current AutoSNP approach: Cluster sequences (d2cluster) Align and find SNPs (cap3) Accession # / SNP report # Query with Accession MySQL database gi|11117503 | snip_1.htm gi|12217138 | snip_2.htm Sequence query Blast client Matching Accessions Links to existing SNP reports
Desirable: Client supplied query Sequence (ATAGCGTACG……) Blast search (data direct from EBI?) Build contigs of results Detect eSNPs Client gets SNP report(s) (html) for all sequences matching query Data and processing power (large) processing power (medium) processing power (small) < 10 seconds
Conclusions SNPs (single nucleotide polymorphisms) are abundant and useful genetic markers. Software exists to mine them from public data sets, but this doesn’t work in real time. GRID technology could help to deliver up-to-date alignments to users for any query sequence with putative SNPs marked up. Related useful features would include bootstrapped trees for each alignment, generated on the fly.