Data Mining for Bioinformatics
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Data Mining for Bioinformatics

on

  • 4,806 views

 

Statistics

Views

Total Views
4,806
Views on SlideShare
4,672
Embed Views
134

Actions

Likes
0
Downloads
195
Comments
0

2 Embeds 134

http://www.webicina.com 129
http://www.slideshare.net 5

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Mining for Bioinformatics Presentation Transcript

  • 1. Data Mining for Bioinformatics Craig A. Struble, Ph.D. Marquette University [email_address]
  • 2. Overview
    • Survey of KDD for Bioinformatics
      • KDD overview
      • Bioinformatics data
      • Survey of KDD steps
    • Case Study: miRNA Project
      • Identifying the problem
      • Data collection with Perl
      • Selection/cleansing
      • Future work…
    • Next Time
  • 3. Knowledge Discovery in Databases Data Warehouse Prepared data Data Cleaning Integration Selection Transformation Data Mining Patterns Evaluation Visualization Knowledge Knowledge Base
  • 4. Bioinformatics Data
    • DNA Sequences
    • Genes
      • Location, introns, exons, function, etc.
    • Gene products
      • RNA, Proteins
    • Pathways
      • Signaling, metabolic, genomic, etc.
  • 5. Bioinformatics Data
    • Experimental
      • Gene expression, knockouts, etc.
    • Literature
      • Diseases, viruses, bacteria
      • Organisms
      • Textbooks
    • Expert knowledge
      • Unpublished
      • Insights
      • Etc.
  • 6. KDD for Bioinformatics Genomic Literature Experimental Data Warehouse Prepared data Data Normalization Curation Validation Etc. Clustering SVMs ILP Classification Etc. Patterns Evaluation Visualization Knowledge Expert Knowledge Sampling Expressed Genes Homologs Etc. Often not explicitly implemented
  • 7. Data Collection and Cleansing
    • Perl scripts (BioPerl)
    • From literature
      • Read a paper and enter the information
      • Supplemental data for papers
    • Public databases
      • GenBank
      • Stanford Microarray Database
      • SWISS-Prot
      • Etc.
  • 8. Data Cleansing
    • Remove invalid, redundant, or otherwise useless data
    • Extrapolate missing data values
    • Data formatting/transformation
      • Binning, normalization, scaling, etc.
  • 9. Data Selection
    • Database queries for specific genes, organisms, sequences, etc.
    • Statistical analysis (microarray)
    • Random sampling
    • Etc.
  • 10. Data Mining Techniques
    • Statistical
      • Principal Component Analysis
      • ANOVA
      • Outlier analysis
      • Discrimination
      • Some clustering techniques (K-Means)
  • 11. Data Mining Techniques
    • Machine Learning
      • Neural Networks
      • Support Vector Machines
      • Decision Trees
      • Inductive Logic Programming
      • Fuzzy Logic
      • Rough Sets
      • Bayesian Belief Networks
  • 12. Data Mining Techniques
    • More Techniques
      • Clustering
      • Self Organizing Maps
      • Hidden Markov Models
      • Maximum Likelihood Estimators
      • Association Rules
  • 13. Kinds of Techniques
    • Unsupervised
      • Technique makes no assumption about a priori knowledge
      • Useful when not much known
    • Supervised
      • Attach class labels to data items
      • Identify (or learn about) properties that distinquish classes
  • 14. Kinds of Techniques
    • Unsupervised
      • Clustering
      • SOMs
    • Supervised
      • Support Vector Machines
      • Neural Networks
      • Bayesian Belief Networks
      • HMMs
  • 15. Kinds of Techniques
    • Supervised techniques require training
      • Data split into training and test sets
      • Many kinds of validation
        • N-way cross validation
        • Leave one out testing
        • Etc…
  • 16. Visualization of Results
    • Graphs/Charts
    • Rules
      • If expression of X < 1035, then tissue is cancerous
    • Largely dependent on the technique used
  • 17. Case Study: miRNA Project
    • Started Jan, 2002
    • Participants
      • Dr. Craig Struble
      • Dr. Stephen Munroe
      • Dr. John Simms
      • Parthav Jailwala
      • Peigang Li
    • http://bistro.mscs.mu.edu/miRNA
  • 18. Case Study: miRNA Project
    • Lee, R. C. & Ambros, V. An extensive class of small RNAs in Caenorhabditis elegans. Science 294, 862-864 (2001).
    • Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & Tuschl, T. Identification of novel genes coding for small expressed RNAs. Science 294, 853-858 (2001).
    • Hutvßgner, G. et al. A cellular function for the RNA-interference enzyme Dicer in the maturation of the let-7 small temporal RNA. Science 293, 834-838 (2001).
    • N.C. Lau, Lee P. Lim, Earl G. Weinstein, David P. Bartel. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294, 858-86 (2001).
  • 19. Research Questions
    • Can we identify features of existing miRNAs that can be used to predict the existence of other miRNA genes?
    • Which mRNA (messenger RNA) are targeted by miRNAs?
    • What other family-wide behavioral and structural questions can be answered about miRNAs?
  • 20. Current Implementation Genbank Perl Script Perl Script Perl Script miRNA library BLAST Reports Homolog library Multiple Sequence Alignment Data warehouse Data Selection/Cleansing Initial mining and cleansing
  • 21. Perl
    • Practical Extraction and Report Language
    • Language of choice for many bioinformaticians
    • Excellent support for parsing/transforming data
    • http://www.perl.com
  • 22. Data Collection with Perl E.G. Using Entrez
  • 23. Data Collection with Perl Construct a URL to search and access information in Entrez
  • 24. Data Collection with Perl
    • Use LWP module
      • Makes network connections easy
    • Use BioPerl ( http://www.bioperl.org )
      • Perl modules/objects for handling bioinformatics data
      • Handles connections to databases
  • 25. Sample Perl Script #!/usr/local/bin/perl # # Simple Entrez Query in Perl # Craig A. Struble # # For internet requests and protocols use LWP; # A user agent for testing my $ua = LWP::UserAgent->new; $ua->agent('miRNA/0.1 '); # URL base for Entrez search my $NCBI_ENTREZ = 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?';
  • 26. Script (cont.) # Building up the URL for the Entrez Search my $search_URL = $NCBI_ENTREZ # URL Base . 'cmd=Search' # Command . '&db=nucleotide' # Database . '&dispmax=100' # Max results . '&term=miRNA' # Search term . '&doptcmdl=FASTA'; # result format # Make an HTTP GET request for a Entrez search my $req = HTTP::Request->new(GET => $search_URL); $req->push_header(Connection => 'Keep-Alive'); # Get the response my $res = $ua->request($req);
  • 27. Script (cont.) # Check the response. If it's OK, print out the content if ($res->is_success) { print $res->content; } else { print $res->error_as_HTML; exit 1; }
  • 28. Sample Result <input name=&quot;showndispmax&quot; type=&quot;hidden&quot; value=&quot;100&quot;><input name=&quot;page&quot; type=&quot;hi dden&quot; value=&quot;0&quot;></table></td></tr> </table><dl><dt><table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; width=&quot;100%&quot;><tr><td><inp ut name=&quot;uid&quot; type=&quot;checkbox&quot; value=&quot;17646034&quot;><b>1: </b>AJ421749. Homo sapiens micr...[gi:17646034]</td> <td align=&quot;right&quot;><SPAN><a CLASS=&quot;dblinks&quot; href=&quot;query.fcgi?db=nucleotide&amp;cm d=Display&amp;dopt=nucleotide_pubmed&amp;from_uid=17646034&quot;>PubMed, </a></SPAN> <SPAN><a CLASS=&quot;dblinks&quot; href=&quot;query.fcgi?db=nucleotide&amp;cmd=Display&amp;dopt =nucleotide_taxonomy&amp;from_uid=17646034&quot;>Taxonomy</a></SPAN> </td> </tr></table></dt></dl><pre>>gi|17646034|emb|AJ421749.1|HSA421749 Homo sapiens m icroRNA miR-27 TTCACAGTGGCTAAGTTCCGCT </pre><dl><dt><table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; width=&quot;100%&quot;><tr><td><input name=&quot;uid&quot; type=&quot;checkbox&quot; value=&quot;17646061&quot;><b>2: </b>AJ421776. Drosophila mela no...[gi:17646061]</td>
  • 29. Parsing Result
    • Result is big, ugly HTML file
    • Need to take out data in <pre> tags
    • Fortunately, Perl can come to the rescue!
  • 30. Parsing Result with Perl #!/usr/local/bin/perl # Use an HTML parser use HTML::TreeBuilder; # Extract out FASTA entries for each file on the command line foreach my $file_name (@ARGV) { # Build an HTML Parse Tree my $tree = HTML::TreeBuilder->new; $tree->parse_file($file_name); # FASTA entries are in PRE tags @entries = $tree->find_by_tag_name('pre'); # Print out each entry foreach my $entry (@entries) { @children = $entry->content_list; print $children[0] . &quot; &quot;; # first child is text content } }
  • 31. Processed Results >gi|17646034|emb|AJ421749.1|HSA421749 Homo sapiens microRNA miR-27 TTCACAGTGGCTAAGTTCCGCT >gi|17646061|emb|AJ421776.1|DME421776 Drosophila melanogaster microRNA miR-14 TCAGTCTTTTTCTCTCTCCTA >gi|17646060|emb|AJ421775.1|DME421775 Drosophila melanogaster microRNA miR-13b-2 TATCACAGCCATTTTGACGAGT >gi|17646059|emb|AJ421774.1|DME421774 Drosophila melanogaster microRNA miR-13b-1 TATCACAGCCATTTTGACGAGT >gi|17646058|emb|AJ421773.1|DME421773 Drosophila melanogaster microRNA miR-13a TATCACAGCCATTTTGATGAGT
  • 32. Getting BLAST Reports
    • Can automate getting BLAST reports with Perl
    • URL format documentation is available at
      • http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
    • Perl code not displayed
  • 33. Parsing BLAST Reports
    • Use BioPerl Bio::Tools::BPLite
    • Find high scoring pairs that contain surrounding sequence
      • BLAST also reports original sequence hits
    • Extract out matching sequence with up and downstream surrounding sequence
  • 34. Perl Script #!/usr/local/bin/perl # # Create homolog database from BLAST reports # Author: Craig A. Struble # Various BioPerl modules to use use Bio::Tools::BPlite; use Bio::DB::GenBank; use Bio::SeqIO; use Bio::Seq;
  • 35. Script (cont.) ############################################################################### # Function: rev_comp # Description: Calculates the reverse complement of a DNA sequence. ############################################################################### sub rev_comp { my @seqs; foreach $seq (@_) { $seq =~ tr/AaCcTtGg/TtGgAaCc/; $seq = reverse $seq; push @seqs, $seq; } # wantarray checks whether we were called in list context return wantarray ? @seqs : $seqs[0]; }
  • 36. Script (cont.) ############################################################################### # Function: around_seq # Description: Returns the upstream and downstream sequence around an HSP # Parameters: hsp - the high scoring pair # seq - the sequence of reference # upstream - number of basepairs upstream # downstream - number of basepairs downstream ############################################################################### sub around_seq { my ($hsp, $seq, $upstream, $downstream) = @_; # Code deleted due to space return $subseq; }
  • 37. Script (cont.) # Open the BLAST report open(BLAST, &quot;<&quot; . $ARGV[0]) or die &quot;open failed&quot;; $report = new Bio::Tools::BPlite(-fh => *BLAST); $gb = new Bio::DB::GenBank; # Open output file $out = Bio::SeqIO->new('-file' => &quot;>$ARGV[1]&quot;, '-format' => 'fasta'); # Amount up and downstream to get $upstream = $ARGV[2]; $downstream = $ARGV[3];
  • 38. Script (cont.) while (my $sbjct = $report->nextSbjct) { my ($db, $accv, $acc, $rest) = split /|| /, $sbjct->name; $seq = $gb->get_Seq_by_acc($acc); print $seq->accession_number . &quot; &quot;; while (my $hsp = $sbjct->nextHSP) { my $seqstr = around_seq($hsp, $seq, $upstream, $downstream); my $subseq = Bio::Seq->new('-seq' => $seqstr, '-accession_number' => $seq->accession_number, '-display_id' => $seq->accession_number . &quot;_&quot; . $hsp->subject->start . &quot;..&quot; . $hsp->subject->end . &quot;_&quot; . $hsp->subject->strand ); $out->write_seq($subseq); } }
  • 39. Results >AC084471_10966..10987_-1 TCCCCCTTGGTCCCTTCTCATATACCATACTACATTTCTTTCAAAACTAACCGGGATTTT TCAGGGGATTGCAGGATGATGGCTCTACACTGGGGTACGGTGAGGTAGTAGGTTGTATAG TTTAGAATATTACTCTCGGTGAACTATGCAAGTTTCTACCTCACCGAATACCAGGTTCTC AACTGCATCGTGTCAATTACTCTCAAACGACGGACACCTTCA >AF274345_1763..1784_1 CACATCTCCCTTTGAATTTATATGTCTAATTTAACAACAAGTACTAATCCATTTTTCAGG CAAGCAGGCGATTGGTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG TTTGGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTT CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT >Z70203_12425..12446_-1 CACATCTCCCTTTGAATTTATATGTCTAATTTAACAACAAGTACTAATCCATTTTTCAGG CAAGCAGGCGATTGGTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG TTTGGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTT CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT
  • 40. Multiple Sequence Alignment
    • Currently using clustalw/clustalx
    • Eventually generate web pages with sequence alignments
    • Investigate conserved regions of the surrounding sequence
  • 41. Multiple Sequence Alignment
  • 42. Future Work
    • Process homolog library with RNA fold predication software (mFold)
    • Collect together fold structure information and other information
    • Transform into logical representation for ILP analysis
    • Store data in a database (Postgres)
  • 43. Next Time
    • Applications of
      • Clustering
      • Neural Networks
      • Support Vector Machines
      • Etc.
    • Available tools to use, etc.