Use of NCBI Databases in qPCR Assay Design


Published on

Published in: Science, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Sequences can be pasted in or Accession number can allow download of sequence. This is NOT just RefSeq sequences, it will find any accession number
  • Important to note PrimerQuest will not look for introns, lower and upper case treated the same way. Clicking Orange buttons is preferred if possibleAmplicon size: general PCR 200-1000, qPCR 75-150 default
  • The default for “set Design parameters for….” is General PCR. Must click the appropriate assay button first as the page will reload and give all defaults (removing any changes you made)First thing to try is just increasing number of assays returned and seeing if one matches the parameters of your design.
  • Design across junction(s) allows you to specify intron junctions or to focus design on a specific location
  • Just some examples of using parameters to restrict design to specific regions
  • This is a place to discuss particularly GC or AT rich sequences. For AT rich need to increase max oligo length as well.
  • The first link is the URL for the NCBI homepage. The first link to the right under Popular Resources is the link for the BLAST website. The BLAST program is directly accessible by using the link on the bottom of the slide.
  • There are two different types of alignments: local and global. As the name implies, BLAST is a local alignment tool. Global alignments try to align each letter of the query sequence with the sequence in a database. A local alignment tries to find the best match without trying to align the entire sequence but breaks the query into smaller parts and aligns these smaller sequences. This approach is called a heuristic algorithm. A heuristic algorithm is one that uses knowledge about the problem to create a faster method to solve the problem. This will reduce the time of the search but will not always retrieve all sequences that have high identity.BLAST can also align two sequences together but this is generally not the best method for this type of alignment.BLAST can restrict a search by an Entrez query such as aligning a query to only a specific gene across all species in the database.
  • This section has the most popular BLAST tools that give users the most flexibility.
  • The concept of splitting the query into words is essential to understanding how BLAST works. The sequence above is 35 bases long. The box that position over the first 7-letters is the first 7-letter word. The next 7-letter word is the red box shifted one base to the right. Each subsequent word is one base shifted to the right until the end of the sequence is reached.
  • Homology indicates that the proteins that are related which usually implies similar function but this is not necessary. An example of homologous proteins is human actin and mouse actin. Identity is the percent of bases (or amino acids) of aligned sequences that are exactly the same.Similarity is the percent of amino acids aligned that have either identical or have conserved properties.Hits are the same sequences returned by BLAST that aligned to the query. Homology is not the same as identity or similarity! These terms are frequently improperly used.
  • The BLAST raw score is converted to a bit score for each alignment using parameters based on statistics described in Karlin et al. paper ( The details of the conversion are beyond the scope of this presentation. The Max bit score is the alignment with the highest bit score for a hit. For example, a primer may align in several places within a mRNA sequence all with different percent identity. The match with the highest identity will have the max bit score while the total score for that transcript will be the sum of all the matches. Blastn: For a given database, bit score is only dependent on the alignment, length of the sequence, and the length of the database. Technically, there are scoring matrices in blastn but these are not frequently changed.A high bit score does not indicate that the query is unique. A database that is completely composed of the same 30nt repeat will give the same bit score for a perfect match of that 30nt sequence as a database that only has this sequence once.
  • This of course only works with genomic targets. But another chance to show some variations on BLAST that can give clearer resultsThe Database dropdown menu includes the first two radio button options on the top. Users can select the database from the dropdown menu and then click on the question mark to find more the sources of the data in this database, molecule type, update date, and number of sequences.The Nucleotide collection (nr/nt) is the most common choice for nucleotide information and this contains information from GenBank as well information from the European (EMBL) and Japanese (DDBJ) databases. This database also includes information from the Protein data bank (PDB) which is a depository for solved biology structures (mostly protein and nucleic acids). Nr stands for non-redundant, however, this database is full of redundancies. Reference RNA sequences are all RNA sequences that are curated by NCBI and deposited in RefSeq. RefSeq attempts to be non-redundant and more thoroughly annotated than the Nucleotide database.RefSeq genomic sequences is the DNA equivalent to the RNA RefSeq. 16S ribosomal RNA sequences is a new database that allows users to search just the 16S database of bacteria and Achaea. The nucleotide collection does not contain these ribosomal RNAs.
  • Note that you can also increase the “expect threshold” to 1000
  • Always good to give a sense of what the BLAST alignments are like. The thin line indicates the two matches are on the same sequence contig. If the bar is short at the 3’ end of either sequence it is likely not a concern
  • The order that sequences are listed seems a bit random assuming they all have the same score.
  • Two checks here: 1. does match extend full length of primers? 2. what is the distance between primers? For genomic matches like this a large distance may indicate an intron
  • How to use the graphic interface
  • When looking at other off-target matches it may be a pseudogene. This one is non-transcribed and so likely not a problem for an expression assay
  • Some intron sequences hang around for a while so it’s not always clear if a pseudogene in an intron is a problem. Always best to avoid if possible but inform customer if necessary
  • SNP analysis on BLAST can be temperamental but it is the simplest analysis so we’ll keep that here
  • Remember to hover to get pop-up window then click on the rs number
  • An example of unhelpful frequency data for a SNP.
  • Examples of different frequencies and discussion of relative danger
  • The delta G value represents the likelihood an oligo will form a specific structure. The more negative the value, the more spontaneously the oligo will form this structure and remain in this conformation. Therefore, the more negative the value, the stronger the structure. For most analyses, we recommend values less negative than -9 kcal/mole. However, this value is relatively conservative, especially for longer oligos. PCR and qPCR reactions can often perform with delta G’s of up to -12 kcal/mole.
  • Use of NCBI Databases in qPCR Assay Design

    1. 1. Integrated DNA Technologies Use of NCBI Databases in qPCR Assay Design Elisabeth Wagner, PhD Scientific Applications Specialist
    2. 2. 1 Session Outcomes  You will:  Learn which NCBI tools are useful for designing qPCR assays  Become proficient using tools for qPCR design in the IDT SciTools® suite  Navigate the features and tools available on the NCBI website  Obtain sequence information for your gene of interest  Perform a BLAST search for assay specificity  Search for SNPs  Understand how to proceed with a basic qPCR design
    3. 3. 2 qPCR Design Covers A Lot of Ground There are many uses for quantitative PCR. For some examples:  Gene expression  Copy number variation  Genotyping  Multi-species analysis  Splice variant specific (or common) expression We will address the general considerations for design in this session, and cover more specific examples later this afternoon.
    4. 4. 3 SciTools® Overview   Several Tools are available in the IDT SciTools® suite to assist with qPCR design  1. RealTime PCR Tool  2. PrimerQuest® Tool  3. OligoAnalyzer® Tool  4. PrimeTime® Predesigned qPCR Assay Database
    5. 5. 4 NCBI Databases Overview:  1. Obtain sequence information for your gene of interest-  NCBI Nucleotide or Gene  2. Perform a BLAST search for assay specificity  NCBI BLAST  3. Search for SNPs  NCBI dbSNP NCBI enables you to access all of this information necessary for design in one location.
    6. 6. 5 Using NCBI Databases for Custom qPCR Assay Design
    7. 7. NCBI Overview (National Center for Biotechnology and Information)  Founded in 1988 as part of the United States National Library of Medicine  Houses a series of databases relevant to biotechnology and biomedicine  Curates Genbank, a database of over 1x1012 bp of DNA sequences  Gene database, which integrates gene-specific information from numerous species  dbSNP, which is a database of reported Single Nucleotide Polymorphisms (SNPs)  Contains the BLAST sequence similarity search program  Maintains PubMed, a journal database for biomedical literature  Much, much more information! 6
    8. 8. NCBI Database Search: Sequence Information for qPCR Assay Design 7
    9. 9. NCBI Sequence Files Files:  Can be entered by anyone  May or may not be checked for accuracy  May contain contaminated sequence (plasmid or other)  May contain annotation errors Accession numbers:  Letters at the beginning indicate the type of file  Nucleotide sequences start with 1 or 2 letters: 8
    10. 10. The RefSeq Database  non-redundant  explicitly linked nucleotide and protein sequences  ongoing curation by NCBI staff and collaborators, with reviewed records indicated  includes data validation and format consistency  distinct accession numbers  all accessions include an underscore '_' character  Different versions are tracked 9
    11. 11. RefSeq Accession Numbers  mRNAs and Proteins  NM_123456 Curated mRNA  NP_123456 Curated Protein  NR_123456 Curated non-coding RNA  XM_123456 Predicted mRNA  XP_123456 Predicted Protein  XR_123456 Predicted non-coding RNA  Gene Records  NG_123456 Reference Genomic Sequence  Chromosome  NC_123455 Microbial replicons, organelle genomes, human chromosomes  AC-123455 Alternate assemblies  Assemblies  NT_123456 Contig  NW_123456 WGS Supercontig 10
    12. 12. Accessing Sequence Information in NCBI 11 NCBI
    13. 13. NCBI Gene Database Information: Gene Search 12
    14. 14. Sequence Data Searches Using Nucleotide  Sequence Files  mRNA and genomic  Transcript variants 13
    15. 15. Genbank information 14
    16. 16. Data Retrieval: Graphics View 15
    17. 17. Data Retrieval: FASTA Sequence Format 16
    18. 18. 17 Using PrimerQuest® Tool for Custom qPCR Designs
    19. 19. 18 PrimerQuest® Tool for Generating Custom qPCR Designs Highly customizable tool
    20. 20. 19 You Can Use NCBI Accession Number or FASTA Sequence
    21. 21. 20 Once Sequence Entered, 3 Defaults Become Available Often you will need to adjust the parameters of the tool to meet experimental design requirements
    22. 22. 21 PrimerQuest® Tool Assay Output
    23. 23. 22 Changing Parameters Depend on the Assay Required Before changing anything, make sure you have selected the correct assay Sometimes you simply need to increase the number of designs returned It is unlikely that you will need to change these parameters
    24. 24. 23 Directing the Design to a Specific Region Target a particular “junction”
    25. 25. 24 Examples Excluded region 260-280 Excluded region-probe 260-280Target region 260-280
    26. 26. 25 Changing Primer/Probe Parameters If the target is particularly biased (AT or GC rich), you may need to change primer/probe parameters (i.e. length)
    27. 27. 26 Once Initial Design Completed, Back to NCBI Use NCBI tools to:  Check whether assay is specific (BLAST)  Ensure there are no SNPs to worry about (dbSNP) Use IDT OligoAnalzyer® Tool  Check primers (and probe) for secondary structure and dimer formation
    28. 28. 27 Using NCBI BLAST to Check for Primer Specificity
    29. 29. 28 What is BLAST?—Getting to BLAST   Or
    30. 30. 29 What is BLAST (Basic Local Alignment Search Tool)?  BLAST stands for Basic Local Alignment Search Tool and is provided by the National Center for Biotechnology and Information (NCBI)  Aligns a user defined query (sequence) to a wide variety of databases  Can translate the query or the database to align sequences  Can align 2 or more sequences together  Heuristic algorithm to create alignments very fast  Breaks sequences into “words” and searches the database for matches  Reassembles these matches based on the criteria entered
    31. 31. 30 What is BLAST?—Basic BLAST
    32. 32. 31 How BLAST Works—Words  BLAST divides the query sequence into subsets called “words”, which the algorithm uses to perform the alignment  Example (35 nt sequence): CGATCGGGCATCACACAAAGTTATGTAGTAGAAAT  All possible words that can be generated from the sequence are used for the alignment  The max number of words for this sequence is 29 7-letter word
    33. 33. 32 Overview—Definitions  Hit: A sequence to which the query is aligned and is returned in the results of BLAST  Identity: the extent of exact matches between 2 sequences (eg ACGT and ACGG have 75% identity)  Similarity = Positives (in BLAST scoring)
    34. 34. 33 How BLAST Works—Scores  The BLAST raw score is converted to a bit score for each alignment using parameters based on statistics described in Karlin and Altschul (1990) (  A high score does not necessarily indicate that the query is unique  The score is only dependent on the alignment, length of the sequence, and the length of the database  E-value is the expected amount of random sequences that have equivalent sequence alignment  Calculated using the Max bit score and the length of the query and database  Tells you the relative strength of the alignment  Shorter sequences have higher E-values because the probability of finding that sequence is higher  A low E-value does not mean you have a unique match!
    35. 35. 34 BLAST Assessment for qPCR Primers  Go to the BLAST server:  Enter primer sequences separated by 7+ N’s
    36. 36. 35 Select the Correct Database “Others” is the most general but contains a lot of sequences. If possible use Human or Mouse specific databases For species with completed genome projects, consider using “NCBI Genomes” to limit BLAST results
    37. 37. 36 Change the parameters of the BLAST scoring Select less rigorous algorithm Change Word size to “7”
    38. 38. 37 Looking at the Results The Graphic Summary can immediately give you a sense of what the overall results are Hover over each result in the graphic to identify the sequence name
    39. 39. 38 Then Look at Results List Look at E-value and Query Coverage. Look for jumps in either/both. Looks like assay is specific to a single gene by transcript Ignore the “alternate” chromosome assemblies
    40. 40. 39 Investigate details of alignment Check distance between primer binding if looking at mRNA Open Graphics result in a new tab/window
    41. 41. 40 BLAST Shows Primer Aligned to Sequence Zoom out with “-” sign You can grab within window and drag sequence side to side
    42. 42. 41 The Target Gene is on Chromosome 6 This looks promising with primers on different exons.
    43. 43. 42 But We Had Other Chromosomal Hits…… “real” transcript Pseudogene— doesn’t look transcribed Primers (red bar indicates mismatch)
    44. 44. 43 And Another One…… Another pseudogene. But what’s this? Intron of a transcribed gene. So potentially in RNA samples. Recommend avoiding if possible
    45. 45. 44 Using NCBI to Check for SNPs
    46. 46. 45 While Assessing BLAST Results, Also Assess for SNPs
    47. 47. 46 Investigate SNPs in Primer Binding Sites
    48. 48. 47 Assessing SNP Data Tells you it’s a single base substitution Indicates alternate forms (here recorded on opposite strand) Indicates allele frequency if known Sometimes more frequency data at bottom of page
    49. 49. 48 SNP Data Roughly Divided by Risk Trusted source Very low frequency No data, likely not going to be problematic Significant risk. Look to redesign if possible
    50. 50. 49 Using OligoAnalzyer® Tool to Check Primers and Probes
    51. 51. 50 Checking Primers with OligoAnalyzer® Tool  PrimerQuest® design tools give you the “best” assays for the region specified  They check for self- and hetero-dimers, but this is only part of the scoring system used  An assay maybe be “better” even with dimer issues if it scores well on other parameters  Go to the OligoAnalyzer Tool  Perform self-dimer checks for primers and probe  Perform heterodimer checks on all primer/probe combinations (especially important to include all combinations when multiplexing)  Check hairpin structures.  Look for stability of < -9 kcal/mol  Or multiple hairpins forming with < -4 kcal/mol
    52. 52. 51 Assessing Dimer Data Looks stable < -9kcal/mol But this is not “dangerous”, avoid if possible but ok Looks stable < -9kcal/mol Not extendable, not a problem Doesn’t look stable > -9kcal/mol Danger of extension, exponential amplification!
    53. 53. 52 Assessing Hairpin Structures  Based on UNAfold predictions
    54. 54. IDT PrimeTime® Predesigned qPCR Database 53
    55. 55. 54 Primer and Probe Design Criteria for PrimeTime® Assays  Primers  equal Tm (60–63oC)  15–30 bases in length  no runs of 4 or more Gs  amplicon size 50–150 bp (max 400 bp)  Probe  Probe length no longer than 30–35 bases  Tm value 4–10oC higher than primers  no runs of 4 or more consecutive Gs  G+C content 30–80%  no G at the 5′ end
    56. 56. 55 PrimeTime Results
    57. 57. 56  Questions?