Successfully reported this slideshow.
Your SlideShare is downloading. ×

AI 바이오 (4일차).pdf

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
AI 바이오 (2_3일차).pdf
AI 바이오 (2_3일차).pdf
Loading in …3
×

Check these out next

1 of 58 Ad

AI 바이오 (4일차).pdf

Download to read offline

4th in bioinformatics training, including DNA and RNA sequence analysis.
Included is the sequence alignment (pairwise and multiple sequence).

4th in bioinformatics training, including DNA and RNA sequence analysis.
Included is the sequence alignment (pairwise and multiple sequence).

Advertisement
Advertisement

More Related Content

Recently uploaded (20)

Advertisement

AI 바이오 (4일차).pdf

  1. 1. AI-Bio 융합 전문 과정 2022-8~10 윤형기 (hky@openwith.net) 4일차
  2. 2. 주제 세부사항 1일차 인사 및 과정 소개 인사 수강생 현황 및 수강목적 등 파악 의료/바이오 개관 (기술/산업) 의료/바이오 기술 및 산업동향 기반기술 (1-1) Python과 분석 패키지 분석도구 (1) (Python, Scipy, numpy/pandas) 2일차 기반기술 (1-2) R과 통계분석 분석도구 (2) (R과 통계학) 생명통계 활용 (1) 생명정보와 ANOVA, 다변량분석 등 유전체 분석 3일차 생명통계 활용 (2) 메타분석 유전체 분석 (Omics) (1) 유전체(genome) 분석 전사체(transcriptome) 분석 4일차 유전체 분석 (Omics) (2) 후성유전체(epigenome) 분석 단백체(proteome) 분석 차세대 Sequencing GenBank와 NCBI데이터 VCF 데이터 분석, NGS 데이터 처리 등 5일차 기반기술 (3) 기계학습 (1) 모델링 방법론 (모델 개념 및 Cross-Validation) 지도학습 알고리즘 (선형모델, 분류) 기반기술 (3) 기계학습 (2) 비지도학습 알고리즘 (군집, 연관분석 등) 6일차 지도학습과 생명정보 응용 의료데이터에서의 예측모델 선형모델과 헬스케어 데이터의 분류 비지도학습과 생명정보 응용 임상데이터의 연관성분석 동반질병 (comorbidity) 분석 의료/바이오 도메인 이해 헬스케어 데이터셋과 생명통계 바이오 데이터와 기계학습 일정
  3. 3. 주제 세부사항 7일차 기반기술 (4) 딥러닝 (1) 신경망 학습과 딥러닝 모델 기반기술 (3) 딥러닝 (2) TensorFlow PyTorch 8일차 딥러닝과 생명정보 응용 Bi-LSTM을 이용한 헬스케어 시뮬레이션 딥러닝을 이용한 피부병 식별 온톨로지와 생명정보 응용 세만틱웹과 ontologies Ontology의 생명정보 응용 9일차 기반 기술 (3) 이미지 처리 이미지 처리와 컴퓨터 비전 개요 의료영상분석 (1) Segmentation 영상등록 (image registration) 10일차 의료영상분석 (2) 심전도 (ECG) Rendering과 Surface Models MRI 11일차 기반기술 (4) 생명정보와 계산화학 계산화학 (computational chemistry) 개요 신약개발 (drug discovery) (1) 표적규명 (target identification) 시약과 검정법 개발 ADME (흡수, 분포, 대사, 배설) 독성학과 기계학습 응용 12일차 기반 기술 (5) GAN GAN (Generative Adversarial Networks)과 VAE 신약개발과 GAN 생성모델을 이용한 신약후보물질 추천 총정리 Wrap-up 총정리 의료영상 분석 약물분석과 신약설계 바이오 데이터와 딥러닝
  4. 4. 유전체 분석
  5. 5. 생명정보학 주요 주제 • 서열정렬 – Pairwise Sequence Alignment – Database 유사도 검색 – Multiple Sequence Alignment – Profile과 HMM – Protein Motifs and Domain Prediction • Gene과 Promoter 예측 – 유전자 예측 – Promoter and Regulatory Element Prediction • 분자 계통 발생학 (Molecular Phylogenetics) – Phylogenetics Basics – Phylogenetic Tree Construction Methods and Programs • 구조적 생명정보학 (Structural Bioinformatics) – 단백질 구조 시각화, 비교 & 분류 – Protein 구조 Structure 예측 (2ndary, Tertiary) – RNA 구조 예측 • 유전체학과 전사체학 (Genomics & Proteomics) – 유전체 Mapping, Assembly, 비교 – 기능 유전체학 – Proteomics • Genome rearrangements • Motif finding • Gene expression analysis
  6. 6. 서열정렬
  7. 7. 보충: 유전 부호(genetic code) • 1. 개요 – 각 codon이 어떤 아미노산을 부호화(encoding)할지를 정해놓은 규칙 • 2. 코돈 Codon – 단백질의 아미노산을 지정하는 RNA의 유전 정보 – RNA 구성 염기: Uracil, Guanine, Cytosine, Adenine – 한 codon은 3개 염기로 구성 - 이론상 4×4×4=64종의 정보 지정. • 3. 종류 – 3.1. 개시 코돈 start codon • 5'-AUG-3’ (일부 박테리아에서 변형된 개시 코돈 사용). • 진핵 생물에서는 메싸이오닌(Methionine, Met)을, 원핵생물에서는 N-포르밀메싸이오닌(N-Formylmethionine, fMet)을 지정. • 또한 mRNA가 리보솜과 결합해 단백질 번역을 시작하도록 하는 역할도 수행 – 3.2. 종결 코돈 Stop Codon, Nonsense Codon • 단백질 번역의 끝을 알리는 codon으로서 UAA, UAG, UGA의 세 종류 • 종결 코돈에는 대응하는 tRNA가 없고 대신 '종결 인자'라는 단백질이 붙으며, 번역 과 정에서 종결 코돈에 도달하면 리보솜의 두 단위체가 분리되어 번역이 종결된다. – 3.3. 안티코돈(역코돈) anticodon • tRNA의 RNA 사슬을 이루는 특정 구간의 염기 서열.
  8. 8. Pairwise Sequence Alignment • 배경 • Sequence Homology (서열 상동성) vs. Sequence Similarity • Sequence Similarity vs. Sequence Identity • 기법 – Global Alignment and Local Alignment – Alignment Algorithms – Dot Matrix Method – Dynamic Programming Method • Gap Penalties • Dynamic Programming for Global Alignment • Dynamic Programming for Local Alignment • Scoring 행렬 – Amino Acid Scoring 행렬 – PAM 행렬 – BLOSUM 행렬 – Comparison between PAM and BLOSUM • Sequence Alignment의 통계적 유의성
  9. 9. • (Goal) • 서열 비교  “공통 character patterns” 과 residue–residue 대응관계를 찾아냄 • 배경 – 진화 • DNA와 protein은 진화의 소산 – The degree of sequence conservation in the alignment reveals evolutionary relatedness of different sequences, whereas the variation between sequences reflects the changes that have occurred during evolution in the form of substitutions, insertions, and deletions. • sequence alignment – can be used as basis for prediction of structure and function of uncharacterized sequences. – provides inference for the relatedness of two sequences under study.
  10. 10. Sequence Homology vs. Similarity • (…) – 용어 구별 • Homologous relationship or share homology. – an inference or a conclusion about a common ancestral relationship drawn from sequence similarity comparison when the two sequences share a high enough degree of similarity. (qualitative) • Sequence similarity – is a direct result of observation from the sequence alignment. – % of aligned residues that are similar in physiochemical properties such as size, charge, and hydrophobicity. (quantitative) – 문제는 sequence similarity level • Nucleotide sequences consist of only 4 characters → unrelated sequences have at least a 25% chance of being identical. • protein sequences - 20 possible amino acid residues → two unrelated sequences can match up 5% of the residues by random chance.
  11. 11. – 단, % identity values only provide a tentative guidance for homology identification 3 zones of protein sequence alignments. (Source: Modified from Rost 1999).
  12. 12. Sequence Similarity vs. Sequence Identity • (…) • nucleotide sequence의 경우 사실상 같은 의미 • Protein sequence의 경우 구별할 것 – sequence identity = % of matches of the same amino acid residues between two aligned sequences. – Similarity = % of aligned residues that have similar physicochemical characteristics and can be more readily substituted for each other. – Sequence similarity 및 identity 계산 방법 – One involves use of the overall sequence lengths of both sequences – the other normalizes by the size of the shorter sequence.
  13. 13. Methods • Global Alignment and Local Alignment • Global Alignment – 처음부터 끝까지 비교 » is more applicable for aligning two closely related sequences of roughly the same length. » For divergent sequences and sequences of variable lengths, this method may not be able to generate optimal results because it fails to recognize highly similar local regions between the two sequences. • Local alignment – only finds local regions with the highest level of similarity between the two sequences and aligns these regions without regard for the alignment of the rest of the sequence regions – Two sequences to be aligned can be of different lengths
  14. 14. pairwise sequence 비교의 예
  15. 15. • 정렬 알고리즘 – Dot Matrix Method (= dot plot method) – Dynamic Programming Method • Gap Penalties • Dynamic Programming for Global Alignment • Dynamic Programming for Local Alignment – Word method
  16. 16. – Dot Matrix Method dot plot에 의한 서열비교의 예. Lines linking the dots in diagonals indicate sequence alignment. Diagonal lines above or below the main diagonal represent internal repeats of either sequence
  17. 17. • Problem when comparing large sequences using dot matrix method – high noise level. » In most dot plots, dots are plotted all over the graph, obscuring identification of the true alignment - particularly acute for DNA sequences because only 4 possible characters in DNA and each residue therefore has a 1-in-4 chance of matching a residue in another sequence. » To reduce noise, instead of using a single residue to scan for similarity, a filtering technique has to be applied, which uses a “window” of fixed length covering a stretch of residue pairs.
  18. 18. • self comparison as a variation of using the dot plot method. – a main diagonal for perfect matching of each residue  identify internal repeat elements – If repeats are present, short parallel lines are observed above and below the main diagonal. » Self complementarity of DNA sequences (also called inverted repeats) can also be identified using a dot plot. » In this case, a DNA sequence is compared with its reverse- complemented sequence. – Parallel diagonals represent the inverted repeats.
  19. 19. – 장점 » easy identification of greatest similarities. – 단점 » it is often up to the user to construct a full alignment with insertions and deletions by linking nearby diagonals. » it lacks statistical rigor in assessing the quality of the alignment. » is also restricted to pairwise alignment. It is difficult for the method to scale up to multiple alignment.
  20. 20. – Dynamic Programming Method • (…) – convert a dot matrix into a scoring matrix to account for matches and mismatches between sequences. By searching for the set of highest scores in this matrix, the best alignment can be accurately obtained. – construct a 2-D matrix. » The residue matching is according to a particular scoring matrix. The scores are calculated one row at a time. This starts with the first row of one sequence, which is used to scan through the entire length of the other sequence, followed by scanning of the second row. The matching scores are calculated.
  21. 21. • Gap Penalties – Apply gaps that represent insertions and deletions. – cost difference between opening a gap and extending an existing gap. » it is easier to extend a gap that has already been started. Thus, gap opening have a much higher penalty  if insertions and deletions ever occur, several adjacent residues are likely to have been inserted or deleted together. » affine gap penalties (= These differential gap penalties). » Strategy: use preset gap penalty values for introducing and extending gaps. » The total gap penalty (W) is a linear function of gap length: » a constant gap penalty - less realistic γ = gap opening penalty, δ = gap extension penalty, k = length of the gap.
  22. 22. • DP for Global Alignment (Needleman–Wunsch algorithm) – an optimal alignment is obtained over the entire lengths of the two sequences. – Drawback = risk of missing the best local similarity → only suitable for aligning two closely related sequences that are of the same length. (For divergent sequences or sequences with different domain structures, the approach does not produce optimal alignment) • DP for Local Alignment (Smith–Waterman algorithm) – identification of regional sequence similarity
  23. 23. Scoring 행렬 • (…) = a substitution 행렬 • is derived from statistical analysis of residue substitution data from sets of reliable alignments of highly related sequences. – A positive value or high score is given for a match and a negative value or low score for a mismatch. – Assumption: the frequencies of mutation are equal for all bases. 단, 비현실적 가정임 • Scoring matrices for amino acids are more complicated –  the physicochemical properties of amino acid residues, as well as the likelihood of certain residues being substituted among true homologous sequences. – Certain amino acids with similar physicochemical properties can be more easily substituted than those without similar characteristics. Substitutions among similar residues are likely to preserve the essential functional and structural features. However, substitutions between residues of different physicochemical properties are more likely to cause disruptions to the structure and function.
  24. 24. • Amino Acid Scoring 행렬 – 20 x 20 matrices to reflect the likelihood of residue substitutions • 2 types of amino acid substitution matrices. – (i) based on interchangeability of the genetic code or amino acid properties, » is based on genetic code or the physicochemical features of amino acids → less accurate – (ii) derived from empirical studies of amino acid substitutions. »  surveys of actual amino acid substitutions among related proteins. » PAM and BLOSUM matrices derived from actual alignments of highly similar sequences. By analyzing the probabilities of amino acid substitutions in these alignments, a scoring system can be developed by giving a high score for a more likely substitution and a low score for a rare substitution.
  25. 25. • PAM 행렬 (Dayhoff PAM 행렬) • point accepted mutation Correspondence of PAM Numbers with Observed Amino Acid Mutational Rates
  26. 26. • BLOSUM 행렬 • the series of blocks amino acid substitution matrices (BLOSUM) – → (In PAM matrix construction, the only direct observation of residue substitutions is in PAM1, based on a relatively small set of extremely closely related sequences. Sequence alignment statistics for more divergent sequences are not available. ) – all are derived based on direct observation for every possible amino acid substitution in multiple sequence alignments. • extrapolation 함수 대신, BLOSUM matrices are actual % identity values of sequences selected for construction of the matrices.
  27. 27. PAM250 amino acid substitution matrix. Residues are grouped according to physicochemical similarities.
  28. 28. BLOSUM62 amino acid substitution matrix.
  29. 29. • PAM과 BLOSUM의 비교 • 주된 차이점 – PAM matrices, except PAM1, are derived from an evolutionary model – BLOSUM matrices consist of entirely direct observations. » BLOSUM matrices are entirely derived from local sequence alignments of conserved sequence blocks, » PAM1 matrix is based on the global alignment of full-length sequences composed of both conserved and variable regions. → BLOSUM matrices is more advantageous in searching databases and finding conserved domains in proteins. • 몇몇 실증 비교의 결과 – BLOSUM matrices outperform the PAM matrices in terms of accuracy of local alignment, largely because BLOSUM matrices are derived from a much larger and more representative dataset than the one used to derive the PAM matrices. → BLOSUM matrices more reliable. – 개정된 행렬이 고안됨. (ex) Gonnet matrices and Jones–Taylor–Thornton matrices –particularly robust in phylogenetic tree construction .
  30. 30. alignment score에 대한 Gumble 극값 분포.
  31. 31. Sequence Alignment의 통계적 유의성 • 개념 • True evidence of homology를 찾기 위한 통계검정 – 검정 절차 • A P-value resulting from the test – < 10-100 indicates an exact match between the two sequences. – 10-100 < P-value < 10-50 → a nearly identical match. – 10-50 < P-value < 10-5 → sequences having clear homology. – 10-5 < P-value < 10-1 → possible distant homologs. – 10-1 < P-value → the two sequence may be randomly related. – However, sometimes truly related protein sequences may lack the statistical significance at the sequence level owing to fast divergence rates. Their evolutionary relationships can nonetheless be revealed at the three-dimensional structural level.
  32. 32. Database 유사도 검색 • DB 검색의 요건 • Heuristic 검색 • Basic Local Alignment Search Tool (BLAST) – Variants – Statistical Significance – Low Complexity Regions – BLAST Output Format • FASTA – 통계적 유의성 • FASTA와 BLAST의 비교 • Smith–Waterman Method에 의한 검색
  33. 33. 일반론 • DB 검색 • pairwise alignment to retrieve biological sequences in DBs based on similarity. – Query for a pairwise comparison with all individual sequences in a database. - Database similarity searching is pairwise alignment on a large scale. – However, DP is slow and impractical to use in most cases. Special search methods are needed to speed up the computational process. • DB 검색의 요건 • Sensitivity → “true positives” • specificity = “false positives.” • speed – Types of algo • Exhaustive type – examine all mathematical combinations (ex) DP • Heuristic type – find empirical or near optimal solution using rules of thumb
  34. 34. Heuristic 검색 • (…) – BLAST – FASTA – word method • Both BLAST and FASTA use a heuristic “word method” for fast pairwise sequence alignment.
  35. 35. Basic Local Alignment Search Tool (BLAST) • 목적 – = high-scoring ungapped segments를 찾아내고자 함 - Segments above a given threshold indicates pairwise similarity beyond random chance. BLOSUM62 matrix에 의한 alignment scoring의 예
  36. 36. • 변형된 방법론 – BLASTN – BLASTP – BLASTX – TBLASTX
  37. 37. • 통계적 유의성 – The larger the DB, the more unrelated sequence alignments. → a new parameter taking into account total number of sequence alignments conducted, proportional to the size of the database. • In BLAST searches, E-value (expectation value) – indicates the probability that the resulting alignments from a DB search are caused by random chance. – E-value is related to the P-value used to assess significance of single pairwise alignment. BLAST compares a query sequence against all database sequences, and so the E-value is determined by: – (ex) … • A bit score – Measures sequence similarity independent of query sequence length and DB size and is normalized based on the raw pairwise alignment score
  38. 38. • Low Complexity Regions (LCRs) • For both protein and DNAsequences, there may be regions that contain highly repetitive residues, such as short segments of repeats, or segments that are overrepresented by a small number of residues. – LCRs are rather prevalent in DB sequences; about 15% of the total protein sequences in public databases. → spurious DB matches and lead to artificially high alignment scores with unrelated sequences. • To avoid the problem of high similarity scores owing to matching of LCRs, filter out the problematic regions in both query and DB sequences to improve SN ratio,(= masking) • 2 types of masking: hard and soft. • SEG detects and mask repetitive elements before executing DB searches. – SEG has been integrated into the BLAST web based program. • BLAST Output Format
  39. 39. FASTA • (…) • 최초의 DB 유사도 검색 도구 • find matches for a short stretch of identical residues with a length of k. (“hashing” 방식) – string of residues (= ktuples or ktups) are equivalent to words in BLAST, but are normally shorter than words. Typically, a ktup is composed of two residues for protein sequences and six residues for DNA sequences. • Similar to BLAST, FASTA has a number of subprograms.
  40. 40. Procedure of ktup identification using the hashing strategy by FASTA. Identical offset values between residues of the two sequences allow the formation of ktups.
  41. 41. Steps of the FASTA alignment procedure. In step 1 (left ), all possible ungapped alignments are found between two sequences with the hashing method. In step 2 (middle), the alignments are scored according to a particular scoring matrix. Only the ten best alignments are selected. In step 3 (right ), the alignments in the same diagonal are selected and joined to form a single gapped alignment, which is optimized using the dynamic programming approach.
  42. 42. • 통계적 유의성 • FASTA also uses E-values and bit scores. – essentially the same as in BLAST, but the FASTA output provides one more statistical parameter, the Z-score. » Because most of the alignments with the query sequence are with unrelated sequences, the higher the Z-score for a reported match, the further away from the mean of the score distribution, hence, the more significant the match. » For a Z-score > 15, the match can be considered extremely significant, with certainty of a homologous relationship. » If Z is in the range of 5 to 15, the sequence pair can be described as highly probable homologs. » If Z < 5, their relationships is described as less certain.
  43. 43. FASTA와 BLAST의 비교 • (…) • BLAST and FASTA perform equally well in regular DB searching. • differences (Notably seeding step) – BLAST uses a substitution matrix to find matching words » use of low-complexity masking in BLAST → higher specificity than FASTA because potential FPs are reduced. » BLAST sometimes gives multiple best-scoring alignments from the same sequence; – FASTA identifies identical matching word using hashing procedure. » By default, FASTA scans smaller window sizes. → more sensitive results than BLAST, with a better coverage rate for homologs. However, it is usually slower than BLAST. » FASTA returns only one final alignment.
  44. 44. 다중 서열정렬 (Multiple Sequence Alignment) • Scoring 함수 • Exhaustive Algorithms • Heuristic Algorithms – Progressive Alignment Method – Drawbacks and Solutions – Iterative Alignment – Block-Based Alignment • 검토사항 – Protein-Coding DNA Sequences – Editing – Format Conversion
  45. 45. • 개념 • generation of multiple matching sequence pairs → convert numerous pairwise alignments into a single alignment → arrange sequences in such a way that evolutionarily equivalent positions across all sequences are matched. • 장점 – reveals more biological information than pairwise alignments can. – applications in designing degenerate PCR primers based on multiple related sequences. • DP vs. Heuristic – the amount of computing time and memory DP requires increases exponentially as the number of sequences increases. In practice, heuristic approaches are most often used.
  46. 46. Scoring 함수 • (…) • MSA is to arrange sequences in such a way that a max no. of residues from each sequence are matched up according to a particular scoring function. » = sum of pairs (SP). (= sum of scores of all possible pairs of sequences in a multiple alignment based on a particular scoring matrix). – In calculating SP scores, each column is scored by summing the scores for all possible pairwise matches, mismatches and gap costs. The score of the entire alignment is the sum of all of column scores. – The purpose of most multiple sequence alignment algorithms is to achieve maximum SP scores.
  47. 47. Exhaustive Algorithms
  48. 48. Heuristic Algorithms • (3 categories) – Progressive Alignment Method – Iterative Alignment – Block-Based Alignment • Progressive Alignment Method – Drawbacks and Solutions Schematic of a typical progressive alignment procedure (e.g., Clustal). Angled wavy lines represent consensus sequences for sequence pairs A/B and C/D. Curved wavy lines represent a consensus for A/B/C/D.
  49. 49. Conversion of a sequence alignment into a graphical profile in the Poa algorithm. Identical residues in the alignment are condensed as nodes in the partial order graph.
  50. 50. • Iterative Alignment • Block-Based Alignment Schematic of iterative alignment procedure for PRRN, which involves two sets of iterations.
  51. 51. 실습 (1) PYTHON
  52. 52. • Source
  53. 53. 실습 (2) R
  54. 54. • Source

×