1. AI-Bio 융합 전문 과정
2022-8~10
윤형기 (hky@openwith.net)
4일차
2. 주제 세부사항
1일차 인사 및 과정 소개
인사
수강생 현황 및 수강목적 등 파악
의료/바이오 개관 (기술/산업) 의료/바이오 기술 및 산업동향
기반기술 (1-1) Python과 분석 패키지 분석도구 (1) (Python, Scipy, numpy/pandas)
2일차 기반기술 (1-2) R과 통계분석 분석도구 (2) (R과 통계학)
생명통계 활용 (1) 생명정보와 ANOVA, 다변량분석 등
유전체 분석
3일차 생명통계 활용 (2) 메타분석
유전체 분석 (Omics) (1)
유전체(genome) 분석
전사체(transcriptome) 분석
4일차 유전체 분석 (Omics) (2)
후성유전체(epigenome) 분석
단백체(proteome) 분석
차세대 Sequencing
GenBank와 NCBI데이터
VCF 데이터 분석, NGS 데이터 처리 등
5일차 기반기술 (3) 기계학습 (1)
모델링 방법론 (모델 개념 및 Cross-Validation)
지도학습 알고리즘 (선형모델, 분류)
기반기술 (3) 기계학습 (2) 비지도학습 알고리즘 (군집, 연관분석 등)
6일차 지도학습과 생명정보 응용
의료데이터에서의 예측모델
선형모델과 헬스케어 데이터의 분류
비지도학습과 생명정보 응용
임상데이터의 연관성분석
동반질병 (comorbidity) 분석
의료/바이오 도메인 이해
헬스케어 데이터셋과 생명통계
바이오 데이터와 기계학습
일정
3. 주제 세부사항
7일차 기반기술 (4) 딥러닝 (1) 신경망 학습과 딥러닝 모델
기반기술 (3) 딥러닝 (2)
TensorFlow
PyTorch
8일차 딥러닝과 생명정보 응용
Bi-LSTM을 이용한 헬스케어 시뮬레이션
딥러닝을 이용한 피부병 식별
온톨로지와 생명정보 응용
세만틱웹과 ontologies
Ontology의 생명정보 응용
9일차 기반 기술 (3) 이미지 처리 이미지 처리와 컴퓨터 비전 개요
의료영상분석 (1)
Segmentation
영상등록 (image registration)
10일차 의료영상분석 (2)
심전도 (ECG)
Rendering과 Surface Models
MRI
11일차 기반기술 (4) 생명정보와 계산화학 계산화학 (computational chemistry) 개요
신약개발 (drug discovery) (1)
표적규명 (target identification)
시약과 검정법 개발
ADME (흡수, 분포, 대사, 배설)
독성학과 기계학습 응용
12일차 기반 기술 (5) GAN GAN (Generative Adversarial Networks)과 VAE
신약개발과 GAN 생성모델을 이용한 신약후보물질 추천
총정리 Wrap-up 총정리
의료영상 분석
약물분석과 신약설계
바이오 데이터와 딥러닝
5. 생명정보학 주요 주제
• 서열정렬
– Pairwise Sequence Alignment
– Database 유사도 검색
– Multiple Sequence Alignment
– Profile과 HMM
– Protein Motifs and Domain
Prediction
• Gene과 Promoter 예측
– 유전자 예측
– Promoter and Regulatory
Element Prediction
• 분자 계통 발생학
(Molecular Phylogenetics)
– Phylogenetics Basics
– Phylogenetic Tree Construction
Methods and Programs
• 구조적 생명정보학
(Structural Bioinformatics)
– 단백질 구조 시각화, 비교 & 분류
– Protein 구조 Structure 예측
(2ndary, Tertiary)
– RNA 구조 예측
• 유전체학과 전사체학
(Genomics & Proteomics)
– 유전체 Mapping, Assembly, 비교
– 기능 유전체학
– Proteomics
• Genome rearrangements
• Motif finding
• Gene expression analysis
7. 보충: 유전 부호(genetic code)
• 1. 개요
– 각 codon이 어떤 아미노산을 부호화(encoding)할지를 정해놓은 규칙
• 2. 코돈 Codon
– 단백질의 아미노산을 지정하는 RNA의 유전 정보
– RNA 구성 염기: Uracil, Guanine, Cytosine, Adenine
– 한 codon은 3개 염기로 구성 - 이론상 4×4×4=64종의 정보 지정.
• 3. 종류
– 3.1. 개시 코돈 start codon
• 5'-AUG-3’ (일부 박테리아에서 변형된 개시 코돈 사용).
• 진핵 생물에서는 메싸이오닌(Methionine, Met)을,
원핵생물에서는 N-포르밀메싸이오닌(N-Formylmethionine, fMet)을 지정.
• 또한 mRNA가 리보솜과 결합해 단백질 번역을 시작하도록 하는 역할도 수행
– 3.2. 종결 코돈 Stop Codon, Nonsense Codon
• 단백질 번역의 끝을 알리는 codon으로서 UAA, UAG, UGA의 세 종류
• 종결 코돈에는 대응하는 tRNA가 없고 대신 '종결 인자'라는 단백질이 붙으며, 번역 과
정에서 종결 코돈에 도달하면 리보솜의 두 단위체가 분리되어 번역이 종결된다.
– 3.3. 안티코돈(역코돈) anticodon
• tRNA의 RNA 사슬을 이루는 특정 구간의 염기 서열.
8. Pairwise Sequence Alignment
• 배경
• Sequence Homology (서열 상동성) vs. Sequence Similarity
• Sequence Similarity vs. Sequence Identity
• 기법
– Global Alignment and Local Alignment
– Alignment Algorithms
– Dot Matrix Method
– Dynamic Programming Method
• Gap Penalties
• Dynamic Programming for Global Alignment
• Dynamic Programming for Local Alignment
• Scoring 행렬
– Amino Acid Scoring 행렬
– PAM 행렬
– BLOSUM 행렬
– Comparison between PAM and BLOSUM
• Sequence Alignment의 통계적 유의성
9. • (Goal)
• 서열 비교
“공통 character patterns” 과 residue–residue 대응관계를 찾아냄
• 배경 – 진화
• DNA와 protein은 진화의 소산
– The degree of sequence conservation in the alignment reveals
evolutionary relatedness of different sequences, whereas the
variation between sequences reflects the changes that have occurred
during evolution in the form of substitutions, insertions, and
deletions.
• sequence alignment
– can be used as basis for prediction of structure and function of
uncharacterized sequences.
– provides inference for the relatedness of two sequences under study.
10. Sequence Homology vs. Similarity
• (…)
– 용어 구별
• Homologous relationship or share homology.
– an inference or a conclusion about a common ancestral relationship
drawn from sequence similarity comparison when the two sequences
share a high enough degree of similarity. (qualitative)
• Sequence similarity
– is a direct result of observation from the sequence alignment.
– % of aligned residues that are similar in physiochemical properties
such as size, charge, and hydrophobicity. (quantitative)
– 문제는 sequence similarity level
• Nucleotide sequences consist of only 4 characters → unrelated
sequences have at least a 25% chance of being identical.
• protein sequences - 20 possible amino acid residues → two
unrelated sequences can match up 5% of the residues by random
chance.
11. – 단, % identity values only provide a tentative guidance for homology
identification
3 zones of protein sequence alignments. (Source: Modified from Rost 1999).
12. Sequence Similarity vs. Sequence Identity
• (…)
• nucleotide sequence의 경우 사실상 같은 의미
• Protein sequence의 경우 구별할 것
– sequence identity = % of matches of the same amino acid residues
between two aligned sequences.
– Similarity = % of aligned residues that have similar physicochemical
characteristics and can be more readily substituted for each other.
– Sequence similarity 및 identity 계산 방법
– One involves use of the overall sequence lengths of both sequences
– the other normalizes by the size of the shorter sequence.
13. Methods
• Global Alignment and Local Alignment
• Global Alignment
– 처음부터 끝까지 비교
» is more applicable for aligning two closely related sequences of
roughly the same length.
» For divergent sequences and sequences of variable lengths, this
method may not be able to generate optimal results because it
fails to recognize highly similar local regions between the two
sequences.
• Local alignment
– only finds local regions with the highest level of similarity between
the two sequences and aligns these regions without regard for the
alignment of the rest of the sequence regions
– Two sequences to be aligned can be of different lengths
15. • 정렬 알고리즘
– Dot Matrix Method (= dot plot method)
– Dynamic Programming Method
• Gap Penalties
• Dynamic Programming for Global Alignment
• Dynamic Programming for Local Alignment
– Word method
16. – Dot Matrix Method
dot plot에 의한 서열비교의 예. Lines linking the dots in diagonals indicate
sequence alignment. Diagonal lines above or below the main diagonal
represent internal repeats of either sequence
17. • Problem when comparing large sequences using dot matrix
method
– high noise level.
» In most dot plots, dots are plotted all over the graph, obscuring
identification of the true alignment - particularly acute for DNA
sequences because only 4 possible characters in DNA and each
residue therefore has a 1-in-4 chance of matching a residue in
another sequence.
» To reduce noise, instead of using a single residue to scan for
similarity, a filtering technique has to be applied, which uses a
“window” of fixed length covering a stretch of residue pairs.
18. • self comparison as a variation of using the dot plot method.
– a main diagonal for perfect matching of each residue identify
internal repeat elements
– If repeats are present, short parallel lines are observed above and
below the main diagonal.
» Self complementarity of DNA sequences (also called inverted
repeats) can also be identified using a dot plot.
» In this case, a DNA sequence is compared with its reverse-
complemented sequence.
– Parallel diagonals represent the inverted repeats.
19. – 장점
» easy identification of greatest similarities.
– 단점
» it is often up to the user to construct a full alignment with
insertions and deletions by linking nearby diagonals.
» it lacks statistical rigor in assessing the quality of the alignment.
» is also restricted to pairwise alignment. It is difficult for the
method to scale up to multiple alignment.
20. – Dynamic Programming Method
• (…)
– convert a dot matrix into a scoring matrix to account for matches
and mismatches between sequences. By searching for the set of
highest scores in this matrix, the best alignment can be accurately
obtained.
– construct a 2-D matrix.
» The residue matching is according to a particular scoring matrix.
The scores are calculated one row at a time. This starts with the
first row of one sequence, which is used to scan through the
entire length of the other sequence, followed by scanning of
the second row. The matching scores are calculated.
21.
22. • Gap Penalties
– Apply gaps that represent insertions and deletions.
– cost difference between opening a gap and extending an existing
gap.
» it is easier to extend a gap that has already been started. Thus,
gap opening have a much higher penalty if insertions and
deletions ever occur, several adjacent residues are likely to have
been inserted or deleted together.
» affine gap penalties (= These differential gap penalties).
» Strategy: use preset gap penalty values for introducing and
extending gaps.
» The total gap penalty (W) is a linear function of gap length:
» a constant gap penalty - less realistic
γ = gap opening penalty,
δ = gap extension penalty,
k = length of the gap.
23. • DP for Global Alignment (Needleman–Wunsch algorithm)
– an optimal alignment is obtained over the entire lengths of the two
sequences.
– Drawback = risk of missing the best local similarity → only suitable
for aligning two closely related sequences that are of the same
length. (For divergent sequences or sequences with different domain
structures, the approach does not produce optimal alignment)
• DP for Local Alignment (Smith–Waterman algorithm)
– identification of regional sequence similarity
24. Scoring 행렬
• (…) = a substitution 행렬
• is derived from statistical analysis of residue substitution data
from sets of reliable alignments of highly related sequences.
– A positive value or high score is given for a match and a negative
value or low score for a mismatch.
– Assumption: the frequencies of mutation are equal for all bases.
단, 비현실적 가정임
• Scoring matrices for amino acids are more complicated
– the physicochemical properties of amino acid residues, as well as
the likelihood of certain residues being substituted among true
homologous sequences.
– Certain amino acids with similar physicochemical properties can be
more easily substituted than those without similar characteristics.
Substitutions among similar residues are likely to preserve the
essential functional and structural features. However, substitutions
between residues of different physicochemical properties are more
likely to cause disruptions to the structure and function.
25.
26. • Amino Acid Scoring 행렬
– 20 x 20 matrices to reflect the likelihood of residue substitutions
• 2 types of amino acid substitution matrices.
– (i) based on interchangeability of the genetic code or amino acid
properties,
» is based on genetic code or the physicochemical features of
amino acids → less accurate
– (ii) derived from empirical studies of amino acid substitutions.
» surveys of actual amino acid substitutions among related
proteins.
» PAM and BLOSUM matrices derived from actual alignments of
highly similar sequences. By analyzing the probabilities of
amino acid substitutions in these alignments, a scoring system
can be developed by giving a high score for a more likely
substitution and a low score for a rare substitution.
27. • PAM 행렬 (Dayhoff PAM 행렬)
• point accepted mutation
Correspondence of PAM Numbers with Observed
Amino Acid Mutational Rates
28. • BLOSUM 행렬
• the series of blocks amino acid substitution matrices (BLOSUM)
– → (In PAM matrix construction, the only direct observation of
residue substitutions is in PAM1, based on a relatively small set of
extremely closely related sequences. Sequence alignment statistics
for more divergent sequences are not available. )
– all are derived based on direct observation for every possible amino
acid substitution in multiple sequence alignments.
• extrapolation 함수 대신, BLOSUM matrices are actual % identity
values of sequences selected for construction of the matrices.
29. PAM250 amino acid substitution matrix. Residues are
grouped according to physicochemical similarities.
31. • PAM과 BLOSUM의 비교
• 주된 차이점
– PAM matrices, except PAM1, are derived from an evolutionary model
– BLOSUM matrices consist of entirely direct observations.
» BLOSUM matrices are entirely derived from local sequence
alignments of conserved sequence blocks,
» PAM1 matrix is based on the global alignment of full-length
sequences composed of both conserved and variable regions. →
BLOSUM matrices is more advantageous in searching databases and
finding conserved domains in proteins.
• 몇몇 실증 비교의 결과
– BLOSUM matrices outperform the PAM matrices in terms of accuracy of
local alignment, largely because BLOSUM matrices are derived from a
much larger and more representative dataset than the one used to derive
the PAM matrices. → BLOSUM matrices more reliable.
– 개정된 행렬이 고안됨. (ex) Gonnet matrices and Jones–Taylor–Thornton
matrices –particularly robust in phylogenetic tree construction .
33. Sequence Alignment의 통계적 유의성
• 개념
• True evidence of homology를 찾기 위한 통계검정
– 검정 절차
• A P-value resulting from the test
– < 10-100 indicates an exact match between the two sequences.
– 10-100 < P-value < 10-50 → a nearly identical match.
– 10-50 < P-value < 10-5 → sequences having clear homology.
– 10-5 < P-value < 10-1 → possible distant homologs.
– 10-1 < P-value → the two sequence may be randomly related.
– However, sometimes truly related protein sequences may lack the
statistical significance at the sequence level owing to fast divergence
rates. Their evolutionary relationships can nonetheless be revealed at
the three-dimensional structural level.
34. Database 유사도 검색
• DB 검색의 요건
• Heuristic 검색
• Basic Local Alignment Search Tool (BLAST)
– Variants
– Statistical Significance
– Low Complexity Regions
– BLAST Output Format
• FASTA
– 통계적 유의성
• FASTA와 BLAST의 비교
• Smith–Waterman Method에 의한 검색
35. 일반론
• DB 검색
• pairwise alignment to retrieve biological sequences in DBs based on
similarity.
– Query for a pairwise comparison with all individual sequences in a
database. - Database similarity searching is pairwise alignment on a large
scale.
– However, DP is slow and impractical to use in most cases. Special search
methods are needed to speed up the computational process.
• DB 검색의 요건
• Sensitivity → “true positives”
• specificity = “false positives.”
• speed
– Types of algo
• Exhaustive type – examine all mathematical combinations (ex) DP
• Heuristic type – find empirical or near optimal solution using rules of
thumb
36. Heuristic 검색
• (…)
– BLAST
– FASTA
– word method
• Both BLAST and FASTA use a heuristic “word method” for fast
pairwise sequence alignment.
37. Basic Local Alignment Search Tool (BLAST)
• 목적
– = high-scoring ungapped segments를 찾아내고자 함 - Segments
above a given threshold indicates pairwise similarity beyond random
chance.
BLOSUM62 matrix에 의한 alignment scoring의 예
39. • 통계적 유의성
– The larger the DB, the more unrelated sequence alignments.
→ a new parameter taking into account total number of sequence
alignments conducted, proportional to the size of the database.
• In BLAST searches, E-value (expectation value)
– indicates the probability that the resulting alignments from a DB
search are caused by random chance.
– E-value is related to the P-value used to assess significance of single
pairwise alignment. BLAST compares a query sequence against all
database sequences, and so the E-value is determined by:
– (ex) …
• A bit score
– Measures sequence similarity independent of query sequence length
and DB size and is normalized based on the raw pairwise alignment
score
40. • Low Complexity Regions (LCRs)
• For both protein and DNAsequences, there may be regions that
contain highly repetitive residues, such as short segments of
repeats, or segments that are overrepresented by a small number
of residues.
– LCRs are rather prevalent in DB sequences; about 15% of the total
protein sequences in public databases. → spurious DB matches and
lead to artificially high alignment scores with unrelated sequences.
• To avoid the problem of high similarity scores owing to matching
of LCRs, filter out the problematic regions in both query and DB
sequences to improve SN ratio,(= masking)
• 2 types of masking: hard and soft.
• SEG detects and mask repetitive elements before executing DB
searches.
– SEG has been integrated into the BLAST web based program.
• BLAST Output Format
41.
42. FASTA
• (…)
• 최초의 DB 유사도 검색 도구
• find matches for a short stretch of identical residues with a
length of k. (“hashing” 방식)
– string of residues (= ktuples or ktups) are equivalent to words in
BLAST, but are normally shorter than words. Typically, a ktup is
composed of two residues for protein sequences and six residues for
DNA sequences.
• Similar to BLAST, FASTA has a number of subprograms.
43. Procedure of ktup identification using the hashing strategy by FASTA. Identical
offset values between residues of the two sequences allow the formation of ktups.
44. Steps of the FASTA alignment procedure. In step 1 (left ), all possible ungapped
alignments are found between two sequences with the hashing method. In step 2
(middle), the alignments are scored according to a particular scoring matrix. Only
the ten best alignments are selected. In step 3 (right ), the alignments in the same
diagonal are selected and joined to form a single gapped alignment, which is
optimized using the dynamic programming approach.
45. • 통계적 유의성
• FASTA also uses E-values and bit scores.
– essentially the same as in BLAST, but the FASTA output provides one
more statistical parameter, the Z-score.
» Because most of the alignments with the query sequence are
with unrelated sequences, the higher the Z-score for a reported
match, the further away from the mean of the score distribution,
hence, the more significant the match.
» For a Z-score > 15, the match can be considered extremely
significant, with certainty of a homologous relationship.
» If Z is in the range of 5 to 15, the sequence pair can be
described as highly probable homologs.
» If Z < 5, their relationships is described as less certain.
46. FASTA와 BLAST의 비교
• (…)
• BLAST and FASTA perform equally well in regular DB searching.
• differences (Notably seeding step)
– BLAST uses a substitution matrix to find matching words
» use of low-complexity masking in BLAST → higher specificity
than FASTA because potential FPs are reduced.
» BLAST sometimes gives multiple best-scoring alignments from
the same sequence;
– FASTA identifies identical matching word using hashing procedure.
» By default, FASTA scans smaller window sizes. → more sensitive
results than BLAST, with a better coverage rate for homologs.
However, it is usually slower than BLAST.
» FASTA returns only one final alignment.
47. 다중 서열정렬
(Multiple Sequence Alignment)
• Scoring 함수
• Exhaustive Algorithms
• Heuristic Algorithms
– Progressive Alignment Method
– Drawbacks and Solutions
– Iterative Alignment
– Block-Based Alignment
• 검토사항
– Protein-Coding DNA Sequences
– Editing
– Format Conversion
48. • 개념
• generation of multiple matching sequence pairs → convert
numerous pairwise alignments into a single alignment → arrange
sequences in such a way that evolutionarily equivalent positions
across all sequences are matched.
• 장점
– reveals more biological information than pairwise alignments can.
– applications in designing degenerate PCR primers based on multiple
related sequences.
• DP vs. Heuristic
– the amount of computing time and memory DP requires increases
exponentially as the number of sequences increases. In practice,
heuristic approaches are most often used.
49. Scoring 함수
• (…)
• MSA is to arrange sequences in such a way that a max no. of
residues from each sequence are matched up according to a
particular scoring function.
» = sum of pairs (SP). (= sum of scores of all possible pairs of sequences in
a multiple alignment based on a particular scoring matrix).
– In calculating SP scores, each column is scored by summing the
scores for all possible pairwise matches, mismatches and gap costs.
The score of the entire alignment is the sum of all of column scores.
– The purpose of most multiple sequence alignment algorithms is to
achieve maximum SP scores.
51. Heuristic Algorithms
• (3 categories)
– Progressive Alignment Method
– Iterative Alignment
– Block-Based Alignment
• Progressive Alignment Method
– Drawbacks and Solutions
Schematic of a typical progressive alignment procedure (e.g., Clustal).
Angled wavy lines represent consensus sequences for sequence pairs A/B
and C/D. Curved wavy lines represent a consensus for A/B/C/D.
52.
53. Conversion of a sequence alignment into a graphical profile in
the Poa algorithm. Identical residues in the alignment are
condensed as nodes in the partial order graph.
54. • Iterative Alignment
• Block-Based Alignment
Schematic of iterative alignment procedure for PRRN, which
involves two sets of iterations.