Algorithmic Information Theory and Computational Biology

2,499 views

Published on

I present cutting-edge concepts and tools drawn from algorithmic information theory (AIT) for new generation genetic sequencing, network biology and bioinformatics in general. AIT is the most advanced mathematical theory of information theory formally characterising the concepts and differences between simplicity, randomness and structure. Measures of AIT will empower computational medicine and systems biology to deal with big data, sophisticated analytics and a powerful new understanding framework.

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,499
On SlideShare
0
From Embeds
0
Number of Embeds
230
Actions
Shares
0
Downloads
42
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Algorithmic Information Theory and Computational Biology

  1. 1. Algorithmic Information Theory and Computational Biology Hector Zenil Unit of Computational Medicine Karolinska Institutet Sweden Hector Zenil AIT Tools for Biology and Medicine
  2. 2. Complex Adaptive Systems (CAS) Hector Zenil AIT Tools for Biology and Medicine
  3. 3. Complexity is hard to quantify in biology Mapping quantitative stimuli to qualitative behaviour Hector Zenil AIT Tools for Biology and Medicine
  4. 4. Information Theory in Biology Sequence alignment Pattern recognition Sequence logos Binding site detection Motif detection Consensus sequences Biological significance [based on Claude Shannon’s Information Theory, 1940] Hector Zenil AIT Tools for Biology and Medicine
  5. 5. Algorithmic Information Theory Which sequence looks more random? (a) AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA (b) AGGTCGTGAAGTGCGATGGCCTTACGTAGC (c) GCGCGCGCGCGCGCGCGCGCGCGCGCGC Classical probability theory vs. Kolmogorov Complexity Definition KU (s) = min{|p|, U(p) = s} (1) Compressibility A sequence with low Kolmogorov complexity is c-compressible if |p| + c = |s|. A sequence is random if K (s) ≈ |s|. [Kolmogorov (1965); Chaitin (1966)] Hector Zenil AIT Tools for Biology and Medicine
  6. 6. Examples Example 1 Sequences like (a) have low algorithmic complexity because they allow a short description. For example, “20 times A”. No matter how long (a) grows in length, the description increases only by about log2 (k) (k times A). Example 2 The sequence (b) is algorithmic random because it doesn’t seem to allow a (much) shorter description other than the length of (b) itself. For example, for sequence (a), a proof of non-randomness implies the exhibition of a short program. Compressibility is therefore a sufficient test of non-randomness. Hector Zenil AIT Tools for Biology and Medicine
  7. 7. Example of an evaluation of K The sequence (b) GCGCGC...GC is not algorithmic random (or has low K complexity) because it can be produced by the following program (take G=0 and C=1): Program A(i): 1: n:= 0 2: Print n mod 2 3: n:= n+1 4: If n=i Goto 6 5: Goto 2 6: End The length of A (in bits) is an upper bound of K (GCGCGC ...GC ). Hector Zenil AIT Tools for Biology and Medicine
  8. 8. The ultimate measure of pattern detection and optimalprediction Kolmogorov and Chaitin, Schnorr, and Martin-L¨fo independently provided 3 different approaches to randomness (compression, predictability and typicality). They proved (for infinite sequences): incompressibility ⇐⇒ unpredictability ⇐⇒ typicality When this happens in mathematics a concept has objectively been captured (randomness). This is why prediction in biology is hard. AIT tells that no effective statistical test will succeed to recognise all patterns and no computable technique can fully predict all outcomes. The problem is deeply connected to computability and algorithmic information theory. [Solomonoff (1964); Kolmogorov (1965); Chaitin (1969)] Hector Zenil AIT Tools for Biology and Medicine
  9. 9. Information distances and similarity metrics Measures waiting to be introduced in bioinformatics Information Distance ID(x, y ) = max K (x|y ), K (y |x) Universal Similarity Metric USM(x, y ) = max K (x|y ), K (y |x)/ max K (x), K (y ) Normalised Information Distance: NCD(x, y ) = K (xy ) − min K (x), K (y )/ max K (x), K (y ) and NCD. Normalized Compression Measure (NCM): NC (s) = K (s)/|s| (asymptotic behaviour) Bennett’s Logical Depth: LDd (s) = min{t(p) : (|p| − |p ∗ | < d) and (U(p) = s)} (e.g. of an app. see Zenil, Complexity 2011) Hector Zenil AIT Tools for Biology and Medicine
  10. 10. Non-systematic but succesful attempts in biology GenCompress is a compression algorithm to compress DNA sequences: d(x, y ) = 1 − (K (x) − K (x|y ))/K (xy ) NCD applied to genetic similarity: AIT looks at the genome as information, not as data (letters). Counting: traditional Shannon-entropy style sequencing. Interpreting: AIT. The full power of the theory hasn’t yet been unleashed. Hector Zenil AIT Tools for Biology and Medicine
  11. 11. To be or not to be... Borel’s “Infinite Monkey” theorem Input 1 0 1024 π Syntax error √2 ∞ CH3 ∞ “To be or not to be, that is the question.” Hector Zenil AIT Tools for Biology and Medicine
  12. 12. Algorithmic probability Hector Zenil AIT Tools for Biology and Medicine
  13. 13. Producing π This C-language code produces the first 1000 digits of π (Gjerrit Meinsma): long k = 4e3, p, a[337], q, t = 1e3; main(j){for (; a[j = q = 0]+ = 2, k; ) for (p = 1 + 2 ∗ k; j < 337; q = a[j] ∗ k + q%p ∗ t, a[j + +] = q/p) k! = j > 2? : printf (“%.3d”, a[j2]%t + q/p/t); } Producing non-random sequences: If an object has low Kolmogorov complexity then it has a short description and a greater probability to be produced by a random program. The less random a string the more likely to be produced by a short program. Hector Zenil AIT Tools for Biology and Medicine
  14. 14. Biological Big Data Analysis The information bottleneck: Small Data matters: Local measurements of information content are a good indication of the global information content of an object. Evidence: BDM Image classification. Compression works at large scales looking for long regularities, while BDM is very local. Yet both yield astonishing similar results for this object sizes. Hector Zenil AIT Tools for Biology and Medicine
  15. 15. Complementary methods for different sequence lengths The methods to approximate K coexist and complement each other for different sequence lengths. short strings long strings scalability < 100 bits > 100 bits Lossless compression √ √ method × Coding Theorem √ method × × Block Decomposition √ √ √ method [Zenil, Soler, Delahaye, Gauvrit, Two-Dimensional Kolmogorov Complexity and Validation of the Coding Theorem Method by Compressibility (2012)] Hector Zenil AIT Tools for Biology and Medicine
  16. 16. Coding Theorem method and lossless compression The transition between one method and the other. What is complex for the Coding Theorem method is less compressible. [Soler, Zenil, Delahaye, Gauvrit, Correspondence and Independence of Numerical Evaluations of Algorithmic Information Measures (2012)] Hector Zenil AIT Tools for Biology and Medicine
  17. 17. Online Algorithmic Complexity Calculator Provides: Shannon’s entropy, lossless compression (Deflate) values, Kolmogorov complexity approximations and relative frequency order (algorithmic probability). A Mathematica API and an R module. Datasets available online at the Dataverse Network. Basic data analysis tool for shorts sequence comparison. [http://www.complexitycalculator.com] Hector Zenil AIT Tools for Biology and Medicine
  18. 18. Online Algorithmic Complexity Calculator 2 [http://www.complexitycalculator.com] Hector Zenil AIT Tools for Biology and Medicine
  19. 19. Simulation of natural systems w/complex symbolic systems An elementary cellular automaton (ECA) is defined by a local function f : {0, 1}3 → {0, 1}, f maps the state of a cell and its two immediate neighbours (range = 1) to a new cell state: ft : r−1 , r0 , r+1 → r0 . Cells are updated synchronously according to f over all cells in a row. [Wolfram, (1994)] Hector Zenil AIT Tools for Biology and Medicine
  20. 20. Behavioural classes of CA Wolfram’s classes of behaviour: Class I: Systems evolve into a stable state. Class II: Systems evolve in a periodic (e.g. fractal) state. Class III: Systems evolve into random-looking states. Class IV: Systems evolve into localised complex structures. e.g. Rule 110 or the Game of Life. [Wolfram, (1994)] Hector Zenil AIT Tools for Biology and Medicine
  21. 21. Block Decomposition method (BDM) The Block Decomposition method uses the Coding Theorem method. Formally, we will say that an object c has complexity: K logm,2Dd×d (c) = (nu − 1) log2 (Km,2D (ru )) + Km,2D (ru ) (ru ,nu )∈cd×d (2) where cd×d represents the set with elements (ru , nu ), obtained from decomposing the object into blocks of d × d with boundary conditions. In each (ru , nu ) pair, ru is one of such squares and nu its multiplicity. [H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)] Hector Zenil AIT Tools for Biology and Medicine
  22. 22. Classification of ECA by BDM versus lossless compression Compressors have limitations (small sequences, time complexity) Applications to machine learning Problems of classification and clustering BDM is computationally efficient (runs in O(nd ) time, hence linear (d = 1) time for sequences) [H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)] Hector Zenil AIT Tools for Biology and Medicine
  23. 23. Asymptotic behaviour of complex systems [Zenil, Complex Systems (2010)] Hector Zenil AIT Tools for Biology and Medicine
  24. 24. Rule space of 3-symbol 1D CA [Zenil, Complex Systems (2011)] Hector Zenil AIT Tools for Biology and Medicine
  25. 25. Phase transition detection Definition |C (Mt (i1 ))−C (Mt (i2 ))|+...+|C (Mt (in−1 ))−C (Mt (in ))| ctn = t(n−1) [Zenil, Complex Systems (2011)] Hector Zenil AIT Tools for Biology and Medicine
  26. 26. A measure of programmability ∂f (ctn ) Ctn (M) = (3) ∂t [Zenil, Complex Systems (2011)] Hector Zenil AIT Tools for Biology and Medicine
  27. 27. Examples Figure : ECA Rule 4 has a low Ctn for random chosen n and t (it doesn’t react much to external stimuli). limn,t→∞ Ctn (R4) = 0 [H. Zenil, Philosophy & Technology, (2013)] Hector Zenil AIT Tools for Biology and Medicine
  28. 28. Examples (cont.) Figure : ECA R110 has large coefficient Ctn value for sensible choices of t and n, which is compatible with the fact that it has been proven to be capable of universal computation (for particular semi-periodic initial configurations). limn,t→∞ Ctn (R110) = 1 Hector Zenil AIT Tools for Biology and Medicine
  29. 29. Classification of graphs [Zenil, Soler, Dingle, Graph Automorphism Estimation and Complex Network Topological Characterization by Algorithmic Randomness] Hector Zenil AIT Tools for Biology and Medicine
  30. 30. Characterisation of complex networks Complex Networks w/preferential attachment algorithms preserve properties invariant under network size (connectedness, robustness) at a low cost (unlike costly random nets in the number of links). [Zenil, Soler, Dingle, Graph Automorphism Estimation and Complex Network Topological Characterization by Algorithmic Randomness] Hector Zenil AIT Tools for Biology and Medicine
  31. 31. Biological case study: Programmable Porphyrin molecules Much about the dynamics of these molecules is known, one can perform Monte-Carlo simulations based in these mathematical models and establish a correspondence between Wang tiles and simple molecules. [joint work with ICOS, U. of Nottingham] [G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based Molecular Computing] Hector Zenil AIT Tools for Biology and Medicine
  32. 32. Quantitative dynamics of living systems Aggregations with similar Kolmogorov complexity cluster in similar configurations. [G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based Molecular Computing] Hector Zenil AIT Tools for Biology and Medicine
  33. 33. Mapping output behaviour to external stimuli: Parameterdiscovery Parameter Space P → Target Space T Target space T : Set a configuration from P that triggers the desired behaviour in T . To investigate: Reduction of the parameter space Characterisation of the target space [G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based Molecular Computing] Hector Zenil AIT Tools for Biology and Medicine
  34. 34. Robustness and pervasiveness Concentration changes preserving behaviour: Output parameters that have the highest impact can be tested in silico before experiments in materio. [G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based Molecular Computing] Hector Zenil AIT Tools for Biology and Medicine
  35. 35. Orthogonality Specific concentrations producing certain behaviour using the mathematical model to be tested against empirical data. Hector Zenil AIT Tools for Biology and Medicine
  36. 36. Highlights and goals Ultimate goal (a few years time): An information-theoretical toolbox for systems and synthetic biology [Complex3D Proteins Database (graph representation) & Z Chen et al. Lung cancer pathways in response to treatments.] Pushing boundaries. A cutting-edge mathematical approach Tools from Complexity theory. Hector Zenil AIT Tools for Biology and Medicine
  37. 37. New Generation Sequence data analysis Heavily driven by: Explosion of experimental data Difficulties in data interpretation New paradigms for knowledge extraction Data mining the behaviour of natural systems Towards an AIT tool-kit for systems biology, a functional library of programmable biological modules with a SBML interface. Hector Zenil AIT Tools for Biology and Medicine
  38. 38. J.P. Delahaye and H. Zenil, On the Kolmogorov-Chaitin complexityfor short sequences, in Cristian Calude (eds), Complexity andRandomness: From Leibniz to Chaitin, World Scientific, 2007.J.-P. Delahaye and H. Zenil, Numerical Evaluation of the Complexityof Short Strings, Applied Mathematics and Computation, 2011.H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit,Two-Dimensional Kolmogorov Complexity and Validation of theCoding Theorem Method by Compressibility, arXiv:1212.6745 [cs.CC]F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Correspondence and Independence of Numerical Evaluations ofAlgorithmic Information Measures, Numerical Algorithms (in 2ndrevision)F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Calculating Kolmogorov Complexity from the Frequency OutputDistributions of Small Turing Machines, arXiv:1211.1302 [cs.IT]H. Zenil, Compression-based Investigation of the DynamicalProperties of Cellular Automata and Other Systems, ComplexSystems, Vol. 19, No. 1, pages 1-28, 2010. Hector Zenil AIT Tools for Biology and Medicine
  39. 39. H. Zenil and J.A.R. Marshall, Some Aspects of ComputationEssential to Evolution and Life, Ubiquity, 2012.H. Zenil, What is Nature-like Computation? A Behavioural Approachand a Notion of Programmability, Philosophy & Technology (specialissue on History and Philosophy of Computing), 2013.H. Zenil, On the Dynamic Qualitative Behavior of UniversalComputation Complex Systems, vol. 20, No. 3, pp. 265-278, 2012.H. Zenil, A Turing Test-Inspired Approach to Natural ComputationIn G. Primiero and L. De Mol (eds.), Turing in Context II (Brussels,10-12 October 2012), Historical and Contemporary Research inLogic, Computing Machinery and Artificial Intelligence, Proceedingspublished by the Royal Flemish Academy of Belgium for Science andArts, 2013.G.J. Chaitin A Theory of Program Size Formally Identical toInformation Theory, J. Assoc. Comput. Mach. 22, 329-340, 1975.A. N. Kolmogorov, Three approaches to the quantitative definitionof information Problems of Information and Transmission, 1(1):1–7,1965. Hector Zenil AIT Tools for Biology and Medicine
  40. 40. L. Levin, Laws of information conservation (non-growth) and aspectsof the foundation of probability theory, Problems of InformationTransmission, 10(3):206–210, 1974.M. Li, P. Vit´nyi, An Introduction to Kolmogorov Complexity and Its aApplications, Springer, 3rd. ed., 2008.R.J. Solomonoff. A formal theory of inductive inference: Parts 1 and2, Information and Control, 7:1–22 and 224–254, 1964.S. Wolfram, A New Kind of Science, Wolfram Media, 2002. Hector Zenil AIT Tools for Biology and Medicine

×