Successfully reported this slideshow.
Upcoming SlideShare
×

# Algorithmic Information Theory and Computational Biology

I present cutting-edge concepts and tools drawn from algorithmic information theory (AIT) for new generation genetic sequencing, network biology and bioinformatics in general. AIT is the most advanced mathematical theory of information theory formally characterising the concepts and differences between simplicity, randomness and structure. Measures of AIT will empower computational medicine and systems biology to deal with big data, sophisticated analytics and a powerful new understanding framework.

See all

### Related Audiobooks

#### Free with a 30 day trial from Scribd

See all
• Full Name
Comment goes here.

Are you sure you want to Yes No

### Algorithmic Information Theory and Computational Biology

1. 1. Algorithmic Information Theory and Computational Biology Hector Zenil Unit of Computational Medicine Karolinska Institutet Sweden Hector Zenil AIT Tools for Biology and Medicine
2. 2. Complex Adaptive Systems (CAS) Hector Zenil AIT Tools for Biology and Medicine
3. 3. Complexity is hard to quantify in biology Mapping quantitative stimuli to qualitative behaviour Hector Zenil AIT Tools for Biology and Medicine
4. 4. Information Theory in Biology Sequence alignment Pattern recognition Sequence logos Binding site detection Motif detection Consensus sequences Biological signiﬁcance [based on Claude Shannon’s Information Theory, 1940] Hector Zenil AIT Tools for Biology and Medicine
5. 5. Algorithmic Information Theory Which sequence looks more random? (a) AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA (b) AGGTCGTGAAGTGCGATGGCCTTACGTAGC (c) GCGCGCGCGCGCGCGCGCGCGCGCGCGC Classical probability theory vs. Kolmogorov Complexity Deﬁnition KU (s) = min{|p|, U(p) = s} (1) Compressibility A sequence with low Kolmogorov complexity is c-compressible if |p| + c = |s|. A sequence is random if K (s) ≈ |s|. [Kolmogorov (1965); Chaitin (1966)] Hector Zenil AIT Tools for Biology and Medicine
6. 6. Examples Example 1 Sequences like (a) have low algorithmic complexity because they allow a short description. For example, “20 times A”. No matter how long (a) grows in length, the description increases only by about log2 (k) (k times A). Example 2 The sequence (b) is algorithmic random because it doesn’t seem to allow a (much) shorter description other than the length of (b) itself. For example, for sequence (a), a proof of non-randomness implies the exhibition of a short program. Compressibility is therefore a suﬃcient test of non-randomness. Hector Zenil AIT Tools for Biology and Medicine
7. 7. Example of an evaluation of K The sequence (b) GCGCGC...GC is not algorithmic random (or has low K complexity) because it can be produced by the following program (take G=0 and C=1): Program A(i): 1: n:= 0 2: Print n mod 2 3: n:= n+1 4: If n=i Goto 6 5: Goto 2 6: End The length of A (in bits) is an upper bound of K (GCGCGC ...GC ). Hector Zenil AIT Tools for Biology and Medicine
8. 8. The ultimate measure of pattern detection and optimalprediction Kolmogorov and Chaitin, Schnorr, and Martin-L¨fo independently provided 3 diﬀerent approaches to randomness (compression, predictability and typicality). They proved (for inﬁnite sequences): incompressibility ⇐⇒ unpredictability ⇐⇒ typicality When this happens in mathematics a concept has objectively been captured (randomness). This is why prediction in biology is hard. AIT tells that no eﬀective statistical test will succeed to recognise all patterns and no computable technique can fully predict all outcomes. The problem is deeply connected to computability and algorithmic information theory. [Solomonoﬀ (1964); Kolmogorov (1965); Chaitin (1969)] Hector Zenil AIT Tools for Biology and Medicine
9. 9. Information distances and similarity metrics Measures waiting to be introduced in bioinformatics Information Distance ID(x, y ) = max K (x|y ), K (y |x) Universal Similarity Metric USM(x, y ) = max K (x|y ), K (y |x)/ max K (x), K (y ) Normalised Information Distance: NCD(x, y ) = K (xy ) − min K (x), K (y )/ max K (x), K (y ) and NCD. Normalized Compression Measure (NCM): NC (s) = K (s)/|s| (asymptotic behaviour) Bennett’s Logical Depth: LDd (s) = min{t(p) : (|p| − |p ∗ | < d) and (U(p) = s)} (e.g. of an app. see Zenil, Complexity 2011) Hector Zenil AIT Tools for Biology and Medicine
10. 10. Non-systematic but succesful attempts in biology GenCompress is a compression algorithm to compress DNA sequences: d(x, y ) = 1 − (K (x) − K (x|y ))/K (xy ) NCD applied to genetic similarity: AIT looks at the genome as information, not as data (letters). Counting: traditional Shannon-entropy style sequencing. Interpreting: AIT. The full power of the theory hasn’t yet been unleashed. Hector Zenil AIT Tools for Biology and Medicine
11. 11. To be or not to be... Borel’s “Inﬁnite Monkey” theorem Input 1 0 1024 π Syntax error √2 ∞ CH3 ∞ “To be or not to be, that is the question.” Hector Zenil AIT Tools for Biology and Medicine
12. 12. Algorithmic probability Hector Zenil AIT Tools for Biology and Medicine
13. 13. Producing π This C-language code produces the ﬁrst 1000 digits of π (Gjerrit Meinsma): long k = 4e3, p, a[337], q, t = 1e3; main(j){for (; a[j = q = 0]+ = 2, k; ) for (p = 1 + 2 ∗ k; j < 337; q = a[j] ∗ k + q%p ∗ t, a[j + +] = q/p) k! = j > 2? : printf (“%.3d”, a[j2]%t + q/p/t); } Producing non-random sequences: If an object has low Kolmogorov complexity then it has a short description and a greater probability to be produced by a random program. The less random a string the more likely to be produced by a short program. Hector Zenil AIT Tools for Biology and Medicine
14. 14. Biological Big Data Analysis The information bottleneck: Small Data matters: Local measurements of information content are a good indication of the global information content of an object. Evidence: BDM Image classiﬁcation. Compression works at large scales looking for long regularities, while BDM is very local. Yet both yield astonishing similar results for this object sizes. Hector Zenil AIT Tools for Biology and Medicine
15. 15. Complementary methods for diﬀerent sequence lengths The methods to approximate K coexist and complement each other for diﬀerent sequence lengths. short strings long strings scalability < 100 bits > 100 bits Lossless compression √ √ method × Coding Theorem √ method × × Block Decomposition √ √ √ method [Zenil, Soler, Delahaye, Gauvrit, Two-Dimensional Kolmogorov Complexity and Validation of the Coding Theorem Method by Compressibility (2012)] Hector Zenil AIT Tools for Biology and Medicine
16. 16. Coding Theorem method and lossless compression The transition between one method and the other. What is complex for the Coding Theorem method is less compressible. [Soler, Zenil, Delahaye, Gauvrit, Correspondence and Independence of Numerical Evaluations of Algorithmic Information Measures (2012)] Hector Zenil AIT Tools for Biology and Medicine
17. 17. Online Algorithmic Complexity Calculator Provides: Shannon’s entropy, lossless compression (Deﬂate) values, Kolmogorov complexity approximations and relative frequency order (algorithmic probability). A Mathematica API and an R module. Datasets available online at the Dataverse Network. Basic data analysis tool for shorts sequence comparison. [http://www.complexitycalculator.com] Hector Zenil AIT Tools for Biology and Medicine
18. 18. Online Algorithmic Complexity Calculator 2 [http://www.complexitycalculator.com] Hector Zenil AIT Tools for Biology and Medicine
19. 19. Simulation of natural systems w/complex symbolic systems An elementary cellular automaton (ECA) is deﬁned by a local function f : {0, 1}3 → {0, 1}, f maps the state of a cell and its two immediate neighbours (range = 1) to a new cell state: ft : r−1 , r0 , r+1 → r0 . Cells are updated synchronously according to f over all cells in a row. [Wolfram, (1994)] Hector Zenil AIT Tools for Biology and Medicine
20. 20. Behavioural classes of CA Wolfram’s classes of behaviour: Class I: Systems evolve into a stable state. Class II: Systems evolve in a periodic (e.g. fractal) state. Class III: Systems evolve into random-looking states. Class IV: Systems evolve into localised complex structures. e.g. Rule 110 or the Game of Life. [Wolfram, (1994)] Hector Zenil AIT Tools for Biology and Medicine
21. 21. Block Decomposition method (BDM) The Block Decomposition method uses the Coding Theorem method. Formally, we will say that an object c has complexity: K logm,2Dd×d (c) = (nu − 1) log2 (Km,2D (ru )) + Km,2D (ru ) (ru ,nu )∈cd×d (2) where cd×d represents the set with elements (ru , nu ), obtained from decomposing the object into blocks of d × d with boundary conditions. In each (ru , nu ) pair, ru is one of such squares and nu its multiplicity. [H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)] Hector Zenil AIT Tools for Biology and Medicine
22. 22. Classiﬁcation of ECA by BDM versus lossless compression Compressors have limitations (small sequences, time complexity) Applications to machine learning Problems of classiﬁcation and clustering BDM is computationally eﬃcient (runs in O(nd ) time, hence linear (d = 1) time for sequences) [H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, (2012)] Hector Zenil AIT Tools for Biology and Medicine
23. 23. Asymptotic behaviour of complex systems [Zenil, Complex Systems (2010)] Hector Zenil AIT Tools for Biology and Medicine
24. 24. Rule space of 3-symbol 1D CA [Zenil, Complex Systems (2011)] Hector Zenil AIT Tools for Biology and Medicine
25. 25. Phase transition detection Deﬁnition |C (Mt (i1 ))−C (Mt (i2 ))|+...+|C (Mt (in−1 ))−C (Mt (in ))| ctn = t(n−1) [Zenil, Complex Systems (2011)] Hector Zenil AIT Tools for Biology and Medicine
26. 26. A measure of programmability ∂f (ctn ) Ctn (M) = (3) ∂t [Zenil, Complex Systems (2011)] Hector Zenil AIT Tools for Biology and Medicine
27. 27. Examples Figure : ECA Rule 4 has a low Ctn for random chosen n and t (it doesn’t react much to external stimuli). limn,t→∞ Ctn (R4) = 0 [H. Zenil, Philosophy & Technology, (2013)] Hector Zenil AIT Tools for Biology and Medicine
28. 28. Examples (cont.) Figure : ECA R110 has large coeﬃcient Ctn value for sensible choices of t and n, which is compatible with the fact that it has been proven to be capable of universal computation (for particular semi-periodic initial conﬁgurations). limn,t→∞ Ctn (R110) = 1 Hector Zenil AIT Tools for Biology and Medicine
29. 29. Classiﬁcation of graphs [Zenil, Soler, Dingle, Graph Automorphism Estimation and Complex Network Topological Characterization by Algorithmic Randomness] Hector Zenil AIT Tools for Biology and Medicine
30. 30. Characterisation of complex networks Complex Networks w/preferential attachment algorithms preserve properties invariant under network size (connectedness, robustness) at a low cost (unlike costly random nets in the number of links). [Zenil, Soler, Dingle, Graph Automorphism Estimation and Complex Network Topological Characterization by Algorithmic Randomness] Hector Zenil AIT Tools for Biology and Medicine
31. 31. Biological case study: Programmable Porphyrin molecules Much about the dynamics of these molecules is known, one can perform Monte-Carlo simulations based in these mathematical models and establish a correspondence between Wang tiles and simple molecules. [joint work with ICOS, U. of Nottingham] [G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based Molecular Computing] Hector Zenil AIT Tools for Biology and Medicine
32. 32. Quantitative dynamics of living systems Aggregations with similar Kolmogorov complexity cluster in similar conﬁgurations. [G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based Molecular Computing] Hector Zenil AIT Tools for Biology and Medicine
33. 33. Mapping output behaviour to external stimuli: Parameterdiscovery Parameter Space P → Target Space T Target space T : Set a conﬁguration from P that triggers the desired behaviour in T . To investigate: Reduction of the parameter space Characterisation of the target space [G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based Molecular Computing] Hector Zenil AIT Tools for Biology and Medicine
34. 34. Robustness and pervasiveness Concentration changes preserving behaviour: Output parameters that have the highest impact can be tested in silico before experiments in materio. [G. Terrazas, H. Zenil and N. Krasnogor, Exploring Programmable Self-Assembly in Non DNA-based Molecular Computing] Hector Zenil AIT Tools for Biology and Medicine
35. 35. Orthogonality Speciﬁc concentrations producing certain behaviour using the mathematical model to be tested against empirical data. Hector Zenil AIT Tools for Biology and Medicine
36. 36. Highlights and goals Ultimate goal (a few years time): An information-theoretical toolbox for systems and synthetic biology [Complex3D Proteins Database (graph representation) & Z Chen et al. Lung cancer pathways in response to treatments.] Pushing boundaries. A cutting-edge mathematical approach Tools from Complexity theory. Hector Zenil AIT Tools for Biology and Medicine
37. 37. New Generation Sequence data analysis Heavily driven by: Explosion of experimental data Diﬃculties in data interpretation New paradigms for knowledge extraction Data mining the behaviour of natural systems Towards an AIT tool-kit for systems biology, a functional library of programmable biological modules with a SBML interface. Hector Zenil AIT Tools for Biology and Medicine
38. 38. J.P. Delahaye and H. Zenil, On the Kolmogorov-Chaitin complexityfor short sequences, in Cristian Calude (eds), Complexity andRandomness: From Leibniz to Chaitin, World Scientiﬁc, 2007.J.-P. Delahaye and H. Zenil, Numerical Evaluation of the Complexityof Short Strings, Applied Mathematics and Computation, 2011.H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit,Two-Dimensional Kolmogorov Complexity and Validation of theCoding Theorem Method by Compressibility, arXiv:1212.6745 [cs.CC]F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Correspondence and Independence of Numerical Evaluations ofAlgorithmic Information Measures, Numerical Algorithms (in 2ndrevision)F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit,Calculating Kolmogorov Complexity from the Frequency OutputDistributions of Small Turing Machines, arXiv:1211.1302 [cs.IT]H. Zenil, Compression-based Investigation of the DynamicalProperties of Cellular Automata and Other Systems, ComplexSystems, Vol. 19, No. 1, pages 1-28, 2010. Hector Zenil AIT Tools for Biology and Medicine
39. 39. H. Zenil and J.A.R. Marshall, Some Aspects of ComputationEssential to Evolution and Life, Ubiquity, 2012.H. Zenil, What is Nature-like Computation? A Behavioural Approachand a Notion of Programmability, Philosophy & Technology (specialissue on History and Philosophy of Computing), 2013.H. Zenil, On the Dynamic Qualitative Behavior of UniversalComputation Complex Systems, vol. 20, No. 3, pp. 265-278, 2012.H. Zenil, A Turing Test-Inspired Approach to Natural ComputationIn G. Primiero and L. De Mol (eds.), Turing in Context II (Brussels,10-12 October 2012), Historical and Contemporary Research inLogic, Computing Machinery and Artiﬁcial Intelligence, Proceedingspublished by the Royal Flemish Academy of Belgium for Science andArts, 2013.G.J. Chaitin A Theory of Program Size Formally Identical toInformation Theory, J. Assoc. Comput. Mach. 22, 329-340, 1975.A. N. Kolmogorov, Three approaches to the quantitative deﬁnitionof information Problems of Information and Transmission, 1(1):1–7,1965. Hector Zenil AIT Tools for Biology and Medicine
40. 40. L. Levin, Laws of information conservation (non-growth) and aspectsof the foundation of probability theory, Problems of InformationTransmission, 10(3):206–210, 1974.M. Li, P. Vit´nyi, An Introduction to Kolmogorov Complexity and Its aApplications, Springer, 3rd. ed., 2008.R.J. Solomonoﬀ. A formal theory of inductive inference: Parts 1 and2, Information and Control, 7:1–22 and 224–254, 1964.S. Wolfram, A New Kind of Science, Wolfram Media, 2002. Hector Zenil AIT Tools for Biology and Medicine