Введение в дизайн с акцентом на применение этих принципов в дизайне научных иллюстраций и постеров. Вводная лекция курса "Недеструктивный дизайн", прочитанного на Летней школе по молекулярной и теоретической биологии, в Пущино. (dynastybioschool.wordpress.com)
2. Disclaimer (ответственное заявление)
• В этой лекции нещадно используются
• Идеи и примеры из книги Robin Williams “The Non-
designer’s Design Book”*
• Постеры участников школы “Современная биология
& будущее биотехнологий” 2013 и 2014**
*Авторские права не соблюдены. Автору лекции очень стыдно…
**Эти люди знали, на что шли. Они подавали свои постеры для разбора на
школе. Некоторым это помогло сделать постеры лучше.
3. Отличный план
1. Зачем нужен дизайн?
• Сколько нужно выучить алгебры для достижения
гармонии?
2. Четыре принципа дизайна
• Contrast (контраст)
• Repetition (повтор)
• Alignment (выравнивание)
• Proximity (близость)
3. Примеры
4. Дизайн – не для красоты
• … the important part must stand out and the unimportant
must be subdued . . . .
• Jan Tschichold 1935
• … важное должно выделяться, а второстепенное
должно отойти на второй план…
• Ян Щичольд 1935 г.
17. Повторение
• Повторяйте элементы дизайна. Повтор создает
структуру и успокаивает
• Что можно повторять?
• Цвет
• Шрифт
• Толщину линий
• Размеры (шрифтов, колонок, картинок)
19. Modelling Leaf Shape Evolution with Gaussian Processes
N. A. Raharinirina, L. Rusaitis, H. Jackson, N. S. Jones, J. W. J. Anderson, M. Tsiantis, M. Cartolano and J. Hein‡
Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG, United Kingdom
‡ hein@stats.ox.ac.uk .
Motivation
Leaf shapes display a tremendous variation over
their evolution, which makes them an attractive
system to study. Our focus of investigation is to
find some ways of quantifying this leaf shape di-
versity and to infer the existing phylogenetic trees
from sample leaf data. Although there are many
techniques available already for phylogenetic infer-
ence, in our implementation, we will take the edges
of the leaves as a 2-D function, and assume that they
come from a phylogenetic Gaussian process.
Varying the topology of the
phylogeny that we assume
the leaves come from, we in-
tend to be able to select the
correct one simply by maxi-
mum likelihood methods.
Representing Leaf Shapes
a) Olimarabidopsis Pumila b) Arabidopsis Neglecta
We quantify the leaves by taking a 2-D represen-
tation of them, and finding the distances from the
vein to the edge of the leaf, as well as using the gra-
dient or just the very tip of the leaf to compare the
effectiveness of each different data type.
Gaussian Process Regression Model
We infer a Gaussian Process on our leaf data and find the mean and the covariance function of the GP.
Firstly, we analyse one leaf shape GP regression, and get covariance in space only:
k(x, x0
; l) = e
(x x0)2
2l2
+ 2
(x x0
).
Then, to do a phylogenetic inference, we introduce a covariance in evolutionary time
t for the leaves u and v:
k(xu, x0
v; l, t = (t1, t2)) = e (t1+t2)
e
(xu x0
v)2
2l2
+ 2
(u v) (xu x0
v).
Maximizing the likelihood over (l, t) we find the most likely phylogeny:
p(y|X, (l, t)) =
1
(2⇡)
n
2 |Ky|
1
2
e
1
2 (y µ)T
K 1
y (y µ)
.
Inference on Simulated Data
Simulating ’leaves’ from a GP for which we know
all the relevant parameters, we can see how well we
are able to recover them using our inference proce-
dure. Most simulated data sets we tried this on gave
reasonable results, and the estimate of the time be-
tween leaves was not overly sensitive to incorrect
lengthscales.
0.5 1.0 1.5 2.0
0.200.250.300.350.400.450.50
Proportion of Correct Trees
Length scale
Proportioncorrect
a) Comparison with
UPGMA(red)
●
●
●
●●●●●● ●●●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
0.25 0.638888888888889 1.41666666666667 2
024681012
Estimate of total time of tree
Length scale
Time
b) Allowed evolutionary time
(red is true total time)
The benchmark we are trying to beat is the pro-
portion of times that the correct phylogeny was
inferred using the UPGMA method, using a dis-
tance matrix given by the sum of squared distances
between points on the leaves.
Simulating 4 leaves with total of 15 possible phy-
logenies, we took 100 datasets. The proportion
of phylogenies selected correctly by UPGMA was
0.385. At the correct lengthscale (l = 1), the propor-
tion selected correctly by the Gaussian process in-
ference was 0.53. So we can say with confidence that
Gaussian Process regression performs better than
UPGMA when we get the covariance structure cor-
rect. As our lengthscale guess gets further from the
truth, though, the performance of the GP inference
decreases a lot.
Results on Real Leaf Data
a) Original b) Polar Form c) Consensus d) Gradient e)Tip of the Leaf
The previous analysis on the simulated data showed that it is possible
to use the GP to infer phylogenies. Encouraged by this, we used a gen-
eral squared exponential space covariance and a simple exponential
covariance in time on a real sample of 5 leaves in Arabidopsis family.
In the figures above, we present the maximum likelihood surfaces of
all the different data type representations we used for the leaf shape.
The green point is the maximum likelihood of the true phylogeny, so
we can straightforwardly quantify the strength of our predictions.
The normal space covariance seems to give us reasonably good results
of some data sets, and very poor predictions for the other. Therefore,
the model is highly sensitive to the type of leaf shape representation.
Olimarabidopsis pumila
Arabidopsis halleri
Arabidopsis lyrata
Arabidopsis neglecta
Arabidopsis thaliana
True Tree
Olimarabidopsis pumila
Arabidopsis halleri
Arabidopsis lyrata
Arabidopsis neglecta
Arabidopsis thaliana
The tree has log likelihood = 67.75
Tree Comparison
True Tree (left) against our
most likely inferred Tree(right)
Further Results and Extensions
Another area of interest was to investigate the con-
sequences of assuming a non-homogeneous space
covariance, by increasing correlation between spe-
cial points on the leaves. We chose these points as
the turning points in the leaf outlines by analysing
the gradient, and we changed the space covariance
matrix accordingly. We observed some drastic im-
provement in the prediction for some data types,
particularly for the original, gradient and tip rep-
resentations. Thus, provided we can find the right
covariance structure that represent the leaf shape,
we can make much better predictions.
a) A significant improvement
for true tree likelihood in
original leaf representation
b) Modified space-covariance
matrix for leaves Halleri,
Thaliana, Pumila, Neglecta
The project directs to many other areas still left to investigate, from studying these modified covariance
matrices and hyperparameter sensitivity more in-depth, as well as experimenting with 2-dimensional re-
gression models and other representations of the leaf shapes. Gaussian Process regression proves to be a
powerful method worthy of more investigation.
References
[1] Nick S. Jones and John Moriarty (2010). Evolutionary Infer-
ence for Functional Data: Using Gaussian Processes on Phy-
logenies to Study Shape Evolution.
[2] C. E. Rasmussen, C. K. I. Williams (2006). Gaussian Processes
for Machine Learning, the MIT Press.
Acknowledgements
This work was carried out as part of the Oxford Summer School
in Computational Biology, 2011, in conjunction with the Depart-
ment of Plant Sciences, and with support from the Department of
Zoology. Funding was provided by J. Hein’s PRA. We specially
thank J. W. J. Anderson, N. S. Jones and J. Hein for guidance, and
everyone at the Plant Sciences that made this project possible.
34. Последний… TSS mapping and transcript repertoire!
TSS position in relation to gene is key to its function. !
Promoter motif prediction!
Transcrip)on+Start+Site+Map+Of+Soy+Symbiont+
Bradyrhizobium-japonicum-Based+On+dRNA:seq!
1Moscow!Ins*tute!of!Physics!and!Technology,!Dolgoprudny,!Russia!2A.A.!Kharkevich!Ins*tute!for!Informa*on!Transmission!Problems,!
Moscow,!Russia,!3!M.V.Lomonosov!Moscow!State!University,!Moscow,!Russia,!4MassachuseKs!Ins*tute!of!Technology,!Boston,!USA,!!
5Ins*tute!of!Microbiology!and!Molecular!Biology,!JustusOLiebeg!Universitat!Giessen,!Gießen,!Germany!
chuklina.jelena@gmail.com
Jelena!Chuklina1,!2,!Nikolay!Lyubimov3,!Maxim!Imakaev4,!Elena!EvguenievaOHackenberg5!and!Mikhail!S.!Gelfand2,3!
A+ sub+ T+ G+ sub+ C+
Outline!
• Perform new round of machine-learning with updated training set!
• Update gTSS and 5’-aTSS classification!
• Compare dRNA-seq data with expression array and proteome
data!
!
Acknowledgments!
• Julia Hahn and Sebastian Thalmann for experimental validation of transcription
start sites and promoter motifs!
• Iakov Davydov and Aleksandr Chuklin for numerous advices on program
development!
• Cynthia Sharma, Konrad Förstner, Jorg Vogel for sequencing and read mapping
• Gabriella Pessi und Hans-Martin Fischer for nodule RNA!
!
Summary!
1. We! detected! 17574! peaks,! aYer! machine!
learning!10071!were!leY!as!TSS.!
2. We! detected! 3979! RpoD! promoters,! 485! RpoN!
mo*fs,!159!TSSes!have!both.!
3. AYer! reOannota*on! 73! ncTSS! and! 682! iTSSes!
were!reOclassified!as!gTSSes.!
Abstract!
dRNA%seq) was) designed) for) selec4ve) sequencing) of) na4ve) transcripts)
origina4ng) from) transcrip4on) start) sites) (TSS).) Here) we) present) TSSF) –)
Transcrip4on) Start) Site) Finder) –) a) soBware) package) which) allows)
comprehensive) analysis) of) bacterial) trancriptomic) landscape.) TSS) map)
allows)to)assess)repertoire)of)small)non%coding)RNA,)inves4gate)promoter)
mo4fs)and)improve)gene)annota4on.)
In) this) study) we) use) TSSF) to) compare) transcriptome) of) soy) symbiont)
Bradyrhizobium) japonicum,) in) liquid) cultures) and) root) nodule) popula4ng)
bacteroids.))
!
Re-annotation!
TSS detection. Machine learning!
(+)! library! is! RNA,! selected! for! primary!
transcripts,! (O)! library! is! all! RNA,! including!
processed!(Fig.1).!All!peaks!matching!in!(+)!and!
(O)! library! were! treated! as! candidate! TSS! and!
were! subjected! to! automated! machine!
learning.! ExpertOassessed! peaks! as! a! training!
set! (Fig.2! and! Table1).! Machine! learning! was!
performed! separately! for! freeOliving! bacteria!
(FR)! and! nodules! (NO).! To! compute! support!
vectors,! the! following! parameters! were!
selected:!
i. Height!of!(+)!and!(O)!peak!(Fig.!3)!!
ii. ra*o!of!(+)!and!(O)!peak!
iii. average!expression!in!30!b.p.!radius!
Fig. 3. Peak detection: RNA-seq
read coverage (blue), salience
function (green), peaks (red)
Fig. 5. Best-scoring patterns were used to construct Positional weight matrix (PWM).
PWM threshold determination (upper): score distribution density of normal
upstreams is skewed towards higher scores when compared to random sequences.
Resulting logos (lower) of RpoD (σ70) and RpoN(σ54).
0.00
0.05
0.10
0.15
5 10 15 20
totalScore
density
normal
random
RpoN, score distribution density.
TSSes overexpressed in nodules
subs*tu*on!
box2!
box1! box2!
box1! box2!
extension!
shiY!
box1!
!+ ISGA+vs+old+ RAST+vs+old+ RAST+vs+ISGA+
matching!!CDSes+ 4749! 4669! 7690!
matching!genes+ 4796! !! !!
reOannotated!start+ 3050! 2941! 898!
new!genes+ 1351! 1105! 556!
discarded!+ 525! 707! 127!
!+ old+ ISGA+ RAST++
genes++ 8373! 9197! !!
CDS++ 8317! 9144! 8715!
sRNA length assessment!
Typical transcript starts with TSS and ends with
terminator. We used 3 publically available tools
(ARNold, TransTermHP, WebGesterDB) for
rho-independent terminator prediction. Only
ARNold predicts terminators independently of
annotated gene end and we used it to assess
sRNA length. !
Only 247 TSSes were matched with
terminators, their length was usually 40-200 nt,
rarely more than 400 nt.!
See also:
poster by Julia Hahn!
Fig. 2. Expert assessment of candidate TSS
for training set.
Fig. 1. dRNA-seq
data. (+) library –
red, (-) library – blue.
Table 1. Training set:
M a n u a l l y a s s e s s e d
0-130kb and 1681..1920 kb
(symb.island) of genome
Fig. 9. Start-codon re-annotation: change in protein lengths after re-annotation with RAST and ISGA.
There is clear skew of ORFs which became shorter for both ISGA and RAST. This leads to iTSS re-
classification as gTSS
5’-untranslated region length!
Fig 7. While most of
5‘-UTR have typical
length of 20-40 nt,
there is considerable
amount of leaderless
transcripts, which
s e e m s t o b e
common property of
bacteria
Fig. 8. Re-annotation of RegR (blr0904): now
the TSS №1 precedes start-codon. Old
annotation is grey, new is cyan. P1, P2, P3
are predicted promoters.
Table 2. Number of genes (CDS) predicted by different AGEs
Table 3. Different B.japonicum USDA 110 annotations
Anti-sense transcript mapping!
Most! of! TSS,! classified! as! gTSS! and! aTSS!
belong!to!5’OUTR!and!oYen!don’t!intersect!
corresponding!an*Osense!transcripts!and!!
thus!are!gTSS/oTSS,!transcribed!divergently!(as!
dashed!arrow!above).!Overlap!in!various!aTSS!types!
is!due!to!overlap!of!annotated!genes.!
Protein:coding+genes:+
• 4084!proteinOcoding!genes!have!TSS!
• Maximal!number!of!TSS!per!gene!is!4!
• 873!proteinOcoding!genes!have!more!than!
one!TSS!
An):+sense+RNAs:+
• 4013!genes!have!an*Osense!TSS!(2056!of!
them!expressed)!
Internal+TSSes:+
• 4167!genes!have!iTSSes!(2368!of!them!
are!expressed)!
!
! gTSS!=!gene!TSS!
iTSS!=!internal!TSS!
oTSS!=!orphan!TSS!
aTSS_5!!
aTSS_i!!!!!!!!!an*Osense!
aTSS_3!
Fig. 6. Different TSS type (=transcript type) distribution.
Abundance of iTSS maybe due to: 1) Operon intrinsic
promoter; 2) RNA cleavage products misclassified as
TSS. For aTSS misclassification analysis, see below.
1340+
oTSS!
TSS! mapping! allows! for! correc*on! of!
annota*on! errors,! especially! reO
annota*on!of!start!codons.!!We!applied!
automated! genome! annota*on! engines!
(AGE)! RAST! and! ISGA! to! improve!
Bradyrhizobium) japonicum) USDA) 110)
annota*on.!
TSS! candidate! upstream! sequences! is!
enriched!with!promoter!mo*fs!!!!
promoters!support!TSS!candidate!as!true!TSS.!
Usually!promoters!possess:!
1. Conserved!twoObox!sequences!
2. Conserved!distance!to!TSS!
3. Conserved!distance!between!boxes!
We! scanned! 60! nt! sequences! upstream! of!
each! predicted! TSS! (or! subset! of! TSS!
u p r e g u l a t e d! i n! n o d u l e s )! t o! fi n d!
overrepresented!6Ont!mo*fs.!We!allowed!1O2!
nt!shiY!of!boxes!from!the!ideal!distance,!1O2!
nt!extension!of!distance!between!boxes!and!!
1O2! subs*tu*ons! in! each! box,! penalizing! for!
each.!
Fig. 4. In the region -35 and -10 nt accordingly there are
the most concentration of correlated position.
Illustration is based on 5000 best patterns
0.00
0.05
0.10
0.15
5 10 15
totalScore
density
normal
random
RpoD, score distribution density.
P3P2P1
1 2
1 2 3
T
ATG
old
TTG
new
RegR, bll0904