2. 2
ELIS – Multimedia Lab
I. CV
I. Electromagnetics for fusion
II. Bioinformatics: iADHoRe & BLSSPeller
III. Hadoop@Telenet
II. @MMLab
III. BLSSpeller:
I. Motivation
I. Genetics
II. Motif Discovery
II. Algorithm
III. Validation & Future work
Outline
10. 10
ELIS – Multimedia Lab
• Parallel algorithm to uncover sequence motifs in
DNA sequences of closely related species
+
OR
Current Research: BLSSpeller
11. 11
ELIS – Multimedia Lab
@Telenet: Data Warehousing in a
Hadoop production environment
14. 14
ELIS – Multimedia Lab
SPARQL
Public
medical
data
• Ontoforce: Federated query engine for
biomedical datasets:
• Virtuoso is currently the bottleneck
• Federated Querying
• Alternative architectures
• Some datasets >> Virtuoso
Project work
Aggregate
s
15. 15
ELIS – Multimedia Lab
You …
Are …
The …
Greatest …
=> more powerpoints
MMLab Sales…
16. 16
ELIS – Multimedia Lab
BLSSPELLER: MOTIVATION
Exhaustive comparative discovery of conserved cis-regulatory
elements
17. 17
ELIS – Multimedia Lab
• DNA stores the
information to build
proteins…
• Proteins are generated
in a two-step process:
transcription &
translation
• Update: there should
be more arrows!
One slide on genetics
18. 18
ELIS – Multimedia Lab
• RNA polymerase: DNA to RNA
• Recruited by transcription factors which bind to
gene promoter => where are the sites?
Transcription factor binding sites
19. 19
ELIS – Multimedia Lab
• Original approach:
- Search for motifs in the promoter sequences of sets
of coregulated genes
- But: Coexpression IS NOT EQUAL TO
Coregulation
• Phylogenetic approach
- Compare promoter sequences of genes which are
homologous between similar species
- Approach: use genome alignments
Which data!?
24. 24
ELIS – Multimedia Lab
• Comparative motif finding
• Exhaustive algorithm => no heuristics
• Word-based motif model
• Alignment-free (Alignment-based as a bonus)
How are we different?
28. 28
ELIS – Multimedia Lab
Depth-first
exhaustive motif
enumeration
for 15 character
IUPAC alphabet
GST can be used
for Branch & Bound
purposes: Search
space reduction
Motif discovery with GSTs
36. 36
ELIS – Multimedia Lab
• Extending BLSSpeller: treat it as a reverse
index!
• Query engine
• OCR data
• Frontend
• Clustering & Visualizations
• Alternative paths (decide after Speller review)
• Scaling out Sparql / Graph algorithms (Ontoforce)
• Stream analytics and querying (IoT)
• Medical images (Wesley)
New Research
Master Thesis: solving maxwell’s equation in a fusion reactor
RF antenna is used to pump energy into the plasma, fully ionized gas, needs 10^7 K to ignite and produce energy (like a fire)
Clean energy: D, T in => He + n out
Scattering on a cold plasma cylinder, validation of the algorithm since it was possible to derive an analytical solution here
Followed Jan Fostier to IBCN: iterative Alignment-based Detection of Homologous Regions
GHM: compare two chromosomes, may also be same species or same chromosome
Collinear regions = gene order and content conserved
Bottomup clustering algorithm using a nonuniform distance (shortest along diagonals)
Alignments used as new ‘chromosomes’ for bigger sensitivity (right panel)
Visualisation suite mainly used to understand the algorithm (green boxes pvalue, red boxes bad pvalue, dots, blue confidence interval for regression line)
“I just want to make things fast” Joachim VH
Bad News: we were able to process the full ensembl dataset of genomes in approximately 8 hours on a single node => no need for 4.0
Why compare 50 genomes? We managed to find a multiplicon of all species: HOX cluster (=gold standard) known to relate to the body plan of an organism
After that on my own: started a new research based on a book I read about index structures: find dna motifs, exhaustive algorithm which forced us to look at parallellisation and big computers
March left: Telenet, Dataretention project => Identify people based on request by the Police
Met Erik on a train, convinced me to resign my well paid job
Big Data course: Visualisation, Big Data management (nosql, hadoop, architectures, semantics, spark, IOT streaming), Analytics (deep learning, scalable algorithms)
Project with Ontoforce: design of a federated query engine. Status: fixing federated engine of MMLab, open question federated querying connected to disqover not feasible
So many options possible since virtuoso is only used as a data adapter
I am often not here, and I often wonder what I am actualy doing but I think you should call it sales. We already have strong connections with IoT en Data Science, on the right a number of
Companies we introduced ourselves (companies want to do or are already doing big data)
More arrows: many different types of RNAs, often an end product, proteins are in a dynamic equilibrium, proteins bind back to DNA to trigger the creation of new proteins
If one of the arrows breaks: Cancer
Protein networks are often drawn = pathway analysis
Transcription: molecular machine, recruited by transcription factors (=proteins)
Where are the sites: in vitro versus in vivo is not the same!! (chipSeq)
At first: compare promoters of coregulated genes
Next: compare promoters of genes which are related (or mostly WGA)
IUPAC motif model is the most exact motif model
Word based algorithms (based on MM), exhaustive,… are better than randomized algorithms but overall most of them are not very good
Current approaches work mainly with alignments but these biologically identified binding sites are misaligned!!
Set of 17724 gene families for which we have the promoter sequences (each time >=1 gene from each species)