GraphClust is an alignment-free method for clustering RNA sequences based on their secondary structures. It represents RNA secondary structures as labeled graphs and uses a graph kernel called NSPDK to compute similarities between graphs. GraphClust clusters RNAs in a linear time complexity by using MinHash to identify candidate neighborhoods for clustering, then refines clusters with sequence-structure alignment tools. It was able to cluster hundreds of thousands of RNA sequences in under a month, while traditional alignment-based methods would take hundreds of computer years for the same task.
We review our recent progress in the development of graph kernels. We discuss the hash graph kernel framework, which makes the computation of kernels for graphs with vertices and edges annotated with real-valued information feasible for large data sets. Moreover, we summarize our general investigation of the benefits of explicit graph feature maps in comparison to using the kernel trick. Our experimental studies on real-world data sets suggest that explicit feature maps often provide sufficient classification accuracy while being computed more efficiently. Finally, we describe how to construct valid kernels from optimal assignments to obtain new expressive graph kernels. These make use of the kernel trick to establish one-to-one correspondences. We conclude by a discussion of our results and their implication for the future development of graph kernels.
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Christopher Morris
Most state-of-the-art graph kernels only take local graph properties into account, i.e., the kernel is computed with regard to properties of the neighborhood of vertices or other small substructures. On the other hand, kernels that do take global graph properties into account may not scale well to large graph databases. Here we propose to start exploring the space between local and global graph kernels, striking the balance between both worlds. Specifically, we introduce a novel graph kernel based on the k-dimensional Weisfeiler-Lehman algorithm. Unfortunately, the k-dimensional Weisfeiler-Lehman algorithm scales exponentially in k. Consequently, we devise a stochastic version of the kernel with provable approximation guarantees using conditional Rademacher averages. On bounded-degree graphs, it can even be computed in constant time. We support our theoretical results with experiments on several graph classification benchmarks, showing that our kernels often outperform the state-of-the-art in terms of classification accuracies.
We review our recent progress in the development of graph kernels. We discuss the hash graph kernel framework, which makes the computation of kernels for graphs with vertices and edges annotated with real-valued information feasible for large data sets. Moreover, we summarize our general investigation of the benefits of explicit graph feature maps in comparison to using the kernel trick. Our experimental studies on real-world data sets suggest that explicit feature maps often provide sufficient classification accuracy while being computed more efficiently. Finally, we describe how to construct valid kernels from optimal assignments to obtain new expressive graph kernels. These make use of the kernel trick to establish one-to-one correspondences. We conclude by a discussion of our results and their implication for the future development of graph kernels.
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Christopher Morris
Most state-of-the-art graph kernels only take local graph properties into account, i.e., the kernel is computed with regard to properties of the neighborhood of vertices or other small substructures. On the other hand, kernels that do take global graph properties into account may not scale well to large graph databases. Here we propose to start exploring the space between local and global graph kernels, striking the balance between both worlds. Specifically, we introduce a novel graph kernel based on the k-dimensional Weisfeiler-Lehman algorithm. Unfortunately, the k-dimensional Weisfeiler-Lehman algorithm scales exponentially in k. Consequently, we devise a stochastic version of the kernel with provable approximation guarantees using conditional Rademacher averages. On bounded-degree graphs, it can even be computed in constant time. We support our theoretical results with experiments on several graph classification benchmarks, showing that our kernels often outperform the state-of-the-art in terms of classification accuracies.
Do we need a logic of quantum computation?Matthew Leifer
A longer talk given at the University of Queensland in October 2005 based on the preprint http://arxiv.org/abs/quant-ph/0509193
Note that I now regard some of these ideas as flawed, but I think that the motivation and discussion is still interesting.
Notes taken while reading the paper "A Tutorial on Spectral Clustering" by Ulrike von Luxburg. Find original paper at http://www.informatik.uni-hamburg.de/ML/contents/people/luxburg/publications/Luxburg07_tutorial.pdf
Presentation of my NSERC-USRA funded summer research project given at the Canadian Undergraduate Mathematics Conference (CUMC) 2014.
Please refer to the project site: http://jessebett.com/Radial-Basis-Function-USRA/
Do we need a logic of quantum computation?Matthew Leifer
A longer talk given at the University of Queensland in October 2005 based on the preprint http://arxiv.org/abs/quant-ph/0509193
Note that I now regard some of these ideas as flawed, but I think that the motivation and discussion is still interesting.
Notes taken while reading the paper "A Tutorial on Spectral Clustering" by Ulrike von Luxburg. Find original paper at http://www.informatik.uni-hamburg.de/ML/contents/people/luxburg/publications/Luxburg07_tutorial.pdf
Presentation of my NSERC-USRA funded summer research project given at the Canadian Undergraduate Mathematics Conference (CUMC) 2014.
Please refer to the project site: http://jessebett.com/Radial-Basis-Function-USRA/
Master Thesis on the Mathematial Analysis of Neural NetworksAlina Leidinger
Master Thesis submitted on June 15, 2019 at TUM's chair of Applied Numerical Analysis (M15) at the Mathematics Department.The project was supervised by Prof. Dr. Massimo Fornasier. The thesis took a detailed look at the existing mathematical analysis of neural networks focusing on 3 key aspects: Modern and classical results in approximation theory, robustness and Scattering Networks introduced by Mallat, as well as unique identification of neural network weights. See also the one page summary available on Slideshare.
One of many important generalizations of ordinary Voronoi diagrams is the higher-order Voronoi diagram. The order-k Voronoi diagram is the partitioning of the plane into regions, such that each point within a fixed region has the same k nearest sites. Many algorithms have been developed that construct the higher-order Voronoi diagram of point-sites. In this talk we will discuss randomized algorithms that can be used for a larger class of sites—specifically, polygonal objects and the abstract setting. We describe the algorithms in combinatorial rather than geometric terms, which makes it possible to construct higher-order Voronoi diagrams that have bisectors satisfying certain combinatorial properties.
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
The GraphNet (aka S-Lasso), as well as other “sparsity + structure” priors like TV (Total-Variation), TV-L1, etc., are not easily applicable to brain data because of technical problems
relating to the selection of the regularization parameters. Also, in
their own right, such models lead to challenging high-dimensional optimization problems. In this manuscript, we present some heuristics for speeding up the overall optimization process: (a) Early-stopping, whereby one halts the optimization process when the test score (performance on leftout data) for the internal cross-validation for model-selection stops improving, and (b) univariate feature-screening, whereby irrelevant (non-predictive) voxels are detected and eliminated before the optimization problem is entered, thus reducing the size of the problem. Empirical results with GraphNet on real MRI (Magnetic Resonance Imaging) datasets indicate that these heuristics are a win-win strategy, as they add speed without sacrificing the quality of the predictions. We expect the proposed heuristics to work on other models like TV-L1, etc.
My invited talk at the 2018 Annual Meeting of SIAM (Society of Industrial and...Anirbit Mukherjee
This is a slightly expanded version of the talk I gave at the 2018 ISMP (International Symposium on Mathematical Programming). This SIAM talk has some more introductory material than the ISMP talk.
My 2hr+ survey talk at the Vector Institute, on our deep learning theorems.Anirbit Mukherjee
This survey talk at the Vector Institute is a much more extended version of my overview talks at the ISMP 2018 and the SIAM Annual Meeting 2018. This gives a lot more details about background concepts and proof strategies.
Гостевая лекция Института биоинформатики. Подробнее: http://bioinformaticsinstitute.ru/lectures/1218
Несмотря на несерьезное название, на лекции разговор пойдет о важной проблеме в работе биоинформатика, почти любая реальная задача которого связана с обработкой и анализом больших данных. И решить задачу нужно не только правильно, но и эффективно. Процесс решения можно условно разделить на две части: «придумать», как решать, и «обучить» этому компьютер. И на лекции речь пойдет именно об эффективном «обучении».
Наивно реализованные алгоритмы работают неприемлемо долго, когда дело доходит до гигабайтов реальных данных. От биоинформатика уже требуются не просто базовые навыки программирования, но и знание технических нюансов. И даже у профессионального программиста уйдет немало времени, например, чтобы выгодно использовать возможности Hadoop при работе с Big Data. Так можно ли современному ученому обойтись без тщательного изучения кучи языков, библиотек и фреймворков и сосредоточиться именно на решении?
Ядерный век прошел, и становится все понятнее, что в фокусе науки 21-го века будут живые системы, медицина, и человек во всех его проявлениях. Здесь осуществляются самые масштабные финансовые вливания, и на эту отрасль человечество возлагает самые большие надежды. Все чаще слышатся предметные обсуждения тем, казавшихся еще недавно научной фантастикой: сможет ли человечество победить старение, рак, и другие смертельные заболевания? Сможет ли менять свой геном по собственному желанию? Будем ли мы хозяевами своим телам в той же мере, как мы хозяйничаем на Земле?
Многие десятилетия биология и медицина развивались как описательные науки. Однако по мере созревания и накопления информации, любая наука рано или поздно переходит на более точный язык - язык математики. Проект "Геном человека" обеспечил технологический прорыв, который будет питать науку о живом еще много лет - но который также поставил много новых глобальных вопросов перед современными учеными.
http://bioinformaticsinstitute.ru/guests
В пятницу 10 октября в 19.00 Мария Шутова (ИоГЕН РАН) выступала в Институте биоинформатики с открытой лекцией, посвященной изучению рака.
Рак -- одна из наиболее распространенных причин смерти по всему миру. В лекции рассматривается, как знания об эволюции, работе генома, репрограммировании, а также использование биоинформатических методов помогли лучше понять, как развивается раковая опухоль и предложить новые методы лечения разнообразных типов рака. Рассмотрены мышиные модели развития рака и интересные результаты, которые были получены с их помощью.
http://bioinformaticsinstitute.ru/lectures
Гостевая лекция Института биоинформатики, 9 октября 2014. Лектор -- Мария Шутова (ИоГЕН РАН).
За последние десять лет плюрипонтентные клетки стали героями двух Нобелевских премий и многих тысяч научных и научно-популярных статей. Их уникальная возможность превращаться в любую клетку взрослого организма до сих пор дает пищу для ума как биологам развития, так и ученым, ищущим способы лечения генетических заболеваний. В лекции будет рассказано о двух типах плюрипотентных клеток: "естественных" (эмбриональные стволовые клетки) и "искусственных" (индуцированные плюрипотентные стволовые клетки). Отдельно мы остановимся на том, как знания о работе транскрипционных факторов помогли репрограммировать клетки, и как эти "искусственные" плюрипотентные клетки можно использовать в медицине.
В своей лекции Андрей Афанасьев рассказал о стартапах в биотехе и биоинформатике и своем биоинформатическом проекте iBinom, разобрал несколько биотехнологических проектов глазами инноваторов и инвесторов, а также коснулся вопроса поиска инвестиций и поделился личным опытом взаимодействия с венчурными фондами и институтами развития.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Full-RAG: A modern architecture for hyper-personalization
Slides4
1. GraphClust: alignment-free structural clustering of
local RNA secondary structures.
Yakovlev Pavel
St-Petersburg Academic University
March 16, 2013
Yakovlev Pavel Graph Clust 1/33
2. Abstract
Non-protein coding RNAs (ncRNAs)
have moved to a key research in
molecular biology, fundamentally
challenging established paradigms and
rapidly transforming our perception of
genome complexity. There is mounting
evidence that the eukaryotic genome is
pervasively transcribed and that ncRNAs
function at various levels regulating gene
expression and cell biology.
Yakovlev Pavel Graph Clust 2/33
3. Motivation
Clustering algorithms are applied to store, compare and predict
ncRNAs, because:
in contrast to protein-coding genes, ncRNAs belong to a
diverse array of classes with vastly different structures,
functions, and evolutionary patterns
ncRNAs can be divided into RNA families according to
inherent functional, structural, or compositional similarities
Yakovlev Pavel Graph Clust 3/33
5. Motivation
Part II
Answer:
They are awfull and do not work.
Why?
Multiple Sequence Alignment (MSA) is an NP-complete
problem.
Approximation algorithms have very high computational
complexity too (dynamic programming):
Sankoff Algorithm (1985) - O(n4
)
Carrillo-Lipman Algorithm (1989)
Thus, the tools, based on these algorithms, are limited to
relatively small datasets.
LocARNA (2007)
FoldAlign (2007)
Yakovlev Pavel Graph Clust 5/33
6. Smart quote
Gorodkin: ‘Even using substantially more sophisticated techniques,
genome-scale ncRNA analyses often consume tens to hundreds of
computer years. These high computational costs are one reason
why ncRNA gene finding is still in its infancy.’1
1
Gorodkin,J. et al. (2010) De novo prediction of structured RNAs from
genomic sequences. Trends Biotechnol, 28, 9–19.
Yakovlev Pavel Graph Clust 6/33
7. Solutions
Postulate:
ncRNAs can be also grouped by RNA classes whose members, in
contrast with RNA families, have no discernible homology at the
sequence level, but still share common structural and functional
properties.
Two main classes of clustering approaches:
1 Class uses simplifications in the representation of structures.
heavily depend on the correctness of the structures
computational prediction of secondary structures are known to
be error prone
2 Class uses sequence information as prior knowledge to speed
up the computation.
sequences are first clustered by sequence-alignment
family assignments of structured RNAs obtained from
sequence alignments at pairwise sequence identities below 60%
are often wrong
Yakovlev Pavel Graph Clust 7/33
8. GraphClust
Alignment-free solution
GraphClust (University of Freiburg, Germany):
comparing and clustering RNAs according to both sequences
and structures
alignment-free
linear-time
scales to datasets of hundreds of thousands of sequences
extended fast graph kernel technique designed for
chemoinformatics
ready-to-use pipeline, that is available on request for free
Yakovlev Pavel Graph Clust 8/33
9. GraphClust
Approach
Approach pipeline:
1 Try a small number of probable, but sufficiently different,
structures for each RNA sequence.
2 Encode each structure as a labeled graph preserving all
information about the nucleotide type and the bond type -
sequence’s structure is represented as a graph with several
disconnected components.
3 Compute the similarity between the representative graphs
using a [chemoinformatics] graph kernel.
4 Evaluate each neighborhood and select as candidate clusters
those that contain very similar elements.
5 Dataset scanning: associating missing sequences to clusters
and removing them from set.
6 Reiterating the procedure
Yakovlev Pavel Graph Clust 9/33
10. GraphClust
Questions
Obscure:
How structures are encoded as graphs?
How to compare graphs or graph substructures?
Why we have not quadratic number of comparisons of graphs?
Yakovlev Pavel Graph Clust 10/33
11. Answer 1
Graph encoding for structures
Minimum free energy-based secondary structure prediction has been
shown to be error prone (Gardner et al., 2005). So, we want to use
representation of the entrie ensemble of low-energy conformations.
Problem: Resorting to a complete enumeration of near-optimal
structures would yield a tremendously large number of structures.
Solution: We opt for abstract shape analysis methods.
Yakovlev Pavel Graph Clust 11/33
12. Answer 1
Graph encoding for structures
We consider subsequences
of our ncRNA sequence, as
overlapping windows.
Procedure depends on two
parameters:
W - window sizes (30 and 150
nt in tests)
O - overlap parameter for
window start positions
Than we take the set l
most representative
structures of each
subsequence and encode
them as disconnected
components.
Yakovlev Pavel Graph Clust 12/33
13. Answer 2
Graph comparisons
But how to compare these graphs?
Graph similarity is the central problem for all learning tasks such as
clustering and classification on graphs.
Subgraphs polymorphism is NP-complete problem.
Yakovlev Pavel Graph Clust 13/33
14. Answer 2
Graph kernel
Graph kernels:
Compare substructures of graphs that are computable in
polynomial time
Examples: walks, paths, cyclic patterns, trees
Good graph kernel criteria:
Expressive
Efficient to compute
Positive definite
Applicable for wide range of graphs
Yakovlev Pavel Graph Clust 14/33
15. Answer 2
Graph kernel
NSPDK - Neighborhood Subgraph Pairwise Distance Kernel (Costa
and Grave, 2010)
Key fetures:
very fast kernel
suitable for large datatsets
works with sparse graphs with discrete vertex and edge labels
works with special pair of subgraphs, called «neighborhood»
subgraphs
instance of decomposition kernel (Haussler, 1999)
Yakovlev Pavel Graph Clust 15/33
16. Answer 2
Decomposition kernel
Definitions:
G(V , E) - graph, r ≥ 0. Than Nv
r (G) - neighborhood
subgraph of G.
Where Nv
r (G) rooted in v and induced by the set of vertices within
distance not exceeding r.
Rr,d - neighborhood-pair relation.
Where Rr,d - indicates that the distance between the roots of two
neighborhood subgraphs of radius r is exactly equal to d.
kr,d (G, G ) =
A,B∈R−1
r,d
(G)
A ,B ∈R−1
r,d
(G )
1(A ∼= A )1(B ∼= B ) - decomposition
kernel.
R−1
r,d - indicates all possible pairs of neighborhood subgraphs or radius r,
whose roots are at distance d.
Yakovlev Pavel Graph Clust 16/33
17. Answer 2
NSPDK
Than, NSPDK (non-normalized form) is:
K(G, G ) =
r d
kr,d (G, G )
For efficiency reasons, the "upper bound" form:
Kr∗,d∗ (G, G ) =
r∗
r=0
d∗
d=0
kr,d (G, G )
Normalization like decomposition kernel:
ˆkr,d (G, G ) =
kr,d (G, G )
kr,d (G, G)kr,d (G , G )
Yakovlev Pavel Graph Clust 17/33
18. Answer 2
NSPDK
Full isomorphism test is
computationally expensive.
Reducing the task:
1 String encoding: uses a technique
based on the distance information
between pairs of vertices and can
be computed in O(|V |).
2 Hashing procedure: encoding
string to integer code (Costa and
Grave, 2010).
3 Isomorphism test: equality test
between thier integer codes.
Yakovlev Pavel Graph Clust 18/33
19. Answer 3
Explicit feature representation
NSPDK limits the number of features to O(r∗d∗|V |2) pairs of
neighborhood subgraphs (one feature for each pair of vertices times
each possible combination of values for the radius and the distance).
r∗ ∈ [0; 5] d∗ ∈ [0; 10]
For sparse graphs, the number of vertices that are reachable within
fixed small distance is typically small ⇒ dependency on the vertex
set size can be more tightly approximated by O(|V |).
Conclusion: each graph is mapped into a sparse vector in Rm -
high-dimensional feature space. Number of non-zero features in
this space is linear in |V |.
Yakovlev Pavel Graph Clust 19/33
20. Alignment-free clustering
MinHash
Some definitions:
P - instances [graphs] set
x, z, p - any instance
xj - element of instance
Nk (x) - the set of the k-closest instances.
So, how to cluster graphs using NSPDK? Compare all the possible pairs of
graphs is too expensive. That‘s why let us compare only similar graphs.
Dk (x) = 1
k
z∈Nk (x)
Kr∗,d∗ (x, z) - k-neighborhood density is good metrics for
clustering.
To make it as fast, as possible, we want to count Nk (x) very fast. It is perfect
to do it in constant time.
MinHash algorithm will help us with this task.
Yakovlev Pavel Graph Clust 20/33
21. Alignment-free clustering
MinHash
MinHash algorithm:
1 Binarize all instances P = {p1, . . . , pn} from Rm
→ {0, 1}r
.
2 Choose some fi : N → N - random hash functions with terms:
∀xj = xk , fi (xj ) = fi (xk ) and ∀xj = xk , P(fi (xj ) ≤ fi (xk )) = 1/2.
3 Get hash functions hi (x) = argminxj ∈x fi (xj )1
. So, they return the first
feature indicator under a random permutation of the features oreder.
Useful facts:
P(hi (x) = hi (z)) =
|x∩z|
|x∪z|
= s(x, z) - Jaccard similarity, the fraction of
common x and z features over the total number of none-null features of x
and z.
Error rate γ = 1
N2 , where N - number of hash functions.
4 Build a tuple < h1(x), . . . , hN (x) >
5 Collect the set of instances for h∗
= hi (x) as
Zi (h∗
) = {z ∈ P : hi (z) = h∗
}.
6 Z = {Zi }N
i=1 - approximate neighborhood, that can be used to find Nk (x).
These steps can be performed in constant time (N is fixed, Z independent of
the dataset size |P|).
1
These hash functions are called min-hash function (Broder, 1997).
Yakovlev Pavel Graph Clust 21/33
22. MinHash
Demo Time
def generic_hash( s, seed ):
result = 1
for char in s:
result = int(seed*result + ord(char)) & 0xFFFFFFFF;
return result
def hashgen( size ):
seed = math.floor( random.random() * size ) + 32
return functools.partial(generic_hash, seed = seed)
def signature( set, functions ):
return [min(map(f, set)) for f in functions]
def similarity( sig_a, sig_b, num_funcs ):
return sum(x == y for x, y in zip(sig_a, sig_b)) / num_funcs
def minhash( max_error ):
num_funcs = int(1 / max_error ** 2)
functions = map(hashgen, xrange(num_funcs))
return [functools.partial(signature, functions = functions),
functools.partial(similarity, num_funcs = num_funcs)]
Yakovlev Pavel Graph Clust 22/33
23. GraphClust pipeline
Phase 1
Phase 1. Preprocessing
Get RNA sequences from RNA-seq or computational methods,
mask all repeats to avoid clusters made of genomic repeats.
Split long sequences into smaller fragments (windows)1.
Near identical sequences are removed using blastclust2.
1
to enable detection of small signals
2
The program begins with pairwise matches and places a sequence in a
cluster if the sequence matches at least one sequence already in the cluster.
Megablast algorithm is used.
Yakovlev Pavel Graph Clust 23/33
24. GraphClust pipeline
Phase 2
Phase 2. Structure determination
Try set of structures for each window with RNAShape tool.
150 nt sequence processing into set of disconnected graphs of
≈ 2500 total vertices.
O = 20%; W = 30, 150; l = 3
Yakovlev Pavel Graph Clust 24/33
25. GraphClust pipeline
Phase 3
Phase 3. Parallel encoding
Encoding set of structures into sparse vectors, by counting Kr∗,d∗ .
150 nt sequence processing into a sparse vector with ≈ 8000
features.
r∗ = 2; d∗ = 4
Yakovlev Pavel Graph Clust 25/33
26. GraphClust pipeline
Phase 4
Phase 4. Candidate cluster
ci - candidate clusters
ci sorted as ∀i < j, D(ci ) > D(cj )
Building the union of candidate clusters:
Cj =
j
i=1
ci with greedily discarded ck if |Cj ∩ ck| > threshold
Yakovlev Pavel Graph Clust 26/33
27. GraphClust pipeline
Phase 5
Phase 5. Parallel cluster refinement
Candidate clusters contains sequences, similar under NSPDK
measure.
Each cluster adjusts using sequence-structure alignment tool
LocARNA (Will et al., 2007). We choose only top ranked RNAs in
each cluster.
Yakovlev Pavel Graph Clust 27/33
28. GraphClust pipeline
Phase 6
Phase 6. Parallel cluster model
Selected top ranked RNA candidates are realigned with tool
LocARNA-P (Will et al., 2012). so, we identfy high trusted
alignment region and estimate the borders of the common local
motif.
At last, we create a covariance model (CM) by applying INFERNAL
(Nawrocki et al., 2009) tool on the identfied subsequence. Every
cluster member gets its CM bitscore1.
1
A log-scaled version of score.
Yakovlev Pavel Graph Clust 28/33
29. GraphClust pipeline
Phase 7
Phase 7. Model scanning
Covariance model of each cluster is used to find additional cluster
members in full residual dataset of sequences by CM bitscore or
E-value1.
Some sequences can be assigned to multiplie clusters.
1
In terms of the authors, the number of instances with a score equivalent or
better, than in current instance.
Yakovlev Pavel Graph Clust 29/33
30. GraphClust pipeline
Phase 8
Phase 8. Iteration and removal
All cluster members found in the previous phase are removed from
the dataset. A new iteration starts from Phase 4.
The termination conditions are:
predetermined max number of iterations
time limit
empty dataset
Yakovlev Pavel Graph Clust 30/33
31. GraphClust pipeline
Phase 9
Phase 9. Postprocessing
Merge clusters
For every pair of clusters, we compute the relative overlap and
merge them if it is more, than 50%.
Make in-cluster postprocessing
Sequences in a cluster finally ranked by their CM bitscore.
Yakovlev Pavel Graph Clust 31/33
32. Results
Species Type Method Input Time Cluster
Benchmark
Bacteria Small ncRNAs Misc 363 6.8h 39
Human Predicted RNA elemets EvoFam 699 0.3h 37
Misc Small ncRNAs Rfam 3 900 36h 130
De-novo discovery
Fugu LincRNAs RNA-seq 5 877 10.3h 99
Fugu Predicted RNA elements RNAz 11 287 13.3h 97
Fruit fly Predicted RNA elements RNAz 17 765 20.4h 95
Human LincRNAs RNA-seq 31 418 3.6d 95
Human Predicted RNA elements EvoFold 37 258 5.7d 117
Human 3’UTRs RefSeq 118 514 12.8d 106
227 081 25.7d 815
Clustering 3901 sequences using LocARNA takes ∼ 370 days.
Yakovlev Pavel Graph Clust 32/33
33. Q&A
Thank you for your attention!
Q&A
Yakovlev Pavel Graph Clust 33/33