Fauteux Seeder Bosc2009

•

0 likes•614 views

3) The algorithm works by first discovering short motif seeds, then extending these seeds into full length position weight matrices, and iteratively refining the matrices to discover overrepresented motifs.

Technology Education

Seeder: Perl Modules for
Cis-regulatory Motif Discovery

Bioinformatics Open Source Conference
June 28 2009, Stockholm

François Fauteux
Department of Plant Science
McGill University
Macdonald campus

Introduction

• Precise control of where,
when and at which level
transcription occurs

• Synthetic promoter
engineering
M. Venter, Trends Plant Sci 12, 118 (2007).

DNA Motif Discovery

• Searching for imperfect
copies of an unknown pattern

• Sequence-driven
approaches: not guaranteed to
yield a global optimum

• Enumerative approaches:
computationally expensive

• Convergence towards low-
complexity motifs
D. GuhaThakurta, Nucleic Acids Res 34, 3585 (2006). W. W. Wasserman, A. Sandelin,
Nat Rev Genet 5, 276 (2004).

Seeder Algorithm: Input

• Set B={B1,...,Bm} of background sequences

• Set P={P1,...,Pn} of positive sequences

• Length k of the motif seed

• Length l of the full motif to discover

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Seeder::Background

• Enumerate all words [A C G T]

• SMD: smallest HD between w and a |w|-length substring of s

• SMDs between word w and background sequences
probability distribution gw(y)

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Seeder::Finder

• Sum S(w) of SMDs between w and
positive sequences p-value

• Closest match to word w* (min. q-value) found in each
positive sequence seed PWM

• Matrix is extended to motif width and sites maximizing the
score to the extended weight matrix are selected

• PWM is built from those sites and the process is iterated

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Seeder::Index

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Seeder::Index

• List of indices corresponding
to words of increasing HD

• Efficient lookup of minimally
distant subsequence

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

Seeder Algorithm: Usage
#!/usr/bin/perl

use Seeder::Index;
use Seeder::Finder;
use Seeder::Background;

my $index = Seeder::Index->new(
seed_width => "6",
out_file => "6.index",
);
$index->get_index;

my $background = Seeder::Background->new(
seed_width => "6",
strand => "revcom",
hd_index_file => "6.index",
seq_file => "seqs.fasta",
out_file => "seqs.bkgd",
);
$background->get_background;

my $finder = Seeder::Finder->new(
seed_width => "6",
strand => "revcom",
motif_width => "12",
n_motif => "1",
hd_index_file => "6.index",
seq_file => "prom.fasta",
bkgd_file => "seqs.bkgd",
out_file => "prom.finder",
);
$finder->find_motifs;

Benchmark Against Popular Tools

• Binding site sequences from the Transfac database
G. K. Sandve, O. Abul, V. Walseng, F. Drablos, BMC Bioinformatics 8, 193 (2007).

F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

SSP Promoter Motifs

F. Fauteux, M. V. Stromvik, submitted.

Acknowledgements

Supervisor
Dr Martina Strömvik

Advisory committee
Dr Mathieu Blanchette
Dr Pierre Dutilleul

Sphinner nasce da un’ esperienza di oltre 15 anni nel settore ITC in ambiti applicativi e di ricerca. Abbiamo lavorato come ricercatori universitari, come liberi professionisti ed imprenditori in diversi settori. In ogni progetto abbiamo messo innovazione e tecnologia, passione e dedizione, che si trattasse di computer vision, automazione industriale, RF-ID o algoritmi intelligenti applicati alla configurazione di prodotto. Noi non sviluppiamo siti internet, non siamo un’agenzia di comunicazione e neppure una web agency. Tuttavia molte delle nostre applicazioni vivono in internet, comunicano con i tuoi clienti e si diffondono sul web. Noi non lavoriamo da soli, crediamo nella collaborazione e nella condivisione, per questo motivo abbiamo coltivato una rete di talenti unici e preziosi che ci supportano in ogni progetto. Se credi che innovazione e tecnologia possano renderti unico e piu’ competitivo sul mercato, allora sei nel posto giusto. Benvenuto su Sphinner.

Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data. The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.

Data mining

Jhadesunil

Bayesian network-based predictive analytics applied to invasive species distr...

Wisdom Dlamini

Introduction to Bioinformatics.

Elena Sügis

Emerging challenges in data-intensive genomics

mikaelhuss

Basen Network

guestf7d226

MseqDR consortium: a grass-roots effort to establish a global resource aimed ...

Human Variome Project

The success of whole exome sequencing (WES) for highly heterogeneous disorders, such as mitochondrial disease, is limited by substantial technical and bioinformatics challenges to correctly identify and prioritize the extensive number of sequence variants present in each patient. The likelihood of success can be greatly improved if a large cohort of patient data is assembled in which sequence variants can be systematically analysed, annotated, and interpreted relative to known phenotype. This effort has engaged and united more than 100 international mitochondrial clinicians, researchers, and bioinformaticians in the Mitochondrial Disease Sequence Data Resource (MSeqDR) consortium that formed in June 2012 to identify and prioritize the specific WES data analysis needs of the global mitochondrial disease community. Through regular web-based meetings, we have familiarized ourselves with existing strengths and gaps facing integration of MSeqDR with public resources, as well as the major practical, technical, and ethical challenges that must be overcome to create a sustainable data resource. We have now moved forward toward our common goal by establishing a central data resource (http://mseqdr.org/) that has both public access and secure web-based features that allow the coherent compilation, organization, annotation, and analysis of WES and mtDNA genome data sets generated in both clinical- and research-based settings of suspected mitochondrial disease patients. The most important aims of the MSeqDR consortium are summarized in the MSeqDR portal within the Consortium overview sections. Consortium participants are organized in 3 working groups that include (1) Technology and Bioinformatics; (2) Phenotyping, databasing, IRB concerns and access; and (3) Mitochondrial DNA specific concerns. The online MSeqDR resource is organized into discrete sections to facilitate data deposition and common reannotation, data visualization, data set mining, and access management. With the support of the United Mitochondrial Disease Foundation (UMDF) and the NINDS/NICHD U54 supported North American Mitochondrial Disease Consortium (NAMDC), the MSeqDR prototype has been built. Current major components include common data upload and reannotation using a novel HBCR based annotation tool that has also been made publicly available through the website, MSeqDR GBrowse that allows ready visualization of all public and MSeqDR specific data including labspecific aggregate data visualization tracks, MSeqDR-LSDB instance of nearly 1250 mitochondrial disease and mitochodnrial localized genes that is based on the Locus Specific Database model, exome data set mining in individuals or families using the GEM.app tool, and Account & Access Management. Within MSeqDR GBrowse it is now possible to explore data derived from MitoMap, HmtDB, ClinVar, UCSC-NumtS, ENCODE, 1000 genomes, and many other resources that bioinformaticians recruited to the project are organizing.

Data Science, Data Curation, and Human-Data Interaction

University of Washington

Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems. Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information. I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute. In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs. The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.

Swertz Molgenis Bosc2009bosc

Bosc Intro 20090627bosc

Software Patterns Panel Bosc2009bosc

Schbath Rmes Bosc2009bosc

Kallio Chipster Bosc2009bosc

Welch Wordifier Bosc2009bosc

Rice Emboss Bosc2009bosc

Prlic Bio Java Bosc2009bosc

Senger Soaplab Bosc2009bosc

Cock Biopython Bosc2009bosc

Similar to Fauteux Seeder Bosc2009

Satya Sahoo Thesis DefenseArtificial Intelligence Institute at UofSC

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...

GigaScience, BGI Hong Kong

Introduction to Bayesian Divergence Time Estimation

Tracy Heath

A Self-Adaptive Evolutionary Negative Selection Approach for AnomLuis J. Gonzalez, PhD

VariantSpark: applying Spark-based machine learning methods to genomic inform...

Denis C. Bauer

Data mining

Jhadesunil

Bayesian network-based predictive analytics applied to invasive species distr...

Wisdom Dlamini

Introduction to Bioinformatics.

Elena Sügis

Emerging challenges in data-intensive genomics

mikaelhuss

Basen Network

guestf7d226

MseqDR consortium: a grass-roots effort to establish a global resource aimed ...

Human Variome Project

Data Science, Data Curation, and Human-Data Interaction

University of Washington

Similar to Fauteux Seeder Bosc2009 (12)

Satya Sahoo Thesis Defense

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...

Introduction to Bayesian Divergence Time Estimation

A Self-Adaptive Evolutionary Negative Selection Approach for Anom

VariantSpark: applying Spark-based machine learning methods to genomic inform...

Data mining

Bayesian network-based predictive analytics applied to invasive species distr...

Introduction to Bioinformatics.

Emerging challenges in data-intensive genomics

Basen Network

MseqDR consortium: a grass-roots effort to establish a global resource aimed ...

Data Science, Data Curation, and Human-Data Interaction

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Product School

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

"Impact of front-end architecture on development cost", Viktor Turskyi

Fwdays

I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana

GraphRAG is All You need? LLM & Knowledge Graph

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

Assuring Contact Center Experiences for Your Customers With ThousandEyes

UiPath Test Automation using UiPath Test Suite series, part 3

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Leading Change strategies and insights for effective change management pdf 1.pdf

Mission to Decommission: Importance of Decommissioning Products to Increase E...

How world-class product teams are winning in the AI era by CEO and Founder, P...

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

When stars align: studies in data quality, knowledge graphs, and machine lear...

FIDO Alliance Osaka Seminar: Overview.pdf

PHP Frameworks: I want to break free (IPC Berlin 2024)

"Impact of front-end architecture on development cost", Viktor Turskyi

Epistemic Interaction - tuning interfaces to provide information for AI support

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

Fauteux Seeder Bosc2009

1. Seeder: Perl Modules for Cis-regulatory Motif Discovery Bioinformatics Open Source Conference June 28 2009, Stockholm François Fauteux Department of Plant Science McGill University Macdonald campus

2. Introduction • Precise control of where, when and at which level transcription occurs • Synthetic promoter engineering M. Venter, Trends Plant Sci 12, 118 (2007).

3. Transcription Factor Binding Sites

4. DNA Motif Discovery • Searching for imperfect copies of an unknown pattern • Sequence-driven approaches: not guaranteed to yield a global optimum • Enumerative approaches: computationally expensive • Convergence towards low- complexity motifs D. GuhaThakurta, Nucleic Acids Res 34, 3585 (2006). W. W. Wasserman, A. Sandelin, Nat Rev Genet 5, 276 (2004).

5. Seeder Algorithm: Input • Set B={B1,...,Bm} of background sequences • Set P={P1,...,Pn} of positive sequences • Length k of the motif seed • Length l of the full motif to discover F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

6. Seeder::Background • Enumerate all words [A C G T] • SMD: smallest HD between w and a |w|-length substring of s • SMDs between word w and background sequences probability distribution gw(y) F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

7. Seeder::Finder • Sum S(w) of SMDs between w and positive sequences p-value • Closest match to word w* (min. q-value) found in each positive sequence seed PWM • Matrix is extended to motif width and sites maximizing the score to the extended weight matrix are selected • PWM is built from those sites and the process is iterated F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

8. Seeder::Index F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

9. Seeder::Index • List of indices corresponding to words of increasing HD • Efficient lookup of minimally distant subsequence F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

10. Seeder Algorithm: Usage #!/usr/bin/perl use Seeder::Index; use Seeder::Finder; use Seeder::Background; my $index = Seeder::Index->new( seed_width => "6", out_file => "6.index", ); $index->get_index; my $background = Seeder::Background->new( seed_width => "6", strand => "revcom", hd_index_file => "6.index", seq_file => "seqs.fasta", out_file => "seqs.bkgd", ); $background->get_background; my $finder = Seeder::Finder->new( seed_width => "6", strand => "revcom", motif_width => "12", n_motif => "1", hd_index_file => "6.index", seq_file => "prom.fasta", bkgd_file => "seqs.bkgd", out_file => "prom.finder", ); $finder->find_motifs;

11. Benchmark Against Popular Tools • Binding site sequences from the Transfac database G. K. Sandve, O. Abul, V. Walseng, F. Drablos, BMC Bioinformatics 8, 193 (2007). F. Fauteux, M. Blanchette, M. V. Stromvik, Bioinformatics 24, 2303 (2008).

12. SSP Promoter Motifs F. Fauteux, M. V. Stromvik, submitted.

13. http://seeder.agrenv.mcgill.ca

14. Acknowledgements Supervisor Dr Martina Strömvik Advisory committee Dr Mathieu Blanchette Dr Pierre Dutilleul

Fauteux Seeder Bosc2009

Recommended

Recommended

More Related Content

Similar to Fauteux Seeder Bosc2009

Similar to Fauteux Seeder Bosc2009 (12)

More from bosc

More from bosc (20)

Recently uploaded

Recently uploaded (20)

Fauteux Seeder Bosc2009