The computational requirements of next generation sequencing is placing a huge demand on IT organisations .
Building compute clusters is now a well understood and relatively straightforward problem. However, NGS sequencing applications require large amounts of storage, and high IO rates.
This talk details our approach for providing storage for next-gen sequencing applications.
Talk given at BIO-IT World, Europe, 2009.
Next-generation sequencing: Data mangementGuy Coates
Next-generation sequencing is producing vast amounts of data. Providing storage and compute is only half the battle. Researchers and IT staff need to be able to "manage" data, in order to stay productive.
Talk given at BIO-IT World, Europe 2010.
Next generation genomics: Petascale data in the life sciencesGuy Coates
Keynote presentation at OGF 28.
The year 2000 saw the release of "The" human genome, the product of a the combined sequencing effort of the whole planet. In 2010, single institutions are sequencing thousands of genomes a year, producing petabytes of data. Furthermore, many of the large scale sequencing projects are based around international collaboration and consortia. The talk will explore how Grid and Cloud technologies are being used to share genomics data around the planet, revolutionizing life science research.
In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
The computational requirements of next generation sequencing is placing a huge demand on IT organisations .
Building compute clusters is now a well understood and relatively straightforward problem. However, NGS sequencing applications require large amounts of storage, and high IO rates.
This talk details our approach for providing storage for next-gen sequencing applications.
Talk given at BIO-IT World, Europe, 2009.
Next-generation sequencing: Data mangementGuy Coates
Next-generation sequencing is producing vast amounts of data. Providing storage and compute is only half the battle. Researchers and IT staff need to be able to "manage" data, in order to stay productive.
Talk given at BIO-IT World, Europe 2010.
Next generation genomics: Petascale data in the life sciencesGuy Coates
Keynote presentation at OGF 28.
The year 2000 saw the release of "The" human genome, the product of a the combined sequencing effort of the whole planet. In 2010, single institutions are sequencing thousands of genomes a year, producing petabytes of data. Furthermore, many of the large scale sequencing projects are based around international collaboration and consortia. The talk will explore how Grid and Cloud technologies are being used to share genomics data around the planet, revolutionizing life science research.
In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes. Overview of work underway to add applications and computational analysis pipelines to iPlant for metagenomics and microbial ecology.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
Genomics large, semi-structured, file-based data is ideally suited for a Hadoop Distributed File System. The EMC Isilon OneFS file system features connectivity to the Hadoop Distributed File System (HDFS) that makes the Hadoop storage "oscale-out" and truly distributed. An example from the "CrossBow" project is explored.
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...EMC
This white paper explains how the Renaissance Computing Institute (RENCI) of the University of North Carolina uses EMC Isilon scale-out NAS storage, Intel processor and system technology, and iRODS-based data management to tackle Big Data processing, Hadoop-based analytics, security and privacy challenges in research and clinical genomics.
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
Talk at Mount Sinai School of Medicine. Introduction to the Hadoop ecosystem, problems in bioinformatics data analytics, and a specific use case of building a genome variant store backed by Cloudera Impala.
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes. Overview of work underway to add applications and computational analysis pipelines to iPlant for metagenomics and microbial ecology.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
Genomics large, semi-structured, file-based data is ideally suited for a Hadoop Distributed File System. The EMC Isilon OneFS file system features connectivity to the Hadoop Distributed File System (HDFS) that makes the Hadoop storage "oscale-out" and truly distributed. An example from the "CrossBow" project is explored.
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...EMC
This white paper explains how the Renaissance Computing Institute (RENCI) of the University of North Carolina uses EMC Isilon scale-out NAS storage, Intel processor and system technology, and iRODS-based data management to tackle Big Data processing, Hadoop-based analytics, security and privacy challenges in research and clinical genomics.
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
Talk at Mount Sinai School of Medicine. Introduction to the Hadoop ecosystem, problems in bioinformatics data analytics, and a specific use case of building a genome variant store backed by Cloudera Impala.
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
" Use of genomics for understanding and improving adaptation to climate chang...ExternalEvents
" Use of genomics for understanding and improving
adaptation to climate change in forest trees " presentation by Sally Aitken, University of British Columbia, Vancouver, Canada
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
Health system leaders have questions about big data: When will I need it? How should I prepare? What’s the best way to use it? It’s important to separate the hype of big data from the reality. Where big data stands in healthcare today is a far cry from where it will be in the future. Right now, the best use cases are in academic- or research-focused healthcare institutions. Most healthcare organizations are still tackling issues with their transactional databases and learning how to use those databases effectively. But soon—once the issues of expertise and security have been addressed—big data will play a huge role in care management, predictive analytics, prescriptive analytics, and genomics for everyday patients. The transition to big data will be easier if health systems adopt a late-binding approach to the data now.
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
We’ll explore current and future considerations in advanced computing architectures that empower the conversion of data into knowledge. Life sciences produce the largest amount of data production out of all major science domains, making analytics and scientific computing cornerstones of modern research programs and methodologies. We’ll highlight the remarkable biomedical discoveries that are emerging through combined efforts, and discuss where and how the right infrastructure can catalyze the advancement of human knowledge. On-premises architectures as well as cloud, hybrid, and exotic architectures will all be discussed. It’s likely that all life science researchers will required advanced computing to perform their research within the next year. However, there has been less focus on advanced computing infrastructures across the industry due to the increased availability of public cloud infrastructure anything as a service models.
Presentation to the Department of Biology at the University of Windsor, Windsor, Ontario. The description and update of activities related to the International Cancer Genome Consortium (ICGC)
Workshop finding and accessing data - fiona - lunteren april 18 2016Fiona Nielsen
Workshop presentation on finding and accessing human genomics data for research.
Including statistics of publicly available data sources and tips on how to save time in your workflow of data access.
Presented at BioSB2016, pre-conference PhD retreat for young researchers in bioinformatics and systems biology at Congrescentrum De Werelt in Lunteren. #BioSB2016 #BioSB16
Link to event:
http://www.youngcb.nl/events/biosb-phd-retreat-2016/
Read more about my work:
http://DNAdigest.org
http://repositive.io
https://uk.linkedin.com/in/fionanielsen
Genomics: Big Data Leading to Big OpportunitiesHannes Smárason
The accumulation of genomic data is a worldwide phenomenon. Cloud-based platforms such as WuXi NextCODE’s Exchange are essential to address the fundamental big data challenge of genomics.
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
Genome sharing projects across the world
Did you ever wonder what happened to the exponential increase in genome sequencing data? It is out there around the world and a lot of it is consented for research use. This means that if you just know where to find the data, you can potentially analyse gigabytes of data to power your research.
In this talk Fiona will present community genome initiatives, the genome sharing projects across the world, how you can benefit from this wealth of data in your work, and how you can boost your academic career by sharing and collaboration.
by Fiona Nielsen, Founder and CEO of DNAdigest and Repositive
With a background in software development Fiona pursued her career in bioinformatics research at Radboud University Nijmegen. Now a scientist-turned-entrepreneur Fiona founded DNAdigest and its social enterprise spin-out Repositive Ltd. Both the charity and company focus on efficient and ethical sharing of genetics data for research to accelerate diagnostics and cures for genetic diseases.
Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013Amazon Web Services
Life Technologies initially planned to build out its own data center infrastructure, but when a cost analysis revealed that by using Amazon Web Services the company would save $325,000 in hardware alone for a single new initiative, the company decided to use AWS instead. Within 6 months of adopting AWS, Life Technologies launched their Digital Hub platform in production, which now undergirds Life Technologies' entire instrumentation product suite.This immediately began to decrease their time-to-market and enhance their customers' user experience. In this session, we provide an overview of our path to the AWS cloud, with particular focus on the evaluation criteria used to make a cloud vendor decision. We also discuss the lessons learned since going into production.
FDA NGS and Big Data Conference September 2014Warren Kibbe
Presentation for the FDA NGS and Big Data Conference September 2014 held on the NIH campus. NCI initiatives, including Cancer Genomics Data Commons, NCI Cloud Pilots, big data issues for cancer
Beating Bugs with Big Data: Harnessing HPC to Realize the Potential of Genomi...Tom Connor
Introducing the HPC challenges associated with developing a set of clinical microbial genomics services in the NHS in Wales. Demonstrating the potential of these technologies, and the impact it is already having for the patients of the Welsh NHS.
MedChemica BigData What Is That All About?Al Dossetter
A light look at the world of BigData for the lay person - a look at a couple of examples and what we do in MedChemica to speed up drug discovery. First presented at Macclesfield SciBar, and then Knutsford SciBar.
Using research software in a production environmentMorgan Taschuk
The exponential increase in data has caused an analysis bottleneck: the effort needed to manage the data and develop complex analysis pipelines is greater than the collection itself. I discuss some of the major techniques we used in order to turn our research pipelines into a production system able to analyze diverse datasets with minimal failures. I highlight the importance of valid metadata, the adaptation of research software, and surrounding infrastructure including workflow systems.
This presentation focuses on the networking requirements using open source to treat diseases through cell-based analysis at the molecular level. Transporting this knowledge across devices and centers requires a whole new structure and networking. Terabits per second with high-availability and guaranteed delivery is required to meet the needs. Shared knowledge is the critical for real-time analysis. This will discuss data flows, open networking, and databases that are all open source and have been optimized for this problem.
Healthcare is changing rapidly. It is clear that humans need mechanisms to automate some parts of data processing and help humans in decision making. This talk will concentrate on how to improve the machine understanding of unstructured data.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Large Language Model (LLM) and it’s Geospatial Applications
Life sciences big data use cases
1. Big data and Life Sciences
Guy Coates
Wellcome Trust Sanger Institute
gmpc@sanger.ac.uk
2. The Sanger Institute
Funded by Wellcome Trust.
• 2nd
largest research charity in the world.
• ~700 employees.
• Based in Hinxton Genome Campus,
Cambridge, UK.
Large scale genomic research.
• Sequenced 1/3 of the human genome.
(largest single contributor).
• Large scale sequencing with an impact
on human and animal health.
Data is freely available.
• Websites, ftp, direct database access,
programmatic APIs.
• Some restrictions for potentially
identifiable data.
My team:
• Scientific computing systems architects.
4. Economic Trends:
Cost of sequencing halves every 12
months.
• Wrong side of Moore's Law.
The Human genome project:
• 13 years.
• 23 labs.
• $500 Million.
A Human genome today:
• 3 days.
• 1 machine.
• $8,000.
Trend will continue:
• $1000 genome is probable within 2 years.
• Informatics not included.
5. The scary graph
Peak Yearly capillaryPeak Yearly capillary
sequencing: 30 Gbasesequencing: 30 Gbase
Current weekly sequencing:Current weekly sequencing:
7-10 Tbases7-10 Tbases
Data doubling Time: 4-6Data doubling Time: 4-6
months.months.
7. PbytesPbytes!!
Sequencing data flow.
SequencerSequencerSequencerSequencer Processing/Processing/
QCQC
Processing/Processing/
QCQC
ComparativeComparative
analysisanalysis
ComparativeComparative
analysisanalysis ArchiveArchiveArchiveArchive
Structured dataStructured data
(databases)(databases)
Unstructured dataUnstructured data
(Flat files)(Flat files)
InternetInternetInternetInternet
AlignmentsAlignments
(200GB)(200GB)
Variation dataVariation data
(1GB)(1GB)
FeatureFeature
(3MB)(3MB)
Raw dataRaw data
(10 TB)(10 TB)
SequenceSequence
(500GB)(500GB)
8. A Sequencing Centre Today
CPU
• Generic x86_64 cluster.
• (16,000 cores)
Storage
• ~1 TB per day per sequencer.
• (15 PB disk)
• (Lustre + NFS)
Metadata driven data management
• Only keep our important files.
• Catalogue them, so we can find them!
• Keep the number of copies we want, and no more.
• (iRODS, in house LIMs).
A solved problem; we know how to do this.A solved problem; we know how to do this.
11. Proper Big Data
We want to compute across all the data.
• Sequencing data (of course).
• Patient records, treatment and outcomes.
Why?
• Cancer: tie in genetics, patient outcomes and treatments.
• Pharma: high failure rate due to genetic factors in drug response.
• Infectious disease epidemiology.
• Rare genetic diseases.
Many genetic effects are small
• Million member cohorts to get good signal:noise.
12. Translation: Genomics of drug
sensitivity in Cancer
Pre-treatmentPre-treatment BRAF inhibitorBRAF inhibitor
15 weeks of treatment15 weeks of treatment
molecularmolecular
diagnosticdiagnostic
BRAF mutation positiveBRAF mutation positive ✔✔
70% response rate vs 10% for standard chemotherapy70% response rate vs 10% for standard chemotherapy
BRAF Inhibitors in maligant melanomaBRAF Inhibitors in maligant melanoma
Slide from Mathew Garnet (CGP)Slide from Mathew Garnet (CGP)
13. Current Data Archives
EBI ERA / NCBI SRA store
results of all sequencing
experiments.
• Public data availability: A good
thing (tm)
• 1.6 Pbases
Problems
• Archives are “dark”.
• You can put data in, but you can't
do anything with it.
• In order to analyse the data, you
need to download it all.
• 100s of Tbytes
Situation replicated at local
Institute level too.
• eg How does CRI get hold of their
data currently held at Sanger?
14. The Vision
Global Alliance for sharing genomic and clinical data
• 70 research institutes & hospitals (including Sanger, Broad, EBI, BGI,
Cancer Research UK)
Million cancer genome warehouse
• (UC Berkeley)
15. Institute AInstitute AInstitute AInstitute A
To the Cloud!
DataData
AnalysisAnalysis
pipelinepipeline
Institute BInstitute BInstitute BInstitute B
DataData
AnalysisAnalysis
pipelinepipeline
DataData
DataData
DataData
AnalysisAnalysis
pipelinepipeline AnalysisAnalysis
pipelinepipeline
AnalysisAnalysis
pipelinepipeline
DataData
17. Code & Algorithms
Bioinformatics code:
• Integer not FP heavy.
• Single threaded.
• Large memory footprints.
• Interpreted languages.
Not a good fit for future computing architectures.
Expensive to run on public clouds.
• Memory footprint leads to unused cores.
Out of scope for a data talk, but still an important point.
18. Architectural differences
Global File systemGlobal File system
cpucpucpucpu cpucpucpucpu cpucpucpucpu cpucpucpucpu
cpucpucpucpu
cpucpucpucpu
cpucpucpucpu
Object StoreObject Store
cpucpucpucpu
Fast NetworkFast Network Slow NetworkSlow Network
Static nodesStatic nodes
dynamic nodesdynamic nodes
VSVS
19. Whose Cloud?
A VM is just a VM, right?
• Clouds are supposed to be
programmable.
• Nobody wants to re-write a pipeline
when they move clouds.
Storage:
• Posix:
• (lustre/GPFS/EMC)?
• Object:
• Low level: AWS S3, Openstack
SWIFT, Ceph/rados
• High level: Data management
layer (eg iRODS)?
Cloud Interoperability?
• Do we need is more standards?!
Pragmatic approach:
• First person to make one that
actually works, wins.
20. Moving data
Data still has to get from our instruments
to the Cloud.
Good news:
• Lots of products out there for wide area data
movement.
Bad news:
• We are currently using all of them(!)
Network bandwidth still a problem.
• Research institutes have fast data networks.
• What about your GP's surgery?
UDT / UDRUDT / UDR
rsync / sshrsync / ssh
genetorrentgenetorrent
21. Identity Access
Unlikely that data archives are going to
allow anonymous access.
• Who are you?
Federated identify providers.
• Is everyone signed up to the same federation?
• Does it include the right mix of cross-national co-
operation?
• Does your favourite bit of software support
federated IDs?
Janet MoonshotJanet Moonshot
22. The LAW
Legal
• Theory: anonymised data can be stored and
accessed without jumping through hoops.
• Practice: Risk of re-identification. Becomes
easier the more data you have.
• Medical records are hard to anonymise
and still be useful.
Ethical
• Medical consent process adds more
restrictions above data-protection law.
• Limits data use & access even if
anonymised.
Controlled data access?
• No ad-hoc analysis.
• Access via restricted API only (“trusted
intermediary model”).
Policy development ongoing.
• Cross juristiction for added fun.
23. Summary
We know where we want to get to.
• No shortage of Vision
There are lots of interesting tools and technologies out
there.
• Getting them to work coherently together will be a challenge.
• Prototyping efforts are underway.
• Need to leverage expertese and experience in other fields.
Not simply technical issues:
• Significant policy issues need to be worked out.
• We have to bring the public along.
24. Acknowledgements
ISG:
•
• James Beal
• Helen Brimmer
• Pete Clapham
Global Alliance whitepaper:
http://www.sanger.ac.uk/about/press/assets/130605-white-paper.pdf
Million Cancer Genome Warehouse whitepaper
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-211.html
Editor's Notes
Sequencing the start of most analysis People = Umanaged data Data in wrong place Duplicated Nobody can find anything Inc systems:Backups/security Capacity planning?