The document discusses ADAM, a new framework for scalable genomic data analysis. It aims to make genomic pipelines horizontally scalable by using a columnar data format and in-memory computing. This avoids disk I/O bottlenecks. The framework represents genomic data as schemas and stores data in Parquet for efficient column-based access. It has been shown to reduce genome analysis pipeline times from 100 hours to 1 hour by enabling analysis on large datasets in parallel across many nodes.
Distributed machine learning 101 using apache spark from the browserAndy Petrella
Talk given by Xavier Tordoir and myself at Scala Days Amsterdam 2015.
Contains intro to ML, focusing on what is it and models selection via the Bias Variation constraint.
Then switches a gear to show how genomics can be learned using LDA, KMeans and Random Forest.
Finishes with some insight on what we'll change in the future regarding machine learning and modeling.
In an environment where cloud-scaling applications is becoming more and more important, client-server architectures paradigms, as shown by memcached, are back with vengeance. In this talk, Galder will talk about Hot Rod, Infinispan's new client/server binary protocol, explaining the key differences compared to memcached's binary protocol, such as the possibility of receiving cluster topology changes. Audience of this talk will learn of the importance of Hot Rod in 'cloud-scale' application server clustering, where stateless application server instances could use Infinispan Hot Rod clients to retrieve state from an elastic farm of Infinispan Hot Rod servers, improving capabilities to run application server instances as a PaaS. The talk will finish with a brief demo of a cluster of Infinispan Hot Rod servers running on EC2 being accessed from a non-Java client. The audience is expected to have an intermediate understanding of client-server software architectures and cloud deployments.
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftDaniel Krook
Lightning talk from the OpenStack NYC meetup on October 8, 2014.
http://bit.ly/ibm-os-meetup
By Gil Vernik
The integration between Apache Spark and Swift, and the use of Storlets for smart retrieval via filtering and privacy-support.
The content of this talk is a statement from the IBM Research division, not IBM product divisions, and is not a statement from IBM regarding its plans, directions or product intents. Any activities described by this talk are subject to change.
ELC-E 2010: The Right Approach to Minimal Boot Timesandrewmurraympc
This was presented at ELC-E 2010 in Cambridge and describes an approach to cold boot time reduction. It also demonstrates the approach through a case study with an MS7724 reference board.
DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
Best Practices for Virtualizing Apache HadoopHortonworks
Join this webinar to discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Thomas Graf
Open vSwitch (OVS) has long been a critical component of the Neutron's reference implementation, offering reliable and flexible virtual switching for cloud environments.
Being an early adopter of the OVS technology, Neutron's reference implementation made some compromises to stay within the early, stable featureset OVS exposed. In particular, Security Groups (SG) have been so far implemented by leveraging hybrid Linux Bridging and IPTables, which come at a significant performance overhead. However, thanks to recent developments and ongoing improvements within the OVS community, we are now able to implement feature-complete security groups directly within OVS.
In this talk we will summarize the existing Security Groups implementation in Neutron and compare its performance with the Open vSwitch-only approach. We hope this analysis will form the foundation of future improvements to the Neutron Open vSwitch reference design.
Distributed machine learning 101 using apache spark from the browserAndy Petrella
Talk given by Xavier Tordoir and myself at Scala Days Amsterdam 2015.
Contains intro to ML, focusing on what is it and models selection via the Bias Variation constraint.
Then switches a gear to show how genomics can be learned using LDA, KMeans and Random Forest.
Finishes with some insight on what we'll change in the future regarding machine learning and modeling.
In an environment where cloud-scaling applications is becoming more and more important, client-server architectures paradigms, as shown by memcached, are back with vengeance. In this talk, Galder will talk about Hot Rod, Infinispan's new client/server binary protocol, explaining the key differences compared to memcached's binary protocol, such as the possibility of receiving cluster topology changes. Audience of this talk will learn of the importance of Hot Rod in 'cloud-scale' application server clustering, where stateless application server instances could use Infinispan Hot Rod clients to retrieve state from an elastic farm of Infinispan Hot Rod servers, improving capabilities to run application server instances as a PaaS. The talk will finish with a brief demo of a cluster of Infinispan Hot Rod servers running on EC2 being accessed from a non-Java client. The audience is expected to have an intermediate understanding of client-server software architectures and cloud deployments.
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftDaniel Krook
Lightning talk from the OpenStack NYC meetup on October 8, 2014.
http://bit.ly/ibm-os-meetup
By Gil Vernik
The integration between Apache Spark and Swift, and the use of Storlets for smart retrieval via filtering and privacy-support.
The content of this talk is a statement from the IBM Research division, not IBM product divisions, and is not a statement from IBM regarding its plans, directions or product intents. Any activities described by this talk are subject to change.
ELC-E 2010: The Right Approach to Minimal Boot Timesandrewmurraympc
This was presented at ELC-E 2010 in Cambridge and describes an approach to cold boot time reduction. It also demonstrates the approach through a case study with an MS7724 reference board.
DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
Best Practices for Virtualizing Apache HadoopHortonworks
Join this webinar to discuss best practices for designing and building a solid, robust and flexible Hadoop platform on an enterprise virtual infrastructure. Attendees will learn the flexibility and operational advantages of Virtual Machines such as fast provisioning, cloning, high levels of standardization, hybrid storage, vMotioning, increased stabilization of the entire software stack, High Availability and Fault Tolerance. This is a can`t miss presentation for anyone wanting to understand design, configuration and deployment of Hadoop in virtual infrastructures.
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Thomas Graf
Open vSwitch (OVS) has long been a critical component of the Neutron's reference implementation, offering reliable and flexible virtual switching for cloud environments.
Being an early adopter of the OVS technology, Neutron's reference implementation made some compromises to stay within the early, stable featureset OVS exposed. In particular, Security Groups (SG) have been so far implemented by leveraging hybrid Linux Bridging and IPTables, which come at a significant performance overhead. However, thanks to recent developments and ongoing improvements within the OVS community, we are now able to implement feature-complete security groups directly within OVS.
In this talk we will summarize the existing Security Groups implementation in Neutron and compare its performance with the Open vSwitch-only approach. We hope this analysis will form the foundation of future improvements to the Neutron Open vSwitch reference design.
This was a talk given on 2014-06-19 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop on using Galaxy. It concerns the Assemblathon projects as well as other aspects relating to genome assembly.
A version of this talk is also available on Slideshare with embedded notes.
Note, this is an evolving talk. There are older and newer versions of the talk also available on slideshare.
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
The detection and analysis of rare genomic events requires integrative analysis across large cohorts with terabytes to petabytes of genomic data. Contemporary genomic analysis tools have not been designed for this scale of data-intensive computing. This talk presents ADAM, an Apache 2 licensed library built on top of the popular Apache Spark distributed computing framework. ADAM is designed to allow genomic analyses to be seamlessly distributed across large clusters, and presents a clean API for writing parallel genomic analysis algorithms. In this talk, we’ll look at how we’ve used ADAM to achieve a 3.5× improvement in end-to-end variant calling latency and a 66% cost improvement over current toolkits, without sacrificing accuracy. We will talk about a recent recompute effort where we have used ADAM to recall the Simons Genome Diversity Dataset against GRCh38. We will also talk about using ADAM alongside Apache Hbase to interactively explore large variant datasets.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
3. The Sequencing Abstraction
It was the best of times, it was the worst of times…
the worst of
It was the the best of
worst of times
times, it was
• Humans have 46 chromosomes and each
chromosome looks like a long strong
• We get randomly distributed substrings, and want
to reassemble original, whole string
Metaphor borrowed from Michael Schatz
best of times
was the worst
4. Genomics = Big Data
• Sequencing run produces >100 GB of raw data
• Want to process 1,000’s of samples at once to
improve statistical power
• Current pipelines take about a week to run and are
not horizontally scalable
6. What’s our goal?
• Human genome is 3.3B letters long, but our reads
are only 50-250 letters long
• Sequence of the average human genome is known
• Insight: Each human genome only differs at 1 in
1000 positions, so we can align short reads to
average genome, and compute diff
7. Align Reads
It was the best of times, it was the worst of times…
best of times
was the worst
It was the the best of
times, it was
the worst of
worst of times
8. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
best of times
was the worst
times, it was
the worst of
worst of times
9. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
best of times
was the worst
times, it was
the worst of
worst of times
10. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
best of times
was the worst
times, it was
the worst of
worst of times
11. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
best of times
was the worst
worst of times
12. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
best of times
worst of times
was the worst
13. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
best of times
worst of times
was the worst
14. Align Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
15. Assemble Reads
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
16. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was
the worst of
worst of times
best of times
was the worst
17. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was
was the worst
the worst of
worst of times
18. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was
the worst
the worst of
worst of times
19. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was the worst
of
worst of times
20. Assemble Reads
It was the best of times, it was the worst of times…
It was the best of times, it was the worst of times
21. Overall Pipeline Structure
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
22. Overall Pipeline Structure
End to end pipeline takes ~120 hours
The stages take ~100 hours; ADAM works here
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
24. Key Observations
• Current genomics pipelines are I/O limited
• Most genomics algorithms can be formulated as
either data/graph parallel computation
• Genomics is heavy on iteration/pipelining, data
access pattern is write once, read many times
• High coverage, whole genome (>220 GB) will
become main dataset for human genetics
25. ADAM Principles
• Use schema as “narrow waist”
• Columnar data representation +
in-memory computing eliminates
disk bandwidth bottleneck
• Minimize data movement: send
code to data
Application
Transformations
Presentation
Enriched Models
Evidence Access
MapReduce/DBMS
Schema
Data Models
Materialized Data
Columnar Storage
Data Distribution
Parallel FS/Sharding
Physical Storage
Disk
26. Data Independence
• Many current genomics systems require data to be
stored and processed in sorted order
• This is an abstraction inversion!
• Narrow waist at schema forces processing to be
abstract from data, data to be abstract from disk
• Do tricks at the processing level (fast coordinate-system
joins) to give necessary programming
abstractions
27. Data Format
• Genomics algorithms frequently
access global metadata
• Schema is fully denormalized,
allows O(1) access to metadata
• Make all fields nullable to allow for
arbitrary column projections
• Avro enables literate
programming
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
28. Parquet
• ASF Incubator project, based on
Google Dremel
• http://www.parquet.io
• High performance columnar
store with support for projections
and push-down predicates
• 3 layers of parallelism:
• File/row group
• Column chunk
• Page
Image from Parquet format definition: https://github.com/Parquet/parquet-format
29. Access to Remote Data
• For genomics, we often have a really huge dataset
which we only want to analyze part of
• This dataset might be stored in S3/equivalent
block store
• Minimize data movement by allowing Parquet to
support predicate pushdown/projections into S3
• Work is in progress, found at https://github.com/
bigdatagenomics/adam/tree/multi-loader
30. Performance
• Reduced pipeline time
from 100 hrs to ~1hr
• Linear speedup through
128 nodes, when
processing 234GB of data
• For flagstat, columnar
projection leads to a 5x
speedup
31. ADAM Status
• Apache 2 licensed OSS
• 25 contributors across 10 institutions
• Pushing for production 1.0 release towards end of year
• Working with GA4GH to use concepts from ADAM to
improve broader genomics data management techniques
32. Acknowledgements
• UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos
Kozanitis, Dave Patterson, Anthony Joseph
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael
Linderman, Jeff Hammerbacher
• GenomeBridge: Timothy Danford, Carl Yeksigian
• The Broad Institute: Chris Hartl
• Cloudera: Uri Laserson
• Microsoft Research: Jeremy Elson, Ravi Pandya
• And other open source contributors, including Michael Heuer, Neil
Ferguson, Andy Petrella, Xavier Tordoir!