The completion of the project gave us an unprecedented about of data and insights into the human genetic code
“The basic next-generation sequencing process involves fragmenting DNA/RNA into multiple pieces, adding adapters, sequencing the libraries, and reassembling them to form a genomic sequence. In principle, the concept is similar to capillary electrophoresis. The critical difference is that NGS sequences millions of fragments in a massively parallel fashion, improving speed and accuracy while reducing the cost of sequencing”
By Sequencing the genome and looking for variants we can accelerate drug discoveries, identify mutations causing disease, and practice personalized medicine
For any IT team responsible for supporting next-generation sequencing within their organization, they must address two primary challenges. The first challenge is enabling rapid processing and analysis of data. Ideally, an analysis should not cause a bottleneck such that leads to a significant backlog of unanalyzed or unprocessed raw data. In other words, the time to process, analyze, and manage data should keep up with the rate at which the data is produced from the sequencer. The second challenge is efficiently managing petabytes. The desire to improve efficiency is driven by government mandates or requirements to share and distribute research data, business policy, limited or shrinking data center space, and the overall cost to maintain data long-term. Both challenges are likely to emerge as common themes as you qualify an opportunity.
One major challenge in NGS is the rate of analysis. This slide depicts ways analysis can be increased. We know that many LS companies leverage these technologies, but it is equally important that storage doesn’t become a bottle neck.
CALL OUT ISILON + PARABRICKS + GPU = 1000 WGS/Week
Confirm process of genomic data
Step 1: Genomic sequencer
Step 2: HPC and Preparation of Data the sequencer will produce
Step 3: Store data as VCF (Variant Call Files) and juxtapose those files against a known control DNA
Step 4: Identify million points of interest to further investigate
When evaluating a Life Science opportunity for PowerScale or ECS be sure to qualify it against the backdrop of the data lifecycle. It is a simple three-phase life cycle: Generation, Analysis, and Archive. When qualifying and PowerScale or ECS opportunity, focus on understanding how data is used and how the IT environment may change for each phase of the life cycle. Data generation catches all the primary or raw data generated from the NGS (or other genomic) instrumentation and prepares it for downstream analysis. Keep in mind that third-parties like collaborators or DNA sequencing service providers can be considered data generation sources too. During the analysis phase performance is especially important. It can be further separated into two stages. The first stage of analysis often involves severing data to an HPC environment while later stages of analysis combine next-generation sequencing data with other data types using modern data analytics techniques. The final stage is the archive. The archive capacity requirements will vary with the organization. Capacity will depend on the organization size and type, types of data access, the frequency of access, retention periods, intended use (for example, research or clinical use).For a complete picture, it's highly recommended that you include IT and end-user representatives as you qualify the opportunity. It's not uncommon to discover that the IT team does not have a clear understanding of their end-user workflows, analysis or data management needs. Using the data lifecycle will help you to quickly uncover who, what, how, where and when storage will be used. Including IT and end users will reveal how storage can impact analysis, data management strategies and challenges. The input collected from your customer will put you in a better position to recommend an PowerScale, ECS or hybrid configuration.
There is a lot of potential in unstructured data – from home directories to IoT sensors to video files to analytics filled data lakes – to be able to:
understand business results | anticipate what’s coming | and act quickly on risk and opportunity
Every business is becoming data-driven or they risk being outsmarted.
Businesses are taking steps to harness this data so they can:
drive innovations
get to market faster
And to create differentiation
We are now unlocking the power of OneFS to be able to bring software innovations to market faster, and to provide more flexibility in use cases to expand beyond the traditional datacenter. Our customers will benefit from our engineers focusing on OneFS software features while allowing the PowerEdge team to focus on delivering bleeding edge hardware.
And this is the just the beginning of this new journey we are taking with PowerScale.
PowerScale is a new unstructured data storage family based on new PowerScale OneFS 9.0 which includes new PowerScale-branded 1U sized nodes, co-existence with existing Isilon clusters, and upgraded capabilities for our cloud offerings.
It can offer simplicity at any scale, handle any data, any where, and find insights within your infrastructure and your data
Simplicity at Any Scale: The core strength of OneFS is a future-proof design that allows new any new nodes to merge into existing clusters in 60 seconds. Once a new node is connected, it is auto-discovered and then the data is auto-balanced across every node in the array to ensure performance is evenly distributed. It is truly a future-proof design and we are bringing this forward with powerful new capabilities.
Any data. Anywhere: To handle any data, we offer flexible file and object access and support for 8 protocols including S3 access for cloud-native development. And our software-defined approach allows us to run OneFS in more places from the datacenter to the cloud and now we support smaller customers and edge locations in a way we’ve never been able to do before. Our new PowerEdge-based all flash and NVMe nodes provide incredible power in a compact, competitively priced product. No matter the location, the system provides the same great experience and remains efficient, secure, & protected.
Intelligent Insights: We’ve expanded our software choices for our customers with free tools that help customers understand their data. CloudIQ delivers detailed infrastructure insights and storage-level health monitoring across your on-premises cloud, while DataIQ is a tool for discovering, understanding, and acting on the data you have – to provide data insights. Many customers don't know much about all the data in their infrastructure, they need a tool that gives them a better "DataIQ."
Together its a complete solution for unlocking the potential within your data.
Simplicity at Scale:
The core strength behind Isilon's success was the OneFS file-system. With this release we are bringing the best of OneFS forward and delivering new capabilities including inline dedupe and compression, support for new ansible workflows, and integration with popular infrastructure frameworks such as Kubernetes and OpenShift.
It is a very scalable file-system and can now go even lower than ever – starting at 11TB of space (usable) - and scale to very large capacities.
Our No Node Left Behind philosophy is still with us, so you can swap in new PowerScale nodes in existing Isilon clusters in 60 seconds, and decommission old nodes - with no downtime. Everything is auto-balanced and resilient to lose multiple nodes at the same time – without downtime.
DevOps Ready: Programmable infrastructure and automation are hot topics these days and we've got new Ansible workflow support and support for leading management and container orchestration frameworks, such as Kubernetes and OpenShift, to help customers can streamline application development and reduce deployment timeframes.
Kubernetes integrations
is the migration-free design that allows new nodes to plug-into clusters in 60 seconds. We can start as small as 11TB and grow to massive scale in the petabytes with the same ease of use. We’ve enhanced our efficiency and automation capabilities here.
Any scale: Terabytes to petabytes and millions of file operations
No Nodes left behind: Add nodes in 60 seconds - with no downtime
Auto-balance: Scale-out architecture ensures no hot spots
Resilient: Sustain multi-node failures with no data loss
This is the new PowerScale Family. It spans from edge to core to cloud and includes existing Isilon nodes as well as new PowerScale branded nodes.
We offer all-flash, hybrid, and archive nodes – to offer the right balance between price, performance, and capacity.
They can work together in the same cluster, as we maintain our No Nodes Left Behind compatibility.
To be clear, PowerScale can join existing Isilon nodes in the same OneFS 9.0 cluster
We have also extended PowerScale OneFS into the cloud with our partnerships with AWS, Azure, and Google. Last month, we announced a native cloud offer for the Google Cloud Platform. This allows our customers to leverage the cloud in situations where they don't necessarily want to spin up a new site with new hardware.
Next we will talk about how FLEXIBLE the system is.
We can handle virtually any unstructured data types and access method including support for 8 protocols including NFS, SMB, HDFS, REST, HTTP, NDMP, FTP and new S3 support.
This flexibility allows any user to get to the data they need in order to create, share, collaborate, and develop using an incredibly powerful, multi-lingual data platform.
The introduction of S3 support enables customers to run modern applications that rely on object storage – perhaps it’s a mobile app based on using video clips that are shared in a certain repository. Or it’s a fitness tracker for a school, or a system of schools. The possibilities are endless.
Example: Existing dataset on Isilon. Upgrade to OneFS 9. Now you can use S3 support to provide developers an easy way to access your NFS files.
Now Intelligent Insights
Get insights about your infrastructure and your data with CloudIQ and DataIQ. CloudIQ makes it easy to determine the health of your systems across their datacenter.
DataIQ makes it easy for anyone to find and understand data across your PowerScale - and your entire file cloud.
Once locations are indexed it becomes simple to anyone to find and share files at very high-speeds. This speed can increase the speed of insights and truly help your business make decisions faster.
DataIQ allows life science customers to discover where their data is, gain insight into their data, and act on their findings. Many customers are unaware of exactly where their data is living and that data that should be archived is sitting on higher performance storage. With DataIQ, IT or researchers can discover “cold data” and move it to the appropriate tier. This allows for “point and use” file storage with the ability to “right click & archive.” Since users can use DataIQ to gain insight and forecast, workflows can be set up to automatically send data where it needs to go. (Example: Sequence done, compress, sent to archive). Researchers can quickly call study information back with the ability to search across their storage for files.
With DataIQ’s ability to showcase cost savings by leveraging the appropriate storage tier, IT can easily create quotas. (Example: You’re using X amount of storage)
Data Mgmt.- Plug-In built
Lot of customers, Isilon to much $, move to cheap archive, not sure what they had, data mgmt. tools to
Point and use any file storage, fast indexing, searchable catalog, directories, view moveable/useless data/ move to archive
DNA Service Provider- uses in production to save $$ across storage, Create scenarios where IT can better view of data, more tools for manage-> where it should be, automate data mgmt.
UI exposed to end users allowing for self-service, “cold data” archive moved to lower tier, start to plan & forecast where data needs to go/be
Talk Points- “Right Click” send to Archive
Data Imaging- Catalog building
Create workflows, sequence done, compress, send to archive, identifiers for data sets
Value add to all stakeholders
Tactical
Ability to show cost savings, billing according to group usage, Isilon features hard and soft quotas, “youre using X amount of storage”
High speed search across file systems / storage repositories
- Environments often consist of NetApp, Isilon, Quantum, GPFS, and archive storage systems
Single Pane of Glass Data Management
- High-end knowledge workers must always be able to find and act on data without a service request from IT
100% Self-Service for Researchers / Producers / Design Managers / Engineers / etc
- Allow business users to manage their own cost and workflow
- Handle access, visibility, and control in a single system
Highly Available and Highly Scalable (Petabytes of data / billions of files)
- wanted to handle both clinical and research data within a single system.
Initial archive was based on a tape library and SGI’s DMF… they later swapped that for ECS
Challenges
A scientific archive with a single pane of glass
Self-service for researchers, producers, and engineers to lower reliance on IT
Access, visibility, and control in a single system
DataIQ
A single UI to view all clinical and research data
A self-service archive
Fast search across billions of files
Archive data to reduce tier 1 storage costs
Our customers have been looking for an end-to-end solution, and this is how it comes together.
PowerScale technology gives you the ability to innovate faster and unlock the potential of your data.
Here is an example life science architecture that supports next-generation sequencing and other genomics workflows. Viewing the architecture from left to right, you can layer the data lifecycle over the architecture. On the far left, next-generation sequencing instruments generate data and transmit that raw data over the CIFS/SMB protocol to an Isilon cluster. PowerScale is at the center of the architecture as it bridges data generation and analysis. Once on the Isilon cluster, users may access the data using a Windows, OSX, or Linux client then submit a job to the HPC cluster over NFS to process the raw data. Alternatively, a data scientist might access the next-generation sequencing data with clinical data over the HDFS protocol to perform an interactive analysis in a Spark environment. Moving to the right side of the architecture, data moves to the archiving phase. Depending on the habits and practices of your life science customer, raw data and results may be replicated via SyncIQ to an PowerScale DR or archive cluster, or a data mover like EcsSync might move the data over to ECS object-store where it can be accessed by collaborators.
Cloud Storage Services with Microsoft Azure provides a higher bandwidth (up to 100Gbps) and lower latency (as low as 1.2ms) connection to the cloud using ExpressRoute Local. This solution allows for the right combination of storage and compute in the cloud for data-intensive, high I/O throughput workloads that require high compute performance on a periodic and/or unpredictable basis. With no outbound data traffic costs, this solution enables workloads that require a lot of temporary writes to storage to cost-effectively take advantage of Azure’s application services. This is ideal for verticals such as Life Sciences and Media and Entertainment, giving users best of both worlds – reliable, cost-effective Dell EMC Storage performance at scale and the scalable compute performance of Microsoft Azure.
USE CASES:
Life Sciences:
Genome analysis is one of the key use cases for life sciences. The raw data generated by a genomic sequencer for the complete genome of a single human is approximately 100GB. This dictates a requirement for a massively scalable file system to which capacity and performance could be added. Genome alignment and sorting, which are both part of the secondary analysis stage, are the most compute and storage demanding and can require network throughput of 10GB/s or even 100Gb/s. Dell EMC and Azure testing has demonstrated that the performance of Isilon scales out linearly to match the IO demands of an increasing number of Azure VMs that support the genome alignment stage. The 100Gb/s ExpressRoute Local connection between Isilon and Azure enables both the compute performance in Azure and the storage performance in Isilon to scale up to process real-world genome analysis.
Large research facilities processing hundreds of thousands of genomes per year, generate petabytes of very large file data (typically 500GB per file set) to be stored, and have a demand for computing power that is bursty by nature – a perfect application for on-demand, easily scalable cloud computing. In addition, since genomic processing is, at its core, a pattern-matching application, there are writes to temporary files on the Isilon storage during a large part of the analysis workflows.
Focused On Life Science Organizations Since 2008
Used by 400+ Organizations For NGS, HPC And Research Archive Workloads
Installed At:
8 Of The Top 10 Global Pharmaceutical Companies
40% Of Top 100 North America Academic Medical Centers
11 NIH Research Centers
37% Of Sequencing Site Worldwide