Research and technology explosion in scale-out storage


Published on

A view of the directions storage is taking in science & technology from Ryan Sayre, technical strategist in the office of the CTO for EMC Isilon, using examples from recent work in life science genomics and other industries taking advantage of the combination of extreme computing (HPC) and big data. As presented at the Bull sponsored Science & Innovation 2013 conference Westminster.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • High Performance Computing has influenced and changed the way we manage our scientific endeavours in the UK and beyond. The evolution of how we use scale-out compute infrastructure also affects the way we store data as well. Traditional islands of data storage used in previous eras cannot scale to solve the current challenges of bioinformatics, complex scientific simulations, and technical innovation. Scaling-out is the only way to manage the size of the problems that are being solved today. UK case studies in research and technology and related opportunities to be discussed.
  • Note to Presenter: View in Slide Show mode for animation.We hear a lot about Big Data, but sometimes the definition isn’t clear. Here is a useful definition of Big Data from Wikipedia: Big Data is data that challenges the capabilities of a system to capture, manage, and process it within a tolerable elapsed time.In the context of today’s presentation, two key attributes that we’ll be discussing is the volume of data and the composition of the data. In terms of “volume,” we’ll focus on the multi-terabyte to multi-petabyte range. And for “composition,” we’ll focus primarily on unstructured, file-based data. In this context, Big Data includes audio, video, graphics, images, and enterprise file data sets such as office files, home directories, VMDKs, and large-scale file archives. Isilon supports all kinds of unstructured and file-based data.
  • source for 88 million outpatientsIf ¼ of the patients opted for genomic analysis due to a possible genetic factor in their health, this would factor to over an exabyte of storage. To put that into perspective, that’s about 10 days worth of data processing that all of the servers at Google compute daily – extrapolating for data growth from (
  • Here’s an example of one of these next generation sequencing machines. It’s beautiful, and can output a lot of useful data that scientists can sift through and discover meaning out of the data.
  • These are prime examples of data-intensive industries where Isilon storage systems have been proven to deliver significant customer benefits: Medical ImagingGene SequencingSeismic Exploration in the Oil & Gas industryVideo & Graphics (Media & Entertainment)Satellite Images Product DevelopmentCompanies in these industries have been the leading edge because large-scale files and unstructured data—Big Data—have caused these firms to adopt innovative storage approaches and embrace Isilon.
  • Legacy scale-up file systems and volume sizes are inadequate. Leads to multiple file system, hundreds of volumes Increases management overheadLowers capacity efficiencyAdds complexity
  • Here we see the Dilemma of Scale-Out and Scale-Up in graphic formScalabilityScale-up achieves with Capacity growth only, with limited performance options In contrast, Scale-out provides both Performance and Capacity scalabilityPerformanceWith Scale-Up, we see a true degradation of performance & capacity at scale. In contrast, Scale-Out has true linear predictability
  • Isilon scale-out NAS is an ideal storage platform for consolidation of your application data.Note to Presenter: Click now in Slide Show mode for animation.We’ll go into these capabilities in more detail later, but here is a summary of a number of important innovations from Isilon:Isilon storage is easy to scale and can support over 20PB of data in a single Isilon clusterUnlike traditional storage alternatives, Isilon storage performance increases linearly with growth in storage capacityIsilon storage is highly efficient;you can achieve over 80 percent storage utilization with Isilon’s scale-out NAS solutionsIsilon’s storage systems are highly resilient and can maintain 100 percent data availability, even with multiple component failures (including disk drives or entire nodes)Isilon provides a comprehensive portfolio of data protection and management software to help you get the full value of your Isilon storage systemsAnd with Isilon, you never need to migrate data again
  • Note to Presenter: Click now in Slide Show mode for animation.This slide shows how Isilon SmartPools software can help you optimize storage resources with automated tiering.SmartPools is integrated with the Isilon OneFS operating system to allow a single point of management, with a single scalable file system that offer multiple tiers of performance—depending on the data.The automated, policy-based data movement is transparent to the users, and there are no application changes required.
  • Note to Presenter: View in Slide Show mode for animation.Isilon storage systems are highly resilient and provide unmatched data protection and availability. Isilon uses the proven Reed-Solomon erasure encoding algorithm rather than RAID to provide a level of data protection that goes far beyond traditional storage systems.Here is an example of the flexibility and types of data protection that are standard in an Isilon cluster:With N+1 protection, data is 100 percent available even if a single drive or node fails. This is similar to RAID 5 in conventional storage.Note to Presenter: Click now in Slide Show mode for animation.N+2 protection allows two components to fail within the system, similar to RAID 6.With N+3 or N+4 protection, three or four components can fail, keeping the data 100 percent available.Isilon FlexProtect is the foundation for data resiliency and availability in Isilon storage solutions.Legacy “scale-up” systems are still dependent on traditional data protection. They typically use traditional RAID, which consumes 30 to 50 percent of the available disk capacity. The time to rebuild a RAID group after a drive failure continues to increase with drive capacity, and data loss is susceptible to a two-disk failure.Isilon’s industry-leading data protection will provide 100 percent accessibility to data with one-, two-, three-, or four-node failures in a pool. And, data protection levels can be established on a file, directory, or file system level so all data can be treated independently—meeting SLAs based on the application or type of data.And due to the distributed yet symmetric nature of the cluster, all nodes participate in accelerating the restoration of the portions of files from a failed drive. As the cluster grows, the rebuild times become faster and more efficient, making the adoption of larger-capacity drives very simple. With Isilon, a drive replacement can be rebuilt quickly—the larger the storage system, the faster. And in Isilon solutions, drives are hot pluggable and hot swappable with no downtime.
  • With Isilon, you can streamline your storage infrastructure by consolidating large-scale file and unstructured data assets, eliminating silos of storage. Platform REST API: Isilon solutions incorporate a platform REST (representational state transfer) API to provide you and third-party ISVs with a robust control interface to the Isilon OneFS operating system for further automation, orchestration, and provisioning of your Isilon storage cluster.VMware integration: Isilon storage solutions readily integrate with your VMware environment and incorporate VMware VAAI and VASA APIs to simplify storage management in your virtualized IT environment. Multi-protocol support: Isilon scale-out NAS includes integrated support for a wide range of industry-standard protocols, including NFS, SMB, HTTP, FTP, and native Hadoop HDFS to: Simplify your business analytics initiativesSimplify and consolidate workflowsIncrease flexibilityGet more value from your enterprise applications and data These levels of interoperability help you leverage your large data assets more flexibly with a broad range of applications and workloads, and across a diverse IT infrastructure environment.
  • Isilon storage systems are extremely easy to use. This “simple to manage” approach translates into a significant cost savings for you.A recent IDC white paper details Isilon’s cost advantages for enterprise environments. As shown in the graphic on the left, IDC investigated the relative amount of time needed by IT professionals to perform a wide range of data and storage management functions (listed on y axis) for Isilon as well as traditional storage systems.Isilon storage is easier to manage and requires less time. The study showed that with Isilon scale-out NAS, enterprises were able to increase IT productivity by 48 percent and thereby reduce OpEx (operating expenditures).
  • The IDC study also found that as a result of Isilon storage systems’ unmatched efficiency—over 80 percent storage utilization—organizations were able to reduce CAPEX (capital expenditures) significantly.With the reduced CapEx and increase in IT productivity, enterprise customers were able to reduce their overall storage costs by 40 percent with Isilon scale-out NAS (compared to traditional storage systems).
  • Research and technology explosion in scale-out storage

    1. 1. 1© Copyright 2013 EMC Corporation. All rights reserved. Research and Technology Explosion in the Scale-Out Storage Era Exploring the new frontier of perpetual data growth and how it will affect us Ryan Sayre Technical Strategist, EMEA EMC ISD Office of the CTO June 2013
    2. 2. 2© Copyright 2013 EMC Corporation. All rights reserved. What Is Big Data? Data that challenges the capabilities of a system to capture, manage, and process it within an acceptable elapsed time ~ Wikipedia ~
    3. 3. 3© Copyright 2013 EMC Corporation. All rights reserved. The Big Data Challenge 0 10 20 30 40 50 60 70 80 90 2009 2010 2011 2012 2013 2014 Exabytes By 2013, 80% of all storage capacity sold will be for file-based data Source: “Scale Out Storage in the Content Driven Enterprise: Unleashing the Value of Information Assets,” IDC White Paper (2010 Enterprise Disk Storage Consumption Model), June 2011 File based: 61.8% CAGR Block based: 23.7% CAGR Media & Entertainment Design & Simulation HealthcareBioinformatics Data Analytics File Shares & Archives
    4. 4. 5© Copyright 2013 EMC Corporation. All rights reserved. Genomics Size : : * 1000 EMR Radiology Genomics 88 million outpatient visits to NHS hospitals in 2010/2011 *finished data Sources: Dr. Halamka, BIDMC S. Joshi, internal research HIMSS Internal EMC data Volume 50GB
    5. 5. 6© Copyright 2013 EMC Corporation. All rights reserved.
    6. 6. 7© Copyright 2013 EMC Corporation. All rights reserved. Bioinformatics: A “data tsunami” • Already a cliché in 2006: – “Data Deluge”, “Data Tsunami” … • What changed starting in 2007: Terabyte scale laboratory instruments – “Next Generation” DNA Sequencers – Confocal Microscopy & Live cell imaging – Other Imaging (fMRI, CT, Ultrasound, etc.) • 2010: Faster adoption of next-generation sequencing • 2013: Scale-Out Storage is the only way to keep surviving!
    7. 7. 8© Copyright 2013 EMC Corporation. All rights reserved. Vast quantities of data • Terabyte scale issues have traditionally been “lab” or “workgroup” problems • Individual researchers & lab instruments can generate terabyte volumes of data per-experiment – Average of 40TB storage for each Solexa instrument – A recent “100TB Single-namespace” project was for a lab with a single 454 instrument
    8. 8. 9© Copyright 2013 EMC Corporation. All rights reserved. Sequencing throughput over time (Data from one vendor’s platform) 0 2 4 6 8 10 12 14 16 18 20 GigabasesofSequenceperRun 15 x
    9. 9. 10© Copyright 2013 EMC Corporation. All rights reserved. Throughput Outpacing Moore’s Law • 1000 Genomes Project – Could generate 90Tbase of raw data (@ 30x coverage) • International Cancer Genome Consortium – 50,000+ samples could generate 5,000Tbase of raw data 1 10 100 1,000 10,000 100,000 1,000,000 1996 Today kb/Day CPU
    10. 10. 11© Copyright 2013 EMC Corporation. All rights reserved. 0 10 20 30 40 50 60 70 80 G Per Instrument Sequencer capacity is growing enormously Dependent infrastructure has become a significant and critical factor Home grown storage and compute resources are capable of supporting data reduction and alignment Specialized HPC and storage architectures are required to meet aggregate throughput and processing demands Current HPC architectures can be resource prohibitive at the quantity required to manage data output Time
    11. 11. 12© Copyright 2013 EMC Corporation. All rights reserved. Broad Institute Sequencing Data
    12. 12. 13© Copyright 2013 EMC Corporation. All rights reserved. Big Data Apps Need Big Data Storage Data intensive, HPC workflows Medical Imaging Gene Sequencing Seismic Exploration Media & Entertainment Product DevelopmentSatellite Images
    13. 13. 14© Copyright 2013 EMC Corporation. All rights reserved. Big Data Archive Challenge Relentless Data Growth Primary Storage Overloaded with Unstructured Files – Constant upgrade requirements Performance Issues – Hinders regulatory responses and e- discovery applications Storage Islands Many Systems or 2-way clusters and Points of Management Numerous File Systems/Volumes
    14. 14. 15© Copyright 2013 EMC Corporation. All rights reserved. My own Big Data Growth Story… Started out at 1 Terabytes of shared storage in 2004 – Image Processing and Visualisation – Quickly grew to 5 Terabytes within 5 months – Was worrying about storage every day, needed a way out! – Transitioned to Scale-Out, Scaled to 300 TB within 3 years Current organisation is over 2 Petabytes of storage – No dedicated storage administrator – I/O patterns are managed by policy and tier now
    15. 15. 16© Copyright 2013 EMC Corporation. All rights reserved. UK Case Study : (Life Sciences Institute) Bioinformatics Organisation needing to not only store but cross reference multiple genome types to create a mega database of genomic structural variants across all species Share across multiple organisations across the UK and into greater Europe Need to grow to 20 Petabytes and beyond
    16. 16. 17© Copyright 2013 EMC Corporation. All rights reserved. UK Case Study : (Engineering Design Automation) Performance requirements of over 1 million operations a second to simulate complex electrical pathways Time to market required more rapid simulations to advance technology roadmap Multiple protocols across Windows and Linux systems Growing for both performance and capacity (PB’s)
    17. 17. 18© Copyright 2013 EMC Corporation. All rights reserved. The Scale-Out / Scale-Up Dilemma 18 Scale-out Scale-up Isilon OneFS Other Storage Platforms Scalability • Scale-out • Performance, Capacity, Both • Scale-up • Capacity only, limited performance options Performance • True linear predictability • Degradation of performance & capacity at scale
    18. 18. 19© Copyright 2013 EMC Corporation. All rights reserved. What does this look like?
    19. 19. 20© Copyright 2013 EMC Corporation. All rights reserved. Isilon Scale-Out NAS Architecture OneFS Operating Environment Intra-cluster Communication Layer Servers Client/Application Layer Ethernet Layer Servers Servers SingleFS/Volume CIFSNFS FTPHTTP HDFS for Hadoop
    20. 20. 21© Copyright 2013 EMC Corporation. All rights reserved. Single storage pool for application consolidation Isilon Scale-Out Innovation Simple to scale – Manage 20+ PB like 1TB drive Predictable performance – Grows linearly Efficient and Easy to operate – Maximize utilization to 80%+ – Automate tiering Highly resilient – Survives multiple failures Enterprise proven – Management and protection tools that customers expect No data migrations
    21. 21. 22© Copyright 2013 EMC Corporation. All rights reserved. More scalable than traditional storage systems Largest and Most Scalable File System OneFS scales from 18 TB to more than 20 PB in a single file system, single volume Under 60 seconds to scale with no downtime World’s fastest performance and capacity scaling Over 100 GB/s of throughput
    22. 22. 23© Copyright 2013 EMC Corporation. All rights reserved. Gain New Levels of Efficiency • AutoBalance automatically moves content to new storage nodes while system is online and in production • Eliminates “hot spots” • Enables unmatched storage capacity utilization of more than 80% AutoBalance Automated data balancing across nodes reduces costs, complexity, and risks for scaling storage EMPTY EMPTY EMPTY EMPTY EMPTY FULL FULL FULL FULL BALANCED BALANCED BALANCED BALANCED BALANCED Isilon AutoBalance
    23. 23. 24© Copyright 2013 EMC Corporation. All rights reserved. Optimize Resources with Automated Tiering • Single point of management – Single file system/single volume – Multiple performance tiers • Automatic data movement – Policy-based tiering management – Transparent reallocation – NO application changes • Optimize storage resources – Automatically match storage resources with data requirements – Eliminate data migration Isilon SmartPools S-Series Performance NL-Series Active archives X-Series Collaboration Reducedcost/TB Files
    24. 24. 25© Copyright 2013 EMC Corporation. All rights reserved. With N+2, N+3, and N+4 protection, data is 100% available if multiple drives or nodes fail With N+1 protection, data is 100% available even if a single drive or node fails Highly resilient, clustered architecture Unmatched Data Protection and Availability 100% 100% 100% 100% 100% 100% 100% 100% FAILED FAILED And with Isilon, the more nodes in the cluster, the faster drive rebuild time
    25. 25. 26© Copyright 2013 EMC Corporation. All rights reserved. Interoperability for Operational Flexibility Platform REST API – Simplify management and integration – Third-party application integration VMware integration – VAAI: vStorage APIs for array integration – VASA: vSphere APIs for storage awareness – Virtual Server writeable clones Multi-protocol support – Integrated support for industry-standard protocols – Native HDFS support
    26. 26. 27© Copyright 2013 EMC Corporation. All rights reserved. The Cost Advantage of Scale-Out Ease of use and management simplicity IDC: Isilon improves IT productivity by 48%, reduces OPEX* Storage allocation Storage provisioning Managing capacity Managing backup Space reclamation Adding new applications Uploading of re-loading data 0.0 0.5 1.0 1.5 2.0 FTEHoursperTBinUse Isilon Traditional * Source: “Quantifying the Business Benefits of Scale-Out NAS Solutions,” IDC White Paper, November 2011
    27. 27. 28© Copyright 2013 EMC Corporation. All rights reserved. Reduces Big Data storage costs by 40% The Cost Advantage of Scale-Out $0 $500 $1,000 $1,500 $2,000 $2,500 Traditional Isilon Average Annual Cost Per TB in Use OPEX IT Staff CAPEX Source: “Quantifying the Business Benefits of Scale-Out NAS Solutions,” IDC White Paper, November 2011