Are you drowning in unstructured data? Can a private cloud, object based storage system be the solution? Join our live webinar "5 Reasons Object Storage Can Solve Unstructured Data Problems" and learn how Object Storage can:
1. Reduce Data Storage Costs
2. Increase Data Retention Capabilities
3. Improve Data Protection and Security
4. Provide "Cloud" Flexibility
5. Be easy to implement and operate
Presentation to the ARROW repositories day, Brisbane, 2008, on suggestions for improving the rate of capture of documents in institutional repositories
Digitization Basics for Archives and Special Collections – Part 2: Store and ...WiLS
Catherine Phan, Metadata Librarian, University of Wisconsin Digital Collections Center
Jesse Henderson, Digital Services Librarian, University of Wisconsin Digital Collections Center
Steven Dast, Digital Asset Librarian, University of Wisconsin Digital Collections Center
This is the second part of a two-part, full-day workshop introducing the core elements of creating digital collections of historic photographs, documents and other archival materials. Part 2 focuses on sharing your digitized materials with the world and steps you can take to ensure that they’ll remain usable and accessible into the future. We’ll define metadata and why it’s important, and consider approaches to creating descriptive metadata for discovery of historical resources. We’ll examine the issue of digital preservation, including practical steps you can take to preserve your digital content with limited resources. And we’ll think about digitization as a path to community engagement, including reaching out to your community for content and promoting your digital collections to your users.
Data management for Quantitative Biology -Basics and challenges in biomedical...QBiC_Tue
This lecture was presented on April 23, 2015 as the second lecture within the the series "Data management for Quantitative Biology" at the University of Tübingen in Germany.
Are you drowning in unstructured data? Can a private cloud, object based storage system be the solution? Join our live webinar "5 Reasons Object Storage Can Solve Unstructured Data Problems" and learn how Object Storage can:
1. Reduce Data Storage Costs
2. Increase Data Retention Capabilities
3. Improve Data Protection and Security
4. Provide "Cloud" Flexibility
5. Be easy to implement and operate
Presentation to the ARROW repositories day, Brisbane, 2008, on suggestions for improving the rate of capture of documents in institutional repositories
Digitization Basics for Archives and Special Collections – Part 2: Store and ...WiLS
Catherine Phan, Metadata Librarian, University of Wisconsin Digital Collections Center
Jesse Henderson, Digital Services Librarian, University of Wisconsin Digital Collections Center
Steven Dast, Digital Asset Librarian, University of Wisconsin Digital Collections Center
This is the second part of a two-part, full-day workshop introducing the core elements of creating digital collections of historic photographs, documents and other archival materials. Part 2 focuses on sharing your digitized materials with the world and steps you can take to ensure that they’ll remain usable and accessible into the future. We’ll define metadata and why it’s important, and consider approaches to creating descriptive metadata for discovery of historical resources. We’ll examine the issue of digital preservation, including practical steps you can take to preserve your digital content with limited resources. And we’ll think about digitization as a path to community engagement, including reaching out to your community for content and promoting your digital collections to your users.
Data management for Quantitative Biology -Basics and challenges in biomedical...QBiC_Tue
This lecture was presented on April 23, 2015 as the second lecture within the the series "Data management for Quantitative Biology" at the University of Tübingen in Germany.
2013 ICEEFP Fisheries Information Management System (FIMS) at Petascale_Mark ...Christa Woodley
Since 2004 the Juvenile Salmonid Acoustic Telemetry System (JSATS) cabled array and autonomous node systems have been deployed in the Columbia River Basin to provide survival estimates and understand fish passage. Autonomous nodes provide presence/absence while cabled arrays provide 3D fish position estimates. Cabled array deployments consist of over 100 acquisition systems continually collecting data through the juvenile salmonid migration season. Raw data volumes are approaching petabytes. Real-time software processing reduces decode acoustic micro transmitter (AMT) signals surgically implanted in juvenile salmonids. Given the distance between and number of systems, cellular modems notify a central monitoring system of potential system issues. Project management receives system alerts in efforts to proactively fix faulting equipment. System downtime and fish detections are coordinated with dam operations data, run at large estimates, environmental measurements, and fish condition data. Fish condition helps estimate the run of the river and is collected throughput the season. This data includes photographing each fish used in the study. In 2012, approximately 65,000 photographs were taken. Images are archived and used for reporting to management agencies. We present a fisheries information management system for large studies that can facilitate future spatiotemporal meta-data analysis to support management of hydropower systems.
> Utilization of Big Data in different R&D stages (clinical research, drug discovery, genomics, translational medicine)
> Large scale and holistic compound scale activity prediction
> Pooled data from multiple clinical trials in SDTM format
> Implication of Big Data in the early phase of drug development
> Leveraging large amounts of data to advance patient-centric treatment
> Pooling data from multiple sources to tailor treatments
> Understanding the real value of big data
> Case studies from various therapeutic areas
> Data security
Wireless sensor networks, clustering, Energy efficient protocols, Particles S...IJMIT JOURNAL
Wireless sensor networks (WSN) is composed of a large number of small nodes with limited functionality.
The most important issue in this type of networks is energy constraints. In this area several researches have
been done from which clustering is one of the most effective solutions. The goal of clustering is to divide
network into sections each of which has a cluster head (CH). The task of cluster heads collection, data
aggregation and transmission to the base station is undertaken. In this paper, we introduce a new approach
for clustering sensor networks based on Particle Swarm Optimization (PSO) algorithm using the optimal
fitness function, which aims to extend network lifetime. The parameters used in this algorithm are residual
energy density, the distance from the base station, intra-cluster distance from the cluster head. Simulation
results show that the proposed method is more effective compared to protocols such as (LEACH, CHEF,
PSO-MV) in terms of network lifetime and energy consumption.
Scaling Security Workflows in Government AgenciesAvere Systems
For most federal agencies dealing with increased security threats, limiting machine-data collection is not an option. But faced with finite IT budgets, few agencies can continue to absorb the high costs of scaling high-end network attached storage (NAS) or moving to and expanding a block-based storage footprint. During this webcast, you’ll learn about more cost-effective solutions to support large-scale machine-data ingestion and fast data access for security analytics.
You’ll learn about:
- The common challenges organizations face when scaling security workflows
- Why a high-performance cache works to solve these issues
- How to integrate cloud into processing and storage for additional scalability and efficiencies
Object Storage promises many things - unlimited scalability, both in terms of capacity and file count, low cost but highly redundant capacity and excellent connectivity to legacy NAS. But, despite these promises object storage has not caught on in the enterprise like it has in the cloud. It seems like, for the enterprise object storage just isn’t a good fit. The problem is that most object storage system’s starting capacity is too large. And while connectivity to legacy NAS systems is available, seamless integration is not. Can object storage be sized so that it is a better fit for the enterprise?
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Avere Systems
For years vendors have been trying to drive down the cost of flash so that the all-flash data center can become reality. The problem is that even the rapidly declining price of flash storage can’t keep pace with the rapidly declining price of hard disk. As a result data that does not need to be on flash storage has to be stored on something less expensive. But does that less expensive storage need to be another hard disk array or could it be stored in the cloud?
In this webinar join Storage Switzerland’s founder George Crump and Avere Systems CEO, Ron Bianchini for an interactive webinar Using the Cloud to Create an All-Flash Data Center.
Unstructured data is growing at a staggering rate. It is breaking traditional storage and IT budgets and burying IT professionals under a mountain of operational challenges. Listen as Cloudian and Storage Switzerland discuss panel-style discussion the seven key reasons why organizations can dramatically lower storage infrastructure costs by deploying a hardware-agnostic object storage solution instead of sticking with legacy NAS.
Coping Strategies for the Death of Unlimited StorageGlobus
Presented at GlobusWorld 2022 by a set of panelists moderated by Bob Flynn from Internet2. Panelists offer their perspectives on migrating between cloud storage providers.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
Most organizations making an investment in NetApp Filers count on the system to store user data and host virtual machine datastores from an environment like VMware. In addition these organizations want their NetApp systems to do more and be the repository for the next wave of unstructured data; data generated by machines. NetApp systems are busting at the seams, so these organizations are trying to decide what to do next.
Webinar: Getting Beyond Flash 101 - Flash 102 Selecting the Right Flash ArrayStorage Switzerland
Join Storage Switzerland and Data Direct Networks (DDN) for this on demand webinar: "Getting Beyond Flash 101 - Flash 102 Selecting the Right Flash Array”. We discuss the different types of flash storage and compare them, why vendors want to replace your SAN instead of enhance it and what you can do to not only protect your current storage investments but also prepare a path to the future.
Three Steps to Modern Media Asset Management with Active ArchiveAvere Systems
From Dec. 3 2015 Webinar
Digital media assets are extremely valuable, providing organizations a collection of tools to reach and engage. Making this growing collection of large files manageable over time can challenge even the most seasoned IT professionals.
In this webinar, we'll look at the workflow in place at large global broadcasters, then discuss the best practices identified while building a modern media asset management system with active archive support. Operations directors, program managers, systems architects, and media technology professionals will learn:
- How the right tools can help bring order to "big data" assets and methodologies for implementation and savings of media professionals valuable time.
- How private cloud archive reduces cost while improving accessibility and security
- How to create a high performance active archive that masks network latency between media asset management software and cloud archives
This presentation covers a number of best practices for managing research data. The main topics include: file naming and organization conventions, data documentation, and data storage and backups.
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will WinStorage Switzerland
Join Storage Switzerland's Founder George Crump and Caringo's VP of Products Tony Barbagallo for this on demand webinar. They compare NAS and object storage and provide 10 reasons they think object storage will be the file server of the future for the enterprise.
Storage Switzerland’s founder and lead analyst, George Crump and Cloudian’s Chief Marketing Officer, Paul Turner, describe the benefits of object and cloud storage but also describe how they can work together to solve your data problem once and for all. In addition, they cover specific next steps to begin implementing a hybrid cloud storage solution in your data center.
2013 ICEEFP Fisheries Information Management System (FIMS) at Petascale_Mark ...Christa Woodley
Since 2004 the Juvenile Salmonid Acoustic Telemetry System (JSATS) cabled array and autonomous node systems have been deployed in the Columbia River Basin to provide survival estimates and understand fish passage. Autonomous nodes provide presence/absence while cabled arrays provide 3D fish position estimates. Cabled array deployments consist of over 100 acquisition systems continually collecting data through the juvenile salmonid migration season. Raw data volumes are approaching petabytes. Real-time software processing reduces decode acoustic micro transmitter (AMT) signals surgically implanted in juvenile salmonids. Given the distance between and number of systems, cellular modems notify a central monitoring system of potential system issues. Project management receives system alerts in efforts to proactively fix faulting equipment. System downtime and fish detections are coordinated with dam operations data, run at large estimates, environmental measurements, and fish condition data. Fish condition helps estimate the run of the river and is collected throughput the season. This data includes photographing each fish used in the study. In 2012, approximately 65,000 photographs were taken. Images are archived and used for reporting to management agencies. We present a fisheries information management system for large studies that can facilitate future spatiotemporal meta-data analysis to support management of hydropower systems.
> Utilization of Big Data in different R&D stages (clinical research, drug discovery, genomics, translational medicine)
> Large scale and holistic compound scale activity prediction
> Pooled data from multiple clinical trials in SDTM format
> Implication of Big Data in the early phase of drug development
> Leveraging large amounts of data to advance patient-centric treatment
> Pooling data from multiple sources to tailor treatments
> Understanding the real value of big data
> Case studies from various therapeutic areas
> Data security
Wireless sensor networks, clustering, Energy efficient protocols, Particles S...IJMIT JOURNAL
Wireless sensor networks (WSN) is composed of a large number of small nodes with limited functionality.
The most important issue in this type of networks is energy constraints. In this area several researches have
been done from which clustering is one of the most effective solutions. The goal of clustering is to divide
network into sections each of which has a cluster head (CH). The task of cluster heads collection, data
aggregation and transmission to the base station is undertaken. In this paper, we introduce a new approach
for clustering sensor networks based on Particle Swarm Optimization (PSO) algorithm using the optimal
fitness function, which aims to extend network lifetime. The parameters used in this algorithm are residual
energy density, the distance from the base station, intra-cluster distance from the cluster head. Simulation
results show that the proposed method is more effective compared to protocols such as (LEACH, CHEF,
PSO-MV) in terms of network lifetime and energy consumption.
Scaling Security Workflows in Government AgenciesAvere Systems
For most federal agencies dealing with increased security threats, limiting machine-data collection is not an option. But faced with finite IT budgets, few agencies can continue to absorb the high costs of scaling high-end network attached storage (NAS) or moving to and expanding a block-based storage footprint. During this webcast, you’ll learn about more cost-effective solutions to support large-scale machine-data ingestion and fast data access for security analytics.
You’ll learn about:
- The common challenges organizations face when scaling security workflows
- Why a high-performance cache works to solve these issues
- How to integrate cloud into processing and storage for additional scalability and efficiencies
Object Storage promises many things - unlimited scalability, both in terms of capacity and file count, low cost but highly redundant capacity and excellent connectivity to legacy NAS. But, despite these promises object storage has not caught on in the enterprise like it has in the cloud. It seems like, for the enterprise object storage just isn’t a good fit. The problem is that most object storage system’s starting capacity is too large. And while connectivity to legacy NAS systems is available, seamless integration is not. Can object storage be sized so that it is a better fit for the enterprise?
Share on LinkedIn Share on Twitter Share on Facebook Share on Google+ Share b...Avere Systems
For years vendors have been trying to drive down the cost of flash so that the all-flash data center can become reality. The problem is that even the rapidly declining price of flash storage can’t keep pace with the rapidly declining price of hard disk. As a result data that does not need to be on flash storage has to be stored on something less expensive. But does that less expensive storage need to be another hard disk array or could it be stored in the cloud?
In this webinar join Storage Switzerland’s founder George Crump and Avere Systems CEO, Ron Bianchini for an interactive webinar Using the Cloud to Create an All-Flash Data Center.
Unstructured data is growing at a staggering rate. It is breaking traditional storage and IT budgets and burying IT professionals under a mountain of operational challenges. Listen as Cloudian and Storage Switzerland discuss panel-style discussion the seven key reasons why organizations can dramatically lower storage infrastructure costs by deploying a hardware-agnostic object storage solution instead of sticking with legacy NAS.
Coping Strategies for the Death of Unlimited StorageGlobus
Presented at GlobusWorld 2022 by a set of panelists moderated by Bob Flynn from Internet2. Panelists offer their perspectives on migrating between cloud storage providers.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
Most organizations making an investment in NetApp Filers count on the system to store user data and host virtual machine datastores from an environment like VMware. In addition these organizations want their NetApp systems to do more and be the repository for the next wave of unstructured data; data generated by machines. NetApp systems are busting at the seams, so these organizations are trying to decide what to do next.
Webinar: Getting Beyond Flash 101 - Flash 102 Selecting the Right Flash ArrayStorage Switzerland
Join Storage Switzerland and Data Direct Networks (DDN) for this on demand webinar: "Getting Beyond Flash 101 - Flash 102 Selecting the Right Flash Array”. We discuss the different types of flash storage and compare them, why vendors want to replace your SAN instead of enhance it and what you can do to not only protect your current storage investments but also prepare a path to the future.
Three Steps to Modern Media Asset Management with Active ArchiveAvere Systems
From Dec. 3 2015 Webinar
Digital media assets are extremely valuable, providing organizations a collection of tools to reach and engage. Making this growing collection of large files manageable over time can challenge even the most seasoned IT professionals.
In this webinar, we'll look at the workflow in place at large global broadcasters, then discuss the best practices identified while building a modern media asset management system with active archive support. Operations directors, program managers, systems architects, and media technology professionals will learn:
- How the right tools can help bring order to "big data" assets and methodologies for implementation and savings of media professionals valuable time.
- How private cloud archive reduces cost while improving accessibility and security
- How to create a high performance active archive that masks network latency between media asset management software and cloud archives
This presentation covers a number of best practices for managing research data. The main topics include: file naming and organization conventions, data documentation, and data storage and backups.
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will WinStorage Switzerland
Join Storage Switzerland's Founder George Crump and Caringo's VP of Products Tony Barbagallo for this on demand webinar. They compare NAS and object storage and provide 10 reasons they think object storage will be the file server of the future for the enterprise.
Storage Switzerland’s founder and lead analyst, George Crump and Cloudian’s Chief Marketing Officer, Paul Turner, describe the benefits of object and cloud storage but also describe how they can work together to solve your data problem once and for all. In addition, they cover specific next steps to begin implementing a hybrid cloud storage solution in your data center.
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
Alluxio Community Office Hour
Apr 7, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speaker: Bin Fan
Alluxio (alluxio.io) is an open-source data orchestration system that provides a single namespace federating multiple external distributed storage systems. It is critical for Alluxio to be able to store and serve the metadata of all files and directories from all mounted external storage both at scale and at speed.
This talk shares our design, implementation, and optimization of Alluxio metadata service (master node) to address the scalability challenges. Particularly, we will focus on how to apply and combine techniques including tiered metadata storage (based on off-heap KV store RocksDB), fine-grained file system inode tree locking scheme, embedded state-replicate machine (based on RAFT), exploration and performance tuning in the correct RPC frameworks (thrift vs gRPC) and etc. As a result of the combined above techniques, Alluxio 2.0 is able to store at least 1 billion files with a significantly reduced memory requirement, serving 3000 workers and 30000 clients concurrently.
In this Office Hour, we will go over how to:
- Metadata storage challenges
- How to combine different open source technologies as building blocks
- The design, implementation, and optimization of Alluxio metadata service
Similar to 2010 AIRI Petabyte Challenge - View From The Trenches (20)
2. Housekeeping Notes
Some Acknowledgements
Fair warning #1
• I speak fast and travel with a large slide deck
Fair warning #2
• Unrepentant PowerPoint fiddler
• Latest slides (as delivered) will be posted at
http://blog.bioteam.net
1
4. BioTeam Inc.
Independent
Consulting Shop:
Vendor/technology agnostic
Staffed by:
• Scientists forced to learn
High Performance IT to
conduct research
Our specialty:
Bridging the gap between
Science & IT
3
5. A brief note on our client base
Very few of our customers are in this room
• With a few cool exceptions, of course
The fact that you are here today speaks volumes
• Chances are:
You have forward-looking research IT roadmaps
You have dedicated research IT staff
You have dedicated storage gurus
You have research datacenter(s)
With a few notable exceptions, many of our customers do
not have the level of expertise, experience and resources
that an AIRI affiliated lab would have
This tends to bias what I say and think
4
13. #1 - Unchecked Enterprise Architects
Scientist: “My work is priceless, I must be able to
access it at all times”
Storage Guru: “Hmmm…you want H/A, huh?”
System delivered:
• 40TB Enterprise FC SAN
• Asynchronous replication to remote DR site
• Can’t scale, can’t do NFS easily
• $500K/year in maintenance costs
14. #1 - Unchecked Enterprise Architects
Lessons learned
Corporate storage architects may not fully
understand the needs of HPC and research
informatics users
End-users may not be precise with terms:
• “Extremely reliable” means “no data loss”, not
99.999% uptime at a cost of millions
When true costs are explained:
• Many research users will trade a small amount
of uptime or availability for more capacity or
capabilities
15. #2 - Unchecked User Requirements
Scientist: “I do bioinformatics, I am rate limited by the speed
of file IO operations. Faster disk means faster science. “
System delivered:
• Budget blown on top tier ‘Cadillac’ system
• Fast everything
Outcome:
• System fills to capacity in 9 months
16. #2 - Unchecked User Requirements
Lessons learned
• End-users demand the world
• Necessary to really talk to them and understand
their work, needs and priorities
You will often find
• The people demanding the “fastest” storage
don’t have actual metrics to present
• Many groups will happily trade some level of
performance in exchange for a huge win in
capacity or capability
17. #3 - D.I.Y Cluster/Parallel File systems
Common source of storage unhappiness
Root cause:
• Not enough pre-sales time spent on design and
engineering
System as built:
• Not enough metadata controllers
• Poor configuration of key components
End result:
• Poor performance or availability
18. #3 - D.I.Y Cluster/Parallel File systems
Lessons learned:
Software-based parallel or clustered file systems
are non-trivial to correctly implement
Essential to involve experts in the initial design
phase
• Even if using ‘open source’ version …
Commercial support is essential
• And I say this as an open source zealot …
20. Data Awareness
First principals:
• Understand science changes faster than IT
• Understand the data you will produce
• Understand the data you will keep
• Understand how the data will move
Second principals:
• One research type or many?
• One instrument type or many?
• One lab/core or many?
21. Data You Produce
Important to understand data sizes and types throughout
the organization
• 24x7 core facility with known protocols?
• Wide open “discovery research” efforts?
• Mixture of both?
Where it matters:
• Big files or small files?
• File types & access patterns?
• Hundreds, thousands or millions of files?
• Does it compress well?
• Does it deduplicate well?
• Where does the data have to move?
22. Data You Will Keep
Instruments producing terabytes/run are the
norm, not the exception
Data triage is real and here to stay
• Triage is the norm, not the exception these days
• I think the days of “unlimited storage” are likely over
What bizarre things are downstream researchers
doing with the data ?
Must decide what data types are kept
• And for how long …
23. Data You Will Keep
Raw data ⇒ Result data
• Can involve 100x reduction in some cases
Result data ⇒ Downstream derived data
• Often overlooked and trend-wise the fastest growing area
• Researchers have individual preferences for files, formats
and meta-data
• Collaborators have their own differences & requirements
• The same data can be sliced and diced in many ways
when used by different groups
24. Data Movement
Facts
• Data captured does not stay with the instrument
• Often moving to multiple locations (and offsite)
• Terabyte volumes of data could be involved
• Multi-terabyte data transit across networks is rarely trivial
no matter how advanced the IT organization
• Campus network upgrade efforts may or may not extend
all the way to the benchtop …
23
25. Data Movement - Personal Story
One of my favorite ‘09 consulting projects …
• Move 20TB scientific data out of Amazon S3 storage cloud
What we experienced:
• Significant human effort to swap/transport disks
• Wrote custom DB and scripts to verify all files each time they moved
Avg. 22MB/sec download from internet
Avg. 60MB/sec server to portable SATA array
Avg. 11MB/sec portable SATA to portable NAS array
• At 11MB/sec, moving 20TB is a matter of weeks
• Forgot to account for MD5 checksum calculation times
Result:
• Lesson Learned: data movement & handling
took 5x longer than data acquisition
24
27. Storage Landscape
Storage is a commodity
Cheap storage is easy
Big storage getting easier every day
Big, cheap & SAFE is much harder …
Traditional backup methods may no longer apply
• Or even be possible …
28. Storage Landscape
Still see extreme price ranges
• Raw cost of 1,000 Terabytes (1PB):
$125,000 to $5,000,000 USD
Poor product choices exist in all price ranges
29. Poor Choice Examples
On the low end:
• Use of RAID5 (unacceptable in since 2008)
• Too many hardware shortcuts result in
unacceptable reliability trade-offs
30. Poor Choice Examples
And with high end products:
• Feature bias towards corporate computing, not
research computing - pay for many things you
won’t be using
• Unacceptable hidden limitations (size or speed)
• Personal example:
$800,000 70TB (raw) Enterprise NAS Product
… can’t create a NFS volume larger than 10TB
… can’t dedupe volumes larger than 3-4 TB
31. One slide on RAID 5
I was a RAID 5 bigot for many years
• Perfect for life science due to our heavy read bias
• Small write penalty for parity operation no big deal
RAID 5 is no longer acceptable
• Mostly due to drive sizes (1TB+), array sizes and rebuild time
• In the time it takes to rebuild an array after a disk failure there is
a non-trivial chance that a 2nd failure will occur, resulting in total
data loss
Today:
• Only consider products that offer RAID 6 or other “double
parity” protection methods
• Even RAID 6 is a stopgap measure …
35. Single Namespace Matters
Non-scalable storage islands add complexity
Also add “data drift”
Example:
• Volume “Caspian” hosted on server “Odin”
• “Odin” replaced by “Thor”
• “Caspian” migrated to “Asgard”
• Relocated to “/massive/”
Resulted in file paths that look like this:
/massive/Asgard/Caspian/blastdb
/massive/Asgard/old_stuff/Caspian/blastdb
/massive/Asgard/can-be-deleted/do-not-delete…
36. User Expectation Management
End users still have no clue about the
true costs of keeping data accessible
& available
“I can get a terabyte from Costco for $220!” (Aug 08)
“I can get a terabyte from Costco for $160!” (Oct 08)
“I can get a terabyte from Costco for $124!” (April 09)
“I can get a terabyte from NewEgg for $84!” (Feb 10)
IT needs to be involved in setting
expectations and educating on true
cost of keeping data online &
accessible
37. Storage Trends
In 2008 …
• First 100TB single-namespace project
• First Petabyte+ storage project
• 4x increase in “technical storage audit”
work
• First time witnessing 10+TB
catastrophic data loss
• First time witnessing job dismissals due
to data loss
• Data Triage discussions are spreading
well beyond cost-sensitive industry
organizations
38. Storage Trends
In 2009 …
• More of the same
• 100TB not a big deal any more
• Even smaller organizations are talking (or
deploying) petascale storage
Witnessed spectacular failures of
Tier 1 storage vendors:
• $6M 1.1PB system currently imploding
under a faulty design.
• $800K NAS product that can’t supply a
volume larger than 10TB
• Even less with dedupe enabled
39. Going into 2010 …
Peta-scale is no longer scary
A few years ago 1PB+ was
somewhat risky and involved
significant engineering,
experimentation and crossed
fingers
• Especially single-namespace
Today 1PB is not a big deal
• Many vendors, proven
architectures
• Now it’s a capital expenditure,
not a risky technology leap
40. Going into 2010 …
Biggest Trend
• Significant rise in storage
requirements for post-
instrument downstream
experiments and mashups
• The decrease in instrument
generated data flows may be
entirely offset by increased
consumption from users
working downstream on many
different efforts & workflows
… this type of usage is
harder to model & predict
42. Why I drank the kool-aid
I am known to be rude and cynical when talking
about over hyped “trends” and lame cooption
attempts by marketing folk
• Wide-area Grid computing is an example from dot com days
• “Private Clouds” - another example of marketing fluff masking
nothing of actual useful value in 2010
I am also a vocal cheerleader for things that help
me solve real customer-facing problems
• Cloud storage might actually do this …
43. Cloud Storage: Use Case 1
Amazon AWS “downloader pays” model is
extremely compelling
Potentially a solution for organizations required to
make large datasets available to collaborators or
the public
• Costs of local hosting, management & public
bandwidth can be significant resource drain
• Cloud-resident data sets where the downloader
offsets or shares in the distribution cost feels
like a good match
44. Cloud Storage: Use Case 2
Archive, deep or cold storage pool
Imagine this scenario:
• Your 1PB storage resource can’t be backed up via
traditional methods
• Replication is the answer
• However just to be safe you decide you need:
Production system local to campus
Backup copy at Metro-distance colo
Last resort copy at WAN-distance colo
• Now you have 3PB to manage across three different
facilities
Non trivial human, facility, financial and operational
burden costs …
45. Cloud Storage: Use Case 2
James Hamilton has blogged some interesting
figures
• Site: http://perspectives.mvdirona.com
• Cold storage geographically replicated 4x can
be achieved at scale for $.80 GB/year (and
falling quickly)
• With an honest accounting of all your facility,
operational and human costs can you really
approach this figure?
46. Cloud Storage: Use Case 2
Google, Amazon, Microsoft, etc. all operate at efficiency
scales that few can match
• Cutting-edge containerized data-centers with incredible PUE values
• Fast private national and trans-national optical networks
• Rumors of “1 human per XX,000 servers” automation efficiency, etc.
• Dozens or hundreds of datacenters and exabytes of spinning platters
• My hypothesis:
Not a single person in this room can come anywhere
close to the IT operating efficiencies that these
internet-scale companies operate at every day
Someone is going to eventually make a compelling
service/product offering that leverages this …
47. Cloud Storage: Use Case 2
Cheap storage is easy, we all can do this
Geographically replicated, efficiently managed cheap
storage is not very easy (or not cheap)
When the price is right …
I see cloud storage as being a useful archive or deep
storage tier
• Probably a 1-way transit
• Data only comes “back” if a disaster occurs
• Data mining & re-analysis done in-situ with local ‘cloud’
server resources if needed
48. Final Thoughts
• Yes the “data deluge” problem is real
• Many of us have peta-scale storage issues today
• “Data Deluge” & “Tsunami” are apt terms
• But:
• The problem does not feel as scary as it once did
• Many groups have successfully deployed diverse types of peta-scale
storage systems - Best practice info is becoming available
• Chemistry, reagent cost, date movement & human factors are natural
bottlenecks
• Data Triage is an accepted practice, no longer heresy
• Data-reduction starting to happen within instruments
• Customers starting to trust instrument vendor software more
• We see large & small labs dealing successfully with these issues
• Many ways to tackle IT requirements