Nesta apresentação mostramos as melhores práticas para se fazer backup, abordando temas como Disaster recovery, Point in time recovery, backup físcio vs lógico, backup incremental vs diferencial.
The more components a system has, the more challenging its maintenance becomes. Oracle Exadata marries storage with computation through a fast, reliable network, and patching all of these seems daunting. Many companies seem to struggle with it, with some even avoiding it altogether by keeping it “pending.” This session presents tested, applied, and working tips to make the Oracle Exadata patching experience smooth as silk, like vacationing in Hawaii.
Nesta apresentação mostramos as melhores práticas para se fazer backup, abordando temas como Disaster recovery, Point in time recovery, backup físcio vs lógico, backup incremental vs diferencial.
The more components a system has, the more challenging its maintenance becomes. Oracle Exadata marries storage with computation through a fast, reliable network, and patching all of these seems daunting. Many companies seem to struggle with it, with some even avoiding it altogether by keeping it “pending.” This session presents tested, applied, and working tips to make the Oracle Exadata patching experience smooth as silk, like vacationing in Hawaii.
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...Amazon Web Services Korea
SK Telecom의 망관리 프로젝트인 TANGO에서는 오라클을 기반으로 시스템을 구축하여 운영해 왔습니다. 하지만 늘어나는 사용자와 데이터로 인해 유연하고 비용 효율적인 인프라가 필요하게 되었고, 이에 클라우드 도입을 검토 및 실행에 옮기게 되었습니다. TANGO 프로젝트의 클라우드 도입을 위한 검토부터 준비, 실행 및 이를 통해 얻게 된 교훈과 향후 계획에 대해 소개합니다.
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon Web Services Korea
로그 및 지표 데이터를 쉽게 가져오고, OpenSearch 검색 API를 사용하고, OpenSearch 대시보드를 사용하여 시각화를 구축하는 등 Amazon OpenSearch의 새로운 기능과 기능에 대해 자세히 알아보십시오. 애플리케이션 문제를 디버깅할 수 있는 OpenSearch의 Observability 기능에 대해 알아보세요. Amazon OpenSearch Service를 통해 인프라 관리에 대해 걱정하지 않고 검색 또는 모니터링 문제에 집중할 수 있는 방법을 알아보십시오.
최근 국내에도 글로벌 서비스나 급성장하는 웹 서비스를 쉽게 볼 수 있습니다. 초기에 RDBMS로 시작된 서비스들은 규모가 성장함에 따라 샤딩과 NoSQL의 선택의 기로에 서게 됩니다. Amazon DynamoDB는 모든 스케일에서 사용할 수 있는 완전 관리형 Key-Value NoSQL 데이터베이스이지만 여전히 Key Design은 가장 어려운 영역 중 하나입니다. 이 세션에서는 대규모 서비스의 키 디자인 방법을 알아봅니다.
최근 국내와 글로벌 서비스에서 MongoDB를 사용하는 사례가 급증하고 있습니다. 다만 전통적인 RDBMS에 비해, 아직 지식과 경험의 축적이 적게 되어 있어 손쉬운 접근과 트러블 슈팅등에 문제가 있는 것도 사실입니다. 이 세션에서는 MongoDB 와 AWS의 DocumentDB의 Architecure를 간단히 살펴보고 MongoDB 및 DocumentDB의 비교를 진행하며 특히 MongoDB와 DocumentDB를 사용할때 주의해야할 중요 포인트에 대해서 알아봅니다.
거의 모든 산업에서 오늘날의 시장 리더와 혁신을 꿈꾸는 이들에게는 한 가지 공통점이 있습니다. 바로 데이터를 운영의 중심에 두는 것입니다. 클라우드에서 태어난 젊은 기업들은 데이터를 기반으로 비즈니스 를 구축하여 성장시키고, 기존 엔터프라이즈 기업은 데이터를 신속하게 수집, 분석 및 공유하는 과정에서 귀중한 통찰력을 얻기 위해 노력합니다. 본 세션에서는 데이터의 무한한 가치를 실현시킬 차세대 데이터 플랫폼 Snowflake를 소개하고, 이를 통해 눈부신 혁신을 이룬 기업들의 성공사례에 대해 알아봅니다.
AWS Cloud9과 Workspace만으로 PC없는 개발환경 활용기
클라우드 기반의 IDE와 가상 데스크톱 서비스를 이용하면 언제 어디서나 개발 환경을 구축할 수 있습니다. 애플리케이션 서비스를 위해 서버리스(Serverless)를 이용할 수 있는 것 처럼 개발 환경도 컴퓨터리스(Computerless)로 가능합니다. AWS Cloud9과 Workspace만으로 PC없이 웹 서비스 및 안드로이드 앱 개발을 진행하는 방법을 소개합니다.
AWS Webcast - High Availability with Route 53 DNS FailoverAmazon Web Services
This webinar will be discussing how to use DNS Failover to a range of high-availability architectures, from a simple backup website to advanced multi-region architectures.
SK Telecom - 망관리 프로젝트 TANGO의 오픈소스 데이터베이스 전환 여정 - 발표자 : 박승전, Project Manager, ...Amazon Web Services Korea
SK Telecom의 망관리 프로젝트인 TANGO에서는 오라클을 기반으로 시스템을 구축하여 운영해 왔습니다. 하지만 늘어나는 사용자와 데이터로 인해 유연하고 비용 효율적인 인프라가 필요하게 되었고, 이에 클라우드 도입을 검토 및 실행에 옮기게 되었습니다. TANGO 프로젝트의 클라우드 도입을 위한 검토부터 준비, 실행 및 이를 통해 얻게 된 교훈과 향후 계획에 대해 소개합니다.
Amazon OpenSearch - Use Cases, Security/Observability, Serverless and Enhance...Amazon Web Services Korea
로그 및 지표 데이터를 쉽게 가져오고, OpenSearch 검색 API를 사용하고, OpenSearch 대시보드를 사용하여 시각화를 구축하는 등 Amazon OpenSearch의 새로운 기능과 기능에 대해 자세히 알아보십시오. 애플리케이션 문제를 디버깅할 수 있는 OpenSearch의 Observability 기능에 대해 알아보세요. Amazon OpenSearch Service를 통해 인프라 관리에 대해 걱정하지 않고 검색 또는 모니터링 문제에 집중할 수 있는 방법을 알아보십시오.
최근 국내에도 글로벌 서비스나 급성장하는 웹 서비스를 쉽게 볼 수 있습니다. 초기에 RDBMS로 시작된 서비스들은 규모가 성장함에 따라 샤딩과 NoSQL의 선택의 기로에 서게 됩니다. Amazon DynamoDB는 모든 스케일에서 사용할 수 있는 완전 관리형 Key-Value NoSQL 데이터베이스이지만 여전히 Key Design은 가장 어려운 영역 중 하나입니다. 이 세션에서는 대규모 서비스의 키 디자인 방법을 알아봅니다.
최근 국내와 글로벌 서비스에서 MongoDB를 사용하는 사례가 급증하고 있습니다. 다만 전통적인 RDBMS에 비해, 아직 지식과 경험의 축적이 적게 되어 있어 손쉬운 접근과 트러블 슈팅등에 문제가 있는 것도 사실입니다. 이 세션에서는 MongoDB 와 AWS의 DocumentDB의 Architecure를 간단히 살펴보고 MongoDB 및 DocumentDB의 비교를 진행하며 특히 MongoDB와 DocumentDB를 사용할때 주의해야할 중요 포인트에 대해서 알아봅니다.
거의 모든 산업에서 오늘날의 시장 리더와 혁신을 꿈꾸는 이들에게는 한 가지 공통점이 있습니다. 바로 데이터를 운영의 중심에 두는 것입니다. 클라우드에서 태어난 젊은 기업들은 데이터를 기반으로 비즈니스 를 구축하여 성장시키고, 기존 엔터프라이즈 기업은 데이터를 신속하게 수집, 분석 및 공유하는 과정에서 귀중한 통찰력을 얻기 위해 노력합니다. 본 세션에서는 데이터의 무한한 가치를 실현시킬 차세대 데이터 플랫폼 Snowflake를 소개하고, 이를 통해 눈부신 혁신을 이룬 기업들의 성공사례에 대해 알아봅니다.
AWS Cloud9과 Workspace만으로 PC없는 개발환경 활용기
클라우드 기반의 IDE와 가상 데스크톱 서비스를 이용하면 언제 어디서나 개발 환경을 구축할 수 있습니다. 애플리케이션 서비스를 위해 서버리스(Serverless)를 이용할 수 있는 것 처럼 개발 환경도 컴퓨터리스(Computerless)로 가능합니다. AWS Cloud9과 Workspace만으로 PC없이 웹 서비스 및 안드로이드 앱 개발을 진행하는 방법을 소개합니다.
AWS Webcast - High Availability with Route 53 DNS FailoverAmazon Web Services
This webinar will be discussing how to use DNS Failover to a range of high-availability architectures, from a simple backup website to advanced multi-region architectures.
Deduplication reduces the amount of disk storage needed to retain and protect data by ratios of 10-30x and greater, making a disk a cost-effective alternative to tape. Data on disk is available online and onsite for longer retention periods, and restores become fast and reliable. Storing only unique data on disk also means that data can be cost-effectively replicated over existing networks to remote sites for disaster recovery and consolidated tape operations.
Vblock Infrastructure Packages — integrated best-of-breed packages from VMwar...Eric Sloof
IT is undergoing a transformation. The current ‘accidental architecture’ of IT today increases procurement, management costs, and complexity while making it difficult to meet customer service level agreements. This makes IT less responsive to the business and creates the perception of IT being a cost center. IT is now moving towards a ‘private cloud’ model, which is a new model for delivering IT as a service, whether that service is provided internally (IT today), externally (service provider), or in combination. This new model requires a new way of thinking about both the underlying technology and the way IT is delivered for customer success.
While the need for a new IT model has never been more clear, navigating the path to that model has never been more complicated. The benefits of private clouds are capturing the collective imagination of IT architects and IT consumers in organizations of all sizes around the world. The realities of outdated technologies, rampant incremental approaches, and the absence of a compelling end-state architecture are impeding adoption by customers.
This new ‘private cloud’ model, which is a new model for delivering IT as a service, whether that service is provided internally (IT today), externally (service provider), or in combination. This new model requires a new way of thinking about both the underlying technology and the way IT is delivered for customer success.
By harnessing the power of virtualization, private clouds place considerable business benefits within reach.
Cisco and EMC, together with VMware, are putting you on a new road to greater efficiency, control and choice. A faster road to unprecedented IT agility and unbounded business opportunities. With the Virtual Compute Environment’s Vblock experience.
Presented to the Parker Vegas Experience (#parkervegas) on 1-29-15. Discusses the 7 attributes of ultimate networkers with examples. Also features the tools of the networker and "pop quiz." Bonus slides from the onsite program.
This technical paper provides the best practices for implementing the IBM Storwize V7000 Unified system NDMP backup solution using EMC NetWorker. To know more about the IBM Storwize V7000, visit http://ibm.co/TaLb6Q.
Dedupe-Centric Storage for General Applications EMC
Proliferation and preservation of many versions and copies of data drives much of the tremendous data growth most companies are experiencing. IT administrators are left to deal with the consequences. Because deduplication addresses one of the key elements of data growth, it should be at the heart of any data management strategy. This paper demonstrates how storage systems can be built with a deduplication engine powerful enough to become a platform that general applications can leverage.
Symantec Netbackup Delivering Performance Value Through Multiple De-duplicati...Symantec
De-duplication (dedupe) has rapidly become a standard requirement for organizations of all sizes and across all industries to reduce backup storage costs. However, many organizations have fallen into the habit, or rut, of upgrading to bigger dedupe storage appliances every time they outgrow or refresh their current dedupe pool. Organizations took the approach of “bigger was better” for de-duplication pools. Today, organizations need to re-examine their approach to deduplication pools. With advances in technology, increasing data growth, and
dynamic business demands, many IT organizations are now implementing a “more is better” approach to de-duplication pools. This solution brief examines the pros and cons of implementing multiple deduplication pools to address today’s IT challenges.
Big data week London Big data pipelining 0.2Simon Ambridge
Big Data Pipelining - build guide for the pipeline container from Data Fellas, to go with the slides from my Big Data Pipelining talk at Big Data Week and Cassandra London meetup 27th November 2015
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Data Domain Architecture
1. White Paper
Dedupe-Centric Storage
Hugo Patterson, Chief Architect, Data Domain
Abstract
Deduplication is enabling a new generation of data storage systems. While traditional
storage vendors have tried to dismiss deduplication as a feature, it is proving to be a true
storage fundamental. Faced with the choice of completely rebuilding their storage system,
or simply bolting on a copy-on-write snapshot, most vendors choose the easy path that
will get a check-box snapshot feature to market soonest.
There are a vast number of requirements that make it very challenging to build a storage
system that actually delivers the full potential of deduplication. Deduplication is hard to
implement in a way that runs fast with low overhead across a full range of protocols and
application environments. When deduplication is done well, people can create, share,
access, protect, and manage their data in new and easier ways, and IT administrators
aren’t struggling to keep up.
Dedupe-centric storage is engineered from the start to address these challenges.
This paper gives historical perspective and a technological context to this revolution
in storage management.
DEDUPLICATION STORAGE
3. Deduplication: The Post-Snapshot Revolution in Storage
The revolution to come with deduplication will eclipse the snapshot revolution of the early 1990s. Since at least the 1980s, systems had been
saving versions of individual files or snapshots of whole disk volumes or file systems. Technologies existed for creating snapshots reasonably
efficiently. But, in most production systems, snapshots were expensive, slow, or limited in number and so had limited use and deployment.
About 15 years ago, Network Appliance (NetApp) leveraged the existing technologies and added some of their own innovations to build
a file system that could create and present multiple snapshots of the entire system with little performance impact and without making a
complete new copy of all the data for every snapshot. It did not use deduplication to find and eliminate redundant data, but at least it
didn’t duplicate unchanged data just to create a new logical view of the same data in a new snapshot. Even this was a significant advance.
By not requiring a complete copy for each snapshot, it became much cheaper to store snapshots. And, by not moving the old version of
data preserved in a snapshot just to store a new version of data, there was no performance impact to creating snapshots. The efficiency
of their approach meant there was no reason not to create snapshots and keep a bunch around. Users responded by routinely creating
several snapshots a day and often preserving some snapshots for several days. The relative abundance of such snapshots revolutionized
data protection. Users could browse snapshots and restore earlier versions of files to recover from mistakes immediately all by themselves.
Instead of being shut down to run backups, databases could be quiesced briefly to create a consistent snapshot and then freed to run
again while backups proceeded in the background. Such consistent snapshots could also serve as starting points for database recovery in
the event of a database corruption, thereby avoiding the need to go to tape for recovery.
Building a Better Snapshot data sets are vast, putting a premium on scalability. Ideally, a
Despite the many benefits and commercial success of space and deduplication technology would be able to find and eliminate
performance efficient snapshots, it was over a decade before the redundant data no matter which application writes it, even if
first competitors built comparable technology. Most competitors’ different applications have written the same data. Deduplication
snapshots still do not compare. Many added a snapshot feature should be effective and scalable so that it can find a small segment
that delivers space efficiency but few matched the simplicity, of matching data no matter where it is stored in a large system.
performance, scalability and ultimately the utility of NetApp’s Deduplication also needs to be efficient or the memory and
design. Why? computing overhead of finding duplicates in a large system could
negate any benefit. Finally, deduplication needs to be simple and
The answer is that creating and managing the multiple virtual
automatic or the management burden of scheduling, tuning, or
views of a file system captured in snapshots is challenging. By far
generally managing the deduplication process could again negate
the easiest and most widely adopted approach is to read the old
any benefits of the data reduction. All of these requirements make
version of data and copy it to a safe place before writing new
it very challenging to build a storage system that actually delivers
data. This is known as copy-on-write and works especially well for
the full potential of deduplication.
block storage systems. Unfortunately, the copy operation imposes
a severe performance penalty because of the extra read and write
operations it requires. But, doing something more sophisticated When deduplication is done well, people
and writing new data to a new location without the extra read can create, share, access, protect, and
and write requires all new kinds of data structures and completely manage their data in new and easier
changes the model for how a storage system works. Further, ways and IT administrators aren’t
block-level snapshots by themselves do not provide access to the struggling to keep up.
file version captured in the snapshots; you need a file system that
integrates those versions into the name space.
Early efforts at deduplication did not attempt a comprehensive
Faced with the choice of completely rebuilding their storage
solution. One example is email systems that store only one copy
system, or simply bolting on a copy-on-write snapshot, most
of an attachment or a message even when it is sent to many
vendors choose the easy path that will get a check-box snapshot
recipients. Such email blasts are a common way for people to share
feature to market soonest. Very few are willing to invest the years
their work and early email implementations created a separate copy
and dollars to start over for what they view as just a single feature.
of the email and attachment for every recipient. The first innovation
By not investing, their snapshots were doomed to be second rate.
is to store just a single copy on the server and maintain multiple
They don’t cross the hurdle to that makes snapshots easy, cheap
references to that. More sophisticated systems might detect when
and most useful. They don’t deliver on the snapshot revolution.
the same attachment is forwarded to further recipients and create
additional references to the same data instead of storing a new
Deduplication as a Point Solution copy. Nevertheless, when users download attachments to their
On its surface, deduplication is a simple concept: find and desktops, the corporate environment still ends up with many copies
eliminate redundant copies of data. But computing systems are to store, manage and protect.
complex, supporting many different kinds of applications. And
DEDUPE-CENTRIC STORAGE 3
4. Local Compression
Creating and managing the multiple Deduplication across all the data stored in a system is necessary, but
virtual views of a file system captured in it should be complemented with local compression which typi-
snapshots is challenging. cally reduces data size by another factor of 2. This may not be as
significant a factor as deduplication itself, but no system that takes
data reduction seriously can afford to ignore it.
At the end of the day, an email system that eliminates duplicate
email attachments is a point solution. It helps that one application, Format Agnostic
but it doesn’t keep many additional duplicates, even of the very Data comes in many formats generated by many different applica-
same attachments from proliferating through the environment. tions. But, embedded in those different formats is often the same
duplicate data. A document may appear as an individual file gener-
The power of an efficient snapshot mechanism in a storage system ated by a word processor, or in an email saved in a folder by an
is its generality. Databases could all build their own snapshot email reader, or in the database of the email server, or embedded
mechanism, but when the storage provides them they don’t have in a backup image, or squirreled away by an email archive applica-
to. They and any other application can benefit from snapshots for tion. A deduplication system that relies on parsing the formats
efficient, and independent, data protection. NetApp was dedicated generated by all these different applications can never be a general
to building efficient snapshots so all the applications didn’t have to. purpose storage platform. There are too many such formats and
they change too quickly for a storage vendor to support them all.
Dedupe as a Storage Fundamental Even if they could, such an approach would end up handcuffing
application writers trying to innovate on top of the platform. Until
Data Domain is dedicated to building deduplication storage
their new format is supported, they’d gain no benefit from the
systems with a deduplication engine powerful enough to become
platform. They would be better off sticking with the same old
a platform that general applications can leverage. That engine
formats and not creating anything new. Storage platforms should
can find duplicates in data written by many different applications
unleash creativity, not squelch it. Thus, the deduplication engine
at every stage of the lifecycle of the file, can scale to store many
must be data agnostic and find and eliminate duplicates in data no
hundreds of terabytes of data and find duplicates wherever they
matter how it is packaged and stored to the system.
exist, can reduce data by a factor of 20 or more, can do so at
high speed without huge memory or disk resources, and does so
automatically while taking snapshots and so requires a minimum of
administrator attention. Vast requirements make it very
Such an engine is only possible with an unconventional system challenging to build a storage system
architecture that breaks with traditional thinking. Here are some that actually delivers the full potential of
important features needed to build such an engine. deduplication.
Variable Length Duplicates
Not all data changes are exactly 4KB in size and 4KB aligned.
Sometimes people just replace one word with a longer word. Such Multi-Protocol
a simple small replacement shifts all the rest of the data by some There are many standard protocols in use in storage systems today
small amount. To a deduplication engine built on fixed size blocks, from NFS and CIFS to blocks and VTL. For maximum flexibility,
the entire rest of the file would seem to be new unique data even storage systems should support all these protocols since different
though it really contains lots of duplicates. Further, that file may protocols are needed for different applications at the same time.
exist elsewhere in the system, say in an email folder, at a random User home directories may be in NAS. The exchange server may
offset in a larger file. Again, to an engine organized around fixed need to run on blocks. And backups may prefer VTL. Over the
blocks, the data would all seem to be new. Conventional storage course of its lifecycle, the same data may be stored with all of these
systems, whether NAS or SAN, store fixed sized blocks. Such a protocols. A presentation that starts in a home directory may be
system which attempts to bolt on deduplication as an afterthought emailed to a colleague and stored in blocks by the email server and
will only be able to look for identical fixed size blocks. Clearly, then archived in NAS by an email archive application and backed
such an approach will never be as effective, comprehensive, and up from the home directory, the email server, and the archive ap-
general purpose as a system that can handle small replacements plication to VTL. Deduplication should be able to find and eliminate
and recognize and eliminate duplicate data no matter where in a redundant data no matter how it is stored.
file it appears.
4 DEDUPE-CENTRIC STORAGE
5. CPU-Centric vs. Disk-Intensive Algorithm Deduplicated Snapshots
Over the last two decades, CPU performance has increased Snapshots are very helpful for capturing different versions of
2,000,000 times1. In that time, disk performance has only increased data and deduplication can store all those versions much more
11 times1. Today, CPU performance is taking another leap with compactly. Both are fundamental to simplifying data management.
every doubling of the number of cores in a chip. Clearly, algorithms Yet many systems that are cobbled together as a set of features
developed today for deduplication should leverage the growth in either can’t create snapshots efficiently, or they can’t deduplicate
CPU performance instead of being tied to disk performance. Some snapshots, so users need to be careful when they create snapshots
so as not to lock duplicates in. Such restrictions and limitations
complicate data management not simplify it.
Deduplication has leverage across the
storage infrastructure for reducing Conclusion
data, improving data protection and, in Deduplication has leverage across the storage infrastructure for
general, simplifying data management. reducing data, improving data protection and, in general, simplify-
ing data management. But, deduplication is hard to implement
in a way that runs fast with low overhead across a full range of
protocols and application environments. Storage system vendors
systems rely on a disk access to find every piece of duplicate data.
who treat deduplication merely as a feature will check off a box on
In some systems, the disk access is to lookup a segment fingerprint.
a feature list, but are likely to fall short of delivering the benefits
Other systems can’t find duplicate data except by reading the old
deduplication promises.
data to compare it to the new. In either case, the rate at which
duplicate data can be found and eliminated is bounded by the
speed of these disk accesses. To go faster, such systems need to
add more disks. But, the whole idea of deduplication is to reduce
storage, not grow it. The Data Domain SISL™ (Stream-Informed
Segment Layout) scaling architecture does not have to rely on
reading lots of data from disk to find duplicates and organizes
the segment fingerprints on disk in such a way that only a small
number of accesses are needed to find thousands of duplicates.
With SISL, back end disks only need to deliver a few megabytes of
data to deduplicate hundreds of megabytes of incoming data.
Deduplicated Replication
Data is protected from disaster only when a copy of it is safely at
a remote location. Replication has long been used for high-value
data, but without deduplication, replication is too expensive for the
other 90% of the data. Deduplication should happen immediately
inline and its benefits applied to replication in real time, so the
lag till data is safely off site is as small as possible. Only systems
designed for deduplication can run fast enough to deduplicate and
replicate data right away. Systems which have bolted on deduplica-
tion as merely a feature at best impose unnecessary delays in
replication and at worst don’t deliver the benefits of deduplicated
replication at all.
www.datadomain.com
1 Seagate Technology Paper, Economies of Capacity and Speed, May 2004