This White Paper provides an introduction to the EMC Isilon scale-out data lake as the key enabler to store, manage, and protect unstructured data for traditional and emerging workloads.
White Paper: EMC Isilon OneFS Operating System EMC
This white paper provides an introduction to the EMC Isilon OneFS operating system, the foundation of the Isilon scale-out storage platform. The paper includes an overview of the architecture of OneFS and describes the benefits of a scale-out storage platform.
White Paper: EMC Isilon OneFS — A Technical Overview EMC
This white paper provides an in-depth look at the major components of the Isilon OneFS operating and file system and details the hardware, software, distributed architecture, and various data protection mechanisms.
White Paper: Best Practices for Data Replication with EMC Isilon SyncIQ EMC
This White Paper provides a detailed overview of the key features and benefits of EMC Isilon SynclQ software and describes how SyncIQ enables enterprises to flexibly manage and automate data replication between two Isilon clusters. This paper also describes best practices and use cases to maximize the benefits of cluster-to-cluster replication.
The worlds of Big Data and Enterprise IT are converging, creating new challenges and opportunities for businesses across all industry segments. Join us for this recorded webcast and hear from EMC Isilon’s John Mallory, CTO Americas West, speak about how EMC Isilon scale-out storage solutions can help you manage data more effectively and accelerate your business transformation.
White Paper: EMC Isilon OneFS Operating System EMC
This white paper provides an introduction to the EMC Isilon OneFS operating system, the foundation of the Isilon scale-out storage platform. The paper includes an overview of the architecture of OneFS and describes the benefits of a scale-out storage platform.
White Paper: EMC Isilon OneFS — A Technical Overview EMC
This white paper provides an in-depth look at the major components of the Isilon OneFS operating and file system and details the hardware, software, distributed architecture, and various data protection mechanisms.
White Paper: Best Practices for Data Replication with EMC Isilon SyncIQ EMC
This White Paper provides a detailed overview of the key features and benefits of EMC Isilon SynclQ software and describes how SyncIQ enables enterprises to flexibly manage and automate data replication between two Isilon clusters. This paper also describes best practices and use cases to maximize the benefits of cluster-to-cluster replication.
The worlds of Big Data and Enterprise IT are converging, creating new challenges and opportunities for businesses across all industry segments. Join us for this recorded webcast and hear from EMC Isilon’s John Mallory, CTO Americas West, speak about how EMC Isilon scale-out storage solutions can help you manage data more effectively and accelerate your business transformation.
The session gives an introduction to Big Data.
Starts by giving a generally accepted definition of the term Big Data. Then explores why Big Data is important in the current business scenario . The topic ends with enumeration of the technologies used to analyze Big Data like Map Reduce, NoSql etc
5-7 key questions (non-generic) which would be covered in the webinar
(i) What is Big Data
(ii) Why is Big Data important in the current business scenario
(iii) How can an organization effectively use Big Data
(iv) What are the important technologies used to analyze Big Data?
(v) What are MapReduce/Hadoop/HDFS/NoSQL technologies?
EMC collaborates with colleges and universities worldwide to help prepare students for successful careers in a transforming IT industry. The Academic Alliance program offers unique ‘open’ curriculum-based education on technology topics such as cloud computing, big data analytics, and information storage and management. All courseware and faculty training are offered at no cost to qualifying higher education institutions.
The courses focus on technology concepts and principles applicable to any vendor environment, enabling students to develop highly marketable knowledge and skills required in today’s evolving IT industry.
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
CloudBoost is a cloud-enabling solution from EMC
Facilitates secure, automatic, efficient data transfer to private and public clouds for Long-Term Retention (LTR) of backups. Seamlessly extends existing data protection solutions to elastic, resilient, scale-out cloud storage
EMC Isilon Best Practices for Hadoop Data StorageEMC
This paper describes the best practices for setting up and managing the HDFS service on an EMC Isilon cluster to optimize data storage for Hadoop analytics. This paper covers OneFS 7.2 or later.
MT32 How relational (SQL) and unstructured data (Hadoop) learned to get alongDell EMC World
Big data can drive insight for all companies, large and small. With diverse data sources, healthcare companies can merge patient records with diagnostics, education organizations can better measure and monitor student performance from a variety of inputs, and in the public sector, disparate data can be used to improve operations.
In this session, we will take a whirlwind tour of industry examples with a variety of solutions from SQL to APS, as well as Hadoop, Microsoft Cortana and Power BI. And learn firsthand how North Texas Tollway Authority leverages big data to improve the customer experience on the roadways.
The session gives an introduction to Big Data.
Starts by giving a generally accepted definition of the term Big Data. Then explores why Big Data is important in the current business scenario . The topic ends with enumeration of the technologies used to analyze Big Data like Map Reduce, NoSql etc
5-7 key questions (non-generic) which would be covered in the webinar
(i) What is Big Data
(ii) Why is Big Data important in the current business scenario
(iii) How can an organization effectively use Big Data
(iv) What are the important technologies used to analyze Big Data?
(v) What are MapReduce/Hadoop/HDFS/NoSQL technologies?
EMC collaborates with colleges and universities worldwide to help prepare students for successful careers in a transforming IT industry. The Academic Alliance program offers unique ‘open’ curriculum-based education on technology topics such as cloud computing, big data analytics, and information storage and management. All courseware and faculty training are offered at no cost to qualifying higher education institutions.
The courses focus on technology concepts and principles applicable to any vendor environment, enabling students to develop highly marketable knowledge and skills required in today’s evolving IT industry.
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
CloudBoost is a cloud-enabling solution from EMC
Facilitates secure, automatic, efficient data transfer to private and public clouds for Long-Term Retention (LTR) of backups. Seamlessly extends existing data protection solutions to elastic, resilient, scale-out cloud storage
EMC Isilon Best Practices for Hadoop Data StorageEMC
This paper describes the best practices for setting up and managing the HDFS service on an EMC Isilon cluster to optimize data storage for Hadoop analytics. This paper covers OneFS 7.2 or later.
MT32 How relational (SQL) and unstructured data (Hadoop) learned to get alongDell EMC World
Big data can drive insight for all companies, large and small. With diverse data sources, healthcare companies can merge patient records with diagnostics, education organizations can better measure and monitor student performance from a variety of inputs, and in the public sector, disparate data can be used to improve operations.
In this session, we will take a whirlwind tour of industry examples with a variety of solutions from SQL to APS, as well as Hadoop, Microsoft Cortana and Power BI. And learn firsthand how North Texas Tollway Authority leverages big data to improve the customer experience on the roadways.
In this session, we will describe the key elements of a Dell EMC Isilon Data Lake and its key advantages including reduced IT costs, simplified management, increased operational flexibility and in-place data analytics. Dell EMC products to be featured include Isilon, ECS and Virtustream.
In this session, you will learn about the Dell EMC Isilon Data Lake and its advantages including:
• Data consolidation and increased efficiency to lower capital costs
• Streamlined management to reduce operating costs
• Improved operational flexibility and scalability to meet growing storage requirements
• Simple integration with a choice of public or private cloud storage providers
• Powerful in-place data analytics that accelerate time to insight while eliminating the need for a separate analytics storage infrastructure.
You will also hear how this solution can be easily extended to include data from remote and branch office locations with an efficient software defined storage solution.
MT46 Virtualization Integration with UnityDell EMC World
This session focuses on how virtualized application environments using Dell EMC Unity storage solutions are more agile, more efficient in their use of IT resources and lower costs. With over 90+ integration points with the major virtualization stacks (VMware, Microsoft and OpenStack) Unity is the ideal storage solution for your use. Our discussion will include how storage services are represented and policy based storage provisioning methods used.
MT47 Modernize infrastructure for a modern data centerDell EMC World
Today's businesses need speed, efficiency and agility to deliver services back to their stakeholders, all at an affordable price. In the Modern Data Center, Flash, along with Scale-out, software-defined solutions, help to automate a modern infrastructure, the foundation of the modern data center. This session will show you how Dell EMC's industry leading storage portfolio can transform your company's infrastructure and drive your success. In addition, learn how to protect your modern data center with Dell EMC’s comprehensive data protection portfolio.
Follow us at @DellEMCStorage
Learn more about Dell EMC All-Flash Solutions at DellEMC.com/All-flash.
MT155 Analytics and Cloud Native Apps – Your Business Game ChangerDell EMC World
Data analytics provides the opportunity for organizations to digitally transform their business and outperform their rivals. However, many organizations struggle with deriving value from data insights and operationalizing them back into the business. Pairing data analytics with cloud native apps development offers the potential to take the customer experience and your business to new levels. Cloud native apps are built on the concept of a continuous delivery of micro-services, allowing updates to be made quickly and often, for constant refinement of the customer experience. In this session, we will discuss how we can help you radically transform your organization into an insights-driven business.
Everything in IT is accelerating exponentially. Moore’s Law continues to hold true, as technology capabilities advance 10X every 5 years. Fast forward 15 years from today and you can expect to see it advance another 1000X. The implication will create a dramatically different era of IT. The Internet-of-Everything is quickly leading us down the path to IT-enabled businesses and economies.
There’s another profound shift happening: IT will move from supporting the business, to becoming the business.
For IT this presents a dual challenge: accelerate digital transformation to support the requirements of new cloud-native applications, while supporting the traditional applications that run today’s business. IT must be an expert and thought leader in both distinct architectural and operational paradigms.
To see the 3 tenets of the clearest path forward to transform IT, see David Goulden’s article: http://reflectionsblog.emc.com/dell-emc-world-2016-a-look-back/
See the session recording at http://dellemcworld.com/live/library/dell-emc-world-keynote-david-goulden-1
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
A. Jesse Jiryu Davis and Samantha Ritter are driver developers at MongoDB. At Open Source Bridge 2015, we describe how driver specs are tested, and how we write tests in the data-description language YAML to prove that all drivers conform to specs.
More info about this talk:
http://emptysqua.re/blog/more-info-about-cat-herds-crook/
This white paper introduces the EMC Isilon scale-out data lake as the key enabler to store, manage, and protect unstructured data for traditional and emerging workloads.
A New Frontier in Securing Sensitive Information – Taneja Group, April 2007LindaWatson19
Sensitive Information is increasingly landing in the hands of malicious individuals either through external breaches or insider thefts from employees or contractors. This paper looks into what it takes to secure sensitive information and thwart and potential breaches.
Data lakes are central repositories that store large volumes of structured, unstructured, and semi-structured data. They are ideal for machine learning use cases and support SQL-based access and programmatic distributed data processing frameworks. Data lakes can store data in the same format as its source systems or transform it before storing it. They support native streaming and are best suited for storing raw data without an intended use case. Data quality and governance practices are crucial to avoid a data swamp. Data lakes enable end-users to leverage insights for improved business performance and enable advanced analytics.
EMC Isilon: A Scalable Storage Platform for Big DataEMC
This white paper provides insights into EMC Isilon's shared storage approach, covering a wide range of desired characteristics including increased efficiency and reduced total cost.
Solix Cloud – Managing Data Growth with Database Archiving and Application Re...LindaWatson19
Mission-critical ERP and CRM applications are the lifeblood of any business. This paper examines how Solix Cloud Database Archiving and Application Retirement Solutions enable enterprises to achieve their ILM goals while reducing complexity and offering superior performance.
In the Age of Unstructured Data, Enterprise-Class Unified Storage Gives IT a ...Hitachi Vantara
Your enterprises can no longer be the realm of monolithic block-centric storage systems. Unstructured data drives the adoption of unified storage systems that support multiple protocols. Mobile smartphones and tablets accelerate the spread of file-based applications and data. Is your enterprise ready to join the mobile revolution? In this paper, see how Hitachi with our next generation of unified enterprise storage will help you move into the next era of business agility and efficiency.
Enterprise Archiving with Apache Hadoop Featuring the 2015 Gartner Magic Quad...LindaWatson19
Read how Solix leverages the Apache Hadoop big data platform to provide low cost, bulk data storage for Enterprise Archiving. The Solix Big Data Suite provides a unified archive for both structured and unstructured data and provides an Information Lifecycle Management (ILM) continuum to reduce costs, ensure enterprise applications are operating at peak performance and manage governance, risk and compliance.
Data storage and networking are no exceptions. Development is moving fast, and tipping points have already tipped: the cloud, nextgen networks, Internet of Things (IoT), innovative le systems, NVMe SSD. These technologies are active today in enterprise data centers and in the public clouds that serve them.
Enterprise Storage Solutions for Overcoming Big Data and Analytics ChallengesINFINIDAT
Big Data and analytics workloads represent a new frontier for organizations. Data is being collected from sources that did not exist 10 years ago. Mobile phone data, machine-generated data, and website interaction data are all being collected and analyzed. In addition, as IT budgets are already being pressured down, Big Data footprints are getting larger and posing a huge storage challenge.
This paper provides information on the issues that Big Data applications pose for storage systems and how choosing the correct storage infrastructure can streamline and consolidate Big Data and analytics applications without breaking the bank.
InfiniBox bridges the gap between high performance and high capacity for Big Data applications. InfiniBox allows an organization implementing Big Data and Analytics projects to truly attain its business goals: cost reduction, continual and deep capacity scaling, and simple and effective management — and without any compromises in performance or reliability. All of this to effectively and efficiently support Big Data applications at a disruptive price point.
Learn more at www.infinidat.com.
BIG DATA IN CLOUD COMPUTING REVIEW AND OPPORTUNITIESijcsit
Big Data is used in decision making process to gain useful insights hidden in the data for business and engineering. At the same time it presents challenges in processing, cloud computing has helped in advancement of big data by providing computational, networking and storage capacity. This paper presents the review, opportunities and challenges of transforming big data using cloud computing resources.
Big Data is used in decision making process to gain useful insights hidden in the data for business and engineering. At the same time it presents challenges in processing, cloud computing has helped in advancement of big data by providing computational, networking and storage capacity. This paper presents the review, opportunities and challenges of transforming big data using cloud computing resources.
Top 10 guidelines for deploying modern data architecture for the data driven ...LindaWatson19
Enterprises are facing a new revolution, powered by the rapid adoption of data analytics with modern technologies like machine learning and artificial intelligence (A).
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
With EMC XtremIO all-flash array, improve
1) your competitive agility with real-time analytics & development
2) your infrastructure agility with elastic provisioning for performance & capacity
3) your TCO with 50% lower capex and opex and double the storage lifecycle.
• Citrix & EMC XtremIO: Better Together
• XtremIO Design Fundamentals for VDI
• Citrix XenDesktop & XtremIO
-- Image Management & Storage
-- Demonstrations
-- XtremIO XenDesktop Integration
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
Explore findings from the EMC Forum IT Study and learn how cloud computing, social, mobile, and big data megatrends are shaping IT as a business driver globally.
Reference architecture with MIRANTIS OPENSTACK PLATFORM.The changes that are going on in IT with disruptions from technology, business and culture and so IT to solve the issues has to change from moving from traditional models to broker provider model.
Force Cyber Criminals to Shop Elsewhere
Learn the value of having an Identity Management and Governance solution and how retailers today are benefiting by strengthening their defenses and bolstering their Identity Management capabilities.
Container-based technology has experienced a recent revival and is becoming adopted at an explosive rate. For those that are new to the conversation, containers offer a way to virtualize an operating system. This virtualization isolates processes, providing limited visibility and resource utilization to each, such that the processes appear to be running on separate machines. In short, allowing more applications to run on a single machine. Here is a brief timeline of key moments in container history.
This white paper provides an overview of EMC's data protection solutions for the data lake - an active repository to manage varied and complex Big Data workloads
This infographic highlights key stats and messages from the analyst report from J.Gold Associates that addresses the growing economic impact of mobile cybercrime and fraud.
This white paper describes how an intelligence-driven governance, risk management, and compliance (GRC) model can create an efficient, collaborative enterprise GRC strategy across IT, Finance, Operations, and Legal areas.
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
This white paper discusses the results of a CIO UK survey on a“Trust Paradox,” defined as employees and business partners being both the weakest link in an organization’s security as well as trusted agents in achieving the company’s goals.
2014 Cybercrime Roundup: The Year of the POS BreachEMC
This RSA fraud report summarizes cybercrime in 2014 and includes the number of phishing attacks globally, top hosting countries for phishing attacks, the financial impact of global fraud losses, and a monthly highlight.
2014 Cybercrime Roundup: The Year of the POS Breach
Scale-Out Data Lake with EMC Isilon
1. EMC WHITE PAPER
THE EMC ISILON SCALE-OUT
DATA LAKE
Key capabilities
ABSTRACT
This white paper provides an introduction to the EMC Isilon scale-out data lake as the
key enabler to store, manage, and protect unstructured data for traditional and
emerging workloads. Business decision makers and architects can leverage the
information provided here to make key strategy and implementation decisions for
their storage infrastructure.
July 2014
3. TABLE OF CONTENTS
EXECUTIVE SUMMARY 4
Audience 4
Terminology 4
OVERVIEW 4
DATA FLOW 5
Ingest 6
Storage 6
Analysis 6
Application: Surface and Act 6
BUSINESS CHALLENGES 6
Silos or islands of data 7
Inefficiencies across the system 7
Security and compliance 7
Inflexible architecture for innovation 7
DATA LAKE 8
ISILON SCALE-OUT DATA LAKE 8
Multiple access methods 9
Cost efficiency 9
Reduce risks 10
Protect and secure data assets 10
Faster time to insights 11
STORAGE AND DATA SERVICES 12
CONSUMPTION MODELS 12
CONCLUSION 13
REFERENCES 13
ABOUT EMC 13
4. 4
EXECUTIVE SUMMARY
Data is growing rapidly as the numbers of people using digital devices interact with systems and networks grow across the globe. A
Majority of this data growth is in the unstructured form—and as organizations are experiencing, contains valuable insights that could
be used to improve business results. The technology to capture, store and analyze this data is maturing rapidly, as enterprises are
looking for ways to effectively handle the data growth. By some industry estimates, over 80 percent of the new storage capacity
deployed in organizations around the world will be for unstructured data.
EMC®
Isilon®
scale-out network-attached storage (NAS) provides a simple, scalable, and efficient platform to store massive amounts
of unstructured data and enable various applications to create a scalable and accessible data repository without the overhead
associated with traditional storage systems. Isilon enables organizations to build a scale-out data lake where they can store their
current data, and scale capacity, performance, or protection as their business data grows in the future. The scale-out data lake helps
lower storage costs by efficient storage utilization, eliminate islands or silos of storage, and lower storage management costs of
migration, security and protection.
EMC Isilon OneFS®
, the intelligence behind Isilon systems, is the industry’s leading scale-out NAS operating system, which is known
for its massive capacity, operational simplicity, extreme performance, and unmatched storage utilization. With proven enterprise-
grade protection and security capabilities, Isilon is an ideal platform to meet a wide variety of storage needs.
AUDIENCE
This white paper is intended for business decision makers, IT managers, architects, and implementers. By leveraging the Isilon
scale-out data lake—a key enabler for storing and managing massive quantities of unstructured data—enterprises can build their
storage strategies and implement their infrastructure to maximize the return of their IT investments.
TERMINOLOGY
The acronyms used in this paper are summarized in the Table 1.
Table 1. Acronyms used in this paper
OVERVIEW
According to an IDC “The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things” study
sponsored by EMC, the digital universe will grow from 4.4 zettabytes (1 trillion gigabytes) in 2013 to 44 zettabytes in 2020, doubling
in size every two years. Although enterprises create only 30 percent of this data (1.5 ZB in 2013), they come in contact with 85
percent (2.3 ZB in 2013) of it, and have some liability associated with the data. IDC further estimates that 22 percent of the data is
a candidate for analysis and contains valuable information that organizations can use to make critical decisions. Figure 1 shows this
trend for enterprise data.
ACRONYM DESCRIPTION
CIFS Common Internet File System
DAS direct-attached storage
HDFS Hadoop Distributed File System
LAN local area network
NAS network-attached storage
NFS network file system
SAN storage area network
SMB Server Message Block
5. 5
Figure 1. Enterprise data1
Not all of this data is stored by the enterprises in the datacenter for cost, liability or process reasons.
Isilon scale-out NAS provides key capabilities to build a data lake that physically de-couple compute from storage without affecting
the seamless nature of the data flow outlined below. You can leverage a data lake to get the most value from your data. Data from a
variety of sources can be converged using native protocols, protected, secured and exposed to analytics systems that drive value,
organizations can plan, implement and manage loosely coupled but seamless storage and applications.
DATA FLOW
Organizational data typically follows a linear data flow starting with various sources both consumer and corporate; ingested into a
store, analyzed and surfaced for actions that augment value creation for an organization; as shown in Figure 2. The five distinct
stages are typically broken down into three broad categories of data, analytics and application for easy reference from a holistic point
of view.
|
Figure 2. Data flow
1
EMC Digital Universe Study—with Research and Analysis by IDC, http://www.emc.com/leadership/digital-universe/index.htm
6. 6
In the following section we detail each of the stages in the data flow process as we understand the implications of each of these
phases of the data flow. The choices at each of these stages have a profound impact in deriving value of the organizations data and
ultimately the data store.
INGEST
Data Ingestion is the process of obtaining and processing data for later use by storing in the most appropriate system for the
application. An effective ingestion methodology validates the data, prioritizes the sources and with reasonable speed and efficiency
commits data to storage. The velocity, volume and variety tax the capture speeds, throughput and efficiency of the ingest systems
and therefore the store as discussed in the following section. The ingest process can act as the starting point of a storage silo if the
organization planners do not look at the dataset holistically.
High velocity data- characterized by a continuous feed of data requires a specialized ingestion mechanism that typically captures the
continuous stream and commits the records in batches of varying sizes to storage for further processing. Capture strategies come in
two flavors, commit records to storage directly or use higher speed buffers as an intermediate step before storing to a more
persistent downstream system. Examples of high velocity data include clickstreams, video surveillance feeds etc.
The volume of data is not only a factor of the velocity but also the size of the data set generated in batches of non-streaming origins.
High volume data typically requires time to move from one point in the system to another depending on the network speed and
places burdens on both the network and storage. Optimization for volume as in large files can affect velocity and vice versa. High
definition video and audio are the typical examples of high volume data in the form of files- separate from streaming.
Variety places translation challenges at the data, file and protocol levels as disparate organic systems may need varying levels of
translation for data to be of value for the downstream processes. A combination of CRM, LOB, social media, web and mobile
information on customers is a good example of variety of data.
STORAGE
Storing data is typically dictated by the type of storage strategy namely block or file, the flow mechanism and the application. Over
the years, storage has evolved into an optimization problem between storage costs and performance. As the volume and variety of
data grows, the cost to reliably store, manage, secure, and protect it grows as well, particularly for data that is subject to compliance
and regulatory mandates like personal identifiable information (PII), financial transactions, and medical records. Adding the
availability component adds cost pressures at the expense of performance and vice versa.
The segregation started by the choice of ingestion system is further exacerbated by storage bringing about true silos or islands of
information catering to various application requirements of real-time, interactive or batch processing. As silos permeate the IT
infrastructure, hot spots arise in systems of heavier use, while capacity goes unused elsewhere. Downstream applications,
compliance, security and data policies can contribute to create further silos within silos- giving rise to a very complex system.
ANALYSIS
Data analysis technologies load, process, surface, secure, and manage data in ways that enable organizations to mine value from
their data. Traditional data analysis systems are expensive, and extending them beyond their critical purposes can place a heavy
burden on IT resources and costs. Integrating data from disparate sources adds complexity and management overhead that can be a
deterrent to most organizations looking to derive value from their data.
APPLICATION: SURFACE AND ACT
Post analysis, results and insights have to be surfaced for actions like e-discovery, post mortem analysis, business process
improvements, decision making or a host of other applications. Traditional systems use traditional protocols and access mechanisms
while new and emerging systems are redefining access requirements to data already stored within an organization. A system is not
complete unless it caters to the range of requirements placed by traditional and next-generation workloads, systems and processes.
BUSINESS CHALLENGES
As organizations face large datasets, data and application growth, they are observing a tremendous increase in costs both CAPEX &
OPEX; limitations in the ability to realize the value of the data; and protection and compliance lapses to name a few. These
challenges can be attributed to a combination of factors throughout the data flow process but fall in the following broad categories:
7. 7
SILOS OR ISLANDS OF DATA
The ingestion strategy employed for real-time, interactive or batch applications originate a silo that diverges as the data flows
downstream. For example, if an organization choses to setup a combination of SAN- buffer stream of data and NAS -persistent
storage for a real-time application like customer targeting; while elsewhere, CRM analysis happens on a combination of DAS and
cloud. A third silo consisting of archived data uses a combination of DAS and NAS to run batch analytics- a typical scenario in
organizations today.
In the first silo, the compliance requirement for handling of PII data can pressure admins to choose between applying policy to the
entire silo or carve out a protected zone if the technology so supports. This adds complexity to the system without the inclusion of
access control for financial data to meet SEC requirements; or HIPPA for medical data. If the design of the system does not support
protections, organizations risk going out of compliance with large fines at a minimum and liability of data leaks causing massive
lawsuits on the other extreme.
Figure 3. Silos or islands of data
Also, the silos of data adds cost pressures due to inefficiencies addressed in the following section, locking of insights within silos at
the expense of business and management and protection inconsistencies. In many organizations, applying learnings from a batch
processing system to a real-time system may involve lengthy change management processes- creating friction across departments
or adding barriers to value generation.
INEFFICIENCIES ACROSS THE SYSTEM
If one dataset of say 10 terabytes is used for three different analyses by three systems, you will have a minimum of three copies
requiring three times the storage capacity. If this dataset grows by a 30% a year the scaling requirement grows by 90% annually-
demonstrating inefficiencies experienced by organizations at the most basic level. If one silo has lower utilization than the other,
hotspots arise in the system while capacity goes unutilized elsewhere. Inefficient utilization of the budget can arise as a result of low
value data residing on high cost, high performance storage.
Organizations are faced with management, datacenter footprint, power and cooling inefficiencies due to silos and hotspots in addition
to reconfigurations, migrations and complicated maintenance activities.
SECURITY AND COMPLIANCE
Organizations with silos are faced with duplicate and inconsistent application of policies, security measures and governance policies,
whereas organizations with shared infrastructure face access control violations. Working around these is typically time-consuming,
painful and diverts resources away from standard procedures as issues arise. Securing data against leaks or destruction- both
accidental and malicious; presents a separate set of challenges. These challenges are compounded in regulated sectors like finance,
healthcare and government.
INFLEXIBLE ARCHITECTURE FOR INNOVATION
More users than ever before are mobile or geographically dispersed with access to a larger dataset as a basic requirement to perform
their duties effectively. Traditional systems operated within the confines of Corporate IT through a regulated and measured
interconnections with the external world. With the growth of cloud, mobility and devices; the capabilities of traditional systems are
being challenged in new ways. Providing access while enforcing policies consistently across a wide range of approved and
8. 8
unapproved devices and platforms is a growing challenge for today’s storage administrators. Traditional systems, silos and
implementation strategies add rigidity as protection and security layers are added around the application or sources of information.
DATA LAKE
The data lake represents a paradigm shift from the linear data flow model. As data and the insights gleaned from it increase in value,
enterprise-wide consolidated storage is transformed into a hub around which the ingestion and consumption systems work (see
Figure 4). This enables enterprises to bring analytics to data and avoid expensive costs of multiple systems, storage and time for
ingestion and analysis.
Figure 4. Scale out data lake
By eliminating a number of parallel linear data flows, enterprises can consolidate vast amounts of their data into a single store—a
data lake—through a native and simple ingestion process. This data can be secured and analysis performed, insights surfaced, and
actions taken in an iterative manner as the organization and technology matures. Enterprises can thus eliminate the cost of having
silos or islands of information spread across their enterprises.
The scale-out data lake further enhances this paradigm by providing scaling capabilities in terms of capacity, performance, security,
and protection. The key characteristics of a scale-out data lake are that it:
Accepts data from a variety of sources like file shares, archives, web applications, devices, and the cloud, in both streaming and
batch processes
Enables access to this data for a variety of uses from conventional purposes to next-gen mobile, analytics, and cloud
applications
Secures and safeguards data with the appropriate level of protection from highly critical data like medical records, financial
transactions, credit card data, and PII to website logs and temporary data that might not require any security
Scales to meet the demands of future consolidation and growth as technology evolves and new possibilities emerge for applying
data to gain competitive advantage in the marketplace
Provides a tiering ability that enables organizations to manage their costs without setting up specialized infrastructures for cost
optimization
Maintains simplicity, even at the petabyte scale
ISILON SCALE-OUT DATA LAKE
Isilon enhances the data lake concept by enriching your storage with improved cost efficiencies, reduced risks, data protection,
security, compliance & governance while enabling you to get to insights faster. You can reduce the risks of your big data project
implementation, operational expenses and try out pilot projects on real business data before investing in a solution that meets your
9. 9
exact business needs. Isilon is based on a fully distributed architecture that consists of modular hardware nodes arranged in a
cluster. As nodes are added, the file system expands dynamically scaling out capacity and performance without adding corresponding
administrative overhead.
Multiple access methods
Isilon natively supports multiple protocols like SMB, NFS, File Transfer Protocol (FTP), and Hypertext Transfer Protocol (HTTP) for
traditional workloads, and HDFS for emerging workloads like Hadoop analytics. By the very nature, a shared storage with multiple
access methods eliminates storage silos bringing efficiencies and consistency to your IT infrastructure. This enables batch, real-time
or interactive applications and systems to store and access data from one shared storage pool without the need for any migrations,
loading, or conversion. Accessing data for read or write purposes are achieved at the protocol level. This implies that data can be
created by any of the myriad systems, ingested into the data lake using a natively supported protocol such as SMB for Windows or
Mac, and accessed or modified seamlessly using another protocol like NFS, FTP or HDFS.
Multiprotocol support enables the storage infrastructure to provide access to applications that leverage third platform protocols like
HTTP or HDFS, which drive emerging workloads. Adding future protocol support, data access mechanisms, or interfaces can be easily
achieved to scale the interoperability of the data lake. Without a data lake, interoperability would require an expensive and time-
consuming sequence of operations on data across multiple silos, or even costly and inefficient data duplication.
Figure 5. Isilon Scale-out data lake
Cost efficiency
You can achieve great cost efficiencies by investing in the storage capacity and capability required today- smaller than traditional or
Hadoop requirements; scale in smaller steps proportionately to the data growth; tiering data according to value and; simplified
management.
Isilon scale-out NAS can scale from 18 terabytes to 20 petabytes in a single cluster so you do not have to overprovision your storage
infrastructure. With over 80% utilization, you can keep your CAPEX in check with a smaller footprint. SmartDedupe provides the
ability to reduce the physical data footprint by locating and sharing common elements across files with minimal impact to
performance during writes or concurrent reads. You will observe a reduction in storage expansion costs, typically in the range of 30%
with smaller storage capacity, power and cooling and of course less rack space requirement.
10. 10
Isilon enables you to tier the data lake to further drive cost efficiencies by optimizing performance, throughput, and density. Using
policy-based tiering, you can reduce the cost of storing your data based on the inherent value and utility of the data. Tiering using
EMC Isilon S-Series, X-Series, and NL-Series nodes is shown in Figure 6. High-value, readily needed data can be kept at a high-
performance tier, whereas low-value, infrequently used data can be moved to a more cost-effective active archive without having
any impact on your application or analytics infrastructure.
Figure 6 . Tiering using Isilon storage nodes
Under the covers, an Isilon cluster provides automatic storage balancing and deduplication that not only enhances storage efficiency
but also enables storage administrators to weather hardware outages better.
Reduce risks
Any large and growing data infrastructure risks can be classified as
Ability to successfully implement an initial solution
Ability and flexibility to scale solution as the environment changes
Protect against current and future data loss
Protect against data theft
These risks are further amplified as the dependency between the ingestion, storage and analytics system remains strong. With a
loosely coupled system you will be able to build mitigation strategies. A data lake based on Isilon is able to provide mitigation for all
these components. By decoupling storage from compute and using multiple access methods, you can deploy a storage system
capable of ingesting the data you need for your solution effortlessly. By using protection and security features outlined in the
following sections, you can ensure your data is safe and resilient to external forces.
Isilon is the only scale-out NAS to support multiple distributions of Hadoop through native HDFS implementation. This allows you the
flexibility to analyze, surface and act on data using the best tool for your application and scale storage compute or both as your
needs change.
Protect and secure data assets
Isilon storage systems are highly resilient and provide unmatched data protection and availability. Isilon uses the proven Reed-
Solomon erasure encoding algorithm rather than RAID to provide a level of data protection that goes far beyond traditional storage
systems. With N+1 protection, data is 100 percent available even if a single drive or a complete node fails, which is comparable to
RAID 5 in conventional storage. You can also deploy Isilon for N+ 2 protections, which allows two components to fail within the
system, similar to RAID 6; N+3 or N+ 4 protections, where three or four components can fail, keeping the data 100 percent
available.
11. 11
Isilon is able to recover from hardware failures faster than traditional systems as recovery entails rebuilding lost data as opposed to
the entire disk. In a data Lake, this can have a huge impact on the operation as storage is shared across systems with varying levels
of performance demands. A truly distributed storage like Isilon can orchestrate all the nodes to participate in the restoration or
recovery of data from the outages on a dedicated backend infiniband network speeding up recovery without impacting front end
performance.
As organizations view data as an asset, securing data for inherent value is even more important than meeting regulatory compliance
and corporate governance requirements. Isilon helps organization address security needs by providing robust and flexible security
features outlined below as described in the figure below.
Figure 7. Authentication zones
Secure role separation enables roles-based access control (RBAC), where a clear separation between storage administration and
file system access is enforced.
Authentication zones that serve as secure, isolated storage pools for insulating departments within the organization from access
to data that they are not supposed to see. For example: legal, financial, PII and employee data can be seen by only authorized
employees in the authentication zone.
Write once-read many (WORM) protection is achieved by using SmartLock software, which prevents accidental or malicious
alterations or deletion of data- a key requirement for governance and compliance.
Data at rest encryptions through the use of Self-Encrypting drives enables organizations to ensure physical loss of hardware
does not equate to data leak.
Faster time to insights
By utilizing a shared infrastructure, you can consolidate data from multiple islands into one single storage system based on Isilon.
This is made possible through he multiple access mechanisms at the protocol level where data can be ingested into the Isilon store
from a wide variety of sources and surfaced for use by others. This would eliminate the need for data migration or extraction-
translation-loading (ETL) operations typical with any data analytics solution, saving you precious time and resources. You can then
12. 12
run Hadoop analytics with your dataset in-place. As depicted in the figure below. By using a large loosely coupled shared scalable
storage on a typical dataset of around 100 terabytes, you can save over 24 hours of data moving time on a 10 GBps network; time
that you can use to actually generate insights from your data.
Isilon is the only scale-out NAS that works with multiple distributions of Hadoop from a variety of vendors. This enables you to try
out tools from all of these vendors at the same time if necessary to find the best solution to meet your business requirements.
The data lake is the key enabler for driving business value into customer environments through the multiprotocol, multi-access,
tiered, single namespace, protected and scalable data repository. Leveraging the scale-out data lake, customers can consolidate
multiple disparate islands of storage into a single cohesive and unified data repository that is easier to manage and more cost-
effective.
Figure 8. Faster insights
STORAGE AND DATA SERVICES
A scale-out data lake leverages the industry leading enterprise-grade storage and data services that extend the business value of
your data. Isilon Infrastructure Software provides powerful storage management software that helps protect your data assets,
control costs, and optimize the storage resources and system performance of your scale-out data lake.
The data management capabilities include EMC Isilon InsightIQ®
, MobileIQ, SmartDedupe, SmartPools®
, and SmartQuotas™, which
together help you improve the ROI of your data lake infrastructure. For more information on these data management services,
review the Isilon OneFS whitepaper available here.
The scale-out data lake is further strengthened by proven data protection capabilities provided by EMC Isilon SmartConnect™,
SmartLock®
, SnapshotIQ™, and SyncIQ®
. For more information on these data protection services, review the Isilon data availability
whitepaper available here.
CONSUMPTION MODELS
Isilon provides a number of consumption models that enables you to choose a strategy that is best for your business. The simplest
and most common strategy is an appliance that is a preinstalled combination of hardware and software. You can choose a converged
infrastructure solution in conjunction with EMC Vblock®
. Or, you may choose a cloud-based utility infrastructure-as-a-service; in
which you pay for the service based on usage. The key is that you have choices in the storage purchasing model and flexibility to fit
your business procurement needs.
13. 13
CONCLUSION
Given that unstructured data will be doubling every two years, enterprises need higher efficiencies, simplicity and protection as they
scale capacity and capabilities. A scale-out data lake is a key capability that provides simplicity, agility, and efficiency to store and
manage unstructured data. Starting with a scale-out data lake, organizations can
1. Invest in the infrastructure they need today to get started today
2. Realize the value of their data, store, process, and analyze it- in the most cost effective manner; and
3. Grow capabilities as needs grow in the future.
This enables organizations to store everything, analyze anything and build a solution with the best ROI.
By de-coupling storage from analysis and application, organizations gain flexibility to choose between a larger number of strategies
to deploy solutions without risking data loss, cost overruns and dataset leaks. The data lake based on Isilon offers organizations this
along with a capability to simplify the IT infrastructure, tier, secure and protect data efficiently; and get to insights faster.
As a large dataset is the reason big data exists, organizations can start with the data. Understand the value locked in their large and
growing datasets by running pilots; not worry about ingesting or surfacing and; pilot applications or emerging technologies- to make
better informed strategic decisions.
REFERENCES
EMC Digital Universe Study—with Research and Analysis by IDC (http://www.emc.com/leadership/digital-universe/index.htm)
EMC Isilon OneFS Operating System http://www.emc.com/collateral/hardware/white-papers/h8202-isilon-onefs-wp.pdf
High Availability and Data Protection with EMC Isilon scale-out NAS http://www.emc.com/collateral/hardware/white-papers/h10588-
isilon-data-availability-protection-wp.pdf
ABOUT EMC
EMC Corporation is a global leader in enabling businesses and service providers to transform their operations and deliver IT as a
service. Fundamental to this transformation is cloud computing. Through innovative products and services, EMC accelerates the
journey to cloud computing, helping IT departments to store, manage, protect, and analyze their most valuable asset—information—
in a more agile, trusted, and cost-efficient way. Additional information about EMC can be found at www.EMC.com.