Use Distributed Filesystem as a Storage Tier

Use Distributed File system as a Storage Tier
FabrizioManfred Furuholmen

Agenda

 Introduction
 Next Generation Data Center
 Distributed File system

 Distributed File system
 OpenAFS
 GlusterFS
 HDFS
 Ceph

 Case Studies

 Conclusion

2

16/02/2012

Class Exam

 What do you know about DFS ?

 How can you create a Petabyte
storage ?

 How can you make a centralized
system log ?

 How can you allocate space for your
user or system, when you have a
thousands of users/systems ?

 How can you retrieve data from
everywhere ?
3

16/02/2012

Introduction

Next Generation Data Center: the ―FABRIC‖

Key categories:
 Continuous data protection and disaster
recovery

 File and block data migration across
heterogeneous environments

 Server and storage virtualization

 Encryption for data in-flight and at-rest

In other words: Cloud data center
4

16/02/2012

Introduction

Storage Tier in the ―FABRIC‖
 High Performance
 Scalability
 Simplified Management
 Security
 High Availability

Solutions
 Storage Area Network
 Network Attached Storage
 Distributed file system

5

16/02/2012

Introduction

What is a Distributed File system ?

“A distributed file system takes advantage of the
interconnected nature of the network by storing
files on more than one computer in the network
and making them accessible to all of them..”

6

16/02/2012

Introduction

What do you expected from a distributed file system ?

• Uniform Access: file names global support

• Security: to provide a global authentication/authorization

• Reliability: the elimination of each single point of failure

• Availability: administrators perform routine maintenance while the file
server is in operation, without disrupting the user’s routines

• Scalability: Handle terabytes of data

• Standard conformance: some IEEE POSIX file system semantics standard

• Performance: high performance

7

Part II

Implementations
How many DFS do you know ?

8

OpenAFS: introduction

is theopen sourceimplementation of
AndrewFile system of IBM
Key ideas:
 Make clients do work whenever possible.

 Cache whenever possible.

 Exploit file usage properties. Understand them. One-third of Unix
files are temporary.

 Minimize system-wide knowledge and change. Do not hardwire
locations.

 Trust the fewest possible entities. Do not trust workstations.

 Batch if possible to group operations.

9

16/02/2012

OpenAFS: design

10

16/02/2012

OpenAFS: components

Cell

•Cell is collection of file servers and
workstation
•The directories under /afs are
cells, unique tree
•Fileserver contains volumes

Volumes

•Volumes are "containers" or sets of
related files and directories
•Have size limit
•3 type rw, ro, backup

Mount Point Directory
Server A
•Access to a volume is provided through
a mount point Server C
•A mount point is just like a static
directory Server A+B

11

OpenAFS: performances

OpenAFS OpenAFS OSD 2 Servers
write

40000

35000

30000
35000-40000
25000 30000-35000

20000 25000-30000
20000-25000
15000
15000-20000
10000 10000-15000
16384 5000-10000
5000
1024 0-5000
0 block
64
64

256

1024

4096

16384

4
65536

262144

kb

read

90000

80000

70000
80000-90000
60000 70000-80000
50000 60000-70000
50000-60000
40000
40000-50000
30000
30000-40000
20000 20000-30000
10000 10000-20000
131072 0-10000
0
16384 43
4

16

64

256

1024

2048
4096

16384

a

OpenAFS: features

 Uniform name space: same path on all
workstations

 Security: base to krb4/krb5, extended ACL,
traffic encryption

 Reliability: read-only replication, HA
database, read/write replica in OSD version

 Availability: maintenance tasks without
stopping the service

 Scalability: server aggregation

 Administration: administration delegation

 Performance: client side disk base persistent
cache, big rate client per Server
13

16/02/2012

openAFS: who uses it ?

Morgan Stanley IT
• Internal usage
• Storage: 450 TB (ro)+ 15 TB (rw)
• Client: 22.000

Pictage, Inc
• Online picture album
• Storage: 265TB ( planned growth to 425TB in twelve months)
• Volumes: 800,000.
• Files: 200 000 000.

Embian
• Internet Shared folder
• Storage: 500TB
• Server: 200 Storage server
• 300 App server

RZH
•Internal usage 210TB

14

OpenAFS: good for ...

Good
• Wide Area Network
• Heterogeneous System
• Read operation > write operation
• Large number of clients/systems
• Usage directly by end-users
• Federation

Bad
• Locking
• Database
• Unicode
• Large File
• Some limitations on ..

15

GlusterFS

“Gluster can manage data in a
single global namespace on
commodity hardware..‖

Keys:
 Lower Storage Cost—Open source software runs on commodity
hardware

 Scalability—Linearly scales to hundreds of Petabytes

 Performance—No metadata server means no bottlenecks

 High Availability—Data mirroring and real time self-healing

 Virtual Storage for Virtual Servers—Simplifies storage and keeps VMs
always-on

 Simplicity—Complete web based management suite

16

16/02/2012

GlusterFS: design

17

16/02/2012

GlusterFS: components

Volume
volume posix1
•Volume is the basic element for data type storage/posix
export option directory /home/export1
•The volumes can be stacked for end-volume
extension

Capabilities
volume brick1
•Specific options (features) can be type features/posix-locks
enabled for each volume (cache, pre option mandatory
fetch, etc.) subvolumes posix1
•Simple creation for custom extensions end-volume
with api interface

Services volume server
type protocol/server
•Access to a volume is provided through option transport-type tcp
services like tcp, unix socket, option transport.socket.listen-port 6996
infiniband subvolumes brick1
option auth.addr.brick1.allow *
end-volume

18

16/02/2012

Gluster: components

19

16/02/2012

Gluster: performance

20

16/02/2012

Gluster: carateristics

 Uniform name space: same path on all
workstation

 Reliability: read-1 replication, asynchronous
replication for disaster recovery

 Availability: No system downtime for
maintenance (better in the next release)

 Scalability: Truly linear scalability

 Administration: Self Healing, Centralized logging
and reporting, Appliance version

 Performance: Stripe files across dozens of
storage blocks, Automatic load balancing, per
volume i/o tuning
21

16/02/2012

Gluster: who uses it ?

 Avail TVN (USA)
400TB for Video on demand, video
storage

 Fido Film (Sweden)
visual FX and Animation studio

 University of Minnesota (USA)
142TB Supercomputing

 Partners Healthcare (USA)
336TB Integrated health system

Origo(Switzerland)
open source software development
and collaboration platform

22

Gluster: good for ...

Good
• Large amount of data
• Access with different protocols
• Directly access from applications
(api layer)
• Disaster recover (better in the
next release)
• SAN replacement, vm storage

Bad
• User-space
• Low granularity in security setting
• High volumes of operations on
same file

23

Implementations

Implementations

Old way
 Metadata and data in the same place
 Single stream per file

New way
 Multiple streams are parallel channels
through which data can flow
 Files are striped across a set of nodes in
order to facilitate parallel access
 OSD Separation of file metadata
management (MDS) from the storage of
file data

24

16/02/2012

HDFS: Hadoop

HDFS is part of the Apache
Hadoopproject which develops
open-source software for
reliable, scalable, distributed
computing.

Hadoop was inspired by Google’s
MapReduce and Google File
system

25

16/02/2012

HDFS: Google File System

― Design of a file systems for a different environment
where assumptions of a general purpose file system
do not hold—interesting to see how new assumptions
lead to a different type of system…‖

Key ideas:
 Component failures are the norm.
 Huge files (not just the occasional file)
 Append rather than overwrite is typical
 Co-design of application and file system API—specialization.
For example can have relaxed consistency.

26

16/02/2012

HDFS: MapReduce

“Moving Computation is Cheaper than Moving Data”

Map
• Split and mapped in key-
value pairs

Combine
• For efficiency reasons, the
combiner works directly to map
operation outputs .

Reduce
• The files are then
merged, sorted and reduced

27

HDFS: goals

Scalable: can reliably store and
process petabytes.

Economical: It distributes the data and
processing across clusters of
commonly available computers.

Goals
Efficient: can process data in parallel
on the nodes where the data is
located.

Reliable: automatically maintains
multiple copies of data and
automatically redeploys computing
tasks based on failures.

28

HDFS: components

Namenode

• An HDFS cluster consists of a single
NameNode
• It is a master server that manages
the file system namespace and
regulates access to files by clients.

Datanodes

• Datanode manage storage attached
to the system it run on
• Applay the map rule of MapReduce

Blocks

• File is split into one or more blocks
and these blocks are stored in a set
of DataNodes

30

HDFS: features

 Uniform name space: same path on all
workstations

 Reliability: rw replication, re-balancing, copy
in different locations

 Availability: hot deploy

 Scalability: server aggregation

 Administration: HOD

 Performance: “grid” computation, parallel
transfer

31

16/02/2012

HDFS: who uses it ?

Yahoo!
A9.com
AOL
Booz Allen Hamilton
EHarmony
Facebook
Freebase
Fox Interactive Media
IBM
ImageShack
ISI
Major players Joost
Last.fm
LinkedIn
Metaweb
Meebo
Ning
Powerset (now part of Microsoft)
Proteus Technologies
The New York Times
Rackspace
Veoh
Twitter
…
32

HDFS: good for ...

Good
• Task distribution (Basic GRID
infrastructure)
• Distribution of content (High
throughput of data access )
• Archiving
• Etherogenous envirorment

Bad
• Not General purpose File system
• Not Posix Compliant
• Low granularity in security setting
• Java

33

Ceph

“Ceph is designed to handle workloads
in which tens thousands of clients or
more simultaneously access the same
file orwrite to the same directory–
usage scenarios that bring typical
enterprise storage systems to their
knees.‖
Keys:
 Seamless scaling — The file system can be seamlessly expanded by simply
adding storage nodes (OSDs). However, unlike most existing file systems, Ceph
proactively migrates data onto new devices in order to maintain a balanced
distribution of data.

 Strong reliability and fast recovery — All data is replicated across multiple
OSDs. If any OSD fails, data is automatically re-replicated to other devices.

 Adaptive MDS — The Ceph metadata server (MDS) is designed to dynamically
adapt its behavior to the current workload.

34

Ceph: design

• Client
• Metadat
OSD a Cluster
• Object
Storage
Cluster

35

Ceph: features

Dynamic Distributed Metadata

• Metadata Storage
• Dynamic Subtree Partitioning
• Traffic Control

Reliable Autonomic Distributed Object
Storage

• Data Distribution
• Replication
• Data Safety
• Failure Detection
• Recovery and Cluster Updates

36

Ceph: features

Pseudo-random data distribution function (CRUSH)

Reliable object storage service (RADOS)

Extent B-tree object File System (today btrfs)

37

Ceph: features

Splay Replication
• Only after it has been safely committed to disk is a final commit
notification sent to the client.

38

Ceph: good for …

Good
• Scientific application, High
throughput of data access
• Heavy Read / Write operations
• It is the most advance distributed
file system

Bad
• Young (Linux 2.6.34)
• Linux only
• Complex

39

Others

Lustre PVFS MooseFS

Cloudstore
PNFS …
(kosmos)

Search
XtreemFS Tahoe-LAFS
Wikipedia..

40

Part III

Case Studies

41

Class Exam

 What can DFS do for you ?

 How can you create a Petabyte
storage ?

 How can you make a centralized
system log ?

 How can you allocate space for your
user or system, when you have a
thousands of users/systems ?

 How can you retrieve data from
everywhere ?

42

16/02/2012

File sharing

Problem
•Share Documents across a wide
network area
•Share home folder across different
Terminal servers

Solution

•OpenAFS
•Samba

Results

•Single ID, Kerberos/ldap
•Single file system

Usage

•800 users
•15 branch offices
•File sharing /home dir

43

Web Service

Problem

• Big Storage on a little budget

Solution

• Gluster

Results

• High Availability data storage
• Low price

Usage

• 100 TB image archive
• Multimedia content for web site

44

Internet Disk: myS3

Problems

•Data from everywhere
•Disaster Recover

Solution

•myS3
•Hadoop / OpenAFS

Results

•High Availability
•Access through HTTP protocol (REST
Interface)
•Disaster Recovery

Usage

•Users backup
•Application backend
•200 Users
•6 TB

45

Log concentrator

Problem

• Log concentrator

Solution

• Hadoop cluster
• Syslog-NG

Results

• High availability
• Fast search
• “Storage without limits”

Usage

• Security audit and access control

46

Private cloud

Problems

• Low cost VM storage
• VM self provisioning

Solution

• GlusterFS
• openAFS
• Custom provisioning

Rresults

• Auto provisioning
• Low cost
• Flexible solution

Usage

• Development env
• Production env

Conclusion: problems

Do you have enough bandwidth ?
 Failure
For 10 PB of storage, you will have an
average of22consumer-grade SATA drives
failing per day.

 Read/write time
Each of the 2TB drives takes approximately
best case 24,390 seconds to be read and
written over the network.

 Data Replication
Data replication is the number of the disk
drives, plus difference.

48

16/02/2012

Conclusion

Environment Analysis
• No true Generic DFS
• Not simple move 800TB btw different solutions

Dimension
• Start with the right size
• Servers number is related to speed needed and number of clients
• Network for Replication

Divide system in Class of Service
• Different disk Type
• Different Computer Type

System Management
• Monitoring Tools
• System/Software Deploy Tools

49

Conclusion: next step

50

16/02/2012

Links

OpenAFS Gluster Hadoop Ceph
• www.openafs.org • www.gluster.org • Hadoop.apache.org • ceph.newdream.n
• www.beolink.org • Isabel Drost et
• Publication
• Mailing list

51

I look forwardto meeting you…

XVII European AFS meeting 2010
PILSEN - CZECH REPUBLIC
September 13-15

Who should attend:
 Everyone interested in deploying a globally accessible
file system
 Everyone interested in learning more about real
world usage of Kerberos authentication in single
realm and federated single sign-on environments
 Everyone who wants to share their knowledge and
experience with other members of the AFS and
Kerberos communities
 Everyone who wants to find out the latest
developments affecting AFS and Kerberos

More Info: http://afs2010.civ.zcu.cz/
52

16/02/2012

Use Distributed Filesystem as a Storage Tier

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Use Distributed Filesystem as a Storage Tier

Similar to Use Distributed Filesystem as a Storage Tier (20)

More from Manfred Furuholmen

More from Manfred Furuholmen (19)

Recently uploaded

Recently uploaded (20)

Use Distributed Filesystem as a Storage Tier

Editor's Notes