Architecture of a Next-Generation Parallel File System

Architecture
of
a

Next-‐Generation
Parallel
File
System

-‐  Introduction

-‐  Whats
in
the
code
now

-‐  futures

Agenda

What
is
OrangeFS?

•  OrangeFS
is
a
next
generation
Parallel
File
System

•  Based
on
PVFS

•  Distributes
file
data
across
multiple
file
servers

leveraging
any
block
level
file
system.

•  Distributed
Meta
Data
across
1
to
all
storage
servers

•  Supports
simultaneous
access
by
multiple
clients,

including
Windows
using
the
PVFS
protocol
Directly

•  Works
w/
standard
kernel
releases
and
does
not
require

custom
kernel
patches

•  Easy
to
install
and
maintain

Why
Parallel
File
System?

HPC
–
Data
Intensive
Parallel
(PVFS)
Protocol

•  Large
datasets

•  Checkpointing

•  Visualization

•  Video

•  BigData

Unstructured
Data
Silos
Interfaces
to
Match
Problems

•  Unify
Dispersed
File
Systems

•  Simplify
Storage
Leveling

§  Multidimensional
arrays

§ 
typed
data

§  portable
formats

Original
PVFS
Design
Goals

§  Scalable

§  Configurable
file
striping

§  Non-‐contiguous
I/O
patterns

§  Eliminates
bottlenecks
in
I/O
path

§  Does
not
need
locks
for
metadata
ops

§  Does
not
need
locks
for
non-‐conflicting
applications

§  Usability

§  Very
easy
to
install,
small
VFS
kernel
driver

§  Modular
design
for
disk,
network,
etc

§  Easy
to
extend
-‐>
Hundreds
of
Research
Projects
have

used
it,
including
dissertations,
thesis,
etc…

OrangeFS
Philosophy

•  Focus
on
a
Broader
Set
of
Applications

•  Customer
&
Community
Focused

•  (>300
Member
Strong
Community
&
Growing)

•  Open
Source

•  Commercially
Viable

•  Enable
Research

Conﬁgurability

Performance

Consistency
Reliability

System
Architecture

•  OrangeFS
servers
manage
objects

•  Objects
map
to
a
specific
server

•  Objects
store
data
or
metadata

•  Request
protocol
specifies
operations
on
one
or

more
objects

•  OrangeFS
object
implementation

•  DB
for
indexing
key/value
data

•  Local
block
file
system
for
data
stream
of
bytes

Current
Architecture

Client
Server

1994-‐2004

Design
and
Development
at

CU
Dr.
Ligon
+
ANL
(CU

Graduates)

Primary
Maint
&
Development

ANL
(CU
Graduates)
+
Community

2004-‐2010

2007-‐2010
New
PVFS
Branch

SC10
(fall
2010)

2015

Announced
with
community
and
is
now

Mainline
of
future
development
as
of
2.8.4

Spring
2012

New
Development
focused
on

a
broader
set
of
problems

SC11
(fall
2011)

Performance
improvements,

Direct
Lib
+
Cache

Stability,
WebDAV,
S3

PVFS2

PVFS

Improved
MD,
Stability,

Server
Side
Operations,

Newer
Kernels,
Testing

Windows
Client,
Stability,

Replicate
on
Immutable

2.8.6
+
Webpack

2.8.5
+
Win

Support
and
Targeted
Development
Services

Initially
Offered
by
Omnibond

OrangeFS

3.0

Summer
2014

Distributed
Dir
MD,

Capability
based
security
2.9.0

Winter
2013

Performance
improvements,

Stability,
2.8.7
+
Webpack

Spring
2014

Performance
improvements,
Stability,
shared
mmap,
multi
TCP/IP
Server

Homing,
Hadoop

MapReduce,
user
lib
fixes,
new
spec
file
for
RPMS
+
DKMS
2.8.8
+
Webpack

Available
in
the
AWS
Marketplace

Replicated
MD,
File
Data,
128
bit
UUID
for
File
Handles,
Parallel
Background

Processes,
web
based
Mgt
Ui,
self
healing
processes,
data
balancing

Server
to
Server
Communications
(2.8.5)

Traditional
Metadata

Operation

Create
request
causes
client
to

communicate
with
all
servers
O(p)

Scalable
Metadata

Operation

Create
request
communicates
with
a
single

server
which
in
turn
communicates
with

other
servers
using
a
tree-‐based
protocol

O(log
p)

Mid

Client

Serv

App

Mid
Mid

Client

Serv

Client

Serv

App

Network

App

Mid

Client

Serv

App

Mid
Mid

Client

Serv

Client

Serv

App

Network

App

Recent
Additions
(2.8.5)

SSD
Metadata
Storage
Replicate
on
Immutable
(ﬁle
based)

Windows
Client

Supports
Windows
32/64
bit

Server
2008,
R2,
Vista,
7

Direct
Access
Interface

(2.8.6)

•  Implements:

•  POSIX
system
calls

•  Stdio
library
calls

•  Parallel
extensions

•  Noncontiguous
I/O

•  Non-‐blocking
I/O

•  MPI-‐IO
library

•  Found
more
boundary

conditions
ﬁxed
in

upcoming
2.8.7

App

Kernel

PVFS
lib

Client
Core

Direct
lib

PVFS
lib

Kernel

App

IB

TCP

File
System
File
System

File
System

Direct
Interface
Client
Caching

(2.8.6)

•  Direct
Interface
enables
Multi-‐Process

Coherent
Client
Caching
for
a
single
client

File
System

File
System

Client
Application

Direct
Interface

Client
Cache

File
System

WebDAV
(2.8.6
webpack)

PVFS
Protocol

OrangeFS

Apache

•  Supports
DAV
protocol
and
tested
with
the
Litmus
DAV
test

suite

•  Supports
DAV
cooperative
locking
in
metadata

S3
(2.8.6
webpack)

PVFS
Protocol

OrangeFS

Apache

•  Tested
using
s3cmd
client

•  Files
accessible
via
other
access
methods

•  Containers
are
Directories

•  Accounting
Pieces
not
implimented

Summary
-‐
Recently
Added
to
OrangeFS

•  In
2.8.3

•  Server-‐to-‐Server
Communication

•  SSD
Metadata
Storage

•  Replicate
on
Immutable

•  2.8.4,
2.8.5
(ﬁxes,
support
for
newer
kernels)

•  Windows
Client

•  2.8.6
–
Performance,
Fixes,
IB
updates

•  Direct
Access
Libraries
(initial
release)

•  preload
library
for
applications,
Including
Optional
Client

Cache

•  Webpack

•  WebDAV
(with
ﬁle
locking),
S3

Available
on
the

Amazon
AWS
Marketplace
and
brought
to
you
by
Omnibond

OrangeFS

Instance

Unified High Performance File System
DynamoDB

EBS

Volumes

OrangeFS
on
AWS
Marketplace

In
2.8.8
(Just
Released)

Hadoop
JNI
Interface
(2.8.8)

•  OrangeFS
Java
Native

Interface

•  Extension
of
Hadoop
File

System
Class
–>JNI

•  Buﬀering

•  Distribution

•  Fast
PVFS
Protocol

for
Remote

Conﬁguration

PVFS
Protocol

Additional
Items(2.8.8)

•  Updated
user
lib

•  Shared
mmap
support
in
kernel
module

•  Support
for
kernels
up
to
3.11

•  Multi-‐homing
servers
over
IP

•  Clients
can
access
server
over
multiple
interfaces
(say

clients
on
IPoIB
+clients
on
IPoEthernet
+clients
on
IPoMx

•  Enterprise
Installers
(Coming
Shortly)

•  Client
(with
DKMS
for
Kernel
Module)

•  Server

•  Devel

Scaling
Tests

16
Storage
Servers
with
2
LVM’d
5+1
RAID
sets
were
tested
with
up

to
32
clients,
with
read
performance
reaching
nearly
12GB/s
and

write
performance
reaching
nearly
8GB/s.

MapReduce
over
OrangeFS

•  8
Dell
R720
Servers
Connected
with
10Gb/s
Ethernet

•  Remote
Case
adds
an
additional
8
Identical
Servers
and

does
all
OrangeFS
work
Remotely
and
only
Local
work
is

done
on
Compute
Node
(Traditional
HPC
Model)

•  *25%
improvement
with
OrangeFS
running
Remotely

MapReduce
over
OrangeFS

•  8
Dell
R720
Servers
Connected
with
10Gb/s
Ethernet

•  Remote
Clients
are
R720s
with
single
SAS
disks
for
local

data
(vs.
12
disk
arrays
in
the
previous
test).

OrangeFS

Clients

SC13
Demo
Overview

OrangeFS

Clients

16
Dell
R720

OrangeFS

Servers

SC13
Floor

•  Clemson

•  USC

•  I2

•  Omnibond

I2
Innovation

Platform

100Gb/s

Sc13
WAN
Performance

Multiple
Concurrent
Client
File
Creates
over
PVFS
protocol
(nullio)

For
2.9
(summer
2014)

Distributed
Directory
Metadata
(2.9.0)

DirEnt1

DirEnt2

DirEnt3

DirEnt4

DirEnt5

DirEnt6

DirEnt1

DirEnt5

DirEnt3

DirEnt2

DirEnt6

DirEnt4

Server0

Server1

Server2

Server3

Extensible Hashing

u  State
Management
based
on
Giga+

u  Garth
Gibson,
CMU

u  Improves
access
times
for
directories
with

a
very
large
number
of
entries

Cert
or
Credential

Signed
Capability
I/O

Signed
Capability

Signed
Capability
I/O

Signed
Capability
I/O

OpenSSL

PKI

•  3
Security
Modes

•  Basic
–
OrangeFS/PVFS
Classic
Mode

•  Key-‐Based
–
Keys
are
used
to
authorize
clients
for
use
with
the
FS

•  User
Certiﬁcate
Based
with
LDAP
–
user
certs
are
used
for
access
to

the
ﬁle
system
and
are
generated
based
on
LDAP
uid/gid
info

Replication
/
Redundancy

•  Redundant
Metadata

•  seamless
recovery
after
a
failure

•  Redundant
objects
from
root
directory
down

•  Configurable

•  Redundant
Data

Update
mode
(real
time,
on
close,
on
immutable,
none)

Configurable
Number
of
Replicas

•  Real
Time
“forked
flow”
work
shows
little
overhead

•  Replicate
on
Close

•  Replicate
to
external
(like
LTFS)

•  Looking
at
supporting
HSM
option
to
external

(no
local
replica)

•  Emphasis
on
continuous
operation

OrangeFS
3.0

•  An
OID
(object
identifier)
is
a
128-‐bit
UUID
that

is
unique
to
the
data-‐space

•  An
SID
(server
identifier)
is
a
128-‐bit
UUID
that

is
unique
to
each
server.

•  No
more
than
one
copy
of
a
given
data-‐space

can
exist
on
any
server

•  The
(OID,
SID)
tuple
is
unique
within
the
file

system.

•  (OID,
SID1),

(OID,
SID2),
(OID,
SID3)
are
copies

of
the
object
identifier
on
different
servers.

Handles
-‐>
UUIDs

OrangeFS
3.0

•  In
an
Exascale
environment
with
the
potential
for

thousands
of
I/O
servers,
it
will
no
longer
be
feasible

for
each
server
to
know
about
all
other
servers.

•  Servers
Discovery

•  Will
know
a
subset
of
their
neighbors
at
startup
(or
may

be
cached
from
previous
startups).

Similar
to
DNS

domains.

•  Servers
will
learn
about
unknown
servers
on
an
as
needed

basis
and
cache
them.

Similar
to
DNS
query
mechanisms

(root
servers,
authoritative
domain
servers).

•  SID
Cache,
in
memory
db
to
store
server
attributes

Server
Location
/
SID
Mgt

OrangeFS
3.0

Policy
Based
Location

•  User
deﬁned
attributes
for
servers
and

clients

•  Stored
in
SID
cache

•  Policy
is
used
for
data
location,
replication

location
and
multi-‐tenant
support

•  Completely
Flexible

•  Rack

•  Row

•  App

•  Region

OrangeFS
3.0

•  Modular
infrastructure
to
easily
build

background
parallel
processes
for
the
ﬁle

system

Used
for:

• 
Gathering
Stats
for
Monitoring

•  Usage
Calculation
(can
be
leveraged
for
Directory

Space
Restrictions,
chargebacks)

•  Background
safe
FSCK
processing
(can
mark
bad

items
in
MD)

•  Background
Checksum
comparisons

•  Etc…

Background
Parallel
Processing
Infrastructure

(3.0)

Admin
REST
interface
/
Admin
UI

PVFS
Protocol

OrangeFS

Apache

REST

(3.0)

Data
Migration
/
Mgt

•  Built
on
Redundancy
&
DBG
processes

•  Migrate
objects
between
servers

•  De-‐populate
a
server
going
out
of
service

•  Populate
a
newly
activated
server
(HW
lifecycle)

•  Moving
computation
to
data

•  Hierarchical
storage

•  Use
existing
metadata
services

•  Possible
-‐
Directory
Hierarchy
Cloning

• 
Copy
on
Write
(Dev,
QA,
Prod
environments
with
high
%
data

overlap)

OrangeFS
3.x

Hierarchical
Data
Management

Archive

Intermediate

Storage

NFS

Remote

Systems

exceed,
OSG,

Lustre,
GPFS,

Ceph,
Gluster

HPC

OrangeFS

Metadata

OrangeFS

Users

OrangeFS
3.x

Attribute
Based
Metadata
Search

•  Client
tags
files
with
Keys/Values

• 
Keys/Values
indexed
on
Metadata
Servers

• 
Clients
query
for
files
based
on
Keys/Values

• 
Returns
file
handles
with
options
for
filename

and
path

Key/Value
Parallel
Query

Data

Data

File
Access

OrangeFS
3.x

Extend
Capability
based
security

•  Enable
certiﬁcate
level
access
(in
process)

•  Federated
access
capable

•  Can
be
integrated
with
rules
based
access

control

•  Department
x
in
company
y
can
share
with

Department
q
in
company
z

•  rules
and
roles
establish
the
relationship

•  Each
company
manages
their
own
control
of
who
is
in

the
company
and
in
department

SDN
-‐
OpenFlow

•  Working
with
OF
research
team
at
CU

•  OF
separates
the
control
plane
from
delivery,

gives
ability
to
control
network
with
SW

•  Looking
at
bandwidth
optimization

leveraging
OF
and
OrangeFS

ParalleX

ParalleX
is
a
new
parallel
execution
model

•  Key
components
are:

•  Asynchronous
Global
Address
Space
(AGAS)

•  Threads

•  Parcels
(message
driven
instead
of
message
passing)

•  Locality

•  Percolation

•  Synchronization
primitives

•  High
Performance
ParalleX
(HPX)
library

implementation
written
in
C++

PXFS

•  Parallel
I/O
for
ParalleX
based
on
PVFS

•  Common
themes
with
OrangeFS
Next

•  Primary
objective:
uniﬁcation
of
ParalleX
and
storage

name
spaces.

•  Integration
of
AGAS
and
storage
metadata
subsystems

•  Persistent
object
model

•  Extends
ParalleX
with
a
number
of
IO
concepts

•  Replication

•  Metadata

•  Extending
IO
with
ParalleX
concepts

•  Moving
work
to
data

•  Local
synchronization

•  Eﬀort
with
LSU,
Clemson,
and
Indiana
U.

•  Walt
Ligon,
Thomas
Sterling

Johns
Hopkins
OrangeFS
Selection

•  JHU
-‐
HLTCOE
Selected
OrangeFS

•  After
evaluating:
Ceph,
GlusterFS,
Lustre
and

OrangeFS

“Leveraging
OrangeFS
for
the

parallel
filesystem,
the
system
as
a
whole
is
capable

of
delivering
30GB/s
write,
46GB/s
read,
and
be-‐

tween
37,260-‐237,180
IOPS
of
performance.
The

variation
in
IOPS
performance
is
dependent
on
the

file
size
and
number
of
bytes
written
per
commit
as

documented
in
the
Test
Results
section.”*

*
http://hltcoe.jhu.edu/uploads/publications/papers/14662_slides.pdf

“The
final
system
design
rep-‐

resents
a
2,775%
increase
in
read
performance
and

a
1,763-‐11,759%
increase
in
IOPS”*

Learning
More

•  www.orangefs.org
web
site

•  Releases

•  Documentation

•  Wiki

•  pvfs2-‐users@beowulf-‐underground.org

•  Support
for
users

•  pvfs2-‐developers@beowulf-‐underground.org

•  Support
for
developers

Support
&
Development
Services

•  www.orangefs.com
&
www.omnibond.com

•  Professional
Support
&
Development
team

•  Buy
into
the
project

Intelligent

Transportation

Solutions

Identity
Manager

Drivers
&
Sentinel

Connectors

Parallel
Scale-‐Out

Storage
Software

Social
Media

Interaction
System

Omnibond
Info

Computer
Vision
Enterprise
Personal

Solution
Areas

Insert
Discussion
Here

Architecture of a Next-Generation Parallel File System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Architecture of a Next-Generation Parallel File System

Similar to Architecture of a Next-Generation Parallel File System (20)

More from Great Wide Open

More from Great Wide Open (20)

Recently uploaded

Recently uploaded (20)

Architecture of a Next-Generation Parallel File System