1
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Analytics on Isilon Deep Dive
Claudio Fahey, Steve Hubbell
2
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved. 2
*NO PURCHASE NECESSARY. VOID WHERE PROHIBITED. One entry per person. Employees or officials of any government entity, and employees of EMC and its affiliates, are ineligible. If your company policies or applicable law prohibit you from accepting the prize, you are
ineligible. Three winners randomly selected on May 7th between 3:15pm-3:30pm. Must be present to win. Each will receive Google Glass Explorer Edition(XE) Version 2.0 (ARV $2,299.00). Odds of winning depend on number of entries received. Can only win one prize.
Residents of some jurisdictions may be ineligible or face other restrictions. Prizes intended for USA market; may not work, or be supported, outside USA. Other rules and restrictions apply. You have not yet won. Subject to Official Rules at EMC World Isilon booth #123. If you
do not agree with any of the Official Rules, do not submit an entry. Winner must complete tax and other legal documents to claim prize. Actual prizes may vary from prizes pictured. Not sponsored by Google, Inc. Sponsored by EMC Corporation.
3
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Isilon Scale-Out NAS Architecture
OneFS Operating
Environment
Intra-cluster
Communication Layer
Client/Application Layer Ethernet Layer
Single
FS/Volume
CIFS
NFS
FTP
HTTP
HDFS for
Hadoop
REST for
Object
Gig-e
10 Gig-e
Network
Protocols
4
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
EMC Isilon HDFS Interface
 Isilon supports the HDFS
interfaces for the DataNode
and NameNode to host data
and metadata
 Underlying file system is
OneFS
 As simple as pointing the HDFS
clients to the DNS name of the
Isilon cluster!
5
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Architecture – Traditional DAS
Rack Ethernet Switch
Compute
Shuffle+HDFS
SATA
1+ Gbps
Core Ethernet Switch
Compute
1 Gbps
…
Shuffle+HDFS
Compute
…
Shuffle+HDFS
Rack Ethernet Switch
Compute
Shuffle+HDFS
SATA
1+ Gbps
Compute
1 Gbps
Shuffle+HDFS
Compute
…
Shuffle+HDFS
The ratio of compute and disk
space/performance is fixed.
Non-local HDFS I/O (20-90% of HDFS I/O)
will go through Ethernet.
Local disk usage is shared between shuffle
I/O (60% of all I/O during terasort) and
HDFS I/O.
6
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Architecture – Isilon for HDFS
Isilon InfiniBand Switch
Rack Ethernet Switch
Compute
Shuffle
SATA
1+ Gbps
10 Gbps
Core Ethernet Switch
Compute
Shuffle
10 Gbps
… …
IB
Rack Ethernet Switch
Compute
Shuffle
SATA
10 Gbps
Compute
Shuffle
10 Gbps
…
…
IB
…
The number of compute and Isilon nodes
can be adjusted independently to achieve
the optimal ratio of compute and I/O
bandwidth
HDFS I/O ALWAYS comes through a rack-
local Isilon node which collects data blocks
from all other Isilon nodes across the
InfiniBand fabric
(used only for MR copy phase) 1+ Gbps (used only for MR copy phase)
Shuffle I/O (65% of all I/O during terasort)
remains on local storage. This can be flash
for optimal performance.
Isilon
HDFS
Isilon
HDFS
Isilon
HDFS
Isilon
HDFS
7
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Architecture – Isilon for HDFS+Shuffle
Isilon InfiniBand Switch
Rack Ethernet Switch
Compute
1+ Gbps
10 Gbps
Core Ethernet Switch
Compute
10 Gbps
…
…
Isilon
HDFS
Isilon
HDFS
IB
…
The number of compute and Isilon nodes
can be adjusted independently to achieve
the optimal ratio of compute and I/O
bandwidth
HDFS I/O ALWAYS comes through a rack-
local Isilon node which collects data blocks
from all other Isilon nodes across the
InfiniBand fabric
(used only for MR copy phase)
Shuffle I/O is also on an Isilon cluster. It
can be a standalone S200 cluster or tier
with one node per rack. This will support
the high stream count needed for optimal
merge sort operations.
Isilon
Shuffle
10 Gbps
IB
Rack Ethernet Switch
Compute
1+ Gbps
10 Gbps
Compute
10 Gbps
…
Isilon
HDFS
Isilon
HDFS
IB
…
(used only for MR copy phase)
Isilon
Shuffle
10 Gbps
IB
8
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
MapReduce I/O Breakdown (DAS)
10%
HDFS
Read
Temp
Write
Temp
R/W
Temp
Read
Temp
Write
Temp
Read
HDFS
Write
10% 10% 10% 20% 10% 10%
HDFS
Write
20%
Based on 1:1 transformation job and one on-disk merge sort pass.
9
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
MapReduce I/O Breakdown (Isilon)
12.5%
HDFS
Read
Local
Write
Local
R/W
Local
Read
Local
Write
Local
Read
HDFS
Write
12.5% 12.5% 12.5% 25% 12.5% 12.5%
N/A
N/A
Based on 1:1 transformation job and one on-disk merge sort pass.
10
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Isilon Performance with Hadoop
Compute
Nodes
• Isilon is a scale-out system, like Hadoop
• HDFS on Isilon functions as a parallel
file system
• Each compute node performs I/O on
every Isilon node in the rack
• I/O bandwidth and storage capacity can
be increased linearly simply by adding
Isilon nodes
• Compute can be increased or decreased
on the fly and can easily be virtualized
• With a mesh network that is faster than
the disks, data locality is irrelevant
Isilon
Nodes
11
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Disk Locality
 Traditional Hadoop was designed for SLOW star networks (1 Gbps).
 The only way to effectively deal with slow networks was to strive to keep all
I/O local to the server. This is called disk locality.
 Disk locality is lost under several common situations:
– All nodes with a replica of the block are running the maximum number of tasks. This is very
common for busy clusters!
– Input files are compressed with a non-splittable codec such as gzip.
– “Analysis of Hadoop jobs from Facebook underscores the difficulty in attaining disk-locality: overall,
only 34% of tasks run on the same node that has the input data.”
(reference: Disk-Locality in Datacenter Computing Considered Irrelevant, Ganesh
Ananthanarayanan, University of California, Berkeley)
 Disk locality provides very low latency I/O, however this latency has very little
effect for batch operations such as MapReduce.
12
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Disk Locality (cont.)
 Today, a non-blocking 10 Gbps switch (up to 2500 MB/sec full duplex) can
provide more bandwidth than a typical disk subsystem with 8 disks (600 –
1200 MB/sec).
 We are no longer constrained to maintain data locality in order to provide
adequate I/O bandwidth.
 This gives us much more flexibility in designing a cost-effective and feature-
rich Hadoop architecture.
 Isilon provides rack-locality, not disk-locality.
 Amazon Elastic MapReduce offers users a choice between S3 storage and
traditional HDFS (which is destroyed when the cluster terminates). When
using S3, data locality is lost.
13
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved. 14
Isilon N+M Data Protection
• For every N data blocks,
write M (+1 to +4)
parity blocks calculated
with Reed-Solomon.
• File data and protection
is striped across nodes,
allowing a single file to
use spindles & cache of
up to 20 nodes.
14
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved. 20
File Striping: Nodes and Drives
• Do we really need the same
protection against node and
drive failures?
• A node has 12, 24, or 36
drives.
• +M:1 protection protects
against M drive failures and
one node failure
Node
1
Node
2
Node
3
Node
4
15
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved. 21
Node
1
Node
2
Node
3
Node
4
Node
5
Node
6
File Striping: 10+2:1 illustrated
• N+M terms: 10+2 (10 data units, 2 parity units)
per stripe
• Can survive two disk failures or one node failures
D
0
128KB
256KB
Stripe
Unit
D
1
D
2
D
3
D
4
D
5
D
6
D
7
D
8
D
9
D
0
D
1
D
2
D
3
D
6
D
7
D
8
P
0
P
1
D
9
D
5
D
4
16
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
HDFS Implemented As A NAS Protocol
OneFS runs a daemon
that speaks NameNode
and DataNode natively
OneFS Clustered FileSystem
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
Hadoop
Node
DFSClient
1) Request(“/file”)
2) Response
(block locations) 3) GetBlock(block)
17
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Security – Standard Authentication
 Standard authentication in Hadoop is intended for an
environment where Hadoop users are expected to be honest
although they may make mistakes such as accidentally
deleting somebody else’s file.
 Hadoop clients simply pass the name of the logged in user to
the Hadoop service (Job Tracker, Name Node, etc.).
 Passwords are not validated by any Hadoop services.
18
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Security – Kerberos Authentication
 Before using any Hadoop service, a user must authenticate
with a Kerberos server (with their password) to obtain a
Kerberos ticket.
 The Kerberos ticket must be passed to the Hadoop service.
 Active Directory can be used directly or an MIT Kerberos
server can be configured to pass authentication requests to an
Active Directory server.
19
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Security – Standard Permissions
 Standard Hadoop only provides basic Unix-type
permissions
– Each file or directory is assigned an owner and a group.
– Read and/or write permissions can be assigned to the owner, the
group, and everyone else.
– What do you do when you need to assign read access to group A,
group B, and group C?
– What do you do when you need to assign read access to the
group A and read+write access to group B?
– How do you maintain permissions when files are copied from
Windows NTFS shares?
20
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Hadoop Security – ACLs with Isilon
 Hadoop on Isilon provides full ACLs for NFS, SMB,
and HDFS
– Each file and directory has an Access Control List (ACL) consisting
of one or more Access Control Entries (ACE).
– Each ACE assigns a set of permissions (read, write, delete, etc.)
to a specific security identifier (user or group).
– In addition to the usual Allow ACEs which grant permissions to
users and are additive, there are Deny ACEs which remove
permissions and override any Allow ACEs.
21
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Recommended Directory Structure for Hadoop
 Change Hadoop root to /ifs/hadoop
– isi hdfs --root-path=/ifs/hadoop
 Hadoop system directories will be in /ifs/hadoop
– user, tmp, yarn, mapred, var, hbase, …
 To expose data in other directories to HDFS, create
soft links.
– ln -s /ifs/data /ifs/hadoop/data
22
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Isilon Access Zones for HDFS
• A single Isilon cluster can be split into multiple Access Zones.
Each Access Zone can have its own:
– HDFS root directory
– SmartConnect Zone (IP Addresses) for Name Nodes
– Active Directory users, groups, and authentication
– LDAP users and groups
– Kerberos authentication
• Uses:
– Multi-tenancy
– Multiple Hadoop compute clusters using the same Isilon cluster
– High-security zones
(Coming Soon - July 2014)
24
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Efficiency and flexibility
The Isilon Advantage for Hadoop
 No data ingest necessary
 Eliminate 3x mirroring
 Over 80% storage utilization
 SmartDedupe to further reduce storage needs by up to 30%
 Scale compute and data independently
 Multi-protocol access
 Simultaneous multi-distribution support
 Ability to leverage VMware vSphere Big Data Extensions to reduce
datacenter footprint, power, space, and cooling
25
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Data protection and security
The Isilon Advantage for Hadoop
 Highly resilient architecture
 Robust data protection options
(DR, snapshots, etc.)
 Clustered Name Node
 SEC 17a-4 compliant WORM
 Kerberos authentication
 Hadoop multi-tenancy
26
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
How Do I Start Using Hadoop?
EMC Isilon Hadoop Starter Kit (HSK)
 Visit https://community.emc.com/docs/DOC-26892
 Watch the demo video
 Follow the instructions to deploy Hadoop to your existing
Isilon and VMware infrastructure in about an hour
 There are customized HSKs for Apache, Pivotal, Cloudera,
and Hortonworks
27
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
Be Sure to See…
Isilon’s Keynote by Bill Richter, President, Isilon Wed 2:30 PM
Breakout Session: Bringing the Power of Cloudera’s Enterprise
Data Hub Edition to Isilon
Tue 3:00 PM
Wed 11:30 AM
Hand-on Lab: Deploying Hadoop with EMC Isilon and VMware
Isilon Booth (#123) – See our Hadoop videos!
28
© Copyright 2014 EMC Corporation. All rights reserved.
© Copyright 2014 EMC Corporation. All rights reserved.
EMC Isilon Sessions
Session ID Title Type Day Start Time End Time Room
isilon.01 Isilon: Scale-Out NAS Overview & Future Directions Lecture Mon 08:30 AM 09:30 AM Marcello 4403
isilon.01 Isilon: Scale-Out NAS Overview & Future Directions Lecture Wed 10:00 AM 11:00 AM Murano 3205
isilon.02 Isilon: MobileIQ & Syncplicity On Isilon Lecture Wed 10:00 AM 11:00 AM Marcello 4403
isilon.02 Isilon: MobileIQ & Syncplicity On Isilon Lecture Mon 08:30 AM 09:30 AM Murano 3203
isilon.03 Isilon: Hadoop Analytics + Isilon Scale-Out Storage Lecture Wed 04:00 PM 05:00 PM Delfino 4001 A
isilon.03 Isilon: Hadoop Analytics + Isilon Scale-Out Storage Lecture Tue 08:30 AM 09:30 AM San Polo 3405
isilon.04 Isilon: Advanced Troubleshooting Of Isilon Clusters Lecture with Demonstration Mon 03:00 PM 04:00 PM Lido 3003
isilon.04 Isilon: Advanced Troubleshooting Of Isilon Clusters Lecture with Demonstration Thu 10:00 AM 11:00 AM Lando 4203
isilon.05 Isilon: Scale-Out NAS Solutions In Surveillance Lecture Thu 01:00 PM 02:00 PM San Polo 3405
isilon.05 Isilon: Scale-Out NAS Solutions In Surveillance Lecture Tue 03:00 PM 04:00 PM Murano 3201 A
isilon.06 Isilon: Scale-Out NAS Solutions In Healthcare & Life Sciences Lecture Wed 04:00 PM 05:00 PM San Polo 3401 A
isilon.06 Isilon: Scale-Out NAS Solutions In Healthcare & Life Sciences Lecture Mon 12:00 PM 01:00 PM San Polo 3403
isilon.08 Isilon: OneFS - Scale-Out NAS Architecture Overview Lecture Thu 08:30 AM 09:30 AM Lando 4205
isilon.08 Isilon: OneFS - Scale-Out NAS Architecture Overview Lecture Tue 01:30 PM 02:30 PM Murano 3205
isilon.12 Isilon: Insider's Peek Under The Covers At Striping, Data Structures & More Lecture Thu 11:30 AM 12:30 PM Murano 3203
isilon.12 Isilon: Insider's Peek Under The Covers At Striping, Data Structures & More Lecture Tue 01:30 PM 02:30 PM Delfino 4001 A
isilon.14 Isilon: Highest Performance Scale-Out NAS Lecture Wed 08:30 AM 09:30 AM Delfino 4005
isilon.14 Isilon: Highest Performance Scale-Out NAS Lecture Mon 03:00 PM 04:00 PM San Polo 3403
isilon.15 Isilon: InsightIQ - Get Better Visibility Into Your Isilon Cluster Lecture Mon 12:00 PM 01:00 PM Lando 4205
isilon.15 Isilon: InsightIQ - Get Better Visibility Into Your Isilon Cluster Lecture Thu 11:30 AM 12:30 PM Toscana 3602
isilon.16 Isilon: Hadoop Analytics On Isilon Deep Dive Lecture Wed 08:30 AM 09:30 AM Murano 3205
isilon.16 Isilon: Hadoop Analytics On Isilon Deep Dive Lecture Mon 01:30 PM 02:30 PM Murano 3205
isilon.19 Isilon: Maximizing Performance From Your Isilon Clusters Lecture with Demonstration Thu 01:00 PM 02:00 PM Lido 3003
isilon.19 Isilon: Maximizing Performance From Your Isilon Clusters Lecture with Demonstration Tue 12:00 PM 01:00 PM Delfino 4003
isilon.20 Isilon: Hadoop Analytics With Cloudera & Isilon Scale-Out NAS Lecture Wed 11:30 AM 12:30 PM Lido 3005
isilon.20 Isilon: Hadoop Analytics With Cloudera & Isilon Scale-Out NAS Lecture Tue 03:00 PM 04:00 PM Palazzo C
Hadoop Analytics on Isilon Deep Dive

Hadoop Analytics on Isilon Deep Dive

  • 1.
    1 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Hadoop Analytics on Isilon Deep Dive Claudio Fahey, Steve Hubbell
  • 2.
    2 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. 2 *NO PURCHASE NECESSARY. VOID WHERE PROHIBITED. One entry per person. Employees or officials of any government entity, and employees of EMC and its affiliates, are ineligible. If your company policies or applicable law prohibit you from accepting the prize, you are ineligible. Three winners randomly selected on May 7th between 3:15pm-3:30pm. Must be present to win. Each will receive Google Glass Explorer Edition(XE) Version 2.0 (ARV $2,299.00). Odds of winning depend on number of entries received. Can only win one prize. Residents of some jurisdictions may be ineligible or face other restrictions. Prizes intended for USA market; may not work, or be supported, outside USA. Other rules and restrictions apply. You have not yet won. Subject to Official Rules at EMC World Isilon booth #123. If you do not agree with any of the Official Rules, do not submit an entry. Winner must complete tax and other legal documents to claim prize. Actual prizes may vary from prizes pictured. Not sponsored by Google, Inc. Sponsored by EMC Corporation.
  • 3.
    3 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Isilon Scale-Out NAS Architecture OneFS Operating Environment Intra-cluster Communication Layer Client/Application Layer Ethernet Layer Single FS/Volume CIFS NFS FTP HTTP HDFS for Hadoop REST for Object Gig-e 10 Gig-e Network Protocols
  • 4.
    4 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. EMC Isilon HDFS Interface  Isilon supports the HDFS interfaces for the DataNode and NameNode to host data and metadata  Underlying file system is OneFS  As simple as pointing the HDFS clients to the DNS name of the Isilon cluster!
  • 5.
    5 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Hadoop Architecture – Traditional DAS Rack Ethernet Switch Compute Shuffle+HDFS SATA 1+ Gbps Core Ethernet Switch Compute 1 Gbps … Shuffle+HDFS Compute … Shuffle+HDFS Rack Ethernet Switch Compute Shuffle+HDFS SATA 1+ Gbps Compute 1 Gbps Shuffle+HDFS Compute … Shuffle+HDFS The ratio of compute and disk space/performance is fixed. Non-local HDFS I/O (20-90% of HDFS I/O) will go through Ethernet. Local disk usage is shared between shuffle I/O (60% of all I/O during terasort) and HDFS I/O.
  • 6.
    6 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Hadoop Architecture – Isilon for HDFS Isilon InfiniBand Switch Rack Ethernet Switch Compute Shuffle SATA 1+ Gbps 10 Gbps Core Ethernet Switch Compute Shuffle 10 Gbps … … IB Rack Ethernet Switch Compute Shuffle SATA 10 Gbps Compute Shuffle 10 Gbps … … IB … The number of compute and Isilon nodes can be adjusted independently to achieve the optimal ratio of compute and I/O bandwidth HDFS I/O ALWAYS comes through a rack- local Isilon node which collects data blocks from all other Isilon nodes across the InfiniBand fabric (used only for MR copy phase) 1+ Gbps (used only for MR copy phase) Shuffle I/O (65% of all I/O during terasort) remains on local storage. This can be flash for optimal performance. Isilon HDFS Isilon HDFS Isilon HDFS Isilon HDFS
  • 7.
    7 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Hadoop Architecture – Isilon for HDFS+Shuffle Isilon InfiniBand Switch Rack Ethernet Switch Compute 1+ Gbps 10 Gbps Core Ethernet Switch Compute 10 Gbps … … Isilon HDFS Isilon HDFS IB … The number of compute and Isilon nodes can be adjusted independently to achieve the optimal ratio of compute and I/O bandwidth HDFS I/O ALWAYS comes through a rack- local Isilon node which collects data blocks from all other Isilon nodes across the InfiniBand fabric (used only for MR copy phase) Shuffle I/O is also on an Isilon cluster. It can be a standalone S200 cluster or tier with one node per rack. This will support the high stream count needed for optimal merge sort operations. Isilon Shuffle 10 Gbps IB Rack Ethernet Switch Compute 1+ Gbps 10 Gbps Compute 10 Gbps … Isilon HDFS Isilon HDFS IB … (used only for MR copy phase) Isilon Shuffle 10 Gbps IB
  • 8.
    8 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. MapReduce I/O Breakdown (DAS) 10% HDFS Read Temp Write Temp R/W Temp Read Temp Write Temp Read HDFS Write 10% 10% 10% 20% 10% 10% HDFS Write 20% Based on 1:1 transformation job and one on-disk merge sort pass.
  • 9.
    9 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. MapReduce I/O Breakdown (Isilon) 12.5% HDFS Read Local Write Local R/W Local Read Local Write Local Read HDFS Write 12.5% 12.5% 12.5% 25% 12.5% 12.5% N/A N/A Based on 1:1 transformation job and one on-disk merge sort pass.
  • 10.
    10 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Isilon Performance with Hadoop Compute Nodes • Isilon is a scale-out system, like Hadoop • HDFS on Isilon functions as a parallel file system • Each compute node performs I/O on every Isilon node in the rack • I/O bandwidth and storage capacity can be increased linearly simply by adding Isilon nodes • Compute can be increased or decreased on the fly and can easily be virtualized • With a mesh network that is faster than the disks, data locality is irrelevant Isilon Nodes
  • 11.
    11 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Disk Locality  Traditional Hadoop was designed for SLOW star networks (1 Gbps).  The only way to effectively deal with slow networks was to strive to keep all I/O local to the server. This is called disk locality.  Disk locality is lost under several common situations: – All nodes with a replica of the block are running the maximum number of tasks. This is very common for busy clusters! – Input files are compressed with a non-splittable codec such as gzip. – “Analysis of Hadoop jobs from Facebook underscores the difficulty in attaining disk-locality: overall, only 34% of tasks run on the same node that has the input data.” (reference: Disk-Locality in Datacenter Computing Considered Irrelevant, Ganesh Ananthanarayanan, University of California, Berkeley)  Disk locality provides very low latency I/O, however this latency has very little effect for batch operations such as MapReduce.
  • 12.
    12 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Disk Locality (cont.)  Today, a non-blocking 10 Gbps switch (up to 2500 MB/sec full duplex) can provide more bandwidth than a typical disk subsystem with 8 disks (600 – 1200 MB/sec).  We are no longer constrained to maintain data locality in order to provide adequate I/O bandwidth.  This gives us much more flexibility in designing a cost-effective and feature- rich Hadoop architecture.  Isilon provides rack-locality, not disk-locality.  Amazon Elastic MapReduce offers users a choice between S3 storage and traditional HDFS (which is destroyed when the cluster terminates). When using S3, data locality is lost.
  • 13.
    13 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. 14 Isilon N+M Data Protection • For every N data blocks, write M (+1 to +4) parity blocks calculated with Reed-Solomon. • File data and protection is striped across nodes, allowing a single file to use spindles & cache of up to 20 nodes.
  • 14.
    14 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. 20 File Striping: Nodes and Drives • Do we really need the same protection against node and drive failures? • A node has 12, 24, or 36 drives. • +M:1 protection protects against M drive failures and one node failure Node 1 Node 2 Node 3 Node 4
  • 15.
    15 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. 21 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 File Striping: 10+2:1 illustrated • N+M terms: 10+2 (10 data units, 2 parity units) per stripe • Can survive two disk failures or one node failures D 0 128KB 256KB Stripe Unit D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 D 9 D 0 D 1 D 2 D 3 D 6 D 7 D 8 P 0 P 1 D 9 D 5 D 4
  • 16.
    16 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. HDFS Implemented As A NAS Protocol OneFS runs a daemon that speaks NameNode and DataNode natively OneFS Clustered FileSystem OneFS Node NameNode DataNode OneFS Node NameNode DataNode OneFS Node NameNode DataNode OneFS Node NameNode DataNode Hadoop Node DFSClient 1) Request(“/file”) 2) Response (block locations) 3) GetBlock(block)
  • 17.
    17 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Hadoop Security – Standard Authentication  Standard authentication in Hadoop is intended for an environment where Hadoop users are expected to be honest although they may make mistakes such as accidentally deleting somebody else’s file.  Hadoop clients simply pass the name of the logged in user to the Hadoop service (Job Tracker, Name Node, etc.).  Passwords are not validated by any Hadoop services.
  • 18.
    18 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Hadoop Security – Kerberos Authentication  Before using any Hadoop service, a user must authenticate with a Kerberos server (with their password) to obtain a Kerberos ticket.  The Kerberos ticket must be passed to the Hadoop service.  Active Directory can be used directly or an MIT Kerberos server can be configured to pass authentication requests to an Active Directory server.
  • 19.
    19 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Hadoop Security – Standard Permissions  Standard Hadoop only provides basic Unix-type permissions – Each file or directory is assigned an owner and a group. – Read and/or write permissions can be assigned to the owner, the group, and everyone else. – What do you do when you need to assign read access to group A, group B, and group C? – What do you do when you need to assign read access to the group A and read+write access to group B? – How do you maintain permissions when files are copied from Windows NTFS shares?
  • 20.
    20 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Hadoop Security – ACLs with Isilon  Hadoop on Isilon provides full ACLs for NFS, SMB, and HDFS – Each file and directory has an Access Control List (ACL) consisting of one or more Access Control Entries (ACE). – Each ACE assigns a set of permissions (read, write, delete, etc.) to a specific security identifier (user or group). – In addition to the usual Allow ACEs which grant permissions to users and are additive, there are Deny ACEs which remove permissions and override any Allow ACEs.
  • 21.
    21 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Recommended Directory Structure for Hadoop  Change Hadoop root to /ifs/hadoop – isi hdfs --root-path=/ifs/hadoop  Hadoop system directories will be in /ifs/hadoop – user, tmp, yarn, mapred, var, hbase, …  To expose data in other directories to HDFS, create soft links. – ln -s /ifs/data /ifs/hadoop/data
  • 22.
    22 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Isilon Access Zones for HDFS • A single Isilon cluster can be split into multiple Access Zones. Each Access Zone can have its own: – HDFS root directory – SmartConnect Zone (IP Addresses) for Name Nodes – Active Directory users, groups, and authentication – LDAP users and groups – Kerberos authentication • Uses: – Multi-tenancy – Multiple Hadoop compute clusters using the same Isilon cluster – High-security zones (Coming Soon - July 2014)
  • 23.
    24 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Efficiency and flexibility The Isilon Advantage for Hadoop  No data ingest necessary  Eliminate 3x mirroring  Over 80% storage utilization  SmartDedupe to further reduce storage needs by up to 30%  Scale compute and data independently  Multi-protocol access  Simultaneous multi-distribution support  Ability to leverage VMware vSphere Big Data Extensions to reduce datacenter footprint, power, space, and cooling
  • 24.
    25 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Data protection and security The Isilon Advantage for Hadoop  Highly resilient architecture  Robust data protection options (DR, snapshots, etc.)  Clustered Name Node  SEC 17a-4 compliant WORM  Kerberos authentication  Hadoop multi-tenancy
  • 25.
    26 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. How Do I Start Using Hadoop? EMC Isilon Hadoop Starter Kit (HSK)  Visit https://community.emc.com/docs/DOC-26892  Watch the demo video  Follow the instructions to deploy Hadoop to your existing Isilon and VMware infrastructure in about an hour  There are customized HSKs for Apache, Pivotal, Cloudera, and Hortonworks
  • 26.
    27 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. Be Sure to See… Isilon’s Keynote by Bill Richter, President, Isilon Wed 2:30 PM Breakout Session: Bringing the Power of Cloudera’s Enterprise Data Hub Edition to Isilon Tue 3:00 PM Wed 11:30 AM Hand-on Lab: Deploying Hadoop with EMC Isilon and VMware Isilon Booth (#123) – See our Hadoop videos!
  • 27.
    28 © Copyright 2014EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved. EMC Isilon Sessions Session ID Title Type Day Start Time End Time Room isilon.01 Isilon: Scale-Out NAS Overview & Future Directions Lecture Mon 08:30 AM 09:30 AM Marcello 4403 isilon.01 Isilon: Scale-Out NAS Overview & Future Directions Lecture Wed 10:00 AM 11:00 AM Murano 3205 isilon.02 Isilon: MobileIQ & Syncplicity On Isilon Lecture Wed 10:00 AM 11:00 AM Marcello 4403 isilon.02 Isilon: MobileIQ & Syncplicity On Isilon Lecture Mon 08:30 AM 09:30 AM Murano 3203 isilon.03 Isilon: Hadoop Analytics + Isilon Scale-Out Storage Lecture Wed 04:00 PM 05:00 PM Delfino 4001 A isilon.03 Isilon: Hadoop Analytics + Isilon Scale-Out Storage Lecture Tue 08:30 AM 09:30 AM San Polo 3405 isilon.04 Isilon: Advanced Troubleshooting Of Isilon Clusters Lecture with Demonstration Mon 03:00 PM 04:00 PM Lido 3003 isilon.04 Isilon: Advanced Troubleshooting Of Isilon Clusters Lecture with Demonstration Thu 10:00 AM 11:00 AM Lando 4203 isilon.05 Isilon: Scale-Out NAS Solutions In Surveillance Lecture Thu 01:00 PM 02:00 PM San Polo 3405 isilon.05 Isilon: Scale-Out NAS Solutions In Surveillance Lecture Tue 03:00 PM 04:00 PM Murano 3201 A isilon.06 Isilon: Scale-Out NAS Solutions In Healthcare & Life Sciences Lecture Wed 04:00 PM 05:00 PM San Polo 3401 A isilon.06 Isilon: Scale-Out NAS Solutions In Healthcare & Life Sciences Lecture Mon 12:00 PM 01:00 PM San Polo 3403 isilon.08 Isilon: OneFS - Scale-Out NAS Architecture Overview Lecture Thu 08:30 AM 09:30 AM Lando 4205 isilon.08 Isilon: OneFS - Scale-Out NAS Architecture Overview Lecture Tue 01:30 PM 02:30 PM Murano 3205 isilon.12 Isilon: Insider's Peek Under The Covers At Striping, Data Structures & More Lecture Thu 11:30 AM 12:30 PM Murano 3203 isilon.12 Isilon: Insider's Peek Under The Covers At Striping, Data Structures & More Lecture Tue 01:30 PM 02:30 PM Delfino 4001 A isilon.14 Isilon: Highest Performance Scale-Out NAS Lecture Wed 08:30 AM 09:30 AM Delfino 4005 isilon.14 Isilon: Highest Performance Scale-Out NAS Lecture Mon 03:00 PM 04:00 PM San Polo 3403 isilon.15 Isilon: InsightIQ - Get Better Visibility Into Your Isilon Cluster Lecture Mon 12:00 PM 01:00 PM Lando 4205 isilon.15 Isilon: InsightIQ - Get Better Visibility Into Your Isilon Cluster Lecture Thu 11:30 AM 12:30 PM Toscana 3602 isilon.16 Isilon: Hadoop Analytics On Isilon Deep Dive Lecture Wed 08:30 AM 09:30 AM Murano 3205 isilon.16 Isilon: Hadoop Analytics On Isilon Deep Dive Lecture Mon 01:30 PM 02:30 PM Murano 3205 isilon.19 Isilon: Maximizing Performance From Your Isilon Clusters Lecture with Demonstration Thu 01:00 PM 02:00 PM Lido 3003 isilon.19 Isilon: Maximizing Performance From Your Isilon Clusters Lecture with Demonstration Tue 12:00 PM 01:00 PM Delfino 4003 isilon.20 Isilon: Hadoop Analytics With Cloudera & Isilon Scale-Out NAS Lecture Wed 11:30 AM 12:30 PM Lido 3005 isilon.20 Isilon: Hadoop Analytics With Cloudera & Isilon Scale-Out NAS Lecture Tue 03:00 PM 04:00 PM Palazzo C