HDFS is the foundation layer for Big Data This presentation will present an introduction to HDFS aimed at DBAs coming from a more traditional RDBMS background.
We will examine the architecture of HDFS and show how this enables the reliability, scalability and performance attributes of HDFS. We will examine HA for HDFS and data integrity. Finally we will highlight some configuration properties that are beneficial when running HDFS.
BIG DATA is all the rage
Almost as popular as Cloud
This is where we are dealing with datasets in the hundreds of TB’s to Petabytes
And using 100s to 1000’s of CPUs in parallel to process this data
Aggregating the power of many servers as a single resource
The idea of this presentation is to show how concepts you are familiar with (in ASM)
Carry over to the world of HDFS
As DBAs & Systems folks you are in prime position to manage this coming wave
My name is jason Arneil
Been in IT for around 18 years, both as an Oracle DBA and a System Administrator
The last 4 ½ years exclusively worked on the Exadata platform.
I’m really just dipping my toes in the Big Data world – it is all rage though!
Blogged a bit in the past
You can find me on twitter
Became an Oracle ACE a couple of years ago
Now work in the Accenture Enkitec Group
I Was quite struck when I saw this quote last month.
To me that smells of opportunity
Exponential data growth is a well known phenomena
Many exabytes stored every day worldwide
This creates a storage problem
This does help us store more data
We now have 8TB as fairly standard enterprise HDD’s
Speed at very best roughly 200MB/s
So to be able to run analysison 10’s or 100’s of TB’s of data
In a reasonable time frame
You are going to need LOTS of drives – 100’s or 1000’s of drives
The more concurrency you have the more drives you will need
More drives leads to more drive failure
So we need a mechanism to protect our data from drive failure
Storing redundant copies of data actually leads to even more drives being used
And more drive failures
Cloud Storage company Backblaze have over 50,000 drives in their DataCenters
They publish drive reliability stats from this real world situation – in a proper Air conditioned DC
While Drive failure varies with age, their average failure rate was going on 5%
Source: https://www.backblaze.com/blog/hard-drive-reliability-q3-2015/
We have to have some way of protecting our Data
Hardware RAID is an expensive solution – particularly at 100”s of TB
Doesn’t provide data locality for analysing the data
Transfer of huge quantities of data to servers would be massive bottleneck
Analysis of huge amounts of data more efficient if executed near the data it is operating on
Hadoop Distributed File System (HDFS™)
Hadoop is an open source project from the apache software foundation
Has had a reasonable amount of time to develop, evolve and mature – but filesytems generally have a long (multi-decade) lifespan
Has its roots from google - though the elephant logo is from a toy owned by son of a yahoo engineer – Doug cutting
Note the Distributed part – a filesystem that manages storage across a range of machines
That is the storage of those individual machines is presented as an aggregate
You can think of various layers in the hadoop world
With storage as the base layer
Followed by a method of allocating resources and scheduling tasks across the cluster – Yet Another resource Negotiator
Then various applications used for data analysis that can take advantage of these
Hadoop scales computation, storage and I/O bandwidth
ASM has it’s genesis all the way back in 1996 – initial problem that led to it was related to video streaming!
Took 7 years from initial idea to released product
ASM is a clustered filesystem – not a distributed filesystem
Design goal was to be able to stripe data across 1000’s of disks
It would also be fault tolerant
HDFS is designed to be portable from one platform to another
It is designed to run on commodity hardware
A key goal is linear scalability both on data size and compute resources:
Doubling numbers of nodes should half processing time on same volume of data
Likewise doubling the data volume and the number of nodes should result in constant processing time
Essentially it uses a divide and conquer approach
You can buy it from Oracle it runs on the Big Data Appliance
“Very large” here means files that are hundreds of megabytes, gigabytes, or terabytes in size”
Petabyte sized clusters not unheard off
makes it easy to store large files: optimises sequential reading of data over latency
It’s likely on HDFS that analysis will read large percentage of entire dataset – very different from typical RDBMS usage
Reading most of the dataset efficiently is more important than the latency of reading first record
HDFS applications need a write-once-read-many access model for files.
A file once created, written, and closed need not be changed except for appends and truncates
Can append to a file, but cannot update at arbitrary point
HDFS not designed for low latency access
Lots of small files does not scale well on HDFS
metadata is critical to the operation of a filesystem
Essentially you can’t access the files stored without the metadata
When using an oracle Database with ASM
We have an ASM instance in addition to the database instance
This ASM instance has a small portion of the RDBMS code
ASM that manages the metadata for the Datafiles
Note metadata is stored (and protected) with the data in the diskgroups
ASM architecture upto 12c looks like this
Database on every node, ASM instance on every node
All accessing the same underlying drives where the data is
There is only 1 type of node and all nodes are identical
When it comes to the world of HDFS we have 2 types of nodes
Namenode: - Minimum of 1, mostly 2 for redundancy – we’ll come on to that
Namenode manages the filesystem namespace
Maintains filesystem tree and metadata for all files and directories
Regulates access to files by clients
The other type of node we have is the datanode
Many Datanodes in cluster – these are where the data is stored and where the computations and analysis are executed
These are all just standard servers – are likely to spread across multiple racks in the datacenter because you have so many of them
Datanodes responsible for serving read/write operations from HDFS clients
Datanodes also perform block creation/delition and replication upon instruction from Namenode
But How different is it really from the 12c flex ASM architecture
Here you no longer have ASM instances running on all nodes, and DB instances can run on nodes that don’t have ASM instances
Think of ASM as the “namenodes” – managing the metadata and the databases being clients of the ASM instances
Analogy even better if you think about exadata
Where the storage is on standard servers running linux and where even some computation normally done at the database is offloaded to
Hadoop is extending this idea all the way – all computation done where the storage resides
NameNode is critical in HDFS
Metadata is stored persistently on disk on namenode in 2 files:
Namespace image – name space is hierarchy of files and directories + Edit log
The metadata is decoupled from the data
Namenode also knows on which datanode all blocks for a given file will reside –
remember the same block will exist on multiple datanodes
Block locations not stored permanently on namenode – This info can be reconstructed from datanodes provide periodic block reports
This is stored in memory and with many files this can become limiting factor for scalability
Though we can federate the namespace – so multiple namenodes each manage a portion of the filesystem – This is NOT HA though
DataNodes send heartbeats every 3 secs - No heartbeat in 10 mins – node presumed dead namenode schedules re-replication of lost replicas
Durability of namespace maintained by write-ahead journal and checkpoints
Journal transactions persisted into edit log before replying to client
This records every change that occurs to file system metadata
The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage.
Checkpoints periodically written to image file
Block locations discovered from DataNodes via block reports – these are NOT persisted on NameNode
This can lead to slow startup times of namenode
Creating a diskgroup in ASM implicitly creates filesystem
Size is not specified and data is spread evenly across all disks
A new hdfs installtion needs to be formatted
The formatting process creates an empty filesystem by creating the storage directories
and the initial versions of the namenode’s persistent data structures
Datanodes are not involved in the initial formatting process as the namenode manages the filesystem metadata
You don’t need to say how large a filesystem to create as it’s determined by number of members of the cluster
So filesystem size can be increased with additional cluster members long after creation
disk has a block size 512bytes typical or 4K (modern) - minimum amount of data that can read/write
Filesystem data block is multiple of disk block size, typically few KB in size
IN ASM files written as a collection of extents
Extents are multiples of Allocation Units, typically going from 1MB up to 64MB but can be set higher
HDFS as a filesystem has concept of a block - 128MB by default but it is configurable
Files in HDFS broken into block sized chunks stored as independent units
File smaller than a full block DOES NOT occupy a full block of space
Reason for such a large block size is to minimise seek costs
Having a block abstraction enables a file to span multiple disks
Nothing to require all blocks from the same file to be on same drive
Blocks are fixed size which simplifies metadata management – metadata don’t need to be stored with blocks
Easy to calculate how many blocks can fit on a disk
Block concept also useful when it comes to replication and fault tolerance
A Client accesses the filesystem on behalf of a User by communicating with namenode and datanodes
The client can present a POSIX like filesystem to user –
user code does not need to know about namenode/datanodes to function
HDFS interaction mediated through a JAVA API
Can interact with filesystem via HTTP REST API – but slower than java
Also a C library
Datanode is workhorse of HDFS
Store and retrieve blocks when told by clients or namenode
Report back periodically to namenode with lists of blocks they are storing
Filesystem cannot function without NAMENODE
If the namenode were destroyed all files on the filesystem would be lost as
No way of reconstructing files from blocks on datanodes
Vital to ensure resilience of namenode
There are a number of different options for ensuring NameNode resilience
ASM way ahead in terms of resilience
With non-flex ASM if we lose an ASM instance we only lose the DBs on that node
Other nodes keep working – definite advantage of cluster technology
And it’s even better with flex ASM if we loose an ASM instance all databases can carry on processing.
ASM Instance can also relocate if node fails
What we need to protect is the edit log and the image file
Hadoop can be configured to ensure namenode wites persistent metadata to multiple filesystems
These are synchronous and atomic writes
Usual choice is to write to local disk and NFS mount
Active name nodes writes updates both locally and to NFS Share
Standby Name node also has access to NFS share
Even with a secondary namenode it won’t be able to service requests until
1 Namespace image is loaded into memory
2 Edit log is replayed
3 received enough block reports from datanodes
On decent sized cluster this could be 30 mins! – Not really high availability
One step up the availability ladder is to
run secondary namenode
Does NOT act as a namenode – does not serve requests
Job is to merge namespace image with edit log
Copy of this merged namenode image can be used if primary namenode fails
Note this has a datalag so some data can be lost
This is still not high availability
And causes problem for routine maintenance and planned downtime
Previous options does not provide high availability of the filesystem
HA can be accomplished with a pair of namenodes in Active-Standby configuration
Standby can take over from Active node without significant delay
Namenodes MUST have highly available shared storage
Datanodes MUST send block reports to both nodes – Remember block mappings stored in MEMORY on namenode
Clients must be enabled to handle namenode failover
If active namenode fails
Standby can take over quickly because it has latest state available in memory
Both edit log and block mappings
Can use NFS filer or a Quorum Journal Manager (QJM)
QJM is recommended choice
QJM is a dedicated HDFS implementation
Solely designed for purpose of providing HA for edit log
QJM runs a group of journal nodes
Each edit must be written to a majority of these
Transition managed by a failover controller
Default implementation uses Zookeeper to ensure only 1 namenode active
Each namenode runs a heartbeat process
Can’t have active-active as we don’t have a cluster filesystem – can’t have multiple nodes writing to same file
Previously active namenode can be fenced – can use STONITH
HDFS has permission model for files and directories very POSIX like
3 types of perms: r, w, x
X is ignored for a file (no concept of executing a file) but is needed for directory access
Each file has owner, group and mode (mode is perms for ower,group, and others)
Note by default Hadoop runs with security disabled
placement of replicas is critical to HDFS reliability and performance
purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization.
HDFS’s placement policy is to put one replica on one node in the local rack
another on a different node in a different rack, and the last on a different node in the same rack as the preivous
This policy cuts the inter-rack write traffic which generally improves write performance
The chance of rack failure is far less than that of node failure – so doesn’t reduce data availability
does reduce the aggregate network bandwidth used when writing data since a block is placed in only two unique racks rather than three
As long as you have even chance of starting with node in each rack the data will be evenly distributed across all racks
I/O from Database goes direct to Disks does not go via ASM
To read a block client requests the list of replica locations from NameNode
For each block, the namenode returns the addresses of the datanodes that have a copy of that block.
Client caches replica locations
Datanode Locations sorted by proximity to client
Data read from the dataNodes
A client request to create a file does not reach the NameNode immediately
HDFS client caches the file data into a temp local file Application writes are transparently redirected to this temp local file
Once local file accumulates data worth over one HDFS block size, client contacts NameNode
NameNode inserts the file name into the file system hierarchy and allocates a data block for it
client flushes the block of data from the local temporary file to the first DataNode in small portions
First Datanode sends the portions to the second datanode
Second datanode sends to third
Data is pipelined from one DataNode to the next.
Data nodes tell namenode which blocks they have via block reports
HDFS & ASM BOTH work best when blocks of a file are spread evenly across all disks
This gives best I/O performance
In ASM if new disks are added (or dropped)
A rebalance can ensure the data is evenly spread across all the disks
The balancer program is a Hadoop daemon that redistributes blocks
Moves blocks from overutilized datanodes to underutilized ones
Still adheres to block replica placement policies
cluster is deemed to be balanced, which means that the utilization of every datanode (ratio of used space on the node to total capacity of the node)
differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster)
by no more than a given threshold percentage”
Only one balancer operation may run on cluster at one time
Balancer designed to run in background
Limits bandwidth used to move blocks around
Explain Bit rot
As organisations store more data the possibility of silent disk corruptions grows
Can set the CONTENT.CHECK attribute on a diskgroup to ensure a rebalance will perform this logical content checking
“each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the datanode.
This is to guard against corruption due to “bit rot” in the physical storage media”
“Because HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one of the good replicas to produce a new, uncorrupt replica”
HDFS checksums all data written to it, and by default when reading data
A separate checksum is created for every 512 bytes (by default)
CRC is 4 bytes long less than 1% storage overhead
When clients read data from datanodes checksum verified
One thing about Hadoop in additon to all the whacky names for things
Is that the pace of change is phenomenal in comparison to the old school RDBMS world
I wanted to show a couple of snazzy things that are coming with the next HDFS release
It’s pretty inefficient space wise having to store 3 copies of the same data
Just to guarantee protection for your data
Erasure Coding is a way of encoding data so that the original data can be recovered with just a subset of the original
It sounds awfully similar to RAID 5/6 but with parity stored with the data not a separate device
Should consume way less space than triple mirroring with similar failure rates
However this will trade CPU cycles for space gains
A single DataNode manages multiple disks.
During normal write operation, disks will be filled up evenly. However, adding or replacing disks can lead to significant skew within a DataNode
This situation is not handled by the existing HDFS balancer, which concerns itself with inter- that is BETWEEN different data nodes,
not intra-, DN skew – i.e between disks within a data node!!
With hadoop 3 you can increase the availability of of your cluster be having an increased number of namenodes
It might even be the case that the Big Data world evolves so fast that HDFS is being to be superseded
A new kid on the storage block is KUDU
Which takes the best of HDFS sequential performance along with low latency random access
You may have heard enough by now
As DBAs and Systems Folks HDFS is likely to feature in your organisations
And we are likely to be the folks managing that infrastructure
So best to be prepared!
Put a link and a book recommendation
Questions?
“DEFLATE is a compression algorithm whose standard implementation is zlib.”
Gzip normally used to produce deflate format files
Concept of splittable is very important –
splittable format allows you to seek to any point in the stream
A non splittable file format will have to have all it’s blocks processed by the same process – rather than by distributed processes