Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
2
Reasons For Storage Tiering with Hadoop:
• Single tier lends to a large imbalance of compute and storage resources
• Mor...
3
• Over 65% less hardware
• 60% fewer nodes (software licensing)
• Significant performance improvement
• Immediate ROI fo...
4
Access frequency
of data is the
most important
metric for
effective tiering
Age is easiest to
determine.
CAUTION: Some
d...
5
Installed on a server
or VM
outside your existing
Hadoop cluster without
inserting any
proprietary technology
on the clu...
6
HDFSplus
Apply storage
policy based on
custom query
Files are optimized
during normal
balancing window
Query list based
...
7
Completely out of the data path
FactorData HDFSplus sits outside the Hadoop cluster and collects only
metadata informati...
8
Simplify and Automate Archive and Tiering in Hadoop Today
• Move less accessed data to storage dense nodes for better ut...
9
Upcoming SlideShare
Loading in …5
×

Hadoop Archive and Tiering

632 views

Published on

To successfully archive and tier data in Hadoop, you must understand data heat, age, size and usage. FactorData HDFSplus can provide this visibility and enable automation and simplicity. The result is reduce infrastructure, better performance, and better planning of existing HDFS Hadoop clusters.

Published in: Technology
  • Be the first to comment

Hadoop Archive and Tiering

  1. 1. 1
  2. 2. 2 Reasons For Storage Tiering with Hadoop: • Single tier lends to a large imbalance of compute and storage resources • More applications create varying workloads • Large percent of data is cold in most cases • More recently ingested data can be better balanced • Fewer nodes per GB with archive nodes • Lower infrastructure costs Existing Tier Node Medium Compute Medium Capacity Cold Tier Node Low Compute High Density Capacity 4x Less Per GB Name Nodes Accessed Data Archive Node Example
  3. 3. 3 • Over 65% less hardware • 60% fewer nodes (software licensing) • Significant performance improvement • Immediate ROI for cloud and private infrastructures Archive Data Nodes 80% Disk Data Nodes 20% Disk Data Nodes 100% Single Tier HDFS Storage “The price per GB of the ARCHIVE tier is 4x less” -eBay Hadoop Engineering Blog 4x Fewer Nodes Capacity 10PBCapacity 10PB
  4. 4. 4 Access frequency of data is the most important metric for effective tiering Age is easiest to determine. CAUTION: Some data is long-term active so this cannot be the only criteria. Zero and small files should be treated differently in tiering Hadoop. Large cold files should have priority for archive Knowing how long data is accessed once ingested can provide better capacity planning for your tiers.
  5. 5. 5 Installed on a server or VM outside your existing Hadoop cluster without inserting any proprietary technology on the cluster or in the data path. Report data usage (heat), small files, user activity, replication, and HDFS tier utilization. Customize rules and queries to properly utilize infrastructure and plan better for future scale. Automatically archive, promote, or change the replication factor of data based on usage patterns and user defined rules. Tier Hadoop HDFS By Heat, Age, Size & Activity In Three Easy Steps 01/INSTALL WITHOUT CHANGES TO CLUSTER 02/VISUALIZE & REPORT 03/AUTOMATE OPTIMIZATION
  6. 6. 6 HDFSplus Apply storage policy based on custom query Files are optimized during normal balancing window Query list based on size, heat, activity, and age 1 2 3 • Move all files 120 days old and not accessed for 90 days to ARCHIVE….. • FactorData creates a data list based on query • Limit automated run by max files or capacity • FactorData tracks completion of each run • Data can be excluded from run according to path, size and application Custom Query Example: Automated Tiering:
  7. 7. 7 Completely out of the data path FactorData HDFSplus sits outside the Hadoop cluster and collects only metadata information from the Hadoop cluster No software to install on the existing Hadoop cluster Because HDFSplus leverages only existing Hadoop APIs and features, there is no software to install on the cluster. Provides a highly scalable solution in a small foot-print HDFS visibility and automation for thousands of Hadoop nodes on a single node, VM or server HDFSplus Namenodes Communicates with Existing Hadoop API VM or Physical Machine 32GB RAM 4 CPU or vCPU 500GB Free Disk
  8. 8. 8 Simplify and Automate Archive and Tiering in Hadoop Today • Move less accessed data to storage dense nodes for better utilization • Lower software licensing • Free resources on existing namenodes and datanodes How can we get more performance out of our existing Hadoop cluster? How can we move data not accessed for 90 days to archive nodes? How can we better plan for future scale with real Hadoop storage metrics? Result: Better Performance, Lower Hardware Costs, Lower Software Costs Plus: Get Necessary Storage Visibility To Answer These Questions & More with FactorData HDFSplus
  9. 9. 9

×