Analyzing Small Files in HDFS Cluster
Presenters: Rohit Jangid
Presenters: Raman Goyal
HDFS Analysis for Small Files
Outline
ā–Ŗ What are small files and their problems?
ā–Ŗ Small Files Analysis
ā–Ŗ Architecture
ā–Ŗ FsImage Processing and Aggregation
ā–Ŗ Implementation and tool
ā–Ŗ Dashboards and Results
ā–Ŗ Dashboards
ā–Ŗ Results
ā–Ŗ Future Work
ā–Ŗ Conclusions
2
Expedia’s HDFS Cluster
3
Hdfs Doesn’t Like Lots Of Small Files…
4
Problem?
INEFFICIENT DATA ACCESS PATTERN
5
MAKES JOBS SLOW....
6
Trivial Solution?
7
Compaction
Solution?
8
BUT WHERE...?
9
SMALL FILES ANALYSIS
10
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
11
LSR
LSR
FsIMAGE PROCESSING
MeProcessed 20gb FsImage In ~20 Minutes
Custom OIV Interpreter For Reduced Memory Usage
Fetched from Name node OIV to LSR Interpreter
HDFS Cluster RAW FsImage
Interpreted
FsImage
12
LSR
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
13
Attributes Found Directly
Owner Name
Group Name
Size of File
Replication Factor
Number of Direct File objects
Last Modified Date
Level of File
Is File or Is Directory?
Attribution and Aggregation
Aggregated Attributes
Number of Small File objects
Number of Namespace objects
Smallest, Largest, Avg File size
Difference in Size since Last run
If Directory
14
Attribution and Aggregation
Generate Small Files / Total Files Metrics
Roll-up Attributes to Parent Directories
Custom UDF’s and
PIG Scripts
Using Sqoop
Stored in HDFS
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Storage
15
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
16
LSR
STORAGE AND REPORTING
DashboardStorage
Relational Database and Rest API Dashboards
Different Dashboards Showing User Level and
Overall Level
REST
API
Powered by Cyclotron: http://cyclotron.io
17
Implementation and Tool
Files and Directories Attributed
Small file & Directory information
Download and Interpret
HDFS NameNode
At Directory level
Statistics like Smallest File calculated
Using OIV Interpreter
By splitting FsImage rows
Storage, REST API and Dashboards
Can easily add new Clusters in Tool
18
DASHBOARDS AND
RESULTS
19
Dashboards Information
For file size less than 10 MB
For file size between 10 MB to 70 MB
For file size between 70 MB to ~100 MB
3 possible bucketing models
Goes upto all levels in HDFS
Distribution of owners of small Top 10
Directories to be investigated for
deletion, re-partition, compaction
3
2
1
20
Overall Dashboard containing all Information
21
Distribution of Owners of Small Files
22
Sample Directories Containing Small Files
23
Top 10: Files vs Small Files
24
Daily Small Files per Directory
25
Doesn’t have real time analysis! with
alerting
Cluster has 200+ million namespace objects that we get as memory dump from
Hadoop server.
Future Work
Translating and attributing each directory and file is a time consuming process.
Developing Customisable Compaction
Utility
1
2
26
EDWPMonitoring@expedia.com
Conclusions

HDFS Analysis for Small Files