Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Analyzing Small Files in HDFS Cluster
Presenters: Rohit Jangid
Presenters: Raman Goyal
HDFS Analysis for Small Files
Outline
▪ What are small files and their problems?
▪ Small Files Analysis
▪ Architecture
▪ FsImage Processing and Aggregat...
Expedia’s HDFS Cluster
3
Hdfs Doesn’t Like Lots Of Small Files…
4
Problem?
INEFFICIENT DATA ACCESS PATTERN
5
MAKES JOBS SLOW....
6
Trivial Solution?
7
Compaction
Solution?
8
BUT WHERE...?
9
SMALL FILES ANALYSIS
10
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and ...
LSR
FsIMAGE PROCESSING
MeProcessed 20gb FsImage In ~20 Minutes
Custom OIV Interpreter For Reduced Memory Usage
Fetched fro...
LSR
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
...
Attributes Found Directly
Owner Name
Group Name
Size of File
Replication Factor
Number of Direct File objects
Last Modifie...
Attribution and Aggregation
Generate Small Files / Total Files Metrics
Roll-up Attributes to Parent Directories
Custom UDF...
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and ...
STORAGE AND REPORTING
DashboardStorage
Relational Database and Rest API Dashboards
Different Dashboards Showing User Level...
Implementation and Tool
Files and Directories Attributed
Small file & Directory information
Download and Interpret
HDFS Nam...
DASHBOARDS AND
RESULTS
19
Dashboards Information
For file size less than 10 MB
For file size between 10 MB to 70 MB
For file size between 70 MB to ~...
Overall Dashboard containing all Information
21
Distribution of Owners of Small Files
22
Sample Directories Containing Small Files
23
Top 10: Files vs Small Files
24
Daily Small Files per Directory
25
Doesn’t have real time analysis! with
alerting
Cluster has 200+ million namespace objects that we get as memory dump from
...
EDWPMonitoring@expedia.com
Conclusions
Upcoming SlideShare
Loading in …5
×

HDFS Analysis for Small Files

1,915 views

Published on

HDFS Analysis for Small Files

Published in: Technology
  • Be the first to comment

HDFS Analysis for Small Files

  1. 1. Analyzing Small Files in HDFS Cluster Presenters: Rohit Jangid Presenters: Raman Goyal HDFS Analysis for Small Files
  2. 2. Outline ▪ What are small files and their problems? ▪ Small Files Analysis ▪ Architecture ▪ FsImage Processing and Aggregation ▪ Implementation and tool ▪ Dashboards and Results ▪ Dashboards ▪ Results ▪ Future Work ▪ Conclusions 2
  3. 3. Expedia’s HDFS Cluster 3
  4. 4. Hdfs Doesn’t Like Lots Of Small Files… 4 Problem?
  5. 5. INEFFICIENT DATA ACCESS PATTERN 5
  6. 6. MAKES JOBS SLOW.... 6
  7. 7. Trivial Solution? 7
  8. 8. Compaction Solution? 8
  9. 9. BUT WHERE...? 9
  10. 10. SMALL FILES ANALYSIS 10
  11. 11. ARCHITECTURE HDFS Cluster RAW FsImage Interpreted FsImage Attributed Files and Directory information Aggregated Files and Directory information Dashboard Storage 11 LSR
  12. 12. LSR FsIMAGE PROCESSING MeProcessed 20gb FsImage In ~20 Minutes Custom OIV Interpreter For Reduced Memory Usage Fetched from Name node OIV to LSR Interpreter HDFS Cluster RAW FsImage Interpreted FsImage 12
  13. 13. LSR ARCHITECTURE HDFS Cluster RAW FsImage Interpreted FsImage Attributed Files and Directory information Aggregated Files and Directory information Dashboard Storage 13
  14. 14. Attributes Found Directly Owner Name Group Name Size of File Replication Factor Number of Direct File objects Last Modified Date Level of File Is File or Is Directory? Attribution and Aggregation Aggregated Attributes Number of Small File objects Number of Namespace objects Smallest, Largest, Avg File size Difference in Size since Last run If Directory 14
  15. 15. Attribution and Aggregation Generate Small Files / Total Files Metrics Roll-up Attributes to Parent Directories Custom UDF’s and PIG Scripts Using Sqoop Stored in HDFS Attributed Files and Directory information Aggregated Files and Directory information Storage 15
  16. 16. ARCHITECTURE HDFS Cluster RAW FsImage Interpreted FsImage Attributed Files and Directory information Aggregated Files and Directory information Dashboard Storage 16 LSR
  17. 17. STORAGE AND REPORTING DashboardStorage Relational Database and Rest API Dashboards Different Dashboards Showing User Level and Overall Level REST API Powered by Cyclotron: http://cyclotron.io 17
  18. 18. Implementation and Tool Files and Directories Attributed Small file & Directory information Download and Interpret HDFS NameNode At Directory level Statistics like Smallest File calculated Using OIV Interpreter By splitting FsImage rows Storage, REST API and Dashboards Can easily add new Clusters in Tool 18
  19. 19. DASHBOARDS AND RESULTS 19
  20. 20. Dashboards Information For file size less than 10 MB For file size between 10 MB to 70 MB For file size between 70 MB to ~100 MB 3 possible bucketing models Goes upto all levels in HDFS Distribution of owners of small Top 10 Directories to be investigated for deletion, re-partition, compaction 3 2 1 20
  21. 21. Overall Dashboard containing all Information 21
  22. 22. Distribution of Owners of Small Files 22
  23. 23. Sample Directories Containing Small Files 23
  24. 24. Top 10: Files vs Small Files 24
  25. 25. Daily Small Files per Directory 25
  26. 26. Doesn’t have real time analysis! with alerting Cluster has 200+ million namespace objects that we get as memory dump from Hadoop server. Future Work Translating and attributing each directory and file is a time consuming process. Developing Customisable Compaction Utility 1 2 26
  27. 27. EDWPMonitoring@expedia.com Conclusions

×