Your SlideShare is downloading. ×
Indic threads pune12-comparing hadoop data storage
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Indic threads pune12-comparing hadoop data storage


Published on

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012.

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012.

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Comparing Hadoop Data Storage (HDFS, HBase, Hive and Pig)Rakesh JadhavSAS
  • 2. Agenda • Hadoop Ecosystem • HDFS • HBase • Hive • Pig
  • 3. Hadoop Ecosystem
  • 4. Hadoop Ecosystem Components  HDFS: Hadoop Distributed File System  MapReduce: Hadoop Distributed Programming Paradigm  HBase: Hadoop Column Oriented Database for Random Access Read/Write of Smaller Data  Hive: Hadoop Petabyte scalable Data Warehousing Infrastructure  Pig: Hadoop Data Flow/Analysis Infrastructure  Zookeeper: Hadoop Co-ordination service, Configuration Service Infrastructure  Chukwa: Hadoop Monitoring Service  Avro: Hadoop Data Serialization De-Serialization Infrastructure  Mahout: Hadoop Scalable Machine Learning Library
  • 5. HDFS (Data Storage) Design Features • Failure Is Norm • Designed For Large Datasets than Small • Designed For Batch Processing than Interactive • Supports Write Once- Read Many • Provides Interfaces to Move Processing Closer To Data
  • 6. HDFS APPLICATION AREAS • Large Log Processing • Web search indexing LIMITATIONS • Small Size Problem • Single Node Of Failure • No Random Access • No Write Support
  • 7. HBase (Data Storage) Design Features • Key-Value Store (Like Map) • Semi Structured Data • Column Family, Time Stamp • Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp • De-normalized Data • Faster Data Retrieval Using Column Families • Static Column Families, Dynamic Columns
  • 8. RDBMS v/s HBase: ExampleRDBMSID Name Age Birth- Marital Location Weight Employer Place Status1 Sam 35 Mumbai Married Pune 76 XYZ2 Bob 56 Chicago Married New 79 PQR YorkHBaseRow Personal Information Other InformationKey (Column Family) (Column Family)1 Nam Age: Birth-Place Marital Weight:T2 Locatio Employer:T1= e: T2= :T1=Mumbai Status = 76 n: T2= XYZ T1=S 35 :T2= Pune am Married Weight:T1 Age: = 65 Locatio T1:=2 Marital n: 5 Status: T1:=Mu T1= mbai Unmarried2 … … … … … … …
  • 9. HBase: Application Areas • Applications which need Store/Access/Search using Key • Need Fast Random Access/Update to scalable structured data • Applications Needing Flexible Table Schema • Applications Needing range-search capabilities supported by key ordering
  • 10. HBase: Limitations • Expensive Full Row Read • No Secondary Keys • No SQL Support • Not Efficient for Big Cell Values
  • 11. Hive (Data Access) Design Features • Scalable data warehouse on top of Hadoop developed by Facebook • SQL like Query Language HiveQL • Limited JDBC support • Support for rich data types • Ability to insert custom map-reduce jobs
  • 12. Hive: Application Areas • Adhoc analysis on huge structured data, not having any requirement of low latency • Log processing • Text Mining • Document Indexing • Customer Facing business intelligence (Google analytics) • Predictive Modeling, hypothesis testing
  • 13. Hive: Limitations • No Support To Update Data • Only Bulk Load Support • Not Efficient For Small Data
  • 14. Hive: Example • create table employee (id bigint, name string, age int…) ROW FORMAT DELIMITED FIELDS TERMINATED BY t STORED AS TEXTFILE; • LOAD DATA LOCAL INPATH /sas/employee.txt OVERWRITE INTO TABLE employee;  • INSERT OVERWRITE TABLE oldest_employee SELECT * FROM employee SORT BY age DESC LIMIT 100;
  • 15. Pig(Data Access) • Pig Latin High level data flow language. • Client side library, no server side deployment needed. • Batch processing large unstructured data • Procedural language • Runtime Schema Creation, Check point ability, Splits pipeline support • Customer code support • Rich data types • Support for Joins
  • 16. Pig: Application Areas • Extract Transform Load (ETL) • Unstructured Data Analysis
  • 17. PIG: Limitations • Not efficient for processing small datasets
  • 18. PIG: Example Load Emplyee data from text file, filter it using age and joining year and group using joining year. 1. records = LOAD sas/input/files/employee.txt AS (joiningYear:chararray, employeeId:int, age:int); 2. filtered_records = FILTER records BY age> 30 AND ( joiningYear >=2000 OR joiningYear <= 2012); 3. grouped_records = GROUP filtered_records BY joiningYear; max_age = FOREACH grouped_records GENERATE group, MAX(filtered_records.age); DUMP max_age;
  • 19. Conclusion Organizations •Revisit data strategy •Evaluate Hadoop Ecosystem •Build economical, scalable solutions for Big Data problems
  • 20. References• Hadoop: Definitive Guide, By Tom White••• http://www-• http://www.information-• h/technology_and_innovation/big_data_the_next _frontier_for_innovation
  • 21. Thank You 21