0
Comparing Hadoop Data Storage                (HDFS, HBase, Hive and Pig)Rakesh JadhavSAS
Agenda •   Hadoop Ecosystem •   HDFS •   HBase •   Hive •   Pig
Hadoop Ecosystem
Hadoop Ecosystem Components   HDFS:      Hadoop Distributed File System   MapReduce: Hadoop Distributed Programming Para...
HDFS (Data Storage)     Design Features •   Failure Is Norm •   Designed For Large Datasets than Small •   Designed For Ba...
HDFS APPLICATION AREAS  • Large Log Processing  • Web search indexing LIMITATIONS  •   Small Size Problem  •   Single Node...
HBase (Data Storage)  Design Features • Key-Value Store (Like Map) • Semi Structured Data • Column Family, Time Stamp • Ke...
RDBMS v/s HBase: ExampleRDBMSID  Name Age       Birth-    Marital         Location Weight     Employer                   P...
HBase: Application Areas • Applications which need Store/Access/Search   using Key • Need Fast Random Access/Update to sca...
HBase: Limitations •   Expensive Full Row Read •   No Secondary Keys •   No SQL Support •   Not Efficient for Big Cell Val...
Hive (Data Access)  Design Features  • Scalable data warehouse on top of Hadoop    developed by Facebook  • SQL like Query...
Hive: Application Areas • Adhoc analysis on huge structured data, not   having any requirement of low latency • Log proces...
Hive: Limitations • No Support To Update Data • Only Bulk Load Support • Not Efficient For Small Data
Hive: Example • create table employee (id bigint, name string,   age int…) ROW FORMAT DELIMITED   FIELDS TERMINATED BY t S...
Pig(Data Access)  • Pig Latin High level data flow language.  • Client side library, no server side deployment needed.  • ...
Pig: Application Areas • Extract Transform Load (ETL) • Unstructured Data Analysis
PIG: Limitations • Not efficient for processing small datasets
PIG: Example Load Emplyee data from text file, filter it using  age and joining year and group using joining  year. 1. rec...
Conclusion Organizations •Revisit data strategy •Evaluate Hadoop Ecosystem •Build economical, scalable solutions for Big D...
References• Hadoop: Definitive Guide, By Tom White• http://hadoop.apache.org/• http://developer.yahoo.com/hadoop/tutorial/...
Thank You            21
Upcoming SlideShare
Loading in...5
×

Indic threads pune12-comparing hadoop data storage

689

Published on

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012. http://pune12.indicthreads.com/

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
689
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Indic threads pune12-comparing hadoop data storage"

  1. 1. Comparing Hadoop Data Storage (HDFS, HBase, Hive and Pig)Rakesh JadhavSAS
  2. 2. Agenda • Hadoop Ecosystem • HDFS • HBase • Hive • Pig
  3. 3. Hadoop Ecosystem
  4. 4. Hadoop Ecosystem Components  HDFS: Hadoop Distributed File System  MapReduce: Hadoop Distributed Programming Paradigm  HBase: Hadoop Column Oriented Database for Random Access Read/Write of Smaller Data  Hive: Hadoop Petabyte scalable Data Warehousing Infrastructure  Pig: Hadoop Data Flow/Analysis Infrastructure  Zookeeper: Hadoop Co-ordination service, Configuration Service Infrastructure  Chukwa: Hadoop Monitoring Service  Avro: Hadoop Data Serialization De-Serialization Infrastructure  Mahout: Hadoop Scalable Machine Learning Library
  5. 5. HDFS (Data Storage) Design Features • Failure Is Norm • Designed For Large Datasets than Small • Designed For Batch Processing than Interactive • Supports Write Once- Read Many • Provides Interfaces to Move Processing Closer To Data
  6. 6. HDFS APPLICATION AREAS • Large Log Processing • Web search indexing LIMITATIONS • Small Size Problem • Single Node Of Failure • No Random Access • No Write Support
  7. 7. HBase (Data Storage) Design Features • Key-Value Store (Like Map) • Semi Structured Data • Column Family, Time Stamp • Key=RowKey.ColumnFamiliy.ColumnName.TimeStamp • De-normalized Data • Faster Data Retrieval Using Column Families • Static Column Families, Dynamic Columns
  8. 8. RDBMS v/s HBase: ExampleRDBMSID Name Age Birth- Marital Location Weight Employer Place Status1 Sam 35 Mumbai Married Pune 76 XYZ2 Bob 56 Chicago Married New 79 PQR YorkHBaseRow Personal Information Other InformationKey (Column Family) (Column Family)1 Nam Age: Birth-Place Marital Weight:T2 Locatio Employer:T1= e: T2= :T1=Mumbai Status = 76 n: T2= XYZ T1=S 35 :T2= Pune am Married Weight:T1 Age: = 65 Locatio T1:=2 Marital n: 5 Status: T1:=Mu T1= mbai Unmarried2 … … … … … … …
  9. 9. HBase: Application Areas • Applications which need Store/Access/Search using Key • Need Fast Random Access/Update to scalable structured data • Applications Needing Flexible Table Schema • Applications Needing range-search capabilities supported by key ordering
  10. 10. HBase: Limitations • Expensive Full Row Read • No Secondary Keys • No SQL Support • Not Efficient for Big Cell Values
  11. 11. Hive (Data Access) Design Features • Scalable data warehouse on top of Hadoop developed by Facebook • SQL like Query Language HiveQL • Limited JDBC support • Support for rich data types • Ability to insert custom map-reduce jobs
  12. 12. Hive: Application Areas • Adhoc analysis on huge structured data, not having any requirement of low latency • Log processing • Text Mining • Document Indexing • Customer Facing business intelligence (Google analytics) • Predictive Modeling, hypothesis testing
  13. 13. Hive: Limitations • No Support To Update Data • Only Bulk Load Support • Not Efficient For Small Data
  14. 14. Hive: Example • create table employee (id bigint, name string, age int…) ROW FORMAT DELIMITED FIELDS TERMINATED BY t STORED AS TEXTFILE; • LOAD DATA LOCAL INPATH /sas/employee.txt OVERWRITE INTO TABLE employee;  • INSERT OVERWRITE TABLE oldest_employee SELECT * FROM employee SORT BY age DESC LIMIT 100;
  15. 15. Pig(Data Access) • Pig Latin High level data flow language. • Client side library, no server side deployment needed. • Batch processing large unstructured data • Procedural language • Runtime Schema Creation, Check point ability, Splits pipeline support • Customer code support • Rich data types • Support for Joins
  16. 16. Pig: Application Areas • Extract Transform Load (ETL) • Unstructured Data Analysis
  17. 17. PIG: Limitations • Not efficient for processing small datasets
  18. 18. PIG: Example Load Emplyee data from text file, filter it using age and joining year and group using joining year. 1. records = LOAD sas/input/files/employee.txt AS (joiningYear:chararray, employeeId:int, age:int); 2. filtered_records = FILTER records BY age> 30 AND ( joiningYear >=2000 OR joiningYear <= 2012); 3. grouped_records = GROUP filtered_records BY joiningYear; max_age = FOREACH grouped_records GENERATE group, MAX(filtered_records.age); DUMP max_age;
  19. 19. Conclusion Organizations •Revisit data strategy •Evaluate Hadoop Ecosystem •Build economical, scalable solutions for Big Data problems
  20. 20. References• Hadoop: Definitive Guide, By Tom White• http://hadoop.apache.org/• http://developer.yahoo.com/hadoop/tutorial/• http://www- 01.ibm.com/software/data/infosphere/hadoop/• http://www.information- management.com/blogs/• http://www.mckinsey.com/insights/mgi/researc h/technology_and_innovation/big_data_the_next _frontier_for_innovation
  21. 21. Thank You 21
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×