3. Motivations
Huge Data Volumes
Total data volumes: Several PB per system
Daily data volumes: Several TB per system
Longer retention period: several months
Big potential: 200% increase in some area
Multiple Applications Areas Data Warehouse
•Scalable
BOSS BI NMS Internet ...
•High Available
Data Integration •Reliable
Traditional Application Model + App Solution
SQL support
Fast Index Query … Affordable
Multiple Application support
Sensitive data
CRUD support
Statistic & Reporting
4. Hadoop: Raw Techniques
HDFS: distributed file system with fault tolerance
MapReduce: parallel programming
environments over HDFS
Similar to the situation of POSIX API + Local FS
High Level Toolkits are initiated
Yahoo: PIG/Latin
Business.com: Cloudbase/Hadoop+JDBC
China Mobile: BC-PDM
Facebook: Hive/SQL
5. Hive: A Petabytes Scale Data Warehouse
Features:
• Schema support
• Pluggable Storage Engine I/F
• SQL MR translation
• xDBC Driver
• Tools: HQL Console
• Admin: HWI
Usage Scenarios
• Reporting
• Ad hoc Analysis
• Machine Learning
• Others
•Log analysis
•Trend detection
Facebook has huge clusters
>1000 nodes
Source: ICDE 2010/Facebook
6. HBase: structured storage of sparse data for
Hadoop
Features
• ColumnFamilies
• ACID
• Optimized R/W
• BigTable I/F + BU
• Tools: HBase Shell
• Admin: Jetty Based
Usage Scenarios
• Social Service
• MapReduce Analysis
• Content Repository
• Wiki, RSS
• Near Realtime Reporting
Source: ApacheCon2009/ HBase
& analytics
• Store web pages
… Replacing SQL Systems
7. HugeTable: Application-Oriented Structure
Data Storage System
Address the missing blocks HugeTable
Index store & Query Optimizer Tools
Client I/F Admin
s Data,
Access Control List HFile w/ Index
config,
FM, Log,
Insert, Update and Delete CF Store Perf
Web-based Administration
Build Solutions for Telco Applications
Network Management System – NMS
Value-added System – VAS
Business Intelligence – BI
Other areas
8. A Brief History of HugeTable
HT-p1 HT-p2 HT-p3
1. Connect Hive with 1. Move to higher version
1. HBase-based
HBase of Hive, Hadoop and
2. Partial xDBC/SQL HBase
support 3. Support HFile, CF in
2. New Storage Engine
3. Integration HBase Hive 3. Fruitful external I/F
with ZK before 2. Global Indexing 4. Many other
official release 4. Secondary Index improvements
4. Secondary Index 5. Multiple DB support 4. Application Solution
5. Support Schema 6. ACL support
6. ACL support 7. MR & Scan I/F
7. SQL console 8. Loader Tools, HT-Client
9. Admin Portal
10.JDBC remote console
2008 2009 2010
10. HBase as HugeTable Index Store
Create Index Select … using index xxx
Drop Index Select … where idxcol
Find Index
Index Meta Data Query Engine
Find Index Read Index
Write Index Index Data
Load Service
HBase
HT Loader Check Index
11. Index Store Implementation
Primary Index: index into data file
Secondary Index: index into primary index
Exact match and Range scan
Integrated with Hive ql and other modules
20 Nodes,
1TB/Node Hive HT-
HT-p1 HT-p2
Memory
No Additional cost 8GB/Node*TB 2GB/Node*TB
Consumption
20MB/s·Node(No 2.5MB/s·Node(Primar >5MB/s·Node(Primary
Load Speed
Index) y Index) Index)
Index Query N/A <10 sec <10 sec
12. HugeTable IUD Support
Goal: Support Insert, Update and Delete on application data.
IUD Statement Select
Find IUD table
Meta Data Query Engine
Write IUD Data
HT Data IUD Table Read IUD Data
HDFS HBase
Offline Merger
13. HugeTable Access Control
Goal: Support Multiple Users from Multiple Applications , w/o mutual trust
Database privileges: User Access Level:
1. Meta Data: Index, Create, 1. System Administrator
Drop 2. User Manager
2. User Data: IUD 3. User
Grant/Revoke
DDL/DML Loader/Portal
Check Privileges
Meta Data ACL Module
14. Administration Portal
Goal: Unified HugeTable management point, decrease management effort
Data Management User Management Monitor & FM Configuration
DB/TBL/IDX Add/Delete/Modify Log/Alert/Service Deploy/Setup
15. HugeTable Application API
Various kinds of Applications
JDBC/SQL API MapReduce API BigTable API
• Migration of traditional database • Compatible with Hadoop MR API • BigTable/HBase style API
applications • For data analysis, e.g. data mining • For NoSQL application, on HFile2
• For SQL developer • Work with HT records format • Range scan, Key-value access
• Batch processing & interactive • Access control • Access Control
Table table = new Table("gdr", "admin", "admin");
public void map(LongWritable key, {"default"};
String[] families = new String[] HugeRecord value,
OutputCollector<HugeRecordRowKey, HugeRecord> output,
String[] partitions = new String[] {"dt=20100317"};
int limit = 10; reporter);
Reporter
TableScannerInterface tsi = table.getScanner(
public void reduce(HugeRecordRowKey key,
new byte[0],new byte[] {Byte.MAX_VALUE},families, partitions);
Iterator<HugeRecord> values,
for (int i=0; i<limit; ++i) {
OutputCollector<HugeRecordRowKey, HugeRecord> output,
GroupValue gv = tsi.next();
Reporter reporter);
for (String family : families) {
System.out.println(family + " = " + Bytes.toString(gv.getByteValue(family)));
}
}
16. HugeTable based Telco Application Solutions
Heavy Requirements, e.g.
Batch processing Telco App
Complex data analysis
Interactive query on CDR
Statistic and reporting Reporting
Interactive Complex Interactive
Simple Query Analyze Complex Query
Database
Data Source HugeTable
Cluster
Data
Data + warehouse
Aggregator
DataMing
Data Source Tool kits
Mass Data Store
Batch processing
Statistic
17. Future works
Column Sorage Engine
File Format
Compression
Local Index
Global Index
Query Optimization
Join Optimization: index
Load Optimization
Parallel Load
Application Solution