Your SlideShare is downloading. ×
0
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Horizon for Big Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Horizon for Big Data

546

Published on

Yet Another Data Base or Data Warehouse System.

Yet Another Data Base or Data Warehouse System.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
546
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Horizon for Big Data Nov, 2012 Schubert Zhang
  • 2. Horizon • Yet Another Data Base or Data Warehouse System. – Users like a familiar DBMS (MySQL, PostgreSQL, Oracle, etc.) style Data Warehouse. – Users be familiar with SQL. – But it is for Big Data Era. • Store and Service Big Data. – 4V of Big Data (Volume, Variety, Velocity, Value) – Mass Storage. – Large Scalability. – High Performance. • SQL + (Horizon Stuff) + Bigtable => Horizon 2
  • 3. Background • NoSQL, BigTable, etc., are just the “(Distributed) Storage Engine Layer” of DBMS. • Users always like and are familiar with SQL to touch their data. 3 摘自《 MySQL性能调优与架构设计》 MySQL Server SQL Engine Layer Storage Engine Layer (MyISAM, InnoDB, etc.)
  • 4. Horizon vs. MySQL 4 Horizon Distributed SQL Engine Distributed Storage Engine (NoSQL, HBase) MySQL Server SQL Engine Layer Storage Engine Layer (MyISAM, InnoDB, etc.) vs.
  • 5. Requirements and Target • Pure HBase is not easy to use, for common application developers, especially for DB application developers. – HBase’s Data Model is too low-leveled and flexible, table/key/column/value, CF, etc., all are defined by application layer. Users must known how to design their data model in HBase. – HBase’s client API • is somehow complicated and evolutionary. • is just java. • Both Hadoop’s and HBase’s client and server code are mixed, too many depended java libraries. That makes the application developers very unhappy. • Requirements – Big Data -> Scale – Big Data is usually time-series and aging. – Users like SQL – Fast Query 5
  • 6. Motivation and Idea • HBase as a Storage Engine Layer. – Pure HBase’s Data Model is very low level, and is very flexible, table/key/column/value, CF, etc., all are defined by application layer. – As a DB, we need Customize HBase’s Data Model. – HBase in Horizon, like the InnoDB in MySQL, and HBase is distributed and scalable. • Why not to develop a new “SQL and Query Engine Layer”. – Too difficult and costly, we have no such team and time. – Seek help from open source, H2, HSQLDB, Derby, MySQL, PostgreSQL, etc.. 6
  • 7. H2 + HBase => Horizon • H2 Database Engine – http://www.h2database.com/html/main.html – In Java (easy to integrate with HBase and Hadoop) – Very fast, open source, JDBC API – Small footprint: around 1 MB jar file size – Browser based Console application – … – Risk of H2 License • Modifications to the H2 source code must be published! • Nobody is allowed to rename H2, modify it a little, and sell it as a database engine without telling the customers it is in fact H2. • HBase – Becoming mature… 7
  • 8. Horizon Architecture 8 DFS (Hadoop HDFS) Bigtable (HBase) Data Model (Data Organization, Indexing, Partitioning, Encoding, Compressing, ...) Data Warehouse Utilities / Tools (SpeedLoader, SpeedScan, Data LifeCycle, ...) SQL Engine (Standard, Familiar, Low Learning Curve, ...) JDBC and ODBC REST API MapReduce Hive Pig Oozie Management Connectors IntegratingintoHadoopEcosystem HCatalog ...
  • 9. Deployment 9 This is just a draft logical diagram Data Model HBase SQL Engine Server SQL Engine Server SQL Engine Server Load Balancer (LVS?) Client (JDBC Driver) Client (JDBC Driver) Client (JDBC Driver) Tools LifeCycle (OnOffline) SpeedLoad Hive/Pig MapReduce Connector (with SpeedScan) Hive/Pig MapReduce Analysis ...
  • 10. Feature List • Data Model in HBase <HugeTable and enhanced> • Schema Data Persistent and Management <H2 enhanced> • Indexing (primary, secondary, full-text?...) <HugeTable and enhanced> – Consistency? • De-normalization (storing…) <HugeTable and enhanced> • Partitioning (especially for time-series and aging data) <HugeTable and enhanced> • ACID? <future and future> • SQL grammar change <H2 enhanced> • Tools – Data Loading Tool. <new> – Batch Loading, Dump/Online?, etc… <new> – Performance Benchmark Tools <new, like old benchmark> • System Management <new> – Web-based, Status/Performance monitoring, Configuration, etc. • Integrated into Hadoop (for MapReduce, Hive, Pig) <new> • … 10
  • 11. Feature Set and Roadmap 11 2013-Q1 2013-Q2~3 Day Day
  • 12. Parallel Analysis • 先在外围使用MapReduce和Hive/Pig等 • 逐步在Horizon内部构建并行引擎 – 参考Dremel, MapReduce, etc. – 参考Parallel Database。 12
  • 13. SpeedLoad and SpeedScan 13 2619 2481 3084 12457 0 2000 4000 6000 8000 10000 12000 14000 LocalDisk to HDFS HDFS to Horizon (SpeedLoad) HDFS to Horizon with one SI (SpeedLoad) HDFS to HBase (API put) 数据加载时间(秒) 约1TB数据加载 select city_id, sum(down_vol+up_vol ) as total_vol from waplog where client_addr = '10.221.31.228' group by city_id order by total_vol limit 100; 497 375 1,275 0 200 400 600 800 1000 1200 1400 HDFS Files Scan Horizon SpeedScan HBase Scan 全表扫描分析(秒)
  • 14. Data LifeCycle • 更理解Big Data – 大数据往往与时效有关,常常以时序数据(Time-Series)呈现 – 大数据的访问模型不同 • 热数据并不大,一般是近期数据:A short set of temporal data that tends to be volatile. • 老数据一般很冷:An ever-growing set of data that rarely gets accessed. – 实时查询(Query),同时具备实时增改删(IUD)功能 – 批量、高速数据加载 – 批量、高速数据扫描和分析 14 新数据 Young Old 数据访问逻辑 (查询和分析/MapReduce) SQL/API 较新的数据 占用更多资源 较历史的数据 占用更少资源 动 态 LifeCycle HBase既有设计和使用模式并不适合做大数据仓 库: • 窄口径:其提供的接口和API适合做随机实时修改和查 询,但不适合做大批量高速读写。 • 低密度:其单机所能存储和处理的数据量和当今存储密 集(几十TB)的服务器不匹配。 解决方案: • 理解HBase内核和存储组织,实现粗口径的数据加载和 扫描接口和工具(SpeedLoad和SpeedScan); • 减少随机实时读写所造成的系统开销,提供高密度仓库 存储。
  • 15. 标准化SQL: DDL 15 CREATE TABLE IF NOT EXISTS WAP_SP ( MSISDN VARCHAR(11), TIME_SPAMP TIMESTAMP, COLLECT_DT INT, DATE_CD INT, SP_DOMAIN VARCHAR(20), SP_GROUP_NAM VARCHAR(12), USER_AGENT VARCHAR(20), SOFT_VERSION VARCHAR(16), HTTP_STATUS_CD INT, HTTP_STATUS_SUBCODE VARCHAR(8), CITY_ID BIGINT, STREAM_VOL BIGINT, ACC_CNT BIGINT, UP_VOL BIGINT, DOWN_VOL BIGINT, USERBRAND_CD INT, SUM_GW_AVGDELAY_DUR BIGINT, SUM_SP_AVGDELAY_DUR BIGINT, PV_ACC_CNT BIGINT, GW_DELAY_ACC_CNT BIGINT, SP_DELAY_ACC_CNT BIGINT, USER_AGENT_ORIGIN VARCHAR(16), SEQ_ID BIGINT, PRIMARY KEY(MSISDN, TIME_STAMP, SEQ_ID) ) INDEX IDX_UA(USER_AGENT, TIME_STAMP) STORING (MSISDN) WHERE CITY_ID=‘GuangZhou' PARTITION BY RANGE(TIME_STAMP) ( PARTITION p0 VALUES LESS THAN(TIMESTAMP '2012-01'), PARTITION p1 values less than(TIMESTAMP '2012-02'), PARTITION p1 values less than(TIMESTAMP '2012-03'), … PARTITION p1 values less than(TIMESTAMP '2012-12') ) 如果你熟悉MySQL或Oracle,则学习成本很低
  • 16. 标准化SQL: DML 16 INSERT INTO tab1 VALUES (1, 2, true, 5, 5555555, 5.5, 55.555, 'it is 5', 's'); UPDATE tab1 SET NAME = 'Schubert', VOL = 100 WHERE ID = 2; 基于索引的实时查询: SELECT * FROM WAP_SP LIMIT 10; 0.094s SELECT COUNT(*) AS COUNT FROM WAP_SP WHERE msisdn='F6391198026377639AF99908E5000000' AND TIME_STAMP >=201212280000 AND TIME_STAMP <=201212289999; 0.120s SELECT * FROM WAP_SP WHERE MSISDN='F6391198026377639AF99908E5000000' AND TIME_STAMP >=201212282051 AND TIME_STAMP <=2014212282051 ORDER BY DOWNSTREAM_VOL LIMIT 10; 0.178s SELECT SUM(DOWN_VOL) FROM WAP_SP WHERE USER_AGENT='MAUI_WAP_Browser' AND TIME_STAMP >=201212280000 AND TIME_STAMP <=201212289999; 0.294
  • 17. 集成Hive的分析 17 Hive Horizon H2 Cluster HBase Cluster HDFS Cluster MapReduce Cluster Hive HQL Meta Schema InputFormat OutputFormat SerDe, etc.
  • 18. 集成Pig的分析 18 Pig Horizon H2 Cluster HBase Cluster HDFS Cluster MapReduce Cluster Pig Script Meta Schema InputFormat OutputFormat SerDe, etc.
  • 19. 标准化JDBC 19 可用任意标准化JDBC SQL客户端链接Horizon URL: jdbc:horizon://hostname:port/public
  • 20. Reading • Database Theory … • SQL … • H2 Database Engine • Google Bigtable • Google Percolator • Google Megastore • Google F1 • MySQL, PostgreSQL • Parallel Database Theory … • Dremel, MapReduce 20
  • 21. Thank You! 21

×