Your SlideShare is downloading. ×
001 hbase introduction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

001 hbase introduction

2,346
views

Published on

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,346
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
28
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. HBase Introduction Scott Miao 2012/06/25
  • 2. Agenda• Course Credit• One common web site story• Why RDB not affordable ?• Big Data• Why use noSQL ?• HBase Indroduction• Hands-on• noSQL architecture common practices• Case study 1
  • 3. 一個網站的故事 (1/3) • RDBMS是Persistence tier一個理所當然的選擇 • 它可以幫我們處理transaction(ACID),確保完整性限制 (Integrity Constraints),標準的SQL語言,甚至還有Stored Procedure可以用 • 第一次,你的使用者人數越來越多時… • 使用AP Servers Cluster,它們共用一台DB Server • 第二次,你的使用者人數越來越多時… • DB Server分成Master-Slave架構 • 從Slave Servers讀取資料 • 寫入資料至Master Server 2Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1
  • 4. 一個網站的故事 (2/3)• 第三次,你的使用者人數越來越多時… • 針對讀取資料的瓶頸 • 在Server程式和DB之間,加入Cache,例如Memcached (Memory DB) • 但Server程式的Cache和DB之間,很可能出現資料不一致的問題 • 針對寫入資料的瓶頸 • 增加DB Server的機器規格(CPU、Memory、Disk等,Vertically Scaling) • 別忘記!我們也要連同Slave Severs的規格也要一起增加ㄛ… 3
  • 5. 一個網站的故事 (3/3)• 第四次,你的使用者人數越來越多時… • 使用Database Sharding技術 • 從Vertically Scaling轉換成Horizontally Scaling • 開啟管理的惡夢 • RDBMS天生不適合分散式儲存 (ACID,Fixed Schema) • DBA要設定一組Sharding Rules • 當其中某一台DB Server掛掉,或是儲存容量滿了,就要開始手動作 Resharding • Resharding包含了要重新調整Sharding Rules,接著需要作大量IO的資料複製 和遷移工作,同時間要保證網站可以正常服務,或是要在一定時間內中斷服 務 • 這通常是事後不得已,而且少數可選擇的解決方案 • 天知道我的網站會這麼紅? 4
  • 6. Why RDB not affordable ? (1/6)• Bottleneck of Relational-DB • 90s V.S. recent years (Web 2.0)• Memcachd + mySQL • Mitigate read stress effectively, but not write stress• mySQL Cluster solution • Master/Slave • Not affordable for highly-concurrency scenario • Vertical Partitioning • Vertical/Horizontal Partitioning (Database sharding) • Complex • Hard to scale-out and change requirements • Low availability 5• Some type of simple but big size data cause this conditionhttp://www.infoq.com/cn/news/2011/01/nosql-why
  • 7. Why RDB not affordable ? (2/6) – A general HA system architecture design 6軟體專案的素質之四 ─ 整體設計之 架構設計案例 ─ http://takeshi-experience.blogspot.tw/2012/04/blog-post.html
  • 8. Why RDB not affordable ? (3/6) –Master/Slave 7
  • 9. Why RDB not affordable ? (4/6) –Vertical Partitioning 8
  • 10. Why RDB not affordable ? (5/6) –Master/Slave + Vertical Partitioning 9
  • 11. Why RDB not affordable ? (6/6) –Vertical/Horizontal Partitioning 10
  • 12. • 過去3年所產生的資料量,比過去四萬年創造的資料量還 多!• WallMart的資料量是美國國會圖書館的167倍!• eBay分析平台每天處理的資料量高達100PB!(約 1,000,000GB)• 截至2010年,世界電子資料儲存量為1.2ZB! (1,200,000PB)• 根據IDC預測,2020年世界電子資料儲存量會是2009年的 基礎上,再加上44倍,達到35萬億GB! • 35,000,000,000,000 Giga Bytes 11 架构师 10 月刊 ─ http://www.infoq.com/cn/minibooks/architect-oct-10-2011
  • 13. Trend Micro’s problem• 每人每天造訪約20 ~ 60 html頁面• 每個html頁面約包含15 ~ 30 URI• 每個URI物件大小約10 ~ 150 KB• 以一百萬個用戶而言 • 100萬 X 20 = 2,000萬個html頁面 • 2,000萬個html頁面 X 15 = 30,000萬個URI (三十億) • 30,000萬個URI物件 X 10 = 30,000KB (3TB)• 以上純屬台灣區的資料量• 趨勢是個全球性的公司 • 故每天的資料量約數十個TB 12趨勢的雲端發現之旅 ─ http://findbook.tw/book/9789866126185/basic
  • 14. 大資料時代下的新寵兒 ─ • Not only SQL • 於2009年開始 • 有以下特性 • 不使用關聯式資料模型 • 天生分散式儲存 • 易於水平式擴充的 • 開放原始碼的 • 易於擴充的 • 簡單的API操作 (CRUD,通常沒有SQL支援) • CAP (不同於ACID) • Eventually Consistency、Availability、Partition-Tolerance • 儲存巨量且異質的資料 13 http://nosql-database.org/
  • 15. Why use noSQL ?• Easy to scale-out • Unlike RDB, no relationship therefore easy to scale-out• High performance even in the big data • Table-level cache (RDB) V.S. Record-level cache (noSQL)• Elastic data model • Schema V.S. Schema-less/Dynamic schema• High availability • Easy to add new machines (nodes) without any performance impact 14
  • 16. Comparison between RDB and noSQLIf given a really huge of big data…Aspects RDB noSQLPerformance Getting lower Sustain as a small size of dataScalability Mainly for scale up Mainly for scale outReliability ACID CAPAvailability Hard to maintain SLA Easy to maintain SLASecurity Robust DependsEconomics High-end machines Commodity machinesData Model Relational, Fix-schema Depends but more likely simple, Schema-lessMaturity Very mature Not mature, various productsCommercial Global company Small start-upssupportOLAP/BI Mature Immature 15Human resource Easy to find Hard to find
  • 17. noSQL basic categories 16iTcloud新雲端時代 ─ http://www.ithome.com.tw/002/cloud/cloud.html
  • 18. Apache Hbase介紹 • ASF的top-level專案 • 屬於noSQL DB中的Key-Value類型 • 源自於Google的 • Bigtable: A Distributed Storage System for Structured Data • a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers • a sparse, distributed, persistent multi-dimensional sorted map 17Hbase: The Definitive Guide - http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100/ref=sr_1_1?ie=UTF8&qid=1339060175&sr=8-1
  • 19. Apache Hbase Concepts – Column-Oriented (1/2) 18 http://ofps.oreilly.com/titles/9781449396107/intro.html
  • 20. Apache Hbase Concepts – Column-Oriented (2/2) • a sparse, distributed, persistent multi-dimensional sorted map • which is indexed by row key, column key (column family + qualifiers), and a timestamp Column Families 19
  • 21. Apache Hbase Concepts - Architecture 20 http://ofps.oreilly.com/titles/9781449396107/architecture.html
  • 22. Hands-on (1/3) –Use your VM (Virtual Machine) to install tm-puppet• Please refer to SPN Dev hbase training program again~• Install git on your PC• Install tm-puppet on your VM 21
  • 23. Hands-on (2/3) –Use HBase shell• Basic operations • help, list, scan• Create • A table ‘MY_FIRST_TABLE’ • Two column families ‘FAM_1’, ‘FAM_2’ • Ex. • create t1, {NAME => f1}, {NAME => f2} • Create ‘t1’, ‘f1’, ‘f2’• Put two records (column) • Ex. put t1, r1, c1, value• Update a record (column) (It is also a put)• Delete a record (column) 22 • delete t1, r1, c1
  • 24. Hands-on (3/3) –Requirements• Put your successful installed tm-puppet image file to git • Use following commands • Jps • Ifconfig • Cut the image • Path : ${git_home}/hbase-training/001/hands- on/${your_name}/hands-on-001.jpg• Put your hbase shell records image file to git • Use following commands • Scan ‘MY_TEST_TABLE’ • Ifconfig • Cut the image • Path : ${git_home}/ hbase-training/001/hands- on/${your_name}/hands-on-002.jpg 23• Commit and push your git
  • 25. noSQL architecture practices (1/8) – Use noSQL as complement • Use noSQL as a mirror (implemented by code) • The RDB is still a major storage device, and noSQL as a mirror 24NoSQL架構實踐(一)— 以NoSQL為輔 ─http://www.infoq.com/cn/news/2011/02/nosql-architecture-practice
  • 26. noSQL architecture practices (2/8) –Use noSQL as complement//PSEUDO CODE for noSQL as a mirror//We want to store the data Objectbool status = false;DB.startTransaction(); //start transactionid = DB.Insert(data); //write data Object to RDBif(id > 0){ status = NoSQL.Add(id, data); //write data Object to noSQL by id}if(id > 0 && status == true){ DB.commit(); //commit transaction} else { DB.rollback(); //failed, rollback transaction} 25
  • 27. noSQL architecture practices (3/8) –Use noSQL as complement• Use noSQL as a mirror (implemented by synchronization) 26
  • 28. noSQL architecture practices (4/8) –Use noSQL as complement• Combine RDB & noSQL 27
  • 29. noSQL architecture practices (5/8) –Use noSQL as complement//PSEUDO CODE for RDB & noSQL combination//we want to store the data Objectdata.title = "title";data.name = "name";data.time = "2009-12-01 10:10:01";data.from = "1";bool status = false;DB.startTransaction(); //start transaction//write into RDB, data.from is a value needed by search criteriaid = DB.Insert("INSERT INTO table (from) VALUES(data.from)");if(id > 0){ //write data Object to noSQL by id status = NoSQL.Add(id, data);}if(id>0 && status==true){ DB.commit(); //commit transaction 28}else{ DB.rollback(); //failed, rollback transaction}
  • 30. noSQL architecture practices (6/8) –Use noSQL as complement• What benefits we can get from the RDB & noSQL combination practice• Decrease the I/O of RDB, therefore save more storage space• Increase the RDB table-level cache hitrate, only the key values(PK, FK, search criteria related values) updated will refresh the cache• Increase the synchronization efficiency for RDB Master/Slave architecture• Increase the RDB backup/recover efficiency• Increase the scalability/performance for whole system 29
  • 31. noSQL architecture practices (7/8) – Use noSQL as master • Use only with noSQL • Mainly for simple query requirements systems • But there are noSQL products can fulfill the more complex queries • MonngoDB, Tokyo Cabinet, etc 30NoSQL架構實踐(二)— 以NoSQL為主 ─http://www.infoq.com/cn/news/2011/03/nosql-architecture-practice-2
  • 32. noSQL architecture practices (8/8) –Use noSQL as master• Use noSQL as major data source• APs only write data into noSQL• Then synchronize the data from noSQL to other data stores based on their application 31
  • 33. Case Study (1/4) –Facebook’s Real-time Message System• Use HBase to store 135+ billion messages a month • Beat off other few competitors such as Cassandra, mySQL- Sharding, etc• Data Patterns • A short set of temporal data that tends to be volatile • An ever-growing set of data that rarely gets accessed 32Facebooks New Real-time Messaging System: HBase to Store 135+ BillionMessages a Month - http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html
  • 34. Case Study (2/4) –Facebook’s Real-time Message System• Some key aspects of their system: • HBase • Has a simpler consistency model than Cassandra. • Very good scalability and performance for their data patterns. • Most feature rich for their requirements: auto load balancing and failover, compression support, multiple shards per server, etc. • HDFS, the filesystem used by HBase, supports replication, end-to-end checksums, and automatic rebalancing. • Facebooks operational teams have a lot of experience using HDFS because Facebook is a big user of Hadoop and Hadoop uses HDFS as its distributed file system. 33
  • 35. Case Study (3/4) –Facebook’s Real-time Message System• Haystack is used to store attachments.• A custom application server was written from scratch in order to service the massive inflows of messages from many different sources.• A user discovery service was written on top of ZooKeeper.• Infrastructure services are accessed for: email account verification, friend relationships, privacy decisions, and delivery decisions• Keeping with their small teams doing amazing things approach, 20 new infrastructures services are being released by 15 engineers in one year.• Facebook is not going to standardize on a single database 34 platform, they will use separate platforms for separate tasks.
  • 36. Case Study (4/4) –Alibaba China Site architecture 35http://www.infoq.com/cn/presentations/hl-alibaba-cn-architecture-design-practice
  • 37. 36
  • 38. Data Access pattern as the keyfor noSQL• Data Structure • Structured • Semi-structured • Unstructured • Size• How many & how often writes/read (proportion)• Data Writing • Transaction• Data Reading • Random access • Sequential access • Relationship 37
  • 39. Q&A 38