Your SlideShare is downloading. ×
0
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

1,785

Published on

There is a plethora of storage solutions for big data, each having its own pros and cons. The objective of this talk is to delve deeper into specific classes of storage types like Distributed File …

There is a plethora of storage solutions for big data, each having its own pros and cons. The objective of this talk is to delve deeper into specific classes of storage types like Distributed File Systems, in-memory Key Value Stores, Big Table Stores and provide insights on how to choose the right storage solution for a specific class of problems. For instance, running large analytic workloads, iterative machine learning algorithms, and real time analytics.

The talk will cover HDFS, HBase and brief introduction to Redis

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,785
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
57
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Storage Systems for Big Data Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech
  • 2. Storage Systems for Big Data Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech
  • 3. Storage Hierarchy - In-memory KV Store - Extremely fast access Other KV Store(s) - Large indexed Tables - Fast Random access - Consistent HBase Redis - Large Distributed Storage - High aggregate throughput HDFS General purpose FS Posix filesystem. *nix
  • 4. Storage Hierarchy - In-memory KV Store - Extremely fast access Other KV Store(s) - Large indexed Tables - Fast Random access - Consistent HBase Redis - Large Distributed Storage - High aggregate throughput HDFS General purpose FS Posix filesystem. *nix
  • 5. Hadoop Distributed File System(HDFS) ● History ○ Based on Google File System Paper (2003) ○ Built at Yahoo by a small team ● Goals ○ Tolerance to Hardware failure ○ Sequential access as opposed to Random ○ High aggregated throughput for Large Data Sets ○ “Write Once Read Many” paradigm
  • 6. HDFS - Key Components NameNode Client1 -FileA Client2 -FileB DataNode 1 Rack 1 DataNode 2 DataNode 3 DataNode 4 Rack 2
  • 7. HDFS - Key Components NameNode File.create() Client1 -FileA FileA: Metadata e.g. Size, Owner... AB1:D1, AB1:D3, AB1:D4 AB2:D1, AB2:D3, AB2:D4 MetaData NN OPs Client2 -FileB FileB: Metadata e.g. Size, Owner... BB1:D1, BB1:D2, BB1:D4 DataNode 1 Rack 1 DataNode 2 DataNode 3 DataNode 4 Rack 2
  • 8. HDFS - Key Components NameNode File.create() Client1 -FileA File.write() DataNode 1 FileA: Metadata e.g. Size, Owner... AB1:D1, AB1:D3, AB1:D4 AB2:D1, AB2:D3, AB2:D4 MetaData NN OPs Client2 -FileB FileB: Metadata e.g. Size, Owner... BB1:D1, BB1:D2, BB1:D4 Data Blocks DN OPs DataNode 2 DataNode 3 DataNode 4 AB1 BB1 Rack 1 Rack 2
  • 9. HDFS - Key Components NameNode File.create() Client1 -FileA File.write() DataNode 1 AB1 AB2 FileA: Metadata e.g. Size, Owner... AB1:D1, AB1:D3, AB1:D4 AB2:D1, AB2:D3, AB2:D4 MetaData NN OPs Client2 -FileB FileB: Metadata e.g. Size, Owner... BB1:D1, BB1:D2, BB1:D4 Data Blocks DN OPs DataNode 2 BB1 DataNode 3 AB1 BB1 AB2 DataNode 4 AB1 AB2 BB1 Rack 1 Rack 2 Replication PipeLining
  • 10. HDFS - Communication HDFS Client API. RPC:ClientProtocol Client1 -FileA NameNode
  • 11. HDFS - Communication HDFS Client API. RPC:ClientProtocol NameNode Client1 -FileA HDFS Client API - DataNodeProtocol - Non-RPC, Streaming - Heavy Buffering AB1 AB2 BB1 DataNode 1
  • 12. HDFS - Communication HDFS Client API. RPC:ClientProtocol NameNode Client1 -FileA RPC:DataNodeProtocol HDFS Client API - DataNodeProtocol - Non-RPC, Streaming - Heavy Buffering DN registration: At init time Heart Beat: Stats about Activity and Capacity (secs) Block Report: List of blocks (hour) Block Received: (Triggered by Client upload) AB1 AB2 BB1 AB2 BB1 DataNode 1 DataNode 2
  • 13. HDFS - Communication HDFS Client API. RPC:ClientProtocol NameNode Client1 -FileA RPC:DataNodeProtocol HDFS Client API - DataNodeProtocol - Non-RPC, Streaming - Heavy Buffering DN registration: At init time Heart Beat: Stats about Activity and Capacity (secs) Block Report: List of blocks (hour) Block Received: (Triggered by Client upload) AB1 AB2 BB1 DataNode 1 BB1 Replication PipeLining. Streaming AB2 DataNode 2
  • 14. HDFS - NameNode 1 of 4 ● Heart of HDFS. Typically Lots of Memory ~128Gigs ● Hosts two important tables ● The HDFS Namespace: File->Block mapping ○ Persisted for backup ● The iNode table: Block->Datanode mapping ○ Not persisted. ○ Re-built from block reports ● HDFS is Journaled File system ○ Maintains a WAL called edit log ○ Edit log is merged into fsimage at a preset log size
  • 15. HDFS - NameNode 2 of 4 ● Can take on 3 roles ● Regular mode: Hosts the HDFS Namespace ● Backup mode: Secondary NN ○ Downloads fsimage regularly ○ Merges changes to namespace ○ Its a misnomer, it more of a checkpointing server ● Safemode: Startup time ○ Its a R/O mode ○ Collects data from active DNs
  • 16. HDFS - NameNode 3 of 4 HA using Quorum Journal Manager (Hadoop 2.0+) ZK ZK Cluster ZK Cluster Cluster Clients Clients Clients Active NN Journal Journal Nodes Journal Nodes Nodes DataNodes DataNodes DataNodes DataNodes Standby NN
  • 17. HDFS - NameNode 4 of 4 ● Replication Monitor: Fix over/under replicated blocks ○ Replica Modes: Corrupt, Current, Out-of-date, under-construction ● Lease Management: During file creation ○ Ensures single writer (multiple readers are ok) ○ Synchronously checks active lease ○ Asynchronously checks the entire Tree of leases ● Heartbeat monitor: Collects DN stats and marks them down if no heartbeat recvd for ~10mins.
  • 18. HDFS - DataNode ● Typical Machine: ~ 4TB X 12 disks JBOD ● Has no idea about HDFS, only knows about blocks ● Serves 2 types of requests ○ NN requests for Block create/delete/replicate ○ Serves Block R/W requests from Clients ● Maintains only one table ○ Block->Real Bytes on the local FS ○ Stored locally and not backed up ○ DN can re-build this table by scanning its local dir
  • 19. HDFS - DataNode ● Creates a chksum file for each block ● Runs blockScanner() to find corrupt blocks ● DataNode to NameNode communication ○ Init - registration ○ Sends HeartBeat to NN every few secs ○ Block completion: blockReceived() ○ Lets NN respond with block commands ○ Sends full Block Report every hour
  • 20. HDFS - Typical Deployment Master Switch Aggregator Switch 1 TOR RACK1 TOR ... RACK N (10-20) Aggregator Switch 2 TOR RACK1 ... Aggregator Switch 3 TOR ... RACK N (10-20) ...
  • 21. HDFS - Limitations ● NN holds the Namespace in a single Java process ● 64Gig Heap == ~250 million files + blocks ○ Federation sort of solves the problem ○ Moving Namespace to a KV Store is one solution ● Enterprise features slowly being added ○ Snapshots ○ NFS access ○ Geo replication ○ Run Length Encoding to reduce 3X copies to 1.3X
  • 22. HDFS - Advanced Concepts ● Support for fadvise readahead and drop-behind ● HDFS takes advantage of multiple disks ○ Individual failures do not cause DN failures ○ Spills are parallelized ● Replica and Task placement ○ Done by DNSToSwitchMapping():resolve() ○ User supplied rack topology ○ IP address -> Rack id mapping ○ net.topology.* setttings in core-site.xml
  • 23. HDFS - Advanced Concepts ● Couple of tools for Perf monitoring ○ Ganglia for HDFS ○ Nagios for general health of the machine.
  • 24. Storage Hierarchy - In-memory KV Store - Extremely fast access Other KV Store(s) - Large indexed Tables - Fast Random access - Consistent HBase Redis - Large Distributed Storage - High aggregate throughput HDFS General purpose FS Posix filesystem. *nix
  • 25. Storage Hierarchy - In-memory KV Store - Extremely fast access Other KV Store(s) - Large indexed Tables - Fast Random access - Consistent HBase Redis - Large Distributed Storage - High aggregate throughput HDFS General purpose FS Posix filesystem. *nix
  • 26. HBase ● History ○ ○ ○ Based on Google’s Big Table (2006) Built at Powerset (later acquired by Microsoft) Facebook and Yahoo use it extensively (~1000 machines) ● Goals ○ ○ ○ ○ ○ Random R/W access Tables with Billions of Rows X Millions of Columns Often referred to as a “NoSQL” Data store High speed ingest rate. FB == ~Billion msgs+chat per day. Good consistency model
  • 27. HBase - Key Components ZK ZK Cluster ZK Cluster Cluster Client HMaster JobTracker NameNode Master(s): Active and Backup HRegion Server TaskTracker DataNode Slaves: Many
  • 28. HBase - Data Model ● Google BigTable Paper on #2 says A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes Let’s break that down over the next few slides...
  • 29. HBase - Data Model ● Data is stored in Tables ● Tables have Rows and Columns ● Thats where the similarity ends ○ Columns are grouped into Column Families ● Rows are stored in a sorted(increasing) order ○ Implies, there is only one primary key ● Rows can be sparsely populated ○ Variable length rows are common ● Same row can be updated multiple times ○ Each will be stored as a versioned update
  • 30. HBase - Data Model Conceptual View Row-Key byte-array, Sorted by byte order Versions timemillis() Single column in “contents” byte-array ColumnFamily contents Column => Column Family: Qualifier e.g. Two Columns in the “anchor” byte-array Row Key Time Stamp ColumnFamily anchor "com.cnn.www" t9 anchor:cnnsi.com = "CNN" "com.cnn.www" t8 anchor:my.look.ca = "CNN.com" "com.cnn.www" t5 contents:html = "<html>..." "com.cnn.www" t3 contents:html = "<html>..."
  • 31. HBase - Data Model Physical View Row Key Time Stamp ColumnFamily contents "com.cnn.www" t5 contents:html = "<html>..." "com.cnn.www" t3 contents:html = "<html>..." Row Key Time Stamp ColumnFamily anchor "com.cnn.www" t9 anchor:cnnsi.com = "CNN" "com.cnn.www" t8 anchor:my.look.ca = "CNN.com"
  • 32. HBase - Table Objects Region Server : ~200 Regions per Server HLog/WAL Logical Table Data : R1- R40 Region1 R1-R10 MemStore HFile Blocks Blocks Shards HLog/WAL Region2 R11-R20 MemStore Region Servers HFile Blocks Blocks HDFS H Blocks DFS Blocks HDFS HDFS Blocks Blocks HDFS HDFS Blocks Blocks
  • 33. HBase - Data Model Operations ○ ○ HTable class offers 4 techniques: get, put, delete and scan. The first 3 have a single or batch mode available //Scan example public static final byte[] CF1 = "empData1".getBytes(); public static final byte[] ATTR1 = "empId".getBytes(); HTable htable = new HTable(blah... // create an instance of HTable Scan scan = new Scan(); scan.addColumn(CF1, ATTR1); scan.setStartRow(Bytes.toBytes("200")); scan.setStopRow(Bytes.toBytes("500")); ResultScanner rs = htable.getScanner(scan); try { for (Result r = rs.next(); r != null; r = rs.next()) { // do something with it... } finally { rs.close(); }
  • 34. HBase - Data Versioning ○ ○ ○ ○ ○ ○ ○ ○ By default a put() uses timestamp, but you can override it Get.setMaxVersions() or Get.setTimeRange By default a get() returns the latest version, but you can ask for any All Data model operations are in !sorted order. Row:CF:Col:Version Delete flavors: delete col+ver, delete col, delete col family, delete row Deletes work by creating tombstone markers LIMITATIONS: ■ delete() masks a put() till a major compaction takes place ■ Major compactions can change get() results All operations are ATOMIC within a row
  • 35. HBase - Read Path -ROOT- Table for keeping track of .META. table ZK ZK Cluster ZK Cluster Cluster Region Server1 .META.,region,key: regionInfo, Server Q:Where is .META.? A: RegionServer2 1 Q:Where is -ROOT-? A: RegionServer1 .META. Table for all regions in the system, never splits 2 table, startKey, id:: regionInfo, Server Client Q: HTable.get() 3 6 A: Row 4 HFile - 1 HFile - 2 Region Server2 5 MemStore
  • 36. HBase - Write Path ZK ZK Cluster ZK Cluster Cluster 1 Region Server1 .META.,region,key: regionInfo, Server Q:Where is .META.? A: RegionServer2 Q:Where is -ROOT-? A: RegionServer1 2 HTable.put() Client -ROOT- Table for keeping track of .META. table 3 6 return Code Region Server2 4 5 HLog/WAL MemStore Offline flush HDFS Blocks .META. Table for all regions in the system, never splits table, startKey, id:: regionInfo, Server
  • 37. HBase - Shell ○ ○ ○ ○ ○ Table MetaData: e.g. create/alter/drop/describe table Table Data: e.g. put/scan/delete/count row(s) Admin: e.g. flush/rebalance/compact regions, split tables Replication Tools: e.g. add/enable/list/start/stop replication Security: e.g. grant/revoke/list user permissions ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ Shell interaction example: hbase(main):001:0> create 'myTable', 'myColFam1' 0 row(s) in 3.8890 seconds hbase(main):002:0> put 'myTable’, 'row-1', 'myColFam1:col1', 'value-1' 0 row(s) in 0.1840 seconds hbase(main):003:0> scan 'test' ROW COLUMN+CELL row-11 column=myColFam1:col1, timestamp=1457381922312, value=value-1 1 row(s) in 0.1160 seconds hbase(main):004:0>
  • 38. HBase - Advanced Topics ○ ○ ○ ○ ○ ○ ○ ○ Bulk Loading Cluster Replication Merging and Splitting of regions Predicate pushdown using Server side Filters Bloom filters Co-Processors Snapshots Performance Tuning
  • 39. HBase - What its not ○ ○ ○ ○ HBase is not for everyone Has no support for ■ SQL ■ Joins ■ Secondary indexes ■ Transactions ■ JDBC driver Works well with large deployments Requires good working knowledge of the Hadoop eco-system.
  • 40. HBase - What its good at ● Strongly consistent reads/writes ● Automatic sharding ● Automatic RegionServer failover ● HBase supports MapReduce for using HBase as both source and sink ● Works on top of HDFS ● HBase provides Java Client AP and a REST/Thrift API ● Block Cache and Bloom Filters support ● Web UI and JMX support, for operational management
  • 41. Storage Hierarchy - In-memory KV Store - Extremely fast access Other KV Store(s) - Large indexed Tables - Fast Random access - Consistent HBase Redis - Large Distributed Storage - High aggregate throughput HDFS General purpose FS Posix filesystem. *nix
  • 42. Storage Hierarchy - In-memory KV Store - Extremely fast access Other KV Store(s) - Large indexed Tables - Fast Random access - Consistent HBase Redis - Large Distributed Storage - High aggregate throughput HDFS General purpose FS Posix filesystem. *nix
  • 43. Redis ● Redis is an open source, in-memory key-value store, with Disk persistence ● Originally written at LLOGG by Salvator Sanfilippo ~2009 ● Written in ANSI C and works in most Linux Systems ● No external dependencies ● Very small ~1MB memory per instance ● Datatypes can be data-structures: String, Hash, Set, Sorted Set. ● Compressed in-memory representation of data ● Clients are available in lots of languages. C, C#, Clojure, Scala, Lua...
  • 44. Redis Key Components Memory CPU - 1 Highly Optimized Memory Storage CPU - 2 Highly Optimized Memory Storage Single Threaded Server Highly Optimized Network Layer Single Threaded Server CPU - N Highly Optimized Network Layer Highly Optimized Memory Storage Single Threaded Server Highly Optimized Network Layer Network
  • 45. Redis Key Components Memory CPU - 1 Highly Optimized Memory Storage CPU - 2 Highly Optimized Memory Storage Single Threaded Server Highly Optimized Network Layer Single Threaded Server CPU - N Highly Optimized Network Layer Highly Optimized Memory Storage Single Threaded Server Highly Optimized Network Layer Network
  • 46. Redis Network Layer Client TCP Server - Typical request/response system - For 10K requests, 20K network calls - If each call 1ms, 20secs is lost - Use Batching:: called Pipelining - Send one response for 10K requests - Saving 10 seconds for 10K calls
  • 47. Redis Network Layer Client TCP Server 1,2,3,4…10000 Response Queue - Typical request/response system - For 10K requests, 20K network calls - If each call 1ms, 20secs is lost - Use Batching:: called Pipelining - Send one response for 10K requests - Saving 10 seconds for 10K calls
  • 48. Redis Network Layer Client TCP Server 1,2,3,4…10000 Response Queue ● ● ● Bypass OS socket layer abstraction ○ Uses low level epoll(), kqueue(), select() calls Low overhead of waiting threads. Allows, handling of close to 10K concurrent clients - Typical request/response system - For 10K requests, 20K network calls - If each call 1ms, 20secs is lost - Use Batching:: called Pipelining - Send one response for 10K requests - Saving 10 seconds for 10K calls
  • 49. Redis Memory Optimizations ● Integer encoding for small values ● Small hashes are converted to arrays ○ Leverage CPU caching ● Uses 32 bit version when possible ● Leads to 5X to 10X memory saving
  • 50. Redis Enterprise Features Cluster 1 Async. replication Slave1 Redis Master Shard 1 Slave2 Client Shard 2 Cluster 2 Async. replication Slave1 Redis Master Slave2
  • 51. Redis WrapUp ● Super fast in memory KV store ● Provides a CLI ● Typical apps will require client side coding ● Spills to disk for large data-sets, with reduced performance ● Upcoming “cluster” feature will keep 3 copies for HA
  • 52. Storage Hierarchy - In-memory KV Store - Extremely fast access Other KV Store(s) - Large indexed Tables - Fast Random access - Consistent HBase Redis - Large Distributed Storage - High aggregate throughput HDFS General purpose FS Posix filesystem. *nix
  • 53. Questions?
  • 54. Storage Systems for Big Data Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech

×