Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kudu: New Hadoop Storage for Fast Analytics on Fast Data

9,556 views

Published on

Todd Lipcon's talk from the NYC HUG meeting on 9/28/2015

Published in: Software
  • Login to see the comments

Kudu: New Hadoop Storage for Fast Analytics on Fast Data

  1. 1. 1© Cloudera, Inc. All rights reserved. Todd Lipcon on behalf of the Kudu team Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop 1
  2. 2. 2© Cloudera, Inc. All rights reserved. The conference for and by Data Scientists, from startup to enterprise wrangleconf.com Public registration is now open! Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more When: Thursday, October 22, 2015 Where: Broadway Studios, San Francisco
  3. 3. 3© Cloudera, Inc. All rights reserved. Kudu Storage for Fast Analytics on Fast Data • New updating column store for Hadoop • Apache-licensed open source • Beta now available Columnar Store Kudu
  4. 4. 4© Cloudera, Inc. All rights reserved. Motivation and Goals Why build Kudu? 4
  5. 5. 5© Cloudera, Inc. All rights reserved. Motivating Questions • Are there user problems that can we can’t address because of gaps in Hadoop ecosystem storage technologies? • Are we positioned to take advantage of advancements in the hardware landscape?
  6. 6. 6© Cloudera, Inc. All rights reserved. Current Storage Landscape in Hadoop HDFS excels at: • Efficiently scanning large amounts of data • Accumulating data with high throughput HBase excels at: • Efficiently finding and writing individual rows • Making data mutable Gaps exist when these properties are needed simultaneously
  7. 7. 7© Cloudera, Inc. All rights reserved. Changing Hardware landscape • Spinning disk -> solid state storage • NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping • 3D XPoint memory (1000x faster than NAND, cheaper than RAM) • RAM is cheaper and more abundant: • 64->128->256GB over last few years • Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind. • Takeaway 2: Column stores are feasible for random access
  8. 8. 8© Cloudera, Inc. All rights reserved. • High throughput for big scans (columnar storage and replication) Goal: Within 2x of Parquet • Low-latency for short accesses (primary key indexes and quorum replication) Goal: 1ms read/write on SSD • Database-like semantics (initially single-row ACID) • Relational data model • SQL query • “NoSQL” style scan/insert/update (Java client) Kudu Design Goals
  9. 9. 9© Cloudera, Inc. All rights reserved. Kudu Usage • Table has a SQL-like schema • Finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-composite primary key • Fast ALTER TABLE • Java and C++ “NoSQL” style APIs • Insert(), Update(), Delete(), Scan() • Integrations with MapReduce, Spark, and Impala • more to come! 9
  10. 10. 10© Cloudera, Inc. All rights reserved. Use cases and architectures
  11. 11. 11© Cloudera, Inc. All rights reserved. Kudu Use Cases Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes ● Time Series ○ Examples: Stream market data; fraud detection & prevention; risk monitoring ○ Workload: Insert, updates, scans, lookups ● Machine Data Analytics ○ Examples: Network threat detection ○ Workload: Inserts, scans, lookups ● Online Reporting ○ Examples: ODS ○ Workload: Inserts, updates, scans, lookups
  12. 12. 12© Cloudera, Inc. All rights reserved. Real-Time Analytics in Hadoop Today Fraud Detection in the Real World = Storage Complexity Considerations: ● How do I handle failure during this process? ● How often do I reorganize data streaming in into a format appropriate for reporting? ● When reporting, how do I see data that has not yet been reorganized? ● How do I ensure that important jobs aren’t interrupted by maintenance? New Partition Most Recent Partition Historic Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Incoming Data (Messaging System) Reporting Request Impala on HDFS
  13. 13. 13© Cloudera, Inc. All rights reserved. Real-Time Analytics in Hadoop with Kudu Improvements: ● One system to operate ● No cron jobs or background processes ● Handle late arrivals or data corrections with ease ● New data available immediately for analytics or operations Historical and Real-time Data Incoming Data (Messaging System) Reporting Request Storage in Kudu
  14. 14. 14© Cloudera, Inc. All rights reserved. How it works Replication and distribution 14
  15. 15. 15© Cloudera, Inc. All rights reserved. Tables and Tablets • Table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Each tablet has N replicas (3 or 5), with Raft consensus • Allow read from any replica, plus leader-driven writes with low MTTR • Tablet servers host tablets • Store data on local disks (no HDFS) 15
  16. 16. 16© Cloudera, Inc. All rights reserved. Metadata • Replicated master* • Acts as a tablet directory (“META” table) • Acts as a catalog (table schemas, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • 80-node load test, GetTableLocations RPC perf: • 99th percentile: 68us, 99.99th percentile: 657us • <2% peak CPU usage • Client configured with master addresses • Asks master for tablet locations as needed and caches them 16
  17. 17. 17© Cloudera, Inc. All rights reserved.
  18. 18. 18© Cloudera, Inc. All rights reserved. Raft consensus 18 TS A Tablet 1 (LEADER) Client TS B Tablet 1 (FOLLOWER) TS C Tablet 1 (FOLLOWER) WAL WALWAL 2b. Leader writes local WAL 1a. Client->Leader: Write() RPC 2a. Leader->Followers: UpdateConsensus() RPC 3. Follower: write WAL 4. Follower->Leader: success 3. Follower: write WAL 5. Leader has achieved majority 6. Leader->Client: Success!
  19. 19. 19© Cloudera, Inc. All rights reserved. Fault tolerance • Transient FOLLOWER failure: • Leader can still achieve majority • Restart follower TS within 5 min and it will rejoin transparently • Transient LEADER failure: • Followers expect to hear a heartbeat from their leader every 1.5 seconds • 3 missed heartbeats: leader election! • New LEADER is elected from remaining nodes within a few seconds • Restart within 5 min and it rejoins as a FOLLOWER • N replicas handle (N-1)/2 failures 19
  20. 20. 20© Cloudera, Inc. All rights reserved. Fault tolerance (2) • Permanent failure: • Leader notices that a follower has been dead for 5 minutes • Evicts that follower • Master selects a new replica • Leader copies the data over to the new one, which joins as a new FOLLOWER 20
  21. 21. 21© Cloudera, Inc. All rights reserved. How it works Storage engine internals 21
  22. 22. 22© Cloudera, Inc. All rights reserved. Tablet design • Inserts buffered in an in-memory store (like HBase’s memstore) • Flushed to disk • Columnar layout, similar to Apache Parquet • Updates use MVCC (updates tagged with timestamp, not in-place) • Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans • Near-optimal read path for “current time” scans • No per row branches, fast vectorized decoding and predicate evaluation • Performance worsens based on number of recent updates 22
  23. 23. 23© Cloudera, Inc. All rights reserved. LSM vs Kudu • LSM – Log Structured Merge (Cassandra, HBase, etc) • Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (HFile/SSTable) • Reads perform an on-the-fly merge of all on-disk HFiles • Kudu • Shares some traits (memstores, compactions) • More complex. • Slower writes in exchange for faster reads (especially scans) 23
  24. 24. 24© Cloudera, Inc. All rights reserved. LSM Insert Path 24 MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” flush
  25. 25. 25© Cloudera, Inc. All rights reserved. LSM Insert Path 25 MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2” flush HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
  26. 26. 26© Cloudera, Inc. All rights reserved. LSM Update path 26 MemStore UPDATE HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Note: all updates are “fully decoupled” from reads. Random- write workload is transformed to fully sequential!
  27. 27. 27© Cloudera, Inc. All rights reserved. LSM Read path 27 MemStore HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Merge based on string row keys R1: c1=blah c2=2 R2: c1=newval c2=5 …. CPU intensive! Must always read rowkeys Any given row may exist across multiple HFiles: must always merge! The more HFiles to merge, the slower it reads
  28. 28. 28© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes 28 MemRowSet INSERT(“todd”, “$1000”,”engineer”) name pay role DiskRowSet 1 flush
  29. 29. 29© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes 29 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 INSERT(“doug”, “$1B”, “Hadoop man”) flush
  30. 30. 30© Cloudera, Inc. All rights reserved. Kudu storage - Updates 30 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS Each DiskRowSet has its own DeltaMemStore to accumulate updates base data base data
  31. 31. 31© Cloudera, Inc. All rights reserved. Kudu storage - Updates 31 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS UPDATE set pay=“$1M” WHERE name=“todd” Is the row in DiskRowSet 2? (check bloom filters) Is the row in DiskRowSet 1? (check bloom filters) Bloom says: no! Bloom says: maybe! Search key column to find offset: rowid = 150 150: col 1=$1M base data
  32. 32. 32© Cloudera, Inc. All rights reserved. Kudu storage – Read path 32 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 150: pay=$1M Read rows in DiskRowSet 2 Then, read rows in DiskRowSet 1 Any row is only in exactly one DiskRowSet– no need to merge cross- DRS! Updates are merged based on ordinal offset within DRS: array indexing, no string compares base data base data
  33. 33. 33© Cloudera, Inc. All rights reserved. Kudu storage – Delta flushes 33 MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 0: pay=fooREDO DeltaFile Flush A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later version base data base data
  34. 34. 34© Cloudera, Inc. All rights reserved. Kudu storage – Major delta compaction 34 name pay role DiskRowSet(pre-compaction) Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile Many deltas accumulate: lots of delta application work on reads name pay role DiskRowSet(post-compaction) Delta MS Unmerged REDO deltasUNDO deltas If a column has few updates, doesn’t need to be re- written: those deltas maintained in new DeltaFile Merge updates for columns with high update percentage base data
  35. 35. 35© Cloudera, Inc. All rights reserved. Kudu storage – RowSet Compactions 35 DRS 1 (32MB) [PK=alice], [PK=joe], [PK=linda], [PK=zach] DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke] DRS 3 (32MB) [PK=carl], [PK=julie], [PK=omar] [PK=zoe] DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, joe] [jon, julie, linda, mary] [omar, zach, zeke, zoe] Reorganize rows to avoid rowsets with overlapping key ranges
  36. 36. 36© Cloudera, Inc. All rights reserved. Kudu storage – Compaction policy • Solves an optimization problem (knapsack problem) • Minimize “height” of rowsets for the average key lookup • Bound on number of seeks for write or random-read • Restrict total IO of any compaction to a budget (128MB) • No long compactions, ever • No “minor” vs “major” distinction • Always be compacting or flushing • Low IO priority maintenance threads 36
  37. 37. 37© Cloudera, Inc. All rights reserved. Kudu trade-offs • Random updates will be slower • HBase model allows random updates without incurring a disk seek • Kudu requires a key lookup before update, bloom lookup before insert • Single-row reads may be slower • Columnar design is optimized for scans • Future: may introduce “column groups” for applications where single-row access is more important • Especially slow at reading a row that has had many recent updates (e.g YCSB “zipfian”) 37
  38. 38. 38© Cloudera, Inc. All rights reserved. Benchmarks 38
  39. 39. 39© Cloudera, Inc. All rights reserved. TPC-H (Analytics benchmark) • 75TS + 1 master cluster • 12 (spinning) disk each, enough RAM to fit dataset • Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4 • TPC-H Scale Factor 100 (100GB) • Example query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc; 39
  40. 40. 40© Cloudera, Inc. All rights reserved. - Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data - Parquet likely to outperform Kudu for HDD-resident (larger IO requests)
  41. 41. 41© Cloudera, Inc. All rights reserved. What about Apache Phoenix? • 10 node cluster (9 worker, 1 master) • HBase 1.0, Phoenix 4.3 • TPC-H LINEITEM table only (6B rows) 41 2152 219 76 131 0.04 1918 13.2 1.7 0.7 0.15 155 9.3 1.4 1.5 1.37 0.01 0.1 1 10 100 1000 10000 Load TPCH Q1 COUNT(*) COUNT(*) WHERE… single-row lookup Time(sec) Phoenix Kudu Parquet
  42. 42. 42© Cloudera, Inc. All rights reserved. What about NoSQL-style random access? (YCSB) • YCSB 0.5.0-snapshot • 10 node cluster (9 worker, 1 master) • HBase 1.0 • 100M rows, 10M ops 42
  43. 43. 43© Cloudera, Inc. All rights reserved. What Kudu is not 43
  44. 44. 44© Cloudera, Inc. All rights reserved. Kudu is… • NOT a SQL database • “BYO SQL” • NOT a filesystem • data must have tabular structure • NOT a replacement for HBase or HDFS • Cloudera continues to invest in those systems • Many use cases where they’re still more appropriate • NOT an in-memory database • Very fast for memory-sized workloads, but can operate on larger data too! 44
  45. 45. 45© Cloudera, Inc. All rights reserved. Getting started 45
  46. 46. 46© Cloudera, Inc. All rights reserved. Getting started as a user • http://getkudu.io • kudu-user@googlegroups.com • Quickstart VM • Easiest way to get started • Impala and Kudu in an easy-to-install VM • CSD and Parcels • For installation on a Cloudera Manager-managed cluster 46
  47. 47. 47© Cloudera, Inc. All rights reserved. Getting started as a developer • http://github.com/cloudera/kudu • All commits go here first • Public gerrit: http://gerrit.cloudera.org • All code reviews happening here • Public JIRA: http://issues.cloudera.org • Includes bugs going back to 2013. Come see our dirty laundry! • kudu-dev@googlegroups.com • Apache 2.0 license open source • Contributions are welcome and encouraged! 47
  48. 48. 48© Cloudera, Inc. All rights reserved. Demo? (if we have time and internet gods willing) 48
  49. 49. 49© Cloudera, Inc. All rights reserved. http://getkudu.io/ @getkudu

×