Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introducing Kudu


Published on

Presented to the NYC Big Data Warehousing meetup in December 2015.

Published in: Data & Analytics
  • Be the first to comment

Introducing Kudu

  1. 1. 1© Cloudera, Inc. All rights reserved. Introducing Kudu Jeremy Beard | Senior Solutions Architect December 2015
  2. 2. 2© Cloudera, Inc. All rights reserved. Presenter • Jeremy Beard • Senior Solutions Architect at Cloudera • Three years in big data • Six years in data warehousing •
  3. 3. 3© Cloudera, Inc. All rights reserved. Current storage landscape in Hadoop HDFS excels at: • Efficiently scanning large amounts of data • Accumulating data with high throughput HBase excels at: • Efficiently finding and writing individual rows • Making data mutable Gaps exist when these properties are needed simultaneously
  4. 4. 4© Cloudera, Inc. All rights reserved. Changing hardware landscape • Spinning disk -> solid state storage • NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput, at a price of less than $3/GB and dropping • 3D XPoint memory (1000x faster than NAND, cheaper than RAM) • RAM is cheaper and more abundant: • 64->128->256GB over last few years • Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind • Takeaway 2: Column stores are feasible for random access
  5. 5. 5© Cloudera, Inc. All rights reserved. Kudu Storage for Fast Analytics on Fast Data • New updating column store for Hadoop • Simplifies the architecture for building analytic applications on changing data • Designed for fast analytic performance • Natively integrated with Hadoop • Donated as incubating project at Apache Software Foundation • Beta now available STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite
  6. 6. 6© Cloudera, Inc. All rights reserved. • High throughput for big scans (columnar storage and replication) Goal: Within 2x of Parquet • Low-latency for short accesses (primary key indexes and quorum design) Goal: 1ms read/write on SSD • Database-like semantics (initially single-row ACID) • Relational data model • SQL query • “NoSQL” style scan/insert/update (Java client) Kudu design goals
  7. 7. 7© Cloudera, Inc. All rights reserved. Kudu basic design • Apache-licensed open source software • Structured data model • Basic construct: tables • Tables broken down into tablets (roughly equivalent to partitions) • Architecture supports geographically disparate, active/active systems • Not the initial design goal
  8. 8. 8© Cloudera, Inc. All rights reserved. What Kudu is not • Not a SQL interface • Just the storage layer • “BYO SQL” • Not a file system • Data must have tabular structure • Not an application that runs on HDFS • An alternative, native Hadoop storage engine • Not a replacement for HDFS or HBase • Select the right storage for the right use case • Cloudera will continue to support and invest in all three
  9. 9. 9© Cloudera, Inc. All rights reserved. Kudu data model • Tables have a RDBMS-like schema • Finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8/16/32/64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a primary key • Fast random reads/writes by primary key • No secondary indexes (yet) • Columnar layout on disk • Lazy materialization • Encoding and compression options 9
  10. 10. 10© Cloudera, Inc. All rights reserved. Table partitioning • Hash bucketing • Distribute records by hash of partition column(s) • N buckets leads to N tablets • Range partitioning • Distribute records by ranges of the partition column(s) • N split keys leads to N tablets • Can be a mix for different columns of the primary key
  11. 11. 11© Cloudera, Inc. All rights reserved. Consistency model • Consistency and replication enforced by Raft consensus (similar to Paxos) • Replication by operation not data • Single-row transactions now • Multi-row transactions later • Geo-distributed replicas will be possible under strict time synchronization • Techniques drawn from Google Spanner and others
  12. 12. 12© Cloudera, Inc. All rights reserved. Kudu interfaces • NoSQL-style APIs • Insert(), Update(), Delete(), Scan() • Java and C++ now • Python soon • Integrations with MapReduce, Spark, and Impala • No direct access to underlying Kudu tablet files • Beta does not have authentication, authorization, encryption
  13. 13. 13© Cloudera, Inc. All rights reserved. Impala integration • Opens up Kudu to JDBC/ODBC clients • Intuitive way to get data into Kudu • INSERT INTO kudu SELECT * FROM csv; • Additional commands • UPDATE • DELETE • Efficient INSERT VALUES • Runs on the Kudu C++ client
  14. 14. 14© Cloudera, Inc. All rights reserved. Performance characteristics • Very CPU efficient • Written in modern C++, uses specialized CPU instructions, JIT compilation • Latency mostly driven by storage hardware capabilities • Expect sub-millisecond response on SSDs and upcoming technologies • No garbage collection allows very large memory footprint with no pauses • Bloom filters reduce the need for many disk accesses
  15. 15. 15© Cloudera, Inc. All rights reserved. Operating Kudu • Easiest through Cloudera Manager integration • Separate parcel for now • Kudu is always compacting • No minor vs. major compaction • No compaction latency spikes • Web UI is full of metrics and logs
  16. 16. 16© Cloudera, Inc. All rights reserved. Cluster layout • One or multiple masters • Only one in current beta • Low CPU and memory impact • One tablet server per worker node • Can share disks with HDFS • One SSD per worker node just for Kudu WAL can speed up writes • No dependencies on other Hadoop ecosystem components • But interfacing components like Impala or Spark do
  17. 17. 17© Cloudera, Inc. All rights reserved. Real-time analytics in Hadoop today Merging in new data = storage complexity Downsides: ● Multiple storage layers ● Latest data is hidden ● Files are messy ● Complex to do updates without breaking running queriesNew Partition Most Recent Partition Historic Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Incoming Data (Messaging System) Reporting Request HDFS + Impala
  18. 18. 18© Cloudera, Inc. All rights reserved. Real-time analytics in Hadoop with Kudu Improvements: ● One system to operate ● No schedules or background processes ● Handle late arrivals or data corrections with ease ● New data available immediately for analytics or operations Historical and Real-time Data Incoming Data (Messaging System) Reporting Request Kudu + Impala
  19. 19. 19© Cloudera, Inc. All rights reserved. Kudu for data warehousing • Near real time data visibility • BI tools can display events that happened seconds earlier • Excellent for star schemas • Fast scans of deep fact tables • Efficient wide fact tables • Simplified updates of slowly changing dimensions
  20. 20. 20© Cloudera, Inc. All rights reserved. Near real time data warehousing on Kudu Files RDBMS Streams K A F K A K U D U IMPALA HUE BI tools User FLUME SPARK STREAMING Simple Complex
  21. 21. 21© Cloudera, Inc. All rights reserved. Resources Join the community Download the beta Read the whitepaper
  22. 22. 22© Cloudera, Inc. All rights reserved. Thank you