Kudu demo

APACHE - KUDU
WELCOME TO DEMO SESSION TO WORK WITH APACHE - KUDU
Part of defining NDX Strategic Architecture
Version: 0.2. – Status: draft, Date: 9/28/2016
Author: Ravi Kumar Itha & ZTV Team, Reviewers: Manjunatha Prabhu, Felix Shulman, Garry Steedman, Mara Preotescu
Participating teams: Nielsen, Kogentix, and Cloudera

Agenda
 Kudu – Overview
 Kudu – High level
 Design Goals of Kudu
 Kudu – Architecture
 Kudu –Tablet Storage
 Kudu – Hadoop Integration
 Kudu Implementation in Buffer Load and Rawdata load.
 Inserting the data
 Reading the data
 Deleting
 Dropping a table

Kudu at a high level
• It is an open source storage engine, supports low-latency random access together with efficient analytical access patterns.
• It distributes data using horizontal partitioning and replicates each partition, providing low mean-time-to-recovery and low tail latencies
• It is designed within the context of the Hadoop ecosystem and supports integration with Cloudera Impala, Apache Spark, and MapReduce.
Feature Description
Tables and Schemas • Kudu is a storage system for tables of structured data.
• A Kudu cluster may have any number of tables
• Each table has a well-defined schema consisting of a finite number of columns
Unlike most relational
databases
• Kudu does not currently offer secondary indexes or uniqueness constraints other than the primary key.
• Currently, Kudu requires that every table has a primary key defined, though we anticipate that a future version will add
automatic generation of surrogate keys
Write operations • Insert, Update, Upsert and Delete
Read operations • Kudu offers Scan operation to retrieve data from a table. On a scan, any number of predicates can be provided to filter
the results
• In addition to applying predicates, user may specify a projection for a scan
API • Kudu provides APIs for callers to determine the mapping of data ranges to particular servers to aid distributed
execution frameworks such as Spark, MapReduce, or Impala
Consistency Model • Snapshot consistency
• External consistency
Timestamps • Kudu does not allow the user to manually set the timestamp of a write operation
• Allow the user to specify a timestamp for a read operation. This allows the user to perform point-in-time queries in the
past

Design Goals of Kudu
 Strong performance for both scan and random access to help customers simplify complex hybrid architectures
 High CPU efficiency in order to maximize the return on investment that our customers are making in modern processors
 High IO efficiency in order to leverage modern persistent storage
 The ability to update data in place, to avoid extraneous processing and data movement
 The ability to support active-active replicated clusters that span multiple data centers in geographically distant locations

Kudu – Architecture
• Reference Material:
Feature Description
Cluster Roles • Kudu relies on a single Master server*, responsible for metadata,
• Arbitrary number of Tablet Servers, responsible for data
Partitioning • Tables in Kudu are horizontally partitioned. Like BigTable, calls these horizontal partitions tablets.
• Any row may be mapped to exactly one tablet based on the value of its primary key  ensures random access
operations
• For large tables, the recommendation is to have 10-100 tablets per machine. Each tablet can be tens of gigabytes
• Kudu supports a flexible array of partitioning schemes
• Partition schema is made up of zero or more hash partitioning rules followed by an optional range-partitioning rule:
 A hash-partitioning rule consists of a subset of the primary key columns and a number of buckets
 A range-partitioning rule consists of an ordered subset of the primary key columns
Replication • Kudu replicates all of its table data across multiple machines, typically 3 or 5
The Kudu Master • Act as a catalog manager
• Act as a cluster coordinator
• Act as a tablet directory

Kudu –Tablet Storage
Feature Description
Objectives behind the
design
• Fast columnar scans
• Low-latency random updates
• Consistency of performance
RowSets • Tablets in Kudu are themselves subdivided into smaller units called RowSets
• Two types of RowSets: MemRowSets, DiskRowSets
• MemRowSets – RowSets exist in memory are called
• DiskRowSets – RowSets exist in a combination of disk and memory
Other features that make
Kudu perform better in
data read / write and data
management
Kudu has implemented the below processes efficiently by following some best techniques such as Immutable B-tree indexes,
LRU (Least Recently Used ) page caches, Bloom filters, MVCC (Multi-version concurrency control), encoding techniques.
• INSERT path
• Read path
• Lazy Materialization
• Delta Compaction
• RowSet Compaction
• Scheduling maintenance

Kudu – Hadoop Integration
Feature Description
MapReduce and Spark • Bindings for MapReduce jobs to either input or output data to Kudu tables
• A small glue layer binds Kudu tables to higher-level Spark concepts such as DataFrames and Spark SQL tables
• It has native support for several key features:
 Locality
 Columnar Projection
 Predicate pushdown support
Impala • Kudu is also deeply integrated with Cloudera Impala
• SQL support operations are provided via its integration with Impala
• Impala integration includes several key features:
 Locality
 Predicate pushdown support
 DDL extensions
 DML extensions

WHAT DATA TYPES DOES KUDU
SUPPORT?
• Boolean
• 8-bit-signed-integer
• Timestamp
• 32-bit-floating-point
• 64-bit-floating-point
• String
• Binary

ENCODING TYPES ?
COLUMNTYPE
• Integer,Timestamp plain, bitshuffle,run length
• Float plain, bitshuffle
• Bool plain, dictionary, run length
• String,binary plain, prefix, dictionary
Bishulle results are LZ4 compression
ENCODING

HOWTO CREATE A TABLE IN KUDU ?
• You can user Impala
• You can use scala.

CREATINGTABLE USING IMPALA
CREATE TABLE <table_name> (columns)
PRIMARY KEY (c1,c2)
DISTRIBUTE BY RANGE (column)
RANGE BOUND ((2011), (2016))
SPLIT ROWS ((2012), (2013), (2014), (2015));

CREATINGTABLE USING SCALA
• Create kuducontext
• val columnList = new ArrayList[ColumnSchema]()
columnList.add(new ColumnSchemaBuilder("nc_periodid",Type.INT32).key(true).build())
columnList.add(new ColumnSchemaBuilder("ac_nshopid",Type.STRING).key(true).build())
columnList.add(new ColumnSchemaBuilder("ac_lbatchtype",Type.STRING).key(false).build())
val schema = new Schema(columnList)
val cto = new CreateTableOptions()
distrubutionList.add("nc_periodid")
cto.addHashPartitions(distrubutionList, numberOfBuckets)
kuduClient.createTable(tableName, schema, cto).setRepilica(3)

OPERATION ATABLE USING SCALA
• kuduContext.insertRows(DF,table)
• kuduContext.upsertRows(DF, table)
• kuduContext.updateRows(DF,table)
• kuduContext.tableExists(table)
• kuduContext.deleteTable(table)
• kuduContext.deleteRows(DF, table)

READING DATA IN TO DF FROM A TABLE
• val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051",
"kudu.table" -> "kudu_table")).kudu.where(condition)
 Currently column filter and between condition are supported

Kudu demo

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kudu demo

Similar to Kudu demo (20)

Recently uploaded

Recently uploaded (20)

Kudu demo