SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Apache Kudu Webinar Series
Technical Deep Dive
David Alves| Apache Kudu PMC | Cloudera
2© Cloudera, Inc. All rights reserved.
Kudu Webinar Series
Part 1: Lambda Architectures – Simplified by Apache Kudu
A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can
dramatically simplify real-time analytics.
Part 2: Extending the Capabilities of Operational and Analytical Databases
An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and
Analytical databases can handle.
Part 3: Data-in-Motion: Unlock the Value of Real-Time Data
Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will
discuss how to make it a reality.
Part 4: Technical Deep-Dive into Apache Kudu
An in-depth examination of the technical architecture and design of Apache Kudu, straight from a Kudu
PMC Member.
3© Cloudera, Inc. All rights reserved.
Updateable Analytic Storage
Simple real-time analytics and updates with Apache Kudu
Kudu: Storage for fast analytics on fast data
• Simplified architecture for building real-time analytic
applications
• Designed for next-generation hardware for faster analytic
performance across frameworks
• Native Hadoop storage engine
Flexibility for the right tools for the right use
case in one platform
• Only analytic database for big data with Kudu + Impala
• Simple real-time applications with Kudu + Spark
Use cases
• Time series data
• Machine data analytics
• Online reporting
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
OTHER
Kite
NoSQL
HBase
OTHER
Object Store
FILESYSTEM
HDFS
RELATIONAL
Kudu
4© Cloudera, Inc. All rights reserved.
HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Filling the Analytic Gap
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData
5© Cloudera, Inc. All rights reserved.
Better Together
Kudu Benefits from Integration with the Apache Ecosystem
Spark – Stream Processing for Kudu
• Open standard for real-time stream processing
• Effective for automating decision processes and machine
learning
• Use Cases include: Time Series Data & Machine Data
Analytics
Impala – High-Performance BI & SQL for Kudu
• Open standard for interactive SQL queries
• Powers analytic database workloads with flexibility, scale, and
open architecture
• Use Cases include: Time Series Data & Online Reporting
6© Cloudera, Inc. All rights reserved.
Apache Kudu Community
7© Cloudera, Inc. All rights reserved.
Why Kudu?
Use Cases and Motivation
8© Cloudera, Inc. All rights reserved.
Why Kudu?
A simultaneous combination of sequential and random reads and writes
Can you insert time series data in
real time? How long does it take to
prepare it for analysis? Can you get
results and act fast enough to
change outcomes?
Can you handle large volumes of
machine-generated data? Do you
have the tools to identify problems
or threats? Can your system do
machine learning?
How fast can you add data to your
data store? Are you trading off the
ability to do broad analytics for the
ability to make updates? Are you
retaining only part of your data?
Time Series Data Machine Data Analytics Online Reporting
9© Cloudera, Inc. All rights reserved.
Next generation hardware
Cheaper and faster every year.
Persistent memory (3D XPoint™)
Kudu can take advantage of SSD
and NVM using Intel’s NVM Library.
RAM is cheaper and bigger every
day.
Kudu runs smoothly with huge
RAM. Written in C++ to avoid GC
issues.
Modern CPUs are adding cores and
SIMD width, not GHz.
Kudu takes advantage of SIMD
instructions and concurrent data
structures.
Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
10© Cloudera, Inc. All rights reserved.
Apache Kudu: Scalable and fast tabular storage
Scalable
• Tested up to 275 nodes (~3PB cluster)
• Designed to scale to 1000s of nodes and tens of PBs
Fast
• Millions of read/write operations per second across cluster
• Multiple GB/second read throughput per node
Tabular
• Represents data in structured tables like a relational database
• Individual record-level access to 100+ billion row tables
11© Cloudera, Inc. All rights reserved.
Deep Dive:
Replication And Fault Tolerance
12© Cloudera, Inc. All rights reserved.
Metadata
• Replicated master
• Acts as a tablet directory
• Acts as a catalog (which tables exist, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
tablets)
• Caches all metadata in RAM for high performance
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them
13© Cloudera, Inc. All rights reserved.
Client
Hey Master! Where is the row for
‘tlipcon’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might
care about: T1, T2, T3, …
UPDATE tlipcon
SET col=foo
Meta Cache
T1: …
T2: …
T3: …
14© Cloudera, Inc. All rights reserved.
Raft consensus
TS A
Tablet 1
(LEADER)
Client
TS B
Tablet 1
(FOLLOWER)
TS C
Tablet 1
(FOLLOWER)
WAL
WALWAL
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers:
UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!
15© Cloudera, Inc. All rights reserved.
Fault tolerance
• Transient FOLLOWER failure:
• Leader can still achieve majority
• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:
• Followers expect to hear a heartbeat from their leader every 1.5 seconds
• 3 missed heartbeats: leader election!
• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER
• N replicas handle (N-1)/2 failures
16© Cloudera, Inc. All rights reserved.
Fault tolerance (2)
• Permanent failure (Tablet Copy):
• Leader notices that a follower has been dead for 5 minutes
• Evicts that follower
• Master selects a new replica(in PRE_VOTER state)
• Leader copies the data over to the new one, which joins as a new FOLLOWER
• New replica is assigned VOTER state
• Cluster change_config lets you add/remove tablet servers to a tablet’s
configuration(only supported from CLI tool)
17© Cloudera, Inc. All rights reserved.
Deep Dive:
Columnar Storage
18© Cloudera, Inc. All rights reserved.
Columnar Storage
• Improved Scan performance
• Predicates (e.g. WHERE time >= 2016-05-08T00:00:00) can be evaluated without reading
unnecessary data from other columns
• Efficient encodings can dramatically improve compression ratios, which reduces effective IO load
• Typed, homogenous data plays well to modern processor strengths (vectorization, pipelining)
• At the cost of random access performance
• single row access requires a number of seeks proportional to the number of columns
• BUT, random access is becoming cheaper (Cheap RAM, SSDs, NVRAM)
19© Cloudera, Inc. All rights reserved.
Row Storage
{23059873, newsycbot, 1442865158, Visual exp…}
{22309487, RideImpala, 1442828307, Introducing …}
…
Tweet_id, user_name, created_at, text
Scans have to read all the data, no encodings
20© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
21© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
Only read 1 column
1GB 2GB 1GB 200GB
…AND door open to big IO gains with compression and encoding
22© Cloudera, Inc. All rights reserved.
Available Encodings:
• Dictionary (Strings, Binary)
• Bitshuffle (Numeric)
• RLE (Numeric, Bool)
Available Compression:
• Snappy
• LZ4
• ZLib
Columnar Storage – Other Encodings/Compression
Kudu 1.3 ships with “good” defaults for most cases
23© Cloudera, Inc. All rights reserved.
Deep Dive:
Write and Read Paths
24© Cloudera, Inc. All rights reserved.
LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)
• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)
• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu
• Shares some traits (memstores, compactions)
• More complex.
• Slower writes in exchange for faster reads (especially scans)
25© Cloudera, Inc. All rights reserved.
LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
flush
26© Cloudera, Inc. All rights reserved.
LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah2”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“blah2”
Row=r2 col=c2 val=“2”
flush
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
27© Cloudera, Inc. All rights reserved.
LSM Update path
MemStore
UPDATE
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully
decoupled” from reads. Random-
write workload is transformed to
fully sequential!
28© Cloudera, Inc. All rights reserved.
LSM Read path
MemStore
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Merge based on string row
keys
R1: c1=blah c2=2
R2: c1=newval c2=5
….
CPU intensive!
Must always read
rowkeys
Any given row may exist across
multiple HFiles: must always
merge!
The more HFiles to merge, the
slower it reads
29© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
MemRowSet
INSERT(“todd”, “$1000”,”engineer”)
name pay role
DiskRowSet 1
flush
30© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”, “$1B”, “Hadoop man”)
flush
31© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
Each DiskRowSet has its own
DeltaMemStore to
accumulate updates
base data
base data
32© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
UPDATE set pay=“$1M”
WHERE name=“todd”
Is the row in DiskRowSet 2?
(check bloom filters)
Is the row in DiskRowSet 1?
(check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find
offset: rowid = 150
150: col 1=$1M
base data
33© Cloudera, Inc. All rights reserved.
Kudu storage – Read path
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
150: pay=$1M
Read rows in DiskRowSet 2
Then, read rows in
DiskRowSet 1
Any row is only in exactly one
DiskRowSet– no need to merge cross-
DRS!
Updates are merged based on ordinal
offset within DRS: array indexing, no
string compares
base data
base data
34© Cloudera, Inc. All rights reserved.
Kudu storage – Delta flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
0: pay=fooREDO DeltaFile
Flush
A REDO delta indicates how to
transform between the ‘base data’
(columnar) and a later version
base data
base data
35© Cloudera, Inc. All rights reserved.
Kudu storage – Major delta compaction
name pay role
DiskRowSet(pre-compaction)
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
Many deltas accumulate: lots of delta
application work on reads
name pay role
DiskRowSet(post-compaction)
Delta MS
Unmerged REDO
deltasUNDO deltas
If a column has few updates, doesn’t need to be
re-written: those deltas maintained in new
DeltaFile
Merge updates for columns with high update
percentage
base data
36© Cloudera, Inc. All rights reserved.
Kudu storage – RowSet Compactions
DRS 1 (32MB)
[PK=alice], [PK=joe], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl, joe] [jon, julie, linda,
mary]
[omar, zach, zeke, zoe]
Reorganize rows to avoid rowsets
with overlapping key ranges
Writes for “chris” have to perform
bloom lookups on all 3 RS
37© Cloudera, Inc. All rights reserved.
Kudu Storage - Compactions
• Main Idea: Always be compacting!
• Compactions run continuously to prevent IO storms
• ”Budgeted” RS compactions: What is the best way to spend X MBs IO?
• Physical/Logical decoupling: different replicas run compactions at different times
38© Cloudera, Inc. All rights reserved.
Deep Dive
Partitioning
39© Cloudera, Inc. All rights reserved.
Partitioning
• Kudu has flexible policies for distributing data among partitions
• Hash partitioning is built in, and can be combined with range partitioning
• Keys are ordered within a partition
• Key order matters
• Partitioning key order dictates distribution
• Primary key order affects how much data is read for scans
40© Cloudera, Inc. All rights reserved.
Primary Key selection
• Example - Time series data:
• ”time” - timestamp
• ”series” - {region, server, metric}
(us-east.appserver01.loadavg, 2016-05-
09T15:14:00Z)
(us-east.appserver01.loadavg, 2016-05-
09T15:15:00Z)
(us-west.dbserver03.rss, 2016-05-
09T15:14:30Z)
(us-west.dbserver03.rss, 2016-05-
09T15:14:30Z)
(2016-05-09T15:14:00Z, us-
east.appserver01.loadavg)
(2016-05-09T15:14:30Z, us-west.dbserver03.rss)
(2016-05-09T15:15:00Z, us-
east.appserver01.loadavg)
(2016-05-09T15:14:30Z, us-west.dbserver03.rss)
(series, time) (time, series)
SELECT * WHERE series = ‘us-east.appserver01.loadavg’;
41© Cloudera, Inc. All rights reserved.
Partitioning — By Time Range (inserts)
All Inserts go to Latest PartitionBig scans (across large time intervals)
can be parallelized across many partitions
42© Cloudera, Inc. All rights reserved.
Partitioning — By Series Range
Inserts are spread among all partitionsScans are over a single partition
43© Cloudera, Inc. All rights reserved.
Partitioning — By Series Range
Partitions can become unbalanced,
resulting in hot spotting
44© Cloudera, Inc. All rights reserved.
Partitioning — By Series Hash (inserts)
Inserts are spread among all partitionsScans are over a single partition
45© Cloudera, Inc. All rights reserved.
Partitioning — By Series Hash
Partitions grow overtime, eventually
becoming too big for a single server
46© Cloudera, Inc. All rights reserved.
Partitioning — By Series Hash + Time Range
Inserts are spread among all partitions
in the latest time range
Big scans (across large time intervals)
can be parallelized across partitions
47© Cloudera, Inc. All rights reserved.
Next Steps
Get Started with
Kudu & Cloudera
Start Contributing
to Kudu
• www.cloudera.com/downloads
• https://blog.cloudera.com/?s=kudu
http://kudu.apache.org/
48© Cloudera, Inc. All rights reserved.
Thank you
David Alves – Apache Kudu PMC
@dribeiroalves

More Related Content

What's hot

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
 
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트::  AWS Summit Online Ko...EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트::  AWS Summit Online Ko...
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
Amazon Web Services Korea
 
Amazon Aurora: Under the Hood
Amazon Aurora: Under the HoodAmazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
Amazon Web Services
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
David Groozman
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Databricks
 

What's hot (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트::  AWS Summit Online Ko...EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트::  AWS Summit Online Ko...
EMR 플랫폼 기반의 Spark 워크로드 실행 최적화 방안 - 정세웅, AWS 솔루션즈 아키텍트:: AWS Summit Online Ko...
 
Amazon Aurora: Under the Hood
Amazon Aurora: Under the HoodAmazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 

Viewers also liked

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
Cloudera, Inc.
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
Apache kudu
Apache kuduApache kudu
Apache kudu
Asim Jalis
 
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases
Cloudera, Inc.
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
نهاد مبارك
 
Kite SDK: Working with Datasets
Kite SDK: Working with DatasetsKite SDK: Working with Datasets
Kite SDK: Working with Datasets
Cloudera, Inc.
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
Matt Goeke
 
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnApache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Shuya Tsukamoto
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
Mona Chitnis
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
David Kaiser
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)
Cloudera, Inc.
 
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera, Inc.
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
Yahoo Developer Network
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 

Viewers also liked (20)

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Kite SDK: Working with Datasets
Kite SDK: Working with DatasetsKite SDK: Working with Datasets
Kite SDK: Working with Datasets
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnApache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)
 
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
 

Similar to Apache Kudu: Technical Deep Dive



cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
Cloudera Japan
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
jdcryans
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
Felicia Haggarty
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
DataStax Academy
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
Grant Henke
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
LarryZaman
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 

Similar to Apache Kudu: Technical Deep Dive

 (20)

cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 

Recently uploaded (20)

Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 

Apache Kudu: Technical Deep Dive



  • 1. 1© Cloudera, Inc. All rights reserved. Apache Kudu Webinar Series Technical Deep Dive David Alves| Apache Kudu PMC | Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Kudu Webinar Series Part 1: Lambda Architectures – Simplified by Apache Kudu A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can dramatically simplify real-time analytics. Part 2: Extending the Capabilities of Operational and Analytical Databases An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and Analytical databases can handle. Part 3: Data-in-Motion: Unlock the Value of Real-Time Data Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will discuss how to make it a reality. Part 4: Technical Deep-Dive into Apache Kudu An in-depth examination of the technical architecture and design of Apache Kudu, straight from a Kudu PMC Member.
  • 3. 3© Cloudera, Inc. All rights reserved. Updateable Analytic Storage Simple real-time analytics and updates with Apache Kudu Kudu: Storage for fast analytics on fast data • Simplified architecture for building real-time analytic applications • Designed for next-generation hardware for faster analytic performance across frameworks • Native Hadoop storage engine Flexibility for the right tools for the right use case in one platform • Only analytic database for big data with Kudu + Impala • Simple real-time applications with Kudu + Spark Use cases • Time series data • Machine data analytics • Online reporting STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr OTHER Kite NoSQL HBase OTHER Object Store FILESYSTEM HDFS RELATIONAL Kudu
  • 4. 4© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Stored Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Filling the Analytic Gap Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration work to move data between HBase & HDFS Analytic Gap Pace of Analysis PaceofData
  • 5. 5© Cloudera, Inc. All rights reserved. Better Together Kudu Benefits from Integration with the Apache Ecosystem Spark – Stream Processing for Kudu • Open standard for real-time stream processing • Effective for automating decision processes and machine learning • Use Cases include: Time Series Data & Machine Data Analytics Impala – High-Performance BI & SQL for Kudu • Open standard for interactive SQL queries • Powers analytic database workloads with flexibility, scale, and open architecture • Use Cases include: Time Series Data & Online Reporting
  • 6. 6© Cloudera, Inc. All rights reserved. Apache Kudu Community
  • 7. 7© Cloudera, Inc. All rights reserved. Why Kudu? Use Cases and Motivation
  • 8. 8© Cloudera, Inc. All rights reserved. Why Kudu? A simultaneous combination of sequential and random reads and writes Can you insert time series data in real time? How long does it take to prepare it for analysis? Can you get results and act fast enough to change outcomes? Can you handle large volumes of machine-generated data? Do you have the tools to identify problems or threats? Can your system do machine learning? How fast can you add data to your data store? Are you trading off the ability to do broad analytics for the ability to make updates? Are you retaining only part of your data? Time Series Data Machine Data Analytics Online Reporting
  • 9. 9© Cloudera, Inc. All rights reserved. Next generation hardware Cheaper and faster every year. Persistent memory (3D XPoint™) Kudu can take advantage of SSD and NVM using Intel’s NVM Library. RAM is cheaper and bigger every day. Kudu runs smoothly with huge RAM. Written in C++ to avoid GC issues. Modern CPUs are adding cores and SIMD width, not GHz. Kudu takes advantage of SIMD instructions and concurrent data structures. Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
  • 10. 10© Cloudera, Inc. All rights reserved. Apache Kudu: Scalable and fast tabular storage Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node Tabular • Represents data in structured tables like a relational database • Individual record-level access to 100+ billion row tables
  • 11. 11© Cloudera, Inc. All rights reserved. Deep Dive: Replication And Fault Tolerance
  • 12. 12© Cloudera, Inc. All rights reserved. Metadata • Replicated master • Acts as a tablet directory • Acts as a catalog (which tables exist, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • Client configured with master addresses • Asks master for tablet locations as needed and caches them
  • 13. 13© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘tlipcon’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … UPDATE tlipcon SET col=foo Meta Cache T1: … T2: … T3: …
  • 14. 14© Cloudera, Inc. All rights reserved. Raft consensus TS A Tablet 1 (LEADER) Client TS B Tablet 1 (FOLLOWER) TS C Tablet 1 (FOLLOWER) WAL WALWAL 2b. Leader writes local WAL 1a. Client->Leader: Write() RPC 2a. Leader->Followers: UpdateConsensus() RPC 3. Follower: write WAL 4. Follower->Leader: success 3. Follower: write WAL 5. Leader has achieved majority 6. Leader->Client: Success!
  • 15. 15© Cloudera, Inc. All rights reserved. Fault tolerance • Transient FOLLOWER failure: • Leader can still achieve majority • Restart follower TS within 5 min and it will rejoin transparently • Transient LEADER failure: • Followers expect to hear a heartbeat from their leader every 1.5 seconds • 3 missed heartbeats: leader election! • New LEADER is elected from remaining nodes within a few seconds • Restart within 5 min and it rejoins as a FOLLOWER • N replicas handle (N-1)/2 failures
  • 16. 16© Cloudera, Inc. All rights reserved. Fault tolerance (2) • Permanent failure (Tablet Copy): • Leader notices that a follower has been dead for 5 minutes • Evicts that follower • Master selects a new replica(in PRE_VOTER state) • Leader copies the data over to the new one, which joins as a new FOLLOWER • New replica is assigned VOTER state • Cluster change_config lets you add/remove tablet servers to a tablet’s configuration(only supported from CLI tool)
  • 17. 17© Cloudera, Inc. All rights reserved. Deep Dive: Columnar Storage
  • 18. 18© Cloudera, Inc. All rights reserved. Columnar Storage • Improved Scan performance • Predicates (e.g. WHERE time >= 2016-05-08T00:00:00) can be evaluated without reading unnecessary data from other columns • Efficient encodings can dramatically improve compression ratios, which reduces effective IO load • Typed, homogenous data plays well to modern processor strengths (vectorization, pipelining) • At the cost of random access performance • single row access requires a number of seeks proportional to the number of columns • BUT, random access is becoming cheaper (Cheap RAM, SSDs, NVRAM)
  • 19. 19© Cloudera, Inc. All rights reserved. Row Storage {23059873, newsycbot, 1442865158, Visual exp…} {22309487, RideImpala, 1442828307, Introducing …} … Tweet_id, user_name, created_at, text Scans have to read all the data, no encodings
  • 20. 20© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text
  • 21. 21© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; Only read 1 column 1GB 2GB 1GB 200GB …AND door open to big IO gains with compression and encoding
  • 22. 22© Cloudera, Inc. All rights reserved. Available Encodings: • Dictionary (Strings, Binary) • Bitshuffle (Numeric) • RLE (Numeric, Bool) Available Compression: • Snappy • LZ4 • ZLib Columnar Storage – Other Encodings/Compression Kudu 1.3 ships with “good” defaults for most cases
  • 23. 23© Cloudera, Inc. All rights reserved. Deep Dive: Write and Read Paths
  • 24. 24© Cloudera, Inc. All rights reserved. LSM vs Kudu • LSM – Log Structured Merge (Cassandra, HBase, etc) • Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (HFile/SSTable) • Reads perform an on-the-fly merge of all on-disk HFiles • Kudu • Shares some traits (memstores, compactions) • More complex. • Slower writes in exchange for faster reads (especially scans)
  • 25. 25© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” flush
  • 26. 26© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah2” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2” flush HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
  • 27. 27© Cloudera, Inc. All rights reserved. LSM Update path MemStore UPDATE HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Note: all updates are “fully decoupled” from reads. Random- write workload is transformed to fully sequential!
  • 28. 28© Cloudera, Inc. All rights reserved. LSM Read path MemStore HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Merge based on string row keys R1: c1=blah c2=2 R2: c1=newval c2=5 …. CPU intensive! Must always read rowkeys Any given row may exist across multiple HFiles: must always merge! The more HFiles to merge, the slower it reads
  • 29. 29© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet INSERT(“todd”, “$1000”,”engineer”) name pay role DiskRowSet 1 flush
  • 30. 30© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 INSERT(“doug”, “$1B”, “Hadoop man”) flush
  • 31. 31© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS Each DiskRowSet has its own DeltaMemStore to accumulate updates base data base data
  • 32. 32© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS UPDATE set pay=“$1M” WHERE name=“todd” Is the row in DiskRowSet 2? (check bloom filters) Is the row in DiskRowSet 1? (check bloom filters) Bloom says: no! Bloom says: maybe! Search key column to find offset: rowid = 150 150: col 1=$1M base data
  • 33. 33© Cloudera, Inc. All rights reserved. Kudu storage – Read path MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 150: pay=$1M Read rows in DiskRowSet 2 Then, read rows in DiskRowSet 1 Any row is only in exactly one DiskRowSet– no need to merge cross- DRS! Updates are merged based on ordinal offset within DRS: array indexing, no string compares base data base data
  • 34. 34© Cloudera, Inc. All rights reserved. Kudu storage – Delta flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 0: pay=fooREDO DeltaFile Flush A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later version base data base data
  • 35. 35© Cloudera, Inc. All rights reserved. Kudu storage – Major delta compaction name pay role DiskRowSet(pre-compaction) Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile Many deltas accumulate: lots of delta application work on reads name pay role DiskRowSet(post-compaction) Delta MS Unmerged REDO deltasUNDO deltas If a column has few updates, doesn’t need to be re-written: those deltas maintained in new DeltaFile Merge updates for columns with high update percentage base data
  • 36. 36© Cloudera, Inc. All rights reserved. Kudu storage – RowSet Compactions DRS 1 (32MB) [PK=alice], [PK=joe], [PK=linda], [PK=zach] DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke] DRS 3 (32MB) [PK=carl], [PK=julie], [PK=omar] [PK=zoe] DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, joe] [jon, julie, linda, mary] [omar, zach, zeke, zoe] Reorganize rows to avoid rowsets with overlapping key ranges Writes for “chris” have to perform bloom lookups on all 3 RS
  • 37. 37© Cloudera, Inc. All rights reserved. Kudu Storage - Compactions • Main Idea: Always be compacting! • Compactions run continuously to prevent IO storms • ”Budgeted” RS compactions: What is the best way to spend X MBs IO? • Physical/Logical decoupling: different replicas run compactions at different times
  • 38. 38© Cloudera, Inc. All rights reserved. Deep Dive Partitioning
  • 39. 39© Cloudera, Inc. All rights reserved. Partitioning • Kudu has flexible policies for distributing data among partitions • Hash partitioning is built in, and can be combined with range partitioning • Keys are ordered within a partition • Key order matters • Partitioning key order dictates distribution • Primary key order affects how much data is read for scans
  • 40. 40© Cloudera, Inc. All rights reserved. Primary Key selection • Example - Time series data: • ”time” - timestamp • ”series” - {region, server, metric} (us-east.appserver01.loadavg, 2016-05- 09T15:14:00Z) (us-east.appserver01.loadavg, 2016-05- 09T15:15:00Z) (us-west.dbserver03.rss, 2016-05- 09T15:14:30Z) (us-west.dbserver03.rss, 2016-05- 09T15:14:30Z) (2016-05-09T15:14:00Z, us- east.appserver01.loadavg) (2016-05-09T15:14:30Z, us-west.dbserver03.rss) (2016-05-09T15:15:00Z, us- east.appserver01.loadavg) (2016-05-09T15:14:30Z, us-west.dbserver03.rss) (series, time) (time, series) SELECT * WHERE series = ‘us-east.appserver01.loadavg’;
  • 41. 41© Cloudera, Inc. All rights reserved. Partitioning — By Time Range (inserts) All Inserts go to Latest PartitionBig scans (across large time intervals) can be parallelized across many partitions
  • 42. 42© Cloudera, Inc. All rights reserved. Partitioning — By Series Range Inserts are spread among all partitionsScans are over a single partition
  • 43. 43© Cloudera, Inc. All rights reserved. Partitioning — By Series Range Partitions can become unbalanced, resulting in hot spotting
  • 44. 44© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash (inserts) Inserts are spread among all partitionsScans are over a single partition
  • 45. 45© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash Partitions grow overtime, eventually becoming too big for a single server
  • 46. 46© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash + Time Range Inserts are spread among all partitions in the latest time range Big scans (across large time intervals) can be parallelized across partitions
  • 47. 47© Cloudera, Inc. All rights reserved. Next Steps Get Started with Kudu & Cloudera Start Contributing to Kudu • www.cloudera.com/downloads • https://blog.cloudera.com/?s=kudu http://kudu.apache.org/
  • 48. 48© Cloudera, Inc. All rights reserved. Thank you David Alves – Apache Kudu PMC @dribeiroalves