Apache Kudu - Updatable Analytical Storage #rakutentech

1© Cloudera, Inc. All rights reserved.
Apache Kudu
Updatable Analytical Storage for Modern Data Platform
Sho Shimauchi | Sales Engineer | Cloudera

Who Am I?
Sho Shimauchi
Sales Engineer / Technical Evangelist
Joined Cloudera in 2011
The First Employee in Cloudera APJ
Email: sho@cloudera.com
Twitter: @shiumachi

•  Founded in 2008
•  1600+ Clouderans
•  Machine learning and analytics platform
•  Shared data experience
•  Cloud-native and cloud-differentiated
•  Open-source innovation and efficiency

Rakuten Card replaced Mainframe to
Cloudera Enterprise in 2017
Apache Spark improved performance
of the batch processes >2x
Please join Cloudera World Tokyo
2017 to see Kobayashi-san’s Keynote!
www.clouderaworldtokyo.com
Rakuten Card + Cloudera

Why Kudu?
Use Cases and Motivation

6© Cloudera, Inc. All rights reserved. 6
The modern platform for machine learning and analytics optimized for the cloud
EXTENSIBLE
SERVICES
CORE
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA CATALOG
INGEST &
REPLICATION
SECURITY GOVERNANCE
WORKLOAD
MANAGEMENT
DATA
SCIENCE
NEW
OFFERINGS
Cloudera Enterprise
Amazon S3 Microsoft ADLS HDFS KUDU
STORAGE
SERVICES

HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic applications
often require complex data
flow & difficult integration
work to move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData
Filling the Analytic Gap

Apache Kudu: Scalable and fast structured storage
Scalable
•  Tested up to 300+ nodes (PBs cluster)
•  Designed to scale to 1000s of nodes and tens of PBs
Fast
•  Multiple GB/second read throughput per node
•  Millions of read/write operations per second across cluster
Tabular
•  Represents data in structured tables like a relational database
•  Strict schema, finite column count, no BLOBs
•  Individual record-level access to 100+ billion row tables

Apache Kudu Community

Can you insert time series data in
real time? How long does it take to
prepare it for analysis? Can you get
results and act fast enough to
change outcomes?
Can you handle large volumes of
machine-generated data? Do you
have the tools to identify problems or
threats? Can your system do
machine learning?
How fast can you add data to your
data store? Are you trading off the
ability to do broad analytics for the
ability to make updates? Are you
retaining only part of your data?
Time Series Data Machine Data Analytics Online Reporting
Why Kudu?

Cheaper and faster every year.
Persistent memory (3D XPoint™)
Kudu can take advantage of SSD
and NVM using Intel’s NVM Library.
RAM is cheaper and bigger every
day.
Kudu runs smoothly with huge RAM.
Written in C++ to avoid GC issues.
Modern CPUs are adding cores and
SIMD width, not GHz.
Kudu takes advantage of SIMD
instructions and concurrent data
structures.
Next generation hardware
Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs

How it Works
Replication And Fault Tolerance

Tables, tablets, and tablet servers
• Each table is horizontally partitioned into tablets
• Range or hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE
BY HASH(timestamp) INTO 100 BUCKETS
• Each tablet has N replicas (3 or 5) with Raft consensus
• Automatic fault tolerance
• MTTR (mean time to repair): ~5 seconds

Metadata
Replicated master
Acts as a tablet directory
Acts as a catalog (which tables exist, etc)
Acts as a load balancer (tracks TS liveness, re-replicates under-
replicated tablets)
Caches all metadata in RAM for high performance
Client configured with master addresses
Asks master for tablet locations as needed and caches them

Client
Hey Master! Where is the row for ‘tlipcon’
in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might care
about: T1, T2, T3, …
UPDATE tlipcon
SET col=foo
Meta Cache
T1: …
T2: …
T3: …

Raft consensus
TS A
Tablet 1
(LEADER)
Client
TS B
Tablet 1
(FOLLOWER)
TS C
Tablet 1
(FOLLOWER)
WAL
WALWAL
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers:
UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!

How it Works
Columnar Storage

Row Storage
Scans have to read all the data, no encodings
{23059873, newsycbot, 1442865158, Visual exp…}
{22309487, RideImpala, 1442828307, Introducing …}
…
Tweet_id, user_name, created_at, text

{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly, llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
Columnar Storage

SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
{25059873,
22309487,
23059861,
23010982}
Tweet_id
1GB
{newsycbot,
RideImpala,
fastly, llvmorg}
User_name
Only read 1 column
2GB
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
1GB
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
200GB
Columnar Storage

{1442825158,
1442826100,
1442827994,
1442828527}
Created_at
Created_at Diﬀ(created_at)
1442825158 n/a
1442826100 942
1442827994 1894
1442828527 533
64 bits each 11 bits each
Columnar Compression
Many columns can compress to
a few bits per row!
Especially:
Timestamps
Time series values
Low-cardinality strings
Massive space savings and
throughput increase!

How it Works
Write and Read Paths

LSM vs Kudu
LSM – Log Structured Merge (Cassandra, HBase, etc)
Inserts and updates all go to an in-memory map (MemStore) and later
flush to on-disk files (SSTable, HFile)
Reads perform an on-the-fly merge of all on-disk HFiles
Kudu
Shares some traits (memstores, compactions)
More complex.
Slower writes in exchange for faster reads (especially scans)

LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
HFile 1
flush

LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah2”
HFile 2
Row=r2 col=c1 val=“blah2”
flush
HFile 1Row=r1 col=c1 val=“blah”

LSM Update path
MemStore
UPDATE
HFile 1
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully
decoupled” from reads.
Random-write workload is
transformed to fully sequential!

LSM Read path
MemStore
HFile 1
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c1 val=“newval”
Merge based on string
row keys
R1: c1=blah c2=2
R2: c1=newval c2=5
….
CPU intensive!
Must always read
rowkeys
Any given row may exist
across multiple HFiles: must
always merge!
The more HFiles to merge, the
slower it reads

Kudu storage – Inserts and Flushes
MemRowSet
INSERT(“todd”, “$1000”,”engineer”)
name pay role
DiskRowSet 1
flush
Multiple files for each columns
base data
Latest version of data

Kudu storage – Inserts and Flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”, “$1B”, “Hadoop man”)
flush
base data
base data

Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
DeltaMemStore
DeltaMemStore
base data
base data
On MemoryOn Disk
On Memory

Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
DeltaMemStore
DeltaMemStore
UPDATE set pay=“$1M”
WHERE name=“todd”
Is the row in DiskRowSet 2?
(check bloom filters)
Is the row in DiskRowSet 1?
(check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find
offset: rowid = 150
150: col
1=$1M
base data

Kudu storage – Delta flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
DeltaMemStore
DeltaMemStore
0: pay=fooREDO DeltaFile
Flush
A REDO delta indicates how to
transform between the ‘base
data’ (columnar) and a later
version
base data
base data

Kudu storage – Minor delta compaction
name pay role
DiskRowSet(pre-compaction)
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
REDO DeltaFile
base data

Kudu storage – Major delta compaction
name pay role
DiskRowSet
Delta MS
Unmerged
REDO DeltaFile
base data
pay
Compaction can be performed
only on high-frequent column
UNDO Records
UNDO stores previous versions
of data

Kudu storage – RowSet Compactions
DRS 1 (32MB)
[PK=alice], [PK=iris], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl, iris] [jon, julie, linda, mary] [omar, zach, zeke, zoe]
Writes for “chris” have to perform
bloom lookups on all 3 RS
Range: A-Z
Range: A-Z
Range: A-Z
Range: A-I Range: J-M Range: O-Z
Reorganize rows to avoid rowsets
with overlapping key ranges
“chris” is in this range!

Kudu Storage - Compactions
Main Idea: Always be compacting!
Compactions run continuously to prevent IO storms
”Budgeted” RS compactions: What is the best way to spend X MBs IO?
Physical/Logical decoupling: different replicas run compactions at different
times

Kudu storage – Read path
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
DeltaMemStore
DeltaMemStore
150: pay=$1M
base data
base data
Just need to read this DiskRowSet!

Kudu storage – Time Travel Read
name pay role
DiskRowSet
Delta MS
base data
pay
UNDO Records
T=0: a query starts to read “pay” in other DiskRowSet
T=10: major delta compaction happened!
Base file is updated, and UNDO is created
T=20: the query starts to read “pay” in this DiskRowSet,
but read the version of T=0 from UNDO Records

Takeaways

Getting Started
On the web: https://www.cloudera.com/documentation/kudu/latest.html,
https://www.cloudera.com/downloads.html, https://blog.cloudera.com/?s=Kudu,
kudu.apache.org
•  Apache project user mailing list: user@kudu.apache.org
•  Quickstart VM
•  Easiest way to get started
•  Impala and Kudu in an easy-to-install VM
•  CSD and Parcels
•  For installation on a Cloudera Manager-managed cluster
Training classes available: https://www.cloudera.com/more/training.html

Nov 7, 2017 Tue
ANA Intercontinental Hotel
Estimated Attendees #: 1000
E-1: Apache Kudu on Analytical Data
Platform
Register Now!
www.clouderaworldtokyo.com
Cloudera World Tokyo 2017

Thank you
sho@cloudera.com

Apache Kudu - Updatable Analytical Storage #rakutentech

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Apache Kudu - Updatable Analytical Storage #rakutentech

Similar to Apache Kudu - Updatable Analytical Storage #rakutentech (20)

More from Cloudera Japan

More from Cloudera Japan (20)

Recently uploaded

Recently uploaded (20)

Apache Kudu - Updatable Analytical Storage #rakutentech