SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Apache Kudu
Updatable Analytical Storage for Modern Data Platform
Sho Shimauchi | Sales Engineer | Cloudera
2© Cloudera, Inc. All rights reserved.
Who Am I?
Sho Shimauchi
Sales Engineer / Technical Evangelist
Joined Cloudera in 2011
The First Employee in Cloudera APJ
Email: sho@cloudera.com
Twitter: @shiumachi
3© Cloudera, Inc. All rights reserved.
•  Founded in 2008
•  1600+ Clouderans
•  Machine learning and analytics platform
•  Shared data experience
•  Cloud-native and cloud-differentiated
•  Open-source innovation and efficiency
4© Cloudera, Inc. All rights reserved.
Rakuten Card replaced Mainframe to
Cloudera Enterprise in 2017
Apache Spark improved performance
of the batch processes >2x
Please join Cloudera World Tokyo
2017 to see Kobayashi-san’s Keynote!
www.clouderaworldtokyo.com
Rakuten Card + Cloudera
5© Cloudera, Inc. All rights reserved.
Why Kudu?
Use Cases and Motivation
6© Cloudera, Inc. All rights reserved. 6
The modern platform for machine learning and analytics optimized for the cloud
EXTENSIBLE
SERVICES
CORE
SERVICES DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA CATALOG
INGEST &
REPLICATION
SECURITY GOVERNANCE
WORKLOAD
MANAGEMENT
DATA
SCIENCE
NEW
OFFERINGS
Cloudera Enterprise
Amazon S3 Microsoft ADLS HDFS KUDU
STORAGE
SERVICES
7© Cloudera, Inc. All rights reserved.
HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic applications
often require complex data
flow & difficult integration
work to move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData
Filling the Analytic Gap
8© Cloudera, Inc. All rights reserved.
Apache Kudu: Scalable and fast structured storage
Scalable
•  Tested up to 300+ nodes (PBs cluster)
•  Designed to scale to 1000s of nodes and tens of PBs
Fast
•  Multiple GB/second read throughput per node
•  Millions of read/write operations per second across cluster
Tabular
•  Represents data in structured tables like a relational database
•  Strict schema, finite column count, no BLOBs
•  Individual record-level access to 100+ billion row tables
9© Cloudera, Inc. All rights reserved.
Apache Kudu Community
10© Cloudera, Inc. All rights reserved.
Can you insert time series data in
real time? How long does it take to
prepare it for analysis? Can you get
results and act fast enough to
change outcomes?
Can you handle large volumes of
machine-generated data? Do you
have the tools to identify problems or
threats? Can your system do
machine learning?
How fast can you add data to your
data store? Are you trading off the
ability to do broad analytics for the
ability to make updates? Are you
retaining only part of your data?
Time Series Data Machine Data Analytics Online Reporting
Why Kudu?
11© Cloudera, Inc. All rights reserved.
Cheaper and faster every year.
Persistent memory (3D XPoint™)
Kudu can take advantage of SSD
and NVM using Intel’s NVM Library.
RAM is cheaper and bigger every
day.
Kudu runs smoothly with huge RAM.
Written in C++ to avoid GC issues.
Modern CPUs are adding cores and
SIMD width, not GHz.
Kudu takes advantage of SIMD
instructions and concurrent data
structures.
Next generation hardware
Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
12© Cloudera, Inc. All rights reserved.
How it Works
Replication And Fault Tolerance
13© Cloudera, Inc. All rights reserved.
Tables, tablets, and tablet servers
• Each table is horizontally partitioned into tablets
• Range or hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE
BY HASH(timestamp) INTO 100 BUCKETS
• Each tablet has N replicas (3 or 5) with Raft consensus
• Automatic fault tolerance
• MTTR (mean time to repair): ~5 seconds
14© Cloudera, Inc. All rights reserved.
Metadata
Replicated master
Acts as a tablet directory
Acts as a catalog (which tables exist, etc)
Acts as a load balancer (tracks TS liveness, re-replicates under-
replicated tablets)
Caches all metadata in RAM for high performance
Client configured with master addresses
Asks master for tablet locations as needed and caches them
15© Cloudera, Inc. All rights reserved.
Client
Hey Master! Where is the row for ‘tlipcon’
in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might care
about: T1, T2, T3, …
UPDATE tlipcon
SET col=foo
Meta Cache
T1: …
T2: …
T3: …
16© Cloudera, Inc. All rights reserved.
Raft consensus
TS A
Tablet 1
(LEADER)
Client
TS B
Tablet 1
(FOLLOWER)
TS C
Tablet 1
(FOLLOWER)
WAL
WALWAL
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers:
UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!
17© Cloudera, Inc. All rights reserved.
How it Works
Columnar Storage
18© Cloudera, Inc. All rights reserved.
Row Storage
Scans have to read all the data, no encodings
{23059873, newsycbot, 1442865158, Visual exp…}
{22309487, RideImpala, 1442828307, Introducing …}
…
Tweet_id, user_name, created_at, text
19© Cloudera, Inc. All rights reserved.
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly, llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
Columnar Storage
20© Cloudera, Inc. All rights reserved.
SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
{25059873,
22309487,
23059861,
23010982}
Tweet_id
1GB
{newsycbot,
RideImpala,
fastly, llvmorg}
User_name
Only read 1 column
2GB
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
1GB
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
200GB
Columnar Storage
21© Cloudera, Inc. All rights reserved.
{1442825158,
1442826100,
1442827994,
1442828527}
Created_at
Created_at Diff(created_at)
1442825158 n/a
1442826100 942
1442827994 1894	
1442828527 533
64 bits each 11 bits each
Columnar Compression
Many columns can compress to
a few bits per row!
Especially:
Timestamps
Time series values
Low-cardinality strings
Massive space savings and
throughput increase!
22© Cloudera, Inc. All rights reserved.
How it Works
Write and Read Paths
23© Cloudera, Inc. All rights reserved.
LSM vs Kudu
LSM – Log Structured Merge (Cassandra, HBase, etc)
Inserts and updates all go to an in-memory map (MemStore) and later
flush to on-disk files (SSTable, HFile)
Reads perform an on-the-fly merge of all on-disk HFiles
Kudu
Shares some traits (memstores, compactions)
More complex.
Slower writes in exchange for faster reads (especially scans)
24© Cloudera, Inc. All rights reserved.
LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
flush
25© Cloudera, Inc. All rights reserved.
LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah2”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“blah2”
Row=r2 col=c2 val=“2”
flush
HFile 1Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
26© Cloudera, Inc. All rights reserved.
LSM Update path
MemStore
UPDATE
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully
decoupled” from reads.
Random-write workload is
transformed to fully sequential!
27© Cloudera, Inc. All rights reserved.
LSM Read path
MemStore
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Merge based on string
row keys
R1: c1=blah c2=2
R2: c1=newval c2=5
….
CPU intensive!
Must always read
rowkeys
Any given row may exist
across multiple HFiles: must
always merge!
The more HFiles to merge, the
slower it reads
28© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
MemRowSet
INSERT(“todd”, “$1000”,”engineer”)
name pay role
DiskRowSet 1
flush
Multiple files for each columns
base data
Latest version of data
29© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”, “$1B”, “Hadoop man”)
flush
base data
base data
30© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
DeltaMemStore
DeltaMemStore
base data
base data
On MemoryOn Disk
On Memory
31© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
DeltaMemStore
DeltaMemStore
UPDATE set pay=“$1M”
WHERE name=“todd”
Is the row in DiskRowSet 2?
(check bloom filters)
Is the row in DiskRowSet 1?
(check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find
offset: rowid = 150
150: col
1=$1M
base data
32© Cloudera, Inc. All rights reserved.
Kudu storage – Delta flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
DeltaMemStore
DeltaMemStore
0: pay=fooREDO DeltaFile
Flush
A REDO delta indicates how to
transform between the ‘base
data’ (columnar) and a later
version
base data
base data
33© Cloudera, Inc. All rights reserved.
Kudu storage – Minor delta compaction
name pay role
DiskRowSet(pre-compaction)
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
REDO DeltaFile
base data
34© Cloudera, Inc. All rights reserved.
Kudu storage – Major delta compaction
name pay role
DiskRowSet
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
Unmerged
REDO DeltaFile
base data
pay
Compaction can be performed
only on high-frequent column
UNDO Records
UNDO stores previous versions
of data
35© Cloudera, Inc. All rights reserved.
Kudu storage – RowSet Compactions
DRS 1 (32MB)
[PK=alice], [PK=iris], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl, iris] [jon, julie, linda, mary] [omar, zach, zeke, zoe]
Writes for “chris” have to perform
bloom lookups on all 3 RS
Range: A-Z
Range: A-Z
Range: A-Z
Range: A-I Range: J-M Range: O-Z
Reorganize rows to avoid rowsets
with overlapping key ranges
“chris” is in this range!
36© Cloudera, Inc. All rights reserved.
Kudu Storage - Compactions
Main Idea: Always be compacting!
Compactions run continuously to prevent IO storms
”Budgeted” RS compactions: What is the best way to spend X MBs IO?
Physical/Logical decoupling: different replicas run compactions at different
times
37© Cloudera, Inc. All rights reserved.
Kudu storage – Read path
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
DeltaMemStore
DeltaMemStore
150: pay=$1M
base data
base data
Just need to read this DiskRowSet!
38© Cloudera, Inc. All rights reserved.
Kudu storage – Time Travel Read
name pay role
DiskRowSet
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
base data
pay
UNDO Records
T=0: a query starts to read “pay” in other DiskRowSet
T=10: major delta compaction happened!
Base file is updated, and UNDO is created
T=20: the query starts to read “pay” in this DiskRowSet,
but read the version of T=0 from UNDO Records
39© Cloudera, Inc. All rights reserved.
Takeaways
40© Cloudera, Inc. All rights reserved.
Getting Started
On the web: https://www.cloudera.com/documentation/kudu/latest.html,
https://www.cloudera.com/downloads.html, https://blog.cloudera.com/?s=Kudu,
kudu.apache.org
•  Apache project user mailing list: user@kudu.apache.org
•  Quickstart VM
•  Easiest way to get started
•  Impala and Kudu in an easy-to-install VM
•  CSD and Parcels
•  For installation on a Cloudera Manager-managed cluster
Training classes available: https://www.cloudera.com/more/training.html
41© Cloudera, Inc. All rights reserved.
Nov 7, 2017 Tue
ANA Intercontinental Hotel
Estimated Attendees #: 1000
E-1: Apache Kudu on Analytical Data
Platform
Register Now!
www.clouderaworldtokyo.com
Cloudera World Tokyo 2017
42© Cloudera, Inc. All rights reserved.
Thank	you	
sho@cloudera.com

More Related Content

What's hot

Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
نهاد مبارك
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
Cloudera Japan
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
Cloudera, Inc.
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
Cloudera, Inc.
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Cloudera, Inc.
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
markgrover
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
Amazon Web Services
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
Kai Sasaki
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
Cloudera, Inc.
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
What Can HPC on AWS Do?
What Can HPC on AWS Do?What Can HPC on AWS Do?
What Can HPC on AWS Do?
inside-BigData.com
 

What's hot (20)

Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
What Can HPC on AWS Do?
What Can HPC on AWS Do?What Can HPC on AWS Do?
What Can HPC on AWS Do?
 

Viewers also liked

#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計
Cloudera Japan
 
「新製品 Kudu 及び RecordServiceの概要」 #cwt2015
「新製品 Kudu 及び RecordServiceの概要」 #cwt2015「新製品 Kudu 及び RecordServiceの概要」 #cwt2015
「新製品 Kudu 及び RecordServiceの概要」 #cwt2015
Cloudera Japan
 
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Cloudera Japan
 
基礎から学ぶ超並列SQLエンジンImpala #cwt2015
基礎から学ぶ超並列SQLエンジンImpala #cwt2015基礎から学ぶ超並列SQLエンジンImpala #cwt2015
基礎から学ぶ超並列SQLエンジンImpala #cwt2015
Cloudera Japan
 
Gunosyデータマイニング研究会 #118 これからの強化学習
Gunosyデータマイニング研究会 #118 これからの強化学習Gunosyデータマイニング研究会 #118 これからの強化学習
Gunosyデータマイニング研究会 #118 これからの強化学習
圭輔 大曽根
 
機械学習で大事なことをミニGunosyをつくって学んだ╭( ・ㅂ・)و ̑̑ 
機械学習で大事なことをミニGunosyをつくって学んだ╭( ・ㅂ・)و ̑̑ 機械学習で大事なことをミニGunosyをつくって学んだ╭( ・ㅂ・)و ̑̑ 
機械学習で大事なことをミニGunosyをつくって学んだ╭( ・ㅂ・)و ̑̑ 
Seiji Takahashi
 
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LTあなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
Hiroaki Kudo
 
A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation
WrangleConf
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
Kentaro Yoshida
 
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
Hiroaki Kudo
 
Gunosy における AWS 上での自然言語処理・機械学習の活用事例
Gunosy における AWS 上での自然言語処理・機械学習の活用事例Gunosy における AWS 上での自然言語処理・機械学習の活用事例
Gunosy における AWS 上での自然言語処理・機械学習の活用事例
圭輔 大曽根
 
記事分類における教師データおよびモデルの管理
記事分類における教師データおよびモデルの管理記事分類における教師データおよびモデルの管理
記事分類における教師データおよびモデルの管理
圭輔 大曽根
 
論文紹介@ Gunosyデータマイニング研究会 #97
論文紹介@ Gunosyデータマイニング研究会 #97論文紹介@ Gunosyデータマイニング研究会 #97
論文紹介@ Gunosyデータマイニング研究会 #97
圭輔 大曽根
 
マイクロサービスとABテスト
マイクロサービスとABテストマイクロサービスとABテスト
マイクロサービスとABテスト
圭輔 大曽根
 
WebDB Forum 2016 gunosy
WebDB Forum 2016 gunosyWebDB Forum 2016 gunosy
WebDB Forum 2016 gunosy
Hiroaki Kudo
 
Apache Kuduを使った分析システムの裏側
Apache Kuduを使った分析システムの裏側Apache Kuduを使った分析システムの裏側
Apache Kuduを使った分析システムの裏側
Cloudera Japan
 
いまさら聞けない機械学習の評価指標
いまさら聞けない機械学習の評価指標いまさら聞けない機械学習の評価指標
いまさら聞けない機械学習の評価指標
圭輔 大曽根
 

Viewers also liked (17)

#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計
 
「新製品 Kudu 及び RecordServiceの概要」 #cwt2015
「新製品 Kudu 及び RecordServiceの概要」 #cwt2015「新製品 Kudu 及び RecordServiceの概要」 #cwt2015
「新製品 Kudu 及び RecordServiceの概要」 #cwt2015
 
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
 
基礎から学ぶ超並列SQLエンジンImpala #cwt2015
基礎から学ぶ超並列SQLエンジンImpala #cwt2015基礎から学ぶ超並列SQLエンジンImpala #cwt2015
基礎から学ぶ超並列SQLエンジンImpala #cwt2015
 
Gunosyデータマイニング研究会 #118 これからの強化学習
Gunosyデータマイニング研究会 #118 これからの強化学習Gunosyデータマイニング研究会 #118 これからの強化学習
Gunosyデータマイニング研究会 #118 これからの強化学習
 
機械学習で大事なことをミニGunosyをつくって学んだ╭( ・ㅂ・)و ̑̑ 
機械学習で大事なことをミニGunosyをつくって学んだ╭( ・ㅂ・)و ̑̑ 機械学習で大事なことをミニGunosyをつくって学んだ╭( ・ㅂ・)و ̑̑ 
機械学習で大事なことをミニGunosyをつくって学んだ╭( ・ㅂ・)و ̑̑ 
 
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LTあなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT
 
A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
 
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと
 
Gunosy における AWS 上での自然言語処理・機械学習の活用事例
Gunosy における AWS 上での自然言語処理・機械学習の活用事例Gunosy における AWS 上での自然言語処理・機械学習の活用事例
Gunosy における AWS 上での自然言語処理・機械学習の活用事例
 
記事分類における教師データおよびモデルの管理
記事分類における教師データおよびモデルの管理記事分類における教師データおよびモデルの管理
記事分類における教師データおよびモデルの管理
 
論文紹介@ Gunosyデータマイニング研究会 #97
論文紹介@ Gunosyデータマイニング研究会 #97論文紹介@ Gunosyデータマイニング研究会 #97
論文紹介@ Gunosyデータマイニング研究会 #97
 
マイクロサービスとABテスト
マイクロサービスとABテストマイクロサービスとABテスト
マイクロサービスとABテスト
 
WebDB Forum 2016 gunosy
WebDB Forum 2016 gunosyWebDB Forum 2016 gunosy
WebDB Forum 2016 gunosy
 
Apache Kuduを使った分析システムの裏側
Apache Kuduを使った分析システムの裏側Apache Kuduを使った分析システムの裏側
Apache Kuduを使った分析システムの裏側
 
いまさら聞けない機械学習の評価指標
いまさら聞けない機械学習の評価指標いまさら聞けない機械学習の評価指標
いまさら聞けない機械学習の評価指標
 

Similar to Apache Kudu - Updatable Analytical Storage #rakutentech

Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
Felicia Haggarty
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Dataconomy Media
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
jdcryans
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
Cloudera, Inc.
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Alluxio, Inc.
 
Fast Analytics
Fast Analytics Fast Analytics
How to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesHow to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machines
SolarWinds
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 

Similar to Apache Kudu - Updatable Analytical Storage #rakutentech (20)

Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
Fast Analytics
Fast Analytics Fast Analytics
Fast Analytics
 
How to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machinesHow to deploy SQL Server on an Microsoft Azure virtual machines
How to deploy SQL Server on an Microsoft Azure virtual machines
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 

More from Cloudera Japan

Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Cloudera Japan
 
機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介
Cloudera Japan
 
HDFS Supportaiblity Improvements
HDFS Supportaiblity ImprovementsHDFS Supportaiblity Improvements
HDFS Supportaiblity Improvements
Cloudera Japan
 
Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018
Cloudera Japan
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Cloudera Japan
 
HBase Across the World #LINE_DM
HBase Across the World #LINE_DMHBase Across the World #LINE_DM
HBase Across the World #LINE_DM
Cloudera Japan
 
Cloudera in the Cloud #CWT2017
Cloudera in the Cloud #CWT2017Cloudera in the Cloud #CWT2017
Cloudera in the Cloud #CWT2017
Cloudera Japan
 
先行事例から学ぶ IoT / ビッグデータの始め方
先行事例から学ぶ IoT / ビッグデータの始め方先行事例から学ぶ IoT / ビッグデータの始め方
先行事例から学ぶ IoT / ビッグデータの始め方
Cloudera Japan
 
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Cloudera Japan
 
Hue 4.0 / Hue Meetup Tokyo #huejp
Hue 4.0 / Hue Meetup Tokyo #huejpHue 4.0 / Hue Meetup Tokyo #huejp
Hue 4.0 / Hue Meetup Tokyo #huejp
Cloudera Japan
 
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadedaCloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Japan
 
Cloud Native Hadoop #cwt2016
Cloud Native Hadoop #cwt2016Cloud Native Hadoop #cwt2016
Cloud Native Hadoop #cwt2016
Cloudera Japan
 
大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016
Cloudera Japan
 
#cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング
#cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング #cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング
#cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング
Cloudera Japan
 
Ibis: すごい pandas ⼤規模データ分析もらっくらく #summerDS
Ibis: すごい pandas ⼤規模データ分析もらっくらく #summerDSIbis: すごい pandas ⼤規模データ分析もらっくらく #summerDS
Ibis: すごい pandas ⼤規模データ分析もらっくらく #summerDS
Cloudera Japan
 
クラウド上でHadoopを構築できる Cloudera Director 2.0 の紹介 #dogenzakalt
クラウド上でHadoopを構築できる Cloudera Director 2.0 の紹介 #dogenzakaltクラウド上でHadoopを構築できる Cloudera Director 2.0 の紹介 #dogenzakalt
クラウド上でHadoopを構築できる Cloudera Director 2.0 の紹介 #dogenzakalt
Cloudera Japan
 
MapReduceを置き換えるSpark 〜HadoopとSparkの統合〜 #cwt2015
MapReduceを置き換えるSpark 〜HadoopとSparkの統合〜 #cwt2015MapReduceを置き換えるSpark 〜HadoopとSparkの統合〜 #cwt2015
MapReduceを置き換えるSpark 〜HadoopとSparkの統合〜 #cwt2015
Cloudera Japan
 
PCIコンプライアンスに向けたビジネス指針〜MasterCardの事例〜 #cwt2015
PCIコンプライアンスに向けたビジネス指針〜MasterCardの事例〜 #cwt2015PCIコンプライアンスに向けたビジネス指針〜MasterCardの事例〜 #cwt2015
PCIコンプライアンスに向けたビジネス指針〜MasterCardの事例〜 #cwt2015
Cloudera Japan
 
基調講演: 「データエコシステムへの挑戦」 #cwt2015
基調講演: 「データエコシステムへの挑戦」 #cwt2015基調講演: 「データエコシステムへの挑戦」 #cwt2015
基調講演: 「データエコシステムへの挑戦」 #cwt2015
Cloudera Japan
 
基調講演:「ビッグデータのセキュリティとガバナンス要件」 #cwt2015
基調講演:「ビッグデータのセキュリティとガバナンス要件」 #cwt2015基調講演:「ビッグデータのセキュリティとガバナンス要件」 #cwt2015
基調講演:「ビッグデータのセキュリティとガバナンス要件」 #cwt2015
Cloudera Japan
 

More from Cloudera Japan (20)

Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
 
機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介機械学習の定番プラットフォームSparkの紹介
機械学習の定番プラットフォームSparkの紹介
 
HDFS Supportaiblity Improvements
HDFS Supportaiblity ImprovementsHDFS Supportaiblity Improvements
HDFS Supportaiblity Improvements
 
Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理
 
HBase Across the World #LINE_DM
HBase Across the World #LINE_DMHBase Across the World #LINE_DM
HBase Across the World #LINE_DM
 
Cloudera in the Cloud #CWT2017
Cloudera in the Cloud #CWT2017Cloudera in the Cloud #CWT2017
Cloudera in the Cloud #CWT2017
 
先行事例から学ぶ IoT / ビッグデータの始め方
先行事例から学ぶ IoT / ビッグデータの始め方先行事例から学ぶ IoT / ビッグデータの始め方
先行事例から学ぶ IoT / ビッグデータの始め方
 
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
 
Hue 4.0 / Hue Meetup Tokyo #huejp
Hue 4.0 / Hue Meetup Tokyo #huejpHue 4.0 / Hue Meetup Tokyo #huejp
Hue 4.0 / Hue Meetup Tokyo #huejp
 
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadedaCloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
 
Cloud Native Hadoop #cwt2016
Cloud Native Hadoop #cwt2016Cloud Native Hadoop #cwt2016
Cloud Native Hadoop #cwt2016
 
大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016大規模データに対するデータサイエンスの進め方 #CWT2016
大規模データに対するデータサイエンスの進め方 #CWT2016
 
#cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング
#cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング #cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング
#cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング
 
Ibis: すごい pandas ⼤規模データ分析もらっくらく #summerDS
Ibis: すごい pandas ⼤規模データ分析もらっくらく #summerDSIbis: すごい pandas ⼤規模データ分析もらっくらく #summerDS
Ibis: すごい pandas ⼤規模データ分析もらっくらく #summerDS
 
クラウド上でHadoopを構築できる Cloudera Director 2.0 の紹介 #dogenzakalt
クラウド上でHadoopを構築できる Cloudera Director 2.0 の紹介 #dogenzakaltクラウド上でHadoopを構築できる Cloudera Director 2.0 の紹介 #dogenzakalt
クラウド上でHadoopを構築できる Cloudera Director 2.0 の紹介 #dogenzakalt
 
MapReduceを置き換えるSpark 〜HadoopとSparkの統合〜 #cwt2015
MapReduceを置き換えるSpark 〜HadoopとSparkの統合〜 #cwt2015MapReduceを置き換えるSpark 〜HadoopとSparkの統合〜 #cwt2015
MapReduceを置き換えるSpark 〜HadoopとSparkの統合〜 #cwt2015
 
PCIコンプライアンスに向けたビジネス指針〜MasterCardの事例〜 #cwt2015
PCIコンプライアンスに向けたビジネス指針〜MasterCardの事例〜 #cwt2015PCIコンプライアンスに向けたビジネス指針〜MasterCardの事例〜 #cwt2015
PCIコンプライアンスに向けたビジネス指針〜MasterCardの事例〜 #cwt2015
 
基調講演: 「データエコシステムへの挑戦」 #cwt2015
基調講演: 「データエコシステムへの挑戦」 #cwt2015基調講演: 「データエコシステムへの挑戦」 #cwt2015
基調講演: 「データエコシステムへの挑戦」 #cwt2015
 
基調講演:「ビッグデータのセキュリティとガバナンス要件」 #cwt2015
基調講演:「ビッグデータのセキュリティとガバナンス要件」 #cwt2015基調講演:「ビッグデータのセキュリティとガバナンス要件」 #cwt2015
基調講演:「ビッグデータのセキュリティとガバナンス要件」 #cwt2015
 

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

Apache Kudu - Updatable Analytical Storage #rakutentech

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Kudu Updatable Analytical Storage for Modern Data Platform Sho Shimauchi | Sales Engineer | Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Who Am I? Sho Shimauchi Sales Engineer / Technical Evangelist Joined Cloudera in 2011 The First Employee in Cloudera APJ Email: sho@cloudera.com Twitter: @shiumachi
  • 3. 3© Cloudera, Inc. All rights reserved. •  Founded in 2008 •  1600+ Clouderans •  Machine learning and analytics platform •  Shared data experience •  Cloud-native and cloud-differentiated •  Open-source innovation and efficiency
  • 4. 4© Cloudera, Inc. All rights reserved. Rakuten Card replaced Mainframe to Cloudera Enterprise in 2017 Apache Spark improved performance of the batch processes >2x Please join Cloudera World Tokyo 2017 to see Kobayashi-san’s Keynote! www.clouderaworldtokyo.com Rakuten Card + Cloudera
  • 5. 5© Cloudera, Inc. All rights reserved. Why Kudu? Use Cases and Motivation
  • 6. 6© Cloudera, Inc. All rights reserved. 6 The modern platform for machine learning and analytics optimized for the cloud EXTENSIBLE SERVICES CORE SERVICES DATA ENGINEERING OPERATIONAL DATABASE ANALYTIC DATABASE DATA CATALOG INGEST & REPLICATION SECURITY GOVERNANCE WORKLOAD MANAGEMENT DATA SCIENCE NEW OFFERINGS Cloudera Enterprise Amazon S3 Microsoft ADLS HDFS KUDU STORAGE SERVICES
  • 7. 7© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Stored Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration work to move data between HBase & HDFS Analytic Gap Pace of Analysis PaceofData Filling the Analytic Gap
  • 8. 8© Cloudera, Inc. All rights reserved. Apache Kudu: Scalable and fast structured storage Scalable •  Tested up to 300+ nodes (PBs cluster) •  Designed to scale to 1000s of nodes and tens of PBs Fast •  Multiple GB/second read throughput per node •  Millions of read/write operations per second across cluster Tabular •  Represents data in structured tables like a relational database •  Strict schema, finite column count, no BLOBs •  Individual record-level access to 100+ billion row tables
  • 9. 9© Cloudera, Inc. All rights reserved. Apache Kudu Community
  • 10. 10© Cloudera, Inc. All rights reserved. Can you insert time series data in real time? How long does it take to prepare it for analysis? Can you get results and act fast enough to change outcomes? Can you handle large volumes of machine-generated data? Do you have the tools to identify problems or threats? Can your system do machine learning? How fast can you add data to your data store? Are you trading off the ability to do broad analytics for the ability to make updates? Are you retaining only part of your data? Time Series Data Machine Data Analytics Online Reporting Why Kudu?
  • 11. 11© Cloudera, Inc. All rights reserved. Cheaper and faster every year. Persistent memory (3D XPoint™) Kudu can take advantage of SSD and NVM using Intel’s NVM Library. RAM is cheaper and bigger every day. Kudu runs smoothly with huge RAM. Written in C++ to avoid GC issues. Modern CPUs are adding cores and SIMD width, not GHz. Kudu takes advantage of SIMD instructions and concurrent data structures. Next generation hardware Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
  • 12. 12© Cloudera, Inc. All rights reserved. How it Works Replication And Fault Tolerance
  • 13. 13© Cloudera, Inc. All rights reserved. Tables, tablets, and tablet servers • Each table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Each tablet has N replicas (3 or 5) with Raft consensus • Automatic fault tolerance • MTTR (mean time to repair): ~5 seconds
  • 14. 14© Cloudera, Inc. All rights reserved. Metadata Replicated master Acts as a tablet directory Acts as a catalog (which tables exist, etc) Acts as a load balancer (tracks TS liveness, re-replicates under- replicated tablets) Caches all metadata in RAM for high performance Client configured with master addresses Asks master for tablet locations as needed and caches them
  • 15. 15© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘tlipcon’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … UPDATE tlipcon SET col=foo Meta Cache T1: … T2: … T3: …
  • 16. 16© Cloudera, Inc. All rights reserved. Raft consensus TS A Tablet 1 (LEADER) Client TS B Tablet 1 (FOLLOWER) TS C Tablet 1 (FOLLOWER) WAL WALWAL 2b. Leader writes local WAL 1a. Client->Leader: Write() RPC 2a. Leader->Followers: UpdateConsensus() RPC 3. Follower: write WAL 4. Follower->Leader: success 3. Follower: write WAL 5. Leader has achieved majority 6. Leader->Client: Success!
  • 17. 17© Cloudera, Inc. All rights reserved. How it Works Columnar Storage
  • 18. 18© Cloudera, Inc. All rights reserved. Row Storage Scans have to read all the data, no encodings {23059873, newsycbot, 1442865158, Visual exp…} {22309487, RideImpala, 1442828307, Introducing …} … Tweet_id, user_name, created_at, text
  • 19. 19© Cloudera, Inc. All rights reserved. {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text Columnar Storage
  • 20. 20© Cloudera, Inc. All rights reserved. SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; {25059873, 22309487, 23059861, 23010982} Tweet_id 1GB {newsycbot, RideImpala, fastly, llvmorg} User_name Only read 1 column 2GB {1442865158, 1442828307, 1442865156, 1442865155} Created_at 1GB {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text 200GB Columnar Storage
  • 21. 21© Cloudera, Inc. All rights reserved. {1442825158, 1442826100, 1442827994, 1442828527} Created_at Created_at Diff(created_at) 1442825158 n/a 1442826100 942 1442827994 1894 1442828527 533 64 bits each 11 bits each Columnar Compression Many columns can compress to a few bits per row! Especially: Timestamps Time series values Low-cardinality strings Massive space savings and throughput increase!
  • 22. 22© Cloudera, Inc. All rights reserved. How it Works Write and Read Paths
  • 23. 23© Cloudera, Inc. All rights reserved. LSM vs Kudu LSM – Log Structured Merge (Cassandra, HBase, etc) Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (SSTable, HFile) Reads perform an on-the-fly merge of all on-disk HFiles Kudu Shares some traits (memstores, compactions) More complex. Slower writes in exchange for faster reads (especially scans)
  • 24. 24© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” flush
  • 25. 25© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah2” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2” flush HFile 1Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
  • 26. 26© Cloudera, Inc. All rights reserved. LSM Update path MemStore UPDATE HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Note: all updates are “fully decoupled” from reads. Random-write workload is transformed to fully sequential!
  • 27. 27© Cloudera, Inc. All rights reserved. LSM Read path MemStore HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Merge based on string row keys R1: c1=blah c2=2 R2: c1=newval c2=5 …. CPU intensive! Must always read rowkeys Any given row may exist across multiple HFiles: must always merge! The more HFiles to merge, the slower it reads
  • 28. 28© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet INSERT(“todd”, “$1000”,”engineer”) name pay role DiskRowSet 1 flush Multiple files for each columns base data Latest version of data
  • 29. 29© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 INSERT(“doug”, “$1B”, “Hadoop man”) flush base data base data
  • 30. 30© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 DeltaMemStore DeltaMemStore base data base data On MemoryOn Disk On Memory
  • 31. 31© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 DeltaMemStore DeltaMemStore UPDATE set pay=“$1M” WHERE name=“todd” Is the row in DiskRowSet 2? (check bloom filters) Is the row in DiskRowSet 1? (check bloom filters) Bloom says: no! Bloom says: maybe! Search key column to find offset: rowid = 150 150: col 1=$1M base data
  • 32. 32© Cloudera, Inc. All rights reserved. Kudu storage – Delta flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 DeltaMemStore DeltaMemStore 0: pay=fooREDO DeltaFile Flush A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later version base data base data
  • 33. 33© Cloudera, Inc. All rights reserved. Kudu storage – Minor delta compaction name pay role DiskRowSet(pre-compaction) Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile REDO DeltaFile base data
  • 34. 34© Cloudera, Inc. All rights reserved. Kudu storage – Major delta compaction name pay role DiskRowSet Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile Unmerged REDO DeltaFile base data pay Compaction can be performed only on high-frequent column UNDO Records UNDO stores previous versions of data
  • 35. 35© Cloudera, Inc. All rights reserved. Kudu storage – RowSet Compactions DRS 1 (32MB) [PK=alice], [PK=iris], [PK=linda], [PK=zach] DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke] DRS 3 (32MB) [PK=carl], [PK=julie], [PK=omar] [PK=zoe] DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, iris] [jon, julie, linda, mary] [omar, zach, zeke, zoe] Writes for “chris” have to perform bloom lookups on all 3 RS Range: A-Z Range: A-Z Range: A-Z Range: A-I Range: J-M Range: O-Z Reorganize rows to avoid rowsets with overlapping key ranges “chris” is in this range!
  • 36. 36© Cloudera, Inc. All rights reserved. Kudu Storage - Compactions Main Idea: Always be compacting! Compactions run continuously to prevent IO storms ”Budgeted” RS compactions: What is the best way to spend X MBs IO? Physical/Logical decoupling: different replicas run compactions at different times
  • 37. 37© Cloudera, Inc. All rights reserved. Kudu storage – Read path MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 DeltaMemStore DeltaMemStore 150: pay=$1M base data base data Just need to read this DiskRowSet!
  • 38. 38© Cloudera, Inc. All rights reserved. Kudu storage – Time Travel Read name pay role DiskRowSet Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile base data pay UNDO Records T=0: a query starts to read “pay” in other DiskRowSet T=10: major delta compaction happened! Base file is updated, and UNDO is created T=20: the query starts to read “pay” in this DiskRowSet, but read the version of T=0 from UNDO Records
  • 39. 39© Cloudera, Inc. All rights reserved. Takeaways
  • 40. 40© Cloudera, Inc. All rights reserved. Getting Started On the web: https://www.cloudera.com/documentation/kudu/latest.html, https://www.cloudera.com/downloads.html, https://blog.cloudera.com/?s=Kudu, kudu.apache.org •  Apache project user mailing list: user@kudu.apache.org •  Quickstart VM •  Easiest way to get started •  Impala and Kudu in an easy-to-install VM •  CSD and Parcels •  For installation on a Cloudera Manager-managed cluster Training classes available: https://www.cloudera.com/more/training.html
  • 41. 41© Cloudera, Inc. All rights reserved. Nov 7, 2017 Tue ANA Intercontinental Hotel Estimated Attendees #: 1000 E-1: Apache Kudu on Analytical Data Platform Register Now! www.clouderaworldtokyo.com Cloudera World Tokyo 2017
  • 42. 42© Cloudera, Inc. All rights reserved. Thank you sho@cloudera.com