Submit Search
Upload
Introduction to Apache Kudu
•
1 like
•
584 views
Shravan (Sean) Pabba
Follow
Philadelphia Hadoop Meetup Talk - April 26th 2017
Read less
Read more
Technology
Report
Share
Report
Share
1 of 46
Download now
Download to read offline
Recommended
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
Apache kudu
Apache kudu
Asim Jalis
High concurrency, Low latency analytics using Spark/Kudu
High concurrency, Low latency analytics using Spark/Kudu
Chris George
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
Recommended
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
Apache kudu
Apache kudu
Asim Jalis
High concurrency, Low latency analytics using Spark/Kudu
High concurrency, Low latency analytics using Spark/Kudu
Chris George
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
Introducing Kudu
Introducing Kudu
Jeremy Beard
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
Exponea - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
MartinStrycek
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Dataconomy Media
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
A Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
Nacho García Fernández
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
Kudu demo
Kudu demo
Hemanth Kumar Ratakonda
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
sql on hadoop
sql on hadoop
Jianwei Li
More Related Content
What's hot
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
Introducing Kudu
Introducing Kudu
Jeremy Beard
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
Exponea - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
MartinStrycek
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Dataconomy Media
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
A Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
Nacho García Fernández
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
Kudu demo
Kudu demo
Hemanth Kumar Ratakonda
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera, Inc.
What's hot
(20)
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Introducing Kudu
Introducing Kudu
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Exponea - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
A Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
Application Architectures with Hadoop
Application Architectures with Hadoop
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Kudu demo
Kudu demo
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
Similar to Introduction to Apache Kudu
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
sql on hadoop
sql on hadoop
Jianwei Li
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
Hortonworks.bdb
Hortonworks.bdb
Emil Andreas Siemes
Big Data Processing with Hadoop-MapReduce in Cloud Systems
Big Data Processing with Hadoop-MapReduce in Cloud Systems
Intellipaat
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
Cloudera, Inc.
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
Hortonworks
Apache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
Apache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
Predictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache Hadoop
Hortonworks
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
huguk
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
Spark_Part 1
Spark_Part 1
Shashi Prakash
Hadoop Primer
Hadoop Primer
Steve Staso
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
Hadoop white papers
Hadoop white papers
Muthu Natarajan
963
963
Annu Ahmed
Future of-hadoop-analytics
Future of-hadoop-analytics
MapR Technologies
Big Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
Similar to Introduction to Apache Kudu
(20)
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
sql on hadoop
sql on hadoop
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Hortonworks.bdb
Hortonworks.bdb
Big Data Processing with Hadoop-MapReduce in Cloud Systems
Big Data Processing with Hadoop-MapReduce in Cloud Systems
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
Apache Spark in Scientific Applications
Apache Spark in Scientific Applications
Apache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Predictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache Hadoop
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Spark_Part 1
Spark_Part 1
Hadoop Primer
Hadoop Primer
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop white papers
Hadoop white papers
963
963
Future of-hadoop-analytics
Future of-hadoop-analytics
Big Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Recently uploaded
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Jeffrey Haguewood
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
Samir Dash
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MIND CTI
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
rightmanforbloodline
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
WSO2
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
UiPathCommunity
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
WSO2
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
MarkSteadman7
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Zilliz
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
AnitaRaj43
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
johnbeverley2021
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
IES VE
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
Kumar Satyam
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
Sandro Moreira
Recently uploaded
(20)
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
Introduction to Apache Kudu
1.
1 © Cloudera, Inc. All rights reserved. Intro to Apache Kudu Hadoop storage for fast analy=cs on fast data Shravan (Sean) Pabba | Systems Engineer, Cloudera | @skpabba
2.
2 © Cloudera, Inc. All rights reserved. Apache Kudu Storage for fast (low latency) analy=cs on fast (high throughput) data • Simplifies the architecture for building analy=c applica=ons on changing data • Op=mized for fast analy=c performance •
Na=vely integrated with the Hadoop ecosystem of components FILESYSTEM HDFS NoSQL HBASE INGEST – SQOOP, FLUME, KAFKA DATA INTEGRATION & STORAGE SECURITY – SENTRY RESOURCE MANAGEMENT – YARN UNIFIED DATA SERVICES BATCH STREAM SQL SEARCH MODEL ONLINE DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS SPARK, HIVE, PIG SPARK IMPALA SOLR SPARK HBASE COLUMNAR STORE KUDU
3.
3 © Cloudera, Inc. All rights reserved. Why Kudu?
4.
4 © Cloudera, Inc. All rights reserved. Previous Hadoop storage landscape HDFS (GFS) excels at: • Batch ingest only (eg hourly) • Efficiently scanning large amounts of data (analy=cs) HBase (BigTable) excels at: •
Efficiently finding and wri=ng individual rows • Making data mutable Gaps exist when these proper=es are needed simultaneously
5.
5 © Cloudera, Inc. All rights reserved. • High throughput for big scans Goal: Within 2x of Parquet • Low-latency for short accesses Goal: 1ms read/write on SSD •
Database-like seman=cs Ini=ally, single-row atomicity • Rela=onal data model • SQL queries should be natural and easy • Include NoSQL-style scan, insert, and update APIs Kudu design goals
6.
6 © Cloudera, Inc. All rights reserved. Changing hardware landscape • Spinning disk -> solid state storage • NAND Flash: Up to 450k read 250k write IOPS, about 2GB/sec read and 1.5GB/ sec write throughput, at a price of less than $3/GB and dropping •
Intel Optane/3D XPoint memory (1000x faster than Flash, cheaper than RAM) • RAM is cheaper and more abundant: • 64->128->256GB over last few years • Takeaway: The next performance bomleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind
7.
7 © Cloudera, Inc. All rights reserved. Apache Kudu: Scalable and fast structured storage Scalable • Tested up to 400+ nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs Fast •
Millions of read/write opera=ons per second across cluster • Mul=ple GB/second read throughput per node Tables • Represents data in structured tables like a normal database • Individual record-level access to 100+ billion row tables
8.
8 © Cloudera, Inc. All rights reserved. Storing records in Kudu tables • A Kudu table has a SQL-like schema • And a finite number of columns (unlike HBase/Cassandra) •
Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-composite primary key • Fast ALTER TABLE • Java, Python, and C++ NoSQL-style APIs • Insert(), Update(), Delete(), Scan() • SQL via integra=ons with Impala and Spark • Community work in progress / experimental: Drill, Hive
9.
9 © Cloudera, Inc. All rights reserved. Use cases
10.
10 © Cloudera, Inc. All rights reserved. Kudu use cases Kudu is best for use cases requiring: • Simultaneous combina=on of sequen=al and random reads and writes • Minimal to zero data latencies Time series • Examples: Streaming market data; fraud detec=on & preven=on; network monitoring • Workload: Inserts, updates, scans, lookups Online repor=ng / data warehousing • Example: Opera=onal data store (ODS) • Workload: Inserts, updates, scans, lookups
11.
11 © Cloudera, Inc. All rights reserved. “Tradi=onal” real-=me analy=cs in Hadoop Fraud detec=on in the real world = storage complexity Considera=ons: • How do I handle failure during this process? • How oyen do I reorganize data streaming in into a format appropriate for repor=ng? •
When repor=ng, how do I see data that has not yet been reorganized? • How do I ensure that important jobs aren’t interrupted by maintenance? New Par==on Most Recent Par==on Historical Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running opera=ons to complete • Define new Impala par==on referencing the newly wrimen Parquet file Ka{a Repor=ng Request Storage in HDFS
12.
12 © Cloudera, Inc. All rights reserved. Real-=me analy=cs in Hadoop with Kudu Improvements: • One system to operate • No cron jobs or background processes •
Handle late arrivals or data correc=ons with ease • New data available immediately for analy=cs or opera=ons Historical and Real-=me Data Incoming data (e.g. Ka{a) Repor=ng Request Storage in Kudu
13.
13 © Cloudera, Inc. All rights reserved. Large Cable Company - Old Architecture Source: https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56113
14.
14 © Cloudera, Inc. All rights reserved. Challenges • Rebuild of en=re datasets, or par==ons by re-genera=ng compressed CSV files and loading into HDFS to keep data current took several hours or days. • Rebuild opera=ons consumed cluster capacity, limi=ng availability to other teams in a shared cluster. • No way to update a single row in the dataset without recrea=ng table or using a slower complicated integra=on with HBase. Source: https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56113
15.
15 © Cloudera, Inc. All rights reserved. Large Cable Company - New Architecture • Stores Tune Events into Kudu. Any data fixes are made directly in Kudu. • Stores Metadata directly into Kudu. Any data fixes are made directly in Kudu • Spark Streaming updates Kudu on a real =me basis to support quick analy=cs. • Spark Job reads the raw events , sessionizes and updates Kudu. • BI tools like Zoomdata directly work with Impala or Kudu to enable analy=cs. Source: https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56113
16.
16 © Cloudera, Inc. All rights reserved. Large Cable Company - New Architecture Source: https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56113
17.
17 © Cloudera, Inc. All rights reserved. Kudu+Impala vs MPP DWH Commonali=es ✓ Fast analy=c queries via SQL, including most commonly used modern features ✓ Ability to insert, update, and delete data Differences ✓ Faster streaming inserts ✓ Improved Hadoop integra=on • JOIN between HDFS + Kudu tables, run on same cluster • Spark, Flume, other integra=ons ✗ Slower batch inserts ✗ No transac=onal data loading, mul=-row transac=ons, or indexing
18.
18 © Cloudera, Inc. All rights reserved. How it works Replica=on and fault tolerance
19.
19 © Cloudera, Inc. All rights reserved. Tables, tablets, tablet servers and masters • Each table is horizontally par==oned into tablets • Range or hash par==oning • PRIMARY
KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Each tablet has N replicas (3 or 5) with Ray consensus • Automa=c fault tolerance • MTTR: ~5 seconds • Tablet servers host tablets on local disk drives • Master services metadata opera=ons • Create/drop tables and tablets • Locate tablets
20.
20 © Cloudera, Inc. All rights reserved. How it works Columnar storage
21.
21 © Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text
22.
22 © Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; Only read 1 column 1GB 2GB 1GB
200GB
23.
23 © Cloudera, Inc. All rights reserved. Columnar compression {1442825158, 1442826100, 1442827994, 1442828527} Created_at Created_at Diff(created_at) 1442825158 n/a 1442826100
942 1442827994 1894 1442828527 533 64 bits each 11 bits each • Many columns can compress to a few bits per row! • Especially: • Timestamps • Time series values • Low-cardinality strings • Massive space savings and throughput increase!
24.
24 © Cloudera, Inc. All rights reserved. Represen=ng =me series in Kudu
25.
25 © Cloudera, Inc. All rights reserved. What is =me series? Data that can be usefully par==oned and queried based on =me Examples: • Web user ac=vity data (view and click data, tweets, likes) • Machine metrics (CPU u=liza=on, free memory, requests/sec) •
Pa=ent data (blood pressure readings, weight changes over =me) • Financial data (stock transac=ons, price fluctua=ons)
26.
26 © Cloudera, Inc. All rights reserved. Kudu & =me series data Real =me data inges=on + fast scans = Ideal pla…orm for storing and querying =me series data • Support for many column encodings and compression schemes • Encodings: Plain, dic=onary, bitshuffle, Run Length, Prefix •
Compression: LZ4, gzip, bzip2 • Kudu supports a flexible range of par==oning schemes • Par==on by =me range, hash, or both • Parallelizable scans • Scale-out storage system
27.
27 © Cloudera, Inc. All rights reserved. Par==oning by =me range + series hash
28.
28 © Cloudera, Inc. All rights reserved. Par==oning by =me range + series hash (inserts) Inserts are spread among all par==ons of the =me range
29.
29 © Cloudera, Inc. All rights reserved. Par==oning by =me range + series hash (scans) Big scans (across =me intervals) can be parallelized across par==ons
30.
30 © Cloudera, Inc. All rights reserved. Dynamic par==on management • Allows for dropping and adding par==ons on live tables • Efficiently remove ranges of (typically old) data using an admin tool
31.
31 © Cloudera, Inc. All rights reserved. Integra=ons
32.
32 © Cloudera, Inc. All rights reserved. Impala integra=on • CREATE TABLE …
DISTRIBUTE BY HASH(col1) INTO 16 BUCKETS AS SELECT … FROM … • INSERT / UPDATE / DELETE • Optimizations: predicate pushdown, scan locality, scan parallelism • More optimizations on the way • Not an Impala user? Community working on other integrations (Hive, Drill, Presto, etc)
33.
33 © Cloudera, Inc. All rights reserved. Spark DataSource integra=on // Import kudu
datasource import org.kududb.spark.kudu._ val kuduDataFrame = sqlContext.read.options( Map("kudu.master" -> "master.address.example.com", "kudu.table" -> "my_table_name")).kudu // Then query using spark api or register a temporary table and use spark sql kuduDataFrame.select("id").filter("id" >= 5).show() // (prints the selection to the console) // Register kuduDataFrame as a temporary table for spark-sql kuduDataFrame.registerTempTable("kudu_table") // Select from the dataframe sqlContext.sql("select id from kudu_table where id >= 5").show() // (prints the sql results to the console)
34.
34 © Cloudera, Inc. All rights reserved. MapReduce integra=on • Mul=-framework cluster (MR + HDFS + Kudu on the same disks) • KuduTableInputFormat / KuduTableOutputFormat • Support for pushing down predicates, column projec=ons, etc. • Lots of Kudu integra=on / correctness tes=ng done via MapReduce
35.
35 © Cloudera, Inc. All rights reserved. Flume integra=on • Basic Flume sink, similar to the Flume HBaseSink • Write a simple EventProducer plugin to transform from your event format to Kudu Insert objects • Then deploy with a Flume config file like the following: agent.sink.kudu.type = org.kududb.flume.sink.KuduSink agent.sink.kudu.masterAddresses = kudu01.example.com agent.sink.kudu.tableName = my-table agent.sink.kudu.producer = MyEventProducer
36.
36 © Cloudera, Inc. All rights reserved. Performance
37.
37 © Cloudera, Inc. All rights reserved. TPC-H (analy=cs benchmark) • 75 server cluster • 12 (spinning) disks each, enough RAM to fit dataset •
TPC-H Scale Factor 100 (100GB) • Example SQL query (via Impala): • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;
38.
38 © Cloudera, Inc. All rights reserved. TPC-H results: Kudu vs Parquet • Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
39.
39 © Cloudera, Inc. All rights reserved. TPC-H results: Kudu vs other NoSQL storage Apache Phoenix: OLTP SQL engine built on HBase • 10 node cluster (9 workers, 1 master) • TPC-H LINEITEM table only (6B rows)
40.
40 © Cloudera, Inc. All rights reserved. What about NoSQL-style random access? (YCSB) • YCSB 0.5.0-snapshot • 10 node cluster (9 workers, 1 master) •
100M row data set • 10M opera=ons each workload
41.
41 © Cloudera, Inc. All rights reserved. Geˆng started with Kudu
42.
42 © Cloudera, Inc. All rights reserved. Geˆng started as a user • On the web: kudu.apache.org • User mailing list: user@kudu.apache.org •
Slack chat channel (see web site) • Quickstart VM • Easiest way to get started • Impala and Kudu in an easy-to-install VM • CSD and Parcels • For installa=on on a Cloudera Manager-managed cluster
43.
43 © Cloudera, Inc. All rights reserved. Geˆng started as a developer • Source code: github.com/apache/kudu • All commits go here first • Code reviews: gerrit.cloudera.org • All code reviews are public •
Public JIRA: issues.apache.org/jira/browse/KUDU • Includes bugs going back to 2013 • Developer mailing list: dev@kudu.apache.org • Apache 2.0 license open source and an ASF project • Contribu=ons welcome and encouraged!
44.
44 © Cloudera, Inc. All rights reserved. Project status • First open source beta released in September 2015. • Kudu 1.0.0 version released in September 2016. •
Kudu 1.3.1 version was released last week. • Kerberos authen=ca=on, TLS encryp=on, and coarse-grained (cluster-level) authoriza=on • Many Produc=on customers • Users tes=ng up to 400+ nodes so far. • Kudu is a top-level project (TLP) at the Apache Soyware Founda=on • Community-driven open source process.
45.
45 © Cloudera, Inc. All rights reserved. Apache Kudu Community
46.
46 © Cloudera, Inc. All rights reserved. kudu.apache.org @ApacheKudu
Download now