SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Apache Kudu & Apache Spark SQL
for Fast Analytics on Fast Data
Mike Percy
Software Engineer at Cloudera
Apache Kudu PMC member
2© Cloudera, Inc. All rights reserved.
Kudu Overview
3© Cloudera, Inc. All rights reserved.
HDFS
Fast Scans, Analytics
and Processing of
Static Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Traditional Hadoop Storage Leaves a Gap
Use cases that fall between HDFS and HBase were difficult to manage
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Complex Hybrid
Architectures
Analytic
Gap
Pace of Analysis
PaceofData
4© Cloudera, Inc. All rights reserved.
HDFS
Fast Scans, Analytics
and Processing of
Static Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Kudu: Fast Analytics on Fast-Changing Data
New storage engine enables new Hadoop use cases
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData
5© Cloudera, Inc. All rights reserved.
Apache Kudu: Scalable and fast tabular storage
Tabular
• Represents data in structured tables like a relational database
• Strict schema, finite column count, no BLOBs
• Individual record-level access to 100+ billion row tables
Scalable
• Tested up to 275 nodes (~3PB cluster)
• Designed to scale to 1000s of nodes and tens of PBs
Fast
• Millions of read/write operations per second across cluster
• Multiple GB/second read throughput per node
6© Cloudera, Inc. All rights reserved.
Storing records in Kudu tables
• A Kudu table has a SQL-like schema
• And a finite number of columns (unlike HBase/Cassandra)
• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,
TIMESTAMP
• Some subset of columns makes up a possibly-composite primary key
• Fast ALTER TABLE
• Java, Python, and C++ NoSQL-style APIs
• Insert(), Update(), Delete(), Scan()
• SQL via integrations with Spark and Impala
• Community work in progress / experimental: Drill, Hive
7© Cloudera, Inc. All rights reserved.
Kudu SQL access
- Kudu itself is just storage and native “NoSQL” APIs
- SQL access via integrations with Spark, Impala, etc.
8© Cloudera, Inc. All rights reserved.
Kudu “NoSQL” APIs - Writes
KuduTable table = client.openTable(“my_table”);
KuduSession session = client.newSession();
Insert ins = table.newInsert();
ins.getRow().addString(“host”, “foo.example.com”);
ins.getRow().addString(“metric”, “load-avg.1sec”);
ins.getRow().addDouble(“value”, 0.05);
session.apply(ins);
session.flush();
9© Cloudera, Inc. All rights reserved.
Kudu “NoSQL” APIs - Reads
KuduScanner scanner = client.newScannerBuilder(table)
.setProjectedColumnNames(List.of(“value”))
.build();
while (scanner.hasMoreRows()) {
RowResultIterator batch = scanner.nextRows();
while (batch.hasNext()) {
RowResult result = batch.next();
System.out.println(result.getDouble(“value”));
}
}
10© Cloudera, Inc. All rights reserved.
Kudu “NoSQL” APIs - Predicates
KuduScanner scanner = client.newScannerBuilder(table)
.addPredicate(KuduPredicate.newComparisonPredicate(
table.getSchema().getColumn(“timestamp”),
ComparisonOp.GREATER,
System.currentTimeMillis() / 1000 + 60))
.build();
Note: Kudu can evaluate simple predicates, but no aggregations, complex
expressions, UDFs, etc.
11© Cloudera, Inc. All rights reserved.
Tables and tablets
• Each table is horizontally partitioned into tablets
• Range or hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• Translation: bucketNumber = hashCode(row[‘timestamp’]) % 100
• Each tablet has N replicas (3 or 5), kept consistent with Raft consensus
• Tablet servers host tablets on local disk drives
12© Cloudera, Inc. All rights reserved.
Fault tolerance
• Operations replicated using Raft consensus
• Strict quorum algorithm. See Raft paper for details
• Transient failures:
• Follower failure: Leader can still achieve majority
• Leader failure: automatic leader election (~5 seconds)
• Restart dead TS within 5 min and it will rejoin transparently
• Permanent failures
• After 5 minutes, automatically creates a new follower replica and copies data
• N replicas can tolerate up to (N-1)/2 failures
13© Cloudera, Inc. All rights reserved.
Metadata
• Replicated master
• Acts as a tablet directory
• Acts as a catalog (which tables exist, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
tablets)
• Caches all metadata in RAM for high performance
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them
14© Cloudera, Inc. All rights reserved.
Integrations
Kudu is designed for integrating with higher-level compute frameworks
Integrations exist for:
• Spark
• Impala
• MapReduce
• Flume
• Drill
15© Cloudera, Inc. All rights reserved.
What Kudu brings to Spark
• Parquet-like scan performance with 0-delay inserts and updates
• Push down predicate filters for fast & efficient scans
• Primary key indexing for fast point lookups (compared to Parquet)
16© Cloudera, Inc. All rights reserved.
Spark DataSource
17© Cloudera, Inc. All rights reserved.
Spark DataFrame/DataSource integration
// spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0
// Import kudu datasource
import org.kududb.spark.kudu._
val kuduDataFrame = sqlContext.read.options(
Map("kudu.master" -> "master1,master2,master3",
"kudu.table" -> "my_table_name")).kudu
// Then query using Spark data frame API
kuduDataFrame.select("id").filter("id" >= 5).show()
// (prints the selection to the console)
// Or register kuduDataFrame as a table and use Spark SQL
kuduDataFrame.registerTempTable("my_table")
sqlContext.sql("select id from my_table where id >= 5").show()
// (prints the sql results to the console)
18© Cloudera, Inc. All rights reserved.
Quick demo
19© Cloudera, Inc. All rights reserved.
Writing from Spark
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051")
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
new CreateTableOptions().setNumReplicas(1))
// Insert, delete, upsert, or update data
kuduContext.insertRows(df, "test_table")
kuduContext.deleteRows(sqlContext.sql("select id from kudu_table where id >= 5"),
"kudu_table")
kuduContext.upsertRows(df, "test_table")
kuduContext.updateRows(df.select(“id”, $”count” + 1, "test_table")
20© Cloudera, Inc. All rights reserved.
Spark DataSource optimizations
Column projection and predicate pushdown
- Only read the referenced columns
- Convert ‘WHERE’ clauses into Kudu predicates
- Kudu predicates automatically convert to primary key scans, etc
scala> sqlContext.sql("select avg(value) from metrics where host =
'e1103.halxg.cloudera.com'").explain
== Physical Plan ==
TungstenAggregate(key=[], functions=[(avg(value#3),mode=Final,isDistinct=false)], output=[_c0#94])
+- TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], functions=[(avg(value#3),mode=Partial,isDistinct=false)],
output=[sum#98,count#99L])
+- Project [value#3]
+- Scan org.apache.kudu.spark.kudu.KuduRelation@e13cc49[value#3]
PushedFilters: [EqualTo(host,e1103.halxg.cloudera.com)]
21© Cloudera, Inc. All rights reserved.
Spark DataSource optimizations
Partition pruning
scala> df.where("host like 'foo%'").rdd.partitions.length
res1: Int = 20
scala> df.where("host = 'foo'").rdd.partitions.length
res2: Int = 1
22© Cloudera, Inc. All rights reserved.
Use cases
23© Cloudera, Inc. All rights reserved.
Kudu use cases
Kudu is best for use cases requiring:
• Simultaneous combination of sequential and random reads and writes
• Minimal to zero data latencies
Time series
• Examples: Streaming market data; fraud detection & prevention; network monitoring
• Workload: Inserts, updates, scans, lookups
Online reporting / data warehousing
• Example: Operational Data Store (ODS)
• Workload: Inserts, updates, scans, lookups
24© Cloudera, Inc. All rights reserved.
Xiaomi use case
• World’s 4th
largest smart-phone maker (most popular in China)
• Gather important RPC tracing events from mobile app and backend service.
• Service monitoring & troubleshooting tool.
High write throughput
• >20 Billion records/day and growing
Query latest data and quick response
• Identify and resolve issues quickly
Can search for individual records
• Easy for troubleshooting
25© Cloudera, Inc. All rights reserved.
Xiaomi big data analytics pipeline
Before Kudu
Long pipeline
• High data latency
(approx 1 hour – 1 day)
• Data conversion pains
No ordering
• Log arrival (storage) order is
not exactly logical order
• Must read 2 – 3 days of data
to get all of the data points
for a single day
26© Cloudera, Inc. All rights reserved.
Xiaomi big data analytics pipeline
Simplified with Kafka and Kudu
ETL pipeline
• 0 – 10s data latency
• Apps that need to avoid
backpressure or need ETL
Direct pipeline (no latency)
• Apps that don’t require ETL or
backpressure handling
OLAP scan
Side table lookup
Result store
27© Cloudera, Inc. All rights reserved.
Real-time analytics in Hadoop with Kudu
Improvements:
• One system to operate
• No cron jobs or background
processes
• Handle late arrivals or data
corrections with ease
• New data available
immediately for analytics or
operations
Historical and Real-time
Data
Incoming data
(e.g. Kafka)
Reporting
Request
Storage in Kudu
28© Cloudera, Inc. All rights reserved.
Kudu+Impala vs MPP DWH
Commonalities
✓ Fast analytic queries via SQL, including most commonly used modern features
✓ Ability to insert, update, and delete data
Differences
✓ Faster streaming inserts
✓ Improved Hadoop integration
• JOIN between HDFS + Kudu tables, run on same cluster
• Spark, Flume, other integrations
✗ Slower batch inserts
✗ No transactional data loading, multi-row transactions, or indexing
29© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
30© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
Only read 1 column
1GB 2GB 1GB 200GB
31© Cloudera, Inc. All rights reserved.
Columnar compression
• Many columns can compress to a few bits per row!
• Especially:
• Timestamps
• Time series values
• Low-cardinality strings
• Massive space savings and throughput increase!
32© Cloudera, Inc. All rights reserved.
Kudu Roadmap
33© Cloudera, Inc. All rights reserved.
Open Source “Roadmaps”?
- Kudu is an open source ASF project
- ASF governance means there is no guaranteed roadmap
- Whatever people contribute is the roadmap!
- But I can speak to what my team will be focusing on
- Disclaimer: quality-first mantra = fuzzy timeline commitments
34© Cloudera, Inc. All rights reserved.
Security Roadmap
1) Kerberos authentication
a) Client-server mutual authentication
a) Server-server mutual authentication
b) Execution framework-server authentication (delegation tokens)
2) Extra-coarse-grained authorization
a) Likely a cluster-wide “allowed user list”
3) Group/role mapping
a) LDAP/Unix/etc
4) Data exposure hardening
a) e.g. ensure that web UIs dont leak data
5) Fine-grained authorization
a) Table/database/column level
35© Cloudera, Inc. All rights reserved.
Operability
1) Stability
a) Continued stress testing, fault injection, etc
b) Faster and safer recovery after failures
2) Recovery tools
a) Repair from minority (eg if two hosts explode simultaneously)
b) Replace from empty (eg if three hosts explode simultaneously)
c) Repair file system state after power outage
3) Easier problem diagnosis
a) Client “timeout” errors
b) Easier performance issue diagnosis
36© Cloudera, Inc. All rights reserved.
Performance and scale
- Read performance
- Dynamic predicates (aka runtime filters)
- Spark statistics
- Additional filter pushdown (e.g. “IN (...)”, “LIKE”)
- I/O scheduling from spinning disks
- Write performance
- Improved bulk load capability
- Scalability
- Users planning to run 400 node clusters
- Rack-aware placement
37© Cloudera, Inc. All rights reserved.
Client improvements roadmap
- Python
- Full feature parity with C++ and Java
- Even more pythonic
- Integrations: Pandas, PySpark
- All clients:
- More API documentation, tutorials, examples
- Better error messages/exceptions
- Full support for snapshot consistency level
38© Cloudera, Inc. All rights reserved.
Performance
- Significant improvements to compaction throughput
- Default configurations tuned for much less write amplification
- Reduced lock contention on RPC system, block cache
- 2-3x improvement for selective filters on dictionary-encoded columns
- Other speedups:
- Startup time 2-3x better (more work coming)
- First scan following restart ~2x faster
- More compact and compressed internal index storage
39© Cloudera, Inc. All rights reserved.
Performance (YCSB)
Single node micro-benchmark, 500M record insert, 16 client threads, each record ~110 bytes
Runs Load, followed by ‘C’ (10min), followed by ‘A’ (10min)
175Kop/s 422Kop/s 680 op/s
12K op/s
6K op/s 18K op/s
40© Cloudera, Inc. All rights reserved.
TPC-H (analytics benchmark)
• 75 server cluster
• 12 (spinning) disks each, enough RAM to fit dataset
• TPC-H Scale Factor 100 (100GB)
• Example query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders,
lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND
n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND
o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;
41© Cloudera, Inc. All rights reserved.
• Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
42© Cloudera, Inc. All rights reserved.
Versus other NoSQL storage
• Apache Phoenix: OLTP SQL engine built on HBase
• 10 node cluster (9 worker, 1 master)
• TPC-H LINEITEM table only (6B rows)
43© Cloudera, Inc. All rights reserved.
Joining the growing community
44© Cloudera, Inc. All rights reserved.
Apache Kudu Community
45© Cloudera, Inc. All rights reserved.
Getting started as a user
• On the web: kudu.apache.org
• User mailing list: user@kudu.apache.org
• Public Slack chat channel (see web site for the link)
• Quickstart VM
• Easiest way to get started
• Impala and Kudu in an easy-to-install VM
• CSD and Parcels
• For installation on a Cloudera Manager-managed cluster
46© Cloudera, Inc. All rights reserved.
Getting started as a developer
• Source code: github.com/apache/kudu - all commits go here first
• Code reviews: gerrit.cloudera.org - all code reviews are public
• Developer mailing list: dev@kudu.apache.org
• Public JIRA: issues.apache.org/jira/browse/KUDU - includes bug history since 2013
Contributions are welcome and strongly encouraged!
47© Cloudera, Inc. All rights reserved.
kudu.apache.org
@mike_percy | @ApacheKudu

More Related Content

What's hot

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Apache kudu
Apache kuduApache kudu
Apache kudu
Asim Jalis
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
DataWorks Summit
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
Supriya Sahay
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
Gregory Keys
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
David Groozman
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
leanderlee2
 

What's hot (20)

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 

Viewers also liked

Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
Spark + HBase
Spark + HBase Spark + HBase
Kudu demo
Kudu demoKudu demo
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architecture
ArangoDB Database
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
AiougVizagChapter
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
NAYATech
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
Spark Summit
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
Bryan Yang
 
Spark etl
Spark etlSpark etl
Spark etl
Imran Rashid
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
Getting value from IoT, Integration and Data Analytics
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
Krishna Sankar
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
Krishna Sankar
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp
 

Viewers also liked (20)

Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Microservice-based software architecture
Microservice-based software architectureMicroservice-based software architecture
Microservice-based software architecture
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Spark etl
Spark etlSpark etl
Spark etl
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 

Similar to Spark Summit EU talk by Mike Percy

Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
Felicia Haggarty
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera, Inc.
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
Cloudera, Inc.
 

Similar to Spark Summit EU talk by Mike Percy (20)

Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 

Recently uploaded (20)

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 

Spark Summit EU talk by Mike Percy

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Kudu & Apache Spark SQL for Fast Analytics on Fast Data Mike Percy Software Engineer at Cloudera Apache Kudu PMC member
  • 2. 2© Cloudera, Inc. All rights reserved. Kudu Overview
  • 3. 3© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Static Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Traditional Hadoop Storage Leaves a Gap Use cases that fall between HDFS and HBase were difficult to manage Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Complex Hybrid Architectures Analytic Gap Pace of Analysis PaceofData
  • 4. 4© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Static Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Kudu: Fast Analytics on Fast-Changing Data New storage engine enables new Hadoop use cases Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration work to move data between HBase & HDFS Analytic Gap Pace of Analysis PaceofData
  • 5. 5© Cloudera, Inc. All rights reserved. Apache Kudu: Scalable and fast tabular storage Tabular • Represents data in structured tables like a relational database • Strict schema, finite column count, no BLOBs • Individual record-level access to 100+ billion row tables Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node
  • 6. 6© Cloudera, Inc. All rights reserved. Storing records in Kudu tables • A Kudu table has a SQL-like schema • And a finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-composite primary key • Fast ALTER TABLE • Java, Python, and C++ NoSQL-style APIs • Insert(), Update(), Delete(), Scan() • SQL via integrations with Spark and Impala • Community work in progress / experimental: Drill, Hive
  • 7. 7© Cloudera, Inc. All rights reserved. Kudu SQL access - Kudu itself is just storage and native “NoSQL” APIs - SQL access via integrations with Spark, Impala, etc.
  • 8. 8© Cloudera, Inc. All rights reserved. Kudu “NoSQL” APIs - Writes KuduTable table = client.openTable(“my_table”); KuduSession session = client.newSession(); Insert ins = table.newInsert(); ins.getRow().addString(“host”, “foo.example.com”); ins.getRow().addString(“metric”, “load-avg.1sec”); ins.getRow().addDouble(“value”, 0.05); session.apply(ins); session.flush();
  • 9. 9© Cloudera, Inc. All rights reserved. Kudu “NoSQL” APIs - Reads KuduScanner scanner = client.newScannerBuilder(table) .setProjectedColumnNames(List.of(“value”)) .build(); while (scanner.hasMoreRows()) { RowResultIterator batch = scanner.nextRows(); while (batch.hasNext()) { RowResult result = batch.next(); System.out.println(result.getDouble(“value”)); } }
  • 10. 10© Cloudera, Inc. All rights reserved. Kudu “NoSQL” APIs - Predicates KuduScanner scanner = client.newScannerBuilder(table) .addPredicate(KuduPredicate.newComparisonPredicate( table.getSchema().getColumn(“timestamp”), ComparisonOp.GREATER, System.currentTimeMillis() / 1000 + 60)) .build(); Note: Kudu can evaluate simple predicates, but no aggregations, complex expressions, UDFs, etc.
  • 11. 11© Cloudera, Inc. All rights reserved. Tables and tablets • Each table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Translation: bucketNumber = hashCode(row[‘timestamp’]) % 100 • Each tablet has N replicas (3 or 5), kept consistent with Raft consensus • Tablet servers host tablets on local disk drives
  • 12. 12© Cloudera, Inc. All rights reserved. Fault tolerance • Operations replicated using Raft consensus • Strict quorum algorithm. See Raft paper for details • Transient failures: • Follower failure: Leader can still achieve majority • Leader failure: automatic leader election (~5 seconds) • Restart dead TS within 5 min and it will rejoin transparently • Permanent failures • After 5 minutes, automatically creates a new follower replica and copies data • N replicas can tolerate up to (N-1)/2 failures
  • 13. 13© Cloudera, Inc. All rights reserved. Metadata • Replicated master • Acts as a tablet directory • Acts as a catalog (which tables exist, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • Client configured with master addresses • Asks master for tablet locations as needed and caches them
  • 14. 14© Cloudera, Inc. All rights reserved. Integrations Kudu is designed for integrating with higher-level compute frameworks Integrations exist for: • Spark • Impala • MapReduce • Flume • Drill
  • 15. 15© Cloudera, Inc. All rights reserved. What Kudu brings to Spark • Parquet-like scan performance with 0-delay inserts and updates • Push down predicate filters for fast & efficient scans • Primary key indexing for fast point lookups (compared to Parquet)
  • 16. 16© Cloudera, Inc. All rights reserved. Spark DataSource
  • 17. 17© Cloudera, Inc. All rights reserved. Spark DataFrame/DataSource integration // spark-shell --packages org.apache.kudu:kudu-spark_2.10:1.0.0 // Import kudu datasource import org.kududb.spark.kudu._ val kuduDataFrame = sqlContext.read.options( Map("kudu.master" -> "master1,master2,master3", "kudu.table" -> "my_table_name")).kudu // Then query using Spark data frame API kuduDataFrame.select("id").filter("id" >= 5).show() // (prints the selection to the console) // Or register kuduDataFrame as a table and use Spark SQL kuduDataFrame.registerTempTable("my_table") sqlContext.sql("select id from my_table where id >= 5").show() // (prints the sql results to the console)
  • 18. 18© Cloudera, Inc. All rights reserved. Quick demo
  • 19. 19© Cloudera, Inc. All rights reserved. Writing from Spark // Use KuduContext to create, delete, or write to Kudu tables val kuduContext = new KuduContext("kudu.master:7051") // Create a new Kudu table from a dataframe schema // NB: No rows from the dataframe are inserted into the table kuduContext.createTable("test_table", df.schema, Seq("key"), new CreateTableOptions().setNumReplicas(1)) // Insert, delete, upsert, or update data kuduContext.insertRows(df, "test_table") kuduContext.deleteRows(sqlContext.sql("select id from kudu_table where id >= 5"), "kudu_table") kuduContext.upsertRows(df, "test_table") kuduContext.updateRows(df.select(“id”, $”count” + 1, "test_table")
  • 20. 20© Cloudera, Inc. All rights reserved. Spark DataSource optimizations Column projection and predicate pushdown - Only read the referenced columns - Convert ‘WHERE’ clauses into Kudu predicates - Kudu predicates automatically convert to primary key scans, etc scala> sqlContext.sql("select avg(value) from metrics where host = 'e1103.halxg.cloudera.com'").explain == Physical Plan == TungstenAggregate(key=[], functions=[(avg(value#3),mode=Final,isDistinct=false)], output=[_c0#94]) +- TungstenExchange SinglePartition, None +- TungstenAggregate(key=[], functions=[(avg(value#3),mode=Partial,isDistinct=false)], output=[sum#98,count#99L]) +- Project [value#3] +- Scan org.apache.kudu.spark.kudu.KuduRelation@e13cc49[value#3] PushedFilters: [EqualTo(host,e1103.halxg.cloudera.com)]
  • 21. 21© Cloudera, Inc. All rights reserved. Spark DataSource optimizations Partition pruning scala> df.where("host like 'foo%'").rdd.partitions.length res1: Int = 20 scala> df.where("host = 'foo'").rdd.partitions.length res2: Int = 1
  • 22. 22© Cloudera, Inc. All rights reserved. Use cases
  • 23. 23© Cloudera, Inc. All rights reserved. Kudu use cases Kudu is best for use cases requiring: • Simultaneous combination of sequential and random reads and writes • Minimal to zero data latencies Time series • Examples: Streaming market data; fraud detection & prevention; network monitoring • Workload: Inserts, updates, scans, lookups Online reporting / data warehousing • Example: Operational Data Store (ODS) • Workload: Inserts, updates, scans, lookups
  • 24. 24© Cloudera, Inc. All rights reserved. Xiaomi use case • World’s 4th largest smart-phone maker (most popular in China) • Gather important RPC tracing events from mobile app and backend service. • Service monitoring & troubleshooting tool. High write throughput • >20 Billion records/day and growing Query latest data and quick response • Identify and resolve issues quickly Can search for individual records • Easy for troubleshooting
  • 25. 25© Cloudera, Inc. All rights reserved. Xiaomi big data analytics pipeline Before Kudu Long pipeline • High data latency (approx 1 hour – 1 day) • Data conversion pains No ordering • Log arrival (storage) order is not exactly logical order • Must read 2 – 3 days of data to get all of the data points for a single day
  • 26. 26© Cloudera, Inc. All rights reserved. Xiaomi big data analytics pipeline Simplified with Kafka and Kudu ETL pipeline • 0 – 10s data latency • Apps that need to avoid backpressure or need ETL Direct pipeline (no latency) • Apps that don’t require ETL or backpressure handling OLAP scan Side table lookup Result store
  • 27. 27© Cloudera, Inc. All rights reserved. Real-time analytics in Hadoop with Kudu Improvements: • One system to operate • No cron jobs or background processes • Handle late arrivals or data corrections with ease • New data available immediately for analytics or operations Historical and Real-time Data Incoming data (e.g. Kafka) Reporting Request Storage in Kudu
  • 28. 28© Cloudera, Inc. All rights reserved. Kudu+Impala vs MPP DWH Commonalities ✓ Fast analytic queries via SQL, including most commonly used modern features ✓ Ability to insert, update, and delete data Differences ✓ Faster streaming inserts ✓ Improved Hadoop integration • JOIN between HDFS + Kudu tables, run on same cluster • Spark, Flume, other integrations ✗ Slower batch inserts ✗ No transactional data loading, multi-row transactions, or indexing
  • 29. 29© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text
  • 30. 30© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; Only read 1 column 1GB 2GB 1GB 200GB
  • 31. 31© Cloudera, Inc. All rights reserved. Columnar compression • Many columns can compress to a few bits per row! • Especially: • Timestamps • Time series values • Low-cardinality strings • Massive space savings and throughput increase!
  • 32. 32© Cloudera, Inc. All rights reserved. Kudu Roadmap
  • 33. 33© Cloudera, Inc. All rights reserved. Open Source “Roadmaps”? - Kudu is an open source ASF project - ASF governance means there is no guaranteed roadmap - Whatever people contribute is the roadmap! - But I can speak to what my team will be focusing on - Disclaimer: quality-first mantra = fuzzy timeline commitments
  • 34. 34© Cloudera, Inc. All rights reserved. Security Roadmap 1) Kerberos authentication a) Client-server mutual authentication a) Server-server mutual authentication b) Execution framework-server authentication (delegation tokens) 2) Extra-coarse-grained authorization a) Likely a cluster-wide “allowed user list” 3) Group/role mapping a) LDAP/Unix/etc 4) Data exposure hardening a) e.g. ensure that web UIs dont leak data 5) Fine-grained authorization a) Table/database/column level
  • 35. 35© Cloudera, Inc. All rights reserved. Operability 1) Stability a) Continued stress testing, fault injection, etc b) Faster and safer recovery after failures 2) Recovery tools a) Repair from minority (eg if two hosts explode simultaneously) b) Replace from empty (eg if three hosts explode simultaneously) c) Repair file system state after power outage 3) Easier problem diagnosis a) Client “timeout” errors b) Easier performance issue diagnosis
  • 36. 36© Cloudera, Inc. All rights reserved. Performance and scale - Read performance - Dynamic predicates (aka runtime filters) - Spark statistics - Additional filter pushdown (e.g. “IN (...)”, “LIKE”) - I/O scheduling from spinning disks - Write performance - Improved bulk load capability - Scalability - Users planning to run 400 node clusters - Rack-aware placement
  • 37. 37© Cloudera, Inc. All rights reserved. Client improvements roadmap - Python - Full feature parity with C++ and Java - Even more pythonic - Integrations: Pandas, PySpark - All clients: - More API documentation, tutorials, examples - Better error messages/exceptions - Full support for snapshot consistency level
  • 38. 38© Cloudera, Inc. All rights reserved. Performance - Significant improvements to compaction throughput - Default configurations tuned for much less write amplification - Reduced lock contention on RPC system, block cache - 2-3x improvement for selective filters on dictionary-encoded columns - Other speedups: - Startup time 2-3x better (more work coming) - First scan following restart ~2x faster - More compact and compressed internal index storage
  • 39. 39© Cloudera, Inc. All rights reserved. Performance (YCSB) Single node micro-benchmark, 500M record insert, 16 client threads, each record ~110 bytes Runs Load, followed by ‘C’ (10min), followed by ‘A’ (10min) 175Kop/s 422Kop/s 680 op/s 12K op/s 6K op/s 18K op/s
  • 40. 40© Cloudera, Inc. All rights reserved. TPC-H (analytics benchmark) • 75 server cluster • 12 (spinning) disks each, enough RAM to fit dataset • TPC-H Scale Factor 100 (100GB) • Example query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;
  • 41. 41© Cloudera, Inc. All rights reserved. • Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
  • 42. 42© Cloudera, Inc. All rights reserved. Versus other NoSQL storage • Apache Phoenix: OLTP SQL engine built on HBase • 10 node cluster (9 worker, 1 master) • TPC-H LINEITEM table only (6B rows)
  • 43. 43© Cloudera, Inc. All rights reserved. Joining the growing community
  • 44. 44© Cloudera, Inc. All rights reserved. Apache Kudu Community
  • 45. 45© Cloudera, Inc. All rights reserved. Getting started as a user • On the web: kudu.apache.org • User mailing list: user@kudu.apache.org • Public Slack chat channel (see web site for the link) • Quickstart VM • Easiest way to get started • Impala and Kudu in an easy-to-install VM • CSD and Parcels • For installation on a Cloudera Manager-managed cluster
  • 46. 46© Cloudera, Inc. All rights reserved. Getting started as a developer • Source code: github.com/apache/kudu - all commits go here first • Code reviews: gerrit.cloudera.org - all code reviews are public • Developer mailing list: dev@kudu.apache.org • Public JIRA: issues.apache.org/jira/browse/KUDU - includes bug history since 2013 Contributions are welcome and strongly encouraged!
  • 47. 47© Cloudera, Inc. All rights reserved. kudu.apache.org @mike_percy | @ApacheKudu