SlideShare a Scribd company logo
1 of 51
FasT Analytics
Alex Pongpech
• Fast Analytics (FA) is about delivering analytics at decision-making
speeds on !!!
What the heck is it?
10/29/2018 2
https://www.tibco.com/blog/2015/03/27/how-
analytics-facilitates-fast-data/
Quickly need to know
Why oh why?
• Life is a period of continuous-time ( until it is end), seriously life
can not wait for you to make your decision ! ( I will take my money
to another services provider if I have to wait too long)
• The clock is ticking and the information is flowing10/29/2018 3
https://targetdatacorp.com/customer-data/
10/29/2018 4
That’s
Every Minute
Y’all !
https://techstartups.com/2018/05/21/how-much-data-
do-we-create-every-day-infographic/
Why oh WHY?
• What if your life depend on it?
• Drug Discovery
• Precision Medicine
• Point of Care/Patient 360
• Insurance Fraud
10/29/2018 5
Quick decisions based on your personal data
might save your life!
How ?
• by processing high-velocity, high-volume Big Data in real time
through the use of an Enterprise Service Bus (ESB), enabling
decision-makers to gain immediate understanding of new trends
and customer/market shifts as they occur. 10/29/2018 6
http://www.sovtex.ru/en/enterprise-service-bus-esb/
Real time Analytics
10/29/2018 7
HBase is
an open-
source,
non-
relational,
distributed
database
modeled
after
Google's
Bigtable
and
written in
Java.
BUT !
10/29/20188
Kudu vs HDFS and HBase
10/29/2018 9
With Kudu
• Apache Hive was one of the first SQL-like query interfaces developed
over distributed data on top of Hadoop. Hive converts queries to Hadoop
MapReduce jobs
• Apache Impala uses its own parallel processing architecture on top of
HDFS instead of MapReduce jobs. Kudu and Impala are best used
together. unlike Hive, Impala never translate its sql queries into
MapReduce Job rather executes them natively
• Apache Spark is a cluster computing technology. It is not strictly
dependent on Hadoop because it has its own cluster management.
However, Spark is usually implemented on top of Hadoop that is taking
care of distributed data storage. Spark SQL is a Spark component on top
of Spark core that provides a way of querying and persisting structured
and semi-structured data.
10/29/2018 10
Apache KUDU
Kudu
• Kudus are a kind of antelope
• Found in eastern and southern Africa
THIS IS NOT Apache KudU
This is Apache Kudu
10/29/2018 13
Motivation
• Reducing architectural complexity
• Performance (for table-based operations)
• Reliability across globally-distributed data centers
What is and is not
• Apache Kudu is an open source columnar storage engine. It
promises low latency random access and efficient execution of
analytical queries.
What is and is not
• Apache Kudu is not really a SQL interface for Hadoop but a very well
optimized columnar database designed to fit in with the Hadoop
ecosystem. It has been integrated to work with Impala, MapReduce and
Spark, and additional framework integrations are expected. The idea is
that it can provide very fast scan performance.
• Apache Kudu is a “storage engine” or perhaps a “database” project that is
delivered upon a Non-HDFS based filesystem. This underlying storage
format could be considered to be competitive with file formats like
parquet.
• Note, that Kudu is NOT compatible with HDFS and it is NOT truly
complementary to HDFS. It runs on a completely separate filesystem
from Hadoop, which enables Kudu to update data which is very much
unlike HDFS.
10/29/2018 16
Basic Design
• From a user perspective, Kudu is a storage system for
tables of structured data where
• Tables have a well-defined schema consisting of a predefined
number of typed columns.
• Each table has a primary key composed of one or more of its
columns.
• The primary key enforces a uniqueness constraint (no two rows
can share the same key) and acts as an index for efficient updates
and deletes.
10/29/2018 17
Basic Design
• Kudu tables are composed of
• a series of logical subsets of data, similar to partitions in
relational database systems, called Tablets.
• Kudu provides
• data durability and protection against hardware failure by
replicating these Tablets to multiple commodity hardware nodes
using the Raft consensus algorithm.
• Tablets are typically tens of gigabytes, and an individual node
typically holds 10-100 Tablets.
10/29/2018 18
Kuda Table and Tablets
10/29/2018 19
Basic Design
• Tablet A tablet is a contiguous segment of a table, similar to a partition in
other data storage engines or relational databases. A given tablet is
replicated on multiple tablet servers, and at any given point in time, one
of these replicas is considered the leader tablet. Any replica can service
reads, and writes require consensus among the set of tablet servers
serving the tablet.
• Tablet Server A tablet server stores and serves tablets to clients. For a
given tablet, one tablet server acts as a leader, and the others act as
follower replicas of that tablet. Only leaders service write requests, while
leaders or followers each service read requests. Leaders are elected using
Raft Consensus Algorithm. One tablet server can serve multiple tablets,
and one tablet can be served by multiple tablet servers.
10/29/2018 20
KUDU architecture
10/29/2018 21
KUDU tablets
10/29/2018 22
Basic Design
• Kudu has a master process responsible for managing the metadata that
describes the logical structure of the data stored in Tablet Servers (the
catalog), acting as a coordinator when recovering from hardware failure,
and keeping track of which tablet servers are responsible for hosting
replicas of each Tablet.
• Multiple standby master servers can be defined to provide high
availability. In Kudu, many responsibilities typically associated with
master processes can be delineated to the Tablet Servers due to Kudu’s
implementation of Raft consensus, and the architecture provides a path to
partitioning the master’s duties across multiple machines in the future.
• We do not anticipate that Kudu’s master process will become the bottleneck to
overall cluster performance and on tests on a 250-node cluster the server hosting
the master process has been nowhere near saturation.
10/29/2018 23
Basic Design
• Master The master keeps track of all the tablets, tablet servers, the
Catalog Table, and other metadata related to the cluster.
• At a given point in time, there can only be one acting master (the leader).
If the current leader disappears, a new master is elected using Raft
Consensus Algorithm.
• The master also coordinates metadata operations for clients. For example,
when creating a new table, the client internally sends the request to the
master. The master writes the metadata for the new table into the catalog
table, and coordinates the process of creating tablets on the tablet servers.
All the master’s data is stored in a tablet, which can be replicated to all
the other candidate masters. Tablet servers heartbeat to the master at a set
interval (the default is once per second).
10/29/2018 24
Basic Design
• Raft Consensus Algorithm Kudu uses the Raft consensus
algorithm as a means to guarantee fault-tolerance and consistency,
both for regular tablets and for master data. Through Raft, multiple
replicas of a tablet elect a leader, which is responsible for accepting
and replicating writes to follower replicas. Once a write is persisted
in a majority of replicas it is acknowledged to the client. A given
group of N replicas (usually 3 or 5) is able to accept writes with at
most (N - 1)/2 faulty replicas.
10/29/201825
Basic Design
• Data stored in Kudu is updateable through the use of a variation of log-
structured storage in which updates, inserts, and deletes are temporarily
buffered in memory before being merged into persistent columnar
storage.
• Kudu protects against spikes in query latency generally associated with
such architectures through constantly performing small maintenance
operations such as compactions so that large maintenance operations are
never necessary.
• Data Compression Because a given column contains only one type of
data, pattern-based compression can be orders of magnitude more
efficient than compressing mixed data types, which are used in row-
based solutions. Combined with the efficiencies of reading data from
columns, compression allows you to fulfill your query while reading even
fewer blocks from disk.
10/29/2018 26
Basic Design
• Catalog Table The catalog table is the central location for metadata
of Kudu. It stores information about tables and tablets. The catalog
table may not be read or written directly. Instead, it is accessible
only via metadata operations exposed in the client API. The catalog
table stores two categories of metadata:
• Tables table schemas, locations, and states
• Tablets the list of existing tablets, which tablet servers have replicas of each
tablet, the tablet’s current state, and start and end keys.
10/29/2018 27
Basic Design
• Logical Replication Kudu replicates operations, not on-disk data. This is
referred to as logical replication, as opposed to physical replication.
• This has several advantages: Although inserts and updates do transmit
data over the network, deletes do not need to move any data. The delete
operation is sent to each tablet server, which performs the delete locally.
Physical operations, such as compaction, do not need to transmit the data
over the network in Kudu.
• This is different from storage systems that use HDFS, where the blocks
need to be transmitted over the network to fulfill the required number of
replicas. Tablets do not need to perform compactions at the same time or
on the same schedule, or otherwise remain in sync on the physical storage
layer. This decreases the chances of all tablet servers experiencing high
latency at the same time, due to compactions or heavy write loads.
10/29/2018 28
Basic Design
• Kudu provides direct APIs, in both C++ and Java, that allow for
point and batch retrieval of rows, writes, deletes, schema changes,
and more. In addition, Kudu is designed to integrate with and
improve existing Hadoop ecosystem tools. With Kudu’s beta
release integrations with Impala, MapReduce, and Apache Spark
are available. Over time we plan on making Kudu a supported
storage option for most or all of the Hadoop ecosystem tools.
10/29/2018 29
Why not?
• Not a Good Fit for Transactional Workloads (Analytic use-cases
almost exclusively use a subset of the columns in the queried table
and generally aggregate values over a broad range of rows. This
access pattern is greatly accelerated by column oriented data.
Operational use-cases are more likely to access most or all of the
columns in a row, and might be more appropriately served by row
oriented storage. A column oriented storage format was chosen for
Kudu because it’s primarily targeted at analytic use-cases.)
• Small Kudu Tables get loaded almost as fast as Hdfs tables.
However As the size increases we do see the load times becoming
double that of Hdfs with the largest table Lineitem taking up-to 4
times the load time.
10/29/2018 30
Why not?
• Only one index is supported: This can be a problem if you have a
lot of diversified queries that aggregate data by different variables
(by timestamp, user, vehicle, etc). This primary key cannot be
changed after table creation
• Does not redistribute replicas of tablets automatically: if you add a
new node to your cluster for example, Kudu will not redistribute
the tablets so that the cluster nodes are balanced. You need to either
recreate all your existing tables, or move the tablets manually to
balance your cluster using the following command, which might be
tedious work
10/29/2018 31
Why not
• Does not support sqoop : If you want to migrate your SQL warehouse
tables to Kudu, you first need to sqoop them to HDFS, and then use a tool
like Apache Spark to migrate the data to Kudu.
• Dependent on Impala for querying: Impala uses MPP (massive paralel
processing)to perform queries, which basically means that it uses all
deamons to fetch and compute the data it needs, and stores results in
memory. This is great if you need to perform a query that doesn’t take too
long (has few computations or doesn’t move that much data). If however
you have a daily or monthly ETL in which you have complex queries or
massive inserts that demand deamons to be working for hours, if you
have a deamon failure, the query stops and needs to be recomputed from
the very beginning. This is a nightmare for meeting SLAs, a nightmare
that does not happen with Apache Hive, which uses MapReduce to
perform queries, and thus saves intermediate results into disk, making it
much more reliable for these types of scenarios. Kudu’s current engine for
querying is solely Impala, which may cause some issues for these use
cases.
10/29/2018 32
Why not
• Impala tables created with Kudu data cannot be truncated
• Cannot add partitions dynamically unless they are ranged: At the
time of table creation you have to specify how you want to
partition your table (divide it in tablets) and you have three options
to do so: hash partitioning,range partitioning or a combination of
the two. The problem is that in a production scenario, your tables
will keep on increasing in volume and eventually you will need to
add more tablets to keep your performance up. This cannot be
done if you use hash partitioning. Similarly, data consolidation is
impossible using Kudu, unless you create a new, separate table and
perform a full insert to it, which may take some time.
10/29/2018 33
Why ( Performance comparison of different file formats
and storage engines in the Hadoop ecosystem) 6
• Storage efficiency – with Parquet or Kudu and Snappy
compression the total volume of the data can be reduced by a factor
10 comparing to uncompressed simple serialization format.
• Data ingestion speed – all tested file based solutions provide fast
ingestion rate (between x2 and x10) than specialized storage
engines or MapFiles (sorted sequence).
• Random data access time – using HBase or Kudu, typical random
data lookup speed is below 500ms. With smart HDFS namespace
partitioning Parquet could deliver random lookup on a level of a
second but consumes more resources.
10/29/2018 34
Why ( Performance comparison of different file formats
and storage engines in the Hadoop ecosystem) 6
• Data analytics – with Parquet or Kudu it is possible to perform fast
and scalable (typically more than 300k records per second per CPU
core) data aggregation, filtering and reporting.
• Support of in-place data mutation – HBase and Kudu can modify
records (schema and values) in-place where it is not possible with
data stored directly in HDFS files
10/29/2018 35
Why KUDU
• There are really no good alternative storage engines to Kudu in the
Hadoop ecosystem that achieve great analytical query performance
and, at the same time, allow you to change data in near real-time.
• Kudu documentation states that Kudu's intent is to compliment
HDFS and HBase, not to replace, but for many use cases and
smaller data sets, all you might need is Kudu and Impala with
Spark.
10/29/2018 36
Why KUDU
• An good case for Kudu is an ever-popular Data Lake architecture.
It is not enough these days to build a batch-oriented Data Lake,
updated a few times a day. Many modern analytical projects
(predictive alerts, anomaly detection, real-time dashboards etc.)
rely on data, streamed in near real-time from various source
systems.
• if the requirement is for a storage which performs as well as HDFS
for analytical queries with the additional flexibility of faster
random access and RDBMS features such as
Updates/Deletes/Inserts, then Kudu could be considered as a
potential shortlist.
10/29/2018 37
Write once, read many Y’all
Kudu aims
• Fast processing of OLAP workloads.
• Integration with MapReduce, Spark and other Hadoop ecosystem
components.
• Tight integration with Apache Impala, making it a good, mutable
alternative to using HDFS with Apache Parquet.
• Strong but flexible consistency model, allowing you to choose
consistency requirements on a per-request basis, including the
option for strict-serializable consistency.
10/29/2018 38
Kudu aims
• Strong performance for running sequential and random workloads
simultaneously.
• Easy to administer and manage with Cloudera Manager.
• High availability. Tablet Servers and Masters use the Raft
Consensus Algorithm, which ensures that as long as more than half
the total number of replicas is available, the tablet is available for
reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5
replicas are
10/29/2018 39
Kudu aims
• High availability. Tablet Servers and Masters use the Raft
Consensus Algorithm, which ensures that as long as more than half
the total number of replicas is available, the tablet is available for
reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5
replicas are available, the tablet is available.
• Reads can be serviced by read-only follower tablets, even in the
event of a leader tablet failure.
• Structured data model.
10/29/2018 40
A few examples of applications for which Kudu is a
great solution are:
• Reporting applications where newly-arrived data needs to be
immediately available for end users
• Time-series applications that must simultaneously support:
• queries across large amounts of historic data
• granular queries about an individual entity that must return very
quickly
• Applications that use predictive models to make real-time
decisions with periodic refreshes of the predictive model based on
all historic data
10/29/2018 41
Streaming Input with Near Real Time Availability
• A common challenge in data analysis is one where new data arrives
rapidly and constantly, and the same data needs to be available in
near real time for reads, scans, and updates. Kudu offers the
powerful combination of fast inserts and updates with efficient
columnar scans to enable real-time analytics use cases on a single
storage layer
10/29/2018 42
Time-series application with widely varying access
patterns
• A time-series schema is one in which data points are organized and
keyed according to the time at which they occurred. This can be
useful for investigating the performance of metrics over time or
attempting to predict future behavior based on past data.
• For instance, time-series customer data might be used both to store
purchase click-stream history and to predict future purchases, or
for use by a customer support representative.
• While these different types of analysis are occurring, inserts and
mutations may also be occurring individually and in bulk, and
become available immediately to read workloads. Kudu can handle
all of these access patterns simultaneously in a scalable and
efficient manner.
10/29/2018 43
Time-series application with widely varying access
patterns
• Kudu is a good fit for time-series workloads for several reasons. With
Kudu’s support for hash-based partitioning, combined with its native
support for compound row keys, it is simple to set up a table spread
across many servers without the risk of "hotspotting" that is commonly
observed when range partitioning is used. Kudu’s columnar storage
engine is also beneficial in this context, because many time-series
workloads read only a few columns, as opposed to the whole row.
• In the past, you might have needed to use multiple data stores to handle
different data access patterns. This practice adds complexity to your
application and operations, and duplicates your data, doubling (or worse)
the amount of storage required. Kudu can handle all of these access
patterns natively and efficiently, without the need to off-load work to
other data stores.
10/29/2018 44
Predictive Modeling
• Data scientists often develop predictive learning models from large sets
of data.
• The model and the data may need to be updated or modified often as the
learning takes place or as the situation being modeled changes.
• In addition, the scientist may want to change one or more factors in the
model to see what happens over time. Updating a large set of data stored
in files in HDFS is resource-intensive, as each file needs to be completely
rewritten.
• In Kudu, updates happen in near real time. The scientist can tweak the
value, re-run the query, and refresh the graph in seconds or minutes,
rather than hours or days. In addition, batch or incremental algorithms
can be run across the data at any time, with near-real-time results.
10/29/2018 45
Combining Data In Kudu With Legacy Systems
• Companies generate data from multiple sources and store it in a
variety of systems and formats. For instance, some of your data
may be stored in Kudu, some in a traditional RDBMS, and some in
files in HDFS. You can access and query all of these sources and
formats using Impala, without the need to change your legacy
systems.
10/29/2018 46
Real time Analytics with Kudu and Spark
10/29/2018 47
Real time Analytics with Kudu
10/29/2018 48
Working with Impala
10/29/2018 49
Build a Prediction Engine using Spark, Kudu, and
Impala
10/29/2018 50
http://blog.cloudera.com/blog/2016/05/how-to-build-
a-prediction-engine-using-spark-kudu-and-impala/
References
1. https://www.tibco.com/blog/2015/03/27/how-analytics-facilitates-fast-data/
2. https://techstartups.com/2018/05/21/how-much-data-do-we-create-every-day-
infographic/
3. https://mapr.com/blog/much-ado-about-kudu/
4. https://blog.clairvoyantsoft.com/guide-to-using-apache-kudu-and-performance-
comparison-with-hdfs-453c4b26554f
5. https://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/
6. https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-
different-file-formats-and-storage-engines
7. https://kudu.apache.org/docs/quickstart.html
8. https://kudu.apache.org/docs/index.html
9. http://blog.cloudera.com/blog/2016/05/how-to-build-a-prediction-engine-using-spark-
kudu-and-impala/
10/29/2018 51

More Related Content

What's hot

An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to NetezzaVijaya Chandrika
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload managementBiju Nair
 
Netezza Architecture and Administration
Netezza Architecture and AdministrationNetezza Architecture and Administration
Netezza Architecture and AdministrationBraja Krishna Das
 
The IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse ApplianceThe IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse ApplianceIBM Sverige
 
Oracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyOracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyAsis Mohanty
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaBiju Nair
 
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian... White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...EMC
 
Managing user Online Training in IBM Netezza DBA Development by www.etraining...
Managing user Online Training in IBM Netezza DBA Development by www.etraining...Managing user Online Training in IBM Netezza DBA Development by www.etraining...
Managing user Online Training in IBM Netezza DBA Development by www.etraining...Ravikumar Nandigam
 
Using R on Netezza
Using R on NetezzaUsing R on Netezza
Using R on NetezzaAjay Ohri
 
Netezza fundamentals-for-developers
Netezza fundamentals-for-developersNetezza fundamentals-for-developers
Netezza fundamentals-for-developersTariq H. Khan
 
Netezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaNetezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaRavikumar Nandigam
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceIBM Danmark
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceBiju Nair
 
Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?Krishnan Parasuraman
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200xIBM Sverige
 
Ibm db2 analytics accelerator high availability and disaster recovery
Ibm db2 analytics accelerator  high availability and disaster recoveryIbm db2 analytics accelerator  high availability and disaster recovery
Ibm db2 analytics accelerator high availability and disaster recoverybupbechanhgmail
 

What's hot (20)

An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to Netezza
 
Course content (netezza dba)
Course content (netezza dba)Course content (netezza dba)
Course content (netezza dba)
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
Netezza Architecture and Administration
Netezza Architecture and AdministrationNetezza Architecture and Administration
Netezza Architecture and Administration
 
netezza-pdf
netezza-pdfnetezza-pdf
netezza-pdf
 
Netezza pure data
Netezza pure dataNetezza pure data
Netezza pure data
 
The IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse ApplianceThe IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse Appliance
 
Oracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyOracle to Netezza Migration Casestudy
Oracle to Netezza Migration Casestudy
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
 
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian... White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 
Managing user Online Training in IBM Netezza DBA Development by www.etraining...
Managing user Online Training in IBM Netezza DBA Development by www.etraining...Managing user Online Training in IBM Netezza DBA Development by www.etraining...
Managing user Online Training in IBM Netezza DBA Development by www.etraining...
 
Using R on Netezza
Using R on NetezzaUsing R on Netezza
Using R on Netezza
 
Netezza fundamentals-for-developers
Netezza fundamentals-for-developersNetezza fundamentals-for-developers
Netezza fundamentals-for-developers
 
Netezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaNetezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in India
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse appliance
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
 
Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?Hadoop and Netezza - Co-existence or Competition?
Hadoop and Netezza - Co-existence or Competition?
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200x
 
Ibm db2 analytics accelerator high availability and disaster recovery
Ibm db2 analytics accelerator  high availability and disaster recoveryIbm db2 analytics accelerator  high availability and disaster recovery
Ibm db2 analytics accelerator high availability and disaster recovery
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 

Similar to Fast Analytics

Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonHentsū
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Výhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database ApplianceVýhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database ApplianceMarketingArrowECS_CZ
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperDavid Walker
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Database Administration & Management - 01
Database Administration & Management - 01Database Administration & Management - 01
Database Administration & Management - 01FaisalMashood
 
DBAM-01.pdf
DBAM-01.pdfDBAM-01.pdf
DBAM-01.pdfhania80
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform DATAVERSITY
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute ClusterRamsay Key
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 

Similar to Fast Analytics (20)

Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Výhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database ApplianceVýhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database Appliance
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Database Administration & Management - 01
Database Administration & Management - 01Database Administration & Management - 01
Database Administration & Management - 01
 
DBAM-01.pdf
DBAM-01.pdfDBAM-01.pdf
DBAM-01.pdf
 
Apache Kudu
Apache KuduApache Kudu
Apache Kudu
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform Estimating the Total Costs of Your Cloud Analytics Platform 
Estimating the Total Costs of Your Cloud Analytics Platform 
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 

More from Worapol Alex Pongpech, PhD (9)

Blockchain based Customer Relation System
Blockchain based Customer Relation SystemBlockchain based Customer Relation System
Blockchain based Customer Relation System
 
Fast analytics kudu to druid
Fast analytics  kudu to druidFast analytics  kudu to druid
Fast analytics kudu to druid
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
 
Building business intuition from data
Building business intuition from dataBuilding business intuition from data
Building business intuition from data
 
10 basic terms so you can talk to data engineer
10 basic terms so you can  talk to data engineer10 basic terms so you can  talk to data engineer
10 basic terms so you can talk to data engineer
 
Why are we using kubernetes
Why are we using kubernetesWhy are we using kubernetes
Why are we using kubernetes
 
Airflow 4 manager
Airflow 4 managerAirflow 4 manager
Airflow 4 manager
 
Dark data
Dark dataDark data
Dark data
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 

Recently uploaded

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 

Recently uploaded (20)

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 

Fast Analytics

  • 2. • Fast Analytics (FA) is about delivering analytics at decision-making speeds on !!! What the heck is it? 10/29/2018 2 https://www.tibco.com/blog/2015/03/27/how- analytics-facilitates-fast-data/ Quickly need to know
  • 3. Why oh why? • Life is a period of continuous-time ( until it is end), seriously life can not wait for you to make your decision ! ( I will take my money to another services provider if I have to wait too long) • The clock is ticking and the information is flowing10/29/2018 3 https://targetdatacorp.com/customer-data/
  • 4. 10/29/2018 4 That’s Every Minute Y’all ! https://techstartups.com/2018/05/21/how-much-data- do-we-create-every-day-infographic/
  • 5. Why oh WHY? • What if your life depend on it? • Drug Discovery • Precision Medicine • Point of Care/Patient 360 • Insurance Fraud 10/29/2018 5 Quick decisions based on your personal data might save your life!
  • 6. How ? • by processing high-velocity, high-volume Big Data in real time through the use of an Enterprise Service Bus (ESB), enabling decision-makers to gain immediate understanding of new trends and customer/market shifts as they occur. 10/29/2018 6 http://www.sovtex.ru/en/enterprise-service-bus-esb/
  • 7. Real time Analytics 10/29/2018 7 HBase is an open- source, non- relational, distributed database modeled after Google's Bigtable and written in Java.
  • 9. Kudu vs HDFS and HBase 10/29/2018 9
  • 10. With Kudu • Apache Hive was one of the first SQL-like query interfaces developed over distributed data on top of Hadoop. Hive converts queries to Hadoop MapReduce jobs • Apache Impala uses its own parallel processing architecture on top of HDFS instead of MapReduce jobs. Kudu and Impala are best used together. unlike Hive, Impala never translate its sql queries into MapReduce Job rather executes them natively • Apache Spark is a cluster computing technology. It is not strictly dependent on Hadoop because it has its own cluster management. However, Spark is usually implemented on top of Hadoop that is taking care of distributed data storage. Spark SQL is a Spark component on top of Spark core that provides a way of querying and persisting structured and semi-structured data. 10/29/2018 10
  • 12. Kudu • Kudus are a kind of antelope • Found in eastern and southern Africa THIS IS NOT Apache KudU
  • 13. This is Apache Kudu 10/29/2018 13
  • 14. Motivation • Reducing architectural complexity • Performance (for table-based operations) • Reliability across globally-distributed data centers
  • 15. What is and is not • Apache Kudu is an open source columnar storage engine. It promises low latency random access and efficient execution of analytical queries.
  • 16. What is and is not • Apache Kudu is not really a SQL interface for Hadoop but a very well optimized columnar database designed to fit in with the Hadoop ecosystem. It has been integrated to work with Impala, MapReduce and Spark, and additional framework integrations are expected. The idea is that it can provide very fast scan performance. • Apache Kudu is a “storage engine” or perhaps a “database” project that is delivered upon a Non-HDFS based filesystem. This underlying storage format could be considered to be competitive with file formats like parquet. • Note, that Kudu is NOT compatible with HDFS and it is NOT truly complementary to HDFS. It runs on a completely separate filesystem from Hadoop, which enables Kudu to update data which is very much unlike HDFS. 10/29/2018 16
  • 17. Basic Design • From a user perspective, Kudu is a storage system for tables of structured data where • Tables have a well-defined schema consisting of a predefined number of typed columns. • Each table has a primary key composed of one or more of its columns. • The primary key enforces a uniqueness constraint (no two rows can share the same key) and acts as an index for efficient updates and deletes. 10/29/2018 17
  • 18. Basic Design • Kudu tables are composed of • a series of logical subsets of data, similar to partitions in relational database systems, called Tablets. • Kudu provides • data durability and protection against hardware failure by replicating these Tablets to multiple commodity hardware nodes using the Raft consensus algorithm. • Tablets are typically tens of gigabytes, and an individual node typically holds 10-100 Tablets. 10/29/2018 18
  • 19. Kuda Table and Tablets 10/29/2018 19
  • 20. Basic Design • Tablet A tablet is a contiguous segment of a table, similar to a partition in other data storage engines or relational databases. A given tablet is replicated on multiple tablet servers, and at any given point in time, one of these replicas is considered the leader tablet. Any replica can service reads, and writes require consensus among the set of tablet servers serving the tablet. • Tablet Server A tablet server stores and serves tablets to clients. For a given tablet, one tablet server acts as a leader, and the others act as follower replicas of that tablet. Only leaders service write requests, while leaders or followers each service read requests. Leaders are elected using Raft Consensus Algorithm. One tablet server can serve multiple tablets, and one tablet can be served by multiple tablet servers. 10/29/2018 20
  • 23. Basic Design • Kudu has a master process responsible for managing the metadata that describes the logical structure of the data stored in Tablet Servers (the catalog), acting as a coordinator when recovering from hardware failure, and keeping track of which tablet servers are responsible for hosting replicas of each Tablet. • Multiple standby master servers can be defined to provide high availability. In Kudu, many responsibilities typically associated with master processes can be delineated to the Tablet Servers due to Kudu’s implementation of Raft consensus, and the architecture provides a path to partitioning the master’s duties across multiple machines in the future. • We do not anticipate that Kudu’s master process will become the bottleneck to overall cluster performance and on tests on a 250-node cluster the server hosting the master process has been nowhere near saturation. 10/29/2018 23
  • 24. Basic Design • Master The master keeps track of all the tablets, tablet servers, the Catalog Table, and other metadata related to the cluster. • At a given point in time, there can only be one acting master (the leader). If the current leader disappears, a new master is elected using Raft Consensus Algorithm. • The master also coordinates metadata operations for clients. For example, when creating a new table, the client internally sends the request to the master. The master writes the metadata for the new table into the catalog table, and coordinates the process of creating tablets on the tablet servers. All the master’s data is stored in a tablet, which can be replicated to all the other candidate masters. Tablet servers heartbeat to the master at a set interval (the default is once per second). 10/29/2018 24
  • 25. Basic Design • Raft Consensus Algorithm Kudu uses the Raft consensus algorithm as a means to guarantee fault-tolerance and consistency, both for regular tablets and for master data. Through Raft, multiple replicas of a tablet elect a leader, which is responsible for accepting and replicating writes to follower replicas. Once a write is persisted in a majority of replicas it is acknowledged to the client. A given group of N replicas (usually 3 or 5) is able to accept writes with at most (N - 1)/2 faulty replicas. 10/29/201825
  • 26. Basic Design • Data stored in Kudu is updateable through the use of a variation of log- structured storage in which updates, inserts, and deletes are temporarily buffered in memory before being merged into persistent columnar storage. • Kudu protects against spikes in query latency generally associated with such architectures through constantly performing small maintenance operations such as compactions so that large maintenance operations are never necessary. • Data Compression Because a given column contains only one type of data, pattern-based compression can be orders of magnitude more efficient than compressing mixed data types, which are used in row- based solutions. Combined with the efficiencies of reading data from columns, compression allows you to fulfill your query while reading even fewer blocks from disk. 10/29/2018 26
  • 27. Basic Design • Catalog Table The catalog table is the central location for metadata of Kudu. It stores information about tables and tablets. The catalog table may not be read or written directly. Instead, it is accessible only via metadata operations exposed in the client API. The catalog table stores two categories of metadata: • Tables table schemas, locations, and states • Tablets the list of existing tablets, which tablet servers have replicas of each tablet, the tablet’s current state, and start and end keys. 10/29/2018 27
  • 28. Basic Design • Logical Replication Kudu replicates operations, not on-disk data. This is referred to as logical replication, as opposed to physical replication. • This has several advantages: Although inserts and updates do transmit data over the network, deletes do not need to move any data. The delete operation is sent to each tablet server, which performs the delete locally. Physical operations, such as compaction, do not need to transmit the data over the network in Kudu. • This is different from storage systems that use HDFS, where the blocks need to be transmitted over the network to fulfill the required number of replicas. Tablets do not need to perform compactions at the same time or on the same schedule, or otherwise remain in sync on the physical storage layer. This decreases the chances of all tablet servers experiencing high latency at the same time, due to compactions or heavy write loads. 10/29/2018 28
  • 29. Basic Design • Kudu provides direct APIs, in both C++ and Java, that allow for point and batch retrieval of rows, writes, deletes, schema changes, and more. In addition, Kudu is designed to integrate with and improve existing Hadoop ecosystem tools. With Kudu’s beta release integrations with Impala, MapReduce, and Apache Spark are available. Over time we plan on making Kudu a supported storage option for most or all of the Hadoop ecosystem tools. 10/29/2018 29
  • 30. Why not? • Not a Good Fit for Transactional Workloads (Analytic use-cases almost exclusively use a subset of the columns in the queried table and generally aggregate values over a broad range of rows. This access pattern is greatly accelerated by column oriented data. Operational use-cases are more likely to access most or all of the columns in a row, and might be more appropriately served by row oriented storage. A column oriented storage format was chosen for Kudu because it’s primarily targeted at analytic use-cases.) • Small Kudu Tables get loaded almost as fast as Hdfs tables. However As the size increases we do see the load times becoming double that of Hdfs with the largest table Lineitem taking up-to 4 times the load time. 10/29/2018 30
  • 31. Why not? • Only one index is supported: This can be a problem if you have a lot of diversified queries that aggregate data by different variables (by timestamp, user, vehicle, etc). This primary key cannot be changed after table creation • Does not redistribute replicas of tablets automatically: if you add a new node to your cluster for example, Kudu will not redistribute the tablets so that the cluster nodes are balanced. You need to either recreate all your existing tables, or move the tablets manually to balance your cluster using the following command, which might be tedious work 10/29/2018 31
  • 32. Why not • Does not support sqoop : If you want to migrate your SQL warehouse tables to Kudu, you first need to sqoop them to HDFS, and then use a tool like Apache Spark to migrate the data to Kudu. • Dependent on Impala for querying: Impala uses MPP (massive paralel processing)to perform queries, which basically means that it uses all deamons to fetch and compute the data it needs, and stores results in memory. This is great if you need to perform a query that doesn’t take too long (has few computations or doesn’t move that much data). If however you have a daily or monthly ETL in which you have complex queries or massive inserts that demand deamons to be working for hours, if you have a deamon failure, the query stops and needs to be recomputed from the very beginning. This is a nightmare for meeting SLAs, a nightmare that does not happen with Apache Hive, which uses MapReduce to perform queries, and thus saves intermediate results into disk, making it much more reliable for these types of scenarios. Kudu’s current engine for querying is solely Impala, which may cause some issues for these use cases. 10/29/2018 32
  • 33. Why not • Impala tables created with Kudu data cannot be truncated • Cannot add partitions dynamically unless they are ranged: At the time of table creation you have to specify how you want to partition your table (divide it in tablets) and you have three options to do so: hash partitioning,range partitioning or a combination of the two. The problem is that in a production scenario, your tables will keep on increasing in volume and eventually you will need to add more tablets to keep your performance up. This cannot be done if you use hash partitioning. Similarly, data consolidation is impossible using Kudu, unless you create a new, separate table and perform a full insert to it, which may take some time. 10/29/2018 33
  • 34. Why ( Performance comparison of different file formats and storage engines in the Hadoop ecosystem) 6 • Storage efficiency – with Parquet or Kudu and Snappy compression the total volume of the data can be reduced by a factor 10 comparing to uncompressed simple serialization format. • Data ingestion speed – all tested file based solutions provide fast ingestion rate (between x2 and x10) than specialized storage engines or MapFiles (sorted sequence). • Random data access time – using HBase or Kudu, typical random data lookup speed is below 500ms. With smart HDFS namespace partitioning Parquet could deliver random lookup on a level of a second but consumes more resources. 10/29/2018 34
  • 35. Why ( Performance comparison of different file formats and storage engines in the Hadoop ecosystem) 6 • Data analytics – with Parquet or Kudu it is possible to perform fast and scalable (typically more than 300k records per second per CPU core) data aggregation, filtering and reporting. • Support of in-place data mutation – HBase and Kudu can modify records (schema and values) in-place where it is not possible with data stored directly in HDFS files 10/29/2018 35
  • 36. Why KUDU • There are really no good alternative storage engines to Kudu in the Hadoop ecosystem that achieve great analytical query performance and, at the same time, allow you to change data in near real-time. • Kudu documentation states that Kudu's intent is to compliment HDFS and HBase, not to replace, but for many use cases and smaller data sets, all you might need is Kudu and Impala with Spark. 10/29/2018 36
  • 37. Why KUDU • An good case for Kudu is an ever-popular Data Lake architecture. It is not enough these days to build a batch-oriented Data Lake, updated a few times a day. Many modern analytical projects (predictive alerts, anomaly detection, real-time dashboards etc.) rely on data, streamed in near real-time from various source systems. • if the requirement is for a storage which performs as well as HDFS for analytical queries with the additional flexibility of faster random access and RDBMS features such as Updates/Deletes/Inserts, then Kudu could be considered as a potential shortlist. 10/29/2018 37 Write once, read many Y’all
  • 38. Kudu aims • Fast processing of OLAP workloads. • Integration with MapReduce, Spark and other Hadoop ecosystem components. • Tight integration with Apache Impala, making it a good, mutable alternative to using HDFS with Apache Parquet. • Strong but flexible consistency model, allowing you to choose consistency requirements on a per-request basis, including the option for strict-serializable consistency. 10/29/2018 38
  • 39. Kudu aims • Strong performance for running sequential and random workloads simultaneously. • Easy to administer and manage with Cloudera Manager. • High availability. Tablet Servers and Masters use the Raft Consensus Algorithm, which ensures that as long as more than half the total number of replicas is available, the tablet is available for reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5 replicas are 10/29/2018 39
  • 40. Kudu aims • High availability. Tablet Servers and Masters use the Raft Consensus Algorithm, which ensures that as long as more than half the total number of replicas is available, the tablet is available for reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5 replicas are available, the tablet is available. • Reads can be serviced by read-only follower tablets, even in the event of a leader tablet failure. • Structured data model. 10/29/2018 40
  • 41. A few examples of applications for which Kudu is a great solution are: • Reporting applications where newly-arrived data needs to be immediately available for end users • Time-series applications that must simultaneously support: • queries across large amounts of historic data • granular queries about an individual entity that must return very quickly • Applications that use predictive models to make real-time decisions with periodic refreshes of the predictive model based on all historic data 10/29/2018 41
  • 42. Streaming Input with Near Real Time Availability • A common challenge in data analysis is one where new data arrives rapidly and constantly, and the same data needs to be available in near real time for reads, scans, and updates. Kudu offers the powerful combination of fast inserts and updates with efficient columnar scans to enable real-time analytics use cases on a single storage layer 10/29/2018 42
  • 43. Time-series application with widely varying access patterns • A time-series schema is one in which data points are organized and keyed according to the time at which they occurred. This can be useful for investigating the performance of metrics over time or attempting to predict future behavior based on past data. • For instance, time-series customer data might be used both to store purchase click-stream history and to predict future purchases, or for use by a customer support representative. • While these different types of analysis are occurring, inserts and mutations may also be occurring individually and in bulk, and become available immediately to read workloads. Kudu can handle all of these access patterns simultaneously in a scalable and efficient manner. 10/29/2018 43
  • 44. Time-series application with widely varying access patterns • Kudu is a good fit for time-series workloads for several reasons. With Kudu’s support for hash-based partitioning, combined with its native support for compound row keys, it is simple to set up a table spread across many servers without the risk of "hotspotting" that is commonly observed when range partitioning is used. Kudu’s columnar storage engine is also beneficial in this context, because many time-series workloads read only a few columns, as opposed to the whole row. • In the past, you might have needed to use multiple data stores to handle different data access patterns. This practice adds complexity to your application and operations, and duplicates your data, doubling (or worse) the amount of storage required. Kudu can handle all of these access patterns natively and efficiently, without the need to off-load work to other data stores. 10/29/2018 44
  • 45. Predictive Modeling • Data scientists often develop predictive learning models from large sets of data. • The model and the data may need to be updated or modified often as the learning takes place or as the situation being modeled changes. • In addition, the scientist may want to change one or more factors in the model to see what happens over time. Updating a large set of data stored in files in HDFS is resource-intensive, as each file needs to be completely rewritten. • In Kudu, updates happen in near real time. The scientist can tweak the value, re-run the query, and refresh the graph in seconds or minutes, rather than hours or days. In addition, batch or incremental algorithms can be run across the data at any time, with near-real-time results. 10/29/2018 45
  • 46. Combining Data In Kudu With Legacy Systems • Companies generate data from multiple sources and store it in a variety of systems and formats. For instance, some of your data may be stored in Kudu, some in a traditional RDBMS, and some in files in HDFS. You can access and query all of these sources and formats using Impala, without the need to change your legacy systems. 10/29/2018 46
  • 47. Real time Analytics with Kudu and Spark 10/29/2018 47
  • 48. Real time Analytics with Kudu 10/29/2018 48
  • 50. Build a Prediction Engine using Spark, Kudu, and Impala 10/29/2018 50 http://blog.cloudera.com/blog/2016/05/how-to-build- a-prediction-engine-using-spark-kudu-and-impala/
  • 51. References 1. https://www.tibco.com/blog/2015/03/27/how-analytics-facilitates-fast-data/ 2. https://techstartups.com/2018/05/21/how-much-data-do-we-create-every-day- infographic/ 3. https://mapr.com/blog/much-ado-about-kudu/ 4. https://blog.clairvoyantsoft.com/guide-to-using-apache-kudu-and-performance- comparison-with-hdfs-453c4b26554f 5. https://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/ 6. https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison- different-file-formats-and-storage-engines 7. https://kudu.apache.org/docs/quickstart.html 8. https://kudu.apache.org/docs/index.html 9. http://blog.cloudera.com/blog/2016/05/how-to-build-a-prediction-engine-using-spark- kudu-and-impala/ 10/29/2018 51

Editor's Notes

  1. To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image.
  2. Businesses generate and collect huge volumes of complex data from a variety of sources: customer data, market data, supply-chain information, operational data, financial information, social media feeds, sensor data, and so much more. In order to understand and act on these constant streams of information, it’s vital for companies to have the ability to collect data sources together and provide decision-makers the ability to analyze it quickly.
  3. What did you have for lunch? etc
  4. Impala/Parquet is really good at aggregating large data sets quickly (billions of rows and terabytes of data, OLAP stuff), and hBase is really good at handling a ton of small concurrent transactions (basically the mechanism to doing “OLTP” on Hadoop). The tradeoffs of the above tools is Impala sucks at OLTP workloads and hBase sucks at OLAP workloads
  5. Impalad = impala daemon