2. • Fast Analytics (FA) is about delivering analytics at decision-making
speeds on !!!
What the heck is it?
10/29/2018 2
https://www.tibco.com/blog/2015/03/27/how-
analytics-facilitates-fast-data/
Quickly need to know
3. Why oh why?
• Life is a period of continuous-time ( until it is end), seriously life
can not wait for you to make your decision ! ( I will take my money
to another services provider if I have to wait too long)
• The clock is ticking and the information is flowing10/29/2018 3
https://targetdatacorp.com/customer-data/
5. Why oh WHY?
• What if your life depend on it?
• Drug Discovery
• Precision Medicine
• Point of Care/Patient 360
• Insurance Fraud
10/29/2018 5
Quick decisions based on your personal data
might save your life!
6. How ?
• by processing high-velocity, high-volume Big Data in real time
through the use of an Enterprise Service Bus (ESB), enabling
decision-makers to gain immediate understanding of new trends
and customer/market shifts as they occur. 10/29/2018 6
http://www.sovtex.ru/en/enterprise-service-bus-esb/
7. Real time Analytics
10/29/2018 7
HBase is
an open-
source,
non-
relational,
distributed
database
modeled
after
Google's
Bigtable
and
written in
Java.
10. With Kudu
• Apache Hive was one of the first SQL-like query interfaces developed
over distributed data on top of Hadoop. Hive converts queries to Hadoop
MapReduce jobs
• Apache Impala uses its own parallel processing architecture on top of
HDFS instead of MapReduce jobs. Kudu and Impala are best used
together. unlike Hive, Impala never translate its sql queries into
MapReduce Job rather executes them natively
• Apache Spark is a cluster computing technology. It is not strictly
dependent on Hadoop because it has its own cluster management.
However, Spark is usually implemented on top of Hadoop that is taking
care of distributed data storage. Spark SQL is a Spark component on top
of Spark core that provides a way of querying and persisting structured
and semi-structured data.
10/29/2018 10
14. Motivation
• Reducing architectural complexity
• Performance (for table-based operations)
• Reliability across globally-distributed data centers
15. What is and is not
• Apache Kudu is an open source columnar storage engine. It
promises low latency random access and efficient execution of
analytical queries.
16. What is and is not
• Apache Kudu is not really a SQL interface for Hadoop but a very well
optimized columnar database designed to fit in with the Hadoop
ecosystem. It has been integrated to work with Impala, MapReduce and
Spark, and additional framework integrations are expected. The idea is
that it can provide very fast scan performance.
• Apache Kudu is a “storage engine” or perhaps a “database” project that is
delivered upon a Non-HDFS based filesystem. This underlying storage
format could be considered to be competitive with file formats like
parquet.
• Note, that Kudu is NOT compatible with HDFS and it is NOT truly
complementary to HDFS. It runs on a completely separate filesystem
from Hadoop, which enables Kudu to update data which is very much
unlike HDFS.
10/29/2018 16
17. Basic Design
• From a user perspective, Kudu is a storage system for
tables of structured data where
• Tables have a well-defined schema consisting of a predefined
number of typed columns.
• Each table has a primary key composed of one or more of its
columns.
• The primary key enforces a uniqueness constraint (no two rows
can share the same key) and acts as an index for efficient updates
and deletes.
10/29/2018 17
18. Basic Design
• Kudu tables are composed of
• a series of logical subsets of data, similar to partitions in
relational database systems, called Tablets.
• Kudu provides
• data durability and protection against hardware failure by
replicating these Tablets to multiple commodity hardware nodes
using the Raft consensus algorithm.
• Tablets are typically tens of gigabytes, and an individual node
typically holds 10-100 Tablets.
10/29/2018 18
20. Basic Design
• Tablet A tablet is a contiguous segment of a table, similar to a partition in
other data storage engines or relational databases. A given tablet is
replicated on multiple tablet servers, and at any given point in time, one
of these replicas is considered the leader tablet. Any replica can service
reads, and writes require consensus among the set of tablet servers
serving the tablet.
• Tablet Server A tablet server stores and serves tablets to clients. For a
given tablet, one tablet server acts as a leader, and the others act as
follower replicas of that tablet. Only leaders service write requests, while
leaders or followers each service read requests. Leaders are elected using
Raft Consensus Algorithm. One tablet server can serve multiple tablets,
and one tablet can be served by multiple tablet servers.
10/29/2018 20
23. Basic Design
• Kudu has a master process responsible for managing the metadata that
describes the logical structure of the data stored in Tablet Servers (the
catalog), acting as a coordinator when recovering from hardware failure,
and keeping track of which tablet servers are responsible for hosting
replicas of each Tablet.
• Multiple standby master servers can be defined to provide high
availability. In Kudu, many responsibilities typically associated with
master processes can be delineated to the Tablet Servers due to Kudu’s
implementation of Raft consensus, and the architecture provides a path to
partitioning the master’s duties across multiple machines in the future.
• We do not anticipate that Kudu’s master process will become the bottleneck to
overall cluster performance and on tests on a 250-node cluster the server hosting
the master process has been nowhere near saturation.
10/29/2018 23
24. Basic Design
• Master The master keeps track of all the tablets, tablet servers, the
Catalog Table, and other metadata related to the cluster.
• At a given point in time, there can only be one acting master (the leader).
If the current leader disappears, a new master is elected using Raft
Consensus Algorithm.
• The master also coordinates metadata operations for clients. For example,
when creating a new table, the client internally sends the request to the
master. The master writes the metadata for the new table into the catalog
table, and coordinates the process of creating tablets on the tablet servers.
All the master’s data is stored in a tablet, which can be replicated to all
the other candidate masters. Tablet servers heartbeat to the master at a set
interval (the default is once per second).
10/29/2018 24
25. Basic Design
• Raft Consensus Algorithm Kudu uses the Raft consensus
algorithm as a means to guarantee fault-tolerance and consistency,
both for regular tablets and for master data. Through Raft, multiple
replicas of a tablet elect a leader, which is responsible for accepting
and replicating writes to follower replicas. Once a write is persisted
in a majority of replicas it is acknowledged to the client. A given
group of N replicas (usually 3 or 5) is able to accept writes with at
most (N - 1)/2 faulty replicas.
10/29/201825
26. Basic Design
• Data stored in Kudu is updateable through the use of a variation of log-
structured storage in which updates, inserts, and deletes are temporarily
buffered in memory before being merged into persistent columnar
storage.
• Kudu protects against spikes in query latency generally associated with
such architectures through constantly performing small maintenance
operations such as compactions so that large maintenance operations are
never necessary.
• Data Compression Because a given column contains only one type of
data, pattern-based compression can be orders of magnitude more
efficient than compressing mixed data types, which are used in row-
based solutions. Combined with the efficiencies of reading data from
columns, compression allows you to fulfill your query while reading even
fewer blocks from disk.
10/29/2018 26
27. Basic Design
• Catalog Table The catalog table is the central location for metadata
of Kudu. It stores information about tables and tablets. The catalog
table may not be read or written directly. Instead, it is accessible
only via metadata operations exposed in the client API. The catalog
table stores two categories of metadata:
• Tables table schemas, locations, and states
• Tablets the list of existing tablets, which tablet servers have replicas of each
tablet, the tablet’s current state, and start and end keys.
10/29/2018 27
28. Basic Design
• Logical Replication Kudu replicates operations, not on-disk data. This is
referred to as logical replication, as opposed to physical replication.
• This has several advantages: Although inserts and updates do transmit
data over the network, deletes do not need to move any data. The delete
operation is sent to each tablet server, which performs the delete locally.
Physical operations, such as compaction, do not need to transmit the data
over the network in Kudu.
• This is different from storage systems that use HDFS, where the blocks
need to be transmitted over the network to fulfill the required number of
replicas. Tablets do not need to perform compactions at the same time or
on the same schedule, or otherwise remain in sync on the physical storage
layer. This decreases the chances of all tablet servers experiencing high
latency at the same time, due to compactions or heavy write loads.
10/29/2018 28
29. Basic Design
• Kudu provides direct APIs, in both C++ and Java, that allow for
point and batch retrieval of rows, writes, deletes, schema changes,
and more. In addition, Kudu is designed to integrate with and
improve existing Hadoop ecosystem tools. With Kudu’s beta
release integrations with Impala, MapReduce, and Apache Spark
are available. Over time we plan on making Kudu a supported
storage option for most or all of the Hadoop ecosystem tools.
10/29/2018 29
30. Why not?
• Not a Good Fit for Transactional Workloads (Analytic use-cases
almost exclusively use a subset of the columns in the queried table
and generally aggregate values over a broad range of rows. This
access pattern is greatly accelerated by column oriented data.
Operational use-cases are more likely to access most or all of the
columns in a row, and might be more appropriately served by row
oriented storage. A column oriented storage format was chosen for
Kudu because it’s primarily targeted at analytic use-cases.)
• Small Kudu Tables get loaded almost as fast as Hdfs tables.
However As the size increases we do see the load times becoming
double that of Hdfs with the largest table Lineitem taking up-to 4
times the load time.
10/29/2018 30
31. Why not?
• Only one index is supported: This can be a problem if you have a
lot of diversified queries that aggregate data by different variables
(by timestamp, user, vehicle, etc). This primary key cannot be
changed after table creation
• Does not redistribute replicas of tablets automatically: if you add a
new node to your cluster for example, Kudu will not redistribute
the tablets so that the cluster nodes are balanced. You need to either
recreate all your existing tables, or move the tablets manually to
balance your cluster using the following command, which might be
tedious work
10/29/2018 31
32. Why not
• Does not support sqoop : If you want to migrate your SQL warehouse
tables to Kudu, you first need to sqoop them to HDFS, and then use a tool
like Apache Spark to migrate the data to Kudu.
• Dependent on Impala for querying: Impala uses MPP (massive paralel
processing)to perform queries, which basically means that it uses all
deamons to fetch and compute the data it needs, and stores results in
memory. This is great if you need to perform a query that doesn’t take too
long (has few computations or doesn’t move that much data). If however
you have a daily or monthly ETL in which you have complex queries or
massive inserts that demand deamons to be working for hours, if you
have a deamon failure, the query stops and needs to be recomputed from
the very beginning. This is a nightmare for meeting SLAs, a nightmare
that does not happen with Apache Hive, which uses MapReduce to
perform queries, and thus saves intermediate results into disk, making it
much more reliable for these types of scenarios. Kudu’s current engine for
querying is solely Impala, which may cause some issues for these use
cases.
10/29/2018 32
33. Why not
• Impala tables created with Kudu data cannot be truncated
• Cannot add partitions dynamically unless they are ranged: At the
time of table creation you have to specify how you want to
partition your table (divide it in tablets) and you have three options
to do so: hash partitioning,range partitioning or a combination of
the two. The problem is that in a production scenario, your tables
will keep on increasing in volume and eventually you will need to
add more tablets to keep your performance up. This cannot be
done if you use hash partitioning. Similarly, data consolidation is
impossible using Kudu, unless you create a new, separate table and
perform a full insert to it, which may take some time.
10/29/2018 33
34. Why ( Performance comparison of different file formats
and storage engines in the Hadoop ecosystem) 6
• Storage efficiency – with Parquet or Kudu and Snappy
compression the total volume of the data can be reduced by a factor
10 comparing to uncompressed simple serialization format.
• Data ingestion speed – all tested file based solutions provide fast
ingestion rate (between x2 and x10) than specialized storage
engines or MapFiles (sorted sequence).
• Random data access time – using HBase or Kudu, typical random
data lookup speed is below 500ms. With smart HDFS namespace
partitioning Parquet could deliver random lookup on a level of a
second but consumes more resources.
10/29/2018 34
35. Why ( Performance comparison of different file formats
and storage engines in the Hadoop ecosystem) 6
• Data analytics – with Parquet or Kudu it is possible to perform fast
and scalable (typically more than 300k records per second per CPU
core) data aggregation, filtering and reporting.
• Support of in-place data mutation – HBase and Kudu can modify
records (schema and values) in-place where it is not possible with
data stored directly in HDFS files
10/29/2018 35
36. Why KUDU
• There are really no good alternative storage engines to Kudu in the
Hadoop ecosystem that achieve great analytical query performance
and, at the same time, allow you to change data in near real-time.
• Kudu documentation states that Kudu's intent is to compliment
HDFS and HBase, not to replace, but for many use cases and
smaller data sets, all you might need is Kudu and Impala with
Spark.
10/29/2018 36
37. Why KUDU
• An good case for Kudu is an ever-popular Data Lake architecture.
It is not enough these days to build a batch-oriented Data Lake,
updated a few times a day. Many modern analytical projects
(predictive alerts, anomaly detection, real-time dashboards etc.)
rely on data, streamed in near real-time from various source
systems.
• if the requirement is for a storage which performs as well as HDFS
for analytical queries with the additional flexibility of faster
random access and RDBMS features such as
Updates/Deletes/Inserts, then Kudu could be considered as a
potential shortlist.
10/29/2018 37
Write once, read many Y’all
38. Kudu aims
• Fast processing of OLAP workloads.
• Integration with MapReduce, Spark and other Hadoop ecosystem
components.
• Tight integration with Apache Impala, making it a good, mutable
alternative to using HDFS with Apache Parquet.
• Strong but flexible consistency model, allowing you to choose
consistency requirements on a per-request basis, including the
option for strict-serializable consistency.
10/29/2018 38
39. Kudu aims
• Strong performance for running sequential and random workloads
simultaneously.
• Easy to administer and manage with Cloudera Manager.
• High availability. Tablet Servers and Masters use the Raft
Consensus Algorithm, which ensures that as long as more than half
the total number of replicas is available, the tablet is available for
reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5
replicas are
10/29/2018 39
40. Kudu aims
• High availability. Tablet Servers and Masters use the Raft
Consensus Algorithm, which ensures that as long as more than half
the total number of replicas is available, the tablet is available for
reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5
replicas are available, the tablet is available.
• Reads can be serviced by read-only follower tablets, even in the
event of a leader tablet failure.
• Structured data model.
10/29/2018 40
41. A few examples of applications for which Kudu is a
great solution are:
• Reporting applications where newly-arrived data needs to be
immediately available for end users
• Time-series applications that must simultaneously support:
• queries across large amounts of historic data
• granular queries about an individual entity that must return very
quickly
• Applications that use predictive models to make real-time
decisions with periodic refreshes of the predictive model based on
all historic data
10/29/2018 41
42. Streaming Input with Near Real Time Availability
• A common challenge in data analysis is one where new data arrives
rapidly and constantly, and the same data needs to be available in
near real time for reads, scans, and updates. Kudu offers the
powerful combination of fast inserts and updates with efficient
columnar scans to enable real-time analytics use cases on a single
storage layer
10/29/2018 42
43. Time-series application with widely varying access
patterns
• A time-series schema is one in which data points are organized and
keyed according to the time at which they occurred. This can be
useful for investigating the performance of metrics over time or
attempting to predict future behavior based on past data.
• For instance, time-series customer data might be used both to store
purchase click-stream history and to predict future purchases, or
for use by a customer support representative.
• While these different types of analysis are occurring, inserts and
mutations may also be occurring individually and in bulk, and
become available immediately to read workloads. Kudu can handle
all of these access patterns simultaneously in a scalable and
efficient manner.
10/29/2018 43
44. Time-series application with widely varying access
patterns
• Kudu is a good fit for time-series workloads for several reasons. With
Kudu’s support for hash-based partitioning, combined with its native
support for compound row keys, it is simple to set up a table spread
across many servers without the risk of "hotspotting" that is commonly
observed when range partitioning is used. Kudu’s columnar storage
engine is also beneficial in this context, because many time-series
workloads read only a few columns, as opposed to the whole row.
• In the past, you might have needed to use multiple data stores to handle
different data access patterns. This practice adds complexity to your
application and operations, and duplicates your data, doubling (or worse)
the amount of storage required. Kudu can handle all of these access
patterns natively and efficiently, without the need to off-load work to
other data stores.
10/29/2018 44
45. Predictive Modeling
• Data scientists often develop predictive learning models from large sets
of data.
• The model and the data may need to be updated or modified often as the
learning takes place or as the situation being modeled changes.
• In addition, the scientist may want to change one or more factors in the
model to see what happens over time. Updating a large set of data stored
in files in HDFS is resource-intensive, as each file needs to be completely
rewritten.
• In Kudu, updates happen in near real time. The scientist can tweak the
value, re-run the query, and refresh the graph in seconds or minutes,
rather than hours or days. In addition, batch or incremental algorithms
can be run across the data at any time, with near-real-time results.
10/29/2018 45
46. Combining Data In Kudu With Legacy Systems
• Companies generate data from multiple sources and store it in a
variety of systems and formats. For instance, some of your data
may be stored in Kudu, some in a traditional RDBMS, and some in
files in HDFS. You can access and query all of these sources and
formats using Impala, without the need to change your legacy
systems.
10/29/2018 46
50. Build a Prediction Engine using Spark, Kudu, and
Impala
10/29/2018 50
http://blog.cloudera.com/blog/2016/05/how-to-build-
a-prediction-engine-using-spark-kudu-and-impala/
To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image.
Businesses generate and collect huge volumes of complex data from a variety of sources: customer data, market data, supply-chain information, operational data, financial information, social media feeds, sensor data, and so much more.
In order to understand and act on these constant streams of information, it’s vital for companies to have the ability to collect data sources together and provide decision-makers the ability to analyze it quickly.
What did you have for lunch? etc
Impala/Parquet is really good at aggregating large data sets quickly (billions of rows and terabytes of data, OLAP stuff),
and hBase is really good at handling a ton of small concurrent transactions (basically the mechanism to doing “OLTP” on Hadoop).
The tradeoffs of the above tools is Impala sucks at OLTP workloads and hBase sucks at OLAP workloads