Fast analytics kudu to druid

FasT Analytics
KUDU to DRUID
Alex Pongpech 2021

• Fast Analytics (FA) is about delivering analytics at decision-making
speeds on !!!
What the heck is it?
1/24/2021 2
https://www.tibco.com/blog/2015/03/27/how-
analytics-facilitates-fast-data/
Quickly need to know

Why oh why?
• Life is a period of continuous-time ( until it is end), seriously life
can not wait for you to make your decision ! ( I will take my money
to another services provider if I have to wait too long)
• The clock is ticking and the information is flowing 1/24/2021 3
https://targetdatacorp.com/customer-data/

1/24/2021 4
That’s
Every
Minute
Y’all !
https://techstartups.com/2018/05/21/how-much-data-
do-we-create-every-day-infographic/

Why oh WHY?
• What if your life depend on it?
• Drug Discovery
• Precision Medicine
• Point of Care/Patient 360
• Insurance Fraud
1/24/2021 5
Quick decisions based on your
personal data might save your
life!

How ?
• by processing high-velocity, high-volume Big Data in real time
through the use of an Enterprise Service Bus (ESB), enabling
decision-makers to gain immediate understanding of new trends
and customer/market shifts as they occur. 1/24/2021 6
http://www.sovtex.ru/en/enterprise-service-bus-esb/

Building Architecture for Fast Data
1. Ingest with the data feed
2. Make decisions on each event in the feed
3. Provide visibility into fast-moving data with Realtime analytics
1/24/2021 7

BUT !
What about the Storage???
1/24/2021
8

Real time Analytics with HBase
1/24/2021 9
HBase is
an open-
source,
non-
relational,
distributed
database
modeled
after
Google's
Bigtable
and
written in
Java.

Kudu vs HDFS and HBase
1/24/2021 10

Kudu
• Kudus are a kind of antelope
• Found in eastern and southern Africa
THIS IS NOT Apache KudU

This is Apache Kudu
1/24/2021 13

Motivation
• Reducing architectural complexity
• Performance (for table-based operations)
• Reliability across globally-distributed data centers

What is and is not
• Apache Kudu is an open source columnar storage engine. It
promises low latency random access and efficient execution of
analytical queries.

What is and is not
• Apache Kudu is not really a SQL interface for Hadoop but a very well
optimized columnar database designed to fit in with the Hadoop
ecosystem. It has been integrated to work with Impala, MapReduce and
Spark, and additional framework integrations are expected. The idea is
that it can provide very fast scan performance.
• Apache Kudu is a “storage engine” or perhaps a “database” project that is
delivered upon a Non-HDFS based filesystem. This underlying storage
format could be considered to be competitive with file formats like
parquet.
• Note, that Kudu is NOT compatible with HDFS and it is NOT truly
complementary to HDFS. It runs on a completely separate filesystem
from Hadoop, which enables Kudu to update data which is very much
unlike HDFS.
1/24/2021 16

Life in the Fast Lane, Hello Druid
Not playing well with Hadoop?? Bye bye KUDU
1/24/2021
17

This is Apache Druid
1/24/2021 19

What is Druid
• Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics ("OLAP" queries) on large data sets. Druid
is most often used as a database for powering use cases where real-
time ingest, fast query performance, and high uptime are
important.
• As such, Druid is commonly used for powering GUIs of analytical
applications, or as a backend for highly-concurrent APIs that need
fast aggregations. Druid works best with event-oriented data.
1/24/2021 20

Apache Druid
1. Columnar storage format. Druid uses column-oriented storage, meaning
it only needs to load the exact columns needed for a particular query.
This gives a huge speed boost to queries that only hit a few columns. In
addition, each column is stored optimized for its particular data type,
which supports fast scans and aggregations.
2. Scalable distributed system. Druid is typically deployed in clusters of
tens to hundreds of servers, and can offer ingest rates of millions of
records/sec, retention of trillions of records, and query latencies of sub-
second to a few seconds.
3. Massively parallel processing. Druid can process a query in parallel
across the entire cluster.
4. Realtime or batch ingestion. Druid can ingest data either real-time
(ingested data is immediately available for querying) or in batches.
1/24/2021 21

Apache Druid
4. Self-healing, self-balancing, easy to operate. As an operator, to scale the
cluster out or in, simply add or remove servers and the cluster will
rebalance itself automatically, in the background, without any
downtime. If any Druid servers fail, the system will automatically route
around the damage until those servers can be replaced. Druid is
designed to run 24/7 with no need for planned downtimes for any
reason, including configuration changes and software updates.
5. Cloud-native, fault-tolerant architecture that won't lose data. Once
Druid has ingested your data, a copy is stored safely in deep storage
(typically cloud storage, HDFS, or a shared filesystem). Your data can be
recovered from deep storage even if every single Druid server fails. For
more limited failures affecting just a few Druid servers, replication
ensures that queries are still possible while the system recovers.
6. Indexes for quick filtering. Druid uses Roaring or CONCISE
compressed bitmap indexes to create indexes that power fast filtering
and searching across multiple columns.
1/24/2021 22

Apache Druid
7. Time-based partitioning. Druid first partitions data by time, and can
additionally partition based on other fields. This means time-based
queries will only access the partitions that match the time range of the
query. This leads to significant performance improvements for time-
based data.
8. Approximate algorithms. Druid includes algorithms for approximate
count-distinct, approximate ranking, and computation of approximate
histograms and quantiles. These algorithms offer bounded memory
usage and are often substantially faster than exact computations. For
situations where accuracy is more important than speed, Druid also
offers exact count-distinct and exact ranking.
9. Automatic summarization at ingest time. Druid optionally supports
data summarization at ingestion time. This summarization partially pre-
aggregates your data, and can lead to big costs savings and performance
boosts.
1/24/2021 23

Who want to use Apache Druid?
• A fast, modern analytics database
• Druid is designed for workflows where fast ad-hoc analytics, instant data visibility, or
supporting high concurrency is important. As such, Druid is often used to power UIs where
an interactive, consistent user experience is desired.
• Easy integration with your existing data pipelines
• Druid streams data from message buses such as Kafka, and Amazon Kinesis, and batch load
files from data lakes such as HDFS, and Amazon S3. Druid supports most popular file
formats for structured and semi-structured data.
• Fast, consistent queries at high concurrency
• Druid has been benchmarked to greatly outperform legacy solutions. Druid combines novel
storage ideas, indexing structures, and both exact and approximate queries to return most
results in under a second.
1/24/2021 25

Who is using Apache Druid?
• Broad applicability
• Druid unlocks new types of queries and workflows for clickstream, APM, supply chain,
network telemetry, digital marketing, risk/fraud, and many other types of data. Druid is
purpose built for rapid, ad-hoc queries on both real-time and historical data.
• Deploy in public, private, and hybrid clouds
• Druid can be deployed in any *NIX environment on commodity hardware, both in the cloud
and on premise. Deploying Druid is easy: scaling up and down is as simple as adding and
removing Druid services.
1/24/2021 26

So Who is using Druid?
1/24/2021
27

Getting Start
• Quickstart
• This quickstart gets you started with Apache Druid and introduces you to some of its basic features. Following
these steps, you will install Druid and load sample data using its native batch ingestion feature.
• Before starting, you may want to read the general Druid overview and ingestion overview, as the tutorials refer
to concepts discussed on those pages.
• Requirements
• You can follow these steps on a relatively small machine, such as a laptop with around 4 CPU and 16 GB of
RAM.
• Druid comes with several startup configuration profiles for a range of machine sizes. The micro-
quickstartconfiguration profile shown here is suitable for evaluating Druid. If you want to try out Druid's
performance or scaling capabilities, you'll need a larger machine and configuration profile.
• The configuration profiles included with Druid range from the even smaller Nano-Quickstart configuration (1
CPU, 4GB RAM) to the X-Large configuration (64 CPU, 512GB RAM). For more information, see Single server
deployment.
• https://druid.apache.org/docs/latest/tutorials/index.html
1/24/2021 35

References
1. https://www.tibco.com/blog/2015/03/27/how-analytics-facilitates-fast-data/
2. https://techstartups.com/2018/05/21/how-much-data-do-we-create-every-day-infographic/
3. https://mapr.com/blog/much-ado-about-kudu/
4. https://blog.clairvoyantsoft.com/guide-to-using-apache-kudu-and-performance-comparison-with-hdfs-453c4b26554f
5. https://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/
6. https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines
7. https://kudu.apache.org/docs/quickstart.html
8. https://kudu.apache.org/docs/index.html
9. http://blog.cloudera.com/blog/2016/05/how-to-build-a-prediction-engine-using-spark-kudu-and-impala/
10. https://netflixtechblog.com/how-netflix-uses-druid-for-real-time-insights-to-ensure-a-high-quality-experience-19e1e8568d06
11. https://open.spotify.com/episode/0zu8cEUrm0b7e41l3jWjOK
12. https://druid.apache.org/
1/24/2021 36

Fast analytics kudu to druid

More Related Content

What's hot

Similar to Fast analytics kudu to druid

More from Worapol Alex Pongpech, PhD

Recently uploaded

Fast analytics kudu to druid