FasT Analytics
KUDU to DRUID
Alex Pongpech 2021
• Fast Analytics (FA) is about delivering analytics at decision-making
speeds on !!!
What the heck is it?
1/24/2021 2
https://www.tibco.com/blog/2015/03/27/how-
analytics-facilitates-fast-data/
Quickly need to know
Why oh why?
• Life is a period of continuous-time ( until it is end), seriously life
can not wait for you to make your decision ! ( I will take my money
to another services provider if I have to wait too long)
• The clock is ticking and the information is flowing 1/24/2021 3
https://targetdatacorp.com/customer-data/
1/24/2021 4
That’s
Every
Minute
Y’all !
https://techstartups.com/2018/05/21/how-much-data-
do-we-create-every-day-infographic/
Why oh WHY?
• What if your life depend on it?
• Drug Discovery
• Precision Medicine
• Point of Care/Patient 360
• Insurance Fraud
1/24/2021 5
Quick decisions based on your
personal data might save your
life!
How ?
• by processing high-velocity, high-volume Big Data in real time
through the use of an Enterprise Service Bus (ESB), enabling
decision-makers to gain immediate understanding of new trends
and customer/market shifts as they occur. 1/24/2021 6
http://www.sovtex.ru/en/enterprise-service-bus-esb/
Building Architecture for Fast Data
1. Ingest with the data feed
2. Make decisions on each event in the feed
3. Provide visibility into fast-moving data with Realtime analytics
1/24/2021 7
BUT !
What about the Storage???
1/24/2021
8
Real time Analytics with HBase
1/24/2021 9
HBase is
an open-
source,
non-
relational,
distributed
database
modeled
after
Google's
Bigtable
and
written in
Java.
Kudu vs HDFS and HBase
1/24/2021 10
Apache KUDU
Kudu
• Kudus are a kind of antelope
• Found in eastern and southern Africa
THIS IS NOT Apache KudU
This is Apache Kudu
1/24/2021 13
Motivation
• Reducing architectural complexity
• Performance (for table-based operations)
• Reliability across globally-distributed data centers
What is and is not
• Apache Kudu is an open source columnar storage engine. It
promises low latency random access and efficient execution of
analytical queries.
What is and is not
• Apache Kudu is not really a SQL interface for Hadoop but a very well
optimized columnar database designed to fit in with the Hadoop
ecosystem. It has been integrated to work with Impala, MapReduce and
Spark, and additional framework integrations are expected. The idea is
that it can provide very fast scan performance.
• Apache Kudu is a “storage engine” or perhaps a “database” project that is
delivered upon a Non-HDFS based filesystem. This underlying storage
format could be considered to be competitive with file formats like
parquet.
• Note, that Kudu is NOT compatible with HDFS and it is NOT truly
complementary to HDFS. It runs on a completely separate filesystem
from Hadoop, which enables Kudu to update data which is very much
unlike HDFS.
1/24/2021 16
Life in the Fast Lane, Hello Druid
Not playing well with Hadoop?? Bye bye KUDU
1/24/2021
17
This is not Apache Druid
This is Apache Druid
1/24/2021 19
What is Druid
• Apache Druid is a real-time analytics database designed for fast
slice-and-dice analytics ("OLAP" queries) on large data sets. Druid
is most often used as a database for powering use cases where real-
time ingest, fast query performance, and high uptime are
important.
• As such, Druid is commonly used for powering GUIs of analytical
applications, or as a backend for highly-concurrent APIs that need
fast aggregations. Druid works best with event-oriented data.
1/24/2021 20
Apache Druid
1. Columnar storage format. Druid uses column-oriented storage, meaning
it only needs to load the exact columns needed for a particular query.
This gives a huge speed boost to queries that only hit a few columns. In
addition, each column is stored optimized for its particular data type,
which supports fast scans and aggregations.
2. Scalable distributed system. Druid is typically deployed in clusters of
tens to hundreds of servers, and can offer ingest rates of millions of
records/sec, retention of trillions of records, and query latencies of sub-
second to a few seconds.
3. Massively parallel processing. Druid can process a query in parallel
across the entire cluster.
4. Realtime or batch ingestion. Druid can ingest data either real-time
(ingested data is immediately available for querying) or in batches.
1/24/2021 21
Apache Druid
4. Self-healing, self-balancing, easy to operate. As an operator, to scale the
cluster out or in, simply add or remove servers and the cluster will
rebalance itself automatically, in the background, without any
downtime. If any Druid servers fail, the system will automatically route
around the damage until those servers can be replaced. Druid is
designed to run 24/7 with no need for planned downtimes for any
reason, including configuration changes and software updates.
5. Cloud-native, fault-tolerant architecture that won't lose data. Once
Druid has ingested your data, a copy is stored safely in deep storage
(typically cloud storage, HDFS, or a shared filesystem). Your data can be
recovered from deep storage even if every single Druid server fails. For
more limited failures affecting just a few Druid servers, replication
ensures that queries are still possible while the system recovers.
6. Indexes for quick filtering. Druid uses Roaring or CONCISE
compressed bitmap indexes to create indexes that power fast filtering
and searching across multiple columns.
1/24/2021 22
Apache Druid
7. Time-based partitioning. Druid first partitions data by time, and can
additionally partition based on other fields. This means time-based
queries will only access the partitions that match the time range of the
query. This leads to significant performance improvements for time-
based data.
8. Approximate algorithms. Druid includes algorithms for approximate
count-distinct, approximate ranking, and computation of approximate
histograms and quantiles. These algorithms offer bounded memory
usage and are often substantially faster than exact computations. For
situations where accuracy is more important than speed, Druid also
offers exact count-distinct and exact ranking.
9. Automatic summarization at ingest time. Druid optionally supports
data summarization at ingestion time. This summarization partially pre-
aggregates your data, and can lead to big costs savings and performance
boosts.
1/24/2021 23
Druid Architecture
Who want to use Apache Druid?
• A fast, modern analytics database
• Druid is designed for workflows where fast ad-hoc analytics, instant data visibility, or
supporting high concurrency is important. As such, Druid is often used to power UIs where
an interactive, consistent user experience is desired.
• Easy integration with your existing data pipelines
• Druid streams data from message buses such as Kafka, and Amazon Kinesis, and batch load
files from data lakes such as HDFS, and Amazon S3. Druid supports most popular file
formats for structured and semi-structured data.
• Fast, consistent queries at high concurrency
• Druid has been benchmarked to greatly outperform legacy solutions. Druid combines novel
storage ideas, indexing structures, and both exact and approximate queries to return most
results in under a second.
1/24/2021 25
Who is using Apache Druid?
• Broad applicability
• Druid unlocks new types of queries and workflows for clickstream, APM, supply chain,
network telemetry, digital marketing, risk/fraud, and many other types of data. Druid is
purpose built for rapid, ad-hoc queries on both real-time and historical data.
• Deploy in public, private, and hybrid clouds
• Druid can be deployed in any *NIX environment on commodity hardware, both in the cloud
and on premise. Deploying Druid is easy: scaling up and down is as simple as adding and
removing Druid services.
1/24/2021 26
So Who is using Druid?
1/24/2021
27
@Uber
1/24/2021 28
@Nielsen
1/24/2021 29
@Airbnb
1/24/2021 30
@Shopee
1/24/2021 31
@Netflix
1/24/2021 32
@Spotify
1/24/2021 33
@SuperAwesome
1/24/2021 34
Getting Start
• Quickstart
• This quickstart gets you started with Apache Druid and introduces you to some of its basic features. Following
these steps, you will install Druid and load sample data using its native batch ingestion feature.
• Before starting, you may want to read the general Druid overview and ingestion overview, as the tutorials refer
to concepts discussed on those pages.
• Requirements
• You can follow these steps on a relatively small machine, such as a laptop with around 4 CPU and 16 GB of
RAM.
• Druid comes with several startup configuration profiles for a range of machine sizes. The micro-
quickstartconfiguration profile shown here is suitable for evaluating Druid. If you want to try out Druid's
performance or scaling capabilities, you'll need a larger machine and configuration profile.
• The configuration profiles included with Druid range from the even smaller Nano-Quickstart configuration (1
CPU, 4GB RAM) to the X-Large configuration (64 CPU, 512GB RAM). For more information, see Single server
deployment.
• https://druid.apache.org/docs/latest/tutorials/index.html
1/24/2021 35
References
1. https://www.tibco.com/blog/2015/03/27/how-analytics-facilitates-fast-data/
2. https://techstartups.com/2018/05/21/how-much-data-do-we-create-every-day-infographic/
3. https://mapr.com/blog/much-ado-about-kudu/
4. https://blog.clairvoyantsoft.com/guide-to-using-apache-kudu-and-performance-comparison-with-hdfs-453c4b26554f
5. https://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/
6. https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines
7. https://kudu.apache.org/docs/quickstart.html
8. https://kudu.apache.org/docs/index.html
9. http://blog.cloudera.com/blog/2016/05/how-to-build-a-prediction-engine-using-spark-kudu-and-impala/
10. https://netflixtechblog.com/how-netflix-uses-druid-for-real-time-insights-to-ensure-a-high-quality-experience-19e1e8568d06
11. https://open.spotify.com/episode/0zu8cEUrm0b7e41l3jWjOK
12. https://druid.apache.org/
1/24/2021 36

Fast analytics kudu to druid

  • 1.
    FasT Analytics KUDU toDRUID Alex Pongpech 2021
  • 2.
    • Fast Analytics(FA) is about delivering analytics at decision-making speeds on !!! What the heck is it? 1/24/2021 2 https://www.tibco.com/blog/2015/03/27/how- analytics-facilitates-fast-data/ Quickly need to know
  • 3.
    Why oh why? •Life is a period of continuous-time ( until it is end), seriously life can not wait for you to make your decision ! ( I will take my money to another services provider if I have to wait too long) • The clock is ticking and the information is flowing 1/24/2021 3 https://targetdatacorp.com/customer-data/
  • 4.
  • 5.
    Why oh WHY? •What if your life depend on it? • Drug Discovery • Precision Medicine • Point of Care/Patient 360 • Insurance Fraud 1/24/2021 5 Quick decisions based on your personal data might save your life!
  • 6.
    How ? • byprocessing high-velocity, high-volume Big Data in real time through the use of an Enterprise Service Bus (ESB), enabling decision-makers to gain immediate understanding of new trends and customer/market shifts as they occur. 1/24/2021 6 http://www.sovtex.ru/en/enterprise-service-bus-esb/
  • 7.
    Building Architecture forFast Data 1. Ingest with the data feed 2. Make decisions on each event in the feed 3. Provide visibility into fast-moving data with Realtime analytics 1/24/2021 7
  • 8.
    BUT ! What aboutthe Storage??? 1/24/2021 8
  • 9.
    Real time Analyticswith HBase 1/24/2021 9 HBase is an open- source, non- relational, distributed database modeled after Google's Bigtable and written in Java.
  • 10.
    Kudu vs HDFSand HBase 1/24/2021 10
  • 11.
  • 12.
    Kudu • Kudus area kind of antelope • Found in eastern and southern Africa THIS IS NOT Apache KudU
  • 13.
    This is ApacheKudu 1/24/2021 13
  • 14.
    Motivation • Reducing architecturalcomplexity • Performance (for table-based operations) • Reliability across globally-distributed data centers
  • 15.
    What is andis not • Apache Kudu is an open source columnar storage engine. It promises low latency random access and efficient execution of analytical queries.
  • 16.
    What is andis not • Apache Kudu is not really a SQL interface for Hadoop but a very well optimized columnar database designed to fit in with the Hadoop ecosystem. It has been integrated to work with Impala, MapReduce and Spark, and additional framework integrations are expected. The idea is that it can provide very fast scan performance. • Apache Kudu is a “storage engine” or perhaps a “database” project that is delivered upon a Non-HDFS based filesystem. This underlying storage format could be considered to be competitive with file formats like parquet. • Note, that Kudu is NOT compatible with HDFS and it is NOT truly complementary to HDFS. It runs on a completely separate filesystem from Hadoop, which enables Kudu to update data which is very much unlike HDFS. 1/24/2021 16
  • 17.
    Life in theFast Lane, Hello Druid Not playing well with Hadoop?? Bye bye KUDU 1/24/2021 17
  • 18.
    This is notApache Druid
  • 19.
    This is ApacheDruid 1/24/2021 19
  • 20.
    What is Druid •Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets. Druid is most often used as a database for powering use cases where real- time ingest, fast query performance, and high uptime are important. • As such, Druid is commonly used for powering GUIs of analytical applications, or as a backend for highly-concurrent APIs that need fast aggregations. Druid works best with event-oriented data. 1/24/2021 20
  • 21.
    Apache Druid 1. Columnarstorage format. Druid uses column-oriented storage, meaning it only needs to load the exact columns needed for a particular query. This gives a huge speed boost to queries that only hit a few columns. In addition, each column is stored optimized for its particular data type, which supports fast scans and aggregations. 2. Scalable distributed system. Druid is typically deployed in clusters of tens to hundreds of servers, and can offer ingest rates of millions of records/sec, retention of trillions of records, and query latencies of sub- second to a few seconds. 3. Massively parallel processing. Druid can process a query in parallel across the entire cluster. 4. Realtime or batch ingestion. Druid can ingest data either real-time (ingested data is immediately available for querying) or in batches. 1/24/2021 21
  • 22.
    Apache Druid 4. Self-healing,self-balancing, easy to operate. As an operator, to scale the cluster out or in, simply add or remove servers and the cluster will rebalance itself automatically, in the background, without any downtime. If any Druid servers fail, the system will automatically route around the damage until those servers can be replaced. Druid is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software updates. 5. Cloud-native, fault-tolerant architecture that won't lose data. Once Druid has ingested your data, a copy is stored safely in deep storage (typically cloud storage, HDFS, or a shared filesystem). Your data can be recovered from deep storage even if every single Druid server fails. For more limited failures affecting just a few Druid servers, replication ensures that queries are still possible while the system recovers. 6. Indexes for quick filtering. Druid uses Roaring or CONCISE compressed bitmap indexes to create indexes that power fast filtering and searching across multiple columns. 1/24/2021 22
  • 23.
    Apache Druid 7. Time-basedpartitioning. Druid first partitions data by time, and can additionally partition based on other fields. This means time-based queries will only access the partitions that match the time range of the query. This leads to significant performance improvements for time- based data. 8. Approximate algorithms. Druid includes algorithms for approximate count-distinct, approximate ranking, and computation of approximate histograms and quantiles. These algorithms offer bounded memory usage and are often substantially faster than exact computations. For situations where accuracy is more important than speed, Druid also offers exact count-distinct and exact ranking. 9. Automatic summarization at ingest time. Druid optionally supports data summarization at ingestion time. This summarization partially pre- aggregates your data, and can lead to big costs savings and performance boosts. 1/24/2021 23
  • 24.
  • 25.
    Who want touse Apache Druid? • A fast, modern analytics database • Druid is designed for workflows where fast ad-hoc analytics, instant data visibility, or supporting high concurrency is important. As such, Druid is often used to power UIs where an interactive, consistent user experience is desired. • Easy integration with your existing data pipelines • Druid streams data from message buses such as Kafka, and Amazon Kinesis, and batch load files from data lakes such as HDFS, and Amazon S3. Druid supports most popular file formats for structured and semi-structured data. • Fast, consistent queries at high concurrency • Druid has been benchmarked to greatly outperform legacy solutions. Druid combines novel storage ideas, indexing structures, and both exact and approximate queries to return most results in under a second. 1/24/2021 25
  • 26.
    Who is usingApache Druid? • Broad applicability • Druid unlocks new types of queries and workflows for clickstream, APM, supply chain, network telemetry, digital marketing, risk/fraud, and many other types of data. Druid is purpose built for rapid, ad-hoc queries on both real-time and historical data. • Deploy in public, private, and hybrid clouds • Druid can be deployed in any *NIX environment on commodity hardware, both in the cloud and on premise. Deploying Druid is easy: scaling up and down is as simple as adding and removing Druid services. 1/24/2021 26
  • 27.
    So Who isusing Druid? 1/24/2021 27
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    Getting Start • Quickstart •This quickstart gets you started with Apache Druid and introduces you to some of its basic features. Following these steps, you will install Druid and load sample data using its native batch ingestion feature. • Before starting, you may want to read the general Druid overview and ingestion overview, as the tutorials refer to concepts discussed on those pages. • Requirements • You can follow these steps on a relatively small machine, such as a laptop with around 4 CPU and 16 GB of RAM. • Druid comes with several startup configuration profiles for a range of machine sizes. The micro- quickstartconfiguration profile shown here is suitable for evaluating Druid. If you want to try out Druid's performance or scaling capabilities, you'll need a larger machine and configuration profile. • The configuration profiles included with Druid range from the even smaller Nano-Quickstart configuration (1 CPU, 4GB RAM) to the X-Large configuration (64 CPU, 512GB RAM). For more information, see Single server deployment. • https://druid.apache.org/docs/latest/tutorials/index.html 1/24/2021 35
  • 36.
    References 1. https://www.tibco.com/blog/2015/03/27/how-analytics-facilitates-fast-data/ 2. https://techstartups.com/2018/05/21/how-much-data-do-we-create-every-day-infographic/ 3.https://mapr.com/blog/much-ado-about-kudu/ 4. https://blog.clairvoyantsoft.com/guide-to-using-apache-kudu-and-performance-comparison-with-hdfs-453c4b26554f 5. https://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/ 6. https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines 7. https://kudu.apache.org/docs/quickstart.html 8. https://kudu.apache.org/docs/index.html 9. http://blog.cloudera.com/blog/2016/05/how-to-build-a-prediction-engine-using-spark-kudu-and-impala/ 10. https://netflixtechblog.com/how-netflix-uses-druid-for-real-time-insights-to-ensure-a-high-quality-experience-19e1e8568d06 11. https://open.spotify.com/episode/0zu8cEUrm0b7e41l3jWjOK 12. https://druid.apache.org/ 1/24/2021 36