Fast Analytics (FA) uses an Enterprise Service Bus (ESB) to process high volumes of big data in real time, enabling decision makers to understand new trends and shifts as they occur. FA delivers analytics at decision-making speeds through technologies like Apache Kudu, which provides low latency random access and efficient analytical queries on columnar data. Kudu uses a log-structured storage approach and Raft consensus algorithm to replicate data across nodes for reliability and high availability.
Backup Options for IBM PureData for Analytics powered by NetezzaTony Pearson
Confused about what options there are to backup your Netezza or IBM PureData for Analytics solution? This presentation provides alternatives related to file system and external backup software approaches using IBM Storwize V7000 Unified and IBM Tivoli Storage Manager
Backup Options for IBM PureData for Analytics powered by NetezzaTony Pearson
Confused about what options there are to backup your Netezza or IBM PureData for Analytics solution? This presentation provides alternatives related to file system and external backup software approaches using IBM Storwize V7000 Unified and IBM Tivoli Storage Manager
The IBM Netezza Data Warehouse ApplianceIBM Sverige
Netezza - Ett enklare sätt till smart analys.
Denna presentation hölls på IBM Data Server Day den 22 maj i Stockholm av Jacques Milman, Datawarehouse Architecture Leader, IBM
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...EMC
This White Paper explores backing up EMC Greenplum Data Computing Appliance data to Data Domain systems and how to effectively exploit Data DomainTs leading-edge technology.
Managing user Online Training in IBM Netezza DBA Development by www.etraining...Ravikumar Nandigam
Dear Student,
Greetings from www.etraining.guru
We provide BEST online training in Hyderabad for IBM Netezza DBA and/or Development by a senior working professional. Our Netezza Trainer comes with a working experience of 10+ years, 6+ years in Netezza and an Netezza 7.1 certified professional.
DBA Course Content: http://www.etraining.guru/course/dba/online-training-ibm-netezza-puredata-dba
Development Course Content: http://www.etraining.guru/course/ibm/online-training-ibm-puredata-netezza-development
Course Cost: USD 300 (or) INR 18000
Number of Hours: 24 hours
*Please note the course also includes Netezza certification assitance.
If there is any opportunity, we will be very happy to serve you. Appreciate if you can explore other training opportunities in our website as well.
We can be reachable at info@etraining.guru (or) 91-996-669-2446 for any further info/details.
Regards,
Karthik
www.etraining.guru"
Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this presentation, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy.
This is the presentation I delivered on Hadoop User Group Ireland meetup in Dublin on Nov 28 2015. It covers at glance the architecture of GPDB and most important its features. Sorry for the colors - Slideshare is crappy with PDFs
Kudu is popularly referred to as "Fast Analytics on Fast Data" capable of performing both OLAP & OLTP operations. Understand right from essentials to deep-dive into Kudu internals and architecture for building applications based on Kudu and integrating with Hadoop ecosystem.
Read about Kudu clusters, architecture, operations, primary key design and column optimizations, partitioning and other performance considerations.
The IBM Netezza Data Warehouse ApplianceIBM Sverige
Netezza - Ett enklare sätt till smart analys.
Denna presentation hölls på IBM Data Server Day den 22 maj i Stockholm av Jacques Milman, Datawarehouse Architecture Leader, IBM
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...EMC
This White Paper explores backing up EMC Greenplum Data Computing Appliance data to Data Domain systems and how to effectively exploit Data DomainTs leading-edge technology.
Managing user Online Training in IBM Netezza DBA Development by www.etraining...Ravikumar Nandigam
Dear Student,
Greetings from www.etraining.guru
We provide BEST online training in Hyderabad for IBM Netezza DBA and/or Development by a senior working professional. Our Netezza Trainer comes with a working experience of 10+ years, 6+ years in Netezza and an Netezza 7.1 certified professional.
DBA Course Content: http://www.etraining.guru/course/dba/online-training-ibm-netezza-puredata-dba
Development Course Content: http://www.etraining.guru/course/ibm/online-training-ibm-puredata-netezza-development
Course Cost: USD 300 (or) INR 18000
Number of Hours: 24 hours
*Please note the course also includes Netezza certification assitance.
If there is any opportunity, we will be very happy to serve you. Appreciate if you can explore other training opportunities in our website as well.
We can be reachable at info@etraining.guru (or) 91-996-669-2446 for any further info/details.
Regards,
Karthik
www.etraining.guru"
Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this presentation, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy.
This is the presentation I delivered on Hadoop User Group Ireland meetup in Dublin on Nov 28 2015. It covers at glance the architecture of GPDB and most important its features. Sorry for the colors - Slideshare is crappy with PDFs
Kudu is popularly referred to as "Fast Analytics on Fast Data" capable of performing both OLAP & OLTP operations. Understand right from essentials to deep-dive into Kudu internals and architecture for building applications based on Kudu and integrating with Hadoop ecosystem.
Read about Kudu clusters, architecture, operations, primary key design and column optimizations, partitioning and other performance considerations.
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonHentsū
Slides from our recent workshop for hedge funds and a review of the cloud grid computing options. Included some live demos tackling 2TB of full depth market data using MATLAB on AWS, and Google BigQuery with Datalab.
3 Things to Learn About:
-How Kudu is able to fill the analytic gap between HDFS and Apache HBase
-The trade-offs between real-time transactional access and fast analytic performance
-How Kudu provides an option to achieve fast scans and random access from a single API
If you're building relational, time-series, IOT, or real-time architectures using Hadoop, you will find Apache Kudu an attractive choice. With Kudu, you'll be able to build your applications more simply and with fewer moving parts.
Hadoop has become faster and more capable, and has continued to narrow the gap compared to traditional database technologies. However, for developers looking for up-to-the-second analytics on fast-moving data, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing and analytical workloads.
This talk will describe Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark and Apache Impala. Kudu fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
This presentation gives an overview of the Apache Kudu project. It explains the Kudu project in terms of it's architecture, schema, partitioning and replication. It also provides an example deployment scale.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
Estimating the Total Costs of Your Cloud Analytics Platform DATAVERSITY
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a platform designed to address multi-faceted needs by offering multi-function Data Management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion. They need a worry-free experience with the architecture and its components.
A complete machine learning infrastructure cost for the first modern use case at a midsize to large enterprise will be anywhere from $2M to $14M. Get this data point as you take the next steps on your journey.
Introducing Kudu, Big Data Warehousing MeetupCaserta
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
2. • Fast Analytics (FA) is about delivering analytics at decision-making
speeds on !!!
What the heck is it?
10/29/2018 2
https://www.tibco.com/blog/2015/03/27/how-
analytics-facilitates-fast-data/
Quickly need to know
3. Why oh why?
• Life is a period of continuous-time ( until it is end), seriously life
can not wait for you to make your decision ! ( I will take my money
to another services provider if I have to wait too long)
• The clock is ticking and the information is flowing10/29/2018 3
https://targetdatacorp.com/customer-data/
5. Why oh WHY?
• What if your life depend on it?
• Drug Discovery
• Precision Medicine
• Point of Care/Patient 360
• Insurance Fraud
10/29/2018 5
Quick decisions based on your personal data
might save your life!
6. How ?
• by processing high-velocity, high-volume Big Data in real time
through the use of an Enterprise Service Bus (ESB), enabling
decision-makers to gain immediate understanding of new trends
and customer/market shifts as they occur. 10/29/2018 6
http://www.sovtex.ru/en/enterprise-service-bus-esb/
7. Real time Analytics
10/29/2018 7
HBase is
an open-
source,
non-
relational,
distributed
database
modeled
after
Google's
Bigtable
and
written in
Java.
10. With Kudu
• Apache Hive was one of the first SQL-like query interfaces developed
over distributed data on top of Hadoop. Hive converts queries to Hadoop
MapReduce jobs
• Apache Impala uses its own parallel processing architecture on top of
HDFS instead of MapReduce jobs. Kudu and Impala are best used
together. unlike Hive, Impala never translate its sql queries into
MapReduce Job rather executes them natively
• Apache Spark is a cluster computing technology. It is not strictly
dependent on Hadoop because it has its own cluster management.
However, Spark is usually implemented on top of Hadoop that is taking
care of distributed data storage. Spark SQL is a Spark component on top
of Spark core that provides a way of querying and persisting structured
and semi-structured data.
10/29/2018 10
14. Motivation
• Reducing architectural complexity
• Performance (for table-based operations)
• Reliability across globally-distributed data centers
15. What is and is not
• Apache Kudu is an open source columnar storage engine. It
promises low latency random access and efficient execution of
analytical queries.
16. What is and is not
• Apache Kudu is not really a SQL interface for Hadoop but a very well
optimized columnar database designed to fit in with the Hadoop
ecosystem. It has been integrated to work with Impala, MapReduce and
Spark, and additional framework integrations are expected. The idea is
that it can provide very fast scan performance.
• Apache Kudu is a “storage engine” or perhaps a “database” project that is
delivered upon a Non-HDFS based filesystem. This underlying storage
format could be considered to be competitive with file formats like
parquet.
• Note, that Kudu is NOT compatible with HDFS and it is NOT truly
complementary to HDFS. It runs on a completely separate filesystem
from Hadoop, which enables Kudu to update data which is very much
unlike HDFS.
10/29/2018 16
17. Basic Design
• From a user perspective, Kudu is a storage system for
tables of structured data where
• Tables have a well-defined schema consisting of a predefined
number of typed columns.
• Each table has a primary key composed of one or more of its
columns.
• The primary key enforces a uniqueness constraint (no two rows
can share the same key) and acts as an index for efficient updates
and deletes.
10/29/2018 17
18. Basic Design
• Kudu tables are composed of
• a series of logical subsets of data, similar to partitions in
relational database systems, called Tablets.
• Kudu provides
• data durability and protection against hardware failure by
replicating these Tablets to multiple commodity hardware nodes
using the Raft consensus algorithm.
• Tablets are typically tens of gigabytes, and an individual node
typically holds 10-100 Tablets.
10/29/2018 18
20. Basic Design
• Tablet A tablet is a contiguous segment of a table, similar to a partition in
other data storage engines or relational databases. A given tablet is
replicated on multiple tablet servers, and at any given point in time, one
of these replicas is considered the leader tablet. Any replica can service
reads, and writes require consensus among the set of tablet servers
serving the tablet.
• Tablet Server A tablet server stores and serves tablets to clients. For a
given tablet, one tablet server acts as a leader, and the others act as
follower replicas of that tablet. Only leaders service write requests, while
leaders or followers each service read requests. Leaders are elected using
Raft Consensus Algorithm. One tablet server can serve multiple tablets,
and one tablet can be served by multiple tablet servers.
10/29/2018 20
23. Basic Design
• Kudu has a master process responsible for managing the metadata that
describes the logical structure of the data stored in Tablet Servers (the
catalog), acting as a coordinator when recovering from hardware failure,
and keeping track of which tablet servers are responsible for hosting
replicas of each Tablet.
• Multiple standby master servers can be defined to provide high
availability. In Kudu, many responsibilities typically associated with
master processes can be delineated to the Tablet Servers due to Kudu’s
implementation of Raft consensus, and the architecture provides a path to
partitioning the master’s duties across multiple machines in the future.
• We do not anticipate that Kudu’s master process will become the bottleneck to
overall cluster performance and on tests on a 250-node cluster the server hosting
the master process has been nowhere near saturation.
10/29/2018 23
24. Basic Design
• Master The master keeps track of all the tablets, tablet servers, the
Catalog Table, and other metadata related to the cluster.
• At a given point in time, there can only be one acting master (the leader).
If the current leader disappears, a new master is elected using Raft
Consensus Algorithm.
• The master also coordinates metadata operations for clients. For example,
when creating a new table, the client internally sends the request to the
master. The master writes the metadata for the new table into the catalog
table, and coordinates the process of creating tablets on the tablet servers.
All the master’s data is stored in a tablet, which can be replicated to all
the other candidate masters. Tablet servers heartbeat to the master at a set
interval (the default is once per second).
10/29/2018 24
25. Basic Design
• Raft Consensus Algorithm Kudu uses the Raft consensus
algorithm as a means to guarantee fault-tolerance and consistency,
both for regular tablets and for master data. Through Raft, multiple
replicas of a tablet elect a leader, which is responsible for accepting
and replicating writes to follower replicas. Once a write is persisted
in a majority of replicas it is acknowledged to the client. A given
group of N replicas (usually 3 or 5) is able to accept writes with at
most (N - 1)/2 faulty replicas.
10/29/201825
26. Basic Design
• Data stored in Kudu is updateable through the use of a variation of log-
structured storage in which updates, inserts, and deletes are temporarily
buffered in memory before being merged into persistent columnar
storage.
• Kudu protects against spikes in query latency generally associated with
such architectures through constantly performing small maintenance
operations such as compactions so that large maintenance operations are
never necessary.
• Data Compression Because a given column contains only one type of
data, pattern-based compression can be orders of magnitude more
efficient than compressing mixed data types, which are used in row-
based solutions. Combined with the efficiencies of reading data from
columns, compression allows you to fulfill your query while reading even
fewer blocks from disk.
10/29/2018 26
27. Basic Design
• Catalog Table The catalog table is the central location for metadata
of Kudu. It stores information about tables and tablets. The catalog
table may not be read or written directly. Instead, it is accessible
only via metadata operations exposed in the client API. The catalog
table stores two categories of metadata:
• Tables table schemas, locations, and states
• Tablets the list of existing tablets, which tablet servers have replicas of each
tablet, the tablet’s current state, and start and end keys.
10/29/2018 27
28. Basic Design
• Logical Replication Kudu replicates operations, not on-disk data. This is
referred to as logical replication, as opposed to physical replication.
• This has several advantages: Although inserts and updates do transmit
data over the network, deletes do not need to move any data. The delete
operation is sent to each tablet server, which performs the delete locally.
Physical operations, such as compaction, do not need to transmit the data
over the network in Kudu.
• This is different from storage systems that use HDFS, where the blocks
need to be transmitted over the network to fulfill the required number of
replicas. Tablets do not need to perform compactions at the same time or
on the same schedule, or otherwise remain in sync on the physical storage
layer. This decreases the chances of all tablet servers experiencing high
latency at the same time, due to compactions or heavy write loads.
10/29/2018 28
29. Basic Design
• Kudu provides direct APIs, in both C++ and Java, that allow for
point and batch retrieval of rows, writes, deletes, schema changes,
and more. In addition, Kudu is designed to integrate with and
improve existing Hadoop ecosystem tools. With Kudu’s beta
release integrations with Impala, MapReduce, and Apache Spark
are available. Over time we plan on making Kudu a supported
storage option for most or all of the Hadoop ecosystem tools.
10/29/2018 29
30. Why not?
• Not a Good Fit for Transactional Workloads (Analytic use-cases
almost exclusively use a subset of the columns in the queried table
and generally aggregate values over a broad range of rows. This
access pattern is greatly accelerated by column oriented data.
Operational use-cases are more likely to access most or all of the
columns in a row, and might be more appropriately served by row
oriented storage. A column oriented storage format was chosen for
Kudu because it’s primarily targeted at analytic use-cases.)
• Small Kudu Tables get loaded almost as fast as Hdfs tables.
However As the size increases we do see the load times becoming
double that of Hdfs with the largest table Lineitem taking up-to 4
times the load time.
10/29/2018 30
31. Why not?
• Only one index is supported: This can be a problem if you have a
lot of diversified queries that aggregate data by different variables
(by timestamp, user, vehicle, etc). This primary key cannot be
changed after table creation
• Does not redistribute replicas of tablets automatically: if you add a
new node to your cluster for example, Kudu will not redistribute
the tablets so that the cluster nodes are balanced. You need to either
recreate all your existing tables, or move the tablets manually to
balance your cluster using the following command, which might be
tedious work
10/29/2018 31
32. Why not
• Does not support sqoop : If you want to migrate your SQL warehouse
tables to Kudu, you first need to sqoop them to HDFS, and then use a tool
like Apache Spark to migrate the data to Kudu.
• Dependent on Impala for querying: Impala uses MPP (massive paralel
processing)to perform queries, which basically means that it uses all
deamons to fetch and compute the data it needs, and stores results in
memory. This is great if you need to perform a query that doesn’t take too
long (has few computations or doesn’t move that much data). If however
you have a daily or monthly ETL in which you have complex queries or
massive inserts that demand deamons to be working for hours, if you
have a deamon failure, the query stops and needs to be recomputed from
the very beginning. This is a nightmare for meeting SLAs, a nightmare
that does not happen with Apache Hive, which uses MapReduce to
perform queries, and thus saves intermediate results into disk, making it
much more reliable for these types of scenarios. Kudu’s current engine for
querying is solely Impala, which may cause some issues for these use
cases.
10/29/2018 32
33. Why not
• Impala tables created with Kudu data cannot be truncated
• Cannot add partitions dynamically unless they are ranged: At the
time of table creation you have to specify how you want to
partition your table (divide it in tablets) and you have three options
to do so: hash partitioning,range partitioning or a combination of
the two. The problem is that in a production scenario, your tables
will keep on increasing in volume and eventually you will need to
add more tablets to keep your performance up. This cannot be
done if you use hash partitioning. Similarly, data consolidation is
impossible using Kudu, unless you create a new, separate table and
perform a full insert to it, which may take some time.
10/29/2018 33
34. Why ( Performance comparison of different file formats
and storage engines in the Hadoop ecosystem) 6
• Storage efficiency – with Parquet or Kudu and Snappy
compression the total volume of the data can be reduced by a factor
10 comparing to uncompressed simple serialization format.
• Data ingestion speed – all tested file based solutions provide fast
ingestion rate (between x2 and x10) than specialized storage
engines or MapFiles (sorted sequence).
• Random data access time – using HBase or Kudu, typical random
data lookup speed is below 500ms. With smart HDFS namespace
partitioning Parquet could deliver random lookup on a level of a
second but consumes more resources.
10/29/2018 34
35. Why ( Performance comparison of different file formats
and storage engines in the Hadoop ecosystem) 6
• Data analytics – with Parquet or Kudu it is possible to perform fast
and scalable (typically more than 300k records per second per CPU
core) data aggregation, filtering and reporting.
• Support of in-place data mutation – HBase and Kudu can modify
records (schema and values) in-place where it is not possible with
data stored directly in HDFS files
10/29/2018 35
36. Why KUDU
• There are really no good alternative storage engines to Kudu in the
Hadoop ecosystem that achieve great analytical query performance
and, at the same time, allow you to change data in near real-time.
• Kudu documentation states that Kudu's intent is to compliment
HDFS and HBase, not to replace, but for many use cases and
smaller data sets, all you might need is Kudu and Impala with
Spark.
10/29/2018 36
37. Why KUDU
• An good case for Kudu is an ever-popular Data Lake architecture.
It is not enough these days to build a batch-oriented Data Lake,
updated a few times a day. Many modern analytical projects
(predictive alerts, anomaly detection, real-time dashboards etc.)
rely on data, streamed in near real-time from various source
systems.
• if the requirement is for a storage which performs as well as HDFS
for analytical queries with the additional flexibility of faster
random access and RDBMS features such as
Updates/Deletes/Inserts, then Kudu could be considered as a
potential shortlist.
10/29/2018 37
Write once, read many Y’all
38. Kudu aims
• Fast processing of OLAP workloads.
• Integration with MapReduce, Spark and other Hadoop ecosystem
components.
• Tight integration with Apache Impala, making it a good, mutable
alternative to using HDFS with Apache Parquet.
• Strong but flexible consistency model, allowing you to choose
consistency requirements on a per-request basis, including the
option for strict-serializable consistency.
10/29/2018 38
39. Kudu aims
• Strong performance for running sequential and random workloads
simultaneously.
• Easy to administer and manage with Cloudera Manager.
• High availability. Tablet Servers and Masters use the Raft
Consensus Algorithm, which ensures that as long as more than half
the total number of replicas is available, the tablet is available for
reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5
replicas are
10/29/2018 39
40. Kudu aims
• High availability. Tablet Servers and Masters use the Raft
Consensus Algorithm, which ensures that as long as more than half
the total number of replicas is available, the tablet is available for
reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5
replicas are available, the tablet is available.
• Reads can be serviced by read-only follower tablets, even in the
event of a leader tablet failure.
• Structured data model.
10/29/2018 40
41. A few examples of applications for which Kudu is a
great solution are:
• Reporting applications where newly-arrived data needs to be
immediately available for end users
• Time-series applications that must simultaneously support:
• queries across large amounts of historic data
• granular queries about an individual entity that must return very
quickly
• Applications that use predictive models to make real-time
decisions with periodic refreshes of the predictive model based on
all historic data
10/29/2018 41
42. Streaming Input with Near Real Time Availability
• A common challenge in data analysis is one where new data arrives
rapidly and constantly, and the same data needs to be available in
near real time for reads, scans, and updates. Kudu offers the
powerful combination of fast inserts and updates with efficient
columnar scans to enable real-time analytics use cases on a single
storage layer
10/29/2018 42
43. Time-series application with widely varying access
patterns
• A time-series schema is one in which data points are organized and
keyed according to the time at which they occurred. This can be
useful for investigating the performance of metrics over time or
attempting to predict future behavior based on past data.
• For instance, time-series customer data might be used both to store
purchase click-stream history and to predict future purchases, or
for use by a customer support representative.
• While these different types of analysis are occurring, inserts and
mutations may also be occurring individually and in bulk, and
become available immediately to read workloads. Kudu can handle
all of these access patterns simultaneously in a scalable and
efficient manner.
10/29/2018 43
44. Time-series application with widely varying access
patterns
• Kudu is a good fit for time-series workloads for several reasons. With
Kudu’s support for hash-based partitioning, combined with its native
support for compound row keys, it is simple to set up a table spread
across many servers without the risk of "hotspotting" that is commonly
observed when range partitioning is used. Kudu’s columnar storage
engine is also beneficial in this context, because many time-series
workloads read only a few columns, as opposed to the whole row.
• In the past, you might have needed to use multiple data stores to handle
different data access patterns. This practice adds complexity to your
application and operations, and duplicates your data, doubling (or worse)
the amount of storage required. Kudu can handle all of these access
patterns natively and efficiently, without the need to off-load work to
other data stores.
10/29/2018 44
45. Predictive Modeling
• Data scientists often develop predictive learning models from large sets
of data.
• The model and the data may need to be updated or modified often as the
learning takes place or as the situation being modeled changes.
• In addition, the scientist may want to change one or more factors in the
model to see what happens over time. Updating a large set of data stored
in files in HDFS is resource-intensive, as each file needs to be completely
rewritten.
• In Kudu, updates happen in near real time. The scientist can tweak the
value, re-run the query, and refresh the graph in seconds or minutes,
rather than hours or days. In addition, batch or incremental algorithms
can be run across the data at any time, with near-real-time results.
10/29/2018 45
46. Combining Data In Kudu With Legacy Systems
• Companies generate data from multiple sources and store it in a
variety of systems and formats. For instance, some of your data
may be stored in Kudu, some in a traditional RDBMS, and some in
files in HDFS. You can access and query all of these sources and
formats using Impala, without the need to change your legacy
systems.
10/29/2018 46
50. Build a Prediction Engine using Spark, Kudu, and
Impala
10/29/2018 50
http://blog.cloudera.com/blog/2016/05/how-to-build-
a-prediction-engine-using-spark-kudu-and-impala/
To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image.
Businesses generate and collect huge volumes of complex data from a variety of sources: customer data, market data, supply-chain information, operational data, financial information, social media feeds, sensor data, and so much more.
In order to understand and act on these constant streams of information, it’s vital for companies to have the ability to collect data sources together and provide decision-makers the ability to analyze it quickly.
What did you have for lunch? etc
Impala/Parquet is really good at aggregating large data sets quickly (billions of rows and terabytes of data, OLAP stuff),
and hBase is really good at handling a ton of small concurrent transactions (basically the mechanism to doing “OLTP” on Hadoop).
The tradeoffs of the above tools is Impala sucks at OLTP workloads and hBase sucks at OLAP workloads