This document discusses key capabilities needed for real-time analytics. It notes that real-time data, combined with historical data, provides important context for decision making. Building data pipelines with fewer systems and steps leads to greater scalability and reliability. The document outlines needs for real-time analytics like ingesting streaming data, powering analytic applications, delivering massive capacity, and guaranteeing performance. It emphasizes that both real-time and historical data are important for analytics and that the right architecture can incorporate multiple data sources and workloads.
2. Today’s Discussion
We’re awash in real-time data
Real-time data, combined with
historical data, provides the most
context for decision making
Building data pipelines with fewer
systems and steps leads to
greater scalability and reliability
2CONFIDENTIAL
4. Real-Time Reality
Everything is trackable
Everything is shareable,
often inadvertently
Consumers expectations
demand real-time
4
5. Real-Time Reality of Yesterday’s Data Systems
No ability to easily capture real-time
feeds
Too many disparate silos
Poor data cleanliness
Difficult data access (tooling, obscure
languages)
Unpredictable performance and
resource consumption
5
6. Real-Time Needs
Ingest on-the-fly data
• Natively from apps, Kafka/Spark, ETL tools, high speed loaders
Write groundbreaking analytic applications
• Custom dashboards, reporting
Deliver massive capacity
• With minimal node count
Guarantee performance
• Across thousands of users with reserved resources
Provide universal accessibility with ANSI SQL
6
9. Real-Time Is Only Part of the Picture
An important moment,
always fleeting
Challenging to incorporate
context
A small view of the stream
compared to the broad view
over time
9
10. Incorporating Historical Data for Context
Business value lies in the right amount of history
• Hospitality
• Measure across annual visits
• Consumer goods
• Seasonal analytics
Both examples benefit from being able to incorporate real-
time data
• Real-time offers to hospitality guests
• More efficient inventory management
10
12. Identifying The Right Capabilities
Ingest and data loading
• Direct from apps, Kafka/Spark, Change Data Capture from OLTP systems,
ETL, YB Load
Data store scale and expansion
• Capacity, number of concurrent users, mixed workloads
Data accessibility
• Interactive applications, Ad Hoc SQL, Business critical reporting
12
13. Evolution of data pipeline architectures
Enterprise Data Warehouse model
• Consolidate one or multiple application data sets
into a data warehouse
Desire to capture all Internet data
led to adoption of a data lake
• However, MapReduce was challenging
SQL-as-a-Layer provides some relief
• But SQL on a file system IS NOT
a data warehouse
SQL as a Layer
14. Further evolution of data pipelines
14
Data science
Data Lake
High value data to EDW
Large number of
enterprise analytics users
15. Incoming Data
Structured and semi-structured
Enterprise Data Warehouse 1000s of users
(BI analysts, Data engineers)
High value data moves to EDW
Unstructured data Data Lake Data science
Modern architecture for real-time analytics
15
16. Real-Time Architecture Data Warehouse Attributes
Real-time Feeds
Ingest IoT or OLTP data
Capture 100,000s
of rows per second
Interactive Applications
Serve short queries in
under 100 milliseconds
Periodic Bulk Loads
Capture terabytes
of data, petabytes
over time
Powerful Analytics
Respond to
complex BI queries
in just a few seconds
Load and Transform
Use existing ETL tools including intensive
push-down ELT
Business Critical Reporting
Workload management
for prioritized responses
PostgreSQL
compatible
CONFIDENTIAL16
17. The Yellowbrick Data Warehouse
MPP scale-out architecture
Start small
Grow compute
and storage
CONFIDENTIAL17
MODULAR PURPOSE-BUILT APPLIANCE
ALL FLASH DATA WAREHOUSE
Capacity from tens of terabytes
to petabytes
18. Yellowbrick deployments across hybrid cloud
Yellowbrick Data Warehouse
Enabling analytics anywhere
Today
On-premises data centers
Private cloud
Colocation
Edge
2019
Cloud
Hybrid Cloud
Colocation
On-premises
Data Centers
Private Cloud Edge
Cloud
CONFIDENTIAL18
19. The Yellowbrick Impact: 6 full racks > 1 appliance (6 rack units)
3x-100x performance improvement
19
20. Real-World Use Cases
Risk analytics
• Fraud detection for e-commerce
Consumer financing
• Tracking loyalty points and
impact on balance sheet
Hospitality
• Real-time offers
20
22. Common Event Streams
Business Applications
Customer orders
Airline Reservations
Insurance claims
Bank transactions
Telco CDRs
Sources
Digital Information
Clickstreams
Social computing
Customer call logs
News, weather feeds
IT, network logs
Market data
Email
Ideal for real-time
applications and analytics
Internet of Things
RFID
Telemetry SCADA
Geolocation
Machine logs
CONFIDENTIAL22
23. Getting ready for real-time analytics
Business Applications
- OLTP databases
Consolidate multiple
data integration patterns
into fewer systems
Enterprise Digital Information
available via existing ETL procedures
Big data clickstreams, IoT,
Machine logs
CONFIDENTIAL23
IoT
Big Data
24. Gartner on Data Integration Styles
Real-time analytics popularity
dwarfs its practice
Ideal solutions will handle
multiple ingestion methods
More many workflows, the
further “up the stream” you
can grab the data, the better
Source: Gartner24
Editor's Notes
https://twitter.com/jer_s/status/1113667343480045569
@jer_s Follow Follow @jer_s More Jeremy Schneider Retweeted PostgreSQL
The relational model was invented to make it easier to build good apps. When people consider non-relational data stores they sometimes overlook the benefits of a relational approach. Platforms with things like consistency & transactions make better applications with simpler code.