ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Hadoop, Cloud Storage, Excel?

Platforming Your Data for
Success
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444

William McKnight
President, McKnight Consulting Group
• Frequent keynote speaker and trainer internationally
• Consulted to many Global 1000 companies
• Hundreds of articles, blogs, white papers, field tests, etc.
in publication
• Focused on delivering business value and solving business
problems utilizing proven, streamlined approaches to
information management
• Former Database Engineer, Fortune 50 Information
Technology executive and Ernst&Young Entrepreneur of
Year Finalist
• Owner/consultant: Data strategy and implementation
consulting firm
• 25+ years of information management and data
experience
2

McKnight Consulting Group Offerings
Strategy
Training
Strategy
 Trusted Advisor
 Action Plans
 Roadmaps
 Tool Selections
 Program Management
Training
 Classes
 Workshops
Implementation
 Data/Data Warehousing/Business
Intelligence/Analytics
 Master Data Management
 Governance/Quality
 Big Data
Implementation
3

2000’s
•
2010’s+
Give Me
All Data
Fast &
Effectively!
Give Me
Good Data
But Do It
Efficiently!
1990’s
Just Give Me
Some Data
and Fast!
All Data!
5

AI Data
• Call center recordings and chat logs
• Streaming sensor data, historical maintenance records and
search logs
• Customer account data and purchase history
• Email response metrics
• Product catalogs and data sheets
• Public references
• YouTube video content audio tracks
• User website behaviors
• Sentiment analysis, user-generated content, social graph data,
and other external data sources
6

Best Category and Top Tool Picked
Best Category Picked
Top 2 Category Picked
Same Ol’ Platform
80%
70%
60%
50%
Increasing Probability that Platform
Selection Leads to Success

What is it?
• Operational Database
– Operational Real-Time
– Operational Big Data
• Operational Data Hub
• Master Data Management
• A Data Warehouse
• A Data Mart
– Dependent
– Independent
• A Data Lake
• Analytic Application
– Analytic Big Data Application
• Archive Storage
• A Staging Area
9

4 Major Decisions
• Decision #1: The Data Store Type
– The largest factor for distinguishing between databases and file-based scale-out system utilization is the data profile. The latter is best for
data that fits the loose label of 'unstructured' (or semi-structured) data, while more traditional data -- and smaller volumes of all data -- still
belong in a relational database.
• Decision #2: Data Store Placement
– You must also decide where to place your data store -- on-premises or in the cloud (and which cloud). In the past, the only clear choice for
most organizations was on-premises data. However, the costs of scale are gnawing away at the notion that this remains the best approach
for a data platform. For more on why databases are moving to the cloud, please read this article.
• Decision #3: The Workload Architecture
– You must keep in mind the distinction between operational or analytical workloads. Short transactional requests and more complex (often
longer) analytics requests demand different architectures. Analytics databases, though quite diverse, are the preferred platforms for the
analytics workload.
• Decision #4: The Node Architecture
– General purpose to premium and HDD to flash and all memory storage. Volume types, readwrite, cache options. Balance CPU and storage
on the nodes. Levels of IOPS. Levels of management.
10

Data Warehouses, Data Marts,
Data Lakes, Big Data

Data Warehousing
• Data Warehouses (still) have a lower
total cost of ownership than data
marts
• A data warehouse is a SHARED
platform
– Build once, use many
– Access at Data Warehouse
– Access by creating a mart off the DW
• Still A LOT cheaper than building from scratch
“… a subject-
oriented, integrated,
non-volatile, time-
variant collection of
data, organized to
support
management
needs.” — Bill Inmon

On Relational
• Consistency
• Transactions
• Partitioning
• Arrays
• Inheritance
• UNION
• Columnar
• Storage Fluidity
• Custom data types
• Built in graph capabilities
• Caching
13

The Analytic Data Ecosystem
Data Lake
DW
DM
DM
14

Data Warehouses Have Flavors
● The Customer Experience Transformation Data Warehouse focuses on
customer attributes and touchpoints to improve the value of
customers.
● The Asset Maximization with IoT data warehouse deals with the high
volume of edge data tracking the physical assets of the organization.
● The Operational Extension Data Warehouse supports company
operations directly with real- time analytics.
● The Risk Management Data Warehouse supports the ever-growing
compliance and reporting requirements and corporate risk.
● The Finance Modernization Data Warehouse handles the voluminous
financial reporting and ensures the bottom line is considered in every
aspect of the business.
● The Product Innovation Data Warehouse delivers all product-related
information into the decisions of the product life cycle.

Required for Modern Analytics
• In-database analytics
• In-memory capabilities
• Columnar orientation
• Modern programming languages
• New data types
16

Object Storage Instances
• Object Storage instances/clusters have local
storage, i.e., on the physical drives mounted to the
instances themselves, that is HDFS and Hive
• Object Storage technologies access their cloud
vendor’s respective cloud storage—viz.:
– Amazon EMR accesses S3
– Dataproc accesses Google Cloud Storage
– HDI accesses Azure Data Lake Storage Gen2
• Local storage is used by the Object Storage
platform for housekeeping
18

Data Lakes with Analytic Access Pricing
• Pair a lake with an analytical engine that charges
only by what you use
• If you have a ton of data that can sit in cold storage
and only needs to be accessed or analyzed
occasionally, store it in Amazon S3/Azure Blob
Storage/Google Cloud Storage
– Use a database (on-premise or in the cloud) that can
create external tables that point at the storage
– Analysts can query directly against it, or draw down a
subset for some deeper/intensive analysis
– The GB/month storage fee plus data transfer/egress
fees will be much cheaper than leaving it in a data
warehouse
19

Analytics Reference Architecture
Logs
(Apps, Web,
Devices)
User tracking
Operational
Metrics
Offload
data
Raw Data Topics
JSON, AVRO
Processed
Data Topics
Sensors
and
/ or
Transactiona
l/ Context
Data
OLTP/ODS
ETL
Or
EL with
T in Spark
Batch
Low
Latency
Applications
Files
In-
database
analytics
Reach
through
or ETL/ELT
or
Stream
Processing
or
Stream
Processing
Q
Q
Data
Warehouse

Notes on the Data Warehouse of the Future
• More Achievable separate compute and storage architecture
• Compute resources (Map/Reduce, Hive, Spark, etc.) can be taken down,
scaled up or out, or interchanged without data movement
• Storage can be centralized, but compute can be distributed
• Major players have mechanism to ensure consistency to achieve ACID-like
compliance
• Remote data replication to ensure redundancy and recovery
• Most of the query execution is processing time, and not data transport, so if
cloud compute and storage are in the same cloud vendor region,
performance is hardly impacted
21

Disruption Vectors
• Robustness of SQL
• Built-in optimization
• On-the-fly elasticity
• Dynamic Environment Adaption
• Separation of compute from storage
• Support for diverse data
23

Cloud Analytic Databases in the Enterprise
• Can be used for test/dev or prod; disaster recovery; bursting
• CAPEX accounting
• The cloud now offers attractive options with better
economics, such as pay-as-you-go which is easier to justify
and budget, better logistics (streamlined administration and
management), and better scale (elasticity and the ability to
expand a cluster within minutes).
• While on-premises-first development brings a robust
database to the table, not all functions are always part of the
cloud solution and not all of the organizations behind them
have made the transition to cloud.
• Data gravity in the cloud.
24

Performance
• Managed cloud databases are the winner for
performance
• Querying cloud storage directly is inefficient and
bringing subsets of data down for on-premise
processing takes time and costs egress fees
• Performance testing on Hadoop engines like Hive,
Spark, and Impala have shown improvements in
performance, but they still lag significantly behind
the performance and power of a solid relational
cloud database/data warehouse
25

Administration
• Managed cloud databases win this category too.
• Many of the latest and greatest fully-managed cloud
database platforms are streamlining and subsuming
much of the DBA work these days. Things like indexes,
constraints, partitioning, and other DBA-level
performance tuning are fading away.
• Second is cloud storage, because of its very simple
architecture.
• Last place in Administration is Hadoop. You will still need
expertise to help diagnose why Spark executors fail or
Hive throws an exception or why troublesome queries
never finish.
26

However… Why Big Data Technologies for Big
Data
• New Data Types
• Schemaless
• Relaxed ACID
• Faster, Less Expensive Provisioning
• Programmer Freedoms
• Fault-Tolerant Redundancy
• Scale Out (to Webscale)
• Automatic Sharding

Data Lake
Data Scientist Workbench and Data Warehouse
Staging
OLTP
Systems
Data Lake
Data Scientists
ERP
CRM
Supply
Chain
MDM
…
Data
Warehouse
Data Mart
Stream or
Batch
Updates
DI
Real-Time,
Event-Driven
Apps
28

HDFS vs Cloud Storage
• Cloud Storage is more scalable and persistent
• Cloud Storage is backed up and supports
compression, making the cost of big data less
• HDFS has better query performance
• Cloud Storage has object size and single PUT
limits that need workarounds
29

Leveraging Cloud Storage for Data Lakes
• More Achievable separate compute and storage architecture
• Compute resources (Map/Reduce, Hive, Spark, etc.) can be taken
down, scaled up or out, or interchanged without data movement
• Storage can be centralized, but compute can be distributed
• Major players have mechanism to ensure consistency to achieve
ACID-like compliance for remote data changes
• Some vendors also have remote data replication to ensure
redundancy and recovery
• Most of the query execution is processing time, and not data
transport, so if cloud compute and storage are in the same cloud
vendor region, performance is hardly impacted
30

How to Identify a Graph Workload
• Workload is identified by “network, hierarchy,
tree, ancestry, structure” words
• You are planning to use the relational
performance tricks
• Your queries will be about pathing
• You are limiting queries by their complexity
• A quick POC with a graph database impresses
• You are looking for “non-obvious” patterns in
the data
32

Graph Databases
Bridge
vertex
Bridge
vertex
33

GPU Databases
35
• A GPU Database performs at least some
operations using the GPU
• Uses SQL
• Uses each GPUs local memory store,
which is used as a data cache that
operates many times faster than the CPU
cache or main memory itself

Operlytical Databases
• Combination row-based for transactions and
column-based for analytics
• Can process both orders and machine learning
models simultaneously with fast performance
and reduced complexity
36

Decentralized, decoupled, distributed
architectures
• Data Infrastructure as a Platform with complete domain
mastery as nodes
• Enterprise master data management
• Solving the federation challenge; nobody has done it yet,
someone will and it could be really big
• Moving away from conventional integration and its
technical debt and effort
• Containerization, microservice databases, and
embedded databases as part of the analytics
environment
• Integration speed uptake and maturity eliminating
redundant data stores
• Unification of batch and streaming and tools
37

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Hadoop, Cloud Storage, Excel?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Hadoop, Cloud Storage, Excel?

Similar to ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Hadoop, Cloud Storage, Excel? (20)

More from DATAVERSITY

More from DATAVERSITY (20)

Recently uploaded

Recently uploaded (20)

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Hadoop, Cloud Storage, Excel?