Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
9654467111 Call Girls In Munirka Hotel And Home Service
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Hadoop, Cloud Storage, Excel?
1. Platforming Your Data for
Success
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444
2. William McKnight
President, McKnight Consulting Group
• Frequent keynote speaker and trainer internationally
• Consulted to many Global 1000 companies
• Hundreds of articles, blogs, white papers, field tests, etc.
in publication
• Focused on delivering business value and solving business
problems utilizing proven, streamlined approaches to
information management
• Former Database Engineer, Fortune 50 Information
Technology executive and Ernst&Young Entrepreneur of
Year Finalist
• Owner/consultant: Data strategy and implementation
consulting firm
• 25+ years of information management and data
experience
2
3. McKnight Consulting Group Offerings
Strategy
Training
Strategy
Trusted Advisor
Action Plans
Roadmaps
Tool Selections
Program Management
Training
Classes
Workshops
Implementation
Data/Data Warehousing/Business
Intelligence/Analytics
Master Data Management
Governance/Quality
Big Data
Implementation
3
6. AI Data
• Call center recordings and chat logs
• Streaming sensor data, historical maintenance records and
search logs
• Customer account data and purchase history
• Email response metrics
• Product catalogs and data sheets
• Public references
• YouTube video content audio tracks
• User website behaviors
• Sentiment analysis, user-generated content, social graph data,
and other external data sources
6
8. Best Category and Top Tool Picked
Best Category Picked
Top 2 Category Picked
Same Ol’ Platform
80%
70%
60%
50%
Increasing Probability that Platform
Selection Leads to Success
9. What is it?
• Operational Database
– Operational Real-Time
– Operational Big Data
• Operational Data Hub
• Master Data Management
• A Data Warehouse
• A Data Mart
– Dependent
– Independent
• A Data Lake
• Analytic Application
– Analytic Big Data Application
• Archive Storage
• A Staging Area
9
10. 4 Major Decisions
• Decision #1: The Data Store Type
– The largest factor for distinguishing between databases and file-based scale-out system utilization is the data profile. The latter is best for
data that fits the loose label of 'unstructured' (or semi-structured) data, while more traditional data -- and smaller volumes of all data -- still
belong in a relational database.
• Decision #2: Data Store Placement
– You must also decide where to place your data store -- on-premises or in the cloud (and which cloud). In the past, the only clear choice for
most organizations was on-premises data. However, the costs of scale are gnawing away at the notion that this remains the best approach
for a data platform. For more on why databases are moving to the cloud, please read this article.
• Decision #3: The Workload Architecture
– You must keep in mind the distinction between operational or analytical workloads. Short transactional requests and more complex (often
longer) analytics requests demand different architectures. Analytics databases, though quite diverse, are the preferred platforms for the
analytics workload.
• Decision #4: The Node Architecture
– General purpose to premium and HDD to flash and all memory storage. Volume types, readwrite, cache options. Balance CPU and storage
on the nodes. Levels of IOPS. Levels of management.
10
12. Data Warehousing
• Data Warehouses (still) have a lower
total cost of ownership than data
marts
• A data warehouse is a SHARED
platform
– Build once, use many
– Access at Data Warehouse
– Access by creating a mart off the DW
• Still A LOT cheaper than building from scratch
“… a subject-
oriented, integrated,
non-volatile, time-
variant collection of
data, organized to
support
management
needs.” — Bill Inmon
13. On Relational
• Consistency
• Transactions
• Partitioning
• Arrays
• Inheritance
• UNION
• Columnar
• Storage Fluidity
• Custom data types
• Built in graph capabilities
• Caching
13
15. Data Warehouses Have Flavors
● The Customer Experience Transformation Data Warehouse focuses on
customer attributes and touchpoints to improve the value of
customers.
● The Asset Maximization with IoT data warehouse deals with the high
volume of edge data tracking the physical assets of the organization.
● The Operational Extension Data Warehouse supports company
operations directly with real- time analytics.
● The Risk Management Data Warehouse supports the ever-growing
compliance and reporting requirements and corporate risk.
● The Finance Modernization Data Warehouse handles the voluminous
financial reporting and ensures the bottom line is considered in every
aspect of the business.
● The Product Innovation Data Warehouse delivers all product-related
information into the decisions of the product life cycle.
16. Required for Modern Analytics
• In-database analytics
• In-memory capabilities
• Columnar orientation
• Modern programming languages
• New data types
16
18. Object Storage Instances
• Object Storage instances/clusters have local
storage, i.e., on the physical drives mounted to the
instances themselves, that is HDFS and Hive
• Object Storage technologies access their cloud
vendor’s respective cloud storage—viz.:
– Amazon EMR accesses S3
– Dataproc accesses Google Cloud Storage
– HDI accesses Azure Data Lake Storage Gen2
• Local storage is used by the Object Storage
platform for housekeeping
18
19. Data Lakes with Analytic Access Pricing
• Pair a lake with an analytical engine that charges
only by what you use
• If you have a ton of data that can sit in cold storage
and only needs to be accessed or analyzed
occasionally, store it in Amazon S3/Azure Blob
Storage/Google Cloud Storage
– Use a database (on-premise or in the cloud) that can
create external tables that point at the storage
– Analysts can query directly against it, or draw down a
subset for some deeper/intensive analysis
– The GB/month storage fee plus data transfer/egress
fees will be much cheaper than leaving it in a data
warehouse
19
20. Analytics Reference Architecture
Logs
(Apps, Web,
Devices)
User tracking
Operational
Metrics
Offload
data
Raw Data Topics
JSON, AVRO
Processed
Data Topics
Sensors
and
/ or
Transactiona
l/ Context
Data
OLTP/ODS
ETL
Or
EL with
T in Spark
Batch
Low
Latency
Applications
Files
In-
database
analytics
Reach
through
or ETL/ELT
or
Stream
Processing
or
Stream
Processing
Q
Q
Data
Warehouse
21. Notes on the Data Warehouse of the Future
• More Achievable separate compute and storage architecture
• Compute resources (Map/Reduce, Hive, Spark, etc.) can be taken down,
scaled up or out, or interchanged without data movement
• Storage can be centralized, but compute can be distributed
• Major players have mechanism to ensure consistency to achieve ACID-like
compliance
• Remote data replication to ensure redundancy and recovery
• Most of the query execution is processing time, and not data transport, so if
cloud compute and storage are in the same cloud vendor region,
performance is hardly impacted
21
23. Disruption Vectors
• Robustness of SQL
• Built-in optimization
• On-the-fly elasticity
• Dynamic Environment Adaption
• Separation of compute from storage
• Support for diverse data
23
24. Cloud Analytic Databases in the Enterprise
• Can be used for test/dev or prod; disaster recovery; bursting
• CAPEX accounting
• The cloud now offers attractive options with better
economics, such as pay-as-you-go which is easier to justify
and budget, better logistics (streamlined administration and
management), and better scale (elasticity and the ability to
expand a cluster within minutes).
• While on-premises-first development brings a robust
database to the table, not all functions are always part of the
cloud solution and not all of the organizations behind them
have made the transition to cloud.
• Data gravity in the cloud.
24
25. Performance
• Managed cloud databases are the winner for
performance
• Querying cloud storage directly is inefficient and
bringing subsets of data down for on-premise
processing takes time and costs egress fees
• Performance testing on Hadoop engines like Hive,
Spark, and Impala have shown improvements in
performance, but they still lag significantly behind
the performance and power of a solid relational
cloud database/data warehouse
25
26. Administration
• Managed cloud databases win this category too.
• Many of the latest and greatest fully-managed cloud
database platforms are streamlining and subsuming
much of the DBA work these days. Things like indexes,
constraints, partitioning, and other DBA-level
performance tuning are fading away.
• Second is cloud storage, because of its very simple
architecture.
• Last place in Administration is Hadoop. You will still need
expertise to help diagnose why Spark executors fail or
Hive throws an exception or why troublesome queries
never finish.
26
27. However… Why Big Data Technologies for Big
Data
• New Data Types
• Schemaless
• Relaxed ACID
• Faster, Less Expensive Provisioning
• Programmer Freedoms
• Fault-Tolerant Redundancy
• Scale Out (to Webscale)
• Automatic Sharding
28. Data Lake
Data Scientist Workbench and Data Warehouse
Staging
OLTP
Systems
Data Lake
Data Scientists
ERP
CRM
Supply
Chain
MDM
…
Data
Warehouse
Data Mart
Stream or
Batch
Updates
DI
Real-Time,
Event-Driven
Apps
28
29. HDFS vs Cloud Storage
• Cloud Storage is more scalable and persistent
• Cloud Storage is backed up and supports
compression, making the cost of big data less
• HDFS has better query performance
• Cloud Storage has object size and single PUT
limits that need workarounds
29
30. Leveraging Cloud Storage for Data Lakes
• More Achievable separate compute and storage architecture
• Compute resources (Map/Reduce, Hive, Spark, etc.) can be taken
down, scaled up or out, or interchanged without data movement
• Storage can be centralized, but compute can be distributed
• Major players have mechanism to ensure consistency to achieve
ACID-like compliance for remote data changes
• Some vendors also have remote data replication to ensure
redundancy and recovery
• Most of the query execution is processing time, and not data
transport, so if cloud compute and storage are in the same cloud
vendor region, performance is hardly impacted
30
32. How to Identify a Graph Workload
• Workload is identified by “network, hierarchy,
tree, ancestry, structure” words
• You are planning to use the relational
performance tricks
• Your queries will be about pathing
• You are limiting queries by their complexity
• A quick POC with a graph database impresses
• You are looking for “non-obvious” patterns in
the data
32
35. GPU Databases
35
• A GPU Database performs at least some
operations using the GPU
• Uses SQL
• Uses each GPUs local memory store,
which is used as a data cache that
operates many times faster than the CPU
cache or main memory itself
36. Operlytical Databases
• Combination row-based for transactions and
column-based for analytics
• Can process both orders and machine learning
models simultaneously with fast performance
and reduced complexity
36
37. Decentralized, decoupled, distributed
architectures
• Data Infrastructure as a Platform with complete domain
mastery as nodes
• Enterprise master data management
• Solving the federation challenge; nobody has done it yet,
someone will and it could be really big
• Moving away from conventional integration and its
technical debt and effort
• Containerization, microservice databases, and
embedded databases as part of the analytics
environment
• Integration speed uptake and maturity eliminating
redundant data stores
• Unification of batch and streaming and tools
37
38. Platforming Your Data for
Success
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444