Navigate
Architecting
Modern Data Platforms
by ankitrathi.com
Content
• Data Architecture Principles
• Data Lake Basics
• High Level Architecture
• Data Characteristics
• Putting It All Together
• Product-Driven Data Architecture
• Reference Architecture
Data Architecture Principals
• Adhere to ADDA (Accessibility, Definition, Decoupling, Agility)
• Design for RSM (Reliability, Scalability, Maintainability)
• Use Right Tools
• Cloud Native/Agnostic
• Be Cost Conscious
Adhere to ADDA
Accessibility
Easily accessible data
for business
Definition
Data catalog for
simplified data
discovery
Decoupling
Decoupled layers for
flexibility
Agility
Agile enough to cater
evolving business
requirements
Design for RSM
Reliability
works correctly,
fault-tolerant
Scalability
adapts to growth
Maintainability
remains easy to maintain
Use Right Tools
Data Structure
Structured, Semi-
structured, Unstructured
Latency
Low, Medium, High
Throughput
High, Medium, Low
Access Pattern
Key-value, Search,
Transactions
Cloud Native/Agnostic
Cloud Native Cloud Agnostic
Pros:
• Better performance
• Better efficiency
• Lower costs (generic services)
Pros:
• Flexibility
• Minimal vendor lock-in
• Standard performance
Cons:
• Vendor lock-in
• Higher costs (specific services)
Cons:
• Underutilization of vendor capabilities
• Solution can become complex
• Performance, logging and monitoring
can take a hit
Be Cost Conscious
• Efficient consumption of services
• Select cost-conscious options
• Enforce policies and controls
Data Lake
• Data Lake Definition
• An architectural approach
• Massive heterogenous data stored centrally
• Available to diverse group of users
• To be categorized, processed, analyzed & consumed
• Data Lake Characteristics
• Structured, semi-structured & unstructured data
• Scaled out as required
• Diverse set of storage, analytics and ML/AI tools
• Designed for low-cost storage and analytics
High-Level Architecture
Process/
Analyse
Ingest Store Serve
Latency, Throughput, Cost
Data Actionable Insights
Ingest
Source Data Type Data
Web/Mobile Apps Records Transactions
Databases Records Transactions
Logging Search documents Files
Logging Log files Files
Messaging Messages Events
IoT Data Streams Events
Data Characteristics
Hot Warm Cold
Volume MB-GB GB-PB PB-EB
Item Size B-KB KB-MB KB-TB
Latency ms ms, sec min, hrs
Durability Low-high High Very high
Request Rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢¢-¢
Data Characteristics
• Type of Data Structures
• Fixed Schema
• Schema Free
• Key-Value
• Type of Access Patterns
• Key-Value
• Simple relations (1:N, M:N)
• Multi-table joins, transactions
• Faceting, Search
Storage
In-memory
File Storage
NoSQL
SQL
Hot data Warm data Cold data
Structure
HighLow
Request rate, Cost per GBHigh Low
Latency, Data VolumeLow High
Analytics Types
• Message/Stream Analysis
• Interactive Analysis
• Batch Analysis
• Machine Learning/AI
ETL Processing
Process/AnalyseStore ETL
Serve
• Applications & APIs
• Analysis & Visualization
• Notebooks
• IDEs
Putting It All Together
Process/AnalyseStore
ETL
Ingest Serve
Web Apps
Mobile Apps
Data Centers
Logging
Messaging
Devices
Sensors
Cache
NoSQL
SQL
ElasticSearch
Object Storage
SQS
Streams
ML/AI
Interactive
Batch
Message
Streams
APIs
Analysis
Visualization
Notebooks
IDE
Records
Documents
Files
Messages
Streams
Security & Governance, Data Catalog
Product-Driven Data Architecture
Reference: https://martinfowler.com/articles/data-monolith-to-mesh.html
Reference Architecture - Azure
Reference: https://docs.microsoft.com/en-us/azure/architecture/example-scenario/dataplate2e/data-platform-end-to-end
Reference Architecture - AWS
Reference: https://docs.aws.amazon.com/solutions/latest/data-lake-solution/architecture.html
Reference Architecture - GCP
Reference: https://cloud.google.com/solutions/big-data
Navigate
Questions…?
Navigate
Thank You
ankitrathi.com

Architecting Modern Data Platforms

  • 1.
  • 2.
    Content • Data ArchitecturePrinciples • Data Lake Basics • High Level Architecture • Data Characteristics • Putting It All Together • Product-Driven Data Architecture • Reference Architecture
  • 3.
    Data Architecture Principals •Adhere to ADDA (Accessibility, Definition, Decoupling, Agility) • Design for RSM (Reliability, Scalability, Maintainability) • Use Right Tools • Cloud Native/Agnostic • Be Cost Conscious
  • 4.
    Adhere to ADDA Accessibility Easilyaccessible data for business Definition Data catalog for simplified data discovery Decoupling Decoupled layers for flexibility Agility Agile enough to cater evolving business requirements
  • 5.
    Design for RSM Reliability workscorrectly, fault-tolerant Scalability adapts to growth Maintainability remains easy to maintain
  • 6.
    Use Right Tools DataStructure Structured, Semi- structured, Unstructured Latency Low, Medium, High Throughput High, Medium, Low Access Pattern Key-value, Search, Transactions
  • 7.
    Cloud Native/Agnostic Cloud NativeCloud Agnostic Pros: • Better performance • Better efficiency • Lower costs (generic services) Pros: • Flexibility • Minimal vendor lock-in • Standard performance Cons: • Vendor lock-in • Higher costs (specific services) Cons: • Underutilization of vendor capabilities • Solution can become complex • Performance, logging and monitoring can take a hit
  • 8.
    Be Cost Conscious •Efficient consumption of services • Select cost-conscious options • Enforce policies and controls
  • 9.
    Data Lake • DataLake Definition • An architectural approach • Massive heterogenous data stored centrally • Available to diverse group of users • To be categorized, processed, analyzed & consumed • Data Lake Characteristics • Structured, semi-structured & unstructured data • Scaled out as required • Diverse set of storage, analytics and ML/AI tools • Designed for low-cost storage and analytics
  • 10.
    High-Level Architecture Process/ Analyse Ingest StoreServe Latency, Throughput, Cost Data Actionable Insights
  • 11.
    Ingest Source Data TypeData Web/Mobile Apps Records Transactions Databases Records Transactions Logging Search documents Files Logging Log files Files Messaging Messages Events IoT Data Streams Events
  • 12.
    Data Characteristics Hot WarmCold Volume MB-GB GB-PB PB-EB Item Size B-KB KB-MB KB-TB Latency ms ms, sec min, hrs Durability Low-high High Very high Request Rate Very high High Low Cost/GB $$-$ $-¢¢ ¢¢-¢
  • 13.
    Data Characteristics • Typeof Data Structures • Fixed Schema • Schema Free • Key-Value • Type of Access Patterns • Key-Value • Simple relations (1:N, M:N) • Multi-table joins, transactions • Faceting, Search
  • 14.
    Storage In-memory File Storage NoSQL SQL Hot dataWarm data Cold data Structure HighLow Request rate, Cost per GBHigh Low Latency, Data VolumeLow High
  • 15.
    Analytics Types • Message/StreamAnalysis • Interactive Analysis • Batch Analysis • Machine Learning/AI
  • 16.
  • 17.
    Serve • Applications &APIs • Analysis & Visualization • Notebooks • IDEs
  • 18.
    Putting It AllTogether Process/AnalyseStore ETL Ingest Serve Web Apps Mobile Apps Data Centers Logging Messaging Devices Sensors Cache NoSQL SQL ElasticSearch Object Storage SQS Streams ML/AI Interactive Batch Message Streams APIs Analysis Visualization Notebooks IDE Records Documents Files Messages Streams Security & Governance, Data Catalog
  • 19.
    Product-Driven Data Architecture Reference:https://martinfowler.com/articles/data-monolith-to-mesh.html
  • 20.
    Reference Architecture -Azure Reference: https://docs.microsoft.com/en-us/azure/architecture/example-scenario/dataplate2e/data-platform-end-to-end
  • 21.
    Reference Architecture -AWS Reference: https://docs.aws.amazon.com/solutions/latest/data-lake-solution/architecture.html
  • 22.
    Reference Architecture -GCP Reference: https://cloud.google.com/solutions/big-data
  • 23.
  • 24.