Modern Open Data Platform :
Cool Open Source Tools
Crafting your Dream Stack with the Open Data Platform
Playbook
Rahul Xavier Singh Anant Corporation
Data Engineer’s Lunch / Anant Webinar 11/07/2022
Playbook
Design
Framework
Approach
ETL / Reverse ETL
Customer Data Platforms
Components
DataOps
Agenda
We help platform owners
reach beyond their potential
to serve a global customer
base that demands
Everything, Now.
We design with our
Playbook, build with our
Framework, and manage
platforms with our Approach
so our clients
Think & Grow Big.
Customer Success
Challenge
Business
Platform
Playbook
Framework
Approach
Technology
Management
Solutions
[Data] Services Catalog
Fully Managed Service
Subscriptions
We offer Professional Services to engineer Solutions and
offer Managed Services to clients where it makes sense, after an
Assessment
7
Modern Technology is Disconnected
https://chiefmartec.com/2020/04/marketing-technology-landscape-2020-martech-5000/
Businesses want to :
- Create value
- Get the customer
- Deliver the value
- Get paid
8
Most Users Just Want / Need to …
FIND
DISCOVER
FILTER
ANALYZE
VISUALIZE
MEASURE
ACT
USE
SHARE
9
Business / Platform Dream
Enterprise
Consciousness :
- People
- Processes,
- Information
- Systems
Connected /
Synchronized.
Business has been chasing
this dream for a while. As
technologies improve, this
becomes more accessible. Image Source: Digital Business
Technology Platforms, Gartner 2016
10
Going Beyond “Reactive Manifesto” / 12 Factor
References: https://12factor.net/, https://www.reactivemanifesto.org/
- Current Business Information is
available to People in the swiftest
way possible within the bounds of
reasonable costs.
- Business Information is generally
available to the enterprise, siloed
only by security and governance.
- Data platforms make use of
appropriate resources for hot vs.
cold, raw vs. enhanced data.
- Data platforms are always
available, redundant, always
trying to achieve a RPO/RTO of
zero.
Project
Information
Client
Service
Information
Corporate
Guides
Collaborative
Documents
Assets
& Files
Corporate
Assets
Unified User Experience
Challenges of
Managing Data
Platforms in a
Growing Enterprise
Optimized Core enabled Business Modularity
This process needs
to be done in
sequence. Otherwise
we end up having to
redo the work.
Business
Silos
Standardized
Platform
Optimized
Core
Business
Modularity
Phases of Business Modularity
14
Generic Data Platform Operations
Modern
Open Data Platform
Design
Contexts
Responsibilities
Approach
Framework
Tools
17
So Many Different “Modern Stacks?”
Lots of “reference” architectures
available. They tend not to think about
the speed layer since they are focusing
on batch. What about SPEED?
18
How do you choose from the landscape?
Lots and lots of components in the
Data & AI Landscape. Which ones are
the right ones for your business?
19
Playbook for Modern Open Data Platform
Platform Design Evaluate Framework
Cloud
- Public
- Private
- Hybrid
Data
- Data:Object
- Data:Stream
- Data:Table
- Data:Index
- Processor:Batch
- Processor:Stream
DataOps
- ETL/ELT/EtLT
- Reverse ETL
- Orchestration
DevOps
- Infrastructure as
Code
- Systems
Automation
- Application CICD
Architecture (Design)
- Cloud
- Data
- DevOps
- DataOps
Engineering
- Configuration
- Scripting
- Programming
Operation
- Setup / Deploy
- Monitoring/Alerts
- Administration
User Experience
- No-Code/Low Code Apps/Form Builders
- Automatic API Generator/Platform
- Customer App/API Framework
Execute Approach
Discovery (Inventory)
- People
- Process
- Information (Objects)
- Systems (Apps)
Modern Enterprise Canvas
Workflow
Approval
Customer
Acquisition Customer
Payment
Customer
Information
Customer
Information
Customer
Information
Business
Information
Billing
Information
Zoho App
Creator
Unbounce
Zoho CRM Stripe
Zapier
Contexts
- People
- Process
- Information
- Systems
Responsibility Areas
- Products & Services
- Sales & Marketing
- Operations &
Infrastructure
- Research &
Development
- Finance &
Accounting
- Leadership &
Management
Modern Enterprise Canvas
Contexts
- People
- Process
- Information
- Systems
Responsibility Areas
- Customer
- Users
- Business
- Product Owners
- Engineering
- Developers
- Operations
- Administrators
Framework
Framework
Distributed
Realtime
Extendable / Open
Automated
Monitored / Managed
Public Cloud Native - Amazon
Public Cloud Native - Microsoft
Public Cloud Native - Google
Cool Tools:
Optimizing Distributed Data
with Cloud vs. Open Core with
Open Source Tools
Open Core Distributed Data Platforms
To create globally distributed and real time platforms, we
need to use distributed realtime technologies to build your
platform. Here are some. Which ones should you choose?
Open Core
Data Modernization / Automation / Integration
In addition to vastly scalable tools, there are also modern
innovations that can help teams automate and maximize
human capital by making data platform management easier.
Framework Components
â—Ź Major Components
â—‹ Persistent Queues ( RAM/BUS)
â—‹ Queue Processing & Compute ( CPU)
â—‹ Persistent Storage (DISK/RAM)
â—‹ Reporting Engine (Display)
â—‹ Orchestration Framework (Motherboard)
â—‹ Scheduler (Operating System)
â—Ź Strategies
â—‹ Cloud Native on Google
â—‹ Self-Managed Open Source
â—‹ Self-Managed Commercial Source
â—‹ Managed Commercial Source
Customers want options, so we decided to
create a Framework that can scale with
whatever Infrastructure and Software strategy
they want to use.
31
Framework
Approach
Approach
Setup
Training
Administration
Configuration
Knowledge
Approach
34
Sample STACK Outline
35
Framework
Platform
Component
s
Resources
Platform
Setup
Training
Administrati
on
Configuratio
n
Knowledge
â—Ź Components
â—‹ Infrastructure
â–  Source / Git
â–  Github
â–  Gitlab
â–  Cloud / Public
â–  AWS
â–  Azure
â–  GCP
â–  DO
â–  Orchestration
â–  Terraform
â–  Terraform / Atlanits
â–  Configuration
â–  Ansible
â–  Ansible / AWX / Semaphore
â—‹ Compute
â–  Datastax / Spark
â–  Datastax / Livy
â–  Databricks
â—‹ Data / Open Core
â–  Datastax Enterprise
â–  Cassandra
â–  Search / Solr
â–  Graph
â–  Confluent Platform
â—‹ Data / Cloud
â–  Datastax / Astra
â–  Confluent Cloud
â—‹ Data / Open Source
â–  Cassandra
â–  Kafka
â–  Elassandra
â–  YugaByte
â–  Scylla
â–  Pulsar
â—‹ Application
â–  Airflow
â–  Airbyte
â–  Kafka Streams
â–  Jupyter
â–  Redash
â–  Metabase
â–  Superset
â–  Zeppelin
Use Case:
Standard Data Fabric
37
How Distributed Data Helps Drive Enterprise
Consciousness
XDCR: Cross datacenter
replication is the
ultimate data fabric.
Resilience,
performance,
availability, and scale.
Made widely available
by Cassandra and
Couchbase
38
Modern Open Data Platform + Cool Database = Data Fabric
One cluster, many workloads.
With any other “Data Warehouse”,
this would be problematic. With
Cassandra, this is a core feature.
39
How YugaByteDB allows us to go further…
All the benefits of XDCR and ….
- More Data Density at High
Speed
- YCQL Queries to support
Non Relational / C* CQL
like queries.
- YSQL Queries to support
Relational / SQL Queries
- Transactions/Consistency
- …
40
Let’s Get Data into a Database - Easier Today
Open Source:
- Airbyte / RudderStack
makes ETL Easier and
are open source
- Kafka Connect / Pulsar
IO can convert ETL into
Streaming ETL
SaaS/PaaS:
- SaaS like Stitch/HevoData
- Supported versions of Airbyte/RudderStack
41
Once It’s There, Serve it , Do More Processing
Open Source:
- Flink / Spark / Kafka
Streams can be used
to save Analytics /
ML processed data.
- Hasura can help
serve data as
GraphQL, PostgREST
can expose REST
apis.
42
Open Source:
- Grouparoo / Airbyte ,
RudderStack are free.
Others are paid.
- You can always use
Kafka Connect /
Pulsar IO to send data
back also.
Let’s send it back via Reverse ETL!
Reverse ETL is the process of copying data from a warehouse into business applications like
CRM, analytics, and marketing automation software. You perform this process by using a
reverse ETL tool that integrates with your data source and your business SaaS tools.
- Segment Blog
https://segment.com/blog/reverse-
etl/
43
Let’s put it all together now - ONE DATA FABRIC
Cassandra isn’t the only database to
do XDCR that can enable multiple
workloads.
Yugabyte also offers a PostgreSQL
compliant Layer
44
Key Takeaways for Open Data Platforms
Don’t reinvent the wheel.
Prioritize DevOps / DataOps
Document the STACK
Identify the Objectives
- Identify the objectives so that you
know what success looks like.
- DevOps / DataOps combined with a
true agile approach allows you to
iterate your platform quickly.
- Put the data into a distributed data
store that supports SQL/CQL, and
possibly archive it into
Parquet/Iceberg (historical data)
- Get the data out to your Systems
using “Reverse ETL” tools.
Use open tools that are well
supported
45
Thank you and Dream Big.
Hire us
- Design Workshops
- Innovation Sprints
- Service Catalog
Anant.us
- Read our Playbook
- Join our Mailing List
- Read up on Data Platforms
- Watch our Videos
- Download Examples

Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms

  • 1.
    Modern Open DataPlatform : Cool Open Source Tools Crafting your Dream Stack with the Open Data Platform Playbook Rahul Xavier Singh Anant Corporation Data Engineer’s Lunch / Anant Webinar 11/07/2022
  • 2.
    Playbook Design Framework Approach ETL / ReverseETL Customer Data Platforms Components DataOps Agenda
  • 3.
    We help platformowners reach beyond their potential to serve a global customer base that demands Everything, Now.
  • 4.
    We design withour Playbook, build with our Framework, and manage platforms with our Approach so our clients Think & Grow Big.
  • 5.
  • 6.
    Challenge Business Platform Playbook Framework Approach Technology Management Solutions [Data] Services Catalog FullyManaged Service Subscriptions We offer Professional Services to engineer Solutions and offer Managed Services to clients where it makes sense, after an Assessment
  • 7.
    7 Modern Technology isDisconnected https://chiefmartec.com/2020/04/marketing-technology-landscape-2020-martech-5000/ Businesses want to : - Create value - Get the customer - Deliver the value - Get paid
  • 8.
    8 Most Users JustWant / Need to … FIND DISCOVER FILTER ANALYZE VISUALIZE MEASURE ACT USE SHARE
  • 9.
    9 Business / PlatformDream Enterprise Consciousness : - People - Processes, - Information - Systems Connected / Synchronized. Business has been chasing this dream for a while. As technologies improve, this becomes more accessible. Image Source: Digital Business Technology Platforms, Gartner 2016
  • 10.
    10 Going Beyond “ReactiveManifesto” / 12 Factor References: https://12factor.net/, https://www.reactivemanifesto.org/ - Current Business Information is available to People in the swiftest way possible within the bounds of reasonable costs. - Business Information is generally available to the enterprise, siloed only by security and governance. - Data platforms make use of appropriate resources for hot vs. cold, raw vs. enhanced data. - Data platforms are always available, redundant, always trying to achieve a RPO/RTO of zero. Project Information Client Service Information Corporate Guides Collaborative Documents Assets & Files Corporate Assets Unified User Experience
  • 11.
  • 12.
    Optimized Core enabledBusiness Modularity This process needs to be done in sequence. Otherwise we end up having to redo the work.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    17 So Many Different“Modern Stacks?” Lots of “reference” architectures available. They tend not to think about the speed layer since they are focusing on batch. What about SPEED?
  • 18.
    18 How do youchoose from the landscape? Lots and lots of components in the Data & AI Landscape. Which ones are the right ones for your business?
  • 19.
    19 Playbook for ModernOpen Data Platform Platform Design Evaluate Framework Cloud - Public - Private - Hybrid Data - Data:Object - Data:Stream - Data:Table - Data:Index - Processor:Batch - Processor:Stream DataOps - ETL/ELT/EtLT - Reverse ETL - Orchestration DevOps - Infrastructure as Code - Systems Automation - Application CICD Architecture (Design) - Cloud - Data - DevOps - DataOps Engineering - Configuration - Scripting - Programming Operation - Setup / Deploy - Monitoring/Alerts - Administration User Experience - No-Code/Low Code Apps/Form Builders - Automatic API Generator/Platform - Customer App/API Framework Execute Approach Discovery (Inventory) - People - Process - Information (Objects) - Systems (Apps)
  • 20.
    Modern Enterprise Canvas Workflow Approval Customer AcquisitionCustomer Payment Customer Information Customer Information Customer Information Business Information Billing Information Zoho App Creator Unbounce Zoho CRM Stripe Zapier Contexts - People - Process - Information - Systems Responsibility Areas - Products & Services - Sales & Marketing - Operations & Infrastructure - Research & Development - Finance & Accounting - Leadership & Management
  • 21.
    Modern Enterprise Canvas Contexts -People - Process - Information - Systems Responsibility Areas - Customer - Users - Business - Product Owners - Engineering - Developers - Operations - Administrators
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Cool Tools: Optimizing DistributedData with Cloud vs. Open Core with Open Source Tools
  • 28.
    Open Core DistributedData Platforms To create globally distributed and real time platforms, we need to use distributed realtime technologies to build your platform. Here are some. Which ones should you choose?
  • 29.
    Open Core Data Modernization/ Automation / Integration In addition to vastly scalable tools, there are also modern innovations that can help teams automate and maximize human capital by making data platform management easier.
  • 30.
    Framework Components â—Ź MajorComponents â—‹ Persistent Queues ( RAM/BUS) â—‹ Queue Processing & Compute ( CPU) â—‹ Persistent Storage (DISK/RAM) â—‹ Reporting Engine (Display) â—‹ Orchestration Framework (Motherboard) â—‹ Scheduler (Operating System) â—Ź Strategies â—‹ Cloud Native on Google â—‹ Self-Managed Open Source â—‹ Self-Managed Commercial Source â—‹ Managed Commercial Source Customers want options, so we decided to create a Framework that can scale with whatever Infrastructure and Software strategy they want to use.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    Sample STACK Outline 35 Framework Platform Component s Resources Platform Setup Training Administrati on Configuratio n Knowledge â—ŹComponents â—‹ Infrastructure â–  Source / Git â–  Github â–  Gitlab â–  Cloud / Public â–  AWS â–  Azure â–  GCP â–  DO â–  Orchestration â–  Terraform â–  Terraform / Atlanits â–  Configuration â–  Ansible â–  Ansible / AWX / Semaphore â—‹ Compute â–  Datastax / Spark â–  Datastax / Livy â–  Databricks â—‹ Data / Open Core â–  Datastax Enterprise â–  Cassandra â–  Search / Solr â–  Graph â–  Confluent Platform â—‹ Data / Cloud â–  Datastax / Astra â–  Confluent Cloud â—‹ Data / Open Source â–  Cassandra â–  Kafka â–  Elassandra â–  YugaByte â–  Scylla â–  Pulsar â—‹ Application â–  Airflow â–  Airbyte â–  Kafka Streams â–  Jupyter â–  Redash â–  Metabase â–  Superset â–  Zeppelin
  • 36.
  • 37.
    37 How Distributed DataHelps Drive Enterprise Consciousness XDCR: Cross datacenter replication is the ultimate data fabric. Resilience, performance, availability, and scale. Made widely available by Cassandra and Couchbase
  • 38.
    38 Modern Open DataPlatform + Cool Database = Data Fabric One cluster, many workloads. With any other “Data Warehouse”, this would be problematic. With Cassandra, this is a core feature.
  • 39.
    39 How YugaByteDB allowsus to go further… All the benefits of XDCR and …. - More Data Density at High Speed - YCQL Queries to support Non Relational / C* CQL like queries. - YSQL Queries to support Relational / SQL Queries - Transactions/Consistency - …
  • 40.
    40 Let’s Get Datainto a Database - Easier Today Open Source: - Airbyte / RudderStack makes ETL Easier and are open source - Kafka Connect / Pulsar IO can convert ETL into Streaming ETL SaaS/PaaS: - SaaS like Stitch/HevoData - Supported versions of Airbyte/RudderStack
  • 41.
    41 Once It’s There,Serve it , Do More Processing Open Source: - Flink / Spark / Kafka Streams can be used to save Analytics / ML processed data. - Hasura can help serve data as GraphQL, PostgREST can expose REST apis.
  • 42.
    42 Open Source: - Grouparoo/ Airbyte , RudderStack are free. Others are paid. - You can always use Kafka Connect / Pulsar IO to send data back also. Let’s send it back via Reverse ETL! Reverse ETL is the process of copying data from a warehouse into business applications like CRM, analytics, and marketing automation software. You perform this process by using a reverse ETL tool that integrates with your data source and your business SaaS tools. - Segment Blog https://segment.com/blog/reverse- etl/
  • 43.
    43 Let’s put itall together now - ONE DATA FABRIC Cassandra isn’t the only database to do XDCR that can enable multiple workloads. Yugabyte also offers a PostgreSQL compliant Layer
  • 44.
    44 Key Takeaways forOpen Data Platforms Don’t reinvent the wheel. Prioritize DevOps / DataOps Document the STACK Identify the Objectives - Identify the objectives so that you know what success looks like. - DevOps / DataOps combined with a true agile approach allows you to iterate your platform quickly. - Put the data into a distributed data store that supports SQL/CQL, and possibly archive it into Parquet/Iceberg (historical data) - Get the data out to your Systems using “Reverse ETL” tools. Use open tools that are well supported
  • 45.
    45 Thank you andDream Big. Hire us - Design Workshops - Innovation Sprints - Service Catalog Anant.us - Read our Playbook - Join our Mailing List - Read up on Data Platforms - Watch our Videos - Download Examples

Editor's Notes

  • #8 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #9 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #10 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #11 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #15 Challenge Currently the components are broken up in to different vendors and parts. Similar to building a computer every time for every client.
  • #18 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #19 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #20 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #25 Challenge Currently the components are broken up in to different vendors and parts. Similar to building a computer every time for every client.
  • #26 Challenge Currently the components are broken up in to different vendors and parts. Similar to building a computer every time for every client.
  • #27 Challenge Currently the components are broken up in to different vendors and parts. Similar to building a computer every time for every client.
  • #31 Challenge Currently the components are broken up in to different vendors and parts. Similar to building a computer every time for every client.
  • #38 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #39 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #40 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #41 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #42 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #43 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #44 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #45 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  • #46 What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.