Airbyte @ Airflow Summit - The new modern data stack

•

1 like•353 views

The document introduces the modern data stack of Airbyte, Airflow, and dbt. It discusses how ELT addresses issues with traditional ETL processes by separating extraction, loading, and transformation. Extraction and loading involve general-purpose routines to pull and push raw data, while transformation uses business logic specific to the organization. The stack is presented as an open solution that allows composing with best of breed tools for each part of the data pipeline. Airbyte provides data integration, dbt enables data transformation with SQL, and Airflow handles scheduling. The demo shows how these tools can be combined to build a flexible, autonomous, and future proof modern data stack.

The new modern
data stack
Powered by Airbyte, Airflow, dbt

Airbyte
Open Data Integration (ELT
3,000 users
1,900 Slack members
3.5k GitHub stars
Hello!
I am Michel Tricot
co-founder & CEO of Airbyte
MichelTricot
michel-tricot
/in/micheltricot
3

Extract - Load - Transform
A new paradigm for modern data teams
1
4

Extract
Source-specific routines to
pull selected data from an
external system.
Transform
Business logic specific to
your organization to serve
an analytics or operational
use case.
Load
Destination specific
routines to push data
where it is going to be
consumed.
5

ETL doesn’t work in today’s world
Inflexible
● Friction when
changing an existing
pipeline.
● Hard to add new data.
● Most issues force data
to be re-extracted.
Lack of Autonomy
● Warehouses made data
consumers more autonomous.
● Changes require engineering
involvement.
Complex
● Custom DSL.
● Force adoption of a
data stack.
● Address 70% of the
needs, 30% still built
and maintained
in-house.
6

Extract
General-purpose routines
to pull selected data from a
source.
Load
General-purpose routines
to push raw data where it is
going to be consumed.
Transform
Business logic specific to
your organization to serve
an analytics or operational
use case with SQL / dbt / ...
7

ELT fixes the ETL-related issues
Flexibility
● All the data available on
the destination.
● Data consumers are
free to use what they
need for the insights
they want.
Autonomy
● Data consumers can leverage
SQL queries to transform the
data the way they want.
● No need to involve the
engineering team.
Future proof
 Issues during
transformation don’t
prevent access to the
data.
 Easy to update
transformation
schemas.
8

The open modern-stack
Compose with best of breed
2
9

All companies have custom data needs
We are moving away from
horizontal, end-to-end solutions
The world of data is changing
● Budget
● Security
● Privacy
● Types
● Skills
● Scale
● Turnaround
● Compliance
● ...
10

Extract & Load
Transformation BI / Visualization
Quality
Observability
SQL
Data warehouse
Infrastructure
Select the best building blocks
11

Airbyte-dbt-Airflow
To rule them all
3
13

● Open-Source data integration platform
● 80 high-quality data connectors (targeting 200 by EOY
● Connector Development Kit to build your own data connector
● Active Slack community
● Super fun mascot: Octavia Squidington III
Airbyte: Solving Data Integration
14

Source
Source
Destination
Destination
Source
Source
Destination
Destination
Data Exchange Protocol
Scheduler
Config & State
API
UI CLI
Connectors
Platform
Monitoring
Incremental updates
An open data platform for modern teams
Connector Development Kit
Monitoring
15

● Open-Source data transformation application
○ Transform SQL to tables, views
○ Cascades transformations
● Supports most SQL Databases & Warehouses
● Brings software engineering best practices to Analytics Engineers
○ Version control
○ Testing
○ Packaging
dbt: Solving Data Transformation
16

Airflow: Solving Scheduling
Do I really need to tell you what Airflow is?
17

Reverse ETL
Data
Warehouse
Extract Load Transform Activate
...
BI/Visualization
...
18

Demo
All source code & configurations available on github
19

Any questions?
MichelTricot
slack.airbyte.io(@Michel)
airbytehq/airbyte
Thanks!
20

A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.

DBT ELT approach for Advanced Analytics.pptx

Hong Ong

*Event* DBT (Data Build Tool) an ELT approach for Advanced Analytics (wearecommunity.io) https://wearecommunity.io/events/dbt-data-build-tool-an-elt-approach-for-advanced-analytics *Demo* Goal: calculate monthly sales values by category Tech stacks: DBT, Databricks, Azure Blob Data: Brazilian E-Commerce Public Dataset by Olist (Kaggle) Github: https://github.com/ongxuanhong/de05-dbt-databricks YouTube: https://youtu.be/l4Mug-Qp3ag

From Data Warehouse to Lakehouse

Modern Data Stack France

Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology. Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus. Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.

Siligong.Data - May 2021 - Transforming your analytics workflow with dbt

Jon Su

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks

dbt Python models - GoDataFest by Guillermo Sanchez

GoDataDriven

Changing the game with cloud dw

elephantscale

Snowflake Data Science and AI/ML at Scale

Adam Doyle

In Data Engineer's Lunch #54, we will discuss the data build tool, a tool for managing data transformations with config files rather than code. We will be connecting it to Apache Spark and using it to perform transformations. Accompanying YouTube: https://youtu.be/dwZlYG6RCSY Sign Up For Our Newsletter: http://eepurl.com/grdMkn Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday: https://www.meetup.com/Data-Wranglers-DC/events/ Cassandra.Link: https://cassandra.link/ Follow Us and Reach Us At: Anant: https://www.anant.us/ Awesome Cassandra: https://github.com/Anant/awesome-cassandra Email: solutions@anant.us LinkedIn: https://www.linkedin.com/company/anant/ Twitter: https://twitter.com/anantcorp Eventbrite: https://www.eventbrite.com/o/anant-1072927283 Facebook: https://www.facebook.com/AnantCorp/ Join The Anant Team: https://www.careers.anant.us

Architect’s Open-Source Guide for a Data Mesh Architecture

Databricks

Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh? In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry. The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems. This session is targeted for architects, decision-makers, data-engineers, and system designers.

Snowflake: The most cost-effective agile and scalable data warehouse ever!

Visual_BI

Modernizing to a Cloud Data Architecture

Databricks

Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.

Zero to Snowflake Presentation

Brett VanderPlaats

Snowflake Architecture.pptx

chennakesava44

Snowflake: The Good, the Bad, and the Ugly

Tyler Wishnoff

3D: DBT using Databricks and Delta

Databricks

Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.

Intro to Data Vault 2.0 on Snowflake

Kent Graziano

Databricks on AWS.pptx

Wasm1953

Databricks: A Tool That Empowers You To Do More With Data

Databricks

In this talk we will present how Databricks has enabled the author to achieve more with data, enabling one person to build a coherent data project with data engineering, analysis and science components, with better collaboration, better productionalization methods, with larger datasets and faster. The talk will include a demo that will illustrate how the multiple functionalities of Databricks help to build a coherent data project with Databricks jobs, Delta Lake and auto-loader for data engineering, SQL Analytics for Data Analysis, Spark ML and MLFlow for data science, and Projects for collaboration.

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...

Edureka!

This Edureka Spark Hadoop Tutorial will help you understand how to use Spark and Hadoop together. This Spark Hadoop tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial: 1) Spark Overview 2) Hadoop Overview 3) Spark vs Hadoop 4) Why Spark Hadoop? 5) Using Hadoop With Spark 6) Use Case - Sports Analytics (NBA)

Apache Airflow

Knoldus Inc.

Building Lakehouses on Delta Lake with SQL Analytics Primer

Databricks

Considerations for Data Access in the Lakehouse

Databricks

Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker. The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.

Meetup: Streaming Data Pipeline Development

Timothy Spann

Meetup: Streaming Data Pipeline Development In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns. He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg. If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/ You can join the meeting virtually here: https://cloudera.zoom.us/j/91603330726 Speaker - Tim Spann Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Databricks

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

Intro to Delta Lake

Databricks

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...

Cathrine Wilhelmsen

Introducing the Snowflake Computing Cloud Data Warehouse

Snowflake Computing

Speeding Time to Insight with a Modern ELT Approach

Databricks

The availability of new tools in the modern data stack is changing the way data teams operate. Specifically, the modern data stack supports an “ELT” approach for managing data, rather than the traditional “ETL” approach. In an ELT approach, data sources are automatically loaded in a normalized state into Delta Lake and opinionated transformations happen in the data destination using dbt. This workflow allows data analysts to move more quickly from raw data to insight, while creating repeatable data pipelines robust to changes in the source datasets. In this presentation, we’ll illustrate how easy it is for even a data analytics team of one to to develop an end-to-end data pipeline. We’ll load data from GitHub into Delta Lake, then use pre-built dbt models to feed a daily Redash dashboard on sales performance by manager, and use the same transformed models to power the data science team’s predictions of future sales by segment.

Kettle: Pentaho Data Integration tool

Alex Rayón Jerez

What's hot

Data Engineer's Lunch #54: dbt and Spark

Anant Corporation

Architect’s Open-Source Guide for a Data Mesh Architecture

Databricks

Snowflake: The most cost-effective agile and scalable data warehouse ever!

Visual_BI

Modernizing to a Cloud Data Architecture

Databricks

Zero to Snowflake Presentation

Brett VanderPlaats

Snowflake Architecture.pptx

chennakesava44

Snowflake: The Good, the Bad, and the Ugly

Tyler Wishnoff

3D: DBT using Databricks and Delta

Databricks

Intro to Data Vault 2.0 on Snowflake

Kent Graziano

Databricks on AWS.pptx

Wasm1953

Databricks: A Tool That Empowers You To Do More With Data

Databricks

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...

Edureka!

Apache Airflow

Knoldus Inc.

Building Lakehouses on Delta Lake with SQL Analytics Primer

Databricks

Considerations for Data Access in the Lakehouse

Databricks

Meetup: Streaming Data Pipeline Development

Timothy Spann

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Databricks

Intro to Delta Lake

Databricks

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...

Cathrine Wilhelmsen

Introducing the Snowflake Computing Cloud Data Warehouse

Snowflake Computing

What's hot (20)

Data Engineer's Lunch #54: dbt and Spark

Architect’s Open-Source Guide for a Data Mesh Architecture

Snowflake: The most cost-effective agile and scalable data warehouse ever!

Modernizing to a Cloud Data Architecture

Zero to Snowflake Presentation

Snowflake Architecture.pptx

Snowflake: The Good, the Bad, and the Ugly

3D: DBT using Databricks and Delta

Intro to Data Vault 2.0 on Snowflake

Databricks on AWS.pptx

Databricks: A Tool That Empowers You To Do More With Data

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...

Apache Airflow

Building Lakehouses on Delta Lake with SQL Analytics Primer

Considerations for Data Access in the Lakehouse

Meetup: Streaming Data Pipeline Development

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Intro to Delta Lake

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...

Introducing the Snowflake Computing Cloud Data Warehouse

Similar to Airbyte @ Airflow Summit - The new modern data stack

Speeding Time to Insight with a Modern ELT Approach

Databricks

Kettle: Pentaho Data Integration tool

Alex Rayón Jerez

Transform Your Data Integration Platform From Informatica To ODI

Jade Global

Meetup 25/04/19: Big Data

Digipolis Antwerpen

Using Data Platforms That Are Fit-For-Purpose

DATAVERSITY

We must grow the data capabilities of our organization to fully deal with the many and varied forms of data. This cannot be accomplished without an intense focus on the many and growing technical bases that can be used to store, view, and manage data. There are many, now more than ever, that have merit in organizations today. This session sorts out the valuable data stores, how they work, what workloads they are good for, and how to build the data foundation for a modern competitive enterprise.

[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...

DataScienceConferenc1

In this talk, I will discuss how to implement a robust CI (Continuous Integration) workflow with dbt (Data Build Tool) to optimize data warehouse quality. The implementation of a CI workflow can streamline collaboration and improve the quality of the data warehouse by catching errors early in the development process. By leveraging dbt's modular approach and test-driven development practices, the CI workflow can help data teams ensure the accuracy and reliability of their data. Attendees will learn best practices for implementing a CI workflow with dbt and how to optimize their data warehousing quality

Kettleetltool 090522005630-phpapp01jade_22

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Alluxio, Inc.

Mutable data @ scale

Ori Reshef

Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...

Alex Rayón Jerez

Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...

Databricks

Devon Energy is a Fortune 500 company focused on unconventional upstream oil and gas production. With a companywide focus on innovation and data-driven decision making, IT has been challenged to make more data available to more people more quickly. To this end, we have leveraged the scale of Microsoft Azure and Databricks’ Unified Analytics Platform to help reimagine our integration, data warehousing and analytics landscape to improve agility while moving our workloads to the cloud. We are in the third year of this transformation and have lessons learned around improving the testability of data pipelines, code management, model training and deployment, promotion, and user empowerment. In this talk, we will share our experience managing the lifecycle of data engineering and machine learning solutions and striking the balance between agility and reliability in a single platform, while democratizing data access to users from all disciplines across the company. Author: Paul Bruffett

Subhabrata Deb ResumeSubhabrata Deb

Pentaho ppt up

03446940736

Pentaho Data Integration in Data Warehouse. Open-source Pentaho provides business intelligence (BI) and data warehousing solutions at a fraction of the cost of proprietary solutions. Pentaho Data Integration (PDI) provides the Extract, Transform, and Load (ETL) capabilities that facilitates the process of capturing, cleansing, and storing data using a uniform and consistent format that is accessible and relevant to end users and IoT technologies By Muhammad Ayaz Farid Shah. 03446940736. MSCS.

How Yellowbrick Data Integrates to Existing Environments Webcast

Yellowbrick Data

An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022

HostedbyConfluent

An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022 What happens to the modern data stack (MDS) and analytics as a whole when streaming becomes accessible? For years, the MDS has been centered around batch-based workflows with dbt at its core, introducing software engineering best practices to analysts. But now with even major data warehouses like Snowflake getting in the game, expanding their streaming capabilities, what does that mean? In this talk, we will explore what streaming in a batch-based analytics world should look like. How does that change your thoughts about implementing testing and performance optimization in your data pipelines? Do you still need dbt? And the question that we are all asking: do you really need a real-time dashboard?

Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)

Denodo

Watch full webinar here: https://bit.ly/3dudL6u It's not if you move to the cloud, but when. Most organisations are well underway with migrating applications and data to the cloud. In fact, most organisations - whether they realise it or not - have a multi-cloud strategy. Single, hybrid, or multi-cloud…the potential benefits are huge - flexibility, agility, cost savings, scaling on-demand, etc. However, the challenges can be just as large and daunting. A poorly managed migration to the cloud can leave users frustrated at their inability to get to the data that they need and IT scrambling to cobble together a solution. In this session, we will look at the challenges facing data management teams as they migrate to cloud and multi-cloud architectures. We will show how the Denodo Platform can: - Reduce the risk and minimise the disruption of migrating to the cloud. - Make it easier and quicker for users to find the data that they need - wherever it is located. - Provide a uniform security layer that spans hybrid and multi-cloud environments.

Modernizing Data Architecture using Data Virtualization for Agile Data Delivery

Denodo

In this presentation, Dave Kay, Data Consultant within the Analytics and Architecture group at Zurich Insurance, explains how Zurich is modernizing their data infrastructure using data virtualization to accelerate delivery of mortgage insurance and intra-day operational reports to business analysts, salespeople, underwriters, managers, and actuarial staff. This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/GLPPg2.

Sourav_Giri_Resume_2015sourav giri

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

DATAVERSITY

Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020. Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms. Data lakes will be built in cloud object storage. We’ll discuss the options there as well. Get this data point for your data lake journey.

Webinar future dataintegration-datamesh-and-goldengatekafka

Jeffrey T. Pollock

The Future of Data Integration: Data Mesh, and a Special Deep Dive into Stream Processing with GoldenGate, Apache Kafka and Apache Spark. This video is a replay of a Live Webinar hosted on 03/19/2020. Join us for a timely 45min webinar to see our take on the future of Data Integration. As the global industry shift towards the “Fourth Industrial Revolution” continues, outmoded styles of centralized batch processing and ETL tooling continue to be replaced by realtime, streaming, microservices and distributed data architecture patterns. This webinar will start with a brief look at the macro-trends happening around distributed data management and how that affects Data Integration. Next, we’ll discuss the event-driven integrations provided by GoldenGate Big Data, and continue with a deep-dive into some essential patterns we see when replicating Database change events into Apache Kafka. In this deep-dive we will explain how to effectively deal with issues like Transaction Consistency, Table/Topic Mappings, managing the DB Change Stream, and various Deployment Topologies to consider. Finally, we’ll wrap up with a brief look into how Stream Processing will help to empower modern Data Integration by supplying realtime data transformations, time-series analytics, and embedded Machine Learning from within data pipelines. GoldenGate: https://www.oracle.com/middleware/tec... Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)

Similar to Airbyte @ Airflow Summit - The new modern data stack (20)

Speeding Time to Insight with a Modern ELT Approach

Kettle: Pentaho Data Integration tool

Transform Your Data Integration Platform From Informatica To ODI

Meetup 25/04/19: Big Data

Using Data Platforms That Are Fit-For-Purpose

[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...

Kettleetltool 090522005630-phpapp01

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Mutable data @ scale

Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando m...

Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...

Subhabrata Deb Resume

Pentaho ppt up

How Yellowbrick Data Integrates to Existing Environments Webcast

An Analytics Engineer’s Guide to Streaming With Amy Chen | Current 2022

Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)

Modernizing Data Architecture using Data Virtualization for Agile Data Delivery

Sourav_Giri_Resume_2015

ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

Webinar future dataintegration-datamesh-and-goldengatekafka

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Search and Society: Reimagining Information Access for Radical Futures

Bhaskar Mitra

The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

DevOps and Testing slides at DASA Connect

Kari Kakkonen

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

The Future of Platform Engineering

Jemma Hussein Allen

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

The Art of the Pitch: WordPress Relationships and Sales

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Search and Society: Reimagining Information Access for Radical Futures

Essentials of Automations: Optimizing FME Workflows with Parameters

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

DevOps and Testing slides at DASA Connect

When stars align: studies in data quality, knowledge graphs, and machine lear...

The Future of Platform Engineering

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

How world-class product teams are winning in the AI era by CEO and Founder, P...

Assuring Contact Center Experiences for Your Customers With ThousandEyes

PHP Frameworks: I want to break free (IPC Berlin 2024)

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Airbyte @ Airflow Summit - The new modern data stack

1. The new modern data stack Powered by Airbyte, Airflow, dbt

2. Airbyte Open Data Integration (ELT 3,000 users 1,900 Slack members 3.5k GitHub stars Hello! I am Michel Tricot co-founder & CEO of Airbyte MichelTricot michel-tricot /in/micheltricot 3

3. Extract - Load - Transform A new paradigm for modern data teams 1 4

4. Extract Source-specific routines to pull selected data from an external system. Transform Business logic specific to your organization to serve an analytics or operational use case. Load Destination specific routines to push data where it is going to be consumed. 5

5. ETL doesn’t work in today’s world Inflexible ● Friction when changing an existing pipeline. ● Hard to add new data. ● Most issues force data to be re-extracted. Lack of Autonomy ● Warehouses made data consumers more autonomous. ● Changes require engineering involvement. Complex ● Custom DSL. ● Force adoption of a data stack. ● Address 70% of the needs, 30% still built and maintained in-house. 6

6. Extract General-purpose routines to pull selected data from a source. Load General-purpose routines to push raw data where it is going to be consumed. Transform Business logic specific to your organization to serve an analytics or operational use case with SQL / dbt / ... 7

7. ELT fixes the ETL-related issues Flexibility ● All the data available on the destination. ● Data consumers are free to use what they need for the insights they want. Autonomy ● Data consumers can leverage SQL queries to transform the data the way they want. ● No need to involve the engineering team. Future proof  Issues during transformation don’t prevent access to the data.  Easy to update transformation schemas. 8

8. The open modern-stack Compose with best of breed 2 9

9. All companies have custom data needs We are moving away from horizontal, end-to-end solutions The world of data is changing ● Budget ● Security ● Privacy ● Types ● Skills ● Scale ● Turnaround ● Compliance ● ... 10

10. Extract & Load Transformation BI / Visualization Quality Observability SQL Data warehouse Infrastructure Select the best building blocks 11

11. Composed with the best glue 12

12. Airbyte-dbt-Airflow To rule them all 3 13

13. ● Open-Source data integration platform ● 80 high-quality data connectors (targeting 200 by EOY ● Connector Development Kit to build your own data connector ● Active Slack community ● Super fun mascot: Octavia Squidington III Airbyte: Solving Data Integration 14

14. Source Source Destination Destination Source Source Destination Destination Data Exchange Protocol Scheduler Config & State API UI CLI Connectors Platform Monitoring Incremental updates An open data platform for modern teams Connector Development Kit Monitoring 15

15. ● Open-Source data transformation application ○ Transform SQL to tables, views ○ Cascades transformations ● Supports most SQL Databases & Warehouses ● Brings software engineering best practices to Analytics Engineers ○ Version control ○ Testing ○ Packaging dbt: Solving Data Transformation 16

16. Airflow: Solving Scheduling Do I really need to tell you what Airflow is? 17

17. Reverse ETL Data Warehouse Extract Load Transform Activate ... BI/Visualization ... 18

18. Demo All source code & configurations available on github 19

19. Any questions? MichelTricot slack.airbyte.io(@Michel) airbytehq/airbyte Thanks! 20

Airbyte @ Airflow Summit - The new modern data stack

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Airbyte @ Airflow Summit - The new modern data stack

Similar to Airbyte @ Airflow Summit - The new modern data stack (20)

Recently uploaded

Recently uploaded (20)

Airbyte @ Airflow Summit - The new modern data stack