Incrementally streaming rdbms data to your data lake automagically

•

1 like•641 views

Incrementally streaming rdbms data to your data lake automagically using Apache NiFi to load Oracle data to Apache Hive, Apache Kudu, Apache Impala, Apache HDFS

Technology

APACHECON @HOME
Spt, 29th – Oct. 1st 2020

•
•
•
•
Incrementally Streaming RDBMS Data

5
Future of Data - Princeton
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...

6
Speakers
John Kuchmek
Senior Solutions Engineer

7
Speakers
Tim Spann
Principal DataFlow Field Engineer
@PaasDev
DZone Zone Leader and Big Data MVB
Princeton NJ Future of Data Meetup
https://github.com/tspannhw
https://www.datainmotion.dev/

8
JDBC Database to Apache Kudu / JDBC Database to HDFS and Hive

9
Trillions of Messages to SQL Databases and Data Warehouses
PROCESS DELIVER

10
https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_15.html

11
Reference Architecture
Files to RDBMS

‘QueryRecord’ Processor
https://medium.com/@abdelkrim.hadjidj/democratizing-niﬁ-record-processors-with-automatic-schemas-inferenc
e-4f2b2794c427

ADVANCED XML PROCESSING
https://www.datainmotion.dev/2019/03/advanced-xml-processing-with-apache.html
https://pierrevillard.com/2018/06/28/niﬁ-1-7-xml-reader-writer-and-forkrecord-processor/

• Example
• Flat ﬁles on an FTP
server named by date
• Downloads ﬁle
• HTTP REST API
endpoint
• Invokes API and
downloads data
• Legacy/Remote DB
• Performs SQL queries
1
5
DBCP Connection Pool to remote SQL Server
ExecuteSQLRecord processor

•
1
6
https://www.datainmotion.dev/2019/03/implementing-streaming-use-case-from.html

INGEST RDBMS TABLES
https://community.cloudera.com/t5/Community-Articles/Incrementally-Streaming-RDBMS-Data-to-Your-Hadoop-DataLake/ta-p/247927

https://dzone.com/articles/lets-build-a-simple-ingest-to-cloud-data-warehouse

Apache Arrow is designed to make things faster. Its focused on speeding communication between systems as well as processing within any one system. In this talk I'll start by discussing what Arrow is and why it was built. This will include covering an overview of the key components, goals, vision and current state. I’ll then take the audience through a detailed engineering review of how we used Arrow to solve several problems when building the Apache-Licensed Dremio product. This will include talking about Arrow performance characteristics, working with Arrow APIs, managing memory, sizing Arrow vectors, and moving data between processes and/or nodes. We’ll also review several code examples of specific data processing implementations and how they interact with Arrow data. Lastly we’ll spend a short amount of time on what’s next for Arrow. This will be a highly technical talk targeted towards people building data infrastructure systems and complex workflows.

Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...

Databricks

As a data driven company, we use Machine learning based algos and A/B tests to drive all of the content recommendations for our members. Traditionally, these recommendations are precomputed in a batch processing fashion, but such a model cannot react quickly based on member interactions, title interests, popularity etc. With an ever-growing Netflix catalog, finding the right content for our audience in near real-time would provide the best personalized experience. We’ll take a deep dive into our realtime Spark Streaming ecosystem at Netflix. Both it’s infrastructure and business use cases. On the infrastructure front, we will delve into scale challenges, state management, data persistence, resiliency considerations, metrics, operations and auto-remediation. We will talk about a few use cases that leverage real-time data for model training, such as providing the right personalized videos in a member’s Billboard and choosing the right personalized image soon after the launch of the show. We will also reflect on the lessons learnt while building such high volume infrastructure.

Spark and S3 with Ryan Blue

Databricks

Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. At this scale, output committers that create extra copies or can’t handle task failures are no longer practical. This talk will explain the problems that are caused by the available committers when writing to S3, and show how Netflix solved the committer problem. In this session, you’ll learn: – Some background about Spark at Netflix – About output committers, and how both Spark and Hadoop handle failures – How HDFS and S3 differ, and why HDFS committers don’t work well – A new output committer that uses the S3 multi-part upload API – How you can use this new committer in your Spark applications to avoid duplicating data

Hudi architecture, fundamentals and capabilities

Nishith Agarwal

How Uber scaled its Real Time Infrastructure to Trillion events per day

DataWorks Summit

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...

HostedbyConfluent

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022 Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs. Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.

Building robust CDC pipeline with Apache Hudi and Debezium

Tathastu.ai

We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.

Building an open data platform with apache iceberg

Alluxio, Inc.

Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes. In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel. We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases. Speaker: Satish Kotha (Uber) Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore. website: https://www.aicamp.ai/event/eventdetails/W2021043010

UKOUG - 25 years of hints and tips

Connor McDonald

Announcing Databricks Cloud (Spark Summit 2014)

Databricks

The columnar roadmap: Apache Parquet and Apache Arrow

Julien Le Dem

Hive + Tez: A Performance Deep DiveDataWorks Summit

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Databricks

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.

Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...

Databricks

Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products. On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.

Amazon Aurora: Under the Hood

Amazon Web Services

Apache Hudi: The Path Forward

Alluxio, Inc.

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...

Hortonworks

How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services? Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business. Join Hortonworks and Informatica as we discuss: - What is a data lake? - The modern data architecture for a data lake - How Hadoop fits into the modern data architecture - Innovative use-cases for a data lake

Introduction to AWS Glue: Data Analytics Week at the SF Loft

Amazon Web Services

Introduction to AWS Glue: Data Analytics Week at the San Francisco Loft AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes. Level: Intermediate Speakers: John Mallory - Principal Business Development Manager, Storage, AWS Asim Kumar Sasmal - Big Data Consultant, AWS Professional Services

Testing in airflow

Chandulal Kavar

Data ingestion and distribution with apache NiFi

Lev Brailovskiy

In this session, we will cover our experience working with Apache NiFi, an easy to use, powerful, and reliable system to process and distribute a large volume of data. The first part of the session will be an introduction to Apache NiFi. We will go over NiFi main components and building blocks and functionality. In the second part of the session, we will show our use case for Apache NiFi and how it's being used inside our Data Processing infrastructure.

Deep Dive and Best Practices for Real Time Streaming Applications

Amazon Web Services

Get answers to technical questions, frequently asked by those starting to work with streaming data. Learn best practices for building a real-time streaming data architecture on AWS with Amazon Kinesis, Spark Streaming, AWS Lambda, and Amazon EMR. First, we will focus on building a scalable, durable streaming data ingestion workflow from data producers like mobile devices, servers, or even web browsers. We will provide guidelines to minimize duplicates and achieve exactly-once processing semantics in your stream-processing applications. Then, we will show some of the proven architectures for processing streaming data using a combination of tools including Amazon Kinesis Stream, AWS Lambda, and Spark Streaming running on Amazon EMR.

ABD315_Serverless ETL with AWS Glue

Amazon Web Services

Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.

Data ingestion

nitheeshe2

Let's build a simple ingest to cloud datawarehouse with low code

Incrementally streaming rdbms data to your data lake automagically

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Incrementally streaming rdbms data to your data lake automagically

Similar to Incrementally streaming rdbms data to your data lake automagically (20)

More from Timothy Spann

More from Timothy Spann (20)

Recently uploaded

Recently uploaded (20)

Incrementally streaming rdbms data to your data lake automagically