In this talk, Dremio Developer Advocate, Alex Merced, discusses strategies for migrating your existing data over to Apache Iceberg. He'll go over the following:
How to Migrate Hive, Delta Lake, JSON, and CSV sources to Apache Iceberg
Pros and Cons of an In-place or Shadow Migration
Migrating between Apache Iceberg catalogs Hive/Glue -- Arctic/Nessie
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems.
There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
The reality of most large scale data deployments includes storage decoupled from computation, pipelines operating directly on files and metadata services with no locking mechanisms or transaction tracking. For this reason attempts at achieving transactional behavior, snapshot isolation, safe schema evolution or performant support for CRUD operations has always been marred with tradeoffs.
This talk will focus on technical aspects, practical capabilities and the potential future of three table formats that have emerged in recent years as solutions to the issues mentioned above – ACID ORC (in Hive 3.x), Iceberg and Delta Lake. To provide a richer context, a comparison between traditional databases and big data tools as well as an overview of the reasons for the current state of affairs will be included.
After the talk, the audience is expected to have a clear understanding of the current development trends in large scale table formats, on the conceptual and practical level. This should allow the attendees to make better informed assessments about which approaches to data warehousing, metadata management and data pipelining they should adapt in their organizations.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3rwWhyv
The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization.
Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes.
In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture.
You will learn:
- How data mesh architecture not only enables better performance and agility, but also self-service data access
- The requirements for “data products” in the data mesh world, and how data virtualization supports them
- How data virtualization enables domains in a data mesh to be truly autonomous
- Why a data lake is not automatically a data mesh
- How to implement a simple, functional data mesh architecture using data virtualization
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
How to Take Advantage of an Enterprise Data Warehouse in the CloudDenodo
Watch full webinar here: [https://buff.ly/2CIOtys]
As organizations collect increasing amounts of diverse data, integrating that data for analytics becomes more difficult. Technology that scales poorly and fails to support semi-structured data fails to meet the ever-increasing demands of today’s enterprise. In short, companies everywhere can’t consolidate their data into a single location for analytics.
In this Denodo DataFest 2018 session we’ll cover:
Bypassing the mandate of a single enterprise data warehouse
Modern data sharing to easily connect different data types located in multiple repositories for deeper analytics
How cloud data warehouses can scale both storage and compute, independently and elastically, to meet variable workloads
Presentation by Harsha Kapre, Snowflake
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
OSA Con 2022: Apache Iceberg: An Architectural Look Under the Covers
Alex Merced - Dremio
The data lakehouse is one of the most exciting trends in the data space promising to merge the best aspects of data lakes and data warehouses without either of their problems. Open source tech is making this promise a reality and in this talk Dremio Developer Advocate, Alex Merced, explores these technologies.
In this talk Alex Merced will cover:
- What is a Data Lakehouse?
- Why open matters in preserving the promise of lakehouses (better costs, vendor freedom, data freedom)
- What are technologies that enable lakehouses like Apache Iceberg, Apache Parquet, Apache Arrow and Project Nessie
Redefining tables online without surprisesNelson Calero
The Oracle database includes several features to allow moving data online, ie: without preventing users to access it when it is being moved (DML operation are not blocked).
One of those features is to change a table definition, using the package DBMS_REDEFINITION.
While moving a table is an online operation since version 12.2, redefinition is still needed for some changes. Also is needed in older versions.
In this session best practices will be shown based on experience of using it with big tablespaces, with examples covering all the steps needed to use DBMS_REDEFINITION under different scenarios, including the problems you can find, how to resolve them and how this process is different in version 11.2 and 12.
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems.
There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
The reality of most large scale data deployments includes storage decoupled from computation, pipelines operating directly on files and metadata services with no locking mechanisms or transaction tracking. For this reason attempts at achieving transactional behavior, snapshot isolation, safe schema evolution or performant support for CRUD operations has always been marred with tradeoffs.
This talk will focus on technical aspects, practical capabilities and the potential future of three table formats that have emerged in recent years as solutions to the issues mentioned above – ACID ORC (in Hive 3.x), Iceberg and Delta Lake. To provide a richer context, a comparison between traditional databases and big data tools as well as an overview of the reasons for the current state of affairs will be included.
After the talk, the audience is expected to have a clear understanding of the current development trends in large scale table formats, on the conceptual and practical level. This should allow the attendees to make better informed assessments about which approaches to data warehousing, metadata management and data pipelining they should adapt in their organizations.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3rwWhyv
The Data Mesh architectural design was first proposed in 2019 by Zhamak Dehghani, principal technology consultant at Thoughtworks, a technology company that is closely associated with the development of distributed agile methodology. A data mesh is a distributed, de-centralized data infrastructure in which multiple autonomous domains manage and expose their own data, called “data products,” to the rest of the organization.
Organizations leverage data mesh architecture when they experience shortcomings in highly centralized architectures, such as the lack domain-specific expertise in data teams, the inflexibility of centralized data repositories in meeting the specific needs of different departments within large organizations, and the slow nature of centralized data infrastructures in provisioning data and responding to changes.
In this session, Pablo Alvarez, Global Director of Product Management at Denodo, explains how data virtualization is your best bet for implementing an effective data mesh architecture.
You will learn:
- How data mesh architecture not only enables better performance and agility, but also self-service data access
- The requirements for “data products” in the data mesh world, and how data virtualization supports them
- How data virtualization enables domains in a data mesh to be truly autonomous
- Why a data lake is not automatically a data mesh
- How to implement a simple, functional data mesh architecture using data virtualization
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
How to Take Advantage of an Enterprise Data Warehouse in the CloudDenodo
Watch full webinar here: [https://buff.ly/2CIOtys]
As organizations collect increasing amounts of diverse data, integrating that data for analytics becomes more difficult. Technology that scales poorly and fails to support semi-structured data fails to meet the ever-increasing demands of today’s enterprise. In short, companies everywhere can’t consolidate their data into a single location for analytics.
In this Denodo DataFest 2018 session we’ll cover:
Bypassing the mandate of a single enterprise data warehouse
Modern data sharing to easily connect different data types located in multiple repositories for deeper analytics
How cloud data warehouses can scale both storage and compute, independently and elastically, to meet variable workloads
Presentation by Harsha Kapre, Snowflake
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
OSA Con 2022: Apache Iceberg: An Architectural Look Under the Covers
Alex Merced - Dremio
The data lakehouse is one of the most exciting trends in the data space promising to merge the best aspects of data lakes and data warehouses without either of their problems. Open source tech is making this promise a reality and in this talk Dremio Developer Advocate, Alex Merced, explores these technologies.
In this talk Alex Merced will cover:
- What is a Data Lakehouse?
- Why open matters in preserving the promise of lakehouses (better costs, vendor freedom, data freedom)
- What are technologies that enable lakehouses like Apache Iceberg, Apache Parquet, Apache Arrow and Project Nessie
Redefining tables online without surprisesNelson Calero
The Oracle database includes several features to allow moving data online, ie: without preventing users to access it when it is being moved (DML operation are not blocked).
One of those features is to change a table definition, using the package DBMS_REDEFINITION.
While moving a table is an online operation since version 12.2, redefinition is still needed for some changes. Also is needed in older versions.
In this session best practices will be shown based on experience of using it with big tablespaces, with examples covering all the steps needed to use DBMS_REDEFINITION under different scenarios, including the problems you can find, how to resolve them and how this process is different in version 11.2 and 12.
SF Big Analytics 2020-07-28
Anecdotal history of Data Lake and various popular implementation framework. Why certain tradeoff was made to solve the problems, such as cloud storage, incremental processing, streaming and batch unification, mutable table, ...
Metail at Cambridge AWS User Group Main Meetup #3Gareth Rogers
An overview of Metail's data pipeline architecture with particular emphasis on our event collection and batch processing layer, based on Hadoop. A brief overview of our use of business intelligence and data science sections is also given.
Migrating minimal databases with minimal downtime to AWS RDS, Amazon Redshift and Amazon Aurora
Migration of databases to same and different engines and from on premise to cloud
Schema conversion from Oracle and SQL Server to MySQL and Aurora
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Similar to Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg (20)
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
Discussion of LLM fine-tuning with an overview of fine-tuning types and datasets: specifically we will talk about the method that we used to turn an existing collection of Cassandra information into a set of instructions and responses that we can use for fine tuning.
What's AGI? How is it different from an Agent or an AI Assistant? If you're looking to understand how AI Agents/AGI can help your company, check this out.
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotAnant Corporation
In this meetup, we will introduce the concepts of Real Time Analytics, why it is important, the evolution of Analytics, and how companies such as LinkedIn, Stripe, Uber and more are using Real Time analytics to grow their audience and improve usability by using Apache Pinot. What is Apache Pinot? Followed by Demo and Q&A.
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...Anant Corporation
Series: Using AI / ChatGPT at Work - GPT Automation
Are you a small business owner or web developer interested in leveraging the power of GPT (Generative Pretrained Transformer) technology to enhance your business processes? If so, Join us for a series of events focused on using GPT in business. Whether you're a small business owner or a web developer, you'll learn how to leverage GPT to improve your workflow and provide better services to your customers.
GPT Automation: What it is and How it Works
How Time-Saving GPT Automation Can Improve Your Business
Cost-Effective GPT Automation: How it Can Save Your Business Money
Using GPT Automation for Customer Service: Benefits and Best Practices
The Power of GPT Automation for Content Creation
Data Analysis Made Easy with GPT Automation
Top GPT-3 Automation Tools for Businesses
The Ethical Considerations of GPT Automation
Overcoming Bias in GPT Automation: Best Practices
The Future of GPT Automation: Trends and Predictions
Since we focus on "no code" here, we'll explore the tools that are already out there such as ChatGPT plugins for Chrome, OpenAI GPT API, low-code/no-code platforms like Make/Integromat and Zapier, existing apps like Jasper/Rytr, and ecosystem tools like Everyprompt. We'll also discuss the resources available for those interested in learning more about GPT, including other people’s prompts.
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
Episode 3: The LLM / GPT / AI Prompt / Data Engineer Roadmap
In this episode, we'll discuss the history, fundamentals, and the different flavors of LLMs available, beyond GPT/ChatGPT. This is a dry run of a session that will be on a LLM Bootcamp ( Fill out the survey on the link if you are interested in an in-person vs. virtual session)
Intro / Fundamentals of LLM
LLM Foundations
History of LLMs
Tuning, Training, or "In Context Learning" with LLMs
What is "Prompt Engineering"
Case for Augmenting LLMs
Series: Using AI / ChatGPT at Work - GPT Automation
Are you a small business owner or web developer interested in leveraging the power of GPT (Generative Pretrained Transformer) technology to enhance your business processes? If so, Join us for a series of events focused on using GPT in business. Whether you're a small business owner or a web developer, you'll learn how to leverage GPT to improve your workflow and provide better services to your customers.
GPT Automation: What it is and How it Works
How Time-Saving GPT Automation Can Improve Your Business
Cost-Effective GPT Automation: How it Can Save Your Business Money
Using GPT Automation for Customer Service: Benefits and Best Practices
The Power of GPT Automation for Content Creation
Data Analysis Made Easy with GPT Automation
Top GPT-3 Automation Tools for Businesses
The Ethical Considerations of GPT Automation
Overcoming Bias in GPT Automation: Best Practices
The Future of GPT Automation: Trends and Predictions
Since we focus on "no code" here, we'll explore the tools that are already out there such as ChatGPT plugins for Chrome, OpenAI GPT API, low-code/no-code platforms like Make/Integromat and Zapier, existing apps like Jasper/Rytr, and ecosystem tools like Everyprompt. We'll also discuss the resources available for those interested in learning more about GPT, including other people’s prompts.
In Apache Cassandra Lunch #131: YugabyteDB Developer Tools, we discussed third party developer tools that are compatible with YugabyteDB. We talked about using Yugabyte Developer Tools for data visualization and schema management. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST.
Developer tools play a critical role in simplifying and streamlining database development and management. They allow developers and administrators to be more productive, reducing the time and effort required to create and maintain database schemas, write SQL queries, test database performance, and enable collaboration. Developer tools also make it possible to track changes over time, improving the ability to manage the entire development lifecycle.
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
In this episode we'll discuss the different flavors of prompt engineering in the LLM/GPT space. According to your skill level you should be able to pick up at any of the following:
Leveling up with GPT
1: Use ChatGPT / GPT Powered Apps
2: Become a Prompt Engineer on ChatGPT/GPT
3: Use GPT API with NoCode Automation, App Builders
4: Create Workflows to Automate Tasks with NoCode
5: Use GPT API with Code, make your own APIs
6: Create Workflows to Automate Tasks with Code
7: Use GPT API with your Data / a Framework
8: Use GPT API with your Data / a Framework to Make your own APIs
9: Create Workflows to Automate Tasks with your Data /a Framework
10: Use Another LLM API other than GPT (Cohere, HuggingFace)
11: Use open source LLM models on your computer
12: Finetune / Build your own models
Series: Using AI / ChatGPT at Work - GPT Automation
Are you a small business owner or web developer interested in leveraging the power of GPT (Generative Pretrained Transformer) technology to enhance your business processes?
If so, Join us for a series of events focused on using GPT in business. Whether you're a small business owner or a web developer, you'll learn how to leverage GPT to improve your workflow and provide better services to your customers.
In Data Engineer’s Lunch #89: Machine Learning Orchestration with Airflow, we discussed using Apache Airflow to manage and schedule machine learning tasks. By following the best practices of ML Ops, teams can streamline their ML workflows and build scalable, efficient, and accurate models that deliver real-world business value. Properly implemented ML Ops can help organizations stay ahead of the curve and achieve their goals in the fast-paced world of machine learning. Apache Airflow is an open-source tool for scheduling and automating workflows. Airflow allows you to define workflows in Python, with tasks defined as Python functions that can include Operators for all sorts of external tools. This makes it easy to automate repeated processes and define dependencies between tasks, creating directed-acyclic-graphs of tasks that can be scheduled using cron syntax or frequency tasks. Airflow also features a user-friendly UI for monitoring task progress and viewing logs, giving you greater control over your data pipeline.
Cassandra Lunch 130: Recap of Cassandra Forward TalksAnant Corporation
If you didn't attend, you don't want to miss a much shorter synopsis of what was covered and get some thoughts from us as to why they are important. We'll talk about the main topics of the event.
1. ACID transactions on Cassandra by Aaron Ploetz, Datastax
2. Apache Flink with Apache Cassandra at Satyajit Thadeswar, Netflix
3. Durable Execution built on Apache Cassandra by Loren Sands-Ramshaw, Temporal
4. Switching from Mongo to Cassandra with Mongoose & new Stargate JSON API, Valeri Karpov
5. Cloud Native and Realtime AI/ML with Patrick Mcfadin and Davor Boncaci, Datastax
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
In Data Engineer's Lunch 90, Eric Ramseur teaches our audience how to use Arcion.
From best practices to real-world examples, this talk will provide you with the knowledge and insights you need to ensure a successful migration of your SQL data. So whether you're new to data migration or looking to improve your existing process, join us and discover how Arcion can help you achieve your goals.
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
In Data Engineer's Lunch 89, Obioma Anomnachi will discuss how to manage and schedule Machine Learning operations via Airflow. Learn how you can write complete end-to-end pipelines starting with retrieving raw data to serving ML predictions to end-users, entirely in Airflow.
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation
As the demand for real-time data processing continues to grow, so too do the challenges associated with building production-ready applications that can handle large volumes of data and handle it quickly. In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes. Using telemetry data collected from a fitness app, we’ll demonstrate how we used a combination of Apache Kafka and Python-based microservices running on Kubernetes to build a pipeline for processing and analyzing this data in real-time. We'll also discuss how we used machine learning techniques to build a model for detecting collisions and how we implemented notifications to alert family members of a crash. Our ultimate goal is to help you navigate the challenges that come with building data-intensive, real-time applications that use ML models. By showcasing a real-world example, we aim to provide practical solutions and insights that you can apply to your own projects.
Key takeaways:
An understanding of the common challenges faced when building real-time applications at scale
Strategies for using Apache Kafka and Python-based microservices to process and analyze data in real-time
Tips for implementing machine learning models in a real-time application
Best practices for responding to and handling critical events in a real-time application
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
What are the design considerations that go into architecting a modern data warehouse? This presentation will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.
In Apache Cassandra Lunch #121: Migrating to Azure Managed Instance for Apache Cassandra, we discussed different methods for migrating data from existing Cassandra instances to Azure hosted options.
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsAnant Corporation
In this lunch, Johnny will show us how easy it is to start monitoring your Cassandra cluster in minutes. He will explain the various aspects and features of Cassandra that need to be monitored, how to do it, and most importantly why! Approaches for backups and Cassandra repairs will be discussed and explored in detail.
Learn how AxonOps significantly reduces the complexity and overhead when looking after Cassandra and ensures your Cassandra cluster is reliable and resilient.
Experienced developer, DevOps, architect, and AxonOps co-founder, Johnny Miller, has worked with a wide variety of companies – from small start-ups to large enterprises. He has been working with Cassandra for many years and has a deep understanding of the challenges facing modern companies looking to adopt Apache Cassandra.
In Apache Cassandra Lunch #119, Rahul Singh will cover a refresher on GUI desktop/web tools for users that want to get their hands dirty with Cassandra but don't want to deal with CQLSH to do simple queries. Some of the tools are web-based and others are installed on your desktop. Since the beginning days of Cassandra, a lot has changed and there are many options for command-line-haters to use Cassandra.
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
In Data Engineer's Lunch #60, Rahul Singh, CEO here at Anant, will discuss modern data processing/pipeline approaches.
Want to learn about modern data engineering patterns & practices for global data platforms? A high-level overview of different types, frameworks, and workflows in data processing and pipeline design.
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
During this lunch, we’ll review open-source reverse ETL tools to uncover how to send data back to SaaS systems.
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
#data #dataengineering #datagovernance
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
2. 25127
What will we
learn?
- Iceberg’s Structure
- What Catalog to use?
- The migrate procedure
- The add_files procedure
- Migrate with CTAS Statements
- Summary
3. 25127
Apache Iceberg Apache Iceberg’s approach is to define the table through three layers of
metadata. These categories are:
- metadata files that define the table
- manifest lists that define a snapshot of the table, with a list of
manifests that make up the snapshot and metadata about their data
- manifests is a list of data files along with metadata on those data
files for file pruning
4. 25127
Iceberg table format
● Overview of the components
● Summary of the read path (SELECT) #
metadata layer
data layer
5. 25127
Iceberg Catalog
● A store that houses the current
metadata pointer for Iceberg tables
● Must support atomic operations for
updating the current metadata pointer
(e.g. HDFS, HMS, Nessie)
table1’s current metadata pointer
● Mapping of table name to the location
of current metadata file
Iceberg components: Catalog
1
2
metadata layer
data layer
6. 25127
Metadata file - stores metadata about a
table at a certain point in time
{
"table-uuid" : "<uuid>",
"location" : "/path/to/table/dir",
"schema": {...},
"partition-spec": [ {<partition-details>}, ...],
"current-snapshot-id": <snapshot-id>,
"snapshots": [ {
"snapshot-id": <snapshot-id>
"manifest-list": "/path/to/manifest/list.avro"
}, ...],
...
}
Iceberg components: Metadata File
3
5
4
metadata layer
data layer
7. 25127
Iceberg components: Manifest List
Manifest list file - a list of manifest files
{
"manifest-path" : "/path/to/manifest/file.avro",
"added-snapshot-id": <snapshot-id>,
"partition-spec-id": <partition-spec-id>,
"partitions": [ {partition-info}, ...],
...
}
{
"manifest-path" : "/path/to/manifest/file2.avro",
"added-snapshot-id": <snapshot-id>,
"partition-spec-id": <partition-spec-id>,
"partitions": [ {partition-info}, ...],
...
}
6
metadata layer
data layer
8. 25127
Manifest file - a list of data files, along with
details and stats about each data file
{
"data-file": {
"file-path": "/path/to/data/file.parquet",
"file-format": "PARQUET",
"partition":
{"<part-field>":{"<data-type>":<value>}},
"record-count": <num-records>,
"null-value-counts": [{
"column-index": "1", "value": 4
}, ...],
"lower-bounds": [{
"column-index": "1", "value": "aaa"
}, ...],
"upper-bounds": [{
"column-index": "1", "value": "eee"
}, ...],
}
...
}
{
...
}
Iceberg components: Manifest file
7
8
metadata layer
data layer
10. 25127
10
Type of Catalog Pros Cons
Project Nessie - Git Like Functionality
- Cloud Managed Service (Arctic)
- Support from engines beyond
Spark & Dremio
- If not using arctic must deploy
and maintain Nessie server
Hive Metastore - Can use existing Hive Metastore - You have to deploy and maintain
a hive metastore
AWS Glue - Interop with AWS Services - Support outside of AWS, Spark
and Dremio
What can be used an Iceberg Catalog
Catalogs help tracks Iceberg tables and provide locking mechanisms for ACID Guarantees.
Thing to keep in mind is that while many engines may support Iceberg tables they may not
support connections to all catalogs.
- HDFS, JDBC and REST catalogs also available
- With 0.14 there will be a registerTable method to migrate tables between catalogs
without losing snapshot history
12. 25127
Existing Parquet files that are part of a Hive Table
Apache Iceberg has a Call procedure called migrate which can be used to create a new Iceberg table from an
the existing Hive table. This uses the existing data files in the table (must be parquet). This REPLACES the
Hive table with an Iceberg table, so the Hive table will no longer exist after this operation:
CALL catalog_name.system.migrate('db.sample')
To test the results of migrate before running migrate you can use the snapshot call procedure to create
temporary iceberg table based on a Hive table without replacing the Hive table. (temporary tables have
limitations).
CALL catalog_name.system.snapshot('db.sample', 'db.snap')
12
13. 25127
Parquet files not part of a hive table
If you have paquet files representing a dataset that are not yet part of an Iceberg table, you can use the
add_files procedure to add those datafiles to that of an existing table with a matching schema.
CALL spark_catalog.system.add_files(
table => 'db.tbl',
source_table => '`parquet`.`path/to/table`'
)
13
14. 25127
Migration by Restating the Data
You can use CREATE TABLE…AS (CTAS) statements to create new Iceberg tables from pre-existing tables.
This will create new data files but allows you to modify partitioning, schema and do validations at the time of
migration if choose to. Since the data is being rewritten/restated, this will be more time consuming that
migrate or add_files. (This can be done with Spark or Dremio)
CREATE TABLE prod.db.sample
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), truncate(last_name, 2))
AS SELECT ...
14
15. 25127
When to use which approach?
15
Approach
Are Data Files
Rewritten
Can I update Schema
and Partition while
migrating?
When to use?
migrate No No
Moving existing hive table to a
NEW iceberg table
add_files No No
Moving files from existing table
in parquet to an existing
Iceberg table
CTAS Yes Yes
Moving existing table
anywhere to NEW iceberg
table with schema or
partitioning changes
registerTable
(to be released with 0.14)
No No
Moving Iceberg table from one
catalog to another.
16. 25127
Best Practices
1. As the process begins, the new Iceberg table is not yet created or in sync with the source. User-facing
read and write operations remain operating on the source table.
2. The table is created but not fully in sync. Read operations are applied on the source table and write
operations are applied to the source and new table.
3. Once the new table is in sync, you can switch to read operations on the new table. Until you are certain
the migration is successful, continue to apply write operations to the source and new table.
4. When everything is tested, synced, and working properly, you can apply all read and write operations to
the new Iceberg table and retire the source table
16