The days of the relational database being a one-stop-shop for all of your persistence needs are over. Although NoSQL databases address some issues that can’t be addressed by relational databases, the opposite is true as well. The relational database offers an unparalleled feature set and rock solid stability. One cannot underestimate the importance of using the right tool for the job, and for some jobs, one tool is not enough. This talk focuses on the strength and weaknesses of both relational and NoSQL databases, the benefits and challenges of polyglot persistence, and examples of polyglot persistence in the wild.
These slides were presented at WindyCityDB 2010.
Watch full webinar here: https://buff.ly/2mHGaLA
What started to evolve as the most agile and real-time enterprise data fabric, data virtualization is proving to go beyond its initial promise and is becoming one of the most important enterprise big data fabrics.
Attend this session to learn:
• What data virtualization really is
• How it differs from other enterprise data integration technologies
• Why data virtualization is finding enterprise-wide deployment inside some of the largest organizations
Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY
When starting or evaluating the present state of your Data Governance program, it is important to focus on best practices such that you don’t take a ready, fire, aim approach. Best practices need to be practical and doable to be selected for your organization, and the program must be at risk if the best practice is not achieved.
Join Bob Seiner for an important webinar focused on industry best practice around standing up formal Data Governance. Learn how to assess your organization against the practices and deliver an effective roadmap based on the results of conducting the assessment.
In this webinar, Bob will focus on:
- Criteria to select the appropriate best practices for your organization
- How to define the best practices for ultimate impact
- Assessing against selected best practices
- Focusing the recommendations on program success
- Delivering a roadmap for your Data Governance program
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.
Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.
In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
This presentation about Spark SQL will help you understand what is Spark SQL, Spark SQL features, architecture, data frame API, data source API, catalyst optimizer, running SQL queries and a demo on Spark SQL. Spark SQL is an Apache Spark's module for working with structured and semi-structured data. It is originated to overcome the limitations of Apache Hive. Now, let us get started and understand Spark SQL in detail.
Below topics are explained in this Spark SQL presentation:
1. What is Spark SQL?
2. Spark SQL features
3. Spark SQL architecture
4. Spark SQL - Dataframe API
5. Spark SQL - Data source API
6. Spark SQL - Catalyst optimizer
7. Running SQL queries
8. Spark SQL demo
This Apache Spark and Scala certification training is designed to advance your expertise working with the Big Data Hadoop Ecosystem. You will master essential skills of the Apache Spark open source framework and the Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. This Scala Certification course will give you vital skillsets and a competitive advantage for an exciting career as a Hadoop Developer.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Watch full webinar here: https://buff.ly/2mHGaLA
What started to evolve as the most agile and real-time enterprise data fabric, data virtualization is proving to go beyond its initial promise and is becoming one of the most important enterprise big data fabrics.
Attend this session to learn:
• What data virtualization really is
• How it differs from other enterprise data integration technologies
• Why data virtualization is finding enterprise-wide deployment inside some of the largest organizations
Data Governance Best Practices, Assessments, and RoadmapsDATAVERSITY
When starting or evaluating the present state of your Data Governance program, it is important to focus on best practices such that you don’t take a ready, fire, aim approach. Best practices need to be practical and doable to be selected for your organization, and the program must be at risk if the best practice is not achieved.
Join Bob Seiner for an important webinar focused on industry best practice around standing up formal Data Governance. Learn how to assess your organization against the practices and deliver an effective roadmap based on the results of conducting the assessment.
In this webinar, Bob will focus on:
- Criteria to select the appropriate best practices for your organization
- How to define the best practices for ultimate impact
- Assessing against selected best practices
- Focusing the recommendations on program success
- Delivering a roadmap for your Data Governance program
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.
Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.
In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
This presentation about Spark SQL will help you understand what is Spark SQL, Spark SQL features, architecture, data frame API, data source API, catalyst optimizer, running SQL queries and a demo on Spark SQL. Spark SQL is an Apache Spark's module for working with structured and semi-structured data. It is originated to overcome the limitations of Apache Hive. Now, let us get started and understand Spark SQL in detail.
Below topics are explained in this Spark SQL presentation:
1. What is Spark SQL?
2. Spark SQL features
3. Spark SQL architecture
4. Spark SQL - Dataframe API
5. Spark SQL - Data source API
6. Spark SQL - Catalyst optimizer
7. Running SQL queries
8. Spark SQL demo
This Apache Spark and Scala certification training is designed to advance your expertise working with the Big Data Hadoop Ecosystem. You will master essential skills of the Apache Spark open source framework and the Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. This Scala Certification course will give you vital skillsets and a competitive advantage for an exciting career as a Hadoop Developer.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Business Intelligence (BI) and Data Management Basics amorshed
A one-day training course on the Concepts of Data Management and Business Intelligence (BI) in the DX age
A Basic Review of BI and DM
How to Implement BI
A review of BI Tools and 2022 Gartner Quadrant Magic
Basics of Data warehouse (DWH)
An introductions to Power BI
Components of Power BI
Steps for BI Implementation
Data Culture
Intro to ETL and ELT
OLAP files and Architecture
Digital transformation or DX review
A glance at DMBOK2.0 framework
BI Challenges
Data Governance
Data Integration
Data Security and Privacy in DMBOK2.0
Data-Driven Organization
Data and BI Maturity Model
Traditional BI
Self-service BI
who is DMP
who is BI developer
what is Metadata
what is Master data
Data Quality
Data Literacy
Benefits of BI
BI features
How does BI Works?
Modern BI
Data Analytics
BI Architecture
Data Types
Data Lake
Data Mart
Data Silo
Data Visualization
Power BI Architecture and components
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
DAS Slides: Enterprise Architecture vs. Data ArchitectureDATAVERSITY
Enterprise Architecture (EA) provides a visual blueprint of the organization, and shows key inter-relationships between data, process, applications, and more. By abstracting these assets in a graphical view, it’s possible to see key interrelationships, particularly as they relate to data and its business impact across the organization. Join us for a discussion on how Data Architecture is a key component of an overall enterprise architecture for enhanced business value and success.
Activate Data Governance Using the Data CatalogDATAVERSITY
Data Governance programs depend on the activation of data stewards that are held formally accountable for how they manage data. The data catalog is a critical tool to enable your stewards to contribute and interact with an inventory of metadata about the data definition, production, and usage. This interaction is active Data Governance in the truest sense of the word.
In this RWDG webinar, Bob Seiner will share tips and techniques focused on activating your data stewards through a data catalog. Data Governance programs that involve stewards in daily activities are more likely to demonstrate value from their data-intensive investments.
Bob will address the following in this webinar:
- A comparison of active and passive Data Governance
- What it means to have an active Data Governance program
- How a data catalog tool can be used to activate data stewards
- The role a data catalog plays in Data Governance
- The metadata in the data catalog will not govern itself
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
Data Engineering and the Data Science LifecycleAdam Doyle
Everyone wants to be a data scientist. Data modeling is the hottest thing since Tickle Me Elmo. But data scientists don’t work alone. They rely on data engineers to help with data acquisition and data shaping before their model can be developed. They rely on data engineers to deploy their model into production. Once the model is in production, the data engineer’s job isn’t done. The model must be monitored to make sure that it retains its predictive power. And when the model slips, the data engineer and the data scientist need to work together to correct it through retraining or remodeling.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
My perspective on the evolution of big data from the perspective of a distributed systems researcher & engineer -- the background of how it get started, the scale-out paradigm, industry use cases, open source development paradigm, and interesting future challenges.
Data Modelling 101 half day workshop presented by Chris Bradley at the Enterprise Data and Business Intelligence conference London on November 3rd 2014.
Chris Bradley is a leading independent information strategist.
Contact chris.bradley@dmadvisors.co.uk
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Learn how to start a data governance initiative to ensure developing successful frameworks by leveraging the best practices outlined in this inforgraphic.
RWDG Slides: What is a Data Steward to do?DATAVERSITY
Most people recognize that Data Stewards play an essential role in their Data Governance and Information Governance programs. However, the manner in which Data Stewards are used is not the same from organization to organization. How you use Data Stewards depends on your goals for Data Governance.
Join Bob Seiner for this month’s RWDG webinar where he will share different ways to activate Data Stewards based on the purpose of your program. Bob will talk about options to extend existing Data Steward activity and how to build new functionality into the role of your Data Stewards.
In this webinar, Bob will discuss:
- The crucial role of the Data Steward in Data Governance
- Different types of Data Stewards and what they do
- Aligning Data Steward activities with program goals
- Improving existing Data Steward actions
- Finding new ways to use your Data Stewards
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
I'm posting the slide presented at the Snowflake user group meet up.
NFT Bank has introduced DBT to rebuild and operate the entire data pipeline from scratch.
Data quality control and monitoring are critical as data is at the core of the company.
You can manage your numerous data validation tests in organized way. You can add one data validation test with just single line of yaml.
You can build the data catalog and data lineage docs if you just implement your data pipeline on top of DBT without big effort.
---
Session 1: Data Quality & Productivity
Data Quality
Data Quality Validation
Data Catalog, Lineage Documentation
DBT Introduction
Session 2: Integrate DBT with Airflow
DBT Cloud or Airflow?
Astronomer Cosmos
dbt deps
Session 3: Cost Optimization
Query Optimization
Cost Monitoring
Business Intelligence (BI) and Data Management Basics amorshed
A one-day training course on the Concepts of Data Management and Business Intelligence (BI) in the DX age
A Basic Review of BI and DM
How to Implement BI
A review of BI Tools and 2022 Gartner Quadrant Magic
Basics of Data warehouse (DWH)
An introductions to Power BI
Components of Power BI
Steps for BI Implementation
Data Culture
Intro to ETL and ELT
OLAP files and Architecture
Digital transformation or DX review
A glance at DMBOK2.0 framework
BI Challenges
Data Governance
Data Integration
Data Security and Privacy in DMBOK2.0
Data-Driven Organization
Data and BI Maturity Model
Traditional BI
Self-service BI
who is DMP
who is BI developer
what is Metadata
what is Master data
Data Quality
Data Literacy
Benefits of BI
BI features
How does BI Works?
Modern BI
Data Analytics
BI Architecture
Data Types
Data Lake
Data Mart
Data Silo
Data Visualization
Power BI Architecture and components
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
DAS Slides: Enterprise Architecture vs. Data ArchitectureDATAVERSITY
Enterprise Architecture (EA) provides a visual blueprint of the organization, and shows key inter-relationships between data, process, applications, and more. By abstracting these assets in a graphical view, it’s possible to see key interrelationships, particularly as they relate to data and its business impact across the organization. Join us for a discussion on how Data Architecture is a key component of an overall enterprise architecture for enhanced business value and success.
Activate Data Governance Using the Data CatalogDATAVERSITY
Data Governance programs depend on the activation of data stewards that are held formally accountable for how they manage data. The data catalog is a critical tool to enable your stewards to contribute and interact with an inventory of metadata about the data definition, production, and usage. This interaction is active Data Governance in the truest sense of the word.
In this RWDG webinar, Bob Seiner will share tips and techniques focused on activating your data stewards through a data catalog. Data Governance programs that involve stewards in daily activities are more likely to demonstrate value from their data-intensive investments.
Bob will address the following in this webinar:
- A comparison of active and passive Data Governance
- What it means to have an active Data Governance program
- How a data catalog tool can be used to activate data stewards
- The role a data catalog plays in Data Governance
- The metadata in the data catalog will not govern itself
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
Data Engineering and the Data Science LifecycleAdam Doyle
Everyone wants to be a data scientist. Data modeling is the hottest thing since Tickle Me Elmo. But data scientists don’t work alone. They rely on data engineers to help with data acquisition and data shaping before their model can be developed. They rely on data engineers to deploy their model into production. Once the model is in production, the data engineer’s job isn’t done. The model must be monitored to make sure that it retains its predictive power. And when the model slips, the data engineer and the data scientist need to work together to correct it through retraining or remodeling.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
My perspective on the evolution of big data from the perspective of a distributed systems researcher & engineer -- the background of how it get started, the scale-out paradigm, industry use cases, open source development paradigm, and interesting future challenges.
Data Modelling 101 half day workshop presented by Chris Bradley at the Enterprise Data and Business Intelligence conference London on November 3rd 2014.
Chris Bradley is a leading independent information strategist.
Contact chris.bradley@dmadvisors.co.uk
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Learn how to start a data governance initiative to ensure developing successful frameworks by leveraging the best practices outlined in this inforgraphic.
RWDG Slides: What is a Data Steward to do?DATAVERSITY
Most people recognize that Data Stewards play an essential role in their Data Governance and Information Governance programs. However, the manner in which Data Stewards are used is not the same from organization to organization. How you use Data Stewards depends on your goals for Data Governance.
Join Bob Seiner for this month’s RWDG webinar where he will share different ways to activate Data Stewards based on the purpose of your program. Bob will talk about options to extend existing Data Steward activity and how to build new functionality into the role of your Data Stewards.
In this webinar, Bob will discuss:
- The crucial role of the Data Steward in Data Governance
- Different types of Data Stewards and what they do
- Aligning Data Steward activities with program goals
- Improving existing Data Steward actions
- Finding new ways to use your Data Stewards
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
I'm posting the slide presented at the Snowflake user group meet up.
NFT Bank has introduced DBT to rebuild and operate the entire data pipeline from scratch.
Data quality control and monitoring are critical as data is at the core of the company.
You can manage your numerous data validation tests in organized way. You can add one data validation test with just single line of yaml.
You can build the data catalog and data lineage docs if you just implement your data pipeline on top of DBT without big effort.
---
Session 1: Data Quality & Productivity
Data Quality
Data Quality Validation
Data Catalog, Lineage Documentation
DBT Introduction
Session 2: Integrate DBT with Airflow
DBT Cloud or Airflow?
Astronomer Cosmos
dbt deps
Session 3: Cost Optimization
Query Optimization
Cost Monitoring
Practical Design Patterns for Building Applications Resilient to Infrastructu...MongoDB
Speaker: Feng Qu, Sr MTS, eBay
Level: 200 (Intermediate)
Track: Developer
Building applications resilient to infrastructure failure is essential to systems that run in distributed environments, including those with a MongoDB database. For example, failure can come from computer resources, such as nodes, network switches, or the entire data center. On occasion, MongoDB nodes may be marked down by Operations to perform administrative tasks, such as a software upgrade, adding extra capacity, etc.
In this session, we will discuss how to build resilient applications using appropriate design patterns suitable to enterprise class MongoDB applications.
What You Will Learn:
- How to manage updates within a resilient architecture.
- Design patterns for resilient applications.
- Practical advice for deploying resilient enterprise applications.
Роман Нікітченко
Big Data solutions architect в компанії V.I.Tech. Спеціаліст з більш ніж 20-річним досвідом роботи в галузі телекому і вбудованих систем, що змінив індустрію на Java Enterprise. Завдяки попередньому досвіду за короткий термін став одним з провідних архітекторів Big Data в Україні.
When your clients need only small database for personal music library and some kind of HTTP interface to it, everything looks nice and you can use lot of bright frameworks and trusted approaches for your application. But what changes if you step ahead of existing solutions to bring things like population health management?
Let's talk about our Big Data experience and meaningful framework usage:
- What makes the difference when you go Big Data and Hadoop.
- Frameworks and big data: hamsters vs hipsters.
- Reality matters. Frameworks cost. How much?
- What framework is good for you?
- Making your own frameworks.
Getting Started with Big Data in the CloudRightScale
Find out what others are doing with big data on the cloud and how to get started. We will cover solutions for Hadoop and NoSQL highlighting RightScale partner technologies IBM BigInsights, Couchbase, and MongoDB.
New developers and teams are now polyglot :
- they use multiple programming languages (Java, Javascript, Ruby, ...)
- they use multiple persistence store (RDBMS, NoSQL, Hadoop)
In this talk you will learn about the benefits if being polyglot: use the good language or framework for the good cause, select the good persistence for specific constraints.
This presentation will show how developer could mix the Java platform with other technologies such as NodeJS and AngularJS to build application in a more productive way. This is also the opportunity to talk about the new Command Query Responsibility Segregation (CQRS) pattern to allow developers to be more effective and deliver the proper application to the user quicker.
This presentation was delivered during Devfest Nantes 2014
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
Everybody wants to go on the “Big Data” hype cycle, “To do Scale”, to use the coolest tools in the market like Hadoop, Apache Spark, Apache Cassandra, etc.
But do they ask themselves is there really a reason for that?
In the talk we’ll make a brief overview to all of the technologies in the Big Data world nowadays and we’ll talk about the problems that really emerge when you’d like to enter the great world of Big Data handling.
Showing you the Hadoop ecosystem and Apache Spark and all of the distributed tools leading the market today, will give you all a notion of what will be the real costs entering that world.
Promise that I’ll share some stories from the trenches :)
(And about the “pool” thing...I don’t really know how to swim)
Expert IT analyst groups like Wikibon forecast that NoSQL database usage will grow at a compound rate of 60% each year for the next five years, and Gartner Groups says NoSQL databases are one of the top trends impacting information management in 2013. But is NoSQL right for your business? How do you know which business applications will benefit from NoSQL and which won't? What questions do you need to ask in order to make such decisions?
If you're wondering what NoSQL is and if your business can benefit from NoSQL technology, join DataStax for the Webinar, "How to Tell if Your Business Needs NoSQL". This to-the-point presentation will provide practical litmus tests to help you understand whether NoSQL is right for your use case, and supplies examples of NoSQL technology in action with leading businesses that demonstrate how and where NoSQL databases can have the greatest impact."
Speaker: Robin Schumacher, Vice President of Products at DataStax
Robin Schumacher has spent the last 20 years working with databases and big data. He comes to DataStax from EnterpriseDB, where he built and led a market-driven product management group. Previously, Robin started and led the product management team at MySQL for three years before they were bought by Sun (the largest open source acquisition in history), and then by Oracle. He also started and led the product management team at Embarcadero Technologies, which was the #1 IPO in 2000. Robin is the author of three database performance books and frequent speaker at industry events. Robin holds BS, MA, and Ph.D. degrees from various universities.
A presentation of my talk about data migrations for Rails applications. Observed different cases, different solutions. Opinion what's the best approach.
Similar to Polyglot Persistence - Two Great Tastes That Taste Great Together (20)
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Polyglot Persistence - Two Great Tastes That Taste Great Together
1. Polyglot Persistence
Two Great Tastes
That Taste Great Together!
John Wood
john_p_wood@yahoo.com
@johnpwood
2. About Me
● Software Developer at Interactive Mediums
● Primarily work on a web application that allows
our customers to engage and interact with their
customers
● Writing code for about 15 years
● Tinkering with NoSQL for about 1.5 years
● Have a NoSQL solution that has been running
in production for a year
14. The RDBMS Is No Longer The
Default Choice
● Can be very difficult to scale horizontally
● Schemas can be difficult to maintain and
migrate
● For some applications, the data integrity
features of the RDBMS are an unnecessary
overhead
● Data constraints and JOINs can be expensive
at runtime
16. NoSQL Databases Have Stepped
Up To Address These Issues
● Schema-less
● Little to no data integrity enforcement
● Self-contained data
● Eventually consistent
● Easy to scale horizontally to add processing
power and storage
18. But The RDBMS Is Far From Dead
● Incredibly mature, and battle tested
● Immediate and constant consistency
● Integrity of data is enforced
● Efficient use of storage space if data
normalized properly
● Supported by everyone and everything (tools,
frameworks, libraries, etc)
● Incredibly flexible and powerful query language
● Help is plentiful and easy to find
28. “Polyglot Persistence, like
polyglot programming, is all
about choosing the right
persistence option for the task at
hand.” - Scott Leberknight,
October, 2008
http://www.nearinfinity.com/blogs/scott_leberknight/polyglot_persistence.html
53. class User < ActiveRecord::Base
end
class ContestEntry < CouchRest::ExtendedDocument
property :entry_number
end
54. class User < ActiveRecord::Base
def contest_entries
ContestEntry.entries_for_user(self.id)
end
end
class ContestEntry < CouchRest::ExtendedDocument
property :entry_number
property :user_id
def self.entries_for_user(user_id)
# Execute your view to fetch the contest entries
end
def user
User.f nd_by_id(user_id)
i
end
end
58. ● Primary MySQL database with a backup
● A few very large tables, containing 5M – 30M
rows each, and growing quickly
● Increasing query execution time
● Some pages on the web app were timing out
● Increasing database migration time
● Rigid schema of the RDBMS was preventing
some planned features from moving forward
59. ● Brought in a consultant to help us optimize our
MySQL setup
● Optimized slow queries
● Added some indexes
● Offloaded some work to the backup database
● Considered the use of summary tables for
statistics
61. ● Migrated old data from large tables to CouchDB
● Using CouchDB views to aggregate summary
data
● Data is imported and views are updated nightly
● Queries for statistics now very fast
● Using Lucene (via couchdb-lucene) for full text
searching
● Taking full advantage of CouchDBs schema-
less nature in several new application features
63. ● CouchDB databases and views can be very
large on disk
● Some queries could not be substituted with
CouchDB views
● Indexing tens of millions of documents for full
text search with Lucene takes weeks
● Development takes longer, as the map/reduce
model requires additional thought and planning
● Changing/Upgrading views in production not
straightforward
http://www.couch.io/migrating-to-couchdb
67. ● Vertically and horizontally partitioned MySQL
● Several layers of aggressive caching, all
application managed
● Schema changes impossible, resulting in the
use of bitfields and piggyback tables
● Hardware intensive
● Error prone
● Hitting MySQL limits
● Already eventually consistent
69. ● Migrating from MySQL to Cassandra as their
main online data store
● Hadoop/HBase used for people search feature
● FlockDB used to manage the social graph
● Hadoop for analytics
● “As with all NoSQL systems, strengths in
different situations” - Kevin Weil, Analytics
Lead, Twitter
http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010
70. ● Increased availability
● The ability to support new features
● The ability to analyze their massive amount of
data in a reasonable amount of time
http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010