Cloud Data Warehousing presentation by Rogier Werschkull, including tips, best practices and a BigQuery, Redshift, Snowflake and Azure SQL DWH comparison, delivered at #BIDASUMMIT
Presentation on "Cloud Data Warehousing: What, Why and How?" by Rogier Werschkull (RogerData), at the BI & Data Analytics Summit on June 13th, 2019 in Diegem (Belgium)
As cloud computing continues to gather speed, organizations with years’ worth of data stored on legacy on-premise technologies are facing issues with scale, speed, and complexity. Your customers and business partners are likely eager to get data from you, especially if you can make the process easy and secure.
Challenges with performance are not uncommon and ongoing interventions are required just to “keep the lights on”.
Discover how Snowflake empowers you to meet your analytics needs by unlocking the potential of your data.
Agenda of Webinar :
~Understand Snowflake and its Architecture
~Quickly load data into Snowflake
~Leverage the latest in Snowflake’s unlimited performance and scale to make the data ready for analytics
~Deliver secure and governed access to all data – no more silos
Snowflake concepts & hands on expertise to help get you started on implementing Data warehouses using Snowflake. Necessary information and skills that will help you master Snowflake essentials.
Snowflake's Kent Graziano talks about what makes a data warehouse as a service and some of the key features of Snowflake's data warehouse as a service.
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
Struggling to keep up with an ever-increasing demand for data at your organisation? Do you spend hours tinkering with your streaming data pipelines? Does that one data scientist with direct EDW access keep you up at night? Introducing Snowflake, a brand new SQL data warehouse built for the cloud. We’ve designed and implemented a unique cloud-based architecture that addresses the most common shortcomings of existing data solutions. With Snowflake, you can unlock unlimited concurrency, enable instant scalability, and take advantage of built-in tuning and optimisation. Join us and find out what Netflix, Adobe, and Nike all have in common.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
As cloud computing continues to gather speed, organizations with years’ worth of data stored on legacy on-premise technologies are facing issues with scale, speed, and complexity. Your customers and business partners are likely eager to get data from you, especially if you can make the process easy and secure.
Challenges with performance are not uncommon and ongoing interventions are required just to “keep the lights on”.
Discover how Snowflake empowers you to meet your analytics needs by unlocking the potential of your data.
Agenda of Webinar :
~Understand Snowflake and its Architecture
~Quickly load data into Snowflake
~Leverage the latest in Snowflake’s unlimited performance and scale to make the data ready for analytics
~Deliver secure and governed access to all data – no more silos
Snowflake concepts & hands on expertise to help get you started on implementing Data warehouses using Snowflake. Necessary information and skills that will help you master Snowflake essentials.
Snowflake's Kent Graziano talks about what makes a data warehouse as a service and some of the key features of Snowflake's data warehouse as a service.
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
Struggling to keep up with an ever-increasing demand for data at your organisation? Do you spend hours tinkering with your streaming data pipelines? Does that one data scientist with direct EDW access keep you up at night? Introducing Snowflake, a brand new SQL data warehouse built for the cloud. We’ve designed and implemented a unique cloud-based architecture that addresses the most common shortcomings of existing data solutions. With Snowflake, you can unlock unlimited concurrency, enable instant scalability, and take advantage of built-in tuning and optimisation. Join us and find out what Netflix, Adobe, and Nike all have in common.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
In this presentation, we:
1. Look at the challenges and opportunities of the data era
2. Look at key challenges of the legacy data warehouses such as data diversity, complexity, cost, scalabilily, performance, management, ...
3. Look at how modern data warehouses in the cloud not only overcome most of these challenges but also how some of them bring additional technical innovations and capabilities such as pay as you go cloud-based services, decoupling of storage and compute, scaling up or down, effortless management, native support of semi-structured data ...
4. Show how capabilities brought by modern data warehouses in the cloud, help businesses, either new or existing ones, during the phases of their lifecycle such as launch, growth, maturity and renewal/decline.
5. Share a Near-Real-Time Data Warehousing use case built on Snowflake and give a live demo to showcase ease of use, fast provisioning, continuous data ingestion, support of JSON data ...
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Take an in-depth look at data warehousing with Amazon Redshift and get answers to your technical questions. We will cover performance tuning techniques that take advantage of Amazon Redshift's columnar technology and massively parallel processing architecture. We will also discuss best practices for migrating from existing data warehouses, optimizing your schema, loading data efficiently, and using work load management and interleaved sorting.
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
Snowflake is one of the most powerful, efficient data warehouses on the market today—and we joined forces with the Snowflake team to show you how it works!
In this webinar:
- Learn how to optimize Snowflake
- Hear insider tips and tricks on how to improve performance
- Get expert insights from Craig Collier, Technical Architect from Snowflake, and Kalyan Arangam, Solution Architect from Matillion
- Find out how leading brands like Converse, Duo Security, and Pets at Home use Snowflake and Matillion ETL to make data-driven decisions
- Discover how Matillion ETL and Snowflake work together to modernize your data world
- Learn how to utilize the impressive scalability of Snowflake and Matillion
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
Optimize the performance, cost, and value of databases.pptxIDERA Software
Today’s businesses run on data, making it essential for them to access data quickly and easily. This requirement means that databases must run efficiently at all times but keeping a database performing at its best remains a challenging task. Fortunately, database administrators (DBAs) can adopt many practices to achieve this goal, thus saving time and money.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Architecting Snowflake for High Concurrency and High PerformanceSamanthaBerlant
Cloud Data Warehousing juggernaut Snowflake has raced out ahead of the pack to deliver a data management platform from which a wealth of new analytics can be run. Using Snowflake as a traditional data warehouse has some obvious cost advantages over a hardware solution. But the real value of Snowflake as a data platform lies in its ability to support a high-concurrency analytics platform using Kyligence Cloud, powered by Apache Kylin.
In this presentation, Senior Solutions Architect Robert Hardaway will describe a modern data service architecture using precomputation and distributed indexes to provide interactive analytics to hundreds or even thousands of users running against very large Snowflake datasets (TBs to PBs).
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
Alluxio Tech Talk
January 21, 2020
Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio
With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
In this presentation, we:
1. Look at the challenges and opportunities of the data era
2. Look at key challenges of the legacy data warehouses such as data diversity, complexity, cost, scalabilily, performance, management, ...
3. Look at how modern data warehouses in the cloud not only overcome most of these challenges but also how some of them bring additional technical innovations and capabilities such as pay as you go cloud-based services, decoupling of storage and compute, scaling up or down, effortless management, native support of semi-structured data ...
4. Show how capabilities brought by modern data warehouses in the cloud, help businesses, either new or existing ones, during the phases of their lifecycle such as launch, growth, maturity and renewal/decline.
5. Share a Near-Real-Time Data Warehousing use case built on Snowflake and give a live demo to showcase ease of use, fast provisioning, continuous data ingestion, support of JSON data ...
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Take an in-depth look at data warehousing with Amazon Redshift and get answers to your technical questions. We will cover performance tuning techniques that take advantage of Amazon Redshift's columnar technology and massively parallel processing architecture. We will also discuss best practices for migrating from existing data warehouses, optimizing your schema, loading data efficiently, and using work load management and interleaved sorting.
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
Snowflake is one of the most powerful, efficient data warehouses on the market today—and we joined forces with the Snowflake team to show you how it works!
In this webinar:
- Learn how to optimize Snowflake
- Hear insider tips and tricks on how to improve performance
- Get expert insights from Craig Collier, Technical Architect from Snowflake, and Kalyan Arangam, Solution Architect from Matillion
- Find out how leading brands like Converse, Duo Security, and Pets at Home use Snowflake and Matillion ETL to make data-driven decisions
- Discover how Matillion ETL and Snowflake work together to modernize your data world
- Learn how to utilize the impressive scalability of Snowflake and Matillion
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
Optimize the performance, cost, and value of databases.pptxIDERA Software
Today’s businesses run on data, making it essential for them to access data quickly and easily. This requirement means that databases must run efficiently at all times but keeping a database performing at its best remains a challenging task. Fortunately, database administrators (DBAs) can adopt many practices to achieve this goal, thus saving time and money.
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Architecting Snowflake for High Concurrency and High PerformanceSamanthaBerlant
Cloud Data Warehousing juggernaut Snowflake has raced out ahead of the pack to deliver a data management platform from which a wealth of new analytics can be run. Using Snowflake as a traditional data warehouse has some obvious cost advantages over a hardware solution. But the real value of Snowflake as a data platform lies in its ability to support a high-concurrency analytics platform using Kyligence Cloud, powered by Apache Kylin.
In this presentation, Senior Solutions Architect Robert Hardaway will describe a modern data service architecture using precomputation and distributed indexes to provide interactive analytics to hundreds or even thousands of users running against very large Snowflake datasets (TBs to PBs).
Similar to Cloud Data Warehousing presentation by Rogier Werschkull, including tips, best practices and a BigQuery, Redshift, Snowflake and Azure SQL DWH comparison, delivered at #BIDASUMMIT
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
Alluxio Tech Talk
January 21, 2020
Speakers:
Matt Fuller, Starburst
Dipti Borkar, Alluxio
With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data.
Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about:
- The architecture of Presto, an open source distributed SQL engine
- How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics
- Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
The session will be a deep dive introduction to Snowflake that includes Snowflake architecture, Virtual Warehouses, Designing a real use case, Loading data into Snowflake from a Data Lake.
How to select a modern data warehouse and get the most out of it?Slim Baltagi
In the first part of this talk, we will give a setup and definition of modern cloud data warehouses as well as outline problems with legacy and on-premise data warehouses.
We will speak to selecting, technically justifying, and practically using modern data warehouses, including criteria for how to pick a cloud data warehouse and where to start, how to use it in an optimum way and use it cost effectively.
In the second part of this talk, we discuss the challenges and where people are not getting their investment. In this business-focused track, we cover how to get business engagement, identifying the business cases/use cases, and how to leverage data as a service and consumption models.
Big Data technologies promise lot of advantages both from business and engineering point of view. But nothing comes for free. 2 years on this path led to lot of pitfalls discovered in term of infrastructure approaches, solutions complexity balance and technology stack selection approaches. Now we intend to warn you and have some fun.
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.GeeksLab Odessa
Big Data technologies promise lot of advantages both from business and engineering point of view.
But nothing comes for free. 2 years on this path led to lot of pitfalls discovered in term of infrastructure approaches, solutions complexity balance and technology stack approaches.
Now we intend to warn you and have some fun.
NoSQL vs SQL (by Dmitriy Beseda, JS developer and coach Binary Studio Academy)Binary Studio
It is first lecture from noSQL course for students of Lviv Polytechnic National University. Check out our educational portal: http://academy.binary-studio.com/
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
Since Amazon Redshift launched last year, it has been adopted by a wide variety of companies for data warehousing. In this session, learn how customers NASDAQ, HauteLook, and Roundarch Isobar are taking advantage of Amazon Redshift for three unique use cases: enterprise, big data, and SaaS. Learn about their implementations and how they made data analysis faster, cheaper, and easier with Amazon Redshift.
There’s data and there’s Big Data, and these days some of the biggest creators of big data are online gaming apps.
This talk by Dave Shuttleworth of EXASOL uses case studies, including King (makers of hit games Candy Crush Saga and Farm Heroes Saga), to illustrate how the challenges of rapidly acquired data can be managed. This includes the integration of SQL and Hadoop environments, predictive analytics in SQL and emerging industry trends in this area.
YouTube video available at https://www.youtube.com/watch?v=DGxXCH_BSJo
Speaker:
Dave Shuttleworth is a specialist in database technology, parallel processing in the database field, data warehousing and data mining techniques. He has almost 40 years experience in the IT industry and has been at the forefront of the implementation of parallel processing for commercial databases for the last 25 years.
He has spoken at Data Warehouse Institute events in the UK and U.S., DBExpo and at Netezza User Forums in London, San Francisco and Boston, where he won an award for his outstanding contributions to the online Netezza Community.
Challenges for running Hadoop on AWS - AdvancedAWS MeetupAndrei Savu
Nowadays we've got all the tools we need to spin-up and tear-down clusters with hundreds of nodes in minutes and this puts more pressure on the tools we use to configure and monitor our applications. This challenge is even more interesting when we have to deal with long running distributed data storage and processing systems like Hadoop. In this talk we will look into some of the challenges we need to deal with when creating and managing Hadoop clusters in AWS, we will discuss improvement opportunities in monitoring (e.g. detecting and dealing with instance failure, resource contention & noisy neighbors) and a bit about the future and how we should go about disconnecting workload dispatch from cluster lifecycle.
How jKool Analyzes Streaming Data in Real Time with DataStaxDataStax
In this webinar, Charles Rich, VP of Product Management at jKool will share their journey with DataStax; how jKool knew from the start that traditional relational databases wouldn’t work for the scalability and availability demands of time-series data, and why they turned to DataStax Enterprise for blazing performance and powerful enterprise search and analytics capabilities.
How jKool Analyzes Streaming Data in Real Time with DataStaxjKool
jKool provides an application analytics SaaS for DevOps. These slides illustrate some of the choices we had to make and the architectural decisions to build a system for both real-time and historical application analytics.
Sometimes , some things work better than other things. MongoDB is great for quick access to low-latency data; Treasure Data is great for infinitely scalable historical data store. A lambda architecture is also explained.
How to get the best of both: MongoDB is great for low latency quick access of recent data; Treasure Data is great for infinitely growing store of historical data. In the latter case, one need not worry about scaling.
Slides from a talk I gave to Frederick WebTech (https://www.meetup.com/FredWebTech/) that compared the three major cloud providers.
Similar to Cloud Data Warehousing presentation by Rogier Werschkull, including tips, best practices and a BigQuery, Redshift, Snowflake and Azure SQL DWH comparison, delivered at #BIDASUMMIT (20)
Ethical AI at VDAB, presented by Vincent Buekenhout (Ethical AI Lead, VDAB) a...Patrick Van Renterghem
Vincent Buekenhout presented the various AI initiatives at VDAB, its AI4Good strategy, the way applications are designed, and most of all, the way ethics, measurements through KPI's, explainability and fairness play a role in this. Vincent also explained how ethics-by-design works at VDAB.
Implementing error-proof, business-critical Machine Learning, presentation by...Patrick Van Renterghem
This presentation by Deevid De Meyer outlines how Brainjar uses human-centric design and explainability to create machine learning systems that work together with humans to improve efficiency while reducing error rate.
Building Trust and Explainability into Chatbots: the Partena Ziekenfonds Busi...Patrick Van Renterghem
Chatbots and conversational interfaces are taking over customer service departments by storm. In many companies, they provide first-line support to customers. Based on the Partena Ziekenfonds business case, Karel Kremer shares a few critical success factors...
AI & Ethics: The Belgian Industry Vision & Initiatives, presentation by Jelle...Patrick Van Renterghem
Jelle Hoedemaekers (Agoria) explains why Belgian companies are working on ethical AI, and provides an overview of Belgian and European AI Initiatives with a focus on ethics.
Responsible AI: An Example AI Development Process with Focus on Risks and Con...Patrick Van Renterghem
Organisations need to make sure that they use AI in an appropriate way. Martijn and Hugo explain how to ensure that the developments are ethically sound and comply with regulations, how to have end-to-end governance, and how to address bias and fairness, interpretability and explainability, and robustness and security.
During the conference, we looked at an example AI development process with focussing on the risks to be managed and the controls that can be established.
Fairness and Transparency: Algorithmic Explainability, some Legal and Ethical...Patrick Van Renterghem
In this presentation, Nazanin Gifani discussed some of the ethical and legal issues of automated decision making, including algorithmic fairness, transparency and explainability. The big question here is: can AI help us to make fairer decisions ?
How obedient digital twins and intelligent beings contribute to ethics and ex...Patrick Van Renterghem
Paul Valckenaers explains how intelligence is added to a corresponding reality without introducing limitations into a world-of-interest. The outcome is obedience: a conflict with an obedient digital twin is a conflict with its real-world counterpart. Illustrated by healthcare examples.
He Said, She Said: Finding and Fixing Bias in NLP (Natural Language Processin...Patrick Van Renterghem
Yves Peirsman presents several instances where bias has posed a risk to the successful adoption of NLP systems, and discusses what techniques exist to discover these biases before the systems are put in production.
Introduction to Bias in Machine Learning, presented by Matthias Feys, CTO @ M...Patrick Van Renterghem
In this talk, Matthias Feys explains what bias in Machine Learning models actually means. You will get insights in the complexity of the problem and learn realistic ways to reduce bias.
Business Case: Ozitem Groupe, where 80% of the company is working remotely. R...Patrick Van Renterghem
Roxane Pasina (Ozitem's Chief Marketing and Communication Officer) explains and shows how Ozitem Groupe went in 1 year from an old Intranet to an interactive digital workplace allowing them to overcome their communication challenges, using the Jamespot digital workplace tools
Digital Workplace Case Study: How the Municipality of Duffel successfully swi...Patrick Van Renterghem
In 6 months time, the Gemeente/Municipality of Duffel has come quite close to transform into a forceful, digital local government thanks to the help of Synergics
Unleashing the Full Potential of People, Teams and SOLVAY, presented by Bruce...Patrick Van Renterghem
Bruce Fecheyr-Lippens (then SVP, Global Head Agile Working, Digital HR, People Analytics, and HR Director Excellence Center at Solvay) presented the digital workplace environment of Solvay #DWA19 #presentation #digitalworkplace #huapii
The Building Blocks of a Digital Workplace, presented by Sam Marshall at the ...Patrick Van Renterghem
Sam Marshall, manager of Clearbox Consulting, presented the key building blocks to fulfil the purpose of a digital workplace: to optimise the employee experience #DWA19 #presentation #digitalworkplace #DEX
Engie's Digital Workplace and "Connecting the company" business case, present...Patrick Van Renterghem
Jan Vanoudendycke (Director of Knowledge Management at Engie) presented the vision, roll-out and adoption process of the massive Engie Digital Workplace effort to connect everyone in the 150.000 people company #DWA19 #presentation #engie
Face your communication challenges when implementing a digital workplace, bas...Patrick Van Renterghem
Ellen Geens (ChangeLab) described the communication challenges, and gave tips and tricks for the change communication when implementing a digital workplace at their RIZIV and TVH customers
The first steps in Recticel's Digital Workplace program by Kenneth Meuleman (...Patrick Van Renterghem
Kenneth and Serge presented the first steps in Recticel's digital workplace program, and the managed Microsoft Teams and OneDrive for Business rollout #DWA19 #presentation #recticel
Presentation by Dave Geentjens at the "Successful Digital Workplace Adoption"...Patrick Van Renterghem
Dave Geentjens described the evolution of the Digital Workplace at the Flemish Government / Vlaamse Overheid: the challenges, the opportunities and the realisations so far #DWA19 #presentation #vlaamseoverheid
The central information provision layer within Argenta is the name for the central data hub, based on near real time Data Vault, which on one hand answers the information needs of the bank but also feeds applications such as MIFID. This layer is also the base from which the data governance is enforced. For this purpose they use Oracle Enterprise Metadata Manager and Collibra.
Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling an...Patrick Van Renterghem
The start of GDPR implementations in Europe was, for most organizations, also the start of rethinking their Data Warehouse strategy. The experience of past implementations gave a better view on the do's and don'ts. One of the important lessons learned was the approach of handling information quality. It's not something you handle on top of your data warehouse. To be successful, information quality goes hand in hand with your data warehouse implementation.
Presentation by Luc Delanglez (DataLumen) at the Data Vault Modelling and Dat...Patrick Van Renterghem
During this session, Luc Delanglez provides some practical insights to get you started the right way on Traceability, Roles & responsibilities, Data stewardship, Data lineage and impact analysis, Critical functional components, Business Glossary, Data Catalog, Data Quality, Master/Reference Data Management, Compliancy & privacy, all very important aspects of data governance.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Essentials of Automations: Optimizing FME Workflows with Parameters
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, best practices and a BigQuery, Redshift, Snowflake and Azure SQL DWH comparison, delivered at #BIDASUMMIT
1. Cloud Data Warehousing
What - Why - How & Compare
Rogier Werschkull, RogerData
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
2. The story begins: before…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
3. ▪ The system was reaching its storage limit
▪ And extending it was very expensive
▪ All kinds of interesting performance challenges started appearing…
▪ Loading windows, projections…
▪ New use cases started to appear:
▪ “We need near real-time for LiveOps”
▪ Move to mobile gaming: potential unplannable growth
▪ The system became a bit unstable…
▪ Not well distributed data
▪ Out of memory / failing nodes
Until…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
4. Not a Vertica problem!…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
5. “How to fix
this
and prepare for the
future?”
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
6. ▪Upgrade Vertica
▪ Huge initial investment.
• But what to choose when your data growth is unpredictable?
• And what about the changing use cases?
▪Store less data
▪ Try selling that to the business…
▪ And what about the data growth being unpredictable?
▪ Switch technology.............................
Choices…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
8. 3. when you want your costs to scale linear with usage
2. when your current DWH is maxed out and difficult or
expensive to expand
▪ When you need potential endless storage and compute scaling (more
real-time being one possibility)
▪ When you require better workload separation
1. when you want a system that is simpler to setup, config,
maintain and operate
▪ Where a lot of DBA / maintenance work you are still doing now ‘comes
for free’
The takeaway: move your DWH to the
Cloud…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
9. 1. When everything mentioned on the previous slide does not
apply!
2. When you have a lot of data (10+ Terrabyte), and it is not in
the cloud already (ingress)
3. When your data amounts are small (<1TB) or in the multiple
Petabyte range (effect on costs)
4. When you have ultra sensitive data and don’t trust the
measures taken by the cloud providers to prevent this ‘from
going wrong’
The takeaway: when maybe not?
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
10. So which of
these 14 to
pick from then?
(Forrester Wave
for Cloud Data warehouses
Q4-2018)
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
11. The takeaway: start with either…
OR
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
14. But we’ll just do
this by building a
BiG data lake,
right?
Photo credit: Lake Public Domain, http://www.writeups.org/star-trek-brent-spiner-data/
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
18. “Every single company I've worked at and talked to has the
same problem without a single exception so far:
poor data quality...
Either there's incomplete data, missing data,
duplicative data.”
Ruslan Belkin, former VP of Engineering @ Twitter and Salesforce
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
22. I don’t believe in
Cloud
Data warehousing!
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
23. (as the answer to all your Data warehousing
woes)
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
24. “The primary purpose of a data warehouse is to
transform data from
an application state into an integrated corporate
state”
Bill Inmon, the father of datawarehousing
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
25. This is
what we
still
SHOULD
want to
build
Subject
Oriented
Integrated
Time Variant Non-Volatile
DWH
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
26. Build a Data warehouse!
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
31. ‘Virtual by
design’?
• Focus on the
transformation logic
• Not on storing / updating /
deleting data structures
• Simplifies backfilling /
changing the DWH
@rwerschkull
nl.linkedin.com/in/rogierwerschkullPhoto credit: Public Domain
32. Data &
History
Tagging &
Search
Integrate
Data into
meaningful
and useful
stuff
Modern
DWH
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
This is what we did in BigQuery!
Adress basic
DQ issues
Adress complex
DQ issues
33. I do believe in
Cloud
Data warehouse
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
technology!
34. I do believe in
Cloud based
analytical
databases @rwerschkull
nl.linkedin.com/in/rogierwerschkull
38. ▪Build on the Google Dremel execution engine (open sourced
as Apache Drill)
▪Available since October 2011
▪Cloud native: Born in the Cloud
▪Key unique feature:
▪ The only full-on DWAAS: No nodes, no cpu, no ram, nothing to
configure
BigQuery:
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
39. ▪Based in PostgreSQL 8.0.2, rebuild as cloud based MPP
▪Available since October 2012
▪Based on legacy cloud DWH
▪Key unique features:
▪ Most implementations
▪ Best SQL support
Redshift:
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
40. ▪New kid on the block, started by ex Oracle employees
▪Cloud native: Born in the Cloud
▪Available since October 2014
▪Key unique features:
▪ The only cloud agnostic DWH: AWS, Azure and Google (eary
2020)
▪ No downtime auto scaling
▪ Metadata based data cloning (clone your production to DTA, no
extra storage!)
Snowflake
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
41. ▪Based on SQL Server Parallel Data Warehouse (PDW)
▪Based on legacy cloud DWH
▪Available since 2015 (gen1) and may 2018 (gen2)
▪Key unique features:
▪ Getting stronger quickly since the gen2 release
▪ Vast supporting ecosystem of GUI’s and ETL tools
Azure SQL datwarehouse
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
42. So why then and what
is different?
Comparing the top 3
benefits…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
Photo credit: Public Domain
43. 3- Costs:
Low entry point
/ Pay-for-use
Photo by Joel Filipe on Unsplash
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
44. Feature Redshift Azure SQL DWH BigQuery Snowflake
Fixed Start/Licence
Costs? NO NO NO NO
Separation storage
and -compute NO YES YES YES
Costs easy to
calculate?
Bit of work Bit of work YES Bit of work
Good predictability?
YES YES
NO (on demand)
YES (flatrate)
Depends on auto-
scaling
Storage Costs /1TB
month (USD) 374* 149
20/10
UNCOMPRESSED
23
CPU / Usage costs
Depends Depends
5 per TB read
UNCOMPRESSED
Depends
Comparing Cost features..
* For current gen dense storage ds2.8xlarge cluster running all year @rwerschkull
nl.linkedin.com/in/rogierwerschkull
45. ▪ Strong:
▪ All start at 0 costs, no fixed licence fee
▪ All employ a pay for use model
▪ Snowflake has the cheapest storage (even more with metadata-based cloning)
▪ Weak:
▪ Redshift doesn’t seperate storage and compute
▪ BigQuery DWAAS model makes charges cpu costs based on data queried: can get
out of control when you don’t set limits!
This applies to on demand only!
▪ BigQuery’s limited end user data caching can lead to a rise in costs, depending on
the usage pattern (solution in development)
This applies to on demand only!
Costs: Solution strong / weak points
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
46. • A cloud DWH will not always cheaper than on-prem!
• Costs change from CAPEX to OPEX
• Requires a different operating model
• Cost can be unpredictable, can be seen as a problem
• And remember the TCO!
Costs: don’t forget…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
47. 2- (Almost) Infinite scaling
Photo by Joel Filipe on Unsplash
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
48. Feature Redshift Azure SQL DWH BigQuery Snowflake
Type Based on legacy Based on legacy Cloud native Cloud native
Max Storage size
2PB
Gen1:240TB
Gen2: 240TB row,
Unlimited columnstore
Unlimited Unlimited
Storage resizing COMPLEX Doable N.A. N.A.
Dynamic Node-
Resizing
Doable Doable N.A. EASY
Concurrency Resizing
COMPLEX Doable
Default 50,
then contact google
EASY
No-downtime
Auto-Scaling
NO NO N.A YES
Hibernate Compute NO YES N.A YES
Data caching Hot data SSD cache +
exact query
Hot data SSD cache +
exact query
Only exact query
+ in development
Hot data SSD cache +
exact query
Comparing Scale and Speed features..
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
49. ▪Strong:
▪ BigQuery and Snowflake have unlimited storage
▪ BigQuery on-demand is always very powerful with 2000 slots.
Scaling is not relevant here!
▪ Snowflake has the best cluster and concurrency scaling options
▪Weak:
▪ Redshift is complex to resize, scale and cannot hibernate
▪ BigQuery’s DWAAS nature and limited caching options almost
always incur 2-3 seconds of query startup time
Scaling: Strong / weak points
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
50. • This could be ‘Your last DWH migration’: choose wisely!
• The power of this technology is an enabler for the modern
data warehousing methodology: Virtualize!
• An Infinate scale also increases the risk for:
• Infinite costs
• An infinite data mess
So take your Data Management (even more!) seriously!
Scaling: don’t forget…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
51. Fivetran benchmark 10 Sept 2018
• 99 TPC-DS Queries
• Run only once
• Calculated with system
being idle 82% of time
• Factor 10 difference in
size of cluster and
dataset
• No usage of:
• Partitioning
• Sort keys
• Clustering
SOURCE: https://fivetran.com/blog/warehouse-benchmark @rwerschkull
nl.linkedin.com/in/rogierwerschkull
54. 1-Ease of deployment, development and
maintenance
Photo credit: Public Domain
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
55. Feature Redshift Azure SQL DWH BigQuery Snowflake
Setup process COMPLEX AVERAGE N.A EASY
Managing data on cluster COMPLEX AVERAGE N.A EASY
Separation storage and -compute NO YES YES YES
Time travel
(auto-backup)
8 hours + configurable
8 hours + User-Defined
Restore Points
YES, 7 days, 2 after delete
YES
1 to 90 days + fail safe
Metadata only data cloning NO NO NO YES
SQL DDL support EXCELLENT GOOD LIMITED GOOD
SQL DML support EXCELLENT OK GOOD OK*
Stored procedure support GOOD GOOD NO GOOD
UDF support GOOD GOOD AVERAGE: not centrally GOOD
Materialized view support NO YES (preview) NO YES (limited)
PK/FK support as metadata NO NO as metadata
Quality GUI / SQL interface GOOD GOOD, but no web UI OK, web GOOD, web
JSON Parsing capabilities AVERAGE In preview OK GOOD
ETL dev / scheduling GOOD:
AWS GLUE,
Coding
GOOD,
Data Factory, SSIS, coding
GOOD,
Cloud data Fusion, Scheduled
query, Cloud composer
OK,
Coding or ext. ETL tool in AWS
/ Azure
Comparing deployment, development and maintenance
* Analytical functions are not fully mature yet:
https://medium.com/@jthandy/how-compatible-are-redshift-and-snowflakes-sql-syntaxes-c2103a43ae84c
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
56. ▪All have integrated replication and backup
▪BigQuery has no config / maintenance work at all
▪Snowflake has just enough simple configurability
▪BigQuery and Snowflake support time travel
▪Snowflake has metadata based database cloning
▪Redshift has the best SQL support
▪SQL datawarehouse has the best supporting ecosystem of
GUI’s and ETL tools
Using: Strong points
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
57. ▪Redshift and SQL datawarehouse require you to choose
distributions keys and create / update statistics
▪Redshift requires DBA work to reclaim space when deleting
data
▪BigQuery’s SQL DDL support is limited
▪BigQuery’s has no stored procedure, materialized views
and limited UDF support
▪No-one has proper primary or foreign key support!
Using: Weak points
@rwerschkull
nl.linkedin.com/in/rogierwerschkull
58. • The data storing related work that remains: thinking about
your partitioning and clustering (sorting-data) strategy
• You still need to use a good datawarehousing
methodology!
• The basic skills and competences needed don’t
change!
Using: don’t forget…
@rwerschkull
nl.linkedin.com/in/rogierwerschkull