This document summarizes an IBM Cloud Day 2021 presentation on IBM Cloud Data Lakes. It describes the architecture of IBM Cloud Data Lakes including data skipping capabilities, serverless analytics, and metadata management. It then discusses an example COVID-19 data lake built on IBM Cloud to provide trusted COVID-19 data to analytics applications. Key aspects included landing, preparation, and integration zones; serverless pipelines for data ingestion and transformation; and a data mart for querying and reporting.
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
- The document discusses an IBM Cloud Day 2021 event focused on well-architected data lakes. It provides an overview of two sessions on data lake architecture and building a cloud native data lake on IBM Cloud.
- It also summarizes the key capabilities organizations need from a data lake, including visualizing data, flexibility/accessibility, governance, and gaining insights. Cloud data lakes can address these needs for various roles.
Serverless Cloud Data Lake with Spark for Serving Weather Data
1) The document discusses using a serverless architecture with IBM Cloud services like SQL Query powered by Spark, Cloud Object Storage, and Cloud Functions to build a cost-effective cloud data lake for serving historical weather data on demand.
2) It describes how data skipping techniques and geospatial indexes in SQL Query can accelerate queries by an order of magnitude by pruning irrelevant data.
3) The new serverless solution provides unlimited storage, global coverage, and supports large queries for machine learning and analytics at an order of magnitude lower cost than the previous implementation.
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
- The document discusses serverless data analytics using IBM's cloud services, including a serverless data lake built on cloud object storage, serverless SQL queries using Spark, and serverless data processing functions.
- It provides an example of a COVID-19 data lake built on IBM Cloud that collects and integrates data from various sources, prepares and transforms the data, and makes it available for analytics and dashboards through serverless SQL queries.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
This presentation is geared toward enterprise architects and senior IT leaders looking to drive more value from their data by learning about cloud data lake management.
As businesses focus on leveraging big data to drive digital transformation, technology leaders are struggling to keep pace with the high volume of data coming in at high speed and rapidly evolving technologies. What's needed is an approach that helps you turn petabytes into profit.
Cloud data lakes and cloud data warehouses have emerged as a popular architectural pattern to support next-generation analytics. Informatica's comprehensive AI-driven cloud data lake management solution natively ingests, streams, integrates, cleanses, governs, protects and processes big data workloads in multi-cloud environments.
Please leave any questions or comments below.
At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
- The document discusses an IBM Cloud Day 2021 event focused on well-architected data lakes. It provides an overview of two sessions on data lake architecture and building a cloud native data lake on IBM Cloud.
- It also summarizes the key capabilities organizations need from a data lake, including visualizing data, flexibility/accessibility, governance, and gaining insights. Cloud data lakes can address these needs for various roles.
Serverless Cloud Data Lake with Spark for Serving Weather Data
1) The document discusses using a serverless architecture with IBM Cloud services like SQL Query powered by Spark, Cloud Object Storage, and Cloud Functions to build a cost-effective cloud data lake for serving historical weather data on demand.
2) It describes how data skipping techniques and geospatial indexes in SQL Query can accelerate queries by an order of magnitude by pruning irrelevant data.
3) The new serverless solution provides unlimited storage, global coverage, and supports large queries for machine learning and analytics at an order of magnitude lower cost than the previous implementation.
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
- The document discusses serverless data analytics using IBM's cloud services, including a serverless data lake built on cloud object storage, serverless SQL queries using Spark, and serverless data processing functions.
- It provides an example of a COVID-19 data lake built on IBM Cloud that collects and integrates data from various sources, prepares and transforms the data, and makes it available for analytics and dashboards through serverless SQL queries.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
This presentation is geared toward enterprise architects and senior IT leaders looking to drive more value from their data by learning about cloud data lake management.
As businesses focus on leveraging big data to drive digital transformation, technology leaders are struggling to keep pace with the high volume of data coming in at high speed and rapidly evolving technologies. What's needed is an approach that helps you turn petabytes into profit.
Cloud data lakes and cloud data warehouses have emerged as a popular architectural pattern to support next-generation analytics. Informatica's comprehensive AI-driven cloud data lake management solution natively ingests, streams, integrates, cleanses, governs, protects and processes big data workloads in multi-cloud environments.
Please leave any questions or comments below.
At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.
This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.
In this session we will delve into the world of Azure Databricks and analyze why it is becoming a tool for data Scientist and/or fundamental data Engineer in conjunction with Azure services
Spark is fast becoming a critical part of Customer Solutions on Azure. Databricks on Microsoft Azure provides a first-class experience for building and running Spark applications. The Microsoft Azure CAT team engaged with many early adopter customers helping them build their solutions on Azure Databricks.
In this session, we begin by reviewing typical workload patterns, integration with other Azure services like Azure Storage, Azure Data Lake, IoT / Event Hubs, SQL DW, PowerBI etc. Most importantly, we will share real-world tips and learnings that you can take and apply in your Data Engineering / Data Science workloads
McGraw-Hill Optimizes Analytics Workloads with DatabricksAmazon Web Services
Using Databricks, McGraw-Hill securely transformed itself from a collection of data silos with limited access to data and minimal collaboration to an organization with democratized access to data and machine learning. This ultimately enables its data teams to rapidly identify usage patterns predicting student performance, so they can make timely enhancements to the software that proactively guide at-risk students through the course material.
Join our webinar to learn:
- How a cloud-based unified analytics platform can help your company perform analytics faster, at lower cost.
- How to mitigate challenges presented by data silos so data science teams can collaborate effectively.
- How to implement data analytics infrastructure to put models into production quickly
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Apache Spark is a fast and general engine for large-scale data processing. It was created by UC Berkeley and is now the dominant framework in big data. Spark can run programs over 100x faster than Hadoop in memory, or more than 10x faster on disk. It supports Scala, Java, Python, and R. Databricks provides a Spark platform on Azure that is optimized for performance and integrates tightly with other Azure services. Key benefits of Databricks on Azure include security, ease of use, data access, high performance, and the ability to solve complex analytics problems.
Amazon QuickSight is a business intelligence service that allows users to connect to data sources, create interactive dashboards, and securely share them across organizations. It offers auto-scaling, high availability, integration with AWS services, and pay-per-use pricing starting at $5/month for readers. QuickSight provides machine learning capabilities like anomaly detection and forecasting. It also allows embedding dashboards in applications. Customers like Capital One, Comcast, and the NFL use QuickSight for self-service analytics, embedded analytics, and delivering insights to large numbers of users through its reader role and usage-based pricing.
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
The presentation discusses how to migrate expensive open source big data workloads to Azure and leverage latest compute and storage innovations within Azure Synapse with Azure Data Lake Storage to develop a powerful and cost effective analytics solutions. It shows how you can bring your .NET expertise with .NET for Apache Spark to bear and how the shared meta data experience in Synapse makes it easy to create a table in Spark and query it from T-SQL.
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
Check out this presentation to learn the basics of using Attunity Replicate to stream real-time data to Azure Data Lake Storage Gen2 for analytics projects.
Tarun Poladi is a data engineer with over 7 years of experience in business intelligence and data engineering. He has extensive experience with Microsoft BI tools like SSIS, SSRS, and Power BI. He is proficient in SQL, Python, and R. Tarun has worked on various cloud platforms including AWS, Azure, and Databricks. His experience includes data modeling, ETL processes, building reports and dashboards, and developing analytics solutions. He holds certifications in Azure Data Engineering and Microsoft BI Reporting.
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This document provides an overview of building a modern cloud analytics solution using Microsoft Azure. It discusses the role of analytics, a history of cloud computing, and a data warehouse modernization project. Key challenges covered include lack of notifications, logging, self-service BI, and integrating streaming data. The document proposes solutions to these challenges using Azure services like Data Factory, Kafka, Databricks, and SQL Data Warehouse. It also discusses alternative implementations using tools like Matillion ETL and Snowflake.
Ai & Data Analytics 2018 - Azure Databricks for data scientistAlberto Diaz Martin
This document summarizes a presentation given by Alberto Diaz Martin on Azure Databricks for data scientists. The presentation covered how Databricks can be used for infrastructure management, data exploration and visualization at scale, reducing time to value through model iterations and integrating various ML tools. It also discussed challenges for data scientists and how Databricks addresses them through features like notebooks, frameworks, and optimized infrastructure for deep learning. Demo sections showed EDA, ML pipelines, model export, and deep learning modeling capabilities in Databricks.
Modern DW Architecture
- The document discusses modern data warehouse architectures using Azure cloud services like Azure Data Lake, Azure Databricks, and Azure Synapse. It covers storage options like ADLS Gen 1 and Gen 2 and data processing tools like Databricks and Synapse. It highlights how to optimize architectures for cost and performance using features like auto-scaling, shutdown, and lifecycle management policies. Finally, it provides a demo of a sample end-to-end data pipeline.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
These are the slides for my talk "An intro to Azure Data Lake" at Azure Lowlands 2019. The session was held on Friday January 25th from 14:20 - 15:05 in room Santander.
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
In this session you will learn how to develop data pipelines in Azure Data Factory and build a Cloud-based analytical solution adopting modern data warehouse approaches with Azure SQL Data Warehouse and implementing incremental ETL orchestration at scale. With the multiple sources and types of data available in an enterprise today Azure Data factory enables full integration of data and enables direct storage in Azure SQL Data Warehouse for powerful and high-performance query workloads which drive a majority of enterprise applications and business intelligence applications.
Dustin Vannoy presented on using Delta Lake with Azure Databricks. He began with an introduction to Spark and Databricks, demonstrating how to set up a workspace. He then discussed limitations of Spark including lack of ACID compliance and small file problems. Delta Lake addresses these issues with transaction logs for ACID transactions, schema enforcement, automatic file compaction, and performance optimizations like time travel. The presentation included demos of Delta Lake capabilities like schema validation, merging, and querying past versions of data.
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
Serverless SQL provides a serverless analytics platform that allows users to analyze data stored in object storage without having to manage infrastructure. Key features include seamless elasticity, pay-per-query consumption, and the ability to analyze data directly in object storage without having to move it. The platform includes serverless storage, data ingest, data transformation, analytics, and automation capabilities. It aims to create a sharing economy for analytics by allowing various users like developers, data engineers, and analysts flexible access to data and analytics.
IBM's Cloud-based Data Lake for Analytics and AI presentation covered:
1) IBM's cloud data lake provides serverless architecture, low barriers to entry, and pay-as-you-go pricing for analytics on data stored in cloud object storage.
2) The data lake offers SQL-based data exploration, transformation, and analytics capabilities as well as industry-leading optimizations for time series and geospatial data.
3) Security features include customer-controlled encryption keys and options to hide SQL queries and keys from IBM.
This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.
In this session we will delve into the world of Azure Databricks and analyze why it is becoming a tool for data Scientist and/or fundamental data Engineer in conjunction with Azure services
Spark is fast becoming a critical part of Customer Solutions on Azure. Databricks on Microsoft Azure provides a first-class experience for building and running Spark applications. The Microsoft Azure CAT team engaged with many early adopter customers helping them build their solutions on Azure Databricks.
In this session, we begin by reviewing typical workload patterns, integration with other Azure services like Azure Storage, Azure Data Lake, IoT / Event Hubs, SQL DW, PowerBI etc. Most importantly, we will share real-world tips and learnings that you can take and apply in your Data Engineering / Data Science workloads
McGraw-Hill Optimizes Analytics Workloads with DatabricksAmazon Web Services
Using Databricks, McGraw-Hill securely transformed itself from a collection of data silos with limited access to data and minimal collaboration to an organization with democratized access to data and machine learning. This ultimately enables its data teams to rapidly identify usage patterns predicting student performance, so they can make timely enhancements to the software that proactively guide at-risk students through the course material.
Join our webinar to learn:
- How a cloud-based unified analytics platform can help your company perform analytics faster, at lower cost.
- How to mitigate challenges presented by data silos so data science teams can collaborate effectively.
- How to implement data analytics infrastructure to put models into production quickly
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Apache Spark is a fast and general engine for large-scale data processing. It was created by UC Berkeley and is now the dominant framework in big data. Spark can run programs over 100x faster than Hadoop in memory, or more than 10x faster on disk. It supports Scala, Java, Python, and R. Databricks provides a Spark platform on Azure that is optimized for performance and integrates tightly with other Azure services. Key benefits of Databricks on Azure include security, ease of use, data access, high performance, and the ability to solve complex analytics problems.
Amazon QuickSight is a business intelligence service that allows users to connect to data sources, create interactive dashboards, and securely share them across organizations. It offers auto-scaling, high availability, integration with AWS services, and pay-per-use pricing starting at $5/month for readers. QuickSight provides machine learning capabilities like anomaly detection and forecasting. It also allows embedding dashboards in applications. Customers like Capital One, Comcast, and the NFL use QuickSight for self-service analytics, embedded analytics, and delivering insights to large numbers of users through its reader role and usage-based pricing.
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
The presentation discusses how to migrate expensive open source big data workloads to Azure and leverage latest compute and storage innovations within Azure Synapse with Azure Data Lake Storage to develop a powerful and cost effective analytics solutions. It shows how you can bring your .NET expertise with .NET for Apache Spark to bear and how the shared meta data experience in Synapse makes it easy to create a table in Spark and query it from T-SQL.
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
Check out this presentation to learn the basics of using Attunity Replicate to stream real-time data to Azure Data Lake Storage Gen2 for analytics projects.
Tarun Poladi is a data engineer with over 7 years of experience in business intelligence and data engineering. He has extensive experience with Microsoft BI tools like SSIS, SSRS, and Power BI. He is proficient in SQL, Python, and R. Tarun has worked on various cloud platforms including AWS, Azure, and Databricks. His experience includes data modeling, ETL processes, building reports and dashboards, and developing analytics solutions. He holds certifications in Azure Data Engineering and Microsoft BI Reporting.
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This document provides an overview of building a modern cloud analytics solution using Microsoft Azure. It discusses the role of analytics, a history of cloud computing, and a data warehouse modernization project. Key challenges covered include lack of notifications, logging, self-service BI, and integrating streaming data. The document proposes solutions to these challenges using Azure services like Data Factory, Kafka, Databricks, and SQL Data Warehouse. It also discusses alternative implementations using tools like Matillion ETL and Snowflake.
Ai & Data Analytics 2018 - Azure Databricks for data scientistAlberto Diaz Martin
This document summarizes a presentation given by Alberto Diaz Martin on Azure Databricks for data scientists. The presentation covered how Databricks can be used for infrastructure management, data exploration and visualization at scale, reducing time to value through model iterations and integrating various ML tools. It also discussed challenges for data scientists and how Databricks addresses them through features like notebooks, frameworks, and optimized infrastructure for deep learning. Demo sections showed EDA, ML pipelines, model export, and deep learning modeling capabilities in Databricks.
Modern DW Architecture
- The document discusses modern data warehouse architectures using Azure cloud services like Azure Data Lake, Azure Databricks, and Azure Synapse. It covers storage options like ADLS Gen 1 and Gen 2 and data processing tools like Databricks and Synapse. It highlights how to optimize architectures for cost and performance using features like auto-scaling, shutdown, and lifecycle management policies. Finally, it provides a demo of a sample end-to-end data pipeline.
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
In essence, a data lake is commodity distributed file system that acts as a repository to hold raw data file extracts of all the enterprise source systems, so that it can serve the data management and analytics needs of the business. A data lake system provides means to ingest data, perform scalable big data processing, and serve information, in addition to manage, monitor and secure the it environment. In these slide, we discuss building data lakes using Azure Data Factory and Data Lake Analytics. We delve into the architecture if the data lake and explore its various components. We also describe the various data ingestion scenarios and considerations. We introduce the Azure Data Lake Store, then we discuss how to build Azure Data Factory pipeline to ingest the data lake. After that, we move into big data processing using Data Lake Analytics, and we delve into U-SQL.
These are the slides for my talk "An intro to Azure Data Lake" at Azure Lowlands 2019. The session was held on Friday January 25th from 14:20 - 15:05 in room Santander.
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Microsoft Tech Community
In this session you will learn how to develop data pipelines in Azure Data Factory and build a Cloud-based analytical solution adopting modern data warehouse approaches with Azure SQL Data Warehouse and implementing incremental ETL orchestration at scale. With the multiple sources and types of data available in an enterprise today Azure Data factory enables full integration of data and enables direct storage in Azure SQL Data Warehouse for powerful and high-performance query workloads which drive a majority of enterprise applications and business intelligence applications.
Dustin Vannoy presented on using Delta Lake with Azure Databricks. He began with an introduction to Spark and Databricks, demonstrating how to set up a workspace. He then discussed limitations of Spark including lack of ACID compliance and small file problems. Delta Lake addresses these issues with transaction logs for ACID transactions, schema enforcement, automatic file compaction, and performance optimizations like time travel. The presentation included demos of Delta Lake capabilities like schema validation, merging, and querying past versions of data.
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
SQLBits 2020 presentation on how you can build solutions based on the modern data warehouse pattern with Azure Synapse Spark and SQL including demos of Azure Synapse.
Serverless SQL provides a serverless analytics platform that allows users to analyze data stored in object storage without having to manage infrastructure. Key features include seamless elasticity, pay-per-query consumption, and the ability to analyze data directly in object storage without having to move it. The platform includes serverless storage, data ingest, data transformation, analytics, and automation capabilities. It aims to create a sharing economy for analytics by allowing various users like developers, data engineers, and analysts flexible access to data and analytics.
IBM's Cloud-based Data Lake for Analytics and AI presentation covered:
1) IBM's cloud data lake provides serverless architecture, low barriers to entry, and pay-as-you-go pricing for analytics on data stored in cloud object storage.
2) The data lake offers SQL-based data exploration, transformation, and analytics capabilities as well as industry-leading optimizations for time series and geospatial data.
3) Security features include customer-controlled encryption keys and options to hide SQL queries and keys from IBM.
A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
This document discusses building a data lake on AWS. It describes using Amazon S3 for storage, Amazon Kinesis for streaming data, and AWS Lambda to populate metadata indexes in DynamoDB and search indexes. It covers using IAM for access control, AWS STS for temporary credentials, and API Gateway and Elastic Beanstalk for interfaces. The data lake provides a foundation for storing and analyzing structured, semi-structured, and unstructured data at scale from various sources in a cost-effective and secure manner.
The document summarizes IBM's cloud data lake and SQL query services. It discusses how these services allow users to ingest, store, and analyze large amounts of data in the cloud. Key points include that IBM's cloud data lake provides a fully managed data lake service with serverless consumption and fully elastic scaling. It also discusses how IBM SQL Query allows users to analyze data stored in cloud object storage using SQL, and supports various data formats and analytics use cases including log analysis, time series analysis, and spatial queries.
For people who start to create a cloud service, it’s really important to know how to create a scalable cloud service to fit the growth of the future workloads. In this session, we will introduce how to design a scalable cloud service including AWS services introduction and best practices.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
Data Analytics Week at the San Francisco Loft
Using Data Lakes
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Hemant Borole - Sr. Big Data Consultant, AWS
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Level: Intermediate
Speakers:
Tony Nguyen - Senior Consultant, ProServe, AWS
Hannah Marlowe - Consultant - Federal, AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS Amazon Web Services
Uncovering new, valuable insights from big data requires organizations to collect, store, and analyze increasing volumes of data from multiple, often disparate sources at disparate points in time. This makes it difficult to handle big data with data warehouses or relational database management systems alone.
A Data Lake allows you to store massive amounts of data in its original form, without the need to enforce a predefined schema, enabling a far more agile and flexible architecture, which makes it easier to gain new types of analytical insights from your data
In this webinar, we will introduce key concepts of a Data Lake and present aspects related to its implementation. We will discuss critical success factors, pitfalls to avoid as well as operational aspects such as security, governance, search, indexing and metadata management.
Learning Objectives:
• Learn how AWS can help enable a Data Lake architecture
• Understand some of the key architectural considerations when building a Data Lake
• Hear some of the important Data Lake implementation considerations
Who Should Attend:
• Data architects, data scientists, advanced AWS developers
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
Everything generates logs. Applications, infrastructure, security ... everything. Keeping track of the flood of log data is a big challenge, yet critical to your ability to understand your systems and troubleshoot (or prevent) issues. In this session, we will use both Amazon CloudWatch and application logs to show you how to build an end-to-end log analytics solution. First, we cover how to configure an Amazon Elaticsearch Service domain and ingest data into it using Amazon Kinesis Firehose, demonstrating how easy it is to transform data with Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data and configure a secure analytics environment. We demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
This document discusses building data warehouses and data lakes in the cloud using AWS services. It provides an overview of AWS databases, analytics, and machine learning services that can be used to store and analyze data at scale. These services allow customers to migrate existing data warehouses to the cloud, build new data warehouses and data lakes more cost effectively, and gain insights from their data more easily.
The document summarizes announcements from AWS re:Invent 2016 related to compute, storage, artificial intelligence, serverless computing, databases, migration tools, and developer tools. Key announcements included new EC2 instance types, cost reductions, Elastic GPUs, AWS Batch for batch processing, Aurora PostgreSQL, Athena for analytics on S3 data, VMware on AWS, AWS X-Ray for tracing distributed applications, and expanded machine learning capabilities through services like Polly, Lex, and Rekognition as well as support for MXNet as an AI framework.
This document discusses logging scenarios using DynamoDB and Elastic MapReduce. It covers collecting log data in real-time using tools like Fluentd and storing it in DynamoDB. It then describes using EMR to perform ETL processes on the data, extracting from DynamoDB, transforming the data across EC2 instances, and loading to S3 or DynamoDB. Finally, it discusses analyzing the data using Redshift for queries or CloudSearch for search capabilities.
Prague data management meetup 2018-03-27Martin Bém
This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.
This document discusses building a data lake on AWS. It describes using Amazon S3 for storage of structured, semi-structured, and unstructured data at scale. Amazon Kinesis is used for streaming ingest of data. A metadata catalogue using Amazon DynamoDB and AWS Lambda allows for data discovery and governance. IAM policies control access and encryption using AWS KMS provides security. APIs built using Amazon API Gateway provide programmatic access to the data lake resources.
DBP-010_Using Azure Data Services for Modern Data Applicationsdecode2016
This document discusses using Azure data services for modern data applications based on the Lambda architecture. It covers ingestion of streaming and batch data using services like Event Hubs, IoT Hubs, and Kafka. It describes processing streaming data in real-time using Stream Analytics, Storm, and Spark Streaming, and processing batch data using HDInsight, ADLA, and Spark. It also covers staging data in data lakes, SQL databases, NoSQL databases and data warehouses. Finally, it discusses serving and exploring data using Power BI and enriching data using Azure Data Factory and Machine Learning.
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.
Similar to IBM Cloud Day January 2021 Data Lake Deep Dive (20)
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudTorsten Steinbach
Cloud is a sharing economy that reduces your spending. But does this also apply to data and analytics? Doesn't this require you to provision dedicated data warehouse systems to run analytics SQL queries on terabytes of data? With IBM Cloud, the answer is no. By using serverless analytics via IBM Cloud SQL Query, you can analyze your data directly where it sits, be it in IBM Cloud Object Storage or in your NoSQL databases. Due to the serverless nature of SQL Query, you only pay for your queries depending on the data volume that they process. There are no standing costs. You do not need to provision and wait for a data warehouse. But you can still run SQLs on terabytes of data.
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?Torsten Steinbach
You don't necessarily have to set up a relational database, tables and load data in order to use a surprisingly rich set of SQL capabilities on your data in the cloud. IBM SQL Query lets you analyze terabytes of distributed data of heterogeneous formats with a complete ANSI SQL dialect in a completely serverless usage model, elegantly ETL data between formats and partitioning layouts as needed, and run complex time series transformations, analysis and correlations with advanced built-in timeseries SQL algorithms that are differentiating in the entire industry. It also support a complete PostGIS compliant geospatial SQL function set. Come explore the stunningly advanced world of SQL without a database in IBM Cloud.
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM CloudTorsten Steinbach
Agile user and workload insights are one of the key elements of a cloud-native solution. When done well, this represents a real competitive advantage. In this session, we show you how to run cloud-native clickstream analysis with IBM Cloud. By combining serverless mechanisms like object storage for affordable and scalable persistency with SQL Query for serverless analysis of your clickstream data, you can establish a very cost-effective clickstream analysis pipeline easily and quickly.
IBM THINK 2019 - Self-Service Cloud Data Management with SQL Torsten Steinbach
SQL is a powerful language to express data transformations. But did you know that you can also use IBM Cloud SQL to convert data between various data formats and layouts on disks? In this session, you will see the full power of using SQL Query to move and transform your cloud data in an entirely self-service fashion. You can specify any data format, layout or partitioning with a simple SQL statement. See how you can move and transform terabytes of data in the cloud in a very scalable fashion and still being charged only for the individual SQL movement and transformation jobs without having standing costs.
Torsten Steinbach and Chris Glew present IBM Cloud Query, a serverless analytics service that allows users to run ANSI SQL queries against data stored in cloud object storage. Some key points:
- IBM Cloud Query allows users to query data in various open formats like CSV, Parquet, and JSON stored in cloud object storage using SQL, with results also stored in object storage.
- It has a pay-per-query pricing model with no infrastructure to manage. Queries can be run via a web console, REST API, or Python client.
- The presentation outlines the architecture and provides examples of using Cloud Query for log analytics, data exploration, and building serverless data pipelines with Cloud Functions.
IBM Insight 2014 - Advanced Warehouse Analytics in the CloudTorsten Steinbach
- dashDB is IBM's cloud data warehouse service that provides advanced analytics capabilities in the cloud
- It offers three deployment options: an entry plan deployed within Bluemix, a bare metal option, and virtual machine options
- The entry plan provides terabyte-scale capacity within Bluemix while the bare metal and VM options provide more capacity and dedicated resources
- All options provide in-database analytics and backup/restore to Swift object storage for high availability
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloudTorsten Steinbach
This document discusses geospatial analytics capabilities in IBM dashDB. It describes how dashDB supports geospatial data types and functions that allow spatial queries and analysis. This includes functions for spatial predicates, constructors, and calculations. GeoJSON and other formats can be loaded and dashDB implements OGC and ISO spatial standards. Predictive analytics is also possible using the R extension to dashDB. Overall the summary discusses dashDB's geospatial and predictive analytic capabilities for spatial data.
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisTorsten Steinbach
Using Bluemix and dashDB for Twitter Analysis
This document discusses using IBM's Bluemix and dashDB services for Twitter analysis. It provides an overview of the IBM Insights for Twitter service in Bluemix, which allows querying and searching over enriched Twitter data stored in dashDB. Examples are given of queries that can be performed, such as searching for tweets about an upcoming movie within a time frame or searching for tweets with positive sentiment about a product. The document also discusses loading Twitter data into dashDB using a Bluemix app and performing predictive analytics on the data using built-in R and Python capabilities in dashDB.
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...Torsten Steinbach
This document summarizes a presentation on analyzing weather data using IBM's Cloud-Based Analytics of The Weather Company in IBM Bluemix. It discusses loading weather data from various sources into the IBM dashDB data warehouse for analysis using R and Python. Key points include:
- Loading weather data from sources like S3, Swift, on-premise databases, and The Weather Company into IBM Bluemix for analysis in dashDB.
- Using the ibmdbR and ibmdbPy packages to interface with dashDB from R and Python, performing analytics like predictive modeling, statistics, and visualizations directly in the database.
- Publishing predictive models and analytics as web applications using the dashDB REST API.
This document discusses analyzing geospatial data with IBM Cloud Data Services and Esri ArcGIS. It provides an overview of using Cloudant as a NoSQL database to store geospatial data in GeoJSON format and then load it into IBM dashDB for analytics. GeoJSON data can be stored in Cloudant in three different structures - as simple geometry, feature collections, or features. The document also describes how geospatial data from Cloudant can be transformed and loaded into dashDB tables for analysis using IBM data warehousing technologies.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
1. IBM Cloud Day 2021
Deep Dive into Cloud Native Data Lakes with IBM Cloud
Torsten Steinbach, IBM Cloud Data Lake Architect
James Bennett, IBM Cloud Data Lake Offering Lead
21st Jan 2021
4. Telemetry Data
Explore
ETL
Prep Enrich
Streaming
Optimize Analyze
ü Seamless Elasticity
ü Seamless Scalability
ü Highly Cost Effective
ü Long Term Retention
ü Any data formats
ETL
IBM Cloud Data Lake – Big Picture
DWH
Databases
ü Response Time SLAs
ü Warm High-quality Data only
Cloud Data Lake
Analytics
Optional:
5. IBM Serverless Stack for Analytics
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query
Only pay for volume of data
that you really store
Only pay for
amount of
data that you
really scan
Only pay for
CPU that
you really
consume
Blog Article
§ Properties of Serverless:
– No management of resources, hosts and
processes
– Auto-scaling and auto-provisioning based
on actual load
– Precise billing based on really consumed
system resources (memory, storage, CPU,
network, I/O)
– High-Availability is always implicit
6. IBM SQL Query – The Central Cloud Data Lake Service
Cloud Data
Data
Transformation
Serverless SQL Query Service
Analytics
Object
Storage RDBMS
+
Developers
Data
Engineers
Data Analysts
ü Supports ad-hoc and
unknown data structures
ü ETL & ELT Support
ü 100% Pay-as-you-go (5$/TB)
ü 100% API enabled
ü Automatic Big Data Scale-
Out with Spark
ü 100% Self service, No Setup
Data
Management
+
Data Scientists
ü Built-In Database Catalog &
Data Skipping
Data Ingestion
+
7. IBM SQL Query Architecture
2. Read data
4. Read
results
Application
3. Write data
Cloud Data Services
1. Submit SQL
SQL
Event Streams
Query
Db2 on Cloud
Geospatial SQL
Data Skipping
Timeseries SQL
Hive Metastore
Video
Cloud Object Storage
• Using IBM Analytic Engine service
(Spark clusters aaS)
• Large farm of Spark clusters auto-
provisioned & auto-managed in background
• Managing a hot pool of Spark applications
(a.k.a. kernels, using Jupyter Kernel Gateway)
• SQL grammar sandbox
• Auto-scaling of each serverless SQL job
inside large Spark clusters using dynamic
resource allocation
• Intrinsically HA (dispatching across Spark
environments in each availability zone)
8. IBM SQL Query – Access Patterns
Create
Query
SQL
Console
Watson
Studio
Notebooks
Cloud Functions
Integrate Explore
Deploy
Python SDK
REST API
JDBC
Object
Store
Console
Event
Streams
Console
9. Meta Data
IBM Cloud Data Lake – Meta Data
Cloud Data
ACID
Spark
Data Skipping Indexes Governance Policies
& Lineage
Schema, Partitioning,
Statistics
Serverless SQL
Object
Storage RDBMS
Hive
Metastore
Kafka Schema
Registry
Xskipper Iceberg
Watson Knowledge
Catalog
Deltalake
10. Event Streams SQL Query
Object
Storage Meta Data
Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg)
Real-Time
Queries
IBM Cloud Data Lake – 2021 Architecture
COS
Batch
Queries
Stream Xform
& Joins
Stream data landing
Schema management & enforcement
ETL & Data
Preparation
11. Combining Spatial and Temporal Processing
IBM Cloud Object Storage
Sensor
Data
Query
Location
Analytics
Mobile
Cars
Devices
Land
Location
Filtering
Spatial
Aggregation
GPS
SQL/MM
Sensor
Metrics
t
t
t
Timeseries
Assembly
Timeseries
Join
Timeseries SQL
t
12. The SQL Sandwich
Object Storage
Object Storage
Data Warehouse
Raw Data
High Quality
Data
Archived Data
SQL ETL
SQL ETL
SQL
Federation
Explore, Prepare &
Batch Analytics
Interactive
Analytics with SLAs
Compliance
Reporting
SQL
SQL
SQL
Blog Article:
SQL Sandwich
13. Promoting Data After Preparation
SELECT …
INTO <COS URI> <format & layout ops> |
<Db2 service CRN> | <Db2 database URI> /<table name>
[CREATE | OVERWRITE | APPEND] [PARALLELISM <num>]
COS URI: e.g. cos://us-south/myBucket/myFolder/myData.parquet
COS Format/Layout: e.g. STORED AS PARQUET PARTITIONED BY (city, date)
Db2 options:
PARALLELISM: Number of parallel threads for writing (default 1)
Examples:
… INTO db2://db2w-dja.us-south.db2w.cloud.ibm.com/MYSCHEMA.MYTABLE PARALLELISM 20
… INTO crn:v1:bluemix:public:dashdb-for-tx:us-south:s/c38…:cf-service-instance:/MYTABLE
* future
Promote
on COS
Promote
to Db2
Blog Article:
Db2 ETL
16. Data Skipping in IBM SQL Query
• Avoid reading irrelevant objects using indexes
• Complements partition pruning -> object level pruning
• Stores aggregate metadata per object to enable skipping decisions
• Indexes are stored in COS
• Supports multiple index types
• Currently MinMax, ValueList, BloomFilter, Geospatial
• Underlying data skipping library is extensible
• New index types can easily be supported
• Enables data skipping on SQL UDFs
• e.g. ST_Contains, ST_Distance etc.
• UDFs are mapped to indexes
17. How Data Skipping Works
Spark SQL Query Execution Flow
Uses Catalyst optimizer and
session extensions API
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional
file filter
Read data
Metadata
Filter
19. Geospatial Data Skipping Example
Example Query
SELECT * FROM Weather STORED AS parquet
WHERE ST_Contains(ST_WKTToSQL('POLYGON((-
78.93 36.00, -78.67 35.78, -79.04 35.90, -
78.93 36.00))'), ST_Point(long, lat))
INTO cos://us-south/results STORED AS parquet
Object Name lat
Min
lat
Max
...
dt=2020-08-17/part-00085 35.02 36.17
dt=2020-08-17/part-00086 43.59 44.95
dt=2020-08-17/part-00087 34.86 40.62
dt=2020-08-17/part-00088 23.67 25.92
...
Metadata
Red objects are not relevant to this query
Raleigh Research
Triangle (US)
Map ST Contains UDF
to necessary conditions
on lat, long
20. X10 Acceleration with Data Skipping and Catalog
Query rewrite approach
(yellow) is the baseline
• Using already optimized data format:
Parquet/ORC
For other formats the
acceleration is much larger
• e.g. CSV/JSON/Avro
Experiment uses Raleigh Research
Triangle query
X10 speedup
on average
22. An Idealized Enterprise Data Lake Topology
Systems
of Record
Systems
of Record
Streaming
Topics
Streaming
Topics
LoB Data Lake Projects
LoB Data Lake Projects
LoB Data Lake Projects
Landing
Zone
Landing
Subscription
Prep
Zone
Enterprise
Zone
LoB Prep
Zone
Archive
Zone
EDW
Systems
of Record LoB DBs
Streaming
Topics
Scheduled
ETL or CDC
LoB Analytic
Zone
Publish
Analytic
Apps
Analytic
Apps
Analytic
Apps
Analytic
Apps
Automatic
On Demand
Read only
IBM Confidential
23. Making trusted COVID-19 data available to broad set of analytics, e.g.:
§ https://accelerator.weather.com/bi
§ Watson Health Return to Work Advisor
The COVID-19 Data Lake
Ø Extensible with new data sources easily
Ø Maximized velocity and elasticity
Ø Full automation of all pipelines
Ø New pipeline prototype in hours
& productize in 2-3 days
Ø Radically minimizing resource
and operational costs by using IBM Cloud
serverless and full ops automation
Cloud Functions
Cloud
Object Storage
- Persist
- Trigger
- Static Content Creation
- Schema Management
- Pipeline PoCs
- Usage Tutorials
Watson Studio
SQL Query
- Transformation
- Transport
- Table Catalog (Mart)
- Queries
- Export
- Pipeline -Productization
- Automation
- Monitoring & Alerting
- Pull External Data
24. The Four P of Serverless Pipelines
Data
Operations
COS
Object
Operations
Prototype
ibm_boto3
Python
COS
ibmcloudsql
Create
Notebooks
PoC
Schedule
Notebooks
Deploy Cloud
Functions
Productize
Watson Studio
Notebooks
Watson Studio
Notebooks
25. COVID-19 Data Lake Topology – High Level
Landing Zone (E)
Landing Buckets
Preparation Zone (T)
Landing Namespace
Preparation
Namespace
Preparation Buckets
Integration Zone (L)
Dashboarding
DWH
Integration Buckets
Data Mart Instance
Integration
Namespace
Mart Management
Project
Data Mart Access
Project
TWC Scrapers & Pipeline
Collectors Sequences
Preparation Sequences
Mart Sequences
Delivery Sequences
Pipeline Instance
Schema
Management
Static Content
Management
Pipeline Instance
Usage Notebooks
Table Catalog
Preparation Sequences
External
Data
Sources
Pull
Push
Collectors Sequences
Preparation Sequences
Usage Notebooks
Usage Notebooks
Users
Pipeline PoC Project
Preliminary Pipeline
Notebooks
Location
Statistics
Upload
Update
Reference
Data
Add
Partitions
Query &
Extract
Transform
COGNOS
26. US_DEMOGRAPHIC
geo_id
geolevel
total_population
male
:
35_to_40_years
:
attribution
attribution_url
COVID-19 Data Mart –
Data Model
COUNTY_STATISTICS
county_id
dt
collected
attribution_url
confirmed_cases (& *_delta)
deaths(& *_delta)
hospitalized (& *_delta)
testsperformed (& *_delta)
recovered (& *_delta)
COUNTIES
county_id
county_ name
country_id
province_id
code_type
fips_code
nuts_code
EU_DEMOGRAPHIC
geo_id
geolevel
sex_total
sex_y15_29
:
sex_f_age_total
:
attribution
attribution_url
WORLD_DEMOGRAPHIC
geo_id
geolevel
population
migrants_net
:
attribution
attribution_url
PROVINCE_STATISTICS
province_id
dt
collected
attribution_url
confirmed_cases (& *_delta)
deaths(& *_delta)
hospitalized (& *_delta)
testsperformed (& *_delta)
recovered (& *_delta)
COUNTRY_STATISTICS
country_id
dt
collected
attribution_url
confirmed_cases (& *_delta)
deaths(& *_delta)
hospitalized (& *_delta)
testsperformed (& *_delta)
recovered (& *_delta)
PROVINCES
province_id
province_ name
country_id
code_type
fips_code
nuts_code
GEOGRAPHIC_FULL
geo_id
geolevel
region
lat
lon
geometry_wkt
attribution
attribution_url
COUNTRIES
country_id
country_ name
code_type Fact Tables
Dimension Tables
WORLD _GEOGRAPHIC
(view)
country_id=“WORLD”
EU _GEOGRAPHIC (view)
country_id=“EU”
US _GEOGRAPHIC (view)
country_id=“US”
US_COUNTIES
(view)
country_id=“US”
GEOGRAPHIC (view)
substr(geometry_wkt, 1, 30)
WORLD _GEOGRAPHIC_FULL (view)
country_id=“WORLD”
EU _GEOGRAPHIC_FULL (view)
country_id=“EU”
US _GEOGRAPHIC_FULL (view)
country_id=“US”
US_PROVINCES (view)
country_id=“US”
Views
ECDC_STATISTICS
country_id
dt
collected
attribution_url
confirmed_cases
deaths
confirmed_cases_delta
deaths_delta
WHO_STATISTICS
country_id
dt
collected
attribution_url
confirmed_cases
deaths
confirmed_cases_delta
deaths_delta
MX_DEMOGRAPHIC
geo_id
geolevel
population
attribution
attribution_url
28. Cloud Pak for Data
as a Service
IBM Cloud – PaaS Context for Data Lake
IBM Cloud
Data Lake
Telemetry Data
IBM Cloud
Databases
Explore
ETL
Prepare
Enrich
Streaming Ingest
Optimize
Query
ETL
Infuse
Analyze
Organize
Collect
Cognos
Analytics
Watson
Machine
Learning
Watson
Open Scale
Dashboarding AI
IBM SQL Query
IBM Analytic
Engine
Cloud Object
Storage
IBM Event
Streams
IBM Data
Stage
Train
Ladder to AI
Ingest
Watson Knowledge Catalog
Watson
Studio
Data Science
Key Protect
Govern
Protect
Cloud Functions
Automate
IBM Cloud
Databases
Db2
Warehouse
logDNA, sysdig
Operate
Data
Virtualization
29. Serverless == Self-Service == Empowerment:
Data Producers becoming Data Product Owners
§ Enable data producers to prepare and serve data for analytic consumption
§ Data Lake building blocks are easy to use and easy to automate services
§ Minimize data lake operations and resource cost overhead
§ Objective:
§ Eliminate all hurdles (and excuses) for data producers NOT to serve their data
for analytics
§ Reduce classical data engineers to role of data lake infrastructure providers
§ Reference:
§ Paradigm Shift to Data Mesh:
https://martinfowler.com/articles/data-monolith-to-mesh.html
30. IBM SQL Query – Timeseries SQL 1/2
§ Intuitive first-of-a-kind SQL extensions for timeseries operations
§ Industry leading differentiators, including:
• Timeseries transformation functions:
• Correlation, Fourier transformation,
z-normalization, Granger, interpolation,
and distances
• Temporal Joins: SQL support for
Left/Right/Full Inner and Outer joins
of multiple timeseries
Alignment & Joining:
31. IBM SQL Query – Timeseries SQL 2/2
§ Further Industry leading differentiators
• Numerical and categorical timeseries types
• Timeseries data skipping for fast queries
• Forecasting:
• ARIMA, BATS, Anomaly detection, etc.
• Subsequence Mining:
• Train & match models for event sequences
• Segmentation:
• Time-based, Record-based, Anchor-based, Burst, and silence
Segmentation:
32. IBM SQL Query – Spatial SQL
§ SQL/MM standard to store & analyze spatial data in RDBMS
§ Migration of PostGIS compliant SQL queries
§ Aggregation, computation and join via native SQL syntax
§ Industry leading differentiators
• Geodetic Full Earth support
• Increased developer productivity
• Avoid piece-wise planar projections
• High precision calculations anywhere on the earth
• Very large polygons (e.g. countries), polar caps, x-ing anti-meridian
• Spatial data skipping for fast queries
• Native and fine-granular geohash support
• Fast spatial aggregation
34. Secure Passing of Custom Data Source Credentials
IBM
Key Protect
User
Data Sources
Query
1. Create User/Password
combination or API Key
2. Store password or API Key
base64-encoded as custom key
3. Submit SQL statement
referencing password or API
Key via key protect CRN
4. Securely retrieve
password or API Key
5. Connect with retrieved
User/Password combination or API Key
35. Thank you
Torsten Steinbach, Data Lake Services Architect, IBM Cloud, IBM
Resources:
– SQL Query Documentation: https://cloud.ibm.com/docs/sql-query?topic=sql-query-overview
– SQL Query Tutorial: https://dataplatform.cloud.ibm.com/exchange/public/entry/view/4a9bb1c816fb1e0f31fec5d580e4e14d
– SQL Cloud Function: https://hub.docker.com/r/ibmfunctions/sqlquery/
– COVID-19 Data Lake Presentation: https://ibm.biz/Bdq5Ys
– THINK 2020 Cloud Data Lake Presentation: https://ibm.biz/Bdq5Yi
– IBM Cloud Data Lake Team: #wdp-sql-service & #sqlquery-support on Slack
– Blogs:
• https://www.ibm.com/cloud/blog/new-builders/data-lakes-in-the-cloud
• https://www.ibm.com/cloud/blog/big-data-layout
• https://www.ibm.com/cloud/blog/new-builders/big-data
• https://www.ibm.com/cloud/blog/sql-databases-and-object-storage
• https://www.ibm.com/cloud/blog/accelerate-your-big-data-analytics-and-reduce-costs-by-using-ibm-cloud-sql-query
• https://www.ibm.com/cloud/blog/a-serverless-attack-on-ugly-log-archives
• https://www.ibm.com/cloud/blog/announcements/automate-serverless-data-pipelines-for-your-data-warehouse-or-data-lakes