Serverless SQL provides a serverless analytics platform that allows users to analyze data stored in object storage without having to manage infrastructure. Key features include seamless elasticity, pay-per-query consumption, and the ability to analyze data directly in object storage without having to move it. The platform includes serverless storage, data ingest, data transformation, analytics, and automation capabilities. It aims to create a sharing economy for analytics by allowing various users like developers, data engineers, and analysts flexible access to data and analytics.
Analyzing StackExchange data with Azure Data LakeBizTalk360
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different. Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
The document discusses Azure Data Factory V2 data flows. It will provide an introduction to Azure Data Factory, discuss data flows, and have attendees build a simple data flow to demonstrate how they work. The speaker will introduce Azure Data Factory and data flows, explain concepts like pipelines, linked services, and data flows, and guide a hands-on demo where attendees build a data flow to join customer data to postal district data to add matching postal towns.
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
The document discusses evolving approaches to data warehousing and analytics using Azure Data Factory and Azure Stream Analytics. It provides an example scenario of analyzing game usage logs to create a customer profiling view. Azure Data Factory is presented as a way to build data integration and analytics pipelines that move and transform data between on-premises and cloud data stores. Azure Stream Analytics is introduced for analyzing real-time streaming data using a declarative query language.
Customer migration to Azure SQL database, December 2019George Walters
This is a real life story on how a software as a service application moved to the cloud, to azure, over a period of two years. We discuss migration, business drivers, technology, and how it got done. We talk through more modern ways to refactor or change code to get into the cloud nowadays.
This document provides an overview of Azure Databricks, including:
- Azure Databricks is an Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It includes Spark SQL, streaming, machine learning libraries, and integrates fully with Azure services.
- Clusters in Azure Databricks provide a unified platform for various analytics use cases. The workspace stores notebooks, libraries, dashboards, and folders. Notebooks provide a code environment with visualizations. Jobs and alerts can run and notify on notebooks.
- The Databricks File System (DBFS) stores files in Azure Blob storage in a distributed file system accessible from notebooks. Business intelligence tools can connect to Databricks clusters via JDBC
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
This document discusses building data warehouses and data lakes in the cloud using AWS services. It provides an overview of AWS databases, analytics, and machine learning services that can be used to store and analyze data at scale. These services allow customers to migrate existing data warehouses to the cloud, build new data warehouses and data lakes more cost effectively, and gain insights from their data more easily.
This document summarizes key components of Microsoft Azure's data platform, including SQL Database, NoSQL options like Azure Tables, Blob Storage, and Azure Files. It provides an overview of each service, how they work, common use cases, and demos of creating resources and accessing data. The document is aimed at helping readers understand Azure's database and data storage options for building cloud applications.
These slides are a copy of a last Azure Cosmos DB + Gremlin API in Action session which I had the pleasure to present on June 2nd, 2018 at PASS SQL Saturday event in Montreal. The original PowerPoint version contained much more elaborate series of animations. We understand that those had to be flatten for upload in this case. Though I guess you'll get the idea of the logic involved.
Analyzing StackExchange data with Azure Data LakeBizTalk360
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different. Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
The document discusses Azure Data Factory V2 data flows. It will provide an introduction to Azure Data Factory, discuss data flows, and have attendees build a simple data flow to demonstrate how they work. The speaker will introduce Azure Data Factory and data flows, explain concepts like pipelines, linked services, and data flows, and guide a hands-on demo where attendees build a data flow to join customer data to postal district data to add matching postal towns.
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
The document discusses evolving approaches to data warehousing and analytics using Azure Data Factory and Azure Stream Analytics. It provides an example scenario of analyzing game usage logs to create a customer profiling view. Azure Data Factory is presented as a way to build data integration and analytics pipelines that move and transform data between on-premises and cloud data stores. Azure Stream Analytics is introduced for analyzing real-time streaming data using a declarative query language.
Customer migration to Azure SQL database, December 2019George Walters
This is a real life story on how a software as a service application moved to the cloud, to azure, over a period of two years. We discuss migration, business drivers, technology, and how it got done. We talk through more modern ways to refactor or change code to get into the cloud nowadays.
This document provides an overview of Azure Databricks, including:
- Azure Databricks is an Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It includes Spark SQL, streaming, machine learning libraries, and integrates fully with Azure services.
- Clusters in Azure Databricks provide a unified platform for various analytics use cases. The workspace stores notebooks, libraries, dashboards, and folders. Notebooks provide a code environment with visualizations. Jobs and alerts can run and notify on notebooks.
- The Databricks File System (DBFS) stores files in Azure Blob storage in a distributed file system accessible from notebooks. Business intelligence tools can connect to Databricks clusters via JDBC
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
This document discusses building data warehouses and data lakes in the cloud using AWS services. It provides an overview of AWS databases, analytics, and machine learning services that can be used to store and analyze data at scale. These services allow customers to migrate existing data warehouses to the cloud, build new data warehouses and data lakes more cost effectively, and gain insights from their data more easily.
This document summarizes key components of Microsoft Azure's data platform, including SQL Database, NoSQL options like Azure Tables, Blob Storage, and Azure Files. It provides an overview of each service, how they work, common use cases, and demos of creating resources and accessing data. The document is aimed at helping readers understand Azure's database and data storage options for building cloud applications.
These slides are a copy of a last Azure Cosmos DB + Gremlin API in Action session which I had the pleasure to present on June 2nd, 2018 at PASS SQL Saturday event in Montreal. The original PowerPoint version contained much more elaborate series of animations. We understand that those had to be flatten for upload in this case. Though I guess you'll get the idea of the logic involved.
Is there a way that we can build our Azure Synapse Pipelines all with paramet...Erwin de Kreuk
Is there a way that we can build our Synapse Data Pipelines all with parameters all based on MetaData? Yes there's and I will show you how to. During this session I will show how you can load Incremental or Full datasets from your sql database to your Azure Data Lake. The next step is that we want to track our history from these extracted tables. We will do using Delta Lake. The last step that we want, is to make this data available in Azure SQL Database or Azure Synapse Analytics. Oh and we want to have some logging as well from our processes A lot to talk and to demo about during this session.
Hear Ryan Millay, IBM Cloudant software development manager, discuss what you need to consider when moving from world of relational databases to a NoSQL document store.
You'll learn about the key differences between relational databases and JSON document stores like Cloudant, as well as how to dodge the pitfalls of migrating from a relational database to NoSQL.
Azure Databricks is Easier Than You ThinkIke Ellis
Spark is a fast and general engine for large-scale data processing. It supports Scala, Python, Java, SQL, R and more. Spark applications can access data from many sources and perform tasks like ETL, machine learning, and SQL queries. Azure Databricks provides a managed Spark service on Azure that makes it easier to set up clusters and share notebooks across teams for data analysis. Databricks also integrates with many Azure services for storage and data integration.
This document provides an introduction to Cloudant, which is a fully managed NoSQL database as a service (DBaaS) that provides a scalable and flexible data layer for web and mobile applications. The presentation discusses NoSQL databases and why they are useful, describes Cloudant's features such as document storage, querying, indexing and its global data presence. It also provides examples of how companies like FitnessKeeper and Fidelity Investments use Cloudant to solve data scaling and management challenges. The document concludes by outlining next steps for signing up and exploring Cloudant.
AWS Data Services provide a suite of serverless analytics tools including Amazon Athena for interactive SQL queries, AWS Glue for ETL and data cataloging, and Amazon S3 for exabyte-scale data storage. Together these services enable building a data lake architecture for ingesting, storing, discovering, and analyzing all types of data at scale.
Moving to the cloud; PaaS, IaaS or Managed InstanceThomas Sykes
In this session we'll look at the cloud choices available in Azure for SQL Server. Whether it's PaaS, IaaS or Managed Instance we'll look into the features provided, the major differences and the Pros and Cons of each solution and how to choose the best option available.
DynamoDB Accelerator (DAX) is a fully managed caching layer for DynamoDB that provides microsecond latency, millions of requests per second of throughput, and automatic hot key management to improve application performance. DAX caches both individual items and entire queries from DynamoDB to provide single-digit millisecond responses while handling all cache management and high availability.
You may know Google for search, YouTube, Android, Chrome, and Gmail, but that's only as an end-user of OUR apps. Did you know you can also integrate Google technologies into YOUR apps? We have many APIs and open source libraries that help you do that! If you have tried and found it challenging, didn't find not enough examples, run into roadblocks, got confused, or just curious about what Google APIs can offer, join us to resolve any blockers. Code samples will be in Python and/or Node.js/JavaScript. This session focuses on showing you how to access Google Cloud APIs from one of Google Cloud's compute platforms, whether serverless or otherwise.
Presented in DDD Melbourne on on Sat Aug 8th 2015
Himanshu Desai, Ahmed El-Harouny & Daniel Janczak
DocumentDB, Mongo or RavenDB? If you are starting out on a new project and considering NoSQL database as an option, which one should you do choose? What if the option you choose today may not work out to be the best one for your needs?
Come and join us for this session, we will take you on a journey where we will explain each of these database on their merits and compare them and also share War stories.
http://dddmelbourne.com
The document discusses various data design strategies and challenges for microservices architectures on AWS. It recommends using polyglot persistence to allow each microservice to choose its own database technology. It also discusses using eventual consistency between services, implementing transaction rollback mechanisms, and aggregating data from multiple services for reporting and analytics purposes.
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
This presentation gives an brief overview of the history of relational databases, ACID and SQL and presents some of the key strentgths and potential weaknesses. It introduces the rise of NoSQL - why it arose, what is entails, when to use it. The presentation focuses on MongoDB as prime example of NoSQL document store and it shows how to interact with MongoDB from JavaScript (NodeJS) and Java.
Bi and AI updates in the Microsoft Data Platform stackIvan Donev
This document provides an overview and agenda for a presentation on new features in the Microsoft data platform universe including SQL Server, Azure, and Power BI, with a focus on business intelligence (BI) and artificial intelligence (AI).
The presentation covers new features in SQL Server 2019 for BI like calculation groups and dynamic formatting in Analysis Services tabular mode. It also highlights important updates in Azure including new serverless and hyperscale options for Azure SQL Database, updates to data ingestion and storage with Azure Data Factory v2 and Azure Data Lake Gen 2, and changes to data modeling, preparation, and serving capabilities in Azure SQL Data Warehouse.
The document concludes by discussing options for incorporating AI and machine learning into BI solutions using either
Discovery Day 2019 Sofia - Big data clustersIvan Donev
This document provides an overview of the architecture and components of SQL Server 2019 Big Data Clusters. It describes the key Kubernetes concepts used in Big Data Clusters like pods, services, and nodes. It then explains the different planes (control, compute, data) and nodes that make up a Big Data Cluster and their roles. Components in each plane like the SQL master instance, compute pools, storage pools, and data pools are also outlined.
Microsoft SQL server 2017 Level 300 technical deckGeorge Walters
This deck covers new features in SQL Server 2017, as well as carryover features from 2012 onwards. This includes high availability, columnstore, alwayson, In-memory tables, and other enterprise features.
Discovery Day 2019 Sofia - What is new in SQL Server 2019Ivan Donev
This document summarizes new features in SQL Server 2019 including:
- Support for big data analytics using SQL Server, Spark, and data lakes for high volume data storage and access.
- Enhanced data virtualization capabilities to combine data from multiple sources.
- An integrated AI platform to ingest, prepare, train, store, and operationalize machine learning models.
- Improved security, performance, and high availability features like Always Encrypted with secure enclaves and accelerated database recovery.
- Continued improvements based on community feedback.
Amazon Kinesis Analytics is the easiest way to process streaming data in real time with standard SQL without having to learn new programming languages or processing frameworks. Amazon Kinesis analytics enables you to create and run SQL queries on streaming data so that you can gain actionable insights and respond to your business and customer needs promptly. In this session, we will provide an overview of the capabilities of the Amazon Kinesis Analytics. We will show you how you can build an entire stream processing pipeline to collect, ingest, process, and emit streaming data using Amazon Kinesis Analytics, Amazon Kinesis Firehose, and Amazon Kinesis Streams.
Modern ETL: Azure Data Factory, Data Lake, and SQL DatabaseEric Bragas
This document discusses modern Extract, Transform, Load (ETL) tools in Azure, including Azure Data Factory, Azure Data Lake, and Azure SQL Database. It provides an overview of each tool and how they can be used together in a data warehouse architecture with Azure Data Lake acting as the data hub and Azure SQL Database being used for analytics and reporting through the creation of data marts. It also includes two demonstrations, one on Azure Data Factory and another showing Azure Data Lake Store and Analytics.
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...Sandy Winarko
This session focuses on the all PaaS solution of Azure SQL DB/Managed Instance (MI) + SSIS in Azure Data Factory (ADF) to lift & shift, modernize, and extend ETL workflows. We will first show you how to provision Azure-SSIS Integration Runtime (IR) – dedicated ADF servers for running SSIS – with SSIS catalog (SSISDB) hosted by Azure SQL DB/MI, configure it to access data on premises using Windows authentication and Virtual Network injection/Self-Hosted IR as a proxy, and extend it with custom/Open Source/3rd party components. We will next show you how to use the familiar SSDT/SSMS tools to design/test/deploy/execute your SSIS packages in the cloud just like you do on premises. We will finally show you how to modernize your ETL workflows by invoking/scheduling SSIS package executions as first-class activities in ADF pipelines and combining/chaining them with other activities, allowing you to trigger your pipeline runs by events, automatically (de)provision SSIS IR just in time, etc.
This document provides an overview and summary of the author's background and expertise. It states that the author has over 30 years of experience in IT working on many BI and data warehouse projects. It also lists that the author has experience as a developer, DBA, architect, and consultant. It provides certifications held and publications authored as well as noting previous recognition as an SQL Server MVP.
This document summarizes an IBM Cloud Day 2021 presentation on IBM Cloud Data Lakes. It describes the architecture of IBM Cloud Data Lakes including data skipping capabilities, serverless analytics, and metadata management. It then discusses an example COVID-19 data lake built on IBM Cloud to provide trusted COVID-19 data to analytics applications. Key aspects included landing, preparation, and integration zones; serverless pipelines for data ingestion and transformation; and a data mart for querying and reporting.
The document summarizes IBM's cloud data lake and SQL query services. It discusses how these services allow users to ingest, store, and analyze large amounts of data in the cloud. Key points include that IBM's cloud data lake provides a fully managed data lake service with serverless consumption and fully elastic scaling. It also discusses how IBM SQL Query allows users to analyze data stored in cloud object storage using SQL, and supports various data formats and analytics use cases including log analysis, time series analysis, and spatial queries.
Is there a way that we can build our Azure Synapse Pipelines all with paramet...Erwin de Kreuk
Is there a way that we can build our Synapse Data Pipelines all with parameters all based on MetaData? Yes there's and I will show you how to. During this session I will show how you can load Incremental or Full datasets from your sql database to your Azure Data Lake. The next step is that we want to track our history from these extracted tables. We will do using Delta Lake. The last step that we want, is to make this data available in Azure SQL Database or Azure Synapse Analytics. Oh and we want to have some logging as well from our processes A lot to talk and to demo about during this session.
Hear Ryan Millay, IBM Cloudant software development manager, discuss what you need to consider when moving from world of relational databases to a NoSQL document store.
You'll learn about the key differences between relational databases and JSON document stores like Cloudant, as well as how to dodge the pitfalls of migrating from a relational database to NoSQL.
Azure Databricks is Easier Than You ThinkIke Ellis
Spark is a fast and general engine for large-scale data processing. It supports Scala, Python, Java, SQL, R and more. Spark applications can access data from many sources and perform tasks like ETL, machine learning, and SQL queries. Azure Databricks provides a managed Spark service on Azure that makes it easier to set up clusters and share notebooks across teams for data analysis. Databricks also integrates with many Azure services for storage and data integration.
This document provides an introduction to Cloudant, which is a fully managed NoSQL database as a service (DBaaS) that provides a scalable and flexible data layer for web and mobile applications. The presentation discusses NoSQL databases and why they are useful, describes Cloudant's features such as document storage, querying, indexing and its global data presence. It also provides examples of how companies like FitnessKeeper and Fidelity Investments use Cloudant to solve data scaling and management challenges. The document concludes by outlining next steps for signing up and exploring Cloudant.
AWS Data Services provide a suite of serverless analytics tools including Amazon Athena for interactive SQL queries, AWS Glue for ETL and data cataloging, and Amazon S3 for exabyte-scale data storage. Together these services enable building a data lake architecture for ingesting, storing, discovering, and analyzing all types of data at scale.
Moving to the cloud; PaaS, IaaS or Managed InstanceThomas Sykes
In this session we'll look at the cloud choices available in Azure for SQL Server. Whether it's PaaS, IaaS or Managed Instance we'll look into the features provided, the major differences and the Pros and Cons of each solution and how to choose the best option available.
DynamoDB Accelerator (DAX) is a fully managed caching layer for DynamoDB that provides microsecond latency, millions of requests per second of throughput, and automatic hot key management to improve application performance. DAX caches both individual items and entire queries from DynamoDB to provide single-digit millisecond responses while handling all cache management and high availability.
You may know Google for search, YouTube, Android, Chrome, and Gmail, but that's only as an end-user of OUR apps. Did you know you can also integrate Google technologies into YOUR apps? We have many APIs and open source libraries that help you do that! If you have tried and found it challenging, didn't find not enough examples, run into roadblocks, got confused, or just curious about what Google APIs can offer, join us to resolve any blockers. Code samples will be in Python and/or Node.js/JavaScript. This session focuses on showing you how to access Google Cloud APIs from one of Google Cloud's compute platforms, whether serverless or otherwise.
Presented in DDD Melbourne on on Sat Aug 8th 2015
Himanshu Desai, Ahmed El-Harouny & Daniel Janczak
DocumentDB, Mongo or RavenDB? If you are starting out on a new project and considering NoSQL database as an option, which one should you do choose? What if the option you choose today may not work out to be the best one for your needs?
Come and join us for this session, we will take you on a journey where we will explain each of these database on their merits and compare them and also share War stories.
http://dddmelbourne.com
The document discusses various data design strategies and challenges for microservices architectures on AWS. It recommends using polyglot persistence to allow each microservice to choose its own database technology. It also discusses using eventual consistency between services, implementing transaction rollback mechanisms, and aggregating data from multiple services for reporting and analytics purposes.
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema
This presentation gives an brief overview of the history of relational databases, ACID and SQL and presents some of the key strentgths and potential weaknesses. It introduces the rise of NoSQL - why it arose, what is entails, when to use it. The presentation focuses on MongoDB as prime example of NoSQL document store and it shows how to interact with MongoDB from JavaScript (NodeJS) and Java.
Bi and AI updates in the Microsoft Data Platform stackIvan Donev
This document provides an overview and agenda for a presentation on new features in the Microsoft data platform universe including SQL Server, Azure, and Power BI, with a focus on business intelligence (BI) and artificial intelligence (AI).
The presentation covers new features in SQL Server 2019 for BI like calculation groups and dynamic formatting in Analysis Services tabular mode. It also highlights important updates in Azure including new serverless and hyperscale options for Azure SQL Database, updates to data ingestion and storage with Azure Data Factory v2 and Azure Data Lake Gen 2, and changes to data modeling, preparation, and serving capabilities in Azure SQL Data Warehouse.
The document concludes by discussing options for incorporating AI and machine learning into BI solutions using either
Discovery Day 2019 Sofia - Big data clustersIvan Donev
This document provides an overview of the architecture and components of SQL Server 2019 Big Data Clusters. It describes the key Kubernetes concepts used in Big Data Clusters like pods, services, and nodes. It then explains the different planes (control, compute, data) and nodes that make up a Big Data Cluster and their roles. Components in each plane like the SQL master instance, compute pools, storage pools, and data pools are also outlined.
Microsoft SQL server 2017 Level 300 technical deckGeorge Walters
This deck covers new features in SQL Server 2017, as well as carryover features from 2012 onwards. This includes high availability, columnstore, alwayson, In-memory tables, and other enterprise features.
Discovery Day 2019 Sofia - What is new in SQL Server 2019Ivan Donev
This document summarizes new features in SQL Server 2019 including:
- Support for big data analytics using SQL Server, Spark, and data lakes for high volume data storage and access.
- Enhanced data virtualization capabilities to combine data from multiple sources.
- An integrated AI platform to ingest, prepare, train, store, and operationalize machine learning models.
- Improved security, performance, and high availability features like Always Encrypted with secure enclaves and accelerated database recovery.
- Continued improvements based on community feedback.
Amazon Kinesis Analytics is the easiest way to process streaming data in real time with standard SQL without having to learn new programming languages or processing frameworks. Amazon Kinesis analytics enables you to create and run SQL queries on streaming data so that you can gain actionable insights and respond to your business and customer needs promptly. In this session, we will provide an overview of the capabilities of the Amazon Kinesis Analytics. We will show you how you can build an entire stream processing pipeline to collect, ingest, process, and emit streaming data using Amazon Kinesis Analytics, Amazon Kinesis Firehose, and Amazon Kinesis Streams.
Modern ETL: Azure Data Factory, Data Lake, and SQL DatabaseEric Bragas
This document discusses modern Extract, Transform, Load (ETL) tools in Azure, including Azure Data Factory, Azure Data Lake, and Azure SQL Database. It provides an overview of each tool and how they can be used together in a data warehouse architecture with Azure Data Lake acting as the data hub and Azure SQL Database being used for analytics and reporting through the creation of data marts. It also includes two demonstrations, one on Azure Data Factory and another showing Azure Data Lake Store and Analytics.
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...Sandy Winarko
This session focuses on the all PaaS solution of Azure SQL DB/Managed Instance (MI) + SSIS in Azure Data Factory (ADF) to lift & shift, modernize, and extend ETL workflows. We will first show you how to provision Azure-SSIS Integration Runtime (IR) – dedicated ADF servers for running SSIS – with SSIS catalog (SSISDB) hosted by Azure SQL DB/MI, configure it to access data on premises using Windows authentication and Virtual Network injection/Self-Hosted IR as a proxy, and extend it with custom/Open Source/3rd party components. We will next show you how to use the familiar SSDT/SSMS tools to design/test/deploy/execute your SSIS packages in the cloud just like you do on premises. We will finally show you how to modernize your ETL workflows by invoking/scheduling SSIS package executions as first-class activities in ADF pipelines and combining/chaining them with other activities, allowing you to trigger your pipeline runs by events, automatically (de)provision SSIS IR just in time, etc.
This document provides an overview and summary of the author's background and expertise. It states that the author has over 30 years of experience in IT working on many BI and data warehouse projects. It also lists that the author has experience as a developer, DBA, architect, and consultant. It provides certifications held and publications authored as well as noting previous recognition as an SQL Server MVP.
This document summarizes an IBM Cloud Day 2021 presentation on IBM Cloud Data Lakes. It describes the architecture of IBM Cloud Data Lakes including data skipping capabilities, serverless analytics, and metadata management. It then discusses an example COVID-19 data lake built on IBM Cloud to provide trusted COVID-19 data to analytics applications. Key aspects included landing, preparation, and integration zones; serverless pipelines for data ingestion and transformation; and a data mart for querying and reporting.
The document summarizes IBM's cloud data lake and SQL query services. It discusses how these services allow users to ingest, store, and analyze large amounts of data in the cloud. Key points include that IBM's cloud data lake provides a fully managed data lake service with serverless consumption and fully elastic scaling. It also discusses how IBM SQL Query allows users to analyze data stored in cloud object storage using SQL, and supports various data formats and analytics use cases including log analysis, time series analysis, and spatial queries.
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
- The document discusses an IBM Cloud Day 2021 event focused on well-architected data lakes. It provides an overview of two sessions on data lake architecture and building a cloud native data lake on IBM Cloud.
- It also summarizes the key capabilities organizations need from a data lake, including visualizing data, flexibility/accessibility, governance, and gaining insights. Cloud data lakes can address these needs for various roles.
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudTorsten Steinbach
Cloud is a sharing economy that reduces your spending. But does this also apply to data and analytics? Doesn't this require you to provision dedicated data warehouse systems to run analytics SQL queries on terabytes of data? With IBM Cloud, the answer is no. By using serverless analytics via IBM Cloud SQL Query, you can analyze your data directly where it sits, be it in IBM Cloud Object Storage or in your NoSQL databases. Due to the serverless nature of SQL Query, you only pay for your queries depending on the data volume that they process. There are no standing costs. You do not need to provision and wait for a data warehouse. But you can still run SQLs on terabytes of data.
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
- The document discusses serverless data analytics using IBM's cloud services, including a serverless data lake built on cloud object storage, serverless SQL queries using Spark, and serverless data processing functions.
- It provides an example of a COVID-19 data lake built on IBM Cloud that collects and integrates data from various sources, prepares and transforms the data, and makes it available for analytics and dashboards through serverless SQL queries.
IBM's Cloud-based Data Lake for Analytics and AI presentation covered:
1) IBM's cloud data lake provides serverless architecture, low barriers to entry, and pay-as-you-go pricing for analytics on data stored in cloud object storage.
2) The data lake offers SQL-based data exploration, transformation, and analytics capabilities as well as industry-leading optimizations for time series and geospatial data.
3) Security features include customer-controlled encryption keys and options to hide SQL queries and keys from IBM.
This document discusses building a data lake on AWS. It describes using Amazon S3 for storage, Amazon Kinesis for streaming data, and AWS Lambda to populate metadata indexes in DynamoDB and search indexes. It covers using IAM for access control, AWS STS for temporary credentials, and API Gateway and Elastic Beanstalk for interfaces. The data lake provides a foundation for storing and analyzing structured, semi-structured, and unstructured data at scale from various sources in a cost-effective and secure manner.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
The document discusses Azure Data Factory and its capabilities for cloud-first data integration and transformation. ADF allows orchestrating data movement and transforming data at scale across hybrid and multi-cloud environments using a visual, code-free interface. It provides serverless scalability without infrastructure to manage along with capabilities for lifting and running SQL Server Integration Services packages in Azure.
Data Analytics Week at the San Francisco Loft
Using Data Lakes
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Hemant Borole - Sr. Big Data Consultant, AWS
By 2020, 50% of all new software will process machine-generated data of some sort (Gartner). Historically, machine data use cases have required non-SQL data stores like Splunk, Elasticsearch, or InfluxDB.
Today, new SQL DB architectures rival the non-SQL solutions in ease of use, scalability, cost, and performance. Please join this webinar for a detailed comparison of machine data management approaches.
This document summarizes key announcements from re:Invent 2016, AWS's annual user conference. The main themes included artificial intelligence, serverless computing, devops, data, and migration tools. Notable product announcements included AWS Batch for batch processing, Aurora for PostgreSQL, Athena for querying data lakes, and X-Ray for debugging distributed applications. The document also discusses AWS's strategy around machine learning and deep learning using MXNet as its primary framework.
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a platform designed to address multi-faceted needs by offering multi-function Data Management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion. They need a worry-free experience with the architecture and its components.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Level: Intermediate
Speakers:
Tony Nguyen - Senior Consultant, ProServe, AWS
Hannah Marlowe - Consultant - Federal, AWS
- Cloud computing is important for big data applications as it provides variable expense, elastic capacity, and global reach. Amazon Web Services provides data storage, processing, and analytics services across a global network of regions and availability zones.
- Amazon Redshift is a fully managed data warehouse service that allows for fast queries on petabytes of structured data using standard SQL. It uses a columnar data storage format and data compression techniques to improve performance and reduce costs.
- Amazon EMR allows users to easily run Hadoop frameworks like Hive and Pig on AWS without having to manage hardware. It provides a scalable and cost-effective way to process vast amounts of unstructured data in Amazon S3.
- Amazon Kinesis enables real-
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
Data gravity is a reality when dealing with massive amounts and globally distributed systems. Processing this data requires distributed analytics processing across InterCloud. In this presentation we will share our real world experience with storing, routing, and processing big data workloads on Cisco Cloud Services and Amazon Web Services clouds.
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This document provides an overview of building a modern cloud analytics solution using Microsoft Azure. It discusses the role of analytics, a history of cloud computing, and a data warehouse modernization project. Key challenges covered include lack of notifications, logging, self-service BI, and integrating streaming data. The document proposes solutions to these challenges using Azure services like Data Factory, Kafka, Databricks, and SQL Data Warehouse. It also discusses alternative implementations using tools like Matillion ETL and Snowflake.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
This session will focus on how to get from 'Minimum Viable Product' (MVP) to scale. It will also explain how to deal with unpredictable demand and how to build a scalable business. Attend this session to learn how to:
Scale web servers and app services with Elastic Load Balancing and Auto Scaling on Amazon EC2
Scale your storage on Amazon S3 and S3 Reduced Redundancy Storage
Scale your database with Amazon DynamoDB, Amazon RDS, and Amazon ElastiCache
Scale your customer base by reaching customers globally in minutes with Amazon CloudFront
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
Serverless Cloud Data Lake with Spark for Serving Weather Data
1) The document discusses using a serverless architecture with IBM Cloud services like SQL Query powered by Spark, Cloud Object Storage, and Cloud Functions to build a cost-effective cloud data lake for serving historical weather data on demand.
2) It describes how data skipping techniques and geospatial indexes in SQL Query can accelerate queries by an order of magnitude by pruning irrelevant data.
3) The new serverless solution provides unlimited storage, global coverage, and supports large queries for machine learning and analytics at an order of magnitude lower cost than the previous implementation.
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?Torsten Steinbach
You don't necessarily have to set up a relational database, tables and load data in order to use a surprisingly rich set of SQL capabilities on your data in the cloud. IBM SQL Query lets you analyze terabytes of distributed data of heterogeneous formats with a complete ANSI SQL dialect in a completely serverless usage model, elegantly ETL data between formats and partitioning layouts as needed, and run complex time series transformations, analysis and correlations with advanced built-in timeseries SQL algorithms that are differentiating in the entire industry. It also support a complete PostGIS compliant geospatial SQL function set. Come explore the stunningly advanced world of SQL without a database in IBM Cloud.
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM CloudTorsten Steinbach
Agile user and workload insights are one of the key elements of a cloud-native solution. When done well, this represents a real competitive advantage. In this session, we show you how to run cloud-native clickstream analysis with IBM Cloud. By combining serverless mechanisms like object storage for affordable and scalable persistency with SQL Query for serverless analysis of your clickstream data, you can establish a very cost-effective clickstream analysis pipeline easily and quickly.
IBM THINK 2019 - Self-Service Cloud Data Management with SQL Torsten Steinbach
SQL is a powerful language to express data transformations. But did you know that you can also use IBM Cloud SQL to convert data between various data formats and layouts on disks? In this session, you will see the full power of using SQL Query to move and transform your cloud data in an entirely self-service fashion. You can specify any data format, layout or partitioning with a simple SQL statement. See how you can move and transform terabytes of data in the cloud in a very scalable fashion and still being charged only for the individual SQL movement and transformation jobs without having standing costs.
Torsten Steinbach and Chris Glew present IBM Cloud Query, a serverless analytics service that allows users to run ANSI SQL queries against data stored in cloud object storage. Some key points:
- IBM Cloud Query allows users to query data in various open formats like CSV, Parquet, and JSON stored in cloud object storage using SQL, with results also stored in object storage.
- It has a pay-per-query pricing model with no infrastructure to manage. Queries can be run via a web console, REST API, or Python client.
- The presentation outlines the architecture and provides examples of using Cloud Query for log analytics, data exploration, and building serverless data pipelines with Cloud Functions.
IBM Insight 2014 - Advanced Warehouse Analytics in the CloudTorsten Steinbach
- dashDB is IBM's cloud data warehouse service that provides advanced analytics capabilities in the cloud
- It offers three deployment options: an entry plan deployed within Bluemix, a bare metal option, and virtual machine options
- The entry plan provides terabyte-scale capacity within Bluemix while the bare metal and VM options provide more capacity and dedicated resources
- All options provide in-database analytics and backup/restore to Swift object storage for high availability
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloudTorsten Steinbach
This document discusses geospatial analytics capabilities in IBM dashDB. It describes how dashDB supports geospatial data types and functions that allow spatial queries and analysis. This includes functions for spatial predicates, constructors, and calculations. GeoJSON and other formats can be loaded and dashDB implements OGC and ISO spatial standards. Predictive analytics is also possible using the R extension to dashDB. Overall the summary discusses dashDB's geospatial and predictive analytic capabilities for spatial data.
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisTorsten Steinbach
Using Bluemix and dashDB for Twitter Analysis
This document discusses using IBM's Bluemix and dashDB services for Twitter analysis. It provides an overview of the IBM Insights for Twitter service in Bluemix, which allows querying and searching over enriched Twitter data stored in dashDB. Examples are given of queries that can be performed, such as searching for tweets about an upcoming movie within a time frame or searching for tweets with positive sentiment about a product. The document also discusses loading Twitter data into dashDB using a Bluemix app and performing predictive analytics on the data using built-in R and Python capabilities in dashDB.
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...Torsten Steinbach
This document summarizes a presentation on analyzing weather data using IBM's Cloud-Based Analytics of The Weather Company in IBM Bluemix. It discusses loading weather data from various sources into the IBM dashDB data warehouse for analysis using R and Python. Key points include:
- Loading weather data from sources like S3, Swift, on-premise databases, and The Weather Company into IBM Bluemix for analysis in dashDB.
- Using the ibmdbR and ibmdbPy packages to interface with dashDB from R and Python, performing analytics like predictive modeling, statistics, and visualizations directly in the database.
- Publishing predictive models and analytics as web applications using the dashDB REST API.
This document discusses analyzing geospatial data with IBM Cloud Data Services and Esri ArcGIS. It provides an overview of using Cloudant as a NoSQL database to store geospatial data in GeoJSON format and then load it into IBM dashDB for analytics. GeoJSON data can be stored in Cloudant in three different structures - as simple geometry, feature collections, or features. The document also describes how geospatial data from Cloudant can be transformed and loaded into dashDB tables for analysis using IBM data warehousing technologies.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
3. Evolution of Form Factors
For Big Data Analytics
Enterprise Data
Warehouses
Tightly integrated and
optimized systems
Hadoop
Introduced open data formats & easy
scaling on commodity HW
Cloud-Native:
Serverless Analytics-aaS
• Seamless elasticity
• Pay-per-query consumption
• Analyze data as it sits in an object store
• Disaggregated architecture
• No more infrastructure head aches
The 90-ies 2000 Today
4. Ingredient 3: Serverless Data Transformation
Ingredient 4: Serverless Analytics
Ingredient 5: Serverless Automation
Ingredient 2: Serverless Data Ingest
Sharing Economy for Analytics
Ingredient 1: Serverless Storage
5. Object Storage
IBM Cloud Object Storage
Objects
Objects
Objects
At Rest
On the Wire
Buckets
Encrypted
Pennies per GB
REST
Elastic
Durable
Flexible
Resiliency Choices
Storage Classes
User Managed
Encryption Keys
S3 Compatible
High Speed Data
Transfer
Aspera
SQL Queries
6. Data Ingest Options
6
High Customizability
Degree of Serverless-ness
IBM Event Streams
(Kafka aaS)
IBM Cloud Functions
Out-of-the-Box
IBM Streaming Analytics
(IBM Streams aaS)
via Cloud Object Storage API
SQL Query ETL
Cloudant Replication
Blockchain Synch
7. Cloud Data
Data
Transformation
Serverless SQL
Analytics
IBM SQL Query
Object
Storage
Db2
+
Developers
Data
Engineers
Data Analysts
ü Perfect for Machine Generated Data
ü Ad-hoc Data Exploration
ü Operationalizing Data Pipelines
ü Big Data Lakes
ü Flexible Data Transformation
ü Extremely affordable. 5$/TB scanned
ü 100% API enabled
ü Analytics on Object Storage
ü Big Data Scale-Out. Running on Spark
ü 100% Self service – No Setup
8. 2. Read data
4. Read
results
Application
3. Write results
IBM Cloud
Object Storage
Result SetData Set
Data Set
Data Set
1. Submit SQL
SQL
Archive / Export
IBM Cloud Streaming
IBM Streams
Event Streams
Land
Query
IBM Cloud Functions
IBM SQL Query
Architecture
IBM Cloud Databases
Db2 on Cloud
Geospatial SQLData Skipping
Timeseries SQL
Upload
9. Data Center 2
Analytics Engine Cluster
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
Data Center 3
Analytics Engine Cluster
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
SQL 1 SQL 1
Data Center 1
IBM Cloud SQL Query – Very High Level Architecture (MVP 1Q 2018)
Analytics Engine Cluster
20 Kernels
Cluster
Pool
Request Queue
Node 1
Node 3
Node 2
Node 3
…
Kernel
Pools
20
Kernels
…
SQL 1 SQL 2 SQL 3 SQL 4 SQL 5
Cloud Object Storage
SQL 6 …
JKG (Web Sockets)
IBM Cloud Query – Spark Cluster Architecture
10. SQL REST API
Create
Query
SQL Web Console
Watson
Studio
Notebooks
SQL Cloud Function
Integrate Explore
Deploy
IBM Cloud Query – Access Patterns
Node SDK
Python SDK
JDBCLooker
11. Best of breed Spark SQL Reference
• Complete, intuitive and interactive SQL Reference
• Each sample SQL can immediately be executed as is
https://cloud.ibm.com/docs/services/sql-query/sqlref/sql_reference.html#sql-reference
Analytics using full Power of Spark SQL
12. IBM SQL Query – Timeseries SQL 1/2
§ Intuitive first-of-a-kind SQL extensions for timeseries operations
§ Industry leading differentiators, including:
• Timeseries transformation functions:
• Correlation, Fourier transformation,
z-normalization, Granger, interpolation,
and distances
• Temporal Joins: SQL support for
Left/Right/Full Inner and Outer joins
of multiple timeseries
Alignment & Joining:
13. § Further Industry leading differentiators
• Numerical and categorical timeseries types
• Timeseries data skipping for fast queries
• Forecasting:
• ARIMA, BATS, Anomaly detection, etc.
• Subsequence Mining:
• Train & match models for event sequences
• Segmentation:
• Time-based, Record-based, Anchor-based, Burst, and silence
Segmentation:
IBM SQL Query – Timeseries SQL 2/2
14. • IBM SQL Query – Spatial SQL
§ SQL/MM standard to store & analyze spatial data in RDBMS
§ Migration of PostGIS compliant SQL queries
§ Aggregation, computation and join via native SQL syntax
§ Industry leading differentiators
• Geodetic Full Earth support
• Increased developer productivity
• Avoid piece-wise planar projections
• High precision calculations anywhere on the earth
• Very large polygons (e.g. countries), polar caps, x-ing anti-meridian
• Spatial data skipping for fast queries
• Native and fine-granular geohash support
• Fast spatial aggregation
15. Example: Spatio-Temporal Processing of Sensor Data
IBM Cloud Object Storage
Sensor
Data
Query
Location
Analytics
Mobile
Cars
Devices
Land
Location
Filtering
Spatial
Aggregation
GPS
SQL/MM
Sensor
Metrics
t
t
t
Timeseries
Assembly
Timeseries
Join
Timeseries SQL
t
17. Unstructured Data Prep
SQL Query
Cloud
Functions
Analyze
COSCOS
Extract Features
Automated/Scheduled SQL Execution
SQL Query
Cloud
Functions
Develop SQL Deploy as SQL Cloud Function
Set up Cloud
Function
Trigger/Schedule
Shield Data From Direct Access
SQL Query
Cloud
Functions
Deploy Cloud Function
with COS API Key
User Calls
Function to
Access Data
COS
Grant Execute on SQL
Cloud Function to User
Configure SQL Pipelines
SQL Query
Cloud
Functions
User creates function
sequence to automate flow
of consecutive SQLs
Sequence
SQL Query
Cloud
Functions
1.
2.
Use Cases of Cloud Functions Adding Value to SQL
18. Ingredient 3: Serverless Data Transformation ✓
Ingredient 4: Serverless Analytics ✓
Ingredient 5: Serverless Automation ✓
Ingredient 2: Serverless Data Ingest ✓
Ingredient 1: Serverless Storage ✓
Now, what is this all good for?
19. IBM Cloud Object Storage
Acquire
Query
Data Warehouses &
Databases
Db2 on Cloud
Process Analyze
ApplicationsApplications
Applications
IoT
Streaming
Devices
Devices
Devices
BI & AI
Land
Log Messages
Cleanse
Filter
Merge
Aggregate
Compress
Watson Studio
Looker
Cognos
WML
Explore
Analyze Analyze
Promote
Use for Data Pipelines to fuel BI & AI
20. Data –Driven Decisions
☛ Understanding system health, user behavior & workload status
Collecting & Analyzing Log Data
☛ Is NOT and afterthought but rather foundation for decisions on
system and feature design.
Data Volume Growing Rapidly
☛ Growth rates and data volume at rest can jump dramatically. Very
high elasticity is required.
Competitive Advantage
☛ Is based on short runways for turning data into actions
Turn your Logs into Business – Log Data Is The Cloud-Native Currency
21. Logs
Your Cloud
Application/Solution
IBM Cloud Object Storage
Query
Transform
Compress
Aggregate
Repartition
Analyze
Anomaly Detection
User Segmentation
Customer Support
Resource Planning
• Build & run data pipelines and analytics of your log message data
• Flexible log data analytics with full power of SQL
• Seamless scalability & elasticity according to your log message volume
Use for analyzing application logs
22. IDUG Db2 Tech Conference
Charlotte, NC | June 2 – 6, 2019
Data Lake in IBM Cloud – How it works
IBM Cloud Data LakeData
Streaming
Upload
ETL
DB2
Feature
Extraction
Data
Prep
ICD
DB2
ICD
OLAP
Analytics WML
ETL
Federate
Asper
a
Cloudant
Replication
Secure
Sync
IBM
Blockchain
Application
s
Application
s Watson
Studio
Knowledge
Catalog
METASTORE
AI
ICP for DataAnalytics
Engine
IBM Cloud
Functions
Land Process Integrate
Key Protect
Index
Creation
23. Getting started: https://www.ibm.com/cloud/sql-query
SQL Query Intro Video: https://youtu.be/s-FznfHJpoU
SQL Query Starter Notebook in Watson Studio: https://ibm.biz/BdYNrN
SQL Reference: https://ibm.biz/Bd2jF7
SQL Query API doc: https://cloud.ibm.com/apidocs/sql-query
Big Data Layout Best Practices for COS: https://ibm.biz/Bd2jRg
Serverless Data & Analytics: https://ibm.biz/Bd2jF5
Further Resources
25. IDUG Db2 Tech Conference
Charlotte, NC | June 2 – 6, 2019
1. Identify friction points in users’ digital journey, e.g.:
• Clicks-2-purchase ratio
• Unexpected repeated page visits per user
• E.g. entering payment data should only happen once
• Last page visited per session
2. Identify click sequences for successful purchase
• Sequence matching using timeseries analysis
3. Identify customers/segments likely to churn or expand
• Look for typical page visits, actions or flows
• E.g. Terms & conditions, invite additional users etc.
4. Determining your most important content online
What Insights can I extract from a Clickstream?
26. 1. Identify friction points in users’ digital journey, e.g.:
• Clicks-2-purchase ratio
• Unexpected repeated page visits per user
• E.g. entering payment data should only happen once
• Last page visited per session
2. Identify click sequences for successful purchase
• Sequence matching using timeseries analysis
3. Identify customers/segments likely to churn or expand
• Look for typical page visits, actions or flows
• E.g. Terms & conditions, invite additional users etc.
4. Determining your most important content online
What Insights can I extract from a Clickstream?
27. Building IBM Cloud-Native Data Lake
Serverless SQL
Serverless Storage
Serverless Pipeline
Automation ✓
✓
✓
Orchestration
Processing
Persistency Data Ingest
✓
Data Catalog ✓
Serverless
Unstructured Data
Processing ✓
28. • Traditional analytics systems
• Fixed capacities of appliances
• Specialized teams of data engineers & DBAs who manage data model, access and ETL
• BI analysts who have access only to the curated data sets in EDW
• Innovative enterprises today
• Wide range of teams that require direct access to same data set at all stages of the data
pipeline: BI analysts, data scientists, quantitative marketers, dev/ops, developers
• Data engineers that support these teams need a much, much more scalable and cost-
effective platform to ensure all teams have access they need and when needed
• Building analytics platforms in the cloud because of the scale and cost-efficiencies that
come with serverless analytics over object stores
Serverless – The key to IT Sharing Economy ... also for Analytics
31. 2. Use Hive style partitioning
GPMeterStream/dt=2017-08-17/part-00085.csv
GPMeterStream/dt=2017-08-17/part-00086.csv
GPMeterStream/dt=2017-08-17/part-00087.csv
GPMeterStream/dt=2017-08-17/part-00088.csv
GPMeterStream/dt=2017-08-17/part-00089.csv
GPMeterStream/dt=2017-08-18/part-00001.csv
GPMeterStream/dt=2017-08-18/part-00002.csv
GPMeterStream/dt=2017-08-18/part-00003.csv
Avoid reading unnecessary objects altogether
Technique has limitations
Best Practice: minimize bytes scanned
1. Use Parquet
• Column based
• Only read the columns you need
• Column wise compression
• Min/max metadata
32. Table Locators
cos://<endpoint>/<bucket>/[<prefix>] <format definition>
Endpoint – of your object storage bucket or a short alias
E.g. s3.us-south.objectstorage.appdomain.cloud or alias us-south
Bucket – name in object storage
Prefix – one or multiple objects (i.e. table partitions) with same prefix
Used in FROM clauses for input data and in target field for result set data
Examples:
cos://us-south/myBucket/myFolder/mySubFolder/myData.parquet
cos://us/otherBucket/myData
cos://us/otherBucket/myData/part
cos://eu/newBucket/
33. <Table Locator> [JOBPREFIX JOBID | NONE]
[STORED AS CSV | PARQUET | JSON]
• Specifies the data format of the input data
• Table schema is automatically inferred at SQL execution time
• STORED AS Clause is optional, the default is CSV
• Additional parameters for CSV:
• E.g.: FIELDS TERMINATEY BY ‘t’ NOHEADER
• JOBPREFIX only for targets: defines unique prefix to append. Default is JOBID.
Table Format Definition
34. SELECT … INTO
<Table Locator> [STORED AS CSV | PARQUET | JSON]
[PARTITIONED [BY (<column list>)]
[INTO <num> BUCKETS]
[EVERY <num> ROWS]]
[SORT BY (<column list>)]
BY: Produces Hive Style Partitioning
INTO: Produced fix number of partitions (hash partitioned)
EVERY: Produces partitioned of even size (e.g. for pagination)
SORT BY: Exact result order & clustering when combined with PARTITIONED
Table Partitioning Definition
35. Submit a SQL query
POST https://api.sql-query.cloud.ibm.com/v2/sql_jobs
Runs the SQL in the background and returns a job_id
Detailed info for a SQL query (e.g. status, result location)
GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs/{job_id}
Returns JSON with query execution details
List of recent SQL query executions
GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs
Returns JSON array with last 30 SQL submissions and outcomes
IBM SQL Query REST API
36. IDUG Db2 Tech Conference
Charlotte, NC | June 2 – 6, 2019
Scaling Analytics: Data Skipping Saving you Time and $
Index All
Objects
IBM Cloud Object Storage
Data Set Objects
SQL
Query
Data Skipping
Indexing
Candidate
Objects
WHERE Clause
Saving Time
and $
SQL Query learns which objects are not relevant to a query
using a data skipping index
CREATE METAINDEX stores index summary metadata for
each object. Much smaller than the data.
SQLs skipping irrelevant objects to significantly reduce I/O
E.g.:
Independent of data formats
Index Types: Min/Max, Value List, Bounding Box
Get location and time of heat waves (>40 celcius)
SELECT lat, long, city, temp, date
FROM weather
WHERE temp > 40.0
37. Scaling Analytics: Data Skipping Saving you Time and $
Index All
Objects
IBM Cloud Object Storage
Data Set Objects
SQL
Query
Data Skipping
Indexing
Candidate
Objects
WHERE Clause
Saving Time
and $
SQL Query learns which objects are not relevant to a query
using a data skipping index
CREATE METAINDEX stores index summary metadata for
each object. Much smaller than the data.
SQLs skipping irrelevant objects to significantly reduce I/O
E.g.:
Independent of data formats
Index Types: Min/Max, Value List, Bounding Box
Get location and time of heat waves (>40 celcius)
SELECT lat, long, city, temp, date
FROM weather
WHERE temp > 40.0
38. • JDBC compliant driver library that wraps REST API
• Wrapping both, SQL Query and COS REST API
• Exposing regular session interface (JDBC Connection)
• Enabling custom JDBC application support
• Enabling BI application support
• Early adopter: Looker
• Support for stored table meta data (simple catalog)
• Stored as json in COS and referenced via JDBC
connection string
• I.e. DatabaseMetaData interface also supported
JDBC Driver for BI Applications
Apply for Beta Now
Query
JDBC Driver
REST
COS
JDBC
API
DataResult
Sets
Table
Catalog
E.g. Looker
39. Using SQL Query JDBC Driver
Define table catalog
• JSON file in COS containing:
• Table name
• Location of table objects on COS
• Object format
• Column names
• Column types
• INT, FLOAT, VARCHAR, TIMESTAMP
JDBC Connection String:
jdbc:SQLQuery:<sql-query instance crn>
?schemabucket=<COS bucket with json catalog>
?schemafile=<COS object with json catalog>
&apikey=<api key for your account>
&targetcosurl=<COS URL for result set>