Jim Gray gave a presentation on Microsoft SQL Server and database research. He discussed SQL Server's goals of being easy to use and scalable. He outlined enhancements to SQL Server 7 including improved replication, query processing, and data warehousing capabilities. Gray also discussed challenges around managing the growing volume of data being created and the importance of data analysis. He concluded by previewing new capabilities for future versions of SQL Server like support for XML and object-relational features.
Jim Gray presented on his work with large databases and grid computing. He discussed two major projects - TerraServer and SkyServer/World Wide Telescope. TerraServer is a photo database of the United States containing over 15 TB of imagery data accessed through an SQL database. SkyServer is a database of astronomical data containing images and attributes of celestial objects from surveys like SDSS. Gray discussed lessons learned from building and managing these large databases, and future plans to build databases from inexpensive disk bricks. He advocated for grid computing through web services as a way to federate and access distributed data sources on the internet.
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamPrecisely
Valuable mainframe data is often the missing piece in a holistic infrastructure view within Splunk. But if you're not a mainframe expert, knowing which data sources, fields and calculations are needed to get results within Splunk can be a challenge. Even those with mainframe knowledge can sometimes struggle.
With Syncsort Ironstream® you can easily capture the elements you need in real-time – and Ironstream's new Mainframe Data Model makes it easier than ever to work with complex mainframe metrics in Splunk.
View this webinar on-demand to learn more about this new feature, as well as how to:
• See categorized mainframe metrics in easily understood terms
• Get results faster – no need to research data sources, fields and calculations
• Broaden access to more team members – without the need for deep mainframe knowledge
• Use built-in Splunk tooling to get up and running quickly
• Realize valuable ROI sooner and eliminate the mainframe blind spot
If you have a SQL Server license (Standard or higher) then you already have the ability to start data mining. In this new presentation, you will see how to scale up data mining from the free Excel 2013 add-in to production use. Aimed at beginning to intermediate data miners, this presentation will show how mining models move from development to production. We will use SQL Server 2012 tools including SSMS, SSIS, and SSDT.
This document provides an overview of architecting cloud applications for scale. It discusses key concepts like horizontal scaling, distributed computing, and common cloud architecture patterns. Specific examples are given of how large companies like Facebook, Twitter, and Flickr architect their systems using horizontal scaling, partitioning, caching, and other techniques to handle massive loads in a scalable way.
Realtime Indexing for Fast Queries on Massive Semi-Structured DataScyllaDB
Rockset is a realtime indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset exploits the hardware fluidity available in the cloud and automatically grows and shrinks the cluster footprint based on demand. Available as a serverless cloud service, Rockset is used by developers to build data-driven applications and microservices.
In this talk, we discuss some of the key design aspects of Rockset, such as Smart Schema and Converged Index. We describe Rockset's Aggregator Leaf Tailer (ALT) architecture that provides low latency queries on large datasets.Then we describe how you can combine lightweight transactions in ScyllaDB with realtime analytics on Rockset to power an user-facing application.
Presented at SQL Saturday 220, Atlanta, GA, 201305. If you have a SQL Server license (Standard or higher) then you already have the ability to start data mining. In this new presentation, you will see how to scale up data mining from the free Excel 2013 add-in to production use. Aimed at beginning to intermediate data miners, this presentation will show how mining models move from development to production. We will use SQL Server 2012 tools including SSMS, SSIS, and SSDT.
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Jim Gray presented on his work with large databases and grid computing. He discussed two major projects - TerraServer and SkyServer/World Wide Telescope. TerraServer is a photo database of the United States containing over 15 TB of imagery data accessed through an SQL database. SkyServer is a database of astronomical data containing images and attributes of celestial objects from surveys like SDSS. Gray discussed lessons learned from building and managing these large databases, and future plans to build databases from inexpensive disk bricks. He advocated for grid computing through web services as a way to federate and access distributed data sources on the internet.
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamPrecisely
Valuable mainframe data is often the missing piece in a holistic infrastructure view within Splunk. But if you're not a mainframe expert, knowing which data sources, fields and calculations are needed to get results within Splunk can be a challenge. Even those with mainframe knowledge can sometimes struggle.
With Syncsort Ironstream® you can easily capture the elements you need in real-time – and Ironstream's new Mainframe Data Model makes it easier than ever to work with complex mainframe metrics in Splunk.
View this webinar on-demand to learn more about this new feature, as well as how to:
• See categorized mainframe metrics in easily understood terms
• Get results faster – no need to research data sources, fields and calculations
• Broaden access to more team members – without the need for deep mainframe knowledge
• Use built-in Splunk tooling to get up and running quickly
• Realize valuable ROI sooner and eliminate the mainframe blind spot
If you have a SQL Server license (Standard or higher) then you already have the ability to start data mining. In this new presentation, you will see how to scale up data mining from the free Excel 2013 add-in to production use. Aimed at beginning to intermediate data miners, this presentation will show how mining models move from development to production. We will use SQL Server 2012 tools including SSMS, SSIS, and SSDT.
This document provides an overview of architecting cloud applications for scale. It discusses key concepts like horizontal scaling, distributed computing, and common cloud architecture patterns. Specific examples are given of how large companies like Facebook, Twitter, and Flickr architect their systems using horizontal scaling, partitioning, caching, and other techniques to handle massive loads in a scalable way.
Realtime Indexing for Fast Queries on Massive Semi-Structured DataScyllaDB
Rockset is a realtime indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset exploits the hardware fluidity available in the cloud and automatically grows and shrinks the cluster footprint based on demand. Available as a serverless cloud service, Rockset is used by developers to build data-driven applications and microservices.
In this talk, we discuss some of the key design aspects of Rockset, such as Smart Schema and Converged Index. We describe Rockset's Aggregator Leaf Tailer (ALT) architecture that provides low latency queries on large datasets.Then we describe how you can combine lightweight transactions in ScyllaDB with realtime analytics on Rockset to power an user-facing application.
Presented at SQL Saturday 220, Atlanta, GA, 201305. If you have a SQL Server license (Standard or higher) then you already have the ability to start data mining. In this new presentation, you will see how to scale up data mining from the free Excel 2013 add-in to production use. Aimed at beginning to intermediate data miners, this presentation will show how mining models move from development to production. We will use SQL Server 2012 tools including SSMS, SSIS, and SSDT.
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
Simplifying Change Data Capture using Databricks DeltaDatabricks
In this talk, we will present recent enhancements to the techniques previously discussed in this blog: https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html. We will start by discussing the different CDC architectures that can be deployed in concert with Databricks Delta. We will then use notebooks to demonstrate updated CDC SQL and look at performance tuning considerations for both batch as well as streaming CDC pipelines into Delta.
Efficient data access is key to a high-performing application. Amazon Web Services provides several database options to support modern data-driven apps and software frameworks to make developing against them easy. We look at the design of a modern serverless web app using Amazon DynamoDB, the DynamoDB Mapper, Amazon Lambda, Amazon API Gateway and the SDKs and tackle the move from relational to NoSQL data models.
Speaker: Clayton Brown, Solutions Architect, Amazon Web Services
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
"Data mesh is a relatively recent architectural innovation, espoused as one of the best ways to fix analytic data. We renegotiate aged social conventions by focusing on treating data as a product, with a clearly defined data product owner, akin to that of any other product. In addition, we focus on building out a self-service platform with integrated governance, letting consumers safely access and use the data they need to solve their business problems.
Data mesh is prescribed as a solution for _analytical data_, so that conventionally analytical results (think weekly sales or monthly revenue reports) can be more accurately and predictably computed. But what about non-analytical business operations? Would they not also benefit from data products backed by self-service capabilities and dedicated owners? If you've ever provided a customer with an analytical report that differed from their operational conclusions, then this talk is for you.
Adam discusses the resounding successes he has seen from applying data mesh _off-label_ to both analytical and operational domains. The key? Event streams. Well-defined, incrementally updating data products that can power both real-time and batch-based applications, providing a single source of data for a wide variety of application and analytical use cases. Adam digs into the common areas of success seen across numerous clients and customers and provides you with a set of practical guidelines for implementing your own minimally viable data mesh.
Finally, Adam covers the main social and technical hurdles that you'll encounter as you implement your own data mesh. Learn about important data use cases, data domain modeling techniques, self-service platforms, and building an iteratively successful data mesh."
This document discusses using Amazon DynamoDB for application development and data modeling. It provides examples of modeling website session data, game state for tic-tac-toe, image tagging, and social leaderboards. For each use case, it demonstrates how to model the data in DynamoDB and implement common queries and operations.
Graph Database Use Cases - StampedeCon 2015StampedeCon
Presented by Max De Marzi at StampedeCon 2015: Graphs are eating the world – but in what form? Starting off with a primer on Graph Databases, this talk will focus on practical examples of graph applications.
We’ll look at multiple use cases like job boards, dating sites, recommendation engines of all kinds, network management, scheduling engines, etc. We'll also see some examples of graph search in action.
This document provides an overview of graph databases and their use cases. It begins with definitions of graphs and graph databases. It then gives examples of how graph databases can be used for social networking, network management, and other domains where data is interconnected. It provides Cypher examples for creating and querying graph patterns in a social networking and IT network management scenario. Finally, it discusses the graph database ecosystem and how graphs can be deployed for both online transaction processing and batch processing use cases.
Sound cloud - User & Partner Conference - AT InternetAT Internet
This document discusses SoundCloud's use of Amazon Redshift for big data analytics. It summarizes how SoundCloud uses Hadoop to store large amounts of listener data from multiple sources, but faced challenges with data silos and slow access. It migrated to using Amazon Redshift for its data warehouse which provides fast query performance on petabytes of data. It developed ETL processes to load data from source systems into Redshift, build a data model, and create aggregated fact tables and reporting data cubes for exploration and reporting.
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
This document discusses Apache IoTDB, an open source time series database for industrial IoT applications. It describes the origins and goals of IoTDB in managing the large volumes of time-oriented machine data produced by IoT devices. Key features of IoTDB include efficient storage and querying of time series data, native support for time series operations, and integration with common data analytics ecosystems. The document outlines IoTDB's architecture, data model, query language, optimized TsFile data format, and encoding schemes.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
This document provides an agenda for a presentation on best practices for developing XPages applications on IBM Bluemix. The agenda covers prerequisites for getting started with Bluemix, separating application design from data, deployment options using the Domino Designer plugin versus the command line, understanding the MANIFEST.YML configuration file, security considerations, plugin support, and tips/tricks.
Towards a rebirth of data science (by Data Fellas)Andy Petrella
Nowadays, Data Science is buzzing all over the place.
But what is a, so-called, Data Scientist?
Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data.
However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial.
In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results.
Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data.
The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
Presented at JAX London 2013
Apache Drill is a distributed system for interactive ad-hoc query and analysis of large-scale datasets. It is the Open Source version of Google’s Dremel technology. Apache Drill is designed to scale to thousands of servers and able to process Petabytes of data in seconds, enabling SQL-on-Hadoop and supporting a variety of data sources.
The IBM i is an extremely complex integrated system, and many users struggle to understand its capabilities. View this slideshow to learn how to use the features of IBM i and how to use it to access databases.
Watch the on-demand webinar on HelpSystems.com:
http://www.helpsystems.com/sequel/events/recorded-webinars/database-101-ibm-i
Understanding The Azure Platform November 09DavidGristwood
The document discusses Microsoft Azure, a cloud computing platform. It describes how Azure allows developers to build and host scalable applications and services through its global data center infrastructure. Azure offers several services including compute, storage, SQL databases, and content delivery to help applications scale efficiently in the cloud. The platform uses a pay-as-you-go model with no long-term commitments and allows customers to focus on their code instead of managing infrastructure.
ITCamp 2013 - Cristian Lefter - Transact-SQL from 0 to SQL Server 2012ITCamp
This document contains summaries of new features in various versions of Microsoft SQL Server from 2000 to 2012. It begins with a brief history of SQL and an overview of basic database concepts. Each major version is then discussed in its own section, with new syntax, functions, and capabilities highlighted at a high level. The document concludes with a recommendation to learn more about SQL Server memory-optimized tables and attending additional training.
Introduction to Text Mining and Visualization with Interactive Web ApplicationOlga Scrivner
The document introduces an interactive text mining suite (ITMS) that allows users to analyze and visualize unstructured text data. ITMS allows users to upload text files, preprocess the data by removing stopwords and stemming words, visualize the data through word clouds and cluster analysis, and perform topic modeling. The tool aims to make natural language processing and text mining techniques more accessible to users without programming skills. Key functions of ITMS include uploading various data formats, interactive preprocessing options, visualization of word frequencies and topic models, and clustering documents. The document demonstrates example visualizations and analyses produced by the tool.
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
In this session it will presented main workflows and technologies of getting value from Big Data stored in our Enterprise using Azure.
- When we have a Big Data problem
- Finding the best solution for our Big Data
- Working inside the Data Team
- Extract the true value of our data.
Presented at SQL Saturday Atlanta May 18, 2013
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
The Google Hacking Database: A Key Resource to Exposing VulnerabilitiesTechWell
We all know the power of Google—or do we? Two types of people use Google: normal users like you and me, and the not-so-normal users—the hackers. What types of information can hackers collect from Google? How severe is the damage they can cause? Is there a way to circumvent this hacking? As a security tester, Kiran Karnad uses the GHDB (Google Hacking Database) to ensure their product will not be the next target for hackers. Kiran describes how to effectively use Google the way hackers do, using advanced operators, locating exploits and finding targets, network mapping, finding user names and passwords, and other secret stuff. Kiran provides a recipe of five simple security searches that work. Learn how to automate the Google Hacking Database using Python so security tests can be incorporated as a part of the SDLC for the next product you develop.
Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson
The document discusses techniques for discovering lost web pages using lexical signatures. It finds that lexical signatures generated from page titles and content evolve over time, with terms dropping out. Signatures perform best with 5-7 terms. Combining titles with signatures provides better discovery results than either alone. Future work includes predicting "good" titles and augmenting signatures with tags and link neighborhoods.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Simplifying Change Data Capture using Databricks DeltaDatabricks
In this talk, we will present recent enhancements to the techniques previously discussed in this blog: https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html. We will start by discussing the different CDC architectures that can be deployed in concert with Databricks Delta. We will then use notebooks to demonstrate updated CDC SQL and look at performance tuning considerations for both batch as well as streaming CDC pipelines into Delta.
Efficient data access is key to a high-performing application. Amazon Web Services provides several database options to support modern data-driven apps and software frameworks to make developing against them easy. We look at the design of a modern serverless web app using Amazon DynamoDB, the DynamoDB Mapper, Amazon Lambda, Amazon API Gateway and the SDKs and tackle the move from relational to NoSQL data models.
Speaker: Clayton Brown, Solutions Architect, Amazon Web Services
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
"Data mesh is a relatively recent architectural innovation, espoused as one of the best ways to fix analytic data. We renegotiate aged social conventions by focusing on treating data as a product, with a clearly defined data product owner, akin to that of any other product. In addition, we focus on building out a self-service platform with integrated governance, letting consumers safely access and use the data they need to solve their business problems.
Data mesh is prescribed as a solution for _analytical data_, so that conventionally analytical results (think weekly sales or monthly revenue reports) can be more accurately and predictably computed. But what about non-analytical business operations? Would they not also benefit from data products backed by self-service capabilities and dedicated owners? If you've ever provided a customer with an analytical report that differed from their operational conclusions, then this talk is for you.
Adam discusses the resounding successes he has seen from applying data mesh _off-label_ to both analytical and operational domains. The key? Event streams. Well-defined, incrementally updating data products that can power both real-time and batch-based applications, providing a single source of data for a wide variety of application and analytical use cases. Adam digs into the common areas of success seen across numerous clients and customers and provides you with a set of practical guidelines for implementing your own minimally viable data mesh.
Finally, Adam covers the main social and technical hurdles that you'll encounter as you implement your own data mesh. Learn about important data use cases, data domain modeling techniques, self-service platforms, and building an iteratively successful data mesh."
This document discusses using Amazon DynamoDB for application development and data modeling. It provides examples of modeling website session data, game state for tic-tac-toe, image tagging, and social leaderboards. For each use case, it demonstrates how to model the data in DynamoDB and implement common queries and operations.
Graph Database Use Cases - StampedeCon 2015StampedeCon
Presented by Max De Marzi at StampedeCon 2015: Graphs are eating the world – but in what form? Starting off with a primer on Graph Databases, this talk will focus on practical examples of graph applications.
We’ll look at multiple use cases like job boards, dating sites, recommendation engines of all kinds, network management, scheduling engines, etc. We'll also see some examples of graph search in action.
This document provides an overview of graph databases and their use cases. It begins with definitions of graphs and graph databases. It then gives examples of how graph databases can be used for social networking, network management, and other domains where data is interconnected. It provides Cypher examples for creating and querying graph patterns in a social networking and IT network management scenario. Finally, it discusses the graph database ecosystem and how graphs can be deployed for both online transaction processing and batch processing use cases.
Sound cloud - User & Partner Conference - AT InternetAT Internet
This document discusses SoundCloud's use of Amazon Redshift for big data analytics. It summarizes how SoundCloud uses Hadoop to store large amounts of listener data from multiple sources, but faced challenges with data silos and slow access. It migrated to using Amazon Redshift for its data warehouse which provides fast query performance on petabytes of data. It developed ETL processes to load data from source systems into Redshift, build a data model, and create aggregated fact tables and reporting data cubes for exploration and reporting.
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
This document discusses Apache IoTDB, an open source time series database for industrial IoT applications. It describes the origins and goals of IoTDB in managing the large volumes of time-oriented machine data produced by IoT devices. Key features of IoTDB include efficient storage and querying of time series data, native support for time series operations, and integration with common data analytics ecosystems. The document outlines IoTDB's architecture, data model, query language, optimized TsFile data format, and encoding schemes.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
This document provides an agenda for a presentation on best practices for developing XPages applications on IBM Bluemix. The agenda covers prerequisites for getting started with Bluemix, separating application design from data, deployment options using the Domino Designer plugin versus the command line, understanding the MANIFEST.YML configuration file, security considerations, plugin support, and tips/tricks.
Towards a rebirth of data science (by Data Fellas)Andy Petrella
Nowadays, Data Science is buzzing all over the place.
But what is a, so-called, Data Scientist?
Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data.
However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial.
In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results.
Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data.
The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
Presented at JAX London 2013
Apache Drill is a distributed system for interactive ad-hoc query and analysis of large-scale datasets. It is the Open Source version of Google’s Dremel technology. Apache Drill is designed to scale to thousands of servers and able to process Petabytes of data in seconds, enabling SQL-on-Hadoop and supporting a variety of data sources.
The IBM i is an extremely complex integrated system, and many users struggle to understand its capabilities. View this slideshow to learn how to use the features of IBM i and how to use it to access databases.
Watch the on-demand webinar on HelpSystems.com:
http://www.helpsystems.com/sequel/events/recorded-webinars/database-101-ibm-i
Understanding The Azure Platform November 09DavidGristwood
The document discusses Microsoft Azure, a cloud computing platform. It describes how Azure allows developers to build and host scalable applications and services through its global data center infrastructure. Azure offers several services including compute, storage, SQL databases, and content delivery to help applications scale efficiently in the cloud. The platform uses a pay-as-you-go model with no long-term commitments and allows customers to focus on their code instead of managing infrastructure.
ITCamp 2013 - Cristian Lefter - Transact-SQL from 0 to SQL Server 2012ITCamp
This document contains summaries of new features in various versions of Microsoft SQL Server from 2000 to 2012. It begins with a brief history of SQL and an overview of basic database concepts. Each major version is then discussed in its own section, with new syntax, functions, and capabilities highlighted at a high level. The document concludes with a recommendation to learn more about SQL Server memory-optimized tables and attending additional training.
Introduction to Text Mining and Visualization with Interactive Web ApplicationOlga Scrivner
The document introduces an interactive text mining suite (ITMS) that allows users to analyze and visualize unstructured text data. ITMS allows users to upload text files, preprocess the data by removing stopwords and stemming words, visualize the data through word clouds and cluster analysis, and perform topic modeling. The tool aims to make natural language processing and text mining techniques more accessible to users without programming skills. Key functions of ITMS include uploading various data formats, interactive preprocessing options, visualization of word frequencies and topic models, and clustering documents. The document demonstrates example visualizations and analyses produced by the tool.
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
In this session it will presented main workflows and technologies of getting value from Big Data stored in our Enterprise using Azure.
- When we have a Big Data problem
- Finding the best solution for our Big Data
- Working inside the Data Team
- Extract the true value of our data.
Presented at SQL Saturday Atlanta May 18, 2013
Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2012 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
The Google Hacking Database: A Key Resource to Exposing VulnerabilitiesTechWell
We all know the power of Google—or do we? Two types of people use Google: normal users like you and me, and the not-so-normal users—the hackers. What types of information can hackers collect from Google? How severe is the damage they can cause? Is there a way to circumvent this hacking? As a security tester, Kiran Karnad uses the GHDB (Google Hacking Database) to ensure their product will not be the next target for hackers. Kiran describes how to effectively use Google the way hackers do, using advanced operators, locating exploits and finding targets, network mapping, finding user names and passwords, and other secret stuff. Kiran provides a recipe of five simple security searches that work. Learn how to automate the Google Hacking Database using Python so security tests can be incorporated as a part of the SDLC for the next product you develop.
Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson
The document discusses techniques for discovering lost web pages using lexical signatures. It finds that lexical signatures generated from page titles and content evolve over time, with terms dropping out. Signatures perform best with 5-7 terms. Combining titles with signatures provides better discovery results than either alone. Future work includes predicting "good" titles and augmenting signatures with tags and link neighborhoods.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Gray_Compass99.ppt
1. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
1
Microsoft SQL Server,
Scalability, &
Database Research
Jim Gray
Researcher
Microsoft Corporation
2. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
2
Outline
Summary of what you heard. (10 min)
The database scene in general. (10 min)
Scaleability: Farms, Clones,Parts & Packs (15 min)
Microsoft DB research focus. (15 min)
• TerraServer (design and ops).
• RAGS.
• Data Mining
Q&A (10 min)
3. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
3
Organizations Are Going Online
Building a digital nervous system.
Inexpensive hardware means huge
databases are possible.
But, we are drowning in data.
Databases help organize information.
Microsoft’s goal:
• Information at your fingertips.
• Make it easy to capture,
manage, and
analyze information.
4. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
4
Microsoft SQL Server 7 Goals
Easy
Dynamic self management
Multi-site management
Operation Scripting
Job scheduling and execution
Alert/response management
Scriptable Install+upgrade
DBA profiling/tuning tools
Unicode
English Language Query
Integrated with NT Security
Integrated with NT files
Scalability
Data Warehousing
5. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
11
Scalability
Win9x/NTW version
Dynamic row-level locking
Improved query optimizer
Intra-query parallelism
VLDB improvements
Replication improvements
Distributed query
High Availability Clusters
Easy
Scalability
Data Warehousing
6. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
12
Scale Down to Windows 95-98
Full function (same as NTW)
Integration with Access 97
MSDE in Office2000 and MSDN
WinCE version demonstrated
7. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
13
Replication
Transactional and Merge
Remote update
ODBC and OLE DB subscribers
Wizards
Performance
2PC,
RPC
Subscriber
DB2
CICS Subscriber
Subscriber
VSAM
OS 390
DB2
Publisher
Updating Subscriber
(immediate updates)
Distributor
Subscriber
8. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
14
Query Processor Enhancements
Parallelism
Improved scan, fetch, & sort
Smart hash & merge join
Large joins & grouping
Better query optimization
Multi-index operations
Automatic statistics maintenance
Distributed Query
Heterogeneous Query
Focus on Complex Queries
9. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
15
•# of emp. per group
•total inc. per group
Local Agg.
4 x 50 rows
+ + + +
Disks
50,000 rows
Global Agg. Result 50 rows
+
Parallel Query
SMP & Disk Parallelism
Plus Distributed
Plus Hash Join (fanciest on the planet)
Plus Optimized Partitioned views
10. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
16
Distributed Heterogeneous Queries
Data Fusion / Integration
Join spread sheets,
databases,
directories,
Text DBs
etc.
Any source that
exposes OLE DB
interfaces
SQL Server as
gateway,
even on the
desktop
Database
(DB2, VSAM, Oracle, …)
Spreadsheet
Photos
Mail
Maps
Documents
and the Web
Directory
Service
SQL 7.0
Query
Processor
11. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
17
Utilities
The Key to LARGE Databases
Backup
• Fuzzy
• Parallel
• Incremental
• Restartable
Recovery
• Fast
• File granularity
Reorganize
• shrinks file
• reclusters file
Auto-Repair
Index creation
~2x faster than 6.5
DBCC
• not required,
• a good practice
• 5x - 100x faster
0
10
20
30
40
50
60
1 2 3 4
SQL Server 6.5
SQL Server 7.0
# of indices
Recovery
time
(
secs
)
12. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
20
Data Warehousing
Warehousing Framework
Visual data modeler
Microsoft repository
Data transformation services
(DTS)
Plato & Dcube - Multi
Dimensional Data Cubes
English query 2.0
Easy
Scalability
Data Warehousing
13. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
22
Data Warehouse / Data Analysis
Data Transformation Services
to get data into the warehouse
CUBE (OLE/DB OLAP)
to analyze data
Operational
Data Extact
& Load
Data Warehouse
Storage
OLAP
14. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
25
Source table
Partition 1
ROLAP
Partition 2
Partition 3
ROLAP
Europe
USA
Asia
MD SQL
SQL
Plato and Data Cube
and HOLAP
“Plato”
server
“Plato”
Designer
Dcube
Client
app
User 1
Dcube
Client
app
User 2
RED
WHITE
BLUE
By Color
By Make & Year
By Color & Year
By Make
By Year
Sum
15. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
26
English Query
16. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
27
Easy
Scalable
Data
Warehousing
17. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
“Shiloh” The Next SQL Server
Shiloh (H1’00) - Strengthen Position
• Data Warehousing leadership
Materialized Views
Cascading Referential Integrity
(#1 requested user-group feature)
XML support
• Scalability
WinCE support
W2K VLM (36 and 64 bit)
Multi-instance support
Yukon – Next Big Step
• Scalability (Clusters, Partitions)
• Programmability
• Ease of Use (Self Tuning, Auto Config)
18. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
29
Outline
Summary of what you heard. (10 min)
The database scene in general. (10 min)
Scaleability: Farms, Clones,Parts & Packs (15 min)
Microsoft DB research focus. (15 min)
• TerraServer (design and ops).
• RAGS.
• Data Mining
Q&A (10 min)
19. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
30
Info Capture
You can record everything
you see or hear or read.
What will you do with it?
How will you organize &
analyze it?
Most data will never be seen
Analysis an summarization
are key technologies
Video 8 PB per lifetime (10GBph)
Audio 30 TB (10KBps)
Read or write: 8 GB (words)
See: http://www.lesk.com/mlesk/ksg97/ksg.html
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
Kilo A Book
.Movi
e
All books
(words)
All Books
MultiMedia
Everything
!
Recorded
A Photo
20. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
31
Data Tidal Wave
Seagate 47GB drive @ 783$ (= 1.7 ¢/mb)
• 100 GB penny per MB drive coming in 2000
10 $/GB = 10 k$/ Terabyte!
• “Everyone” can afford one
What’s a terror bite?
• If you sell ten billion items a year (e.g Wal-Mart)
• And you record 100 bytes on each one
• Then you get a TeraByte/year
Where will the terror bytes come from?
• Multimedia (like the TerraServer) and...
21. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
32
Reducing Data’s Cost-of-Ownership
Self-Managing data
Cost of ownership:
One admin/TB (100K$ vs 10K$)
Admin cost exceeds storage cost.
SQL 7:
Suggests indices
Migrates data away from end of file
Truncates file
Someday:
Automatic move files to balance disks
Online defragmentation & restructuring
Online physical redesign
22. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
33
OBJECT RELATIONAL
The Next Great DBMS Wave
All DB vendors have added objects to DB
Microsoft is adding DBs to Objects
Integration with COM+
Gives user-defined types and objects
Plug-ins will be Billion dollar industry
• Blades for SQL Server razor
23. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
34
Why Is XML Important?
Self-describing data
“ABC47-Z”, “100”, “STL”, “C”, “3”, “28”
Data stream in a typical interface…
<INVENTORY>
<PART_NUM>ABC47-Z</PART_NUM>
<QUANTITY>100</QUANTITY>
<WAREHOUSE>STL</WAREHOUSE>
<ZONE>C</ZONE>
<AISLE>3</AISLE>
<BIN>28</BIN>
</INVENTORY>
Same data stream in XML…
24. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
35
table.xsl
bar.xsl
art.xsl
25. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
36
XML Applications
Exposing Software as a “Service”
• Websites without UI’s
• Exposed services with common scheme
• Integration points at the enterprise, value-
chain, workgroup, desktop and intelligent
gizmo “levels”
B2B value chains
• Uses XML to transmit wide range of date to a
broad set of stakeholders (regulatory
agencies, suppliers, customers, etc.).
• Leverage for prior efforts like EDI
• BizTalk a key industry effort in this regard
26. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
37
XML Message
XML
Document
XML Message
XML: BizTalk Framework
Order Processing
MVS CICS
SAP R/3
XML
XML
Service
Interface
Browser
Client Apps
New Form Factors
XML
Document
Another Service
JD Edwards
XML
XML
Library
www.biztalk.org
XML
XML
schema
27. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
38
Outline
Summary of what you heard. (20 min)
The database scene in general. (10 min)
Scaleability: Farms, Clones,Parts & Packs (10 min)
Microsoft DB research focus. (15 min)
• TerraServer (design and ops).
• RAGS.
• Data Mining
Q&A (15 min)
28. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
39
Terminology for scaleability
Farms of servers:
• Clones: identical
Scaleability + availability
• Partitions:
Scaleability
• Packs
Partition availability via fail-over
Farm
Clone Partition
Pack
29. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
40
Unpredictable Growth
The TerraServer Story:
• We expected 5 M hits per day
• We got 50 M hits on day 1
• We peak at 15-20 M hpd on a “hot” day
• Average 5 M hpd after 1 year
Most of us cannot predict demand
• Must be able to deal with NO demand
• Must be able to deal with HUGE demand
30. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
41
An Architecture for Internet Services?
Need to be able to add capacity
• New processing
• New storage
• New networking
Need continuous service
• Online change of all components (hardware and software)
• Multiple service sites
• Multiple network providers
Need great development tools
• Change the application several times per year.
• Add new services several times per year.
31. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
42
Premise: Each Site is a
Farm
Buy computing by the slice (brick):
• Rack of servers + disks.
Grow by adding slices
• Spread data and
computation
to new slices
Two growth styles:
• Clones: anonymous servers
• Parts+Packs: Partitions fail over within a pack
In both cases,
remote farm for disaster recovery
Switched
Ethernet
Switched
Ethernet
www.microsoft.com
(3)
search.microsoft.com
(1)
premium.microsoft.com
(1)
European Data Center
FTP
Download Server
(1)
SQL SERVERS
(2)
Router
msid.msn.com
(1)
MOSWest
Admin LAN
SQLNet
Feeder LAN
FDDI Ring
(MIS4)
Router
www.microsoft.com
(5)
Building 11
Live SQL Server
Router
home.microsoft.com
(5)
FDDI Ring
(MIS2)
www.microsoft.com
(4)
activex.microsoft.com
(2)
search.microsoft.com
(3)
register.microsoft.com
(2)
msid.msn.com
(1)
FDDI Ring
(MIS3)
www.microsoft.com
(3)
premium.microsoft.com
(1)
msid.msn.com
(1)
FDDI Ring
(MIS1)
www.microsoft.com
(4)
premium.microsoft.com
(2)
register.microsoft.com
(2)
msid.msn.com
(1) Primary
Gigaswitch
Secondary
Gigaswitch
Staging Servers
(7)
search.microsoft.com
(3)
support.microsoft.com
(2)
register.msn.com
(2)
1997 Microsoft.Com Farm
MOSWest
DMZ Staging Servers
Live SQL Servers
SQL Consolidators
Japan Data Center
www.microsoft.com
(3)
premium.microsoft.com
(1)
HTTP
Download Servers
(2) Router
search.microsoft.com
(2)
SQL SERVERS
(2)
msid.msn.com
(1)
FTP
Download Server
(1)
Router
Router
Router
Router
Router
Router
Router
Router
Internal WWW
SQL Reporting
home.microsoft.com
(4)
home.microsoft.com
(3)
home.microsoft.com
(2)
register.microsoft.com
(1)
support.microsoft.com
(1)
Internet
13
DS3
(45 Mb/Sec Each)
2
OC3
(45Mb/Sec Each)
2
Ethernet
(100 Mb/Sec Each)
cdm.microsoft.com
(1)
FTP Servers
Download
Replication
FTP.microsoft.com
(3)
0
Switched
Ethernet
Switched
Ethernet
www.microsoft.com
(3)
Switched
Ethernet
Switched
Ethernet
www.microsoft.com
(3)
search.microsoft.com
(1)
premium.microsoft.com
(1)
European Data Center
FTP
Download Server
(1)
SQL SERVERS
(2)
Router
msid.msn.com
(1)
MOSWest
Admin LAN
SQLNet
Feeder LAN
FDDI Ring
(MIS4)
Router
www.microsoft.com
(5)
Building 11
Live SQL Server
Router
home.microsoft.com
(5)
FDDI Ring
(MIS2)
www.microsoft.com
(4)
activex.microsoft.com
(2)
search.microsoft.com
(3)
register.microsoft.com
(2)
msid.msn.com
(1)
search.microsoft.com
(3)
register.microsoft.com
(2)
msid.msn.com
(1)
FDDI Ring
(MIS3)
www.microsoft.com
(3)
premium.microsoft.com
(1)
msid.msn.com
(1)
FDDI Ring
(MIS1)
www.microsoft.com
(4)
premium.microsoft.com
(2)
FDDI Ring
(MIS3)
www.microsoft.com
(3)
premium.microsoft.com
(1)
msid.msn.com
(1)
FDDI Ring
(MIS1)
www.microsoft.com
(4)
premium.microsoft.com
(2)
register.microsoft.com
(2)
msid.msn.com
(1) Primary
Gigaswitch
Secondary
Gigaswitch
register.microsoft.com
(2)
msid.msn.com
(1) Primary
Gigaswitch
Secondary
Gigaswitch
Staging Servers
(7)
search.microsoft.com
(3)
support.microsoft.com
(2)
register.msn.com
(2)
1997 Microsoft.Com Farm
MOSWest
DMZ Staging Servers
Live SQL Servers
SQL Consolidators
Live SQL Servers
SQL Consolidators
Japan Data Center
www.microsoft.com
(3)
premium.microsoft.com
(1)
HTTP
Download Servers
(2) Router
search.microsoft.com
(2)
SQL SERVERS
(2)
msid.msn.com
(1)
premium.microsoft.com
(1)
HTTP
Download Servers
(2) Router
search.microsoft.com
(2)
SQL SERVERS
(2)
msid.msn.com
(1)
FTP
Download Server
(1)
Router
Router
Router
Router
Router
Router
Router
Router
FTP
Download Server
(1)
Router
Router
Router
Router
Router
Router
Router
Router
Internal WWW
SQL Reporting
Internal WWW
SQL Reporting
home.microsoft.com
(4)
home.microsoft.com
(3)
home.microsoft.com
(2)
register.microsoft.com
(1)
support.microsoft.com
(1)
home.microsoft.com
(4)
home.microsoft.com
(3)
home.microsoft.com
(2)
register.microsoft.com
(1)
support.microsoft.com
(1)
Internet
13
DS3
(45 Mb/Sec Each)
2
OC3
(45Mb/Sec Each)
2
Ethernet
(100 Mb/Sec Each)
cdm.microsoft.com
(1)
FTP Servers
Download
Replication
FTP.microsoft.com
(3)
0
32. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
43
Scaleable Systems
Scale UP and Scale OUT
Everyone does both.
Choice is
• Size of a brick
• Clones or partitions
• Size of a pack
33. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
44
Everyone scales out
What’s the Brick?
1M$/slice
• IBM S390?
• Sun E 10,000?
100 K$/slice
• Wintel 8X
10 K$/slice
• Wintel 4x
1 K$/slice
• Wintel 1x
34. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
45
Clones: Availability+Scalability
Some applications are
• Read-mostly
• Low consistency requirements
• Modest storage requirement (less than 1TB)
Examples:
• HTTP web servers (IP sprayer/sieve + replication)
• LDAP servers (replication via gossip)
• App/compute servers or firewalls
Replicate app at all nodes (clones)
Spray requests across nodes.
Grow by adding clones
Fault tolerance: stop sending to dead
clone.
Growth: add a clone.
35. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
46
Facilities Clones Need
Automatic replication
• Applications (and system software)
• Data
Automatic request routing
• Spray or sieve
Management:
• Who is up?
• Update management & propagation
• Application monitoring.
Clones are very easy to manage:
• Rule of thumb: 100’s of clones per admin
36. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
47
Partitions for Scalability
Clones are not appropriate for some apps.
• Statefull apps do not replicate well
• high update rates do not replicate well
• Huge DBs (disk to expensive to clone)
Examples
• Email / chat / …
• Databases
Partition state among servers
Scalability (online):
• Partition split/merge
• Partitioning must be transparent to client.
37. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
48
Partitioned (aka. Clustered) Apps
Mail servers
• Perfectly partitionable
Business Object Servers
• Partition by set of objects.
Parallel Databases
Transparent access to partitioned tables
Parallel Query
38. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
49
Packs for Availability
Each partition may fail (independent of others)
Partitions migrate to new node via fail-over
• Fail-over in seconds
Pack: the nodes supporting a partition
• VMS Cluster
• Tandem Process Pair
• SP2 HACMP
• Sysplex™
• WinNT MSCS (wolfpack)
Cluster In A Box
now commodity
Partitions grow in packs.
39. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
50
What Parts+Packs Need
Automatic partitioning (in dbms, mail, files,…)
• Location transparent
• Partition split/merge
• Grow without limits (100x10TB)
Simple failover model
• Partition migration is transparent
• MSCS-like model for services
Application-centric request routing
Management:
• Who is up?
• Automatic partition management (split/merge)
• Application monitoring.
40. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
51
Services on Clones & Partitions
Application provides a set of services
If cloned:
• Services are on subset of clones
If partitioned:
• Services run at each partition
System load balancing routes request to
• Any clone
• Correct partition.
• Routes around failures.
41. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
52
Farm pairs: Always Up
Two farms
Changes from one
sent to other
When one farm fails
other provides service
Masks
• Hardware/Software faults
• Operations tasks (reorganize, upgrade move
• Environmental faults (power fail)
42. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
53
Availabilty for a simple web site
Web Clients
Front End
Web File Store SQL Temp State
SQL Database
Packs for availability
Clones for availability
Load Balance
43. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
54
Cloned Packed file servers
Packed Partitions: Database Transparency
Farm Scale Out Scenarios
SQL Temp State
Web File StoreA
Cloned
Front Ends
(firewall, sprayer,
web server)
SQL Partition 3
The FARM: Clones and Packs of Partitions
Web
Clients
Web File StoreB
replication
SQL Database
SQL Partition 2 SQL Partition1
Load Balance
44. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
55
Clients
IIS Web Server
or other IP based services
Data Servers
SQL, Exchange, File
Network Load
Balancing
Clones
Cluster Service
Pack
COM+ Components
Component Load
Balancing (COM+)
Clones
Application Servers
Reliable, Scalable, Modular
1
2
32
3
…
1
2
8
…
4 3
45. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
56
Talk 2 (if there is time)
Terminology for scaleability
Farms of servers:
• Clones: identical
Scaleability + availability
• Partitions:
Scaleability
• Packs
Partition availability via fail-over
Farm
Clone Partition
Pack
46. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
57
0
50
100
150
200
250
300
350
400
450
NT4
InProc
W2K
B1
inProc
W2K
RC1
InProc
W2K
RC2
InProc
1P 2P 4P 8P
0
50
100
150
200
250
300
350
400
450
NT4
InProc
W2K
B1
inProc
W2K
RC1
InProc
W2K
RC2
InProc
NT4
OOP
W2K
B1
OOP
W2K
RC1
OOP
W2K
RC2
OOP
1P 2P 4P 8P
Scalability: COM+ progress
serving 1,000-statement ASP’s (servelets)
Poor SMP Scaleability on IIS4 NT4
Big
improvements
from standard
Transaction
Processing
tricks
Out of Proc
(safe execution)
now much faster
than In Proc
was on IIS4
SPS: servelets per second
(ASPs served per second by IIS,
1,000 statement VBscript)
Shift from 4x200
Mhz to 8 450 Mhz
47. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
58
Scaleability:
So, What about the death of NT/Alpha?
Two simultaneous Compaq TPC-C numbers
Intel Profusion
NT/SQL/COM+
550 Mhz
8 Processors
4 GB memory
40,368 TPM-C @
18.46$/tpmC
745 K$ 5-year cost
Avail: 12/31/99
Alpha
Unix/Sybase/Tuexdo
700 Mhz
8 Processors
16 GB memory
42,437 TPM-C @
55.45 $/tpmC
$2.35 M$ 5-year cost
Avail: 10/18/99
200% more expense for 5% more performance?
48. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
59
Outline
Summary of what you heard. (10 min)
The database scene in general. (10 min)
Scaleability: Farms, Clones,Parts & Packs (15 min)
Microsoft DB research focus. (15 min)
• TerraServer (design and ops).
• RAGS.
• Data Mining
Q&A (10 min)
49. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
60
The TerraServer
http://www.terraserver.microsoft.com/
50. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
61
Coverage: Range from 70ºN to 70ºS
today: 35% U.S., 1% outside U.S.
Source Imagery:
• 4 TB 1sq meter/pixel Aerial (USGS - 60,000
46Mb B&W- 151Mb Color IR files)
• 1 TB 1.56 meter/pixel Satellite
(Spin-2 - 2400 300 Mb B&W)
Display Imagery: 200x200 pixel images,
subsample to build image pyramid
Store 5x compressed data
Nav Tools:
• 1.5 m place names
• “Click-on” Coverage map
• Expedia & Virtual Globe
• Pick of the week
1.6x 1.6 km “city view”
.8 x .8 km 8m thumbnail
.4 x.4 km browse
200x200 m tile
Concept: User
navigates an ‘almost
seamless’ image of
earth
Database & application UI
51. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
62
browser
HTML
Java
Viewer
The Internet
Web Client
Image Delivery
Application
SQL Server
7
Microsoft
Site Server EE
Internet Information
Server 4.0
Image Provider Site(s)
TerraServer DB
Terra-Server
Stored Procedures
Internet
Information
Server 5.0
Image Server
Active Server Pages
(ADO)
MTS
TerraServer Web Site
Software:
Classic 3 Tier Design
SQL Server 7
24
20 (8/12)
46
46
Fire wall
52. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
63
SourceMeta
Image
Imagery
Image
Type
Search
Image
Search
Logical Schema
Scale
Job
Load
Job
Load
Mgmt
External
Link
External
Group
External
Geo
Famous
Category
Famous
Place
TerraServer
Terra
Database
TerraAdmin
Admin
Gazetteer
Country
Name
State
Name
Place
Name
Place
Type
Small
Place Name
Pyramid
53. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
64
TerraServer File Group Layout
Convert 324 disks to 28 RAID5 sets
plus 28 spare drives
Make 4 NT volumes (RAID 50)
595 GB per volume
Build 30 20GB files on each volume
DB is File Group of 120 files
E: F: G: H:
54. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
65
Hardware
100 Mbps
Ethernet Switch
DS3
Site
Servers
Internet
Map
Server
SPIN-2
Web Servers
2.9 TB Database Server
AlphaServer 8400 8x400.
10 GB RAM
324 StorageWorks disks
10 drive tape library
(STC Timber Wolf DLT7000 )
55. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
66
Backup and Recovery
• STK 9710 Tape robot
• SQL Server Backup &
Restore +
• Legato Networker
• Fast, incremental,
differential, online
• Clocked at 80 MBps (peak)
(~ 200 GB/hr)
Restore
• Fast, incremental (file
oriented), not online.
Load & Backup&Recovery
Performance
Data Bytes Backed Up 1.2 TB
Total Time 7.25 Hours
Number of Tapes Consumed 27
Total Tape Drives 10
Data ThroughPut 168 GB/Ho
Average ThroughPut Per Device 16.8 GB/Hour
Average Throughput Per Device 4.97 MB/Sec
NTFS Logical Volumes 2
56. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
67
BAD OLD Load
DLT
Tape
“tar”
Drop’N’ DoJob
Wait 4
Load
LoadMgr
DB
100mbit
EtherSwitch
108
9.1 GB
Drives
Enterprise Storage Array
Alpha
Server
8400
108
9.1 GB
Drives
108
9.1 GB
Drives
STC
DLT
Tape
Library
60
4.3 GB
Drives
Alpha
Server
4100
ESA
Alpha
Server
4100
LoadMgr
DLT
Tape
NT
Backup
ImgCutter
Drop’N’
Images
10: ImgCutter
20: Partition
30: ThumbImg
40: BrowseImg
45: JumpImg
50: TileImg
55: Meta Data
60: Tile Meta
70: Img Meta
80: Update Place
...
LoadMgr
57. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
68
New Image Load and Update
ODBC Tx
TerraLoader
ODBC TX
TerraServer
SQL
DBMS
DLT
Tape
“tar”
Metadata
Load DB
Active Server Pages
Cut & Load
Scheduling
System
Image
Cutter
Merge
ODBC Tx
Dither
Image Pyramid
From base
58. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
70
TerraServer Daily Traffic
Jun 22, 1998 thru June 22, 1999
0
10M
20M
30M
Count
Sessions
Hit
Page View
DB Query
Image
After a Year:
1 TB of data
750 M records
2.3 billion Hits
2.0 billion DB Queries
1.7 billion Images sent
368 million Page Views
99.93% DB Availability
3rd design now Online
Built and operated by
team of 4 people
TotalTime (Hours)
0
720
1440
2160
2880
3600
4320
5040
5760
6480
7200
7920
8640
Up
Down Time
(Hours:minutes)
0:00
0:30
1:00
1:30
2:00
2:30
3:00
3:30
4:00
4:30
5:00
5:30
6:00
Scheduled
HW+Software
Operations
59. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
74
What TerraServer Shows
Can serve huge databases on Internet
for about a penny a page view
mostly phone bill (!).
Advertising pays more than a penny a page.
Commodity tools do scale fairly far.
A few people (3 developers, 1 operator)
using power tools
can build an impressive web site
60. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
75
Outline
Summary of what you heard. (20 min)
The database scene in general. (10 min)
Scaleability: Packs & Mobs (10 min)
Microsoft DB research focus. (15 min)
• TerraServer (design and ops).
• RAGS.
• Data Mining
Q&A (15 min)
61. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
76
Automatic Testing
60% of Microsoft R&D is testing.
What can research do to help?
• beyond joining the 500,000 Win2K beta testers
Test generation robot:
• Make up SQL queries
• Send them to SQL Server,
Oracle, DB2, Informix,…
• If answer is the same, great,
if not there is a problem
Also good for stress tests
Found MANY bugs in our products (all fixed).
Found MANY bugs in other’s products.
Very valuable tool.
MSR-TR-98-21 Massive Stochastic Testing of SQL, Slutz, Don
http://research.microsoft.com/scripts/pubDB/pubsasp.asp?RecordID=175
W X Y Z
1672 1672 1672 1672
232 234 241 31
1 1 1 1
31 15 12 28
1 12 5 116
0 29 32 4
18 18 19 25
45 19 18 113
Error
All four
agree 84%
Problem with
intermediate
table.
Case
W,X, and
agree 95%
62. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
77
Some Tera-Byte Databases
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
The Web: 1 TB of HTML
TerraServer 1 TB of images
Several other 1 TB (file) servers
Hotmail: 20 TB of email
Sloan Digital Sky Survey:
40 TB raw, 2 TB cooked
EOS/DIS (picture of planet each week)
• 15 PB by 2007
Federal Clearing house: images of checks
• 15 PB by 2006 (7 year history)
Nuclear Stockpile Stewardship Program
• 10 Exabytes (???!!)
63. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
78
Data Mining
Find interesting structure (patterns, relationships) in data
• Prediction
• Segmentation (clustering)
• Dependency modeling (find distribution)
• Summarization
• Trend and change detection and modeling
Allow user to state the query in terms of the business
logic
• User does not speak statistics or SQL
Use data to build predictors
• regression, classification, segmentation etc.
Generate summaries and reports for insight
• find “easy to describe” segments in data automatically
• find segments not known to analyst
64. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
79
Data Mining:
Microsoft SiteServer Commerce 3.0
Intelligent Cross-sell
Based on:
• Historical sales baskets in
stores
• Contents of current shopper
basket
• Browsing behavior of
shopper
Predict: ranking of products
in store likely to be most
interesting to shopper.
Http://www.holtoutlet.com/outlet4
65. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
80
Mail to 25% and capture 40%
400% improved response!
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0.1%
0.2%
0.3%
0.6%
1.3%
5.3%
6.7%
25.5%
34.5%
43.8%
56.9%
68.5%
94.8%
98.5%
100.0%
% mailed
%
Captured
of
true
targets
Real data drawn from a Microsoft marketing example
66. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
81
How do people use
www.microsoft.com?
Data Mining
(Clustering)
Engine
User
browsing
data
100M hits per day
14M users/week X segments
Cluster
Visualizer
Wizard
67. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
82
68. Jim Gray, Research and Microsoft SQL Server
Microsoft Research http://research.Microsoft.com/~gray/talks/
3 October 1999
Chicago, Ill.
83
Outline
Summary of what you heard. (10 min)
The database scene in general. (10 min)
Scaleability: Farms, Clones,Parts & Packs (15 min)
Microsoft DB research focus. (15 min)
• TerraServer (design and ops).
• RAGS.
• Data Mining
Q&A (10 min)