This document provides an overview and comparison of relational and NoSQL databases. Relational databases use SQL and have strict schemas while NoSQL databases are schema-less and include document, key-value, wide-column, and graph models. NoSQL databases provide unlimited horizontal scaling, very fast performance that does not deteriorate with growth, and flexible queries using map-reduce. Popular NoSQL databases include MongoDB, Cassandra, HBase, and Redis.
A Practical Look at the NOSQL and Big Data HullabalooAndrew Brust
Martha Washington
Friend of: Thomas Jefferson
Friend of: Benjamin Franklin
Friend of: John Adams
Friend of: James Madison
Friend of: Alexander Hamilton
Friend of: Marquis de Lafayette
Friend of: Baron von Steuben
Friend of: Henry Knox
Friend of: Nathanael Greene
Friend of: Comte de Rochambeau
Friend of: Friedrich Wilhelm von Steuben
Friend of: Gilbert du Motier, Marquis de Lafayette
Friend of: John Jay
Friend of: John Marshall
Friend of: James Madison
Friend of: Thomas Jefferson
Friend of: Benjamin Franklin
Friend
This document summarizes a presentation by Andrew Brust at the SQL Server Live! Orlando 2012 conference about Microsoft's strategy for big data. The presentation introduces key concepts like Hadoop, MapReduce, HDFS, and Hive. It describes Microsoft's HDInsight platform for running Hadoop on Windows Azure and integrating Hadoop with SQL Server and business intelligence tools using approaches like the Hive ODBC driver and PowerPivot. The presentation argues that Microsoft's technologies are making big data stored in Hadoop accessible to more business users.
The document provides an overview and agenda for a presentation on SQL Server Denali business intelligence (BI) capabilities. Key points include:
- PowerPivot and Excel Services allow self-service BI through a familiar Excel interface while leveraging Analysis Services for storage and collaboration features.
- Analysis Services Tabular Mode is the server implementation of PowerPivot, supporting partitions, roles and other enterprise features.
- Project "Crescent" provides ad hoc reporting directly against PowerPivot and Analysis Services Tabular models through a browser-based, Excel-like interface in Silverlight.
- Master Data Services and Data Quality Services provide master data management and data cleansing capabilities to support better data quality for BI initiatives.
Hadoop and its Ecosystem Components in ActionAndrew Brust
This document provides an overview of Andrew Brust's presentation on Hadoop and its ecosystem components. The presentation introduces key concepts like MapReduce, HDFS, Hive, Pig, HBase, Zookeeper and Mahout. It also provides instructions on setting up and using Hadoop on Amazon Elastic MapReduce and Microsoft Azure HDInsight. The document includes examples of commands for working with HDFS, MapReduce, Hive, Pig, HBase and Mahout.
Tony Gibbs gave a presentation on Amazon Redshift covering its history, architecture, concepts, and parallelism. The presentation included details on Redshift's cluster architecture, node components, storage design, data distribution styles, and terminology. It also provided a deep dive on parallelism in Redshift, explaining how queries are compiled and executed through streams, segments, and steps to enable massively parallel processing across nodes.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.George Joseph
SAP HANA is an in-memory database system that stores data in main memory rather than on disk for faster access. It uses a column-oriented approach to optimize analytical queries. SAP HANA can scale from small single-server installations to very large clusters and cloud deployments. Its massively parallel processing architecture and in-memory analytics capabilities enable real-time processing of large datasets.
SQL Server to Redshift Data Load Using SSISMarc Leinbach
In this article we will try to learn how to load data from SQL Server to Amazon Redshift Data warehouse using SSIS. Techniques outlined in this article can be also applied while extracting data from other Relational Source (e.g. Loading Data from MySQL to Redshift, Oracle to Redshift etc). First we will discuss steps needed to load data into Amazon Redshift Data Warehouse, challenges and then we will simplify whole process using SSIS Task for Amazon Redshift Data Transfer.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
A Practical Look at the NOSQL and Big Data HullabalooAndrew Brust
Martha Washington
Friend of: Thomas Jefferson
Friend of: Benjamin Franklin
Friend of: John Adams
Friend of: James Madison
Friend of: Alexander Hamilton
Friend of: Marquis de Lafayette
Friend of: Baron von Steuben
Friend of: Henry Knox
Friend of: Nathanael Greene
Friend of: Comte de Rochambeau
Friend of: Friedrich Wilhelm von Steuben
Friend of: Gilbert du Motier, Marquis de Lafayette
Friend of: John Jay
Friend of: John Marshall
Friend of: James Madison
Friend of: Thomas Jefferson
Friend of: Benjamin Franklin
Friend
This document summarizes a presentation by Andrew Brust at the SQL Server Live! Orlando 2012 conference about Microsoft's strategy for big data. The presentation introduces key concepts like Hadoop, MapReduce, HDFS, and Hive. It describes Microsoft's HDInsight platform for running Hadoop on Windows Azure and integrating Hadoop with SQL Server and business intelligence tools using approaches like the Hive ODBC driver and PowerPivot. The presentation argues that Microsoft's technologies are making big data stored in Hadoop accessible to more business users.
The document provides an overview and agenda for a presentation on SQL Server Denali business intelligence (BI) capabilities. Key points include:
- PowerPivot and Excel Services allow self-service BI through a familiar Excel interface while leveraging Analysis Services for storage and collaboration features.
- Analysis Services Tabular Mode is the server implementation of PowerPivot, supporting partitions, roles and other enterprise features.
- Project "Crescent" provides ad hoc reporting directly against PowerPivot and Analysis Services Tabular models through a browser-based, Excel-like interface in Silverlight.
- Master Data Services and Data Quality Services provide master data management and data cleansing capabilities to support better data quality for BI initiatives.
Hadoop and its Ecosystem Components in ActionAndrew Brust
This document provides an overview of Andrew Brust's presentation on Hadoop and its ecosystem components. The presentation introduces key concepts like MapReduce, HDFS, Hive, Pig, HBase, Zookeeper and Mahout. It also provides instructions on setting up and using Hadoop on Amazon Elastic MapReduce and Microsoft Azure HDInsight. The document includes examples of commands for working with HDFS, MapReduce, Hive, Pig, HBase and Mahout.
Tony Gibbs gave a presentation on Amazon Redshift covering its history, architecture, concepts, and parallelism. The presentation included details on Redshift's cluster architecture, node components, storage design, data distribution styles, and terminology. It also provided a deep dive on parallelism in Redshift, explaining how queries are compiled and executed through streams, segments, and steps to enable massively parallel processing across nodes.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.George Joseph
SAP HANA is an in-memory database system that stores data in main memory rather than on disk for faster access. It uses a column-oriented approach to optimize analytical queries. SAP HANA can scale from small single-server installations to very large clusters and cloud deployments. Its massively parallel processing architecture and in-memory analytics capabilities enable real-time processing of large datasets.
SQL Server to Redshift Data Load Using SSISMarc Leinbach
In this article we will try to learn how to load data from SQL Server to Amazon Redshift Data warehouse using SSIS. Techniques outlined in this article can be also applied while extracting data from other Relational Source (e.g. Loading Data from MySQL to Redshift, Oracle to Redshift etc). First we will discuss steps needed to load data into Amazon Redshift Data Warehouse, challenges and then we will simplify whole process using SSIS Task for Amazon Redshift Data Transfer.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Deep dive into Clustered Columnstore structures with information on compression algorithms, compression types, locking and dictionaries, as well as the Batch Processing Mode.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
This document provides an overview of a NoSQL Night event presented by Clarence J M Tauro from Couchbase. The presentation introduces NoSQL databases and discusses some of their advantages over relational databases, including scalability, availability, and partition tolerance. It covers key concepts like the CAP theorem and BASE properties. The document also provides details about Couchbase, a popular document-oriented NoSQL database, including its architecture, data model using JSON documents, and basic operations. Finally, it advertises Couchbase training courses for getting started and administration.
In this talk, Ian will table about Amazon Redshift, a managed petabyte scale data warehouse, give an overview of integration with Amazon Elastic MapReduce, a managed Hadoop environment, and cover some exciting new developments in the analytics space.
This document provides an overview and update on Amazon Aurora, Amazon's relational database service. It discusses new performance enhancements including improved read performance through caching, NUMA-aware scheduling, and lock compression to reduce contention. New availability features are also summarized, such as automatic repair and replacement of failed database nodes and storage volumes that can grow to 64TB. The document outlines Aurora's architecture advantages over traditional databases for scaling in the cloud through its distributed, self-healing design.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Jeremy Beard, a senior solutions architect at Cloudera, introduces Kudu, a new column-oriented storage system for Apache Hadoop designed for fast analytics on fast changing data. Kudu is meant to fill gaps in HDFS and HBase by providing efficient scanning, finding and writing capabilities simultaneously. It uses a relational data model with ACID transactions and integrates with common Hadoop tools like Impala, Spark and MapReduce. Kudu aims to simplify real-time analytics use cases by allowing data to be directly updated without complex ETL processes.
This document provides an introduction to NoSQL and Cassandra. It begins with an introduction of the presenter and an overview of what will be covered. It then discusses the history of databases and why alternatives to relational databases were needed to address challenges of scaling to internet-level data volumes, varieties, and velocities. It introduces key NoSQL concepts like CAP theorem, BASE, and the different types of NoSQL databases before focusing on Cassandra. The document summarizes Cassandra's origins, capabilities, data model involving column families and super column families, and architecture.
(1) Amazon Redshift is a fully managed data warehousing service in the cloud that makes it simple and cost-effective to analyze large amounts of data across petabytes of structured and semi-structured data. (2) It provides fast query performance by using massively parallel processing and columnar storage techniques. (3) Customers like NTT Docomo, Nasdaq, and Amazon have been able to analyze petabytes of data faster and at a lower cost using Amazon Redshift compared to their previous on-premises solutions.
This document discusses the limitations of relational databases for modern applications and real-time architectures. It describes how NoSQL databases like Aerospike can provide better performance and scalability. Specific examples are given of how Aerospike has been used to power applications in domains like advertising technology, social media, travel portals, and financial services that require high throughput, low latency access to large datasets.
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014Amazon Web Services
In addition to running databases in Amazon EC2, AWS customers can choose among a variety of managed database services. These services save effort, save time, and unlock new capabilities and economies. In this session, we make it easy to understand how they differ, what they have in common, and how to choose one or more. We explain the fundamentals of Amazon DynamoDB, a fully managed NoSQL database service; Amazon RDS, a relational database service in the cloud; Amazon ElastiCache, a fast, in-memory caching service in the cloud; and Amazon Redshift, a fully managed, petabyte-scale data-warehouse solution that can be surprisingly economical. We'll cover how each service might help support your application, how much each service costs, and how to get started.
This is an introduction to relational and non-relational databases and how their performance affects scaling a web application.
This is a recording of a guest Lecture I gave at the University of Texas school of Information.
In this talk I address the technologies and tools Gowalla (gowalla.com) uses including memcache, redis and cassandra.
Find more on my blog:
http://schneems.com
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
1) Hive is a data warehouse infrastructure built on top of Hadoop for querying large datasets stored in HDFS. It provides SQL-like capabilities to analyze data and supports complex queries using a MapReduce execution engine.
2) Hive compiles SQL queries into a directed acyclic graph (DAG) of MapReduce jobs that are executed by Hadoop. The metadata is stored in a metastore (typically an RDBMS).
3) Hive supports advanced features like partitioning, bucketing, acid transactions, and complex types. It can handle petabyte-scale datasets and integrates with the Hadoop ecosystem but has limitations for low-latency queries.
Redshift is Amazon's cloud data warehousing service that allows users to interact with S3 storage and EC2 compute. It uses a columnar data structure and zone maps to optimize analytic queries. Data is distributed across nodes using either an even or keyed approach. Sort keys and queries are optimized using statistics from ANALYZE operations while VACUUM reclaims space. Security, monitoring, and backups are managed natively with Redshift.
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
In this talk of Hadoop User Group UK meeting, Aaron Kimball from Cloudera introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop's distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.
After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We'll also cover some deeper technical details of Sqoop's architecture, and take a look at some upcoming aspects of Sqoop's development roadmap.
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
This document provides an overview and best practices for optimizing performance on Amazon Redshift. It discusses topics like data distribution, sort keys, compression, loading data efficiently, vacuum operations, and query processing. The webinar agenda covers architecture, distribution styles, sort keys, compression, workload management and more. Examples are provided to demonstrate how different techniques can significantly improve query performance. Administrative scripts and views are also recommended as helpful tools.
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...DataStax Academy
The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.
This document discusses connecting Oracle Analytics Cloud (OAC) Essbase data to Microsoft Power BI. It provides an overview of Power BI and OAC, describes various methods for connecting the two including using a REST API and exporting data to Excel or CSV files, and demonstrates some visualization capabilities in Power BI including trends over time. Key lessons learned are that data can be accessed across tools through various connections, analytics concepts are often similar between tools, and while partnerships exist between Microsoft and Oracle, integration between specific products like Power BI and OAC is still limited.
This document provides an overview of Backbone.js, a lightweight JavaScript library that adds structure to client-side code. It discusses that Backbone.js is commonly used to create single-page applications and explains some of its key features and components. Models contain data and logic, views handle presentation, and collections manage sets of models. It also touches on events, listening to events, and Backbone's dependencies on other libraries like Underscore.js.
1. The document proposes a fully distributed, peer-to-peer architecture for web crawling. The goal is to provide an efficient, decentralized system for crawling, indexing, caching and querying web pages.
2. A traditional web crawler recursively visits web pages, extracts URLs, parses pages for keywords, and visits extracted URLs. The proposed system follows this process but with a distributed, peer-to-peer architecture without a central server.
3. Each peer node includes components for crawling, indexing, and storing a local database. Peers communicate through an overlay network to distribute URLs, indexes, and search queries/results across the system in a decentralized manner.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Deep dive into Clustered Columnstore structures with information on compression algorithms, compression types, locking and dictionaries, as well as the Batch Processing Mode.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
This document provides an overview of a NoSQL Night event presented by Clarence J M Tauro from Couchbase. The presentation introduces NoSQL databases and discusses some of their advantages over relational databases, including scalability, availability, and partition tolerance. It covers key concepts like the CAP theorem and BASE properties. The document also provides details about Couchbase, a popular document-oriented NoSQL database, including its architecture, data model using JSON documents, and basic operations. Finally, it advertises Couchbase training courses for getting started and administration.
In this talk, Ian will table about Amazon Redshift, a managed petabyte scale data warehouse, give an overview of integration with Amazon Elastic MapReduce, a managed Hadoop environment, and cover some exciting new developments in the analytics space.
This document provides an overview and update on Amazon Aurora, Amazon's relational database service. It discusses new performance enhancements including improved read performance through caching, NUMA-aware scheduling, and lock compression to reduce contention. New availability features are also summarized, such as automatic repair and replacement of failed database nodes and storage volumes that can grow to 64TB. The document outlines Aurora's architecture advantages over traditional databases for scaling in the cloud through its distributed, self-healing design.
Has your app taken off? Are you thinking about scaling? MongoDB makes it easy to horizontally scale out with built-in automatic sharding, but did you know that sharding isn't the only way to achieve scale with MongoDB?
In this webinar, we'll review three different ways to achieve scale with MongoDB. We'll cover how you can optimize your application design and configure your storage to achieve scale, as well as the basics of horizontal scaling. You'll walk away with a thorough understanding of options to scale your MongoDB application.
Jeremy Beard, a senior solutions architect at Cloudera, introduces Kudu, a new column-oriented storage system for Apache Hadoop designed for fast analytics on fast changing data. Kudu is meant to fill gaps in HDFS and HBase by providing efficient scanning, finding and writing capabilities simultaneously. It uses a relational data model with ACID transactions and integrates with common Hadoop tools like Impala, Spark and MapReduce. Kudu aims to simplify real-time analytics use cases by allowing data to be directly updated without complex ETL processes.
This document provides an introduction to NoSQL and Cassandra. It begins with an introduction of the presenter and an overview of what will be covered. It then discusses the history of databases and why alternatives to relational databases were needed to address challenges of scaling to internet-level data volumes, varieties, and velocities. It introduces key NoSQL concepts like CAP theorem, BASE, and the different types of NoSQL databases before focusing on Cassandra. The document summarizes Cassandra's origins, capabilities, data model involving column families and super column families, and architecture.
(1) Amazon Redshift is a fully managed data warehousing service in the cloud that makes it simple and cost-effective to analyze large amounts of data across petabytes of structured and semi-structured data. (2) It provides fast query performance by using massively parallel processing and columnar storage techniques. (3) Customers like NTT Docomo, Nasdaq, and Amazon have been able to analyze petabytes of data faster and at a lower cost using Amazon Redshift compared to their previous on-premises solutions.
This document discusses the limitations of relational databases for modern applications and real-time architectures. It describes how NoSQL databases like Aerospike can provide better performance and scalability. Specific examples are given of how Aerospike has been used to power applications in domains like advertising technology, social media, travel portals, and financial services that require high throughput, low latency access to large datasets.
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014Amazon Web Services
In addition to running databases in Amazon EC2, AWS customers can choose among a variety of managed database services. These services save effort, save time, and unlock new capabilities and economies. In this session, we make it easy to understand how they differ, what they have in common, and how to choose one or more. We explain the fundamentals of Amazon DynamoDB, a fully managed NoSQL database service; Amazon RDS, a relational database service in the cloud; Amazon ElastiCache, a fast, in-memory caching service in the cloud; and Amazon Redshift, a fully managed, petabyte-scale data-warehouse solution that can be surprisingly economical. We'll cover how each service might help support your application, how much each service costs, and how to get started.
This is an introduction to relational and non-relational databases and how their performance affects scaling a web application.
This is a recording of a guest Lecture I gave at the University of Texas school of Information.
In this talk I address the technologies and tools Gowalla (gowalla.com) uses including memcache, redis and cassandra.
Find more on my blog:
http://schneems.com
This presentation can help you to apply partioning when appropriate, and to avoid problems when using it. The oneliner is: Simple Works Best. The illustrating demos are on Postgres12 (maybe -13 by the time of presenting) and show some of the problems and solutions that Partitioning can provide. Some of this “experience” is quite old and the demo runs near-identical on Oracle…
These problems are the same on any database.
1) Hive is a data warehouse infrastructure built on top of Hadoop for querying large datasets stored in HDFS. It provides SQL-like capabilities to analyze data and supports complex queries using a MapReduce execution engine.
2) Hive compiles SQL queries into a directed acyclic graph (DAG) of MapReduce jobs that are executed by Hadoop. The metadata is stored in a metastore (typically an RDBMS).
3) Hive supports advanced features like partitioning, bucketing, acid transactions, and complex types. It can handle petabyte-scale datasets and integrates with the Hadoop ecosystem but has limitations for low-latency queries.
Redshift is Amazon's cloud data warehousing service that allows users to interact with S3 storage and EC2 compute. It uses a columnar data structure and zone maps to optimize analytic queries. Data is distributed across nodes using either an even or keyed approach. Sort keys and queries are optimized using statistics from ANALYZE operations while VACUUM reclaims space. Security, monitoring, and backups are managed natively with Redshift.
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
In this talk of Hadoop User Group UK meeting, Aaron Kimball from Cloudera introduces Sqoop, the open source SQL-to-Hadoop tool. Sqoop helps users perform efficient imports of data from RDBMS sources to Hadoop's distributed file system, where it can be processed in concert with other data sources. Sqoop also allows users to export Hadoop-generated results back to an RDBMS for use with other data pipelines.
After this session, users will understand how databases and Hadoop fit together, and how to use Sqoop to move data between these systems. The talk will provide suggestions for best practices when integrating Sqoop and Hadoop in your data processing pipelines. We'll also cover some deeper technical details of Sqoop's architecture, and take a look at some upcoming aspects of Sqoop's development roadmap.
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
This document provides an overview and best practices for optimizing performance on Amazon Redshift. It discusses topics like data distribution, sort keys, compression, loading data efficiently, vacuum operations, and query processing. The webinar agenda covers architecture, distribution styles, sort keys, compression, workload management and more. Examples are provided to demonstrate how different techniques can significantly improve query performance. Administrative scripts and views are also recommended as helpful tools.
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...DataStax Academy
The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.
This document discusses connecting Oracle Analytics Cloud (OAC) Essbase data to Microsoft Power BI. It provides an overview of Power BI and OAC, describes various methods for connecting the two including using a REST API and exporting data to Excel or CSV files, and demonstrates some visualization capabilities in Power BI including trends over time. Key lessons learned are that data can be accessed across tools through various connections, analytics concepts are often similar between tools, and while partnerships exist between Microsoft and Oracle, integration between specific products like Power BI and OAC is still limited.
This document provides an overview of Backbone.js, a lightweight JavaScript library that adds structure to client-side code. It discusses that Backbone.js is commonly used to create single-page applications and explains some of its key features and components. Models contain data and logic, views handle presentation, and collections manage sets of models. It also touches on events, listening to events, and Backbone's dependencies on other libraries like Underscore.js.
1. The document proposes a fully distributed, peer-to-peer architecture for web crawling. The goal is to provide an efficient, decentralized system for crawling, indexing, caching and querying web pages.
2. A traditional web crawler recursively visits web pages, extracts URLs, parses pages for keywords, and visits extracted URLs. The proposed system follows this process but with a distributed, peer-to-peer architecture without a central server.
3. Each peer node includes components for crawling, indexing, and storing a local database. Peers communicate through an overlay network to distribute URLs, indexes, and search queries/results across the system in a decentralized manner.
iChresemo Technologies is professional software training institute. We deliver classroom, online and corporate training for students. We are specialized in Hadoop, PHP, Java and Oracle Hyperion courses.
The document discusses Google's search engine algorithm updates including Hummingbird, Pigeon, Panda, and Penguin. It provides details on what each update aims to accomplish and implications for businesses. Key points covered include Hummingbird focusing on search query meaning over individual words, Pigeon prioritizing larger websites for local search, Panda favoring high-quality pages with original content, and Penguin penalizing sites with bad link profiles and keyword-stuffed pages. The document advises businesses to focus on useful content, natural links, and avoiding over-optimization in light of the algorithm changes.
Presented at the Boston Code Camp 19.
Demo app, CallButler can be found at http://ushag-backbonejs.site44.com/CallButler and the source code at https://github.com/ushag/call-butler
The document provides an introduction to NOSQL databases. It begins with basic concepts of databases and DBMS. It then discusses SQL and relational databases. The main part of the document defines NOSQL and explains why NOSQL databases were developed as an alternative to relational databases for handling large datasets. It provides examples of popular NOSQL databases like MongoDB, Cassandra, HBase, and CouchDB and describes their key features and use cases.
The document provides an introduction and agenda for an HBase presentation. It begins with an overview of HBase and discusses why relational databases are not scalable for big data through examples of a growing website. It then introduces concepts of HBase including its column-oriented design and architecture. The document concludes with hands-on examples of installing HBase and performing basic operations through the HBase shell.
This document presents an introduction to NoSQL databases. It begins with an overview comparing SQL and NoSQL databases, describing the architecture of NoSQL databases. Examples of different types of NoSQL databases are provided, including key-value stores, column family stores, document databases and graph databases. MapReduce programming is also introduced. Popular NoSQL databases like Cassandra, MongoDB, HBase, and CouchDB are described. The document concludes that NoSQL is well-suited for large, highly distributed data problems.
AWS Certified Cloud Practitioner Course S11-S17Neal Davis
This deck contains the slides from our AWS Certified Cloud Practitioner video course. It covers:
Section 11 Databases and Analytics
Section 12 Management and Governance
Section 13 AWS Cloud Security and Identity
Section 14 Architecting for the Cloud
Section 15 Accounts, Billing and Support
Section 16 Migration, Machine Learning and More
Section 17 Exam Preparation and Tips
Full course can be found here: https://digitalcloud.training/courses/aws-certified-cloud-practitioner-video-course/
The document summarizes a meetup about NoSQL databases hosted by AWS in Sydney in 2012. It includes an agenda with presentations on Introduction to NoSQL and using EMR and DynamoDB. NoSQL is introduced as a class of databases that don't use SQL as the primary query language and are focused on scalability, availability and handling large volumes of data in real-time. Common NoSQL databases mentioned include DynamoDB, BigTable and document databases.
The document discusses the history and concepts of NoSQL databases. It notes that traditional single-processor relational database management systems (RDBMS) struggled to handle the increasing volume, velocity, variability, and agility of data due to various limitations. This led engineers to explore scaled-out solutions using multiple processors and NoSQL databases, which embrace concepts like horizontal scaling, schema flexibility, and high performance on commodity hardware. Popular NoSQL database models include key-value stores, column-oriented databases, document stores, and graph databases.
Databases in the Cloud discusses AWS database services for moving workloads to the cloud. It describes Amazon Relational Database Service (RDS) which provides several fully managed relational database options including MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Amazon Aurora. It also discusses non-relational database services like DynamoDB, ElastiCache, and Redshift for analytics workloads. The document provides guidance on choosing between SQL and NoSQL databases and discusses benefits of managed database services over hosting databases on-premises or in EC2 instances.
This document discusses NoSQL databases and when they should be used. It describes what NoSQL databases are, when to consider using one over a relational database, and introduces DynamoDB as an AWS NoSQL solution. Specific topics covered include the differences between relational and NoSQL data models, common use cases for NoSQL databases, and how to access and query DynamoDB tables.
The document discusses database choices and provides an overview of different types of databases including relational, NoSQL, and Hadoop databases. It compares features of relational databases versus Hadoop/MapReduce and provides demos of various database options like AWS DynamoDB, MongoDB, Neo4j, SQL Server, and AWS Redshift. The document aims to help readers understand the different database choices available and which types may be best suited to different types of data and use cases.
If NoSQL is your answer, you are probably asking the wrong question.Lukas Smith
This session is not about bad mouthing MongoDB, CoachDB, big data, map reduce or any of the other more recent additions to the database buzzword bingo. Instead it is about looking at how NoSQL is a confusing term and a more realistic assessment how old and new approaches in databases impact todays architectures...
This document provides an overview of NoSQL databases, including why they are used, common types, and how they work. The key points are:
1) SQL databases do not scale well for large amounts of distributed data, while NoSQL databases are designed for horizontal scaling across servers and partitions.
2) Common types of NoSQL databases include document, key-value, graph, and wide-column stores, each with different data models and query approaches.
3) NoSQL databases sacrifice consistency guarantees and complex queries for horizontal scalability and high availability. Eventual consistency is common, with different consistency models for different use cases.
Nashville analytics summit aug9 no sql mike king dell v1.5Mike King
Here are potential solutions for the NoSQL use cases with explanations:
UC1 - Email: MongoDB. Email data is document-oriented with a well-defined schema. MongoDB would allow flexible schema and easy querying of email metadata and contents.
UC2 - Twitter: Cassandra. Twitter data arrives continuously in JSON format and relationships are many-to-many. Cassandra is a good fit as a wide column store for high write throughput and querying across hashtags and users.
UC3 - Householding: HBase. Large volumes of batch data from different sources and formats need to be joined on customer keys with no strict schema. HBase is optimized for random read/write access to structured and semi-structured data at scale
NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.
This document discusses relational database management systems (RDBMS) and NoSQL databases. It notes that while SQL is useful for flat data, it does not scale well for large, unstructured, distributed data. The CAP theorem is discussed, noting that databases must sacrifice availability, consistency, or partition tolerance. Several categories of NoSQL databases are described, including document, graph, columnar, and key-value stores. Factors like scalability, transactions, data modeling, querying and access are compared between SQL and NoSQL options. The performance of different databases is evaluated for read-write workloads. The future of polyglot persistence using multiple database technologies is envisioned.
The document provides an introduction and overview of NoSQL databases. It discusses why NoSQL databases were created, the different categories of NoSQL databases including column stores, document stores, and key-value stores. It also provides an overview of Hadoop, describing it as a framework that allows distributed processing of large datasets across computer clusters.
Sql vs NO-SQL database differences explainedSatya Pal
This document compares SQL and NoSQL databases. It outlines key differences between the two types of databases such as their data structures (tables vs documents/key-value pairs), schemas (strict vs dynamic), scalability (vertical vs horizontal), and query languages (SQL vs unstructured). Examples of popular SQL databases discussed are MySQL, MS-SQL Server, and Oracle. Examples of NoSQL databases discussed are MongoDB, CouchDB, and Redis. The document provides an overview of each example database's features and benefits.
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL ServicesAmazon Web Services
In this session, we discuss the benefits of NoSQL databases and take a tour of the main NoSQL services offered by AWS—Amazon DynamoDB and Amazon ElastiCache. Then, we hear from two leading customers, Expedia and Mapbox, about their use cases and architectural challenges, and how they addressed them using AWS NoSQL services, including design patterns and best practices. You will walk out of this session having a better understanding of NoSQL and its powerful capabilities, ready to tackle your database challenges with confidence.
The document provides an agenda for a two-day training on NoSQL and MongoDB. Day 1 covers an introduction to NoSQL concepts like distributed and decentralized databases, CAP theorem, and different types of NoSQL databases including key-value, column-oriented, and document-oriented databases. It also covers functions and indexing in MongoDB. Day 2 focuses on specific MongoDB topics like aggregation framework, sharding, queries, schema-less design, and indexing.
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...Charley Hanania
Compression: a hidden Gem for IO heavy Databases
The limiting factor in most database systems is the ability to read and write data to the IO subsystem.
We're still using storage layouts and methodologies in SQL Server that are a reflection of old spinning media in times gone by.
Until major changes are made to the internal storage layouts, we have "some" hope with options such as data compression, sparse columns and filtered indexes, which not only save space on disk, but also reflect a saving in memory.
In this session we will go over the IO savings technologies presented in SQL Server, and discuss how implementing some of these will assist in your operational performance goals.
Presenter: Charley Hanania, MVP
Charley is Principal Consultant at QS2 AG in Switzerland and has consulted to organisations of all sizes during his extensive career in Database and Platform Consulting.
He's been focussed on SQL Server since v4.2 on OS/2 and with over 15 years of experience in IT he's supported companies in the areas of DB training, development, architecture & administration throughout Europe, America & Australasia.
Communities are Charley's passion and he became active in database communities in the mid 90's, participating in heterogeneous database user groups in Australia. He continues to lead an active role through community events such as Database Days, the European PASS Conference, PASS & the Swiss PASS Chapter.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
2. RELATIONAL
SQL ACID
Relational algebra Optimal for ad-hoc queries
Tables, Columns, Rows Sharding can be difficult
Metadata separate from data
Normalized data
Optimized storage
3. POPULAR RDBMS
MySQL Informix
SQL Server Progress
Oracle Pervasive
Postgres Sybase
DB2 Access
Interbase, Firebird …
4. SQL
Unified language to create and query both data and metadata
Similar to English
Verbose(!)
Can get complex for non-trivial queries
Does not expose execution plan – you say what you want it to
return, not how
5. SQL EXAMPLES
If you can say what you mean, you can query the existing data
Results are near-instant when querying based on primary key
select * from valute where id=1 and sid=42
Results are fast when querying based on non-unique index
select valuta from valute where ((id=1 and sid=42)) and (valute.firma_id=123 and
valute.firma__sid=1)
Very readable for trivial queries
select r.customer,sum(rs.iznos) sveukupno from racuni r
join racuni_stavke rs on r.id=rs.racun_id
where r.id=5
order by rs.ordinal
6. SQL EXAMPLES
Not so readable for non-trivial queries
select "MP" tip_prometa, mprac.broj broj_racuna, mprac_stavke.kolicina kolicina, (mprac.tecaj*mprac_stavke.kolicina*mprac_stavke.rabat_iznos)
rabat_iznos, (round(mprac_stavke.cijena - mprac_stavke.rabat_iznos - mprac_stavke.rabat2_iznos - mprac_stavke.rabat3_iznos - mprac_stavke.porez1 -
mprac_stavke.porez2 - mprac_stavke.porez_potrosnja,6)*mprac_stavke.kolicina) iznos, (mprac_stavke.kolicina* ifnull((select
sum(pn_cijena*kolicina)/sum(kolicina) from mprac_skl left join skl_stavke on mprac_skl.skl_id=skl_stavke.skl_id and
mprac_skl.skl__sid=skl_stavke.skl__sid where mprac_skl.mprac_id=mprac.id and mprac_skl.mprac__sid=mprac.sid and
skl_stavke.artikl_id=mprac_stavke.artikl_id and skl_stavke.artikl__sid=mprac_stavke.artikl__sid ),0) ) iznos_nabavno, ifnull( (select
sum(mprac_stavke.kolicina*ambalaze.naknada_kom) from artikli_ambalaze left join ambalaze on ambalaze.id=artikli_ambalaze.ambalaza_id and
ambalaze.sid=artikli_ambalaze.ambalaza__sid where artikli_ambalaze.artikl_id=artikli.id and artikli_ambalaze.artikl__sid=artikli.sid and
ambalaze.kalkulacija="N" ),0) naknada, radnici_komercijalisti.ime racun_komercijalist_ime, (select naziv from skladista where skladista.tip_skladista="M"
and pj_id=mprac.pj_id limit 1) skladiste_naziv , pj.naziv pj_naziv, mprac.datum,
cast(concat("(",if(DayOfWeek(mprac.datum)=1,7,DayOfWeek(mprac.datum)-1),") ", if(DayOfWeek(mprac.datum)=1,"1 Nedjelja",
if(DayOfWeek(mprac.datum)=2,"2 Ponedjeljak", if(DayOfWeek(mprac.datum)=3,"3 Utorak", if(DayOfWeek(mprac.datum)=4,"4 Srijeda",
if(DayOfWeek(mprac.datum)=5,"5 Èetvratk", if(DayOfWeek(mprac.datum)=6,"6 Petak", if(DayOfWeek(mprac.datum)=7,"7 Subota","")))))))) as char(15))
dan_u_tjednu, cast(month(mprac.datum) as unsigned) mjesec, cast(week(mprac.datum) as unsigned) tjedan, cast(quarter(mprac.datum) as unsigned) kvartal,
cast(year(mprac.datum) as unsigned) godina, cast(if(tipovi_komitenata.tip="F",trim(concat(partneri.ime," ",partneri.prezime)),partneri.naziv) as char(200))
kupac_naziv, partneri_mjesta.postanski_broj kupac_mjesto, partneri_mjesta.mjesto kupac_mjesto_naziv, partneri_grupe_mjesta.naziv …
7. RDBMS SCALING
Vertical scaling
• Better CPU, more CPUs
• More RAM
• More disks
• SAN
Partitioning
Sharding
8. PARTITIONING
With many rows and heavy usage, partitioning is a must
What to partition
• Tables
• Indexes
• Views
Typical cases
• Monthly data
• Alphabetical keys
9. RDBMS SHARDING
Sharding means using several databases where each represents part
of data (500 clients on one server, another 500 on another)
Requires changing application code
connect(calculate_server_from(sharding_key))
Impossible to join data from different databases, so choose your
sharding key wisely
Very difficult to repartition your databases based on a new key
10. RDBMS METADATA
Metadata: data describing other data
RDBMS structures are explicitly defined, and each data type is
optimized for storage
Lots of constraints
Can get slow with lot of data
11. NOSQL
“Not SQL”, “Not only SQL”
Core NoSQL databases invented mostly because RDBMS made
life very hard for huge and heavy traffic web databases
NoSQL databases are the ones significantly different from
relational databases
12. NOSQL TYPES
Wide Column Store / Column Families
Document Store
Key Value / Tuple Store
Graph Databases
Object Databases
XML Databases
Multivalue Databases
14. KEY/VALUE STORES
Lineage: Amazon's Dynamo paper and Distributed HashTables.
Data model: A global collection of key-value pairs.
Example: Voldemort, Dynomite, Tokyo Cabinet
Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
15. BIGTABLE CLONES
Lineage: Google's BigTable paper.
Data model: Column family, i.e. a tabular model where each row at
least in theory can have an individual configuration of columns.
Example: HBase, Hypertable, Cassandra
Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
16. DOCUMENT DATABASES
Lineage: Inspired by Lotus Notes.
Data model: Collections of documents, which contain key-value
collections (called "documents").
Example: CouchDB, MongoDB, Riak
Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
17. GRAPH DATABASES
Lineage: Draws from Euler and graph theory.
Data model: Nodes & relationships, both which can hold key-value
pairs
Example: AllegroGraph, InfoGrid, Neo4j
Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
19. NOSQL CHARACTERISTICTS
Almost infinite horizontal scaling
Very fast
Performance doesn’t deteriorate with growth (much)
No fixed table schemas
No join operations
Ad-hoc queries difficult or impossible
Structured storage
Almost everything happens in RAM
20. REAL-WORLD USE
Cassandra
• Facebook (original developer, used it till late 2010)
• Twitter
• Digg
• Reddit
• Rackspace
• Cisco
BigTable
• Google (open-source version is HBase)
MongoDB
• Foursquare
• Craigslist
• Bit.ly
• SourceForge
• GitHub
21. WHY NOSQL?
Handles huge databases (I know, I said it before)
Redundancy, data is pretty safe on commodity hardware
Super flexible queries using map/reduce
Rapid development (no fixed schema, yeah!)
Very fast for common use cases
22. PERFORMANCE
RDBMS uses buffer to ensure ACID properties
NoSQL does not guarantee ACID and is therefore much faster
We don’t need ACID everywhere!
I used MySQL and switched to MongDB for my analytics app
• Data processing (every minute) is 4x faster with MongoDB, despite
being a lot more detailed (due to much simple development)
23. SCALING
Simple web application with not much traffic
• Application server, database server all on one machine
25. SCALING
Even more traffic comes in
• Load balancer
• Application server x2
• Database server
26. SCALING
Even more traffic comes in
• Load balancer x N
• easy
• Application server x N
• easy
• Database server xN
• hard for SQL databases
27. SQL SLOWDOWN
Not linear!
http://www.slideshare.net/rightscale
/scaling-sql-and-nosql-databases-in-the-
cloud
28. NOSQL SCALING
Need more storage?
• Add more servers!
Need higher performance?
• Add more servers!
Need better reliability?
• Add more servers!
29. SCALING SUMMARY
You can scale SQL databases (Oracle, MySQL, SQL Server…)
• This will cost you dearly
• If you don’t have a lot of money, you will reach limits quickly
You can scale NoSQL databases
• Very easy horizontal scaling
• Lots of open-source solutions
• Scaling is one of the basic incentives for design, so it is well handled
• Scaling is the cause of trade-offs causing you to have to use
map/reduce
30. RAM
Why map/reduce? I just need some simple queries. Tomorrow I
will need some other queries….
SQL databases are optimized for very efficient disk access, but for
significant scaling need RAM caching (MySQL+memcached)
NoSQL databases are designed to keep whole working set in RAM
31. WORKING SET
In real-world use working set is much less than complete database
• For analytics 99% of queries will be regarding last 30 days
As you need RAM only for working set, you can use commodity
servers, VPS, and just add more as your app becomes more popular
32. WORKING SET WOES
Foursquare has millions of users and working set the same as the database
They used a single 66GB Amazon EC2 High-Memory Quadruple Extra Large
Instance (with cheese) for millions of users
When their RAM usage was 65GB, they decided to shard
Too late, they started to have disk swaps
Disk is much slower than RAM - 100x slowdown
Server could not keep up due to swapping
11 hours outage (ouch!)
33. MAP/REDUCE
Google’s framework for processing highly distributable
problems across huge datasets using a large number of
computers
Let’s define large number of computers
• Cluster if all of them have same hardware
• Grid unless Cluster (if !Cluster for old-style programmers)
34. MAP/REDUCE
Process split into two phases
• Map
• Take the input, partition it delegate to other machines
• Other machines can repeat the process, leading to tree structure
• Each machine returns results to the machine who gave it the task
• Reduce
• collect results from machines you gave the tasks
• combine results and return it to requester
• Slower than sequential data processing, but massively parallel
• Sort petabyte of data in a few hours
• Input, Map, Shuffle, Reduce, Output
35. MAP/REDUCE EXAMPLE
You need to write two functions
Count different words in a set of documents
39. MONGODB
Data is stored as BSON (binary JSON)
• Makes it very well suited for languages with native JSON support
Map/Reduce written in Javascript
• Slow! There is one single thread of execution in Javascript
Master/slave replication (auto failover with replica sets)
Sharding built-in
Uses memory mapped files for data storage
Performance over features
On 32bit systems, limited to ~2.5Gb
An empty database takes up 192Mb
GridFS to store big data + metadata (not actually an FS)
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
40. CASSANDRA
Written in: Java
Protocol: Custom, binary (Thrift)
Tunable trade-offs for distribution and replication (N, R, W)
Querying by column, range of keys
BigTable-like features: columns, column families
Writes are much faster than reads (!)
• Constant write time regardless of database size
Map/reduce possible with Apache Hadoop
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
41. HBASE
Written in: Java
Main point: Billions of rows X millions of columns
Modeled after BigTable
Map/reduce with Hadoop
Query predicate push down via server side scan and get filters
Optimizations for real time queries
A high performance Thrift gateway
HTTP supports XML, Protobuf, and binary
Cascading, hive, and pig source and sink modules
No single point of failure
While Hadoop streams data efficiently, it has overhead for starting map/reduce jobs. HBase is column oriented key/value store and
allows for low latency read and writes.
Random access performance is like MySQL
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
42. REDIS
Written in: C/C++
Main point: Blazing fast
Disk-backed in-memory database,
Master-slave replication
Simple values or hash tables by keys,
Has sets (also union/diff/inter)
Has lists (also a queue; blocking pop)
Has hashes (objects of multiple fields)
Sorted sets (high score table, good for range queries)
Has transactions (!)
Values can be set to expire (as in a cache)
Pub/Sub lets one implement messaging (!)
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
43. COUCHDB
Written in: Erlang
Main point: DB consistency, ease of use
Bi-directional (!) replication, continuous or ad-hoc, with conflict detection, thus, master-master replication. (!)
MVCC - write operations do not block reads
Previous versions of documents are available
Crash-only (reliable) design
Needs compacting from time to time
Views: embedded map/reduce
Formatting views: lists & shows
Server-side document validation possible
Authentication possible
Real-time updates via _changes (!)
Attachment handling
CouchApps (standalone JS apps)
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
44. HADOOP
Apache project
A framework that allows for the distributed processing of large
data sets across clusters of computers
Designed to scale up from single servers to thousands of machines
Designed to detect and handle failures at the application layer,
instead of relying on hardware for it
45. HADOOP
Created by Doug Cutting, who named it after his son's toy elephant
Hadoop subprojects
• Cassandra
• HBase
• Pig
Hive was a Hadoop subproject, but is now a top-level Apache project
Used by many large & famous organizations
• http://wiki.apache.org/hadoop/PoweredBy
Scales to hundreds or thousands of computers, each with several processor cores
Designed to efficiently distribute large amounts of work across a set of machines
Hundreds of gigabytes of data constitute the low end of Hadoop-scale
Built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes
47. HADOOP
Uses distributed file system (HDFS)
• Designed to hold very large amounts of data (terabytes or even
petabytes)
• Files are stored in a redundant fashion across multiple machines to
ensure their durability to failure and high availability to very parallel
applications
• Data organized into directories and files
• Files are divided into block (64MB by default) and distributed across
nodes
Design of HDFS is based on the design of the Google File System
48. HIVE
A petabyte-scale data warehouse system for Hadoop
Easy data summarization, ad-hoc queries
Query the data using a SQL-like language called HiveQL
Hive compiler generates map-reduce jobs for most queries
49. PIG
Platform for analyzing large data sets
High-level language for expressing data analysis programs
Compiler produces sequences of Map-Reduce programs
Textual language called Pig Latin
• Ease of programming
• System optimizes task execution automatically
• Users can create their own functions
50. PIG LATIN
Pig Latin – high level Map/Reduce programming
Equivalent to SQL for RDBMS systems.
Pig Latin can be extended using Java User Defined Functions
“Word Count” script in Pig Latin
53. SUMMARY
NoSQL is a great problem solver if you need it
Choose your NoSQL platform carefully as each is designed for
specific purpose
Get used to Map/Reduce
It’s not a sin to use NoSQL alongside (yes)SQL database
I am really happy to work with MongoDB instead of MySQL