This document provides an overview and introduction to Cosmos DB. It discusses what Cosmos DB is, its data models, APIs, partitioning, and global distribution. It explains why Cosmos DB was created to address limitations of traditional databases. Key aspects covered include throughput and consistency levels, indexing, backups, failovers, and using Cosmos DB for developers and database administrators. The document also discusses migration tools, limitations, and integrations with PowerBI and geospatial data.
Introduction to CosmosDB - Azure Bootcamp 2018Josh Carlisle
[Session Abstract] Cosmos DB is a globally distributed, multi-model, Serverless, NoSQL database solution that runs on Microsoft Azure. With guaranteed SLAs, various consistency models, and support for multiple APIs, Cosmos DB can have many advantages over common relational database solutions. However, the shift to NoSQL in addition to the numerous configuration options available in Cosmos DB can be a challenge for traditional relational database developers. In this talk, we will take an existing application built on a traditional relational database and update the solution to take advantage of Cosmos DB. Along the way, we will have many decisions to make including which API we should use, how best to model our data, which consistency model to use, and how our data should be indexed, partitioned, and organized. By the end of the talk you should have familiarity with the decisions you will need to make to successfully implement your own solutions on Cosmos DB.
Deep dive into Clustered Columnstore structures with information on compression algorithms, compression types, locking and dictionaries, as well as the Batch Processing Mode.
In DiDi Chuxing Company, which is China’s most popular ride-sharing company. we use HBase to serve when we have a bigdata problem.
We run three clusters which serve different business needs. We backported the Region Grouping feature back to our internal HBase version so we could isolate the different use cases.
We built the Didi HBase Service platform which is popular amongst engineers at our company. It includes a workflow and project management function as well as a user monitoring view.
Internally we recommend users use Phoenix to simplify access.even more,we used row timestamp;multidimensional table schema to slove muti dimension query problems
C++, Go, Python, and PHP clients get to HBase via thrift2 proxies and QueryServer.
We run many important buisness applications out of our HBase cluster such as ETA/GPS/History Order/API metrics monitoring/ and Traffic in the Cloud. If you are interested in any aspects listed above, please come to our talk. We would like to share our experiences with you.
Introduction to CosmosDB - Azure Bootcamp 2018Josh Carlisle
[Session Abstract] Cosmos DB is a globally distributed, multi-model, Serverless, NoSQL database solution that runs on Microsoft Azure. With guaranteed SLAs, various consistency models, and support for multiple APIs, Cosmos DB can have many advantages over common relational database solutions. However, the shift to NoSQL in addition to the numerous configuration options available in Cosmos DB can be a challenge for traditional relational database developers. In this talk, we will take an existing application built on a traditional relational database and update the solution to take advantage of Cosmos DB. Along the way, we will have many decisions to make including which API we should use, how best to model our data, which consistency model to use, and how our data should be indexed, partitioned, and organized. By the end of the talk you should have familiarity with the decisions you will need to make to successfully implement your own solutions on Cosmos DB.
Deep dive into Clustered Columnstore structures with information on compression algorithms, compression types, locking and dictionaries, as well as the Batch Processing Mode.
In DiDi Chuxing Company, which is China’s most popular ride-sharing company. we use HBase to serve when we have a bigdata problem.
We run three clusters which serve different business needs. We backported the Region Grouping feature back to our internal HBase version so we could isolate the different use cases.
We built the Didi HBase Service platform which is popular amongst engineers at our company. It includes a workflow and project management function as well as a user monitoring view.
Internally we recommend users use Phoenix to simplify access.even more,we used row timestamp;multidimensional table schema to slove muti dimension query problems
C++, Go, Python, and PHP clients get to HBase via thrift2 proxies and QueryServer.
We run many important buisness applications out of our HBase cluster such as ETA/GPS/History Order/API metrics monitoring/ and Traffic in the Cloud. If you are interested in any aspects listed above, please come to our talk. We would like to share our experiences with you.
For our eReader development project, we had to find a persistent storage for our JSON documents. After initial scanning we zeroed into two products DynamoDB and MongoDB. These slides take a deeper dive in the selection of our JSON data store.
Building tiered data stores using aesop to bridge sql and no sql systemsRegunath B
Slides from my talk on building tiered data stores using Aesop to bridge SQL and NoSQL data stores. Aesop is a pub-sub like change data capture and propagation system.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1awkL99.
Details on Pinterest's architeture, its systems -Pinball, Frontdoor-, and stack - MongoDB, Cassandra, Memcache, Redis, Flume, Kafka, EMR, Qubole, Redshift, Python, Java, Go, Nutcracker, Puppet, etc. Filmed at qconsf.com.
Yash Nelapati is an infrastructure engineer at Pinterest where he focusses on scalability, capacity planning and architecture. Prior to Pinterest he was into web development and rapidly prototyping UI. Marty Weiner joined Pinterest in early 2011 as the 2nd engineer. Previously worked at Azul Systems as a VM engineer focused on building/improving the JIT compilers in HotSpot.
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.
Relational databases are used extensively in many applications and systems, but they are not always the best data store solution to the problem at hand. In this session we discuss the limitations of RDBMS and show which NoSQL solutions can be used to overcome these limitations. We also cover migration topics, such as how to add NoSQL databases without adding complexity to your development and operations.
Introduction into the world of Clustered Columnstore Indexes in SQL Server 2014, with explanations of the basic structures and functionalities.
Available data types, limitations and differences to SQL Server 2012 Nonclustered Columnstore Indexes are all described here
This session will be more of an introduction to Amazon DynamoDB with the background of what actually a NoSQL DB is. We will get to know the other available options for NoSQL DB, followed by a demo.
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullyMd Kamaruzzaman
In modern Software Development and Software Architecture, selecting the right DataStore is one of the most challenging and important task. In this presentation, I have summarized the major DataStores and the decision criteria to select the right DataStore according to the use case.
Percona Cluster ( Galera ) is one of the best database solution that provides synchronous replication. The feature like automatic recovery, GTID and multi threaded replication makes it powerful along with ( XtraDB and Xtrabackup ).
The good solution for MySQL HA.
NoSQL datastores fall under the following categories: Key-value stores, document databases, column-family stores and graph databases. The traditional TPC-* tests are not sufficient for these heterogeneous database systems. MongoDB, CouchDB, Cassandra, HBase, Memcaches etc belong to one of 4 families and a common workload can be generated by ycsb to simulate your usecase and benchmark them.
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
Highlights of AWS ReInvent 2023 in Las Vegas. Contains new announcements, deep dive into existing services and best practices, recommended design patterns.
For our eReader development project, we had to find a persistent storage for our JSON documents. After initial scanning we zeroed into two products DynamoDB and MongoDB. These slides take a deeper dive in the selection of our JSON data store.
Building tiered data stores using aesop to bridge sql and no sql systemsRegunath B
Slides from my talk on building tiered data stores using Aesop to bridge SQL and NoSQL data stores. Aesop is a pub-sub like change data capture and propagation system.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1awkL99.
Details on Pinterest's architeture, its systems -Pinball, Frontdoor-, and stack - MongoDB, Cassandra, Memcache, Redis, Flume, Kafka, EMR, Qubole, Redshift, Python, Java, Go, Nutcracker, Puppet, etc. Filmed at qconsf.com.
Yash Nelapati is an infrastructure engineer at Pinterest where he focusses on scalability, capacity planning and architecture. Prior to Pinterest he was into web development and rapidly prototyping UI. Marty Weiner joined Pinterest in early 2011 as the 2nd engineer. Previously worked at Azul Systems as a VM engineer focused on building/improving the JIT compilers in HotSpot.
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.
Relational databases are used extensively in many applications and systems, but they are not always the best data store solution to the problem at hand. In this session we discuss the limitations of RDBMS and show which NoSQL solutions can be used to overcome these limitations. We also cover migration topics, such as how to add NoSQL databases without adding complexity to your development and operations.
Introduction into the world of Clustered Columnstore Indexes in SQL Server 2014, with explanations of the basic structures and functionalities.
Available data types, limitations and differences to SQL Server 2012 Nonclustered Columnstore Indexes are all described here
This session will be more of an introduction to Amazon DynamoDB with the background of what actually a NoSQL DB is. We will get to know the other available options for NoSQL DB, followed by a demo.
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullyMd Kamaruzzaman
In modern Software Development and Software Architecture, selecting the right DataStore is one of the most challenging and important task. In this presentation, I have summarized the major DataStores and the decision criteria to select the right DataStore according to the use case.
Percona Cluster ( Galera ) is one of the best database solution that provides synchronous replication. The feature like automatic recovery, GTID and multi threaded replication makes it powerful along with ( XtraDB and Xtrabackup ).
The good solution for MySQL HA.
NoSQL datastores fall under the following categories: Key-value stores, document databases, column-family stores and graph databases. The traditional TPC-* tests are not sufficient for these heterogeneous database systems. MongoDB, CouchDB, Cassandra, HBase, Memcaches etc belong to one of 4 families and a common workload can be generated by ycsb to simulate your usecase and benchmark them.
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
Highlights of AWS ReInvent 2023 in Las Vegas. Contains new announcements, deep dive into existing services and best practices, recommended design patterns.
The event, held on 27th April 2019, was part of the Global Azure Bootcamp and covered Microsoft's Cosmos DB, more specifically:
- Introduction to Cosmos DB, its features, internals, resource models, and request units.
- DEMO: Create an SQL API. Download sample .NET app. Simple queries.
- Covered Change Feed and showcased various use case scenarios.
- Detailed Global Distribution and Consistency Models implications.
- DEMO: Mongo - Lift and shift. Run simple .NET code against a MongoDB (in docker container) and cosmos.
- Introduction to Tinkerpop graphs
- DEMO: Graphs API. Download sample .NET app. Simple queries.
https://techspark.mt/global-azure-bootcamp-27th-april-2019/
Cloud computing gives you a number of advantages, such as the ability to scale your web application or website on demand. If you have a new web application and want to use cloud computing, you might be asking yourself, "Where do I start?" Join us in this session to understand best practices for scaling your resources from zero to millions of users. We show you how to best combine different AWS services, how to make smarter decisions for architecting your application, and how to scale your infrastructure in the cloud.
Cloud computing gives you a number of advantages, such as the ability to scale your web application or website on demand. If you have a new web application and want to use cloud computing, you might be asking yourself, "Where do I start?" Join us in this session to understand best practices for scaling your resources from zero to millions of users. We show you how to best combine different AWS services, how to make smarter decisions for architecting your application, and how to scale your infrastructure in the cloud.
Cloud computing gives you a number of advantages, such as the ability to scale your web application or website on demand. If you have a new web application and want to use cloud computing, you might be asking yourself, "Where do I start?" Join us in this session to understand best practices for scaling your resources from zero to millions of users. We show you how to best combine different AWS services, how to make smarter decisions for architecting your application, and how to scale your infrastructure in the cloud.
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech TalksAmazon Web Services
Enterprises are quickly moving database workloads like SQL Server to the cloud, but with so many options, the best approach isn’t always obvious. You exercise full control of your SQL Server workloads by running them on Amazon EC2 instances, or leverage Amazon RDS for a fully managed database experience. This session will go deep on best practices and considerations for running SQL Server on AWS. We will cover best practices for deploying SQL Server, how to choose between Amazon EC2 and Amazon RDS, ways to optimize the performance of your SQL Server deployment for different applications types. We review in detail how to provision and monitor your SQL Server databases, and how to manage scalability, performance, availability, security, and backup and recovery, in both Amazon RDS and Amazon EC2.
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
Everything generates logs. Applications, infrastructure, security ... everything. Keeping track of the flood of log data is a big challenge, yet critical to your ability to understand your systems and troubleshoot (or prevent) issues. In this session, we will use both Amazon CloudWatch and application logs to show you how to build an end-to-end log analytics solution. First, we cover how to configure an Amazon Elaticsearch Service domain and ingest data into it using Amazon Kinesis Firehose, demonstrating how easy it is to transform data with Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data and configure a secure analytics environment. We demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL ServicesAmazon Web Services
In this session, we discuss the benefits of NoSQL databases and take a tour of the main NoSQL services offered by AWS—Amazon DynamoDB and Amazon ElastiCache. Then, we hear from two leading customers, Expedia and Mapbox, about their use cases and architectural challenges, and how they addressed them using AWS NoSQL services, including design patterns and best practices. You will walk out of this session having a better understanding of NoSQL and its powerful capabilities, ready to tackle your database challenges with confidence.
Design, Deploy, and Optimize SQL Server on AWS - June 2017 AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn how to build applications on AWS from a strong foundation on SQL Server
- Learn when to deploy SQL Server on Amazon EC2 versus Amazon RDS
- Learn how to take advantage of the latest features in SQL Server 2016 when running on AWS
Enterprises are quickly moving database workloads like SQL Server to the cloud, but with so many options, the best approach isn’t always obvious. You exercise full control of your SQL Server workloads by running them on Amazon EC2 instances, or leverage Amazon RDS for a fully managed database experience. This session will go deep on best practices and considerations for running SQL Server on AWS. We will cover best practices for deploying SQL Server, how to choose between Amazon EC2 and Amazon RDS, ways to optimize the performance of your SQL Server deployment for different applications types. We review in detail how to provision and monitor your SQL Server databases, and how to manage scalability, performance, availability, security, and backup and recovery, in both Amazon RDS and Amazon EC2.
Solving Your Backup Needs Using MongoDB Ops Manager, Cloud Manager and AtlasMongoDB
Backup is an important part of your MongoDB deployment. Come and learn about the different offerings MongoDB has to help meet your backup requirements.
For people who start to create a cloud service, it’s really important to know how to create a scalable cloud service to fit the growth of the future workloads. In this session, we will introduce how to design a scalable cloud service including AWS services introduction and best practices.
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
Everything generates logs. Applications, infrastructure, security ... everything. Keeping track of the flood of log data is a big challenge, yet critical to your ability to understand your systems and troubleshoot (or prevent) issues. In this session, we will use both Amazon CloudWatch and application logs to show you how to build an end-to-end log analytics solution. First, we cover how to configure an Amazon Elaticsearch Service domain and ingest data into it using Amazon Kinesis Firehose, demonstrating how easy it is to transform data with Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data and configure a secure analytics environment. We demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
During this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, will discuss the design and architecture of Airbnb's streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb's data warehouse, using a system called SpinalTap. We will also discuss how we leverage Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Essentials of Automations: Optimizing FME Workflows with Parameters
CosmosDB for DBAs & Developers
1. Cosmos DB for DBAs & DEVs
Niko Neugebauer – Consultant @ OH22
2. Speaker
Niko speaks regularly at events such as PASS Summit, SQLRally,
SQLBits, and SQLSaturday events around the world.
Niko Neugebauer
Professional Focus
Community
Lead the first international SQLSaturday
PASS User Group Leader
TUGA Non-Profit Association Leader
/in/webcaravela/
@NikoNeugebauer
Data Platform (especially from Microsoft)
Columnstore Blogger (110+) at http://www.nikoport.com/columnstore
Creator of CISL – Columnstore Indexes Script Library (https://github.com/NikoNeugebauer/CSIL)
3. Niko Neugebauer
Consultant, OH22 IS
Professional Focus
Data Platform (especially from Microsoft)
Columnstore Blogger (110+) at http://www.nikoport.com/columnstore
Creator of CISL – Columnstore Indexes Script Library
(https://github.com/NikoNeugebauer/CSIL)
Lead the first international SQLSaturday
PASS User Group Leader
TUGA Non-Profit Association Leader
Speaker
Niko speaks regularly at events such as PASS
Summit, SQLRally, SQLBits, and SQLSaturday
events around the world.• /in/webcaravela/ • @NikoNeugebauer
4. CAP Theorem – old wisdom: pick just 2!
• Consistency
• Availability
• Partition tolerance
9. What is CosmosDB
• Azure Cosmos DB is Microsoft's globally distributed, multi-model
database.
• With the click of a button, Azure Cosmos DB enables you to
elastically and independently scale throughput and storage across
any number of Azure's geographic regions.
• It offers throughput, latency, availability, and consistency guarantees
with comprehensive service level agreements (SLAs), something no
other database service can offer.
11. Data Models in CosmosDB
• Database engine operates on atom-record-sequence
based type system.
All data models translated to A-R-S
• API and wire protocols supported via extensible
modules
Currently supported data models:
• Documents, Graphs, Key-Value, Column-Value
12. API (30-11-2017)
• DocumentDB API
• SQL-like API
• MongoDB API
• Table API
• Graph API (TinkerPop, Gremlin/Groove)
• Cassandra API
• Spark
• Geospatial support
• more will be coming!
13. A word on Table API vs Azure Table Storage comparison
Table Storage Cosmos Table API
Latency Fast Single-digit millisecond latency
Throughput
Variable, scalalbe up to 20.000
operations/second
Highly scalable with dedicated
reserved throughput per table,
up to 10 million operations/sec
Global Distribution Single Region Turnkey global distribution
Indexing
Only Primary Index on
PartitionKey and RowKey
Automatic and complete
indexing on all properties, no
index management (LOL).
Query
Query execution uses index for
primary key, and scans
otherwise.
Queries can take advantage of
automatic indexing on
properties for fast query times.
Consistency
Strong in Primary Region,
Eventual in Secondary Reg.
5 well-defined consistency levels
18. Partitioning
• Implemented on the Tenant-level (Collection, Graph, Table)
• A resource partition is a resource-governed primitive, which is
limited to a subset of keys.
• Capable of doing Splits, Merges, etc from the Partitions
19. Partitioning Best Practices
- Select a PartitionKey for the best data distribution
- Use location-aware partition key for the best access locality
- Select a PartitionKey which can be a transaction scope
- Don’t use Timestamps for write-heavy workloads. Use time ranges
(hour, month, week, day, year) for even data distribution.
21. Why creating CosmosDB?
• Traditional relational databases were designed in 70s-80s
• Data is Growing (Petabytes, Exabytes, etc)
• Think about Internet-Scale and distributed systems
• Provide API Choices
Think about:
• Availability
• Performance
• Costs
22. CosmosDB: the focus on the performance
Reads (1KB) Indexed Writes (1KB)
50th < 2ms < 6ms
99th < 10ms < 15ms
▪ Globally distributed with reads and writes served from/to local
region
▪ Write-optimised, latch-free engine designed for SSD
▪ Synchronous/Asynchronous automatic indexing
23. Azure Cosmos DB
• Azure Cosmos DB is fully schema agnostic.
• Uses JSON to describe the supported data models
• Automatic indexing of all ingested content
• Resource Governed, write-optimised engine
• Online Index operations
24. Core pieces of CosmosDB Architecture
• Global distribution
• Resource Governance
• Schema-agnostic service
25. Consisteny Levels (and there are 5 of them):
• You pick a stronger consistency level like strong/bounded staleness
because for your account, because a critical path in your e-
commerce/LOB application needs the guarantee
• But for some less-critical operations (like a reporting dashboard
query), you would choose a weaker-consistency level because it
consumes only half the throughput.
• The current offering for the Consistency levels is:
Strong / Bounded Staleness / Session / Consistent Prefix / Eventual
27. Default Consisteny Levels:
• Strong - Linear. Reads are guaranteed to return the most recent
version of an item.
• Bounded Staleness - Consistent Prefix. Reads lag behind writes by k
prefixes or t interval
• Session - Consistent Prefix. Monotonic reads, monotonic writes,
read-your-writes, write-follows-reads in your geographical location.
• Consistent Prefix - Updates returned are some prefix of all the
updates, with no gaps. If you applied sequential transactions, the
previous ones are available on request.
• Eventual - Out of order reads
28. Indexing & Consisteny Levels:
Indexing Mode Reads Queries
Consistent
Select from strong, bounded
staleness, session, consistent
prefix, or eventual
Select from strong, bounded
staleness, session, or eventual
Lazy
Select from strong, bounded
staleness, session, consistent
prefix, or eventual
Eventual
None
Select from strong, bounded
staleness, session, consistent
prefix, or eventual
Eventual
29. Throughoutput
• RU – Requests Unit
• % Memory / % CPU / % IOPS just like for Azure SQLDB
• READ / INSERT / UPSERT / DELETE / QUERY - operations
• QUERY = Scans + Index Lookups + Query Complexity + Instruction
Cost
• Everything is calculated by Azure ML
30. Throughoutput
• RU – Requests Per Unit
• 400 RU/sec – 10.000 RU/sec (Collections)
• 2.500 RU/sec – Unlimited? RU/sec (Partitioned Collections)
• Min Increase / Decrease is 100 RU/sec
31. Scaling Cosmos DB Up & Out
• Scale Up – Increase the number of RUs
• Scale Out – Increase the number of partitions for your
collections/graphs/tables
43. Azure CosmosDB Data Migration Tool
• Allows you to migrate your data into the CosmosDB
• Supports a range of the sources
• Does not support GraphDB ... yet
48. Azure Cosmos DB Emulator
Software requirements:
• Windows Server 2012 R2, Windows Server 2016, or Windows 10
Minimum Hardware requirements:
• 2 GB RAM
• 10 GB available hard disk space
51. Indexing Policy Modes
• Consistent – follows the same consistency level as specified for the point-
reads (i.e. strong, bounded-staleness, session or eventual). The index is
updated synchronously as part of the document update.
The workload target is “write quickly, query immediately”.
• Lazy - To allow maximum document ingestion throughput, an Azure
Cosmos DB collection can be configured with lazy consistency; meaning
queries are eventually consistent.
The index is updated asynchronously when an Azure Cosmos DB
collection is quite.
• None - A collection marked with index mode of “None” has no index
associated with it. This is commonly used if Azure Cosmos DB is utilized as
a key-value storage and documents are accessed only by their ID
property.
54. Indexing Paths
Path Description
/ Default path for the collection. Recursive
/name/? Hash or Range Indexes for predicates and sorts
/name/* Index path for all paths under the specified label. (multiple levels down)
/name/[]/prop/? Index path required to serve iteration and JOIN queries against arrays of
objects like [{prop: "a"}, {prop: "b"}]:
55. Indexes Types, Kinds & Precisions
DataTypes:
• String
• Number
• Point
• Polygon
• LineString
56. Indexes Types, Kinds & Precisions
Index Types:
• Hash – Hash Indexes, think Hekaton (Hash Indexes). Supports
equality and JOIN queries, for the most queries default value of 3
bytes is sufficient. DataType can be String or Number.
• Range – Range Indexes, think Hekaton (BW-Tree). Supports equality
& range queries (<,>,<=,>=,!=) and ORDER BY queries. DataType
can be String or Number.
• Spatial – Spatial Queries for Points, Polygons & LineString. Supports
efficient spatial (within & distance queries) queries.
57. Indexes Precision
Lets you tradeoff between index storage overhead and query
performance.
For numbers, Microsoft recommends using the defulat
precision -1 (“maximum”). Notice that numbers are 8 bytes in
JSON.
Picking smaller numbers for precision (1-7) means collisions
and hence more RU’s consumption.
For String ranges, which can be of arbitrary lengths, the index
precision can impact the performance of range search
queries and impact storage.
The precision can be specified between 1 to 100.
Important: if you need sorting on the results (ORDER BY), you
must specify the precision of 100.
59. Indexing Policy Changes – What for ?
• When importing bulk data using lazy indexing models
for faster writes, switching then to consistent indexing
for regular operation.
• When reducing the throughput for writes as well as the
storage space used by hand selecting the properties to
be indexed and changing them over time, or by varying
the index precision of individual properties.
• When using new indexing features on your current
DocumentDB collections like Order By and string range
queries which require the newly introduced string
range index kind.
63. Backup for DBAs:
• Every 4 hours (approx.) a backup is taken (to Azure BLOB
Storage)
• At least 2 backups are stored at all times
• If you lost your data, you need to contact Azure Support
within 8 hours
• Backup retention: 30 days for deleted partitions/databases
• If you want to maintain your own snapshots, you can use
the export to JSON option in the Azure Cosmos DB Data
Migration tool to schedule additional backups.
64. Backup for DBAs – read carefully:
• As soon as corruption is detected, the user should delete
the corrupted container (collection/graph/table) so that
backups are protected from being overwritten with
corrupted data.
Source: https://docs.microsoft.com/en-us/azure/cosmos-
db/online-backup-and-restore
65. Backup for DBAs – the alternative:
• Extract JSON files of your databases/collections/graphs with
the help of the Azure Migration Tool
71. Manual Failover Scenarios:
• Follow the clock model: If your applications have predictable traffic patterns
based on the time of the day, you can periodically change the write status to
the most active geographic region based on time of the day.
• Service update: Certain globally distributed application deployment may
involve rerouting traffic to different region via traffic manager during their
planned service update. Such application deployment now can use manual
failover to keep the write status to the region where there is going to be
active traffic during the service update window.
• Business Continuity and Disaster Recovery (BCDR) and High Availability
and Disaster Recovery (HADR) drills: Most enterprise applications include
business continuity tests as part of their development and release process.
BCDR and HADR testing is often an important step in compliance
certifications and guaranteeing service availability in the case of regional
outages. You can test the BCDR readiness of your applications that use
Cosmos DB for storage by triggering a manual failover of your Cosmos DB
account and/or adding and removing a region dynamically.
72. Global Distribution aka Geo-Replication aka Reional Failover
• Configuration
• First, deploy your application in multiple regions
• To ensure low latency access from every region your application is deployed,
configure the corresponding preferred regions list for each region via one of
the supported SDKs.
74. GraphDB
• Based on Apache TinkerPop (open source)
• Supporting Gremlin & Groove (How much?) languages
75. GraphDB - possibilities
• Querying across graph collections - not supported right now
• Duplicate Edges detection
• Duplicate Vertex detection
• Betweness Centrality
• Eigenvector (PageRank)
• Recommendation (as Products in SSAS)
• ...
76. GraphDB Gremlin querying
• g.V().count(); // Documents
• g.V().hasLabel(‘person’).has(‘age’,gt(40)); // People aged over 40
• g.V().hasLabel('person').values('firstName'); // List People’s first
names
Under the hood, the query
• g.V().hasLabel('Azure')
transforms into
• {"query":"SELECT N_2 FROM Node N_2 WHERE
(IS_DEFINED(N_2._isEdge) = false AND (N_2.label = 'Azure'))"}
80. PowerBI
• Via Spark - https://github.com/Azure/azure-cosmosdb-
spark/wiki/Configuring-Power-BI-Direct-Query-to-Azure-
Cosmos-DB-via-Apache-Spark-(HDI)
81. Geospatial
• Working with geospatial and GeoJSON location data in
Azure Cosmos DB:
https://docs.microsoft.com/en-us/azure/cosmos-
db/geospatial
• Azure Cosmos DB: Expanded geospatial support, including
automatic indexing of Polygon and LineString objects:
https://azure.microsoft.com/en-us/updates/documentdb-
expanded-geospatial-support-including-automatic-
indexing-of-polygons-and-lines/
82. CosmosDB Links
• https://www.microsoft.com/en-us/download/details.aspx?id=46436
• https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels
• Azure CosmosDB Emulator:
https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator
• Indexing Policies:
https://docs.microsoft.com/en-us/azure/cosmos-db/indexing-policies
• Use the Azure Cosmos DB Emulator for local development and testing:
https://docs.microsoft.com/en-us/azure/cosmos-db/local-emulator
• Tunable data consistency levels in Azure Cosmos DB:
https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels
83. CosmosDB Links
• Gremlin Console:
http://tinkerpop.apache.org/docs/current/tutorials/the-gremlin-
console/
• Tunable data consistency levels in Azure Cosmos DB: