This document contains the planning and schedule for a 5 day Big Data class being taught by Alexandre Bergere. The schedule outlines the topics to be covered each day, including What is Big Data, NoSQL, Cloud Architecture, Spark, data storage options, MongoDB, and more. Presentation slides are included that provide more detail on MongoDB, its flexibility, types of data it can store, CRUD operations, and other tools like MongoDB Compass.
Iot streaming with Azure Stream Analytics from IotHub to the full data slackAlexandre BERGERE
In this article I'm going to explain how to push data from iot devices through Azure Stream Analytics into multiples channels: Azure Blob Storage (as a cold database), Azure Cosmos DB (as a hot database), Power BI (for data visualization) and Azure Service Bus & Azure Logic App (data processing & user interaction).
With this support you would be able to have the basic of Azure Data slack and it will help you to pass the DP-200 and DP-201. If you need some basics on Azure, you can download this support : https://www.slideshare.net/AlexandreBERGERE/azure-fundamentals-153339148.
This support is a summary from the paths:
Azure for the Data Engineer
Store data in Azure
Work with relational data in Azure
Large Scale Data Processing with Azure Data Lake Storage Gen2
Implement a Data Streaming Solution with Azure Streaming Analytics
Implement a Data Warehouse with Azure SQL Data Warehouse
in Microsoft Learn.
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?Torsten Steinbach
You don't necessarily have to set up a relational database, tables and load data in order to use a surprisingly rich set of SQL capabilities on your data in the cloud. IBM SQL Query lets you analyze terabytes of distributed data of heterogeneous formats with a complete ANSI SQL dialect in a completely serverless usage model, elegantly ETL data between formats and partitioning layouts as needed, and run complex time series transformations, analysis and correlations with advanced built-in timeseries SQL algorithms that are differentiating in the entire industry. It also support a complete PostGIS compliant geospatial SQL function set. Come explore the stunningly advanced world of SQL without a database in IBM Cloud.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Leveraging Cloud Analytics to Support Data-Driven DecisionsAmazon Web Services
Learn about AWS business intelligence (BI) analytics, visualization, artificial intelligence, and machine learning services that can transform data into insights.
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...Mariano Gonzalez
Everybody wants to do big data on a data lake! However, implementing it and maintaining the infrastructure necessary to explore it, such as Spark, has been a historically challenging endeavor. Kubernetes is the tool of choice for cloud orchestration, and Spark continues to be the de facto framework for most data wrangling tasks. We’ve previously tried different data lake architectures, and suffered from the pain that Hadoop carries with it. Finally, we decided to bring the best from the cloud and big data worlds together, and walk you through a session on how to set an endless data lake powered with native Spark executors on Kubernetes
Iot streaming with Azure Stream Analytics from IotHub to the full data slackAlexandre BERGERE
In this article I'm going to explain how to push data from iot devices through Azure Stream Analytics into multiples channels: Azure Blob Storage (as a cold database), Azure Cosmos DB (as a hot database), Power BI (for data visualization) and Azure Service Bus & Azure Logic App (data processing & user interaction).
With this support you would be able to have the basic of Azure Data slack and it will help you to pass the DP-200 and DP-201. If you need some basics on Azure, you can download this support : https://www.slideshare.net/AlexandreBERGERE/azure-fundamentals-153339148.
This support is a summary from the paths:
Azure for the Data Engineer
Store data in Azure
Work with relational data in Azure
Large Scale Data Processing with Azure Data Lake Storage Gen2
Implement a Data Streaming Solution with Azure Streaming Analytics
Implement a Data Warehouse with Azure SQL Data Warehouse
in Microsoft Learn.
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?Torsten Steinbach
You don't necessarily have to set up a relational database, tables and load data in order to use a surprisingly rich set of SQL capabilities on your data in the cloud. IBM SQL Query lets you analyze terabytes of distributed data of heterogeneous formats with a complete ANSI SQL dialect in a completely serverless usage model, elegantly ETL data between formats and partitioning layouts as needed, and run complex time series transformations, analysis and correlations with advanced built-in timeseries SQL algorithms that are differentiating in the entire industry. It also support a complete PostGIS compliant geospatial SQL function set. Come explore the stunningly advanced world of SQL without a database in IBM Cloud.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Leveraging Cloud Analytics to Support Data-Driven DecisionsAmazon Web Services
Learn about AWS business intelligence (BI) analytics, visualization, artificial intelligence, and machine learning services that can transform data into insights.
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...Mariano Gonzalez
Everybody wants to do big data on a data lake! However, implementing it and maintaining the infrastructure necessary to explore it, such as Spark, has been a historically challenging endeavor. Kubernetes is the tool of choice for cloud orchestration, and Spark continues to be the de facto framework for most data wrangling tasks. We’ve previously tried different data lake architectures, and suffered from the pain that Hadoop carries with it. Finally, we decided to bring the best from the cloud and big data worlds together, and walk you through a session on how to set an endless data lake powered with native Spark executors on Kubernetes
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
Modernizing analytics data pipelines to gain the most of your data while optimizing costs can be challenging. However, today cloud providers offer a good set of services that can help with this endeavor. We will do a tour across some GCP services during this hands-on session, using DataFlow (apache beam) as the backbone to architect a modern analytics pipeline to wire them all together.
Learn about data lifecycle best practices in the AWS Cloud, so you can optimize performance and lower the costs of data ingestion, staging, storage, cleansing, analytics and visualization, and archiving.
How data modelling helps serve billions of queries in millisecond latency wit...DataWorks Summit
Users of financial data require their queries to return results with very low latency. As a financial data service provider, Bloomberg needs to consistently meet these requirements for our clients.
HBase promises millisecond latency, auto-sharding, and an open schema. However, as HBase is a NoSQL database, support for transaction processing is not trivial. This talk will discuss a data modelling technique and use case where effective transaction processing can be achieved. We will also discuss how this data model helps us achieve real-time streaming, scalability, and millisecond read-write latency for billions of queries each day.
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao KambleDatabricks
GoDaddy powers the world’s largest cloud platform dedicated to small, independent ventures. With more than 14 million customers worldwide and more than 63 million domain names under management, GoDaddy is the place people come to name their idea, build a professional website, attract customers and manage their work. At GoDaddy, the Advanced Analytics team developed the Apache Spark-based Customer Success Dashboard to help customer and marketing teams. It is a product of the Apache Spark-based analytics and ETL pipelines to get the insights of the customer data collected from Internet, machine logs and communication like voice and text.
In this session, GoDaddy will discuss how Apache Spark is used as a distributed framework that they build their own algorithms on top of to generate features for Customer Success Dashboard recommendations for each of their 14 million customers. Learn about specific techniques they use at GoDaddy to scale, and the various pitfalls they’ve found along the way.
The Scout24 Data Landscape Manifesto: Building an Opinionated Data PlatformRising Media Ltd.
The Scout24 Data Landscape Manifesto is the formalization of our opinions on how a successful data-driven company should approach data. In a truly data-driven company, no manager, no salesperson, no engineer and no data scientist can do their job properly without easy access to large amounts of high-quality data. It is Sean's mandate to create a platform that encourages the production of high-quality data and enables engagement with data by all employees. He and his team are opinionated about how all producers and consumers of data need to be active participants in the data platform, to make data-driven decisions and to be responsible for the data they produce. And he built the data platform with 'nudges' that reward data usage that matches his vision for a data-driven company. In this talk, Sean will present the Scout24 Data Landscape Manifesto and will show how the strong opinions it contains enabled him to successfully migrate from a classic centralized data warehouse to a decentralized, scalable, cloud-based data platform at AutoScout24 and ImmobilienScout24 that is core to their analytics and machine learning activities.
Cloud and Software as a Service (SaaS) can make a huge impact on a business. Unfortunately, most start the evaluation of SaaS from an IT perspective and traditional data center advantages (i.e. on-premises costs, staffing and savings). While savings are important, cloud is about agility and speed. For these reasons, line-of-business (LOB) leaders have been more interested in SaaS solutions. Learn how Cognos Business Intelligence on Cloud and IBM dashdb make it simple to get started with collaboration, reporting and analytics.
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
Learn how to solve the top 3 challenges Snowflake customers face, and what you can do to ensure high-performance, intelligent analytics at any scale. Ideal for those currently using Snowflake and those considering it. Learn more at: https://kyligence.io/
At Data-centric Architecture Forum 2020 Thomas Cook, our Sales Director of AnzoGraph DB, gave his presentation "Knowledge Graph for Machine Learning and Data Science". These are his slides.
Syngenta's Predictive Analytics Platform for Seeds R&DMichael Swanson
Syngenta’s Predictive Analytics Platform for Seeds R&D
A journey from on-premise Hadoop to AWS’s Big Data Serverless Analytics stack
Amazon Athena and AWS Glue Summit – Boston
October 9, 2018
Michael Swanson
Domain Architect - Insights and Decisions, Syngenta
Here's the deck we used for our Seed round. We raised $5M led by Accel.
Even though we didn't necessarily show the appendix slides, we sent them along with the rest of the deck.
See https://airbyte.io
Say goodbye to data silos! Analytics in a Day will simplify and accelerate your journey towards the modern data warehouse. Join CCG and Microsoft for a two-day virtual workshop, hosted by James McAuliffe.
how_can_businesses_address_storage_issues_using_mongodb.pdfsarah david
MongoDB enables seamless data storage and performance. Explore our blog to learn how MongoDB handles storage issues for startups and large-scale enterprises. Discover how to optimize MongoDB performance using open-source database storage.
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
During this talk we'll navigate through a customer's journey as they migrate an existing MongoDB deployment to MongoDB Atlas. While the migration itself can be as simple as a few clicks, the prep/post effort requires due diligence to ensure a smooth transfer. We'll cover these steps in detail and provide best practices. In addition, we’ll provide an overview of what to consider when migrating other cloud data stores, traditional databases and MongoDB imitations to MongoDB Atlas.
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
Modernizing analytics data pipelines to gain the most of your data while optimizing costs can be challenging. However, today cloud providers offer a good set of services that can help with this endeavor. We will do a tour across some GCP services during this hands-on session, using DataFlow (apache beam) as the backbone to architect a modern analytics pipeline to wire them all together.
Learn about data lifecycle best practices in the AWS Cloud, so you can optimize performance and lower the costs of data ingestion, staging, storage, cleansing, analytics and visualization, and archiving.
How data modelling helps serve billions of queries in millisecond latency wit...DataWorks Summit
Users of financial data require their queries to return results with very low latency. As a financial data service provider, Bloomberg needs to consistently meet these requirements for our clients.
HBase promises millisecond latency, auto-sharding, and an open schema. However, as HBase is a NoSQL database, support for transaction processing is not trivial. This talk will discuss a data modelling technique and use case where effective transaction processing can be achieved. We will also discuss how this data model helps us achieve real-time streaming, scalability, and millisecond read-write latency for billions of queries each day.
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao KambleDatabricks
GoDaddy powers the world’s largest cloud platform dedicated to small, independent ventures. With more than 14 million customers worldwide and more than 63 million domain names under management, GoDaddy is the place people come to name their idea, build a professional website, attract customers and manage their work. At GoDaddy, the Advanced Analytics team developed the Apache Spark-based Customer Success Dashboard to help customer and marketing teams. It is a product of the Apache Spark-based analytics and ETL pipelines to get the insights of the customer data collected from Internet, machine logs and communication like voice and text.
In this session, GoDaddy will discuss how Apache Spark is used as a distributed framework that they build their own algorithms on top of to generate features for Customer Success Dashboard recommendations for each of their 14 million customers. Learn about specific techniques they use at GoDaddy to scale, and the various pitfalls they’ve found along the way.
The Scout24 Data Landscape Manifesto: Building an Opinionated Data PlatformRising Media Ltd.
The Scout24 Data Landscape Manifesto is the formalization of our opinions on how a successful data-driven company should approach data. In a truly data-driven company, no manager, no salesperson, no engineer and no data scientist can do their job properly without easy access to large amounts of high-quality data. It is Sean's mandate to create a platform that encourages the production of high-quality data and enables engagement with data by all employees. He and his team are opinionated about how all producers and consumers of data need to be active participants in the data platform, to make data-driven decisions and to be responsible for the data they produce. And he built the data platform with 'nudges' that reward data usage that matches his vision for a data-driven company. In this talk, Sean will present the Scout24 Data Landscape Manifesto and will show how the strong opinions it contains enabled him to successfully migrate from a classic centralized data warehouse to a decentralized, scalable, cloud-based data platform at AutoScout24 and ImmobilienScout24 that is core to their analytics and machine learning activities.
Cloud and Software as a Service (SaaS) can make a huge impact on a business. Unfortunately, most start the evaluation of SaaS from an IT perspective and traditional data center advantages (i.e. on-premises costs, staffing and savings). While savings are important, cloud is about agility and speed. For these reasons, line-of-business (LOB) leaders have been more interested in SaaS solutions. Learn how Cognos Business Intelligence on Cloud and IBM dashdb make it simple to get started with collaboration, reporting and analytics.
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
Learn how to solve the top 3 challenges Snowflake customers face, and what you can do to ensure high-performance, intelligent analytics at any scale. Ideal for those currently using Snowflake and those considering it. Learn more at: https://kyligence.io/
At Data-centric Architecture Forum 2020 Thomas Cook, our Sales Director of AnzoGraph DB, gave his presentation "Knowledge Graph for Machine Learning and Data Science". These are his slides.
Syngenta's Predictive Analytics Platform for Seeds R&DMichael Swanson
Syngenta’s Predictive Analytics Platform for Seeds R&D
A journey from on-premise Hadoop to AWS’s Big Data Serverless Analytics stack
Amazon Athena and AWS Glue Summit – Boston
October 9, 2018
Michael Swanson
Domain Architect - Insights and Decisions, Syngenta
Here's the deck we used for our Seed round. We raised $5M led by Accel.
Even though we didn't necessarily show the appendix slides, we sent them along with the rest of the deck.
See https://airbyte.io
Say goodbye to data silos! Analytics in a Day will simplify and accelerate your journey towards the modern data warehouse. Join CCG and Microsoft for a two-day virtual workshop, hosted by James McAuliffe.
how_can_businesses_address_storage_issues_using_mongodb.pdfsarah david
MongoDB enables seamless data storage and performance. Explore our blog to learn how MongoDB handles storage issues for startups and large-scale enterprises. Discover how to optimize MongoDB performance using open-source database storage.
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
During this talk we'll navigate through a customer's journey as they migrate an existing MongoDB deployment to MongoDB Atlas. While the migration itself can be as simple as a few clicks, the prep/post effort requires due diligence to ensure a smooth transfer. We'll cover these steps in detail and provide best practices. In addition, we’ll provide an overview of what to consider when migrating other cloud data stores, traditional databases and MongoDB imitations to MongoDB Atlas.
how_can_businesses_address_storage_issues_using_mongodb.pptxsarah david
MongoDB enables seamless data storage and performance. Explore our blog to learn how MongoDB handles storage issues for startups and large-scale enterprises. Discover how to optimize MongoDB performance using open-source database storage.
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
Teaser: provide developers a new way of understanding advanced analytics and choosing the right cloud architecture
The new buzzword is #serverless, as there are many great services that helps us abstract away the complexity associated with managing servers. In this session we will see how serverless helps on large data analytics backends.
We will see how to architect for Cloud and implement into an existing project components that will take us into the #serverless architecture that will ingest our streaming data, run advanced analytics on petabytes of data using BigQuery on Google Cloud Platform - all this next to an existing stack, without being forced to reengineer our app.
BigQuery enables super-fast, SQL/Javascript queries against petabytes of data using the processing power of Google’s infrastructure. We will cover its core features, SQL 2011 standard, working with streaming inserts, User Defined Functions written in Javascript, reference external JS libraries, and several use cases for everyday backend developer: funnel analytics, email heatmap, custom data processing, building dashboards, extracting data using JS functions, emitting rows based on business logic.
Machine Learning in the Cloud has influences operations company-wide. Learn from Data Scientist, Ahmed Sherif, how to leverage Cloud offerings including AWS, IBM Cloud, and Microsoft Azure.
Introduction to mago3D, an Open Source Based Digital Twin PlatformSANGHEE SHIN
This talk was given at the Busan Eco Delta City(Korea National Pilot Smart City) technical workshop held on 18th July. I talked about introduction and history of mago3D, some core technologies, real cases, and lessons learnt in this workshop.
We all know good training data is crucial for data scientists to build quality machine learning models. But when productionizing Machine Learning, Metadata is equally important. Consider for example:
- Provenance of model allowing for reproducible builds
- Context to comply with GDPR, CCPA requirements
- Identifying data shift in your production data
This is the reason we built ArangoML Pipeline, a flexible Metadata store which can be used with your existing ML Pipeline.
Today we are happy to announce a release of ArangoML Pipeline Cloud. Now you can start using ArangoML Pipeline without having to even start a separate docker container.
In this webinar, we will show how to leverage ArangoML Pipeline Cloud with your Machine Learning Pipeline by using an example notebook from the TensorFlow tutorial.
Find the video here: https://www.arangodb.com/arangodb-events/arangoml-pipeline-cloud/
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation FrameworkMongoDB
The MongoDB Aggregation Framework has immense computative power & capabilities, and you can use it to achieve exceptional results. I will be sharing some unconventional approaches to intricate use cases, including real-time data analytics and importing large amounts of RDBMS datasets into MongoDB.
Beyond the Basics 3: Introduction to the MongoDB BI ConnectorMongoDB
Watch this presentation to learn how the MongoDB BI Connector lets you use MongoDB as a data source for your SQL-based BI and analytics platforms.
Learn how to seamlessly create the visualizations and dashboards that will help you extract the insights and hidden value in your multi-structured data.
Everything You Need to Know About MongoDB Development.pptx75waytechnologies
Today, organizations from different verticals want to harness the power of data to grab new business opportunities and touch new heights of success. Such an urge leads them to follow unique ways to use and handle data effectively. After all, the right use of data boosts the ability to make business decisions faster. But at the same time, working with data is not as easy as a walk in the garden. It sometimes turns out to be a long-standing problem for businesses that also affects their overall functioning.
Companies expect fast phase development and better data management in every scenario. Modern web-based applications development demands a quality working system that can be deployed faster, and the application is able to scale in the future as per the constantly changing environment.
Earlier, relational databases were used as a primary data store for web application development. But today, developers show a high interest in adopting alternative data stores for modern applications such as NoSQL (Not Only Structured Query Language) because of its incredible benefits. And if you ask us, one of the technologies that can do wonders in modern web-based application development is MongoDB.
MongoDB is the first name strike in our heads when developing scalable applications with evolving data schemas. Because MongoDB is a document database, it makes it easier for developers to store both structured and unstructured data. Stores and handles large amounts of data quickly, MongoDB is undoubtedly the smart move toward building scalable and data-driven applications. If you’re wondering what MongoDB is and how it can help your digital success, this blog is surely for you.
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...SANGHEE SHIN
I gave this keynote talk at 3D GeoInfo conference held at Singapore on 25th September 2019. I shared my experiences in integrating CAD/BIM/GIS on the web platform and introduced mago3D from the technical point of view.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
2. 17/07/2019 Big Data class by Alexandre Bergere 2
alexandre.bergere@gmail.com
https://fr.linkedin.com/in/alexandrebergere
@AlexPhile
ESAIP
2013 - 2016
Avanade
2016 - 2019
Sr Anls, Data EngineeringStudent
Worked as a senior analyst at Avanade
France, I have developed my skills in data
analysis (MSBI, Power BI, R, Python) by
working on innovative projects and proofs of
concept in the energy industry.
ESAIP
Teacher
2016 - ?
Freelance
2019 - x
Data Analyst & Data Architect
3. 17/07/2019 Big Data class by Alexandre Bergere 3
Planning
D-1 D-2 D-3 D-4 D-5
MorningAfternoon
What’s Big Data
+
No SQL
+
Cloud Architecture
Azure IOT + Azure Stream
Analytics + Power BI
Theorical AWS Practice Azure Practice Exam
Oral Exam
Written Exam
SPARK
SPARK
Free time
Prep. Oral
Analyse Big Data with
Hadoop
SPARK
Redshift
Cosmos DB
Serverless architecture :
AWS Lambda +
DynamoDB + NodeJS
Cosmos DB
SPARK
On Prem
Neo4J
Mongo DB
Cloud
SPARK
4. 17/07/2019 Big Data class by Alexandre Bergere 4
Planning
D-1 D-2 D-3
MorningAfternoon
What’s Big Data
Azure IOT + Azure Stream
Analytics + Power BI
Theorical Azure Practice
Cosmos DB
SPARK
On Prem
Neo4J
Mongo DB
Cloud
Cloud architecture
Written Exam
BI & Machine Learning
Analyse Big Data with
Hadoop
6. 17/07/2019 Big Data class by Alexandre Bergere 7
Data Storage
Relational data store HDFS Key Value data store Columnar data store
Object store Search data store Graph data store Document data store
8. 17/07/2019 Big Data class by Alexandre Bergere 9
Mongo DB
Created in 2007 & first release
in 2010.
Easy and simple … as a leaf.
Document data store &
Schemaless.
11. MongoDB is easy
17/07/2019 Big Data class by Alexandre Bergere 12
For many developers, data model goes hand in hand with object mapping, and for that purpose
you may have used an object-relational mapping library, such as Java’s Hibernate framework or
Ruby’s ActiveRecord.
Such libraries can be useful for efficiently building applications with a RDBMS, but they’re less
necessary with MongoDB. This is due in part to the fact that a document is already an object-
like representation. It’s also partly due to the MongoDB drivers, which already provide a fairly
high-level interface to MongoDB. Without question, you can build applications on MongoDB
using the driver interface alone.
12. Use cases
17/07/2019 Big Data class by Alexandre Bergere 13
o Web application (mongoDB is well-suited as primary datastore for web application)
o Agile development
o Analytics and logging
o Caching
o Variable Schemas
13. Mongo DB 4.0 : ACID transactions
17/07/2019 Big Data class by Alexandre Bergere 14
More info.
Bêta test.
16. Analytics – use case
17/07/2019 Big Data class by Alexandre Bergere 17
More info.
The City of Chicago cuts crime and improves citizen
welfare with a real-time geospatial analytics platform
called WindyGrid. Using MongoDB, it analyzes data
from 30+ different departments – like bus locations,
911 calls, and even tweets – to better understand and
respond to emergencies.
17. The case for adding NoSQL
17/07/2019 Big Data class by Alexandre Bergere 18
o Large volumes of rapidly changing structured, semi-structured, and unstructured data
o Agile sprints, quick schema iteration, and frequent code pushes
o API-driven, object-oriented programming that is easy to use and flexible
o Geographically distributed scale-out architecture instead of expensive, monolithic
architecture
Consider, for example, enterprise resource planning (ERP), a standard for relational databases.
What if you want to offer ERP forms users can actually modify if they need to? A document-
based NoSQL database such as MongoDB can provide that functionality without requiring you
to rebuild your whole data schema every time a user wants to change the data format.
18. White papers
17/07/2019 Big Data class by Alexandre Bergere 19
MongoDB – BI &
Analytics
MongoDB – Kafka MongoDB – Spark
19. Leader in The Forrester Wave™: Big Data NoSQL, Q1 2019
17/07/2019 Big Data class by Alexandre Bergere 20
o Data Types
o Streaming and Loading
o Big Data Support
o In-memory
o Performance
o Scalability
o High Availability & Disaster
Recovery
o Tools
o Workloads
o Use Cases
o Ability to Execute
o Road Map
o Open Source and Licensing
o Support
22. Mongo DB Atlas
17/07/2019 Big Data class by Alexandre Bergere 23
DAAS : Database As A Service • Schema design
• Query and index optimization
• Server size selection - you must select the appropriate size of server,
coupled with IO and storage capacity
• Capacity planning - you must determine when you need additional
capacity, typically using the monitoring telemetry provided by
MongoDB Atlas, but you can make these changes with no downtime
• Initiating database restores
• How much you use
23. Mongo DB Cloud Manager
17/07/2019 Big Data class by Alexandre Bergere 24
24. Mongo DB Connector for BI
17/07/2019 Big Data class by Alexandre Bergere 25
25. MongoDB Charts
(beta)
17/07/2019 Big Data class by Alexandre Bergere 26
MongoDB Charts is the fastest and
easiest way to build visualizations of
MongoDB data.
27. Change Streams
17/07/2019 Big Data class by Alexandre Bergere 28
More info.
Change streams allow applications to access real-time data changes without the complexity and risk of
tailing the oplog. Applications can use change streams to subscribe to all data changes on a collection and
immediately react to them.
28. Stitch
17/07/2019 Big Data class by Alexandre Bergere 29
Full access to MongoDB, declarative read/write
controls, and integration with your choice of services
MongoDB Stitch lets developers focus on building applications rather than on managing data manipulation code, service
integration, or backend infrastructure. Whether you’re just starting up and want a fully managed backend as a service, or
you’re part of an enterprise and want to expose existing MongoDB data to new applications, Stitch lets you focus on
building the app users want, not on writing boilerplate backend logic.
30. Document are rich data structure
17/07/2019 Big Data class by Alexandre Bergere 31
• JSON:
• String, Number, Array, Object, NULL, Boolean.
• BSON:
• Date, BinData, ObjectID, Geo-Location.
• Better storage performance.
ObjectID:
◦ _id : 'DATE[4] | MAC_ADDR[3] | PID[2] | COUNTER[3]
31. Available Types
17/07/2019 Big Data class by Alexandre Bergere 32
Type Number Alias Notes
Double 1 “double”
String 2 “string”
Object 3 “object”
Array 4 “array”
Binary data 5 “binData”
Undefined 6 “undefined” Deprecated.
ObjectId 7 “objectId”
Boolean 8 “bool”
Date 9 “date”
Null 10 “null”
Regular Expression 11 “regex”
DBPointer 12 “dbPointer” Deprecated.
JavaScript 13 “javascript”
Symbol 14 “symbol” Deprecated.
JavaScript (with scope) 15 “javascriptWithScope”
32-bit integer 16 “int”
Timestamp 17 “timestamp”
64-bit integer 18 “long”
Decimal128 19 “decimal” New in version 3.4.
Min key -1 “minKey”
Max key 127 “maxKey”
32. SQL vs MongoDB Terms
17/07/2019 Big Data class by Alexandre Bergere 33
SQL Terms/Concepts MongoDB Terms/Concepts
Database Database
Table Collection
Line Document
Column Field
Index Index
Join Embeded or linked document
Primary key Primary key (start by « _id »)
34. Document Model
17/07/2019 Big Data class by Alexandre Bergere 35
Pers_ID Surname First_Name City
0 Miller Paul London
1 Ortega Alvaro Valencia
2 Huber Urs Zurich
3 Blanc Gaston Paris
4 Bertolini Fabrizio Rome
Car_ID Model Year Value Pers_ID
101 Bently 1973 100000 0
102 Rolls Royce 1965 330000 0
103 Peugot 1993 500 3
104 Ferrari 2005 150000 4
105 Renault 1998 2000 3
106 Renault 2001 7000 3
107 Smart 1999 2000 2
CAR
PERSON
Mongo DB
RDBMS
36. CRUD
17/07/2019 Big Data class by Alexandre Bergere 37
# FIND()
> db.<collection>.find ({<conditions>},{<champs>})
> db.products.find( { qty: { $gt: 25 } }, { item: 1, qty: 1 } )
Options:
.pretty()
.sort() : 1 : ASC, -1: DESC : sort({‘name’:-1})
.skip() : number
.limit() : number
.count()
sort, first, skip, second, and limit last because that is the only order that makes
sense.
37. CRUD
17/07/2019 Big Data class by Alexandre Bergere 38
# INSERT()
> db.<collection>.insert ({<value>})
> db.<collection>.insertMany([{<values>}])
> db.inventory.insertMany([
{ item: "journal", qty: 25, tags: ["blank", "red"], size: { h: 14, w: 21, uom: "cm" } },
{ item: "mat", qty: 85, tags: ["gray"], size: { h: 27.9, w: 35.5, uom: "cm" } },
{ item: "mousepad", qty: 25, tags: ["gel", "blue"], size: { h: 19, w: 22.85, uom: "cm" } }
])
db.collection.insertOne() Inserts a single document into a collection.
db.collection.insertMany() db.collection.insertMany() inserts multiple documents into a collection.
db.collection.insert()
db.collection.insert() inserts a single document or multiple documents into
a collection.
38. CRUD
17/07/2019 Big Data class by Alexandre Bergere 39
# UPDATE()
> db.<collection>.update
({<conditions>},{<champs>},{upsert:true/false},{multi:true/false}
)
> { "_id": "artist:271", "last_name": "Cotillard", "first_name": "Marion", "birth_date": "1975" }
# Operator Update
> db.artists.update({"_id": "artist:281"},{ $set : {"last_name" : "Page"}})
> { "_id": "artist:271", "last_name": “Page", "first_name": "Marion", "birth_date": "1975" }
# Replacement Update
> db.artists.update({"_id": "artist:281"},{"last_name" : "Page"})
> { "_id": "artist:271", "last_name": “Page"}
❑ Operator Update
❑ Replacement Update
Upsert: boolean Optional. If
set to true, creates a new document when
no document matches the query criteria.
The default value is false, which does not
insert a new document when no match is
found.
Multi: boolean Optional. If
set to true, updates multiple documents
that meet the query criteria. If set to false,
updates one document. The default value
is false.
39. CRUD
17/07/2019 Big Data class by Alexandre Bergere 40
# DELETE()
> db.<collection>.remove ({<conditions>})
> db.artists.remove({"_id": "artist:39"})
# Remove all fields
> db.artists.remove({})
40. Query Operator
17/07/2019 Big Data class by Alexandre Bergere 41
Name Description
$eq Matches values that are equal to a specified value.
$gt Matches values that are greater than a specified value.
$gte Matches values that are greater than or equal to a specified value.
$lt Matches values that are less than a specified value.
$lte Matches values that are less than or equal to a specified value.
$ne Matches all values that are not equal to a specified value.
$in Matches any of the values specified in an array.
42. Query Operator : Arrays
17/07/2019 Big Data class by Alexandre Bergere 43
Name Description
$pull Removes all array elements that match a specified query.
$push Add an element to an array.
$pop Removes the first or last item of an array.
$addToSet Adds elements to an array only if they do not already exist in the set.
$in Matches any of the values specified in an array.
43. DML
17/07/2019 Big Data class by Alexandre Bergere 44
# Returns all database
> show dbs
# The current database name:
> db.getName()
# Returns all database
> show dbs
# Returns all collection in the current
database:
> db.getCollectionNames()
# Returns a collection or a view object:
> db.getCollection(name)
# The current database connection:
> db.getMongo()
# Clean the console log:
> cls
# Return collection informations:
> db.getCollectionInfos({name: "name"})
44. Command-line tools
17/07/2019 Big Data class by Alexandre Bergere 45
# Import multiples document:
> mongoimport -d crunchbase -c companies
D:MongoDBsrccompanies.json
# Import multiples document in an array:
> mongoimport -d crunchbase -c companies
D:MongoDBsrccompanies.json --jsonArray
# Export
> mongoexport -d crunchbase -c artists --out
D:MongoDBartists.json
Launch in the shell, not in mongoDB instance.
Command Description
mongodump mongodump is a utility for creating a binary export of the
contents of a database. mongodump can export data from
either mongod or mongos instances.
mongorestore The mongorestore program loads data from either a binary
database dump created by mongodump or the standard input
(starting in version 3.0.0) into a mongod or mongos instance.
mongostat This utility constantly polls MongoDB and the system to
provide helpful stats, including the number of operations per
second (inserts, queries, updates, deletes, and so on), the
amount of virtual memory allocated, and the number of
connections to the server.
mongoperf Helps you understand the disk operations happening in a
running MongoDB instance.
mongotop Similar to top, this utility polls MongoDB and shows the
amount of time it spends reading and writing data in each
collection.
mongosniff A wire-sniffing tool for viewing operations sent to the
database. It essentially translates the BSON going over the
wire to human-readable shell statements.
45. $text
17/07/2019 Big Data class by Alexandre Bergere 46
# $text
> db.articles.find( { $text: { $search: "coffee" } } ))
$text performs a text search on the content of the fields indexed with a text index. A $text expression has the following
syntax:
# $text
> {
$text:
{
$search: <string>,
$language: <string>,
$caseSensitive: <boolean>,
$diacriticSensitive: <boolean>
}
}
# Create index first - You can index multiple fields for the
text index:
db.reviews.createIndex(
{
subject: "text",
comments: "text"
}
)
46. Schema Validation
17/07/2019 Big Data class by Alexandre Bergere 47
Implement data governance without sacrificing
the agility that comes from a dynamic schema.
With schema validation, developers and
operations spend less time defining data quality
controls in their applications, and instead
delegate these tasks to the database.
47. Aggregation
17/07/2019 Big Data class by Alexandre Bergere 48
Swiss Army knife
Executes in native code
o Written in C++
o JSON parameter
Flexible, funcional, simple
o Operation pipeline
o Computational expressions
48. Pipeline operators
17/07/2019 Big Data class by Alexandre Bergere 49
Operator Description
$match Filter documents
$project Reshape documents
$group Summarize documents
$unwind Expand arrays in documents
$sort Order documents
$limit / $skip Paginate documents
$redact Restrict documents
$geoNear Proximity sort documents
$let, $map Define variables
49. $match
17/07/2019 Big Data class by Alexandre Bergere 50
# Matching field values
> {$match:{
language:"Russian"
}
{
title:"War and Peace",
pages:1440,
langugage:"Russian"
}
# Matching with query operators
> {$match:{
pages:{$gt:100}
}
{
title:"War and Peace",
pages:1440,
langugage:"Russian"
},
{
title:"Atlas Shrugged",
pages:1088,
langugage:"English"
}
50. $project
17/07/2019 Big Data class by Alexandre Bergere 51
# Renaming and cuputing fields
> {$project:{
avgChapterLength:{
$divide:["$pages", "$chapters" ]
},
lang: "$language"
}}
{
_id:375,
avgChapterLength: 24,2222
lang:"English"
}
# Including & excluding fields
> {$project:{
_id:0,
title:1,
language:1
}}
{
title:"Great Gatsby",
language:"English"
}
51. $group
17/07/2019 Big Data class by Alexandre Bergere 52
# Collect distinct values
> {$group:{
_id:"$langugage",
title:{$addToSet:"$title"}
}}
{
_id:"English",
language:[Atlas Shrugged" , "The
Great Gatsby"]
},
{
_id:"Russian",
language:["War and Peace"]
}
# Calculating average, summing fields…
> {$group:{
_id:"$langugage",
pages:{$sum:"$pages"},
books:{$sum:1},
avgPages:{$avg:"$pages"}
}}
{
_id:"Russian",
pages:1440,
books:1,
avgPages:1440
}
52. $unwind
17/07/2019 Big Data class by Alexandre Bergere 53
# Collect distinct values
> {$unwind:{
"subjects"
}
{
title:"The Great Gatsby",
ISBN:"9762832930920323" ,
subjects:"Long Island"
},
{
title:"The Great Gatsby",
ISBN:"9762832930920323" ,
subjects:"New York"
},
{
title:"The Great Gatsby",
ISBN:"9762832930920323" ,
subjects:"1920s"
}
{
title:"The Great Gatsby",
ISBN:"9762832930920323" ,
subjects:[
"Long Island",
"New York",
"1920s"
]
}
55. Instance
17/07/2019 Big Data class by Alexandre Bergere 56
Launch as a service:
mongod --dbpath C:UsersalexaDocumentsMongoDBdata
Launch as a connection:
mongo
Options Shortcut
--db -d
--collection -c
--username -u
--password -p
--host -h
56. Request practice
17/07/2019 Big Data class by Alexandre Bergere 57
# 1.0 Load artists.json
> mongoimport -d crunchbase -c artists --file C:UsersalexaDocumentsCoursMongoDB2017-
2018srcartists.json -–jsonArray -–port 27018
# 1.1 Return first_name and birth_date to all artists born in 1964
> db.artists.find({"birth_date": "1964"},{"_id":0,"first_name":1, "birth_date":1})
# 1.2 Return all arstists born after 1980 or with their first name begin by ‘Chri’
> db.artists.find({$or:[{"birth_date": {$gte:"1980"}},{"first_name":/^Chri/}]},{})
> db.artists.find({$or:[{"birth_date": {$gte:"1980"}},{"first_name":{$regex : /^Chri/}}]},{})
# 1.3 Return the 6e to the 9e artist by their name desc
> db.artists.find().pretty().sort({"last_name":-1}).skip(5).limit(4)
# 1.4 Insert the following artist:
{"_id": "artist:282", "last_name": "Bergere", "first_name": "Alexandre", "birth_date": "1992"} : (Replace
the id)
> db.artists.insert({ "_id": "artist:282", "last_name": "Bergere", "first_name": "Alexandre",
"birth_date": "1992" })
57. Request practice
17/07/2019 Big Data class by Alexandre Bergere 58
# 1.5 Modify by « Jonathan » the first_name of the artists with the id artist:266
> db.artists.update({"_id": "artist:266"},{$set:{"first_name":"Jonathan"}})
# 1.6 Add « golf » to the 280 artist’s hobbies
> db.artists.update({"_id": "artist:280"},{$push:{"hobbies":"golf"}})
# 1.7 Add « yoga » to the 282 artist’s hobbies
> db.artists.update({"_id": "artist:282"},{$push:{"hobbies":"yoga"}})
# 1.8 Remove « poney » and « photo » from 280 artist’s hobbies
> db.artists.update({"_id": "artist:280"},{$pull:{"hobbies": {$in:["poney","photo"]}}})
58. Request practice
17/07/2019 Big Data class by Alexandre Bergere 59
# Convert string to integer
> db.artists.find({birth_date: {$exists: true}}).forEach(function(obj) {
obj.birth_date = new NumberInt(obj.birth_date);
db.artists.save(obj);
});
65. What is a graph database?
17/07/2019 Big Data class by Alexandre Bergere 66
A graph database is an online database management system with Create, Read, Update and Delete (CRUD)
operations working on a graph data model. Graph databases are generally built for use with online
transaction processing (OLTP) systems. Accordingly, they are normally optimized for transactional
performance, and engineered with transactional integrity and operational availability in mind. ~ Neo4j
Unlike other databases, relationships take first priority in graph databases.
66. The case for graph databases
17/07/2019 Big Data class by Alexandre Bergere 67
67. What is Graph?
17/07/2019 Big Data class by Alexandre Bergere 68
Graph is just a collection of vertices and edges—or, in less intimidating language, a set of nodes and the
relationships that connect them.
68. Definitions
17/07/2019 Big Data class by Alexandre Bergere 69
• Nodes
o Nodes are the main data elements
o Nodes are connected to other nodes
via relationships
o Nodes can have one or more properties (i.e.,
attributes stored as key/value pairs)
o Nodes have one or more labels that describes
its role in the graph
o Example: Person nodes vs Car nodes
• Relationships
o Relationships connect two nodes
o Relationships are directional
o Nodes can have multiple, even recursive
relationships
o Relationships can have one or
more properties (i.e., attributes stored as
key/value pairs)
Properties
o Properties are named values where the name (or
key) is a string
o Properties can be indexed and constrained
o Composite indexes can be created from multiple
properties
Labels
o Labels are used to group nodes into sets
o A node may have multiple labels
o Labels are indexed to accelerate finding nodes in
the graph
o Native label indexes are optimized for speed
69. Modelling relational to graph
17/07/2019 Big Data class by Alexandre Bergere 70
Relational Graph
Rows Nodes
Joins Relationships
Table names Labels
Columns Properties
similarities
relational model differs from the graph model
Relational Graph
Each column must have a field value.
Nodes with the same label aren’t required to have the same set of
properties.
Joins are calculated at query time. Relationships are stored on disk when they are created.
A row can belong to one table. A node can have many labels.
72. Neo4j Graph Platform
17/07/2019 Big Data class by Alexandre Bergere 73
The Neo4j Graph Platform includes out-of-the-box tooling that enables you to access graphs in Neo4j Databases. In
addition, Neo4j provides APIs and drivers that enable you to create applications and custom tooling for accessing and
visualizing graphs.
73. Dev env.
17/07/2019 Big Data class by Alexandre Bergere 75
Neo4j SandboxNeo4j Desktop
o Neo4j Database server
o graph engine
o kernel (Cypher execution)
o Neo4j Browser
o additional libraries and drivers for accessing the Neo4j database
o temporary, cloud-based instance of a Neo4j Server with its
associated graph that you can access from any Web
browser
o available for three days, but you can extend it for up to 10
days
o you can use Neo4j Browser Sync to save Cypher scripts
from your sandbox
76. What’s Cypher?
17/07/2019 Big Data class by Alexandre Bergere 78
Cypher is a declarative query language that allows for expressive and efficient querying and updating of graph data.
Cypher is ASCII art
focuses on the clarity of expressing what to retrieve from a
graph
Cypher is inspired by
SPARK
QL
SQL
Python
Haskell
77. Node & Label
17/07/2019 Big Data class by Alexandre Bergere 79
() // anonymous node not be referenced later in the query
(p) // variable p, a reference to a node used later
(:Person) // anonymous node of type Person
(p:Person) // p, a reference to a node of type Person
(p:Actor:Director) // p, a reference to a node of types Actor and Director
Examining the data model
CALL db.schema
78. Using MATCH to retrieve nodes
17/07/2019 Big Data class by Alexandre Bergere 80
MATCH (n) // returns all nodes in the graph
RETURN n
MATCH (p:Person) // returns all Person nodes in the graph
RETURN p
When you specify a pattern for a MATCH clause, you should always specify a node label if possible. In doing so, the graph
engine uses an index to retrieve the nodes which will perform better than not using a label for the MATCH.
79. Properties
17/07/2019 Big Data class by Alexandre Bergere 81
A property is defined for a node and not for a type of node. All nodes of the same type need not have the same properties.
// Query the database for all property keys
CALL db.propertyKeys
MATCH (variable:Label {propertyKey: propertyValue, propertyKey2: propertyValue2})
RETURN variable
MATCH (m:Movie {released: 2003, tagline: 'Free your mind'})
RETURN m
80. Filtering queries using property values
17/07/2019 Big Data class by Alexandre Bergere 82
// Retrieve all Movie nodes that have a released property value of 2003.
MATCH (m:Movie {released:2003}) RETURN m
// Retrieve all Movies released in 2006, returning their titles
MATCH (m:Movie {released: 2006}) RETURN m.title
// Display title, released, and tagline values for every Movie node in the graph
MATCH (m:Movie) RETURN m.title AS `movie title`, m.released AS released, m.tagline
AS tagLine
81. Relationships
17/07/2019 Big Data class by Alexandre Bergere 83
A relationship is a directed connection between two nodes that has a relationship type (name). In addition, a relationship
can have properties, just like nodes.
() // a node
()--() // 2 nodes have some type of relationship
()-->() // the first node has a relationship to the second node
()<--() // the second node has a relationship to the first node
Here is how Cypher uses ASCII art to specify path used for a query:
Querying using relationships:
MATCH (node1)-[:REL_TYPE]->(node2)
RETURN node1, node2
MATCH (node1)-[:REL_TYPEA | :REL_TYPEB]->(node2)
RETURN node1, node2
node1 is a specification of a node where you may include node labels and property values for filtering.
:REL_TYPE is the type (name) for the relationship. For this syntax the relationship is from node1 to node2.
:REL_TYPEA , :REL_TYPEB are the relationships from node1 to node2. The nodes are returned if at least one of the relationships exists.
node2 is a specification of a node where you may include node labels and property values for filtering.
82. Relationships
17/07/2019 Big Data class by Alexandre Bergere 84
Using patterns for queries:
MATCH (p:Person)-[:FOLLOWS]->(:Person {name:'Angela Scope'})
RETURN p
MATCH (p:Person)<-[:FOLLOWS]-(:Person {name:'Angela Scope'})
RETURN p
83. Relationships
17/07/2019 Big Data class by Alexandre Bergere 85
Using patterns for queries:
// Querying by any direction of the relationship
MATCH (p1:Person)-[:FOLLOWS]-(p2:Person {name:'Angela Scope'})
RETURN p1, p2
84. Relationships
17/07/2019 Big Data class by Alexandre Bergere 86
Using patterns for queries:
// Traversing relationships : query to return all followers of the followers
of Jessica Thompson.
MATCH (p:Person)-[:FOLLOWS]->(:Person)-[:FOLLOWS]->(:Person {name:'Jessica
Thompson'})
RETURN p
// Traversing relationships : return each person along the path by specifying
variables for the nodes and returning them
MATCH path = (:Person)-[:FOLLOWS]->(:Person)-[:FOLLOWS]->(:Person {name:'Jessica
Thompson'})
RETURN path
85. Relationships
17/07/2019 Big Data class by Alexandre Bergere 87
Using a relationship in a query:
MATCH (p:Person)-[rel:ACTED_IN]->(m:Movie {title: 'The Matrix'})
RETURN p, rel, m
Variables:
o p to represent the Person nodes during the query, the
variable
o m to represent the Movie node retrieved
o rel to represent the relationship for the relationship
type, ACTED_IN
Querying by multiple relationships:
MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN|:DIRECTED]->(m:Movie)
RETURN p.name, m.title
86. Relationships
17/07/2019 Big Data class by Alexandre Bergere 88
Using anonymous nodes in a query:
MATCH (p:Person)-[:ACTED_IN]->(:Movie {title: 'The Matrix'})
RETURN p.name
A best practice is to place named nodes (those with variables) before anonymous nodes in a MATCH clause.
Using an anonymous relationship for a query:
// find all people who are in any way connected to the movie
MATCH (p:Person)-->(m:Movie {title: 'The Matrix'})
RETURN p, m
MATCH (p:Person)--(m:Movie {title: 'The Matrix'})
RETURN p, m
87. Relationships
17/07/2019 Big Data class by Alexandre Bergere 89
Retrieving the relationship types:
MATCH (p:Person)-[rel]->(:Movie {title:'The Matrix'})
RETURN p.name, type(rel)
Retrieving properties for relationships:
MATCH (p:Person)-[:REVIEWED {rating: 65}]->(:Movie {title: 'The Da Vinci Code'})
RETURN p.name
88. Filtering queries using relationships
17/07/2019 Big Data class by Alexandre Bergere 90
// Retrieve all people who wrote the movie Speed Racer
MATCH (p:Person)-[:WROTE]->(:Movie {title: 'Speed Racer'}) RETURN p.name
// Retrieve all movies that are connected to the person, Tom Hanks
MATCH (m:Movie)<--(:Person {name: 'Tom Hanks'}) RETURN m.title
or
MATCH(:Person {name: 'Tom Hanks'})-->(m:Movie) RETURN m.title
// Retrieve information about the relationships Tom Hanks has with the set of
movies retrieved earlier
MATCH (m:Movie)-[rel]-(:Person {name: 'Tom Hanks'}) RETURN m.title, type(rel)
// Retrieve information about the roles that Tom Hanks acted in
MATCH (m:Movie)-[rel:ACTED_IN]-(:Person {name: 'Tom Hanks'}) RETURN m.title,
rel.roles
89. Cypher style recommendations
17/07/2019 Big Data class by Alexandre Bergere 91
Here are the Neo4j-recommended Cypher coding standards:
o Node labels are CamelCase and begin with an upper-case letter (examples: Person, NetworkAddress). Note that node
labels are case-sensitive.
o Property keys, variables, parameters, aliases, and functions are camelCase and begin with a lower-case letter
(examples: businessAddress, title). Note that these elements are case-sensitive.
o Relationship types are in upper-case and can use the underscore. (examples: ACTED_IN, FOLLOWS). Note that
relationship types are case-sensitive and that you cannot use the “-” character in a relationship type.
o Cypher keywords are upper-case (examples: MATCH, RETURN). Note that Cypher keywords are case-insensitive, but a
best practice is to use upper-case.
o String constants are in single quotes, unless the string contains a quote or apostrophe (examples: ‘The Matrix’,
“Something’s Gotta Give”). Note that you can also escape single or double quotes within strings that are quoted with
the same using a backslash character.
o Specify variables only when needed for use later in the Cypher statement.
o Place named nodes and relationships (that use variables) before anonymous nodes and relationships in your MATCH
clauses when possible.
o Specify anonymous relationships with -->, --, or <--
MATCH (:Person {name: 'Diane Keaton'})-[movRel:ACTED_IN]->
(:Movie {title:"Something's Gotta Give"})
RETURN movRel.roles
Follow the Cypher Style Guide when writing your Cypher statements.
90. 17/07/2019 Big Data class by Alexandre Bergere 92
Getting More Out of Queries
91. Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 93
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released = 2008
RETURN p, m
// complex conditions
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released = 2008 AND m.released = 2009
RETURN p, m
// same as previous
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE 2003 <= m.released <= 2004
RETURN p.name, m.title, m.released
MATCH (p:Person)-[:ACTED_IN]->(m:Movie {released: 2008})
RETURN p, m
92. Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 94
// Testing labels
MATCH (p:Person)
RETURN p.name
MATCH (p:Person)-[:ACTED_IN]->(:Movie {title: 'The Matrix'})
RETURN p.name
MATCH (p)
WHERE p:Person
RETURN p.name
MATCH (p)-[:ACTED_IN]->(m)
WHERE p:Person AND m:Movie AND m.title='The Matrix'
RETURN p.name
93. Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 95
// Testing the existence of a property
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name='Jack Nicholson' AND exists(m.tagline)
RETURN m.title, m.tagline
// Testing strings : You can specify STARTS WITH, ENDS WITH, and CONTAINS
MATCH (p:Person)-[:ACTED_IN]->()
WHERE toLower(p.name) STARTS WITH 'michael'
RETURN p.name
// Testing with regular expressions; You use the syntax =~
MATCH (p:Person)
WHERE p.name =~'Tom.*'
RETURN p.name
94. Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 96
// Testing with patterns
// exclude people who directed that movie
MATCH (p:Person)-[:WROTE]->(m:Movie)
WHERE NOT exists( (p)-[:DIRECTED]->() )
RETURN p.name, m.title
// find Gene Hackman and the movies that he acted in with another person who also
directed the movie
MATCH (gene:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(other:Person)
WHERE gene.name= 'Gene Hackman'
AND exists( (other)-[:DIRECTED]->() )
RETURN gene, other, m
95. Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 97
// Testing with list values : elements of the list have to be the same type of data
MATCH (p:Person)
WHERE p.born IN [1965, 1970]
RETURN p.name as name, p.born as yearBorn
// You can also compare a value to an existing list in the graph.
MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
WHERE 'Neo' IN r.roles AND m.title='The Matrix'
RETURN p.name
There are a number of syntax elements of Cypher that we have not covered in this training. For example, you can specify
CASE logic in your conditional testing for your WHERE clauses. You can learn more about these syntax elements in the
Neo4j Cypher Manual and the Cypher Refcard.
96. Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 98
// Retrieve all actors that were born in the 70’s
MATCH (a:Person)
WHERE a.born >= 1970 AND a.born < 1980
RETURN a.name as Name, a.born as `Year Born`
// Retrieve all movies released in 2000 by testing the node label and the released
property, returning the movie titles
MATCH (m)
WHERE m:Movie AND m.released = 2000 and exists(m.released)
RETURN m.title
// Retrieve all people that wrote movies by testing the relationship between two
nodes
MATCH (a)-[rel]->(m)
WHERE a:Person AND type(rel) = 'WROTE' AND m:Movie
RETURN a.name as Name, m.title as Movie
// Retrieve all people in the graph that do not have the property ‘born’
MATCH (a:Person)
WHERE NOT exists(a.born)
RETURN a.name as Name
97. Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 99
// Retrieve all people related to movies where the relationship has the rating
property, then return their name, movie title, and the rating.
MATCH (a:Person)-[rel]->(m:Movie)
WHERE exists(rel.rating)
RETURN a.name as Name, m.title as Movie, rel.rating as Rating
// Retrieve all REVIEW relationships from the graph where the summary of the review
contains the string fun, returning the movie title reviewed and the rating and
summary of the relationship.
MATCH (:Person)-[r:REVIEWED]->(m:Movie)
WHERE toLower(r.summary) CONTAINS 'fun'
RETURN m.title as Movie, r.summary as Review, r.rating as Rating
// Retrieve all people who have produced a movie, but have not directed a movie
MATCH (a:Person)-[:PRODUCED]->(m:Movie)
WHERE NOT ((a)-[:DIRECTED]->(:Movie))
RETURN a.name, m.title
// Retrieve the movies and their actors where one of the actors also directed the
movie
MATCH (a1:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a2:Person)
WHERE exists( (a2)-[:DIRECTED]->(m) )
RETURN a1.name as Actor, a2.name as `Actor/Director`, m.title as Movie
98. Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 100
// Retrieve the movies that have an actor’s role that is the name of the movie
MATCH (a:Person)-[r:ACTED_IN]->(m:Movie)
WHERE m.title in r.roles
RETURN m.title as Movie, a.name as Actor
99. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 101
MATCH (a:Person)-[:ACTED_IN]->(m:Movie),
(m:Movie)<-[:DIRECTED]-(d:Person)
WHERE m.released = 2000
RETURN a.name, m.title, d.name
Specifying multiple MATCH patterns
This MATCH clause includes a pattern specified by two paths separated by a comma:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
WHERE m.released = 2000
RETURN a.name, m.title, d.name
If possible, you should write the same query as follows:
100. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 102
// retrieve the actors who acted in the same movies as Keanu Reeves, but not when
Hugo Weaving acted in the same movie
MATCH (keanu:Person)-[:ACTED_IN]->(movie:Movie)<-[:ACTED_IN]-(n:Person),
(hugo:Person)
WHERE keanu.name='Keanu Reeves' AND
hugo.name='Hugo Weaving'
AND NOT (hugo)-[:ACTED_IN]->(movie)
RETURN n.name
Specifying multiple MATCH patterns
// Suppose we want to retrieve the movies that Meg Ryan acted in and their
respective directors, as well as the other actors that acted in these movies.
MATCH (meg:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person),
(other:Person)-[:ACTED_IN]->(m)
WHERE meg.name = 'Meg Ryan'
RETURN m.title as movie, d.name AS director , other.name AS `co-actors`
101. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 103
MATCH megPath = (meg:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person),
(other:Person)-[:ACTED_IN]->(m)
WHERE meg.name = 'Meg Ryan'
RETURN megPath
Setting path variables
102. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 104
Specifying varying length paths
// all of the followers of the followers of a Person
MATCH (follower:Person)-[:FOLLOWS*2]->(p:Person)
WHERE follower.name = 'Paul Blythe'
RETURN p
// Retrieve all paths of any length with the relationship, :RELTYPE from nodeA to
nodeB and beyond:
(nodeA)-[:RELTYPE*]->(nodeB)
// Retrieve all paths of any length with the relationship, :RELTYPE from nodeA to
nodeB or from nodeB to nodeA and beyond:
(nodeA)-[:RELTYPE*]-(nodeB)
// Retrieve the paths of length 3 with the relationship,
(node1)-[:RELTYPE*3]->(node2)
// Retrieve the paths of lengths 1, 2, or 3 with the relationship
(node1)-[:RELTYPE*1..3]->(node2)
103. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 105
Finding the shortest path
MATCH p = shortestPath((m1:Movie)-[*]-(m2:Movie))
WHERE m1.title = 'A Few Good Men' AND
m2.title = 'The Matrix'
RETURN p
A built-in function that you may find useful in a graph that has many ways of traversing the graph to get to the same node
is the shortestPath() function. Using the shortest path between two nodes improves the performance of the query.
When you use the shortestPath() function, the query editor will show a warning that this type of query could potentially
run for a long time. You should heed the warning, especially for large graphs. Read the Graph Algorithms documentation
about the shortest path algorithm.
When you use shortestPath(), you can specify a upper limits for the shortest path. In addition, you should aim to provide
the patterns for the from an to nodes that execute efficiently. For example, use labels and indexes.
104. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 106
Specifying optional pattern matching
MATCH (p:Person)
WHERE p.name STARTS WITH 'James'
OPTIONAL MATCH (p)-[r:REVIEWED]->(m:Movie)
RETURN p.name, type(r), m.title
OPTIONAL MATCH matches patterns with your graph, just like MATCH does. The difference is that if no matches are found,
OPTIONAL MATCH will use NULLs for missing parts of the pattern. OPTIONAL MATCH could be considered the Cypher
equivalent of the outer join in SQL.
105. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 107
Collecting results
// the list of movies that Tom Cruise acted in
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name ='Tom Cruise'
RETURN collect(m.title) AS `movies for Tom Cruise`
Cypher has a built-in function, collect() that enables you to aggregate a value into a list.
106. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 108
Aggregation in Cypher
// implicitly groups by a.name and d.name
MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
RETURN a.name, d.name, count(*)
// count the paths retrieved where an actor and director collaborated in a movie
MATCH (actor:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name, director.name, count(m) AS collaborations, collect(m.title) AS
movies
Aggregation in Cypher is different from aggregation in SQL. In Cypher, you need not specify a grouping key. As soon as an
aggregation function is used, all non-aggregated result columns become grouping keys. The grouping is implicitly done,
based upon the fields in the RETURN clause.
There are more aggregating functions such as min()
or max() that you can also use in your queries.
These are described in the Aggregating Functions
section of the Neo4j Cypher Manual.
107. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 109
Additional processing using WITH
// only return actors that have 2 or 3 movies
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WITH a, count(a) AS numMovies, collect(m.title) as movies
WHERE numMovies > 1 AND numMovies < 4
RETURN a.name, numMovies, movies
During the execution of a MATCH clause, you can specify that you want some intermediate calculations or values that will
be used for further processing of the query, or for limiting the number of results before further processing is done. You use
the WITH clause to perform intermediate processing or data flow operations.
You have to name all expressions with an alias in a WITH that are not simple variables.
// find all actors who have acted in at least five movies, and find (optionally)
the movies they directed and return the person and those movies
MATCH (p:Person)
WITH p, size((p)-[:ACTED_IN]->(:Movie)) AS movies
WHERE movies >= 5
OPTIONAL MATCH (p)-[:DIRECTED]->(m:Movie)
RETURN p.name, m.title
108. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 110
Additional processing using WITH
// retrieves all actors that acted in movies, and collects the list of movies for
any actor that acted in more than five movies.
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH p, collect(m) AS movies
WHERE size(movies) > 5
RETURN p.name, movies
109. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 111
// Write a Cypher query that retrieves all movies that Gene Hackman has acted it,
along with the directors of the movies. In addition, retrieve the actors that acted
in the same movies as Gene Hackman. Return the name of the movie, the name of the
director, and the names of actors that worked with Gene Hackman.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person),
(a2:Person)-[:ACTED_IN]->(m)
WHERE a.name = 'Gene Hackman'
RETURN m.title as movie, d.name AS director , a2.name AS `co-actors`
// Retrieve particular nodes that have a relationship and when James Thompson is
acting on it
MATCH (p1:Person)-[:FOLLOWS]-(p2:Person)
WHERE p1.name = 'James Thompson'
RETURN p1, p2
// Modify the query to retrieve nodes that are one and two hops away
MATCH (p1:Person)-[:FOLLOWS*1..2]-(p2:Person)
WHERE p1.name = 'James Thompson'
RETURN p1, p2
// Modify the query to retrieve particular nodes that are connected no matter how
many hops are required
MATCH (p1:Person)-[:FOLLOWS*]-(p2:Person)
WHERE p1.name = 'James Thompson'
RETURN p1, p2
110. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 112
// Retrieve all movie by collecting a list of all people who acted in it
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
RETURN p.name as actor, collect(m.title) AS `movie list`
// Retrieve all movies that Tom Cruise has acted in and the co-actors that acted in
the same movie by collecting a list
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p2:Person)
WHERE p.name ='Tom Cruise'
RETURN m.title as movie, collect(p2.name) AS `co-actors`
// Retrieve all people who reviewed a movie, returning the list of reviewers and
how many reviewers reviewed the movie
MATCH (p:Person)-[:REVIEWED]->(m:Movie)
RETURN m.title as movie, count(p) as numReviews, collect(p.name) as reviewers
// Retrieve all directors, their movies, and people who acted in the movies,
returning the name of the director, the number of actors the director has worked
with, and the list of actors.
MATCH (d:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a:Person)
RETURN d.name AS director, count(a) AS `number actors` , collect(a.name) AS `actors
worked with`
111. Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 113
// Retrieve the movies that have at least 2 directors, and optionally the names of
people who reviewed the movies.
MATCH (m:Movie)
WITH m, size((:Person)-[:DIRECTED]->(m)) AS directors
WHERE directors >= 2
OPTIONAL MATCH (p:Person)-[:REVIEWED]->(m)
RETURN m.title, p.name
112. Controlling how results are returned
17/07/2019 Big Data class by Alexandre Bergere 114
Eliminating duplication
MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
RETURN m.released, collect(DISTINCT m.title) AS movies
You have seen a number of query results where there is duplication in the results returned. In most cases, you want to
eliminate duplicated results. You do so by using the DISTINCT keyword.
Using WITH and DISTINCT to eliminate duplication
MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
WITH DISTINCT m
RETURN m.released, m.title
Another way that you can avoid duplication is to with WITH and DISTINCT together as follows:
113. Controlling how results are returned
17/07/2019 Big Data class by Alexandre Bergere 115
Ordering results
MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
RETURN m.released, collect(DISTINCT m.title) AS movies ORDER BY m.released DESC
If you want the results to be sorted, you specify the expression to use for the sort using the ORDER BY keyword and
whether you want the order to be descending using the DESC keyword. Ascending order is the default.
114. Controlling how results are returned
17/07/2019 Big Data class by Alexandre Bergere 116
Limiting the number of results
MATCH (m:Movie)
RETURN m.title as title, m.released as year ORDER BY m.released DESC LIMIT 10
Although you can filter queries to reduce the number of results returned, you may also want to limit the number of results.
115. Controlling results returned
17/07/2019 Big Data class by Alexandre Bergere 117
// write a query to retrieve all actors that acted in movies during the 1990s,
where you return the released date, the movie title, and the collected actor names
for the movie. For now do not worry about duplication.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released >= 1990 AND m.released < 2000
RETURN DISTINCT m.released, m.title, collect(a.name)
// modify the query so that the released date records returned are not duplicated.
To implement this, you must add the collection of the movie titles to the results
returned.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released >= 1990 AND m.released < 2000
RETURN m.released, collect(m.title), collect(a.name)
// The results returned from the previous query returns the collection of movie
titles with duplicates. That is because there are multiple actors per released
year. Next, modify the query so that there is no duplication of the movies listed
for a year.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released >= 1990 AND m.released < 2000
RETURN m.released, collect(DISTINCT m.title), collect(a.name)
116. Controlling results returned
17/07/2019 Big Data class by Alexandre Bergere 118
// Retrieve the top 5 ratings and their associated movies, returning the movie
title and the rating.
MATCH (:Person)-[r:REVIEWED]->(m:Movie)
RETURN m.title AS movie, r.rating AS rating
ORDER BY r.rating DESC LIMIT 5
117. Working with Cypher data
17/07/2019 Big Data class by Alexandre Bergere 119
Unwinding lists
// create a list with three elements, unwind the list and then return the values
WITH [1, 2, 3] AS list
UNWIND list AS row
RETURN list, row
There may be some situations where you want to perform the opposite of collecting results, but rather separate the lists
into separate rows. This functionality is done using the UNWIND clause.
The UNWIND clause is frequently used when importing data into a graph.
118. Working with Cypher data
17/07/2019 Big Data class by Alexandre Bergere 120
Dates
MATCH (actor:Person)-[:ACTED_IN]->(:Movie)
WHERE exists(actor.born)
// calculate the age
with DISTINCT actor, date().year - actor.born as age
RETURN actor.name, age as `age today`
ORDER BY actor.born DESC
Cypher has a built-in date() function, as well as other temporal values and functions that you can use to calculate temporal
values.
You use a combination of numeric, temporal, spatial, list and string functions to calculate values that are useful to your
application. For example, suppose you wanted to calculate the age of a Person node, given a year they were born (the born
property must exist and have a value).
119. Working with Cypher data
17/07/2019 Big Data class by Alexandre Bergere 121
// Modify the query you just wrote so that before the query processing ends, you
unwind the list of movies and then return the name of the actor and the title of
the associated movie
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH p, collect(m) AS movies
WHERE size(movies) > 5
WITH p, movies UNWIND movies AS movie
RETURN p.name, movie.title
// retrieves all movies that Tom Hanks acted in, returning the title of the movie,
the year the movie was released, the number of years ago that the movie was
released, and the age of Tom when the movie was released
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE a.name = 'Tom Hanks'
RETURN m.title, m.released, date().year - m.released as yearsAgoReleased,
m.released - a.born AS `age of Tom`
ORDER BY yearsAgoReleased
127. Azure Cosmos DB
17/07/2019 Big Data class by Alexandre Bergere 129
A globally distributed, massively scalable, multi-model database service
Azure Cosmos DB
128. Global Distribution
17/07/2019 Big Data class by Alexandre Bergere 130
Policy-based geo-fencing Dynamically add and remove regions
Failover prioritiesDynamically configurable read and write regions
Geo-local reads and writes 99.99% SLA for read availability
Database designed for modern web and mobile applications, which are (typically) global applications in nature.
129. Multi-Master
17/07/2019 Big Data class by Alexandre Bergere 131
Improved write latency for end users
Improved write scalability and write throughput
Better support for disconnected environments (for example, edge devices)
Load balancing
130. Consistency
17/07/2019 Big Data class by Alexandre Bergere 133
Consistency
Level Guarantees
Strong Linearizability (once operation is complete, it will be visible to all)
Bounded
Staleness
Consistent Prefix.
Reads lag behind writes by at most k prefixes or t interval
Similar properties to strong consistency (except within staleness window), while preserving 99.99%
availability and low latency.
Session Consistent Prefix.
Within a session: monotonic reads, monotonic writes, read-your-writes, write-follows-reads
Predictable consistency for a session, high read throughput + low latency
Consistent
Prefix
Reads will never see out of order writes (no gaps).
Eventual Potential for out of order reads. Lowest cost for reads of all consistency levels.
131. COMPREHENSIVE SLAs
17/07/2019 Big Data class by Alexandre Bergere 134
RUN YOUR APP ON WORLD-CLASS INFRASTRUCTURE
Azure Cosmos DB is the only service with financially-backed SLAs for
millisecond latency at the 99th percentile, 99.999% HA and guaranteed
throughput and consistency
HALatency
<10 ms
99th percentile
99.999%
Throughput Consistency
Guaranteed Guaranteed
132. Trust your data to industry-leading Security & Compliance
17/07/2019 Big Data class by Alexandre Bergere 135
Azure is the world’s most trusted cloud, with more certifications
than any other cloud provider.
• Enterprise grade security
• Encryption at Rest
• Encryption is enabled automatically by default
• Comprehensive Azure compliance certification
133. Throughput
17/07/2019 Big Data class by Alexandre Bergere 136
Request unit calculator
Request unit considerations
Item size
Item property count
Data consistency
Indexex properties
Document indexing
Script usage
The currency of Azure Cosmos DB is the request unit (RU). With request units, you don't need to reserve read/write capacities or provision CPU, memory, and IOPS.
134. Serverless database
17/07/2019 Big Data class by Alexandre Bergere 137
Serverless computing is all about the ability to focus on individual pieces of logic that are repeatable and stateless.
o no infrastructure management.
o consume resources only for the seconds, or milliseconds, they run for.
Azure Cosmos DB trigger to invoke an Azure Function
Use an input binding to get data from Azure
Cosmos DB
Use an ouput binding to write data to Azure Cosmos DB
135. Serverless database
17/07/2019 Big Data class by Alexandre Bergere 139
Serverless computing is all about the ability to focus on individual pieces of logic that are repeatable and stateless.
o no infrastructure management.
o consume resources only for the seconds, or milliseconds, they run for.
136. Cosmos DB Change Feed
17/07/2019 Big Data class by Alexandre Bergere 140
138. Top 10 reasons why customers use
Azure Cosmos DB
17/07/2019 Big Data class by Alexandre Bergere 142
different types of data
multi-tenancy
and enterprise-grade
security
global
distribution turnkey
capability
mission
critical
massive
storage/throughput
scalability
to
optimize for speed and
cost
5 well-defined
consistency models
analytics-
ready
event-driven
architectures
single digit
millisecond latency at
99th percentile
worldwide
big data
high
availability and
reliability
139. Powering global solutions
17/07/2019 Big Data class by Alexandre Bergere 143
Azure Cosmos DB was built to support modern app patterns and use cases.
It enables industry-leading organizations to unlock the value of data, and respond to
global customers and changing business dynamics in real-time.
Data distributed and
available globally
Puts data where your
users are
Build real-time
customer experiences
Enable latency-sensitive
personalization, bidding,
and fraud detection.
Ideal for gaming,
IoT & eCommerce
Predictable and fast
service, even during
traffic spikes
Simplified
development with
serverless architecture
Fully-managed event-
driven micro-services
with elastic computing
power
Run Spark analytics
over operational data
Accelerate insights from
fast, global data
Lift and shift
NoSQL data
Lift and shift MongoDB
and Cassandra
workloads
140. Data distributed and available globally
17/07/2019 Big Data class by Alexandre Bergere 144
Put your data where your users are to give real-time access and
uninterrupted service to customers anywhere in the world.
o Turnkey global data replication across all Azure regions
o Guaranteed low-latency experience for global users
o Resiliency for high availability and disaster recovery
141. Build Real-Time Customer experiences
17/07/2019 Big Data class by Alexandre Bergere 145
Offer latency-sensitive applications with personalization, bidding, and
fraud-detection.
o Machine learning models generate real-time
recommendations across product catalogues
o Product analysis in milliseconds
o Low-latency ensures high app performance worldwide
o Tunable consistency models for rapid insight
Online Recommendations Service
HOT path
Offline Recommendations Engine
COLD path
142. Ideal for gaming, IoT and ecommerce
17/07/2019 Big Data class by Alexandre Bergere 146
Maintain service quality during high-traffic periods requiring
massive scale and performance.
o Instant, elastic scaling handles traffic bursts
o Uninterrupted global user experience
o Low-latency data access and processing for large and
changing user bases
o High availability across multiple data centers
143. Massive Scale Telemetry Stores for IOT
17/07/2019 Big Data class by Alexandre Bergere 147
Diverse and unpredictable IoT sensor workloads require a
responsive data platform
o Seamless handling of any data output or volume
o Data made available immediately, and indexed
automatically
o High writes per second, with stable ingestion and
query performance
144. simplified development with serverless architecture
17/07/2019 Big Data class by Alexandre Bergere 148
Experience decreased time-to-market, enhanced scalability, and
freedom from framework management with event-driven
micro-services.
o Seamless handling of any data output or volume
o Data made available immediately, and indexed
automatically
o High writes per second, with stable ingestion and
query performance
o Real-time, resilient change feeds logged forever and
always accessible
o Native integration with Azure Functions
145. Run spark over operational data
17/07/2019 Big Data class by Alexandre Bergere 149
Accelerate analysis of fast-changing, high-volume, global data.
o Real-time big data processing across any data model
o Machine learning at scale over globally-distributed data
o Speeds analytical queries with automatic indexing and
push-down predicate filtering
o Native integration with Spark Connector
146. Lift and shift nosql apps
17/07/2019 Big Data class by Alexandre Bergere 150
Make data modernization easy with seamless lift and shift
migration of NoSQL workloads to the cloud.
o Azure Cosmos DB APIs for MongoDB and Cassandra
bring app data from anywhere to Azure Cosmos DB
o Leverage existing tools, drivers, and libraries, and
continue using existing apps’ current SDKs
o Turnkey geo-replication
o No infrastructure or VM management required
.NET
149. Document Data Model
17/07/2019 Big Data class by Alexandre Bergere 153
“Because at the end of the day, it’s all just keys and values – not just the key-value data model, but all these data models.”
“When it comes to actually building applications – well, that’s the developer’s job, and this is where the decision of which data model to
choose comes into play.”
Document
SQL API (JSON)
MongoDB API
Graph
Gremlin API
(graph transversal language)
Key-Value
Table API
(replaces Azure Table Storage)
Columnar
Cassandra API
150. Atom Record Sequence (ARS)
17/07/2019 Big Data class by Alexandre Bergere 154
Your data is always stored as ARS – or Atom Record Sequence – a Microsoft creation that defines the
persistence layer for key-value pairs.
Switching Between Data Models
choosing an API = choosing a data model
151. Switching Between Data Models
17/07/2019 Big Data class by Alexandre Bergere 155
Each data model is merely a projection of the same underlying ARS format, and so eventually you will be
able to create a single account, and then switch freely between different APIs within the account. So that
then, you’ll be able to access one database as graph, key-value, document, or columnar, all at once.
Future release ?
155. Resource Model
17/07/2019 Big Data class by Alexandre Bergere 159
Account
Database
Container
Item
= Collection Graph Table
156. Handle any data with no schema or indexing required
17/07/2019 Big Data class by Alexandre Bergere 160
Azure Cosmos DB’s schema-less service automatically indexes all your data,
regardless of the data model, to delivery blazing fast queries.
Item Color
Microwave
safe
Liquid
capacity
CPU Memory Storage
Geek mug Graphite Yes 16ox ??? ??? ???
Coffee
Bean mug
Tan No 12oz ??? ??? ???
Surface
book
Gray ??? ??? 3.4 GHz
Intel
Skylake
Core i7-
6600U
16GB 1 TB SSD
o Automatic index management
o Synchronous auto-indexing
o No schemas or secondary indices needed
o Works across every data model
GEEK
157. Index
17/07/2019 Big Data class by Alexandre Bergere 161
Schema-agnostic, automatic indexing
o Automatically index every property of every record without
having to define schemas and indices upfront.
o No need for schema and index management
o Works across every data model
o Latch free data structure for highly write-optimized database
engine
o Multiple index types: Hash, range, and geospatial
158. Index POLICIES
17/07/2019 Big Data class by Alexandre Bergere 162
CUSTOM INDEXING POLICIES
Though all Azure Cosmos DB data is indexed by default, you
can specify a custom indexing policy for your collections. Custom
indexing policies allow you to design and customize the shape of
your index while maintaining schema flexibility.
o Define trade-offs between storage, write and query
performance, and query consistency
o Include or exclude documents and paths to and from the
index
o Configure various index types
{
"automatic": true,
"indexingMode": "Consistent",
"includedPaths": [{
"path": "/*",
"indexes": [{
"kind": "Hash",
"dataType": "String",
"precision": -1
}, {
"kind": "Range",
"dataType": "Number",
"precision": -1
}, {
"kind": "Spatial",
"dataType": "Point"
}]
}],
"excludedPaths": [{
"path": "/nonIndexedContent/*"
}]
}
159. Ressource Model in Cosmos DB
17/07/2019 Big Data class by Alexandre Bergere 163
161. SQL SYNTAX
17/07/2019 Big Data class by Alexandre Bergere 165
Using the popular query language, SQL, to access semi-
structured JSON data.
This module will reference querying in the context of the SQL
API for Azure Cosmos DB.
162. SQL QUERY SYNTAX
17/07/2019 Big Data class by Alexandre Bergere 166
BASIC QUERY SYNTAX
The SELECT & FROM keywords are the basic components of
every query.
> SELECT
tickets.id,
tickets.pricePaid
FROM tickets
> SELECT
t.id,
t.pricePaid
FROM tickets t
163. SQL QUERY SYNTAX - WHERE
17/07/2019 Big Data class by Alexandre Bergere 167
FILTERING
WHERE supports complex scalar expressions including
arithmetic, comparison and logical operators
> SELECT
tickets.id,
tickets.pricePaid
FROM tickets
WHERE
tickets.pricePaid > 500.00 AND
tickets.pricePaid <= 1000.00
164. SQL QUERY SYNTAX - PROJECTION
17/07/2019 Big Data class by Alexandre Bergere 168
FILTERING
If your workloads require a specific JSON schema, Azure
Cosmos DB supports JSON projection within its queries
> SELECT {
"id": tickets.id,
"flightNumber": tickets.assignedFlight.flightNumber,
"purchase": {
"cost": tickets.pricePaid
},
"stops": [
tickets.assignedFlight.origin,
tickets.assignedFlight.destination
]
} AS ticket
FROM tickets
165. SQL QUERY SYNTAX - PROJECTION
17/07/2019 Big Data class by Alexandre Bergere 169
FILTERING
If your workloads require a specific JSON schema, Azure
Cosmos DB supports JSON projection within its queries
> SELECT VALUE {
"id": tickets.id,
"flightNumber": tickets.assignedFlight.flightNumber,
"purchase": {
"cost": tickets.pricePaid
},
"stops": [
tickets.assignedFlight.origin,
tickets.assignedFlight.destination
]
}
FROM tickets
166. INTRA-DOCUMENT JOIN
17/07/2019 Big Data class by Alexandre Bergere 170
Azure Cosmos DB supports intra-document JOIN’s for de-normalized arrays
Let’s assume that we have two JSON documents in a collection:
{
"pricePaid": 575.5,
"assignedFlight": {
"number": "F125",
"origin": "SEA",
"destination": "JFK"
},
"seat": “12A",
"requests": [
"kosher_meal",
"aisle_seat"
],
"id": "6ebe1165836a"
}
{
"pricePaid": 234.75,
"assignedFlight": {
"number": "F752",
"origin": "SEA",
"destination": "LGA"
},
"seat": "14C",
"requests": [
"early_boarding",
"window_seat"
],
"id": "c4991b4d2efc"
}
167. INTRA-DOCUMENT JOIN
17/07/2019 Big Data class by Alexandre Bergere 171
We can filter on a particular array index position without JOIN:
> SELECT
tickets.assignedFlight.number,
tickets.seat,
ticket.requests
FROM
tickets
WHERE
ticket.requests[1] == "aisle_seat"
[
{
"number":"F125","seat":"12A",
"requests": [
"kosher_meal",
"aisle_seat"
]
}
]
168. INTRA-DOCUMENT JOIN
17/07/2019 Big Data class by Alexandre Bergere 172
JOIN allows us to merge embedded documents or arrays across multiple documents and
returned a flattened result set:
> SELECT
tickets.assignedFlight.number,
tickets.seat,
requests
FROM
tickets
JOIN
requests IN tickets.requests
[
{
"number":"F125","seat":"12A",
"requests":"kosher_meal"
},
{
"number":"F125","seat":"12A",
"requests":"aisle_seat"
},
{
"number":"F752","seat":"14C",
"requests":"early_boarding"
},
{
"number":"F752","seat":"14C",
"requests":"window_seat"
}
]
169. INTRA-DOCUMENT JOIN
17/07/2019 Big Data class by Alexandre Bergere 173
Along with JOIN, we can also filter the cross products without knowing the array index
position:
> SELECT
tickets.id, requests
FROM
tickets
JOIN
requests IN tickets.requests
WHERE
requests
IN ("aisle_seat", "window_seat")
[
{
"number":"F125","seat":"12A“,
"requests": "aisle_seat"
},
{
"number":"F752","seat":"14C",
"requests": "window_seat"
}
]
171. Cosmos DB Emulator
17/07/2019 Big Data class by Alexandre Bergere 175
The Azure Cosmos DB Emulator provides a local environment that emulates the Azure Cosmos DB service for development
purposes. Using the Azure Cosmos DB Emulator, you can develop and test your application locally, without creating an Azure
subscription or incurring any costs. When you're satisfied with how your application is working in the Azure Cosmos DB
Emulator, you can switch to using an Azure Cosmos DB account in the cloud.
At this time the Data Explorer in the emulator only fully supports SQL API collections and MongoDB collections. Table, Graph, and Cassandra containers are not
fully supported.
The Azure Cosmos DB Emulator provides a high-fidelity emulation of the Azure Cosmos DB service. It supports identical
functionality as Azure Cosmos DB, including support for creating and querying JSON documents, provisioning and scaling
collections, and executing stored procedures and triggers. You can develop and test applications using the Azure Cosmos DB
Emulator, and deploy them to Azure at global scale by just making a single configuration change to the connection endpoint for
Azure Cosmos DB.
The Azure Cosmos DB Emulator by default runs on the local machine ("localhost") listening on port 8081.
172. Azure Cosmos DB : Data migration tools
17/07/2019 Big Data class by Alexandre Bergere 176
Data Migration Tools
SQL API Mongo DB API Graph APITable API
Cassandra API
173. Cosmos DB Explorer
17/07/2019 Big Data class by Alexandre Bergere 177
With Cosmos DB Explorer you can:
o Take advantage of the full screen real estate for your queries and
results.
o Access your database account and collections with a connection string,
without needing access to the Azure subscription or portal.
o Share query results with authorized peers who do not have Azure
portal access.
o Work with Cosmos DB data without having to download any desktop
tools locally.
https://cosmos.azure.com/
174. Azure Cosmos DB – Interface demo
17/07/2019 Big Data class by Alexandre Bergere 178
175. Azure Cosmos DB – SQL Query Exercice
17/07/2019 Big Data class by Alexandre Bergere 179
Add data using Data Explorer
https://docs.microsoft.com/en-ie/learn/modules/access-data-with-cosmos-db-and-
sql-api/3-add-data
Explore SQL query types
https://docs.microsoft.com/en-ie/learn/modules/access-data-
with-cosmos-db-and-sql-api/4-query-types
176. 17/07/2019 Big Data class by Alexandre Bergere 180
Add cosmos DB to you architecture
179. Stored Procedures
17/07/2019 Big Data class by Alexandre Bergere 183
BENEFITS
o Familiar programming language
o Atomic Transactions
o Built-in Optimizations
o Business Logic Encapsulation
Stored procedures perform complex transactions on documents and properties.
Stored procedures are written in JavaScript and are stored in a container on Azure
Cosmos DB. By performing the stored procedures on the database engine and
close to the data, you can improve performance over client-side programming.
Stored procedures are the only way to achieve atomic transactions within Azure
Cosmos DB; the client-side SDKs do not support transactions.
Performing batch operations in stored procedures is also recommended because
of the reduced need to create separate transactions.
180. Simple Stored Procedure
17/07/2019 Big Data class by Alexandre Bergere 184
function createSampleDocument(documentToCreate) {
var context = getContext();
var collection = context.getCollection();
var accepted = collection.createDocument(
collection.getSelfLink(),
documentToCreate,
function (error, documentCreated) {
context.getResponse().setBody(documentCreated.id)
}
);
if (!accepted) return;
}
181. Multi-DOCUMENT Transactions
17/07/2019 Big Data class by Alexandre Bergere 185
DATABASE TRANSACTIONS
In a typical database, a transaction can be defined as a sequence of operations performed as a single
logical unit of work. Each transaction provides ACID guarantees.
In Azure Cosmos DB, JavaScript is hosted in the same memory space as the database. Hence,
requests made within stored procedures and triggers execute in the same scope of a database
session.
Create
New
Document
Query
Collection
Update
Existing
Document
Delete
Existing
Document
Stored procedures utilize snapshot
isolation to guarantee all reads within the
transaction will see a consistent snapshot
of the data
182. Bounded Execution
17/07/2019 Big Data class by Alexandre Bergere 186
EXECUTION WITHIN TIME BOUNDARIES
All Azure Cosmos DB operations must complete within the server-specified request timeout duration. If an
operation does not complete within that time limit, the transaction is rolled back.
HELPER BOOLEAN VALUE
All functions under the collection object (for create, read, replace, and delete of documents and
attachments) return a Boolean value that represents whether that operation will complete:
o If true, the operation is expected to complete
o If false, the time limit will soon be reached and your function should end execution as soon as
possible.
183. Transaction Continuation Model
17/07/2019 Big Data class by Alexandre Bergere 187
CONTINUING LONG-RUNNING TRANSACTIONS
o JavaScript functions can implement a continuation-based model to batch/resume execution
o The continuation value can be any value of your own choosing. This value can then be used by your
applications to resume a transaction from a new “starting point”
Bulk Create Documents
Return a “pointer” to resume later
Observe
Return
Value
Try Create
Each
Document
Done
184. Control Flow
17/07/2019 Big Data class by Alexandre Bergere 188
JAVASCRIPT CONTROL FLOW
Stored procedures allow you to naturally express control flow, variable scoping, assignment, and
integration of exception handling primitives with database transactions directly in terms of the JavaScript
programming language.
ES6 PROMISES
ES6 promises can be used to implement promises for Azure Cosmos DB stored procedures. Unfortunately,
promises “swallow” exceptions by default. It is recommended to use callbacks instead of ES6 promises.
185. Stored Procedure Control Flow
17/07/2019 Big Data class by Alexandre Bergere 189
function createTwoDocuments(docA, docB) {
var ctxt = getContext(); var coll = context.getCollection(); var collLink =
coll.getSelfLink();
var aAccepted = coll.createDocument(collLink, docA, docACallback);
function docACallback(error, created) {
var bAccepted = coll.createDocument(collLink, docB, docBCallback);
if (!bAccepted) return;
};
function docBCallback(error, created) {
context.getResponse().setBody({
"firstDocId": created.id,
"secondDocId": created.id
});
};
}
186. Rolling Back Transactions
17/07/2019 Big Data class by Alexandre Bergere 190
TRANSACTION ROLL-BACK
Inside a JavaScript function, all operations are automatically wrapped under a single transaction:
o If the function completes without any exception, all data changes are committed
o If there is any exception that’s thrown from the script, Azure Cosmos DB’s JavaScript runtime will
roll back the whole transaction.
Create New
Document
Query
Collection
Update
Existing
Document
Delete Existing
Document
If exception, undo changes
Transaction Scope
187. Transaction ROLLBACK in Stored Procedure
17/07/2019 Big Data class by Alexandre Bergere 191
collection.createDocument(
collection.getSelfLink(),
documentToCreate,
function (error, documentCreated) {
if (error) throw "Unable to create document, aborting...";
}
);
collection.createDocument(
documentToReplace._self,
replacementDocument,
function (error, documentReplaced) {
if (error) throw "Unable to update document, aborting...";
}
);
188. User-defined Functions
17/07/2019 Big Data class by Alexandre Bergere 192
UDF
User-defined functions (UDFs) are used to extend the Azure Cosmos DB SQL API’s query language
grammar and implement custom business logic. UDFs can only be called from inside queries
They do not have access to the context object and are meant to be used as compute-only code.
189. User-Defined Function Definition
17/07/2019 Big Data class by Alexandre Bergere 193
var taxUdf = {
id: "tax",
serverScript: function tax(income) {
if (income == undefined)
throw 'no input’;
if (income < 1000)
return income * 0.1;
else if (income < 10000)
return income * 0.2;
else
return income * 0.4;
}
}
190. User-Defined Function USAGE in Queries
17/07/2019 Big Data class by Alexandre Bergere 194
> SELECT
*
FROM
TaxPayers t
WHERE
udf.tax(t.income) > 20000
193. Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 197
Embeded
“The guiding premise when normalizing data is to avoid storing redundant data on each
record and rather refer to data.”
Embedding data
194. Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 198
Embeded data
When to embed:
o There are contains relationships between entities.
o There are one-to-few relationships between entities.
o There is embedded data that changes infrequently.
o There is embedded data won't grow without bound.
o There is embedded data that is integral to data in a document.
195. Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 199
Referenced data
The problem with this example is that the comments array is unbounded, meaning that there is no (practical) limit to the
number of comments any single post can have.
Referencing data
197. Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 201
Referenced data
When to reference:
o Representing one-to-many relationships.
o Representing many-to-many relationships.
o Related data changes frequently.
o Referenced data could be unbounded.
198. Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 202
Where do I put the relationship?
We have dropped the unbounded collection on the publisher document.
Instead we just have a reference to the publisher on each book document.
201. Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 205
Hybrid data models
Pre-calculated aggregates values to save expensive processing on a read operation. In
the example, some of the data embedded in the author document is data that is
calculated at run-time. Every time a new book is published, a book document is
created and the countOfBooks field is set to a calculated value based on the number of
book documents that exist for a particular author. This optimization would be good in
read heavy systems where we can afford to do computations on writes in order to
optimize reads.
We could've just stuck with id and left the application to get any additional information
it needed from the respective author document using the "link", but because our
application displays the author's name and a thumbnail picture with every book
displayed we can save a round trip to the server per book in a list by
denormalizing some data from the author.
Sure, if the author's name changed or they wanted to update their photo we'd have to
go an update every book they ever published but for our application, based on the
assumption that authors don't change their names very often, this is an acceptable
design decision.
205. Azure Cosmos DB - Change Feed Lab
17/07/2019 Big Data class by Alexandre Bergere 209
206. Cosmos DB & Spark
17/07/2019 Big Data class by Alexandre Bergere 210
207. Broadcast Real-time Updates from Cosmos DB with SignalR
Service and Azure Functions
17/07/2019 Big Data class by Alexandre Bergere 211
208. Advanced Analytics on big data architecture
17/07/2019 Big Data class by Alexandre Bergere 212
209. STRIIM FOR AZURE COSMOS DB
17/07/2019 Big Data class by Alexandre Bergere 213
Continuous, Real-Time Data Movement
210. Querying An Azure Cosmos DB Database using the SQL API
17/07/2019 Big Data class by Alexandre Bergere 214
https://cosmosdb.github.io/labs/dotnet/technical_deep_dive/03-querying_the_database_using_sql.html
Azure Data Factory
Azure Cosmos DB
Visual Studio Code
212. How Skype modernized its backend infrastructure using Azure
Cosmos DB
17/07/2019 Big Data class by Alexandre Bergere 216
Lessons learned
Looking back at the project, Kaduk recalls several “lessons learned.” These include:
o Use direct mode for better performance – How a client connects to Azure Cosmos DB has important performance implications, especially
with respect to observed client side latency. The team began by using the default Gateway Mode connection policy, but switched to a Direct
Mode connection policy because it delivers better performance.
o Learn how to write and handle stored procedures – With Azure Cosmos DB, transactions can only be implemented using stored
procedures—pieces of application logic that are written in JavaScript that are registered and executed against a collection as a single
transaction. (In Azure Cosmos DB, JavaScript is hosted in the same memory space as the database. Hence, requests made within stored
procedures execute in the same scope of a database session, which enables Azure Cosmos DB to guarantee ACID for all operations that are
part of a single stored procedure.)
o Pay attention to query design – With Azure Cosmos DB, queries have a large impact in terms of RU consumption. Developers didn’t pay
much attention to query design at first, but soon found that RU costs were higher than desired. This led to an increased focus on optimizing
query design, such as using point document reads wherever possible and optimizing the query selections per API.
o Use the Azure Cosmos DB SDK 2.x to optimize connection usage – Within Azure Cosmos DB, the data stored in each region is distributed
across tens of thousands of physical partitions. To serve reads and writes, the Azure Cosmos DB client SDK must establish a connection with
the physical node hosting the partition. The team started by using the Azure Cosmos DB SDK 1.x, but found that its lack of support for
connection multiplexing led to excessive connection establishment and closing rates. Switching to the Azure Cosmos DB SDK 2.x, which
supports connection multiplexing, helped solve the problem —and also helped mitigate SNAT port exhaustion issues.