Graph databases store data in graph structures with nodes, edges, and properties. Neo4j is a popular open-source graph database that uses a property graph model. It has a core API for programmatic access, indexes for fast lookups, and Cypher for graph querying. Neo4j provides high availability through master-slave replication and scales horizontally by sharding graphs across instances through techniques like cache sharding and domain-specific sharding.
This presentation explains the major differences between SQL and NoSQL databases in terms of Scalability, Flexibility and Performance. It also talks about MongoDB which is a document-based NoSQL database and explains the database strutre for my mouse-human research classifier project.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
This presentation explains the major differences between SQL and NoSQL databases in terms of Scalability, Flexibility and Performance. It also talks about MongoDB which is a document-based NoSQL database and explains the database strutre for my mouse-human research classifier project.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
An introduction to Neo4j and Graph Databases. Learn about the primary use cases for Graph Databases and explore the properties of Neo4j that make those use cases possible.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
MySQL Performance Tuning: Top 10 Tips intended for PHP, Ruby and Java developers on performance tuning and optimization of MySQL. We will cover the deadly mistakes to be avoided. We will take real life examples of optimizing application many times. Here is the summary of what we intend to cover:
• Selection of Storage Engine
• Schema Optimization
• Server Tuning
• Hardware Selection and Tuning
• Effective uses of Index, when to use and when not to use.
• Partitions
• Speeding up using Stored Procedures
• Implementing prepared statements?
• Deadly Sins to be avoided
• Performance Tuning and Benchmarking Tools
In this presentation, Raghavendra BM of Valuebound has discussed the basics of MongoDB - an open-source document database and leading NoSQL database.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Twitter: http://bit.ly/2gFPTi8
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
This presentation is all about for the difference in between the Sql and NoSQL database because this question generally comes in the mind of every people that on what parameters and
how we can differentiate both these databases.
So, after viewing this presentation all your doubts and misconfusion between Sql and NoSQL got clear.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
An introduction to Neo4j and Graph Databases. Learn about the primary use cases for Graph Databases and explore the properties of Neo4j that make those use cases possible.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
MySQL Performance Tuning: Top 10 Tips intended for PHP, Ruby and Java developers on performance tuning and optimization of MySQL. We will cover the deadly mistakes to be avoided. We will take real life examples of optimizing application many times. Here is the summary of what we intend to cover:
• Selection of Storage Engine
• Schema Optimization
• Server Tuning
• Hardware Selection and Tuning
• Effective uses of Index, when to use and when not to use.
• Partitions
• Speeding up using Stored Procedures
• Implementing prepared statements?
• Deadly Sins to be avoided
• Performance Tuning and Benchmarking Tools
In this presentation, Raghavendra BM of Valuebound has discussed the basics of MongoDB - an open-source document database and leading NoSQL database.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Twitter: http://bit.ly/2gFPTi8
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
This presentation is all about for the difference in between the Sql and NoSQL database because this question generally comes in the mind of every people that on what parameters and
how we can differentiate both these databases.
So, after viewing this presentation all your doubts and misconfusion between Sql and NoSQL got clear.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Ciel, mes données ne sont plus relationnellesXavier Gorse
Quand la gestion des données de nos applications web dépasse la simple persistance dans une base de données relationnelle (type SGBD), l’utilisation de technologies alternatives dites « NoSql » est nécessaire. Nous aborderons les 4 grandes familles de NoSql (Key/Value, Document, Column-oriented et Graph) ainsi que leur intégration dans des applications PHP.
Presentation is about Neo4j database. Some slides i have taken from other presentations as well, but it will you some basic idea.
For sample exercies in the end, you can go with this schema:
1.) http://www.neo4j.org/graphgist?7820655
2.) Sample Movie Schema comes by default
An introduction to Graph databases and in particular Neo4j, including where Neo4j lives on the CAP Scale in relation to other databases, the Graph data model and a very quick introduction to the Cypher Query Language.
Transforming AI with Graphs: Real World Examples using Spark and Neo4jDatabricks
Graphs – or information about the relationships, connection, and topology of data points – are transforming machine learning. We’ll walk through real world examples of how to get transform your tabular data into a graph and how to get started with graph AI. This talk will provide an overview of how we to incorporate graph based features into traditional machine learning pipelines, create graph embeddings to better describe your graph topology, and give you a preview of approaches for graph native learning using graph neural networks. We’ll talk about relevant, real world case studies in financial crime detection, recommendations, and drug discovery. This talk is intended to introduce the concept of graph based AI to beginners, as well as help practitioners understand new techniques and applications. Key take aways: how graph data can improve machine learning, when graphs are relevant to data science applications, what graph native learning is and how to get started.
Transforming AI with Graphs: Real World Examples using Spark and Neo4jFred Madrid
Graphs – or information about the relationships, connection, and topology of data points – are transforming machine learning. We’ll walk through real world examples of how to get transform your tabular data into a graph and how to get started with graph AI. This talk will provide an overview of how we to incorporate graph based features into traditional machine learning pipelines, create graph embeddings to better describe your graph topology, and give you a preview of approaches for graph native learning using graph neural networks. We’ll talk about relevant, real world case studies in financial crime detection, recommendations, and drug discovery. This talk is intended to introduce the concept of graph based AI to beginners, as well as help practitioners understand new techniques and applications. Key take aways: how graph data can improve machine learning, when graphs are relevant to data science applications, what graph native learning is and how to get started.
Combine Spring Data Neo4j and Spring Boot to quicklNeo4j
Speakers: Michael Hunger (Neo Technology) and Josh Long (Pivotal)
Spring Data Neo4j 3.0 is here and it supports Neo4j 2.0. Neo4j is a tiny graph database with a big punch. Graph databases are imminently suited to asking interesting questions, and doing analysis. Want to load the Facebook friend graph? Build a recommendation engine? Neo4j's just the ticket. Join Spring Data Neo4j lead Michael Hunger (@mesirii) and Spring Developer Advocate Josh Long (@starbuxman) for a look at how to build smart, graph-driven applications with Spring Data Neo4j and Spring Boot.
Spring Data Graph is an integration library for the open source graph database Neo4j and has been around for over a year, evolving from its infancy as brainchild of Rod Johnson and Emil Eifrem. It supports transparent AspectJ based POJO to Graph Mapping, a Neo4jTemplate API and extensive support for Spring Data Repositories. It can work with an embedded graph database or with the standalone Neo4j Server.
The session starts with a short introduction to graph databases. Following that, the different approaches using Spring Data Graph are explored in the Cineasts.net web-app, a social movie database which is also the application of the tutorial in the Spring Data Graph Guidebook. The session will also cover creating a green-field project using the Spring Roo Addon for Spring Data Graph and deploying the App to CloudFoundry.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
2. Graph Databases
• Graph Based NoSQL Database
• Property Graph Model
• Neo4j
• Noe4j Architecture
• Data Storage
• Programmatic Data Access
• Core API
• Lucene
• Auto Index lifecycle
• Traversers API
• Cypher
• Graph Algorithms
• Neo4j HA
• Cache Sharding
• References
3. Graphs
• A collection nodes (things) and edges (relationships) that
connect pairs of nodes
– Suitable for any data that is related
• Can attach properties (key-value pairs) on nodes and
relationships
• Relationships connect two nodes and both nodes and
relationships can hold an arbitrary amount of key-value pairs
6. Graphs
• Well-understood patterns and algorithms
– Studied since Leonard Euler's 7 Bridges (1736)
– Codd's Relational Model (1970)
• Knowledge graph - beyond links, search is smarter when considering how things
are related
• Facebook graph search – people interested in finding things in their part of the
world
• Bing + Britannica: referencing and cross-referencing
• People - relationships to people, to organizations, to places, to things - personal
graph
7. A Graph Database
• Relationships are first citizens
• NoSQL database optimized for connected data
– Social networking, logistics networks, recommendation engines
– Relationships are as important as the records
– 1000 times faster than RDBMS for connected data
• Uses graph structures with nodes, edges and properties to store data
• Open source graph databases - Neo4j, InfiniteGraph, InfoGrid,OrientDB
• Very fast querying across records
9. A Graph Database
• Transactional with the usual operations
• RDBMS - can tell sales in last year
• Graph database – can tell customer which book to buy next
• Index-free adjacency
– Every node is a pointer to its adjacent element
• Edges hold most of the important information and relations
– nodes to other nodes
– nodes to properties
10. Graph Based NoSQL Database
• No rigid format of SQL or the tables and columns representation
• Uses a flexible graphical representation - addresses scalability concerns
• Data can be easily transformed from one model to the other using a
graph based NoSQL database
• Nodes are organised by some relationships with one another represented
by edges between the nodes
• Both nodes and the relationships have some defined properties
11. Graph Based NoSQL Database
• Labelled, directed, attributed multi-graph - Graphs contains nodes which
are labelled properly with some properties and these nodes have some
relationship with one another which is shown by the directional edges
• While relational database models can replicate the graphical ones, the
edge would require a join which is a costly proposition
12. Advantages
• Easier Relationships Analysis
• Very fast for associative data sets
– Like social networks
• Map more directly to object oriented applications
– Object classification and Parent->Child relationships
13. Disadvantages
• If data is just tabular with not much relationship between the
data, graph databases do not fare well
• OLAP support for graph databases not mature
14. Performance Experiment
• Compute social network path exists
• 1000 persons
• Average 50 friends per person
• pathExists(a, b) limited to depth 4
# persons query time
Relational
database
1000 2000ms
Neo4j 1000 2ms
Neo4j 1000000 2ms
15. Property Graph Model
name: the Doctor
age: 907
species:Time Lord
first name: Rose
late name:Tyler
vehicle: Skoda
model:Type 40
17. Neo4j
• A Graph Database
• A Property Graph containing Nodes, Relationships with Properties on
both
• Manage complex, highly connected data
• Scalable - High-performance with High-Availability
– Traverse 1,000,000+ relationships / second on commodity hardware
• Server with REST API, or Embeddable on the JVM
18. Neo4j
• Full ACID transactions
• Schema free, bottom-up data model design
• Stable
• Easier than RDBMS since no need for normalization
• Implemented in Java
• Open Source
19. Neo4j
• Schema free – Data does not have to adhere to any convention
• Support for wide variety of languages - Java, Python, Perl, Scala,Cypher
• A graph database can be thought of as a key-value store, with full support
for relationships.
• Graph databases don’t avoid design efforts
• Good design still requires effort
20. Why Neo4J?
• The internet is a network of pages connected to each other.
What is a better way to model that than in graphs?
• No time lost fighting with less expressive data-stores
• Easy to implement experimental features
• A single instance of Neo4j can house at most 34 billion nodes,
34 billion relationships and 68 billion properties
21. Core API
REST API
JVM Language Bindings
Traversal Framework
Caches
Memory-Mapped (N)IO
Filesystem
Java Ruby Clojure…
Graph Matching
Noe4j Architecture
23. Data Storage
• Neo4j stores graph data in a number of different store files
• Each store file contains the data for a specific part of the
graph
– neostore.nodestore.db
– neostore.relationshipstore.db
– neostore.propertystore.db
– neostore.propertystore.db.index
– neostore.propertystore.db.strings
– neostore.propertystore.db.arrays
24. Node Store
• Size: 9 bytes
– 1st byte - in-use flag
– Next 4 bytes - ID of first relationship
– Last 4 bytes - ID of first property of node
• Fixed size records enable fast lookups
25. Relationship store
• neostore.relationshipstore.db
• Size: 33 bytes
• 1st byte - In use flag
• Next 8 bytes - IDs of the nodes at the start and end of the relationship
• 4 bytes - Pointer to the relationship type
• 16 bytes - pointers for the next and previous relationship records for each of the start and end nodes. (
property chain)
• 4 bytes - next property id
33. Index a Graph
• Graphs themselves are indexes
• Can create short-cuts to well-known nodes
• In program, keep a reference to any interesting node
• Indexes offer flexibility in what constitutes an “interesting
node”
34. Lucene
• The default index implementation for Neo4j
– Default implementation for IndexManager
• Supports many indexes per database
• Each index supports nodes or relationships
• Supports exact and regex-based matching
• Supports scoring
– Number of hits in the index for a given item
– Great for recommendations
35. Create a Node Index
GraphDatabaseService db = …
Index<Node> planets = db.index().forNodes("planets");
Type
Type
Indexname
CreateOR
retrieve
36. Create a Relationship Index
GraphDatabaseService db = …
Index<Relationship> enemies = db.index().forRelationships("enemies");
Type
Type
Indexname
CreateOR
retrieve
37. Exact Matches
GraphDatabaseService db = …
Index<Node> actors = doctorWhoDatabase.index().forNodes("actors");
Node rogerDelgado = actors.get("actor", "Roger Delgado“).getSingle();
Valueto
match
Firstmatch
only
Key
38. Query Matches
GraphDatabaseService db = …
Index<Node> species = doctorWhoDatabase.index().forNodes("species");
IndexHits<Node> speciesHits = species.query("species“,"S*n");
Query
Key
39. Transactions to Mutate Indexes
• Mutating access is still protected by transactions which cover both index and graph
GraphDatabaseService db = …
Transaction tx = db.beginTx();
try {
Node nixon= db.createNode();
nixon("character", "Richard Nixon");
db.index().forNodes("characters").add(nixon,
"character“, nixon.getProperty("character"));
tx.success();
} finally {
tx.finish();
}
40. Auto Index lifecycle
• Auto Index - stays consistent with the graph data
• Specify the property name to index while creation
• If node/relationship or property is removed from the graph it is removed
from the index
• If database started with auto indexing enabled but different auto indexed
properties than the last run, then already auto-indexed entities will be
deleted as they are worked upon
• Re-indexing is a manual
– Existing properties not indexed unless touched
42. Core API
• Basic (nodes, relationships)
• Fast
• Imperative
• Flexible - Easily intermix mutating operations
43. Traversers API
• Mechanisms to query graph navigating from starting node to
related nodes according to algorithm to get answers
• Expressive
• Fast
• Declarative (mostly)
• Opinionated
44. Cypher - A Graph Query Language
• Query Language for Neo4j
• A declarative graph pattern matching language
– SQL for graphs
– Tabular results
• aggregation, ordering and limits
• Mutating operations
• CRUD
• Easy to formulate queries based on relationships
• Many features stem from improving pain points of SQL like join tables
48. Operations
• Aggregation - COUNT, SUM, AVG, MAX, MIN, COLLECT
• Where clause
start doctor=node:characters(name = 'Doctor‘)
match (doctor)<-[:PLAYED]-(actor)-[:APPEARED_IN]->(episode) where actor.actor = 'Tom
Baker‘ and episode.title =~ /.*Dalek.*/
return episode.title
• Ordering
– order by <property>
– order by <property> desc
49. Graph Algorithms
• Neo4j has built-in algorithms
• Callable through JVM and REST APIs
• Higher level of abstraction
• Graph Matching
– Look for patterns in a data set - retail analytics
– Higher-level abstraction than raw traversers
• REST API
– Access the server
• Binary protocol
– JSON as default format
50. Neo4j HA - High Availability Cluster
• A scalability package known as high availability or HA that
uses a master-slave cluster architecture
– Full data redundancy
– Service fault tolerance
– Linear read scalability
– Master-slave replication
• Single data-centre or global zones
– tolerance for high-latency
51. Neo4j HA
• Redundancy - improved uptime
– automatic failover
• In a Neo4j HA cluster the full graph is replicated to each instance in the
cluster.
• Full dataset is replicated across the entire cluster to each server
• Read operations can be done locally on each slave
• Read capacity of the HA cluster increases linearly with the number of
servers
59. Server Overload Problem
• Unlike other classes of NOSQL database, a graph does not
have predictable lookup since it is a highly mutable structure
• We want to co-locate related nodes for traversal
performance, but we don’t want to place so many connected
nodes on the same database that it becomes heavily loaded
• The black-hole problem - popular nodes get lumped together
on a single instance, but there is low point cut
61. Thinly Spread Network
• The opposite is also true, that we don’t want too widely connected nodes
across different database instances since it will incur a substantial
performance penalty at runtime as traversals cross the (relatively latent)
network
• Load-leveling alone can lead to many relationships crossing instances
• These are very expensive to traverse, networks are many orders of
magnitude slower than in-memory traversals
63. Minimal Point Cut
• The best approach is to balance a graph across database instances by
creating a minimum point cut for a graph, where graph nodes are placed
such that there are few relationships that span shards
• Good strategy is to take a local view of the graph (no global locks) and
work incrementally (short bursts)
• Take into account use patterns
• Unlike other NoSQL stores, graph s are not predictable so we can not use
techniques like consistent hashing for scale out
65. Cache Sharding
• A strategy for large data sets of terabyte scale
• Mandates consistent request routing
• For instance, requests for user A are always sent to server 1,
while requests for user B are always sent to server 2 and so on
• The key assumption is that requests for user A typically touch
parts of the graph around user A, such has his or her friends,
preferences, likes and so on
66. Cache Sharding
• This means that the neighbourhood of the graph around user
A will be cached on server 1, while the neighbourhood around
user B will be cached on server 2
• By employing consistent routing of requests, the caches of all
servers in the HA cluster can be utilized maximally
• Strategy is highly effective for managing a large graph that
does not fit in RAM
67. Consistent Routing
• Always try to route related requests to the same server to hopefully
benefit from warm caches
68. Domain Specific Sharding
• No easy to shard graphs like documents or KV stores
• High performance graph databases limited in terms of data set size that
can be handled by a single machine
• Use replicas to speed up and improve availability but limits data set size
limited to a single machine’s disk/memory
• No perfect algorithm exists but domain insight of expert helps
69. Domain Specific Sharding
• Some domains can shard easily (geo, most web apps) using consistent
routing approach and cache sharding
– Geo - where the connections between cities are few compared with the
connections within the cities. So can place cities or countries on different
nodes
• Eventually (Petabytes) level data cannot be replicated practically
• Need to shard data across machines
70. References
1. http://www.neo4j.org
2. http://www.neo4j.org/learn/cypher
3. Bachman, Michal (2013)GraphAware -TowardsOnline Analytical Processing in Graph Databases
http://graphaware.com/assets/bachman-msc-thesis.pdf
4. Hunger, Michael (2012). Cypher and Neo4j http://vimeo.com/83797381
5. Mistry, Deep Neo4j: A Developer’s Perspective
http://osintegrators.com/opensoftwareintegrators%7Cneo4jadevelopersperspective
6. MapGraph:A High LevelAPI for Fast Development of High Performance GraphAnalytics on GPUs
7. Parallel Breadth First Search on GPU Clusters
8. DB-Engines Ranking of Graph DBMS