The document provides an introduction to the Neo4j graph database. It discusses trends driving the adoption of NoSQL databases like increasing data size, connectedness of data, semi-structured data, and changing application architectures. It categorizes NoSQL databases and describes key features of key-value stores, column family databases, document databases, and graph databases. The remainder focuses on graph databases and their suitability for interconnected data, introduces Neo4j as an example graph database, and discusses getting started with Neo4j including its core APIs.
This document discusses collaborative filtering and recommender systems. It begins with an overview of non-relational databases and graph databases. It then discusses collaborative filtering, including calculating similarity scores between users or items, predicting ratings for unseen items, and making recommendations. Specific methods discussed include Euclidean distance, Pearson correlation, and user-based filtering. The goal of collaborative filtering is to increase sales, market share, and targeted advertising by making personalized recommendations to users.
The document provides an overview of challenges in large-scale web search engines. It discusses scalability and efficiency issues including the size and dynamic nature of the web, high user volumes, and large data center costs. The main sections covered include web crawling, indexing, query processing, and caching. Open research problems are also mentioned such as web partitioning, crawler placement, and coupling crawling with distributed search and indexing.
Facebook uses a distributed systems architecture with services like Memcache, Scribe, Thrift, and Hip Hop to handle large data volumes and high concurrency. Key components include the Haystack photo storage system, BigPipe for faster page loading, and a PHP front-end optimized using Hip Hop. Data is partitioned horizontally and services communicate using lightweight protocols like Thrift.
Facebook uses Apache HBase to store messaging data at a massive scale. Some key points:
- HBase stores over 2PB of Facebook messaging data and handles 75+ billion read/write operations per day.
- Facebook migrated over 1 billion user accounts to HBase in 2011 and has since performed two schema changes while keeping the system online and available.
- Significant work has gone into optimizing performance, reliability, availability, and scalability of HBase at Facebook through improvements to compaction, metrics collection, and operational processes.
- Choosing HBase provided benefits like strong consistency, automatic failover, compression, and integration with HDFS for storage.
Monday, January 14, 2012 presentation on 3 different data types (unstructured, structured and semi-structured) and how xml plays a role in content management systems, onix (bibliographic data sharing), RSS (real simple syndication) and xml-first publishing for ebooks.
This document discusses transforming open government data from Romania into linked open data. It begins with background on linked data and open data initiatives. Then it describes efforts to model, transform, link, and publish Romanian open data as linked open data. This includes identifying common vocabularies and properties, creating URIs, linking to external datasets like DBPedia, and publishing the linked data for use in applications via a SPARQL endpoint. Overall the goal is to make this data more accessible and interoperable through semantic web standards.
1) The document discusses HTML5 and the W3C standards process. It notes that HTML5 is still a work in progress at the W3C, going through the recommendation process.
2) It also discusses new APIs being developed for HTML5, including vibration and device access APIs, as well as the Semantic Web and Linked Data initiatives.
3) Finally, it mentions the W3C's work on the Web of Things and assigning URIs to real-world objects to integrate them into the Web.
This document discusses collaborative filtering and recommender systems. It begins with an overview of non-relational databases and graph databases. It then discusses collaborative filtering, including calculating similarity scores between users or items, predicting ratings for unseen items, and making recommendations. Specific methods discussed include Euclidean distance, Pearson correlation, and user-based filtering. The goal of collaborative filtering is to increase sales, market share, and targeted advertising by making personalized recommendations to users.
The document provides an overview of challenges in large-scale web search engines. It discusses scalability and efficiency issues including the size and dynamic nature of the web, high user volumes, and large data center costs. The main sections covered include web crawling, indexing, query processing, and caching. Open research problems are also mentioned such as web partitioning, crawler placement, and coupling crawling with distributed search and indexing.
Facebook uses a distributed systems architecture with services like Memcache, Scribe, Thrift, and Hip Hop to handle large data volumes and high concurrency. Key components include the Haystack photo storage system, BigPipe for faster page loading, and a PHP front-end optimized using Hip Hop. Data is partitioned horizontally and services communicate using lightweight protocols like Thrift.
Facebook uses Apache HBase to store messaging data at a massive scale. Some key points:
- HBase stores over 2PB of Facebook messaging data and handles 75+ billion read/write operations per day.
- Facebook migrated over 1 billion user accounts to HBase in 2011 and has since performed two schema changes while keeping the system online and available.
- Significant work has gone into optimizing performance, reliability, availability, and scalability of HBase at Facebook through improvements to compaction, metrics collection, and operational processes.
- Choosing HBase provided benefits like strong consistency, automatic failover, compression, and integration with HDFS for storage.
Monday, January 14, 2012 presentation on 3 different data types (unstructured, structured and semi-structured) and how xml plays a role in content management systems, onix (bibliographic data sharing), RSS (real simple syndication) and xml-first publishing for ebooks.
This document discusses transforming open government data from Romania into linked open data. It begins with background on linked data and open data initiatives. Then it describes efforts to model, transform, link, and publish Romanian open data as linked open data. This includes identifying common vocabularies and properties, creating URIs, linking to external datasets like DBPedia, and publishing the linked data for use in applications via a SPARQL endpoint. Overall the goal is to make this data more accessible and interoperable through semantic web standards.
1) The document discusses HTML5 and the W3C standards process. It notes that HTML5 is still a work in progress at the W3C, going through the recommendation process.
2) It also discusses new APIs being developed for HTML5, including vibration and device access APIs, as well as the Semantic Web and Linked Data initiatives.
3) Finally, it mentions the W3C's work on the Web of Things and assigning URIs to real-world objects to integrate them into the Web.
While new parents are often consumed with spending time with their new babies, Miguel Aliaga wants to make sure they have the finance tips they need to ensure a financially secure future for their child.
Mobile Strategy Partners 2010 Mobile Banking Summit Workshop PresentationDavid Eads
This document provides an overview of mobile banking for financial institutions. It discusses why institutions implement mobile banking and how mobile affects the entire organization. Key points include measuring adoption and success, understanding the mobile landscape and technologies, connecting to existing infrastructure, considerations for offline customers, and working with existing partners or building solutions internally. The document emphasizes having a clear vision and business case to guide mobile decisions and strategies.
Jamaica Personal Income Tax Guide 2016 Edition (1)Dawgen Global
The document provides guidance on Jamaica's personal income tax rates and thresholds for 2016-2017, which were increased from the previous levels. Key points include:
- For 2016, the threshold was increased to $1,000,272 from July 1, and the tax rate above $6 million increased to 30% for the latter half of the year.
- For 2017, the threshold further increased to $1,500,096 from April 1.
- Worked examples are provided to illustrate the tax calculations and potential refunds for individuals under the new thresholds.
- Guidance is given for applying the changes for employed and self-employed individuals for the dual tax periods in 2016.
TDD подход к разработке зарекомендовал себя как очень надежный и быстрый способ реализовать задачи бизнеса с помощью программного кода. Но большая часть примеров на тренингах и в интернете показывает как применять TDD в очень простых ситуациях для кода вида вход/выход или с использованием заглушек для простых зависимостей. А как насчет остальных областей разработки приложения как интеграция с БД? Возможно ли применить TDD к ним? Что даст в этом случае TDD разработчику? Я попробую в своем докладе ответить на эти вопросы и покажу на практических примерах как может быть полезен подход TDD для кода интеграции с БД, как он уменьшает риски и открывает двери для техник рефакторинга БД. В качестве бонуса будут затронуты некоторые NoSQL решения, что должно сделать тему еще популярнее!
P.S. Все примеры будут демонстрироваться на Java.
The Pomodoro Technique is a time management method developed by Francesco Cirillo in the late 1980s. The technique uses a timer to break down work into intervals traditionally 25 minutes in length, separated by short breaks.
Presentation by TachyonNexus & Intel at Strata Singapore 2015Tachyon Nexus, Inc.
Make Tachyon Ready for Next-Gen Data Center Platforms with NVM.
The talk was presented at Strata Singapore, December 2015, focusing on using Tachyon Tiered Storage with NVM as the next generation data center platforms.
This document summarizes a presentation about Tachyon, an open source memory-centric distributed storage system. It introduces Tachyon and how it can be used with Spark to resolve issues around slow data sharing, in-memory data loss during crashes, and data duplication. The presentation outlines new features in Tachyon 0.8.0 like tiered storage, pluggable data management policies, and a unified namespace across storage systems. It concludes by inviting users and collaborators to try, develop, and get involved with the Tachyon community.
Spark Summit EU 2015: Reynold Xin KeynoteDatabricks
This document summarizes Spark's development over the past 12 months and provides a look ahead. It discusses improvements to both the frontend, such as DataFrames and machine learning pipelines, and the backend through projects like Tungsten for performance optimizations. Going forward, it mentions new features like the Dataset API, streaming DataFrames, and potential hardware improvements from technologies like 3D XPoint memory. The overall goal is to provide a unified engine and APIs that can automatically optimize analytics workloads across languages and domains.
Great functional testing with WebDriver and ThucydidesMikalai Alimenkou
Presentation from online conference ConfeT&QA (October 2012) and Selenium Camp 2013 (February 2013) about techniques and approaches to create great functional automated tests.
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
Bloomberg's Chris Jones and Chris Morgan joined Red Hat Storage Day New York on 1/19/16 to explain how Red Hat Ceph Storage helps the financial giant tackle its data storage challenges.
Epiphany: Connecting Millions of Events to Thirty Billion Data Points in Real...DataWorks Summit
This document describes Epiphany, Rocket Fuel's real-time attribution platform. It connects millions of events to 50 billion data points to attribute conversions across devices and algorithms. It uses HBase to lookup impressions in milliseconds. Data flows from actions keyed by user/impression/conversion days to HBase and Hive tables. It enables idempotent attribution across advertisers and algorithms at scale in real-time.
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.
1) Alluxio is an open-source virtual distributed storage system that provides memory-speed access to data across various storage platforms including HDFS, S3, and Swift.
2) Alluxio was presented as having four main use cases - using off-heap memory to alleviate resource pressure, enabling fast data sharing between jobs, accelerating access to remote storage, and providing a unified namespace across different storage systems.
3) Case studies demonstrated that Alluxio improved performance and enabled new workflows, with speedups of 15-300x reported for different customers including Barclays, Qunar, and Baidu.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
CES 2016 Trends and Implications - Havas Tom Goodwin
The document summarizes trends observed at CES 2016. It notes that while hardware changes more slowly than software and expectations, CES 2016 showed an evolution in products that were faster, thinner, cheaper and more connected. Key trends included autonomous mobility with self-driving vehicles; collaborative systems as companies partner to create more value; cognitive robotics becoming more human-like; infinite screens as everything becomes a display; mixed reality with virtual and augmented reality gaining momentum; and diagnostic wearables that closely monitor health metrics.
Alluxio Presentation at Strata San Jose 2016Jiří Šimša
Alluxio (formerly Tachyon) provides a unified namespace and tiered storage that allows data to be shared across clusters at memory speed. It is a virtual distributed storage system with a memory-centric architecture that abstracts persistent storage from applications. Alluxio enables data sharing between frameworks by allowing inter-process sharing at memory speed rather than being slowed by network or disk I/O. It also provides data resilience during application crashes by allowing processes to re-read data from memory I/O rather than network or disk I/O. Alluxio further allows consolidating memory usage across applications by preventing data duplication at the memory level.
The document discusses different architects and what they find architecture in, including nature, curving forms, simple geometries, form and function, materials, details, volume, light, technology, and sustainability. It asks what architecture is, stating that architecture is simply a need that is developed through an innovative design for the future. The document encourages sharing ideas to lead to better ideas, and remembering that architecture is more than buildings but is the essence of life.
Rebeca González Eriksen tiene una amplia experiencia en investigación clínica y dietética. Ha obtenido doctorados en medicina y nutrigenómica del Imperial College London y maestrías en nutrición y salud pública del London School of Hygiene & Tropical Medicine. Actualmente trabaja como dietista clínica en Marbella, España después de haber ocupado cargos de investigación en varias universidades e instituciones médicas en Reino Unido, Ghana y España.
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
Neo4j is a graph database that stores data in nodes and relationships. It allows for efficient querying of connected data through graph traversals. Key aspects include nodes that can contain properties, relationships that connect nodes and also contain properties, and the ability to navigate the graph through traversals. Neo4j provides APIs for common graph operations like creating and removing nodes/relationships, running traversals, and managing transactions. It is well suited for domains that involve connected, semi-structured data like social networks.
This document provides an overview of NoSQL databases and their characteristics. It discusses the different eras of databases and pressures that led to the rise of NoSQL databases. It then categorizes and describes the different types of NoSQL databases, including key-value stores, document stores, column family stores, and graph databases. Specific examples like MongoDB, Cassandra, HBase, Neo4j are also outlined. The document emphasizes that the type of database chosen should depend on the problem to be solved and characteristics of the data.
This document provides an overview of NoSQL databases, including a brief history, classifications, pros and cons of usage, and trends. It discusses how NoSQL technologies originated from distributed computing needs and were driven by scalability, parallelization, and costs. Major classifications of NoSQL databases are described as column-oriented stores, key-value stores, document stores, and graph databases. Examples like MongoDB, Cassandra, and Neo4j are outlined. Both benefits and limitations of NoSQL are presented. Emerging trends around SQL access and adoption of Hadoop are also noted.
While new parents are often consumed with spending time with their new babies, Miguel Aliaga wants to make sure they have the finance tips they need to ensure a financially secure future for their child.
Mobile Strategy Partners 2010 Mobile Banking Summit Workshop PresentationDavid Eads
This document provides an overview of mobile banking for financial institutions. It discusses why institutions implement mobile banking and how mobile affects the entire organization. Key points include measuring adoption and success, understanding the mobile landscape and technologies, connecting to existing infrastructure, considerations for offline customers, and working with existing partners or building solutions internally. The document emphasizes having a clear vision and business case to guide mobile decisions and strategies.
Jamaica Personal Income Tax Guide 2016 Edition (1)Dawgen Global
The document provides guidance on Jamaica's personal income tax rates and thresholds for 2016-2017, which were increased from the previous levels. Key points include:
- For 2016, the threshold was increased to $1,000,272 from July 1, and the tax rate above $6 million increased to 30% for the latter half of the year.
- For 2017, the threshold further increased to $1,500,096 from April 1.
- Worked examples are provided to illustrate the tax calculations and potential refunds for individuals under the new thresholds.
- Guidance is given for applying the changes for employed and self-employed individuals for the dual tax periods in 2016.
TDD подход к разработке зарекомендовал себя как очень надежный и быстрый способ реализовать задачи бизнеса с помощью программного кода. Но большая часть примеров на тренингах и в интернете показывает как применять TDD в очень простых ситуациях для кода вида вход/выход или с использованием заглушек для простых зависимостей. А как насчет остальных областей разработки приложения как интеграция с БД? Возможно ли применить TDD к ним? Что даст в этом случае TDD разработчику? Я попробую в своем докладе ответить на эти вопросы и покажу на практических примерах как может быть полезен подход TDD для кода интеграции с БД, как он уменьшает риски и открывает двери для техник рефакторинга БД. В качестве бонуса будут затронуты некоторые NoSQL решения, что должно сделать тему еще популярнее!
P.S. Все примеры будут демонстрироваться на Java.
The Pomodoro Technique is a time management method developed by Francesco Cirillo in the late 1980s. The technique uses a timer to break down work into intervals traditionally 25 minutes in length, separated by short breaks.
Presentation by TachyonNexus & Intel at Strata Singapore 2015Tachyon Nexus, Inc.
Make Tachyon Ready for Next-Gen Data Center Platforms with NVM.
The talk was presented at Strata Singapore, December 2015, focusing on using Tachyon Tiered Storage with NVM as the next generation data center platforms.
This document summarizes a presentation about Tachyon, an open source memory-centric distributed storage system. It introduces Tachyon and how it can be used with Spark to resolve issues around slow data sharing, in-memory data loss during crashes, and data duplication. The presentation outlines new features in Tachyon 0.8.0 like tiered storage, pluggable data management policies, and a unified namespace across storage systems. It concludes by inviting users and collaborators to try, develop, and get involved with the Tachyon community.
Spark Summit EU 2015: Reynold Xin KeynoteDatabricks
This document summarizes Spark's development over the past 12 months and provides a look ahead. It discusses improvements to both the frontend, such as DataFrames and machine learning pipelines, and the backend through projects like Tungsten for performance optimizations. Going forward, it mentions new features like the Dataset API, streaming DataFrames, and potential hardware improvements from technologies like 3D XPoint memory. The overall goal is to provide a unified engine and APIs that can automatically optimize analytics workloads across languages and domains.
Great functional testing with WebDriver and ThucydidesMikalai Alimenkou
Presentation from online conference ConfeT&QA (October 2012) and Selenium Camp 2013 (February 2013) about techniques and approaches to create great functional automated tests.
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
Bloomberg's Chris Jones and Chris Morgan joined Red Hat Storage Day New York on 1/19/16 to explain how Red Hat Ceph Storage helps the financial giant tackle its data storage challenges.
Epiphany: Connecting Millions of Events to Thirty Billion Data Points in Real...DataWorks Summit
This document describes Epiphany, Rocket Fuel's real-time attribution platform. It connects millions of events to 50 billion data points to attribute conversions across devices and algorithms. It uses HBase to lookup impressions in milliseconds. Data flows from actions keyed by user/impression/conversion days to HBase and Hive tables. It enables idempotent attribution across advertisers and algorithms at scale in real-time.
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.
1) Alluxio is an open-source virtual distributed storage system that provides memory-speed access to data across various storage platforms including HDFS, S3, and Swift.
2) Alluxio was presented as having four main use cases - using off-heap memory to alleviate resource pressure, enabling fast data sharing between jobs, accelerating access to remote storage, and providing a unified namespace across different storage systems.
3) Case studies demonstrated that Alluxio improved performance and enabled new workflows, with speedups of 15-300x reported for different customers including Barclays, Qunar, and Baidu.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
CES 2016 Trends and Implications - Havas Tom Goodwin
The document summarizes trends observed at CES 2016. It notes that while hardware changes more slowly than software and expectations, CES 2016 showed an evolution in products that were faster, thinner, cheaper and more connected. Key trends included autonomous mobility with self-driving vehicles; collaborative systems as companies partner to create more value; cognitive robotics becoming more human-like; infinite screens as everything becomes a display; mixed reality with virtual and augmented reality gaining momentum; and diagnostic wearables that closely monitor health metrics.
Alluxio Presentation at Strata San Jose 2016Jiří Šimša
Alluxio (formerly Tachyon) provides a unified namespace and tiered storage that allows data to be shared across clusters at memory speed. It is a virtual distributed storage system with a memory-centric architecture that abstracts persistent storage from applications. Alluxio enables data sharing between frameworks by allowing inter-process sharing at memory speed rather than being slowed by network or disk I/O. It also provides data resilience during application crashes by allowing processes to re-read data from memory I/O rather than network or disk I/O. Alluxio further allows consolidating memory usage across applications by preventing data duplication at the memory level.
The document discusses different architects and what they find architecture in, including nature, curving forms, simple geometries, form and function, materials, details, volume, light, technology, and sustainability. It asks what architecture is, stating that architecture is simply a need that is developed through an innovative design for the future. The document encourages sharing ideas to lead to better ideas, and remembering that architecture is more than buildings but is the essence of life.
Rebeca González Eriksen tiene una amplia experiencia en investigación clínica y dietética. Ha obtenido doctorados en medicina y nutrigenómica del Imperial College London y maestrías en nutrición y salud pública del London School of Hygiene & Tropical Medicine. Actualmente trabaja como dietista clínica en Marbella, España después de haber ocupado cargos de investigación en varias universidades e instituciones médicas en Reino Unido, Ghana y España.
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
Neo4j is a graph database that stores data in nodes and relationships. It allows for efficient querying of connected data through graph traversals. Key aspects include nodes that can contain properties, relationships that connect nodes and also contain properties, and the ability to navigate the graph through traversals. Neo4j provides APIs for common graph operations like creating and removing nodes/relationships, running traversals, and managing transactions. It is well suited for domains that involve connected, semi-structured data like social networks.
This document provides an overview of NoSQL databases and their characteristics. It discusses the different eras of databases and pressures that led to the rise of NoSQL databases. It then categorizes and describes the different types of NoSQL databases, including key-value stores, document stores, column family stores, and graph databases. Specific examples like MongoDB, Cassandra, HBase, Neo4j are also outlined. The document emphasizes that the type of database chosen should depend on the problem to be solved and characteristics of the data.
This document provides an overview of NoSQL databases, including a brief history, classifications, pros and cons of usage, and trends. It discusses how NoSQL technologies originated from distributed computing needs and were driven by scalability, parallelization, and costs. Major classifications of NoSQL databases are described as column-oriented stores, key-value stores, document stores, and graph databases. Examples like MongoDB, Cassandra, and Neo4j are outlined. Both benefits and limitations of NoSQL are presented. Emerging trends around SQL access and adoption of Hadoop are also noted.
This document discusses trends driving the adoption of NoSQL databases, including increasing data size, connectivity of information, semi-structured data, and distributed application architectures. It describes four categories of NoSQL databases - aggregate-oriented, key-value stores, column family (BigTable), and document databases - and provides examples and comparisons of their pros and cons.
How to use NoSQL in Enterprise Java Applications - NoSQL Roadshow ZurichPatrick Baumgartner
This document discusses how to use NoSQL databases in enterprise Java applications. It provides an overview of Spring Data, an open source framework that supports NoSQL and SQL databases. Spring Data provides common infrastructure and repositories to access data stores like MongoDB, Redis, and Neo4J. The presentation includes an example of using Spring Data to access MongoDB, with annotations for entities, configuration for the data store, and repositories for data access. Attendees are encouraged to try Spring Data with a data model that matches their data.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
HBase is a distributed, scalable, big data store that provides fast lookup capabilities like Google BigTable. It uses a table-like data structure with rows indexed by a key and stores data in columns grouped by families. HBase is designed to operate on top of Hadoop HDFS for scalability and high availability. It allows for fast lookups, full table scans, and range scans across large datasets distributed across clusters of commodity servers.
This document discusses Grails integration with Neo4j graph databases. It begins with an introduction to graph databases and Neo4j. It then covers the Grails Neo4j plugin which allows using Neo4j as the persistence layer for Grails domain classes. Finally, it addresses some challenges in mapping the Grails domain model to the Neo4j nodespace and potential solutions.
Traackr evaluated several NoSQL database options to store its heterogeneous, unstructured web data. Document databases were the best fit due to their flexibility to store variable length text like tweets and blog posts without predefined schemas. MongoDB was selected due to its maturity, adoption, and support for ad-hoc queries and batch processing needed by Traackr in early 2010.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
How to Get Started with Your MongoDB Pilot ProjectDATAVERSITY
Open source, high performance database MongoDB can be used for a pilot project. The document discusses finding a non-critical initial project, getting experience with MongoDB, benchmarking performance, and presenting the business case for broader use. It also outlines steps for moving a successful pilot to production, including using MongoDB's auto-sharding, replication, and commercial support options.
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
Life Science Database Cross Search and MetadataMaori Ito
Life science databases are sometimes difficult to understand due to lack of information. I'd like to add metadata into databases and improve search results.
NoSQL databases provide an alternative to traditional relational databases that is well-suited for large datasets, high scalability needs, and flexible, changing schemas. NoSQL databases sacrifice strict consistency for greater scalability and availability. The document model is well-suited for semi-structured data and allows for embedding related data within documents. Key-value stores provide simple lookup of data by key but do not support complex queries. Graph databases effectively represent network-like connections between data elements.
"Get Ready for Big Data" presentation from Gilbane Boston 2011; for more details, see http://gilbaneboston.com/conference_program.html#t2 and http://pbokelly.blogspot.com/2011/12/gilbane-boston-2011-big-data.html
This document provides an introduction to NoSQL databases. It discusses the history and limitations of relational databases that led to the development of NoSQL databases. The key motivations for NoSQL databases are that they can handle big data, provide better scalability and flexibility than relational databases. The document describes some core NoSQL concepts like the CAP theorem and different types of NoSQL databases like key-value, columnar, document and graph databases. It also outlines some remaining research challenges in the area of NoSQL databases.
Similar to Neo4j Introduction at Imperial College London (20)
This document discusses recommendations engines that use graph databases like Neo4j. It introduces GraphAware, an open-source recommendation engine plugin for Neo4j. The document outlines the business and technical challenges of building recommendation engines, and how GraphAware addresses these challenges through its flexible, high-performance architecture and APIs. It provides an example of building a simple friend recommendation engine using GraphAware.
Advanced Neo4j Use Cases with the GraphAware FrameworkMichal Bachman
The document discusses GraphAware Framework, which makes it easy to build, test, and deploy custom APIs, transaction-driven behavior, and asynchronous computation functionality for Neo4j. It provides examples like representing time series data, tracking graph changes, assigning UUIDs, and running algorithms. GraphAware Framework is open source and supports building both generic and domain-specific Neo4j extensions.
The document discusses the GraphAware Framework, which allows developers to build custom APIs, transaction-driven behavior, and asynchronous computations for Neo4j. It provides examples like the TimeTree module for storing and querying time series data and a change feed module for tracking graph changes. The framework makes it easy to build, test, and deploy these advanced functionalities for Neo4j.
Modelling Data in Neo4j, bidirectional relationships, qualifying relationships with properties vs. relationship types (performance comparison), Neo4j hardware sizing, Cypher vs. Java API
This document discusses graph theory and its applications to data science. It provides examples of social and technological networks that can be represented as graphs, and covers graph theory concepts like connected components, triadic closure, structural balance, and centrality measures. Neo4j is presented as an open-source graph database that allows storing and querying graph data using the Cypher query language.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
13. Key-Value Stores
• “Dynamo: Amazon’s Highly Available Key-
Value Store” (2007)
• Data model:
– Global key-value mapping
– Big scalable HashMap
– Highly fault tolerant (typically)
• Examples:
– Riak, Redis, Voldemort
@bachmanm
14. Pros and Cons
• Strengths
– Simple data model
– Great at scaling out horizontally
• Scalable
• Available
• Weaknesses:
– Simplistic data model
– Poor for complex data
@bachmanm
15. Column Family (BigTable)
• Google’s “Bigtable: A Distributed Storage
System for Structured Data” (2006)
• Data model:
– A big table, with column families
– Map-reduce for querying/processing
• Examples:
– HBase, HyperTable, Cassandra
@bachmanm
16. Pros and Cons
• Strengths
– Data model supports semi-structured data
– Naturally indexed (columns)
– Good at scaling out horizontally
• Weaknesses:
– Unsuited for interconnected data
@bachmanm
17. Document Databases
• Data model
– Collections of documents
– A document is a key-value collection
– Index-centric, lots of map-reduce
• Examples
– CouchDB, MongoDB
@bachmanm
18. Pros and Cons
• Strengths
– Simple, powerful data model (just like SVN!)
– Good scaling (especially if sharding supported)
• Weaknesses:
– Unsuited for interconnected data
– Query model limited to keys (and indexes)
• Map reduce for larger queries
@bachmanm
19. Graph Databases
• Data model:
– Nodes with properties
– Named relationships with properties
– Hypergraph, sometimes
• Examples:
– Neo4j (of course), Sones GraphDB, OrientDB,
InfiniteGraph, AllegroGraph
@bachmanm
20. Pros and Cons
• Strengths
– Powerful data model
– Fast
• For connected data, can be many orders of magnitude
faster than RDBMS
• Weaknesses:
– Sharding
• Though they can scale reasonably well
• And for some domains you can shard too!
@bachmanm
21. Social Network “path exists”
Performance
• Experiment:
• ~1k persons # persons query time
• Average 50 friends per Relational 1000 2000ms
database
person
Neo4j 1000 2ms
• pathExists(a,b)
Neo4j 1000000 2ms
limited to depth 4
• Caches warm to
eliminate disk IO
@bachmanm
23. What are graphs good for?
• Recommendations
• Business intelligence
• Social computing
• Geospatial
• MDM
• Systems management
• Web of things
• Genealogy
• Time series data
• Product catalogue
• Web analytics
• Scientific computing (especially bioinformatics)
• Indexing your slow RDBMS
• And much more!
@bachmanm
24. Neo4j is a Graph Database
So we need to detour through a little
graph theory
@bachmanm
40. Getting started is easy
• Single package download, includes server stuff
– http://neo4j.org/download/
• For developer convenience, Ivy (or whatever):
– <dependency org="org.neo4j" name="neo4j-community" rev="1.9.M04"/>
@bachmanm
41. Run it!
• Server is easy to start stop
– cd <install directory>
– bin/neo4j start
– bin/neo4j stop
• Provides a REST API in addition to the other
APIs we’ve seen
• Provides some ops support
– JMX, data browser, graph visualisation
@bachmanm
42. Embed it!
• If you want to host the database in your
process just load the jars
• And point the config at the right place on disk
• Embedded databases can be HA too
– You don’t have to run as server
@bachmanm
43. name: Phil Johnson
title: Cognitive Psychology
duration: 30 name: Michal Bachman
name: UX
title: Intro to Neo4j
duration: 45
name: Martin Macke
name: Jeremy White INTERESTED name: Neo4j name: NOSQL
@bachmanm
49. name: Phil Johnson
title: Cognitive Psychology
duration: 30 name: Michal Bachman
name: UX
title: Intro to Neo4j
duration: 45
name: Martin Macke
name: Jeremy White INTERESTED name: Neo4j name: NOSQL
@bachmanm
50. All Conference Topics
Node webExpo = neo.getReferenceNode();
for (Relationship talksAt : webExpo.getRelationships(INCOMING, TALKS_AT)) {
Node speaker = talksAt.getStartNode();
for (Relationship delivers : speaker.getRelationships(OUTGOING, DELIVERS)) {
Node talk = delivers.getEndNode();
for (Relationship about : talk.getRelationships(OUTGOING, ABOUT)) {
String topicName = (String) about.getEndNode().getProperty(NAME);
//add to result...
}
}
}
-------------------
Printing all topics
All topics: development, data, advertising, education, usa, business, microsoft, webdesign, software,
responsiveness, ux, e-commerce, php, psychology, crm, api, chef, javascript, patterns, product design,
marketing, metro, social media, web, startup, analytics, lean, cqrs, node.js, branding, cloud, testing, neo4j,
rest, css, design, publishing, nosql. Took: 2 ms
52. name: Phil Johnson
title: Cognitive Psychology
duration: 30 name: Michal Bachman
name: UX
title: Intro to Neo4j
duration: 45
name: Martin Macke
name: Jeremy White INTERESTED name: Neo4j name: NOSQL
@bachmanm
53. Which talks should I attend?
TraversalDescription talksTraversal = Traversal.description()
.uniqueness(Uniqueness.NONE)
.breadthFirst()
.relationships(INTERESTED, OUTGOING)
.relationships(ABOUT, INCOMING)
.evaluator(Evaluators.atDepth(2));
Node attendee =
neo.index().forNodes("people").get("name", ”Jeremy White").getSingle();
Iterable<Node> talks = talksTraversal.traverse(attendee).nodes();
//iterate over talks and print
------------------------------------------
Suggesting talks for 100 random attendees.
...
Aneta Lebedova: Measure Everything!, To the USA, The real me. Took: 1 ms
Bohumir Kubat: Beyond the polar bear, How (not) to do API, Critical interface design. Took: 1 ms
Vladimir Vales: Application Development for Windows 8 Metro. Took: 1 ms
Suggested talks for 100 random attendees in 449 ms
55. name: Phil Johnson
title: Cognitive Psychology
duration: 30 name: Michal Bachman
name: UX
title: Intro to Neo4j
duration: 45
name: Martin Macke
name: Jeremy White INTERESTED name: Neo4j name: NOSQL
@bachmanm
56. What do we have in common?
//retrieve attendeeOne and attendeeTwo from index
int maxDepth = 2;
Iterable<Path> paths = GraphAlgoFactory
.allPaths(Traversal.expanderForAllTypes(), maxDepth)
.findAllPaths(attendeeOne, attendeeTwo);
for (Path path : paths) {
//print it
}
------------------------------------------------------------
Finding things in common for 100 random couples of attendees
...
Karel Kunc and Phil Smith:
(Karel Kunc)--[INTERESTED]-->(ux)<--[INTERESTED]--(Phil Smith),
(Karel Kunc)--[DISLIKED]-->(Be a punk consumer!)<--[DISLIKED]--(Phil Smith),
(Karel Kunc)--[DISLIKED]-->(Beyond the polar bear)<--[LIKED]--(Phil Smith),
(Karel Kunc)--[LIKED]-->(Shipito.com – business in USA)<--[LIKED]--(Phil Smith).
Took: 0 ms.
...
Found things in common for 100 random couples of attendees in 142 ms.
58. Who is my beer mate?
myself beerMate:?
talk:?
@bachmanm
59. Who is my beer mate?
(myself) (beerMate)
(talk)
@bachmanm
60. Who is my beer mate?
start myself=node:people(name = "Emil Votruba")
match (myself)-[:LIKED]->(talk)<-[:LIKED]-(beerMate)
return distinct beerMate.name, count(beerMate)
order by count(beerMate) desc
limit 5;
@bachmanm
61. Cypher Query
start myself=node:people(name = ”Alex Smart")
match (myself)-[:LIKED]->(talk)<-[:LIKED]-(beerMate)
return distinct beerMate.name, count(beerMate)
order by count(beerMate) desc
limit 5;
@bachmanm
62. Cypher Query
start myself=node:people(name = ”Emil Votruba")
match (myself)-[:LIKED]->()<-[:LIKED]-(beerMate)
return distinct beerMate.name, count(beerMate)
order by count(beerMate) desc
limit 5;
@bachmanm
64. Current Research
• Graph partitioning
• Graph analytics (“OLAP” and predictive)
• Performance improvements
• Query languages
• MVCC and single-threaded write models
• ACID (tradeoffs for weakening C and I)
• Yield and Harvest in distributed systems
• Application-level
– Recommendations
– Protein interactions
–…
@bachmanm
WelcomeIntroduce myself, NeoTechMotivations:Presented this at a conference Conversations with FriendsTalked to Serena, no affiliationBigData and NOSQL popular termsGraphs are getting more and more popular (Facebook)Not much attention at ImperialAsk about the audience, heard about graph databases? Graphs? Databases?Outcomes:Learn about a new technologySee application of graph theory in practiceTailored to students (not industry)Agenda:Intro to NOSQLIntro to Graph DatabasesIntro to Neo4jPractical part – how to work with oneReal experiencesCurrent researchQ & A
Why now?Not woke up one day thinking Rel DBs are not cool any moretrends
Generate, process, store and work with
UGC = User Generated ContentGGG = Giant Global Graph (what the web will become)– každýkousíček, každájednotkazajímavýchdat je sémantickypropojená s každoudalšízajímavoujednotkoudat (Tim Berners-Lee)Data jsoupropojenější (lineárně)RDFa (Resource Description Framework in attributes), českysystémpopisuzdrojů v atributech, je technologie pro přenosstrukturovanýchinformacíuvnitřwebovýchstránek. RDFa je jedenzezpůsobůzápisu (serializace) datovéhoformátu Resource Description Framework (RDF). Ontologie je v informaticevýslovný (explicitní) a formalizovanýpopisurčitéproblematiky. Je to formální a deklarativníreprezentace, kteráobsahujeglosář (definicipojmů) a tezaurus (definicivztahůmezijednotlivýmipojmy). Ontologie je slovníkem, kterýslouží k uchovávání a předáváníznalostitýkající se určitéproblematiky.
Data losing predictable structureIndividualisation of data, can’t box each individual, want data about meShape of data, less predictable structureDecentralisation of data creation accelerates this trend
Apps can choose what makes sense to store the data
This is strictly about connected data – joins kill performance there.No bashing of RDBMS performance for tabular transaction processing
Krásavesvětě NOSQL - nikdovámnepřikazuje, vybratdatabázi, kteráodpovídátypučicharakteristicedat, se kterýmipracujete. key-value databáze: jedenklíč - jednahodnota, hash mapy, Redis, Riak (Amazon Dynamo), Většinouvysocetolerantnívůčivýpadkům, Jednoduchýdatový model, Vynikajícíhorizontálníškálovatelnost, Dostupnost, BigTabledatabáze: k-vvvvvvv store s implicitnímiindexy, Cassandra (Google), PodporačástečněstrukturovanýchdatAutomatický index (sloupce), Dobráhorizontálníškálovatelnost, opětnevhodné pro propojená dataDokumentovédatabáze, známá je například subversion, MongoDB, CouchDB, …Kolekcedokumentů, Dokument je kolekce key-value párů, Index je důležitý, hodně map-reduce,Škálovatelnostcelkemdobrá. (Ne takjako key-value, složitějšímdatovýmmodelem, Jednoduchý a výkonýdatový model, jako subversion.Nevýhodouvšech 3 je nejsouúplněvhodné pro hustěpropojená data. Přílišjednoduchýdatový (HashMap, rychlá, ale…) model znamená, žechceme-li získatjakékolivokamžitéhlubšíporozuměníuloženýmdatům. Musí to býtzodpovědnostíaplikačnívrstvy (čili to musímenějaknaprogramovat). Velmičastojsoutedytytodatabázespojeny s frameworkyjako Map-Reduce, pro kterémusímevytvořitúlohy, kterénámtotoporozuměníumožnízískat.Map-reduce je dávkováoperace (to bychuvedl v kontrastu s on-line / in-the-click-stream synchronníoperací), abystezískalipohlednavašepropojená data.Všechny 3 pracují s agregovanýmidaty, tzn. Ževyžadujístruktutupředem, data, kterápatřílogicky k sobě (jakoobjednávka a jejíjednotlivépoložky), jsou v databáziuloženy u sebe a je k nimtaké v dotazechpřistupovánojako k celku. V key-value úložištích je tímcelkemhodnota, v CF CF a v Dok. Dbsdokumenty.OKvpřípadech, kdypřístup k datůmvyžadujepřesnětutostrukturu. Pokud se ale chcemena data podívatjinak, napříkladanalyzovat z objednávekcelkovéprodejejednotlivýchproduktů, musíme s toustrukturoutrochubojovat a to je ten důvod, proč se tolikmluví o map-reduce vespojení s těmitodatabázemi. Výhodouukládánídat v neagregovanýchformách je to, že se dajíanalyzovat a prezentovat z různáchúhlůpohledy v závislotinakonkrétnímpřípadě.A samozřejměgrafovédatabáze, kvůlikterýmtudnesjsme a o kterých se tohodozvíme o něcovíczaminutku
History – Amazon decide that they always wanted the shopping basket to be available, but couldn’t take a chance on RDBMSSo they built their ownBig risk, but simple data model and well-known computing science underpinning it (e.g. consistent hashing, Bloom filters for sensible replication)+ Massive read/write scale- Simplistic data model moves heavy lifting into the app tier (e.g. map reduce)
Mongo DB has a reputation for taking liberties with durability to get speedCouch DB has good multimaster replication from Lotus Notes
People talk about Codd’s relational model being mature because it was proposed in 1969 – 42 years old.Euler’s graph theory was proposed in 1736 – 275 years old.
Can’t easily shard graphs like documents or KV stores.This means that high performance graph databases are limited in terms of data set size that can be handled by a single machine.Can use replicas to speed things up (and improve availability) but limits data set size limited to a single machine’s disk/memory.Some domains can shard easily (.e.g geo, most web apps) using consistent routing approach and cache sharding – we’ll cover that later.
Teoriegrafůzkoumávlastnostistruktur, zvanýchgrafy. Ty jsoutvořenyvrcholy, kteréjsouvzájemněspojenéhranami. Znázorňuje se obvyklejakomnožinabodůspojenýchčárami. Formálně je grafuspořádanoudvojicímnožinyvrcholů V a množinyhran E.
SedmmostůměstaKrálovce (dnes Kaliningrad)Kdodělá pro velkoufirmu, tímmyslímněkolikvrstevmanagementu, softwarovýarchitektnajinémpatřenežvývojářiTatoinformace je pro Vás, v těchtofirmáchbývátěžképrosadit “nové” technologie. Ale relační model, se kterýmpřišel E.F. Codd v roce 1969, je pouze 43 let starý. Grafový model je 276 starý. TakžepříštěažVámšéfnebochytrýarchitektřeknenaadopci NOSQL něcovesmyslu “tadypoužívámejenomzralé a prokázanévyspělétechnologie”, víte, kterýmsměrem ho máteposlat… tímmámnamyslitřebatutopřednáškunawebunebopříslušnéstránkynawikipedii. Takžejakukládáme data v grafu…
Takžejakukládáme data v grafu…V grafuukládámedata jakovrcholy a vrcholyjsouvlastnědokumenty, kterémodoumítlibovolnéklíče a k nimpřiřazenéhodnoty. Stejnějakodokument v MongoDB. V čem se grafliší od MongoDB je že v grafujsouvztahymezivrcholy. A to je trade-off, MongoDB je lépeškálovatelné, protožetohlenedělá. Neo4J je lepší pro propojená data, tohledělá. Ukládávztahymezijednotlivýmivrcholy. Ale nenítakdobřeškálovatelné. A do musímevzít v potazpřiřešeníVašichproblémů: chcetemasivníškálovatelnost, nebookamžitýnáhled do propojenostiVašich dat. POPSAT GRAFVztahymajisemantickyvyznam! Recnici, prednasky v RDBMSJe to poměrněintuitivnízpůsobukládánídat! Úkolgrafovédatabáze je vzíttatointuitivní data, kterásimůžemejednodušenačrtnoutnatabulinebokuspapíru a rychle je procházetvevašichprogramech.
A to je jednahezkávlastnostgrafů – jsouideální pro tabule,zadnístranyobálek, pivníchtácků a krabiček od cigaret… to jsouvěci, nakterýchtynejlepšídesigny (zejménavestartupech) většinouvznikajíJájsemsivybraljakopříkladWebExpo, původnějsemchtělzmapovatkorupčníaféryčeskýchpolitiků, ale tohle je o něconeškodnější. Vztahymeziřečníky, přednáškam, tématy, účastníky a podobněsimůžemenakreslitnapivnítácek! WebExpo je doména,kterámáspoustuvztahů – řečnícimajípřednášky, …To simůžetejednodušenakreslitnatabuli, to je mimochodem to, co dělámejakoprogramátoři, kdyžsedíme s lidmi, kteřípotřebujínějakýkussoftwaru a my se snažímetomu business problému, tédoméněporozumět. Sednemsi k tabuli, nakreslímezákazníky, objednávky, faktury, produkty a podobně a vztahymezinimi!A co udělámepak – vezmemenášpěkný design a denormalizujeme ho. Potíme se vymýšlením, jak to všechnonaládujeme do tabulek. A jsmešťastní a usměvaví, než to zpustímenaživo, do provozu…. A ono to bežíjakželva… Co uděláme? Denormalitzujemenáš model! Všechnaenergie, kteroujsmeinvestovali, krev, pot a slzy, všechno v niveč. U grafovédatabáze, to co je napapíře je přesně to, co naházíte do databáze.
To neznamená,žejsteomluveni s designovéfáze. Pořád se musítehlubocezamysletnadtím, jaké entity (neboobjekty) tvořívašidoménu a jakéjsoumezinimivztahy! Stálepotřebujete design.Nemůžetejednoduševzít data ztabulek, kterámáte a násilím je natřískat do vašízbrusunovégrafovédatabáze. Člověkmusízačítmyslet v nódách a vztazích.Přinavrhovánídatovéhomodelu pro WebExpomusímeudělathodnědesignovýchrozhodnutí: jakodlišitřečníky od účastníků? A je to vůbecpotřeba? Udělatzepátka a sobotynódy, nebojenomvlastnostnajednotlivýchpřednáškách?Stálemusítedělat design, ale pointa je že design datovéhomodelu pro grafovoudatabázimůžebýtpříjemná a přirozenázkušenost.
Stará se proVás o nódy, vztahymezinimi a indexy.Neo4j je stabilní a běží od roku 2003ProcházíaktivnímvývojemPrimárně pro Javu, ale použitelná se spoustoudalšíchtechnologiíIdeální pro škáludesítekserverů v clusteru, ne pro stovkyPro hustěpropojená data, není to KV store
Plně a militantně ACID. Kdoneví, co to znamená?Rychlevysvětlit: atomicity, consistency, isolation, durabilityNěkterédalší NOSQL databáze se vzdávajíněkterýchgarancíveprospěchvýkonu, u Neo4j tohlevypnoutnejde. Data jsouvždyzapsánana disk.
Vyhledatzacatek v indexu (Lucene)Prozkoumavatokoli
Vyhledatzacatek v indexu (Lucene)Prozkoumavatokoli
Neo mázabudovanoucelouknihovnugrafovýchalgoritmů, jakonejkratšícesta, všechnycesty, atp
1m hops zasekundunanormálnímlaptopu, žádnýrozdílpřiznásobenípočtudatHigh performance graph operationsTraverses 1,000,000+ relationships / second on commodity hardware
Obecněpokudpoužíváte MySQL a neplatítezaněj, nebudeteplatitaniza Neo.
Pojďmesikázatpoužití v embedded módunakonkrétnímpříkladu. Vytvořiljsemgraf z webexpa, řečníci a přednáškyjsouopravdové, 1000 účastníkůmánáhodněvygenerovanájména. Popsatgraf a scénář.KdonečteJavuKodbudenagithubu
Vztahymůžoubýtbuďřetězceznaků, neboEnum, kterévámdajívýhodustatickéhotypování v IDE, pro Neo4j v tom nenížádnýrozdíl.Postupopakujemedokudnemámecelýgraf
Tohle je screenshot z webovékonzole, kdemůžemegrafvizálněprocházet. Běžínalaptopu, dámVámnakonci URL, abystesi s tímmohlipohrát.Tak, mámegraf, ale jak z nějteďdostaneme data ven?
Existujeněkolikzpůsobů,jakpsátdotazy v Neo4j, liší se čitelností, složitostí, výkonem a úrovníabstrakce. UkážuVámněkterézezpůsobů a začnuodspoda, tzn. On nativníhonejrychlejšího API.
Core API pracujepřímo s jednotkami, kteréjsme do databázeuložili – vrcholy, hrany a jejichvlastnosti.
Podívejme se ještějednounavelýgraf. Novýgrafmávždyjednunódu s ID 0, z téjsmeudělalliWebExpo.
Tohle je imperativní API, všechnupráciděláprogramátor, je nejvýkonnější
Pojďme se podívat o úroveňvýš co se abstrakcetýčenatakzvané traversal API, kterénámumožnípsátdotazydeklarativně, to znamenápopsat, jakchcemegrafprocházet. Samotnéprocházeníudělá Neo4J zanás.
Můžemepsátvlastníevaluatory
Dalšípovedenoufunkcí je knihovnaalgoritmů pro hledánícestmezidvěmauzly.
Takénejkratšícesta, Dijkstra a další
Těžké pro neprogramátory, pojďmě se podívatnaněcojednoduššího
Na nejvyššíúrovniabstrakce Neo4j zprostředkovávásvůjvlastníjazyk pro psanídotazů, částečněinspirovaný SQL. Ten jazyk se jmenuje Cypher a rozumílidskyčitelnýmpříkazům, jakonapříkladtomu, kterýtadyteďvidíte.
Musímenědezačít, napomocsivezmeme index s názvem people, kdenajdemepanaEmilaVotrubupodlejména.Dálemusímeupřesnit, co za data vlastněchcemezískat, v tomtopřípadějménočlověka a skóre, kolikvěcímámespolečnýchNakonecasinechcemejítnapivoúplně se všemi, ale janomřekněme s 5 lidmi, se kterýmitohomámespolečnéhonejvícAsividítevliv SQL----- Meeting Notes (09/09/2012 20:18) -----animace
Musímenědezačít, napomocsivezmeme index s názvem people, kdenajdemepanaEmilaVotrubupodlejména.Dálemusímeupřesnit, co za data vlastněchcemezískat, v tomtopřípadějménočlověka a skóre, kolikvěcímámespolečnýchNakonecasinechcemejítnapivoúplně se všemi, ale janomřekněme s 5 lidmi, se kterýmitohomámespolečnéhonejvícAsividítevliv SQL----- Meeting Notes (09/09/2012 20:18) -----animace
Musímenědezačít, napomocsivezmeme index s názvem people, kdenajdemepanaEmilaVotrubupodlejména.Dálemusímeupřesnit, co za data vlastněchcemezískat, v tomtopřípadějménočlověka a skóre, kolikvěcímámespolečnýchNakonecasinechcemejítnapivoúplně se všemi, ale janomřekněme s 5 lidmi, se kterýmitohomámespolečnéhonejvícAsividítevliv SQL----- Meeting Notes (09/09/2012 20:18) -----animace