This document discusses using a multi-model database approach to manage time series and event sequence data. It describes some common approaches like using a relational database with timestamp fields or storing events in a document database. It then outlines how OrientDB combines graph and document models to provide flexibility while maintaining fast write and read speeds. Events can be connected in the graph and stored as documents to allow for relationships and complex properties. The document summarizes how OrientDB allows aggregating data during writes using hooks and querying pre-aggregated data to enable fast analysis of time-based data.
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
DNS is critical network infrastructure and securing it against attacks like DDoS, NXDOMAIN, hijacking and Malware/APT is very important to protecting any business.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
In this presentation, the unaware or indirect applications of essential computer science concepts are dicussed as showcase. Jim Huang presented in Department of Computer Science and Engineering, National Taiwan University.
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
Apache Kafka is one of the most commonly used connectors with Apache Flink for exactly-once streaming use cases. The combination of both systems allows you to build mission-critical systems that require low end-to-end latency and exactly-once processing eg. banks processing transactions. In Apache Flink 1.14, we released a new KafkaSink based on Apache Flink’s unified Sink interface that natively supports streaming and batch executions.
However, we needed to stretch Kafka’s transactions API to fully support exactly-once processing in Flink. In this talk, we will start with a quick recap of Apache Kafka’s transactions and Flink’s checkpointing mechanism. Then, we describe the two-phase commit protocol implemented in KafkaSink in-depth and emphasize the difficulties we have overcome when applying Kafka’s transaction API to longer-lasting transactions.
We explain how we ensure performant writing to Apache Kafka and how the KafkaSink recovery works.
In summary, this talk should give users a deep dive into how Apache Flink leverages Apache Kafka’s transactions and developers an overview of what they have to consider when using Apache Kafka’s transactions.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
DNS is critical network infrastructure and securing it against attacks like DDoS, NXDOMAIN, hijacking and Malware/APT is very important to protecting any business.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
In this presentation, the unaware or indirect applications of essential computer science concepts are dicussed as showcase. Jim Huang presented in Department of Computer Science and Engineering, National Taiwan University.
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
Apache Kafka is one of the most commonly used connectors with Apache Flink for exactly-once streaming use cases. The combination of both systems allows you to build mission-critical systems that require low end-to-end latency and exactly-once processing eg. banks processing transactions. In Apache Flink 1.14, we released a new KafkaSink based on Apache Flink’s unified Sink interface that natively supports streaming and batch executions.
However, we needed to stretch Kafka’s transactions API to fully support exactly-once processing in Flink. In this talk, we will start with a quick recap of Apache Kafka’s transactions and Flink’s checkpointing mechanism. Then, we describe the two-phase commit protocol implemented in KafkaSink in-depth and emphasize the difficulties we have overcome when applying Kafka’s transaction API to longer-lasting transactions.
We explain how we ensure performant writing to Apache Kafka and how the KafkaSink recovery works.
In summary, this talk should give users a deep dive into how Apache Flink leverages Apache Kafka’s transactions and developers an overview of what they have to consider when using Apache Kafka’s transactions.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Presentation by Antonio Ranieri (Senior Expert, Cedefop) on the occasion of the EESC LMO conference on 'Excluded or included' in Brussels on 6 November 2012.
Event-driven architecture (EDA) is a software architecture pattern promoting the production, detection, consumption of, and reaction to events.
This architectural pattern may be applied by the design and implementation of applications and systems which transmit events among loosely coupled software components and services.
In this session you’ll learn how to create a loosely coupled architecture for your business that has the domain at the core. You’ll learn the basics of EDA, and also learn how we are transforming our architecture at Unibet.com to become event driven, and what benefits it will bring to our business. The session will cover technologies such as JMS, XML, JSON, Google Protocol Buffers, ActiveMQ and Spring.
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Codemotion
Rappresentare lo scorrere del tempo non è un'impresa semplice, specialmente con strumenti "tradizionali". Purtroppo però la dimensione temporale è fondamentale in mille contesti diversi, dall'analisi statistica alla rappresentazione dei rapporti di causa-effetto, dal forecasting al controllo automatico. In questo talk vedremo come utilizzare al meglio OrientDB, un Document-Graph Database, per il salvataggio, l'elaborazione e l'interrogazione di questo tipo di informazioni.
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
Batch and Streaming Data Processing and Vizualize 300Tb in 5 Seconds meetup on April 18th, 2016 (http://www.meetup.com/Big-things-are-happening-here/events/229532500)
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Streaming SQL Foundations: Why I ❤ Streams+TablesC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2rtxaMm.
Tyler Akidau explores the relationship between the Beam Model and stream & table theory. He explains what is required to provide robust stream processing support in SQL and discusses concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, compare to other offerings such as Apache Kafka’s KSQL and Apache Spark’s Structured streaming. Filmed at qconlondon.com.
Tyler Akidau is a senior staff software engineer at Google, where he is the technical lead for the Data Processing Languages & Systems group, responsible for Google's Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC.
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
Dataflow is a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation. Cloud Dataflow frees you from operational tasks like resource management and performance optimization.
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Foundations of streaming SQL: stream & table theoryDataWorks Summit
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how can all of this work in a programmatic framework like Apache Beam? The presentation answers these questions and more as it walks you through key concepts underpinning data processing in general.
Presentation explores the relationship between the Beam model (as described in paper “The Dataflow Mode”and the “Streaming 101”and “Streaming 102” blog posts) and stream and table theory (as popularized by Martin Kleppmann and Jay Kreps, among others).
It turns out that stream and table theory does an illuminating job of describing the low-level concepts that underlie the Beam model.
The presentation explains what is required to provide robust stream processing support in SQL and discusses the concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, as well as new ideas yet to come. You’ll leave with a much better understanding of the key concepts underpinning data processing—regardless of whether that data processing is batch or streaming or SQL or programmatic—as well as a concrete notion of what robust stream processing in SQL looks like.
Speaker
Anton Kedin, Google, Software Engineer
Lifting the hood on spark streaming - StampedeCon 2015StampedeCon
At the StampedeCon 2015 Big Data Conference: Today if a byte of data were a gallon of water, in only 10 seconds there would be enough data to fill an average home, in 2020 it will only take 2 seconds. The Internet of Things is driving a tremendous amount of this growth, providing more data at a higher rate then we’ve ever seen. With this explosive growth comes the demand from consumers and businesses to leverage and act on what is happening right now. Without stream processing these demands will never be met, and there will be no big data and no Internet of Things. Apache Spark, and Spark Streaming in particular can be used to fulfill this stream processing need now and in the future. In this talk I will peel back the covers and we will take a deep dive into the inner workings of Spark Streaming; discussing topics such as DStreams, input and output operations, transformations, and fault tolerance.
PEARC17: Visual exploration and analysis of time series earthquake dataAmit Chourasia
Earthquake hazard estimation requires systematic investigation of past records as well as fundamental processes that cause the quake. However, detailed long-term records of earthquakes at all scales (magnitude, space and time) are not available. Hence a synthetic method based on first principals could be employed to generate such records to bridge this critical gap of missing data. RSQSim is such a simulator that generates seismic event catalogs for several thousand years at various scales. This synthetic catalog contains rich detail about the earthquake events and associated properties.
Exploring this data is of vital importance to validate the simulator as well as to identify features of interest such as quake time histories, conduct analyses such as calculating mean recurrence interval of events on each fault section. This work1 describes and demonstrates a prototype web based visual tool that enables domain scientists and students explore this rich dataset, as well as discusses refinement and streamlining of data management and analysis that is less error prone and scalable.
Imagine that self-driving cars now exist and are becoming widespread around the world. To facilitate the transition, it's necessary to set up central service to monitor traffic conditions nationwide, deploy sensors throughout the interstate system that monitor traffic conditions including car speeds, pavement and weather conditions, as well as accidents, construction, and other sources of traffic tie ups.
MongoDB has been selected as the database for this application. In this webinar, we will walk through designing the application’s schema that will both support the high update and read volumes as well as the data aggregation and analytics queries.
Streaming SQL to unify batch and stream processing: Theory and practice with ...Fabian Hueske
SQL is the lingua franca for querying and processing data. To this day, it provides non-programmers with a powerful tool for analyzing and manipulating data. But with the emergence of stream processing as a core technology for data infrastructures, can you still use SQL and bring real-time data analysis to a broader audience?
The answer is yes, you can. SQL fits into the streaming world very well and forms an intuitive and powerful abstraction for streaming analytics. More importantly, you can use SQL as an abstraction to unify batch and streaming data processing. Viewing streams as dynamic tables, you can obtain consistent results from SQL evaluated over static tables and streams alike and use SQL to build materialized views as a data integration tool.
Fabian Hueske and Shuyi Chen explore SQL’s role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges and how the unified stream and batch processing platform enables both technical or nontechnical users to process real-time and batch data reliably using the same SQL at Uber scale.
Fully Reactive - from Data to UI with OrientDB + Node.js + Socket.ioLuigi Dell'Aquila
In these slides I show how to create a simple application where all the components act in a reactive (push/asynchronous) way, including the database.
I'll show a preview of a new functionality that is being released in OrientDB: Live Query
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Influence of Marketing Strategy and Market Competition on Business Plan
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
1. Time flows, my friend
Managing event sequences and time series with a
Document-Graph Database
Codemotion Milan 2014
Luigi Dell’Aquila
Orient Technologies LTD
Twitter: @ldellaquila
3. Time What…?
Time series:
A time series is a sequence of data points, typically
consisting of successive measurements made over a
time interval (Wikipedia)
4. Time What…?
Event sequences:
• A set of events with a timestamp
• A set of relationships “happened
before/after”
• Cause and effect relationships
5. Time What…?
Time as a dimension:
• Direct:
– Eg. begin and end of relationships (I’m a
friend of John since…)
• Calculated
– Eg. Speed (distance/time)
8. Fast and Effective
Fast write: Time doesn’t wait! Writes just arrive
Fast read: a lot of data to be read in a short time
Effective manipulation: complex operations like
- Aggregation
- Prediction
- Analysis
11. Current approaches
0. Relational approach: table
HH MM SS Value
14 35 0 1321
14 35 1 2444
14 35 2 2135
14 35 3 1833
12. Current approaches
0. Relational – Advantages
• Simple
• It can be used together with your application data
(operational)
13. Current approaches
0. Relational – Disadvantages
• Slow read (relies on an index)
• Slow insert (update the index…)
14. Current approaches
1. Document Database
• Collections of Documents instead of tables
• Schemaless
• Complex data structures
15. Current approaches
1. Document approach: Minute Based
{
timestamp: “2014-11-21 12.05“
load: [10, 15, 3, … 30] //array of 60, one per second
}
16. Current approaches
1. Document approach: Hour Based
{
timestamp: “2014-11-21 12.00“
load: {
0: [10, 15, 3, … 30], //array of 60, one per second
1: [0, 12, 31, … 24],
…
59: [10, 10, 1, … 16]
}
}
17. Current approaches
1. Document approach – Advantages
• Fast write: One insert x 60 updates
• Fast fetch
18. Current approaches
1. Document approach – Disadvantages
• Fixed time windows
• Single point per unit
• How to pre-aggregate?
• Relationships with the rest of the world?
• Relationships between events?
19. Current approaches
2. Graph Database
• Nodes/Edges instead of tables
• Index free adjacency
• Fast traversal
• Dynamic structure
20. Current approaches
2. Graph approach: linked sequence
e
1
next e
e
2
next e
next e
3
4
5
next
(timestamp on vertex)
21. Current approaches
2. Graph approach: linked sequence (tag
based)
e
1
e
2
nextTag1
e
3
nextTag2
e
4
nextTag1
e
5
nextTag1
nextTag2
[Tag1, Tag2] [Tag1]
[Tag1, Tag2]
[Tag1]
[Tag2]
22. Current approaches
2. Graph approach: Hierarchy
e
1
e
2
e6
0
1
1
8
24
2 60 …
…
Days
Hours
Minutes
Seconds
…
e
3
23. Current approaches
2. Graph approach: mixed
e
1
e
2
e6
0
1
1
8
24
2 60 …
…
Days
Hours
Minutes
Seconds
…
e
3
24. Current approaches
1. Graph approach – Advantages
• Flexible
• Events can be connected together in different ways
• You can connect events to other entities
• Fast traversal of dynamic time windows
• Fast aggregation (based on hierarchy)
25. Current approaches
1. Graph approach – Disadvantages
• Slow writes (vertex + edge + maintenance)
• Not so fast reads
26. Can we mix different models and get
all the advantages?
27. Can we mix all this with the rest of
application logic?
30. OrientDB
First step: put them together
1
1
8
24
2 60 …
Days
Hours
Minutes
…
{
0: 1000,
1: 1500.
…
59: 96
}
31. OrientDB
First step: put them together
1
1
8
24
2 60 …
Days
Hours
Minutes
…
{
0: 1000,
1: 1500.
…
59: 96
}
Graph
Document <- IT’S A VERTEX TOO!!!
32. OrientDB
First step: put them together
1
8
24
Days
… Hours
{
0: {
0: 1000,
1: 1500,
…
59: 210
}
1: { … }
…
59: { … }
}
Graph
Document
33. Where should I stop?
It depends on my domain and
requirements.
34. OrientDB
Result:
• Same insert speed of Document approach
• But with flexibility of a Graph
• (as a side effect of mixing models,
documents can also contain “pointers” to
other elements of app domain)
38. OrientDB
How to aggregate
Hooks: Server side triggers (Java or Javascript),
executed when DB operations happen (eg. Insert
or update)
Java interface:
Public RESULT onBeforeInsert(…);
public void onAfterInsert(…);
public RESULT onBeforeUpdate(…);
public void onAfterUpdate(…);
39. OrientDB
Aggregation logic
• Second 0 -> insert
• Second 1 -> update
• …
• Second 57 -> update
• Second 58 -> update
• Second 59 -> update + aggregate
– Write aggregate value on minute vertex
• Minute == 59? Calculate aggregate on hour vertex
40. OrientDB
1
1
8
24
2 60 …
Days
Hours
Minutes
…
{
0: 1,
1: 12.
…
59: 3
}
sum = 1000
sum = 15000
sum = 300
1 2
incomplete
complete
sum = null
sum = null
41. OrientDB
Query logic:
• Traverse from root node to specified level
(filtering based on vertex data)
• Is there aggregate value?
– Yes: return it
– No: go one level down and do the same
Aggregation on a level will be VERY fast if you
have horizontal edges!
42. OrientDB
How to calculate aggregate values with a query
Input params:
- Root node (suppose it is #11:11)
select sum(aggregateVal) from (
traverse out() from #11:11
while in().aggregateVal is null
)
With the same logic you can query based on time
windows
44. OrientDB
Another use case: Event Categories and OO
e
1
e
2
nextTag1
e
3
nextTag2
e
4
nextTag1
e
5
nextTag1
nextTag2
[Tag1, Tag2, Tag3] [Tag1]
[Tag1, Tag2]
[Tag1]
[Tag2]
nextTag3
e
3
[Tag3]
45. OrientDB
Another use case: Event Categories and OO
Suppose tags are hierarchical categories
(Classes for vertices and/or edges)
nextTAG
nextTagX nextTag3
nextTag1 nextTag2
46. OrientDB
Subset of events
TRAVERSE out(‘nextTag1’) FROM <e1>
e
1
e
2
nextTag1
e
4
nextTag1
e
5
nextTag1
[Tag1, Tag2, Tag3] [Tag1]
[Tag1, Tag2]
[Tag1]
47. OrientDB
Subset of events
TRAVERSE out(‘nextTag2’) FROM <e1>
e
1
nextTag1
nextTag2 e
e
3
5
nextTag2
[Tag1, Tag2, Tag3]
[Tag1, Tag2]
[Tag2]
48. OrientDB
Subset of events (Polymorphic!!!)
TRAVERSE out(‘nextTagX’) FROM <e1>
e
1
e
2
nextTag1
e
3
nextTag2
e
4
nextTag1
e
5
nextTag1
nextTag2
[Tag1, Tag2, Tag3] [Tag1]
[Tag1, Tag2]
[Tag1]
[Tag2]
52. Chase
• Your target is running away
• You have informers that track his moves
(coordinates in a point of time) and give
you additional (unstructured) information
• You have a street map
• You want to:
– Catch him ASAP
– Predict his moves
– Be sure that he is inside an area
55. Chase
• Map is made of points and distances
• You also have speed limits for streets
point1
pointN Distance: 1Km
Max speed: 70Km/h
Distance: 2Km
Max speed: 120Km/h
Distance: 8Km
Max speed: 90Km/h
Map point
Street
56. Chase
• Map is made of points and distances
• You also have speed limits for streets
• Distance / Speed = TIME!!!
57. Chase
You have a time series of your target’s moves
{
{
Timestamp: 29/11/2014 17:15:00
LAT: 19,12223
LON: 42,134
}
Timestamp: 29/11/2014 17:55:00
LAT: 19,12223
LON: 42,134
}
Event
Event seqence
{
Timestamp: 29/11/2014 17:55:00
LAT: 19,12223
LON: 42,134
}
58. Chase
You have a time series of your target’s moves
21/11/2014
2:35:00 PM
20/11/2014
1:20:00 PM
Map point
Street
59. Chase
You have a time series of your target’s moves
21/11/2014
14:35:00
20/11/2014
13:20:00
Event
Map point
Where
Event seqence
Street
29/11/2014
17:55:00
60. Chase
Vertices and edges are also documents
So you can store complex information inside them
{
timestamp: 22213989487987,
lat: xxxx,
lon: yyy,
informer: 15,
additional: {
speed: 120,
description: “the target was in a car”
car: {
model: “Fiat 500”,
licensePlate: “AA 123 BB”
}
}
}
61. Chase
Now you can:
• Predict his moves (eg. statistical methods,
interpolation on lat/lon + time)
• Calculate how far he can be (based on last
position, avg speed and street data)
• Reach him quickly (shortest path, Dijkstra)
• … intelligence?
62. Chase
But to have all this you need:
• An easy way for your informers to send
time series events
Hint: REST interface
With OrientDB you can expose Javascript
functions as REST services!
63. Chase
And you need:
• An extended query language
Eg.
TRAVERSE out(“street”) FROM (
SELECT out(“point”) FROM #11:11
// my last event
) WHILE canBeReached($current, #11:11)
(where he could be)
64. Chase
With OrientDB you can write
function canBeReached(node, event)
In Javascript and use it in your queries
65. Chase
It’s just a game, but think about:
• Fraud detection
• Traffic routing
• Multi-dimensional analytics
• Forecasting
• …
67. One model is not enough
One of most common issues of my customers
is:
“I have a zoo of technologies in my application
stack, and it’s getting worse every day”
My answer is: Multi-Model DB
68. One model is not enough
One of most common issues of my customers
is:
“I have a zoo of technologies in my application
stack, and it’s getting worse every day”
My answer is: Multi-Model DB
of course ;-)
69. From:
“choose the right data model for your
use case”
To:
“Your application has multiple data
models, you need all of them!”