This document discusses MongoDB and Fractal Tree indexes. It provides an overview of B-trees and how Fractal Tree indexes differ, allowing for larger node sizes and buffering of changes in internal nodes. Benchmarks show that Fractal Tree indexes outperform MongoDB's default indexing, providing much better insertion and query performance. The presenter outlines Tokutek's work on Fractal Tree indexes and the roadmap to further integrate them with MongoDB to improve performance and crash safety.
Fractal Tree Indexes : From Theory to PracticeTim Callaghan
Fractal Tree Indexes are compared to the indexing incumbent, B-trees. The capabilities are then shown what they bring to MySQL (in TokuDB) and MongoDB (in TokuMX).
Presented at Percona Live London 2013.
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
Eric Hanson and I gave this presentation at Hadoop Summit 2013:
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
هذا العرض يبين مدى اهمية كل موقع من المواقع التالية حسب تقسيم الموقع من ناحية الشكل والمضمون ومن حيث كتابة النص والفهرسة ويحدد هذا العرض عيوب كل موقع وحسنات الموقع الأخر لمزيد من المعرفة ارجو التفضل بمشاهدة العرض .
Fractal Tree Indexes : From Theory to PracticeTim Callaghan
Fractal Tree Indexes are compared to the indexing incumbent, B-trees. The capabilities are then shown what they bring to MySQL (in TokuDB) and MongoDB (in TokuMX).
Presented at Percona Live London 2013.
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
Eric Hanson and I gave this presentation at Hadoop Summit 2013:
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.
Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.
هذا العرض يبين مدى اهمية كل موقع من المواقع التالية حسب تقسيم الموقع من ناحية الشكل والمضمون ومن حيث كتابة النص والفهرسة ويحدد هذا العرض عيوب كل موقع وحسنات الموقع الأخر لمزيد من المعرفة ارجو التفضل بمشاهدة العرض .
Cloud-native Semantic Layer on Data LakeDatabricks
With larger volume and more real-time data stored in data lake, it becomes more complex to manage these data and serve analytics and applications. With different service interfaces, data caliber, performance bias on different scenarios, the business users begin to suffer low confidence on quality and efficiency to get insight from data.
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases.
In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries.
Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases.
Agenda -
1) Introduction and Ideal Use cases for Druid
2) Data Architecture
3) Streaming Ingestion with Kafka
4) Demo using Druid, Kafka and Superset.
5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion
6) Future Work
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media, etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies. Frameworks come into picture in such scenarios. Frameworks are nothing but toolsets that offer innovative, cost-effective solutions to the problems posed by Big Data processing and helps in providing insights, incorporating metadata and aids decision making aligned to the business needs.
Simplifying Disaster Recovery with Delta LakeDatabricks
There’s a need to develop a recovery process for Delta table in a DR scenario. Cloud multi-region sync is Asynchronous. This type of replication does not guarantee the chronological order of files at the target (DR) region. In some cases, we can expect large files to arrive later than small files.
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
Cloud-native Semantic Layer on Data LakeDatabricks
With larger volume and more real-time data stored in data lake, it becomes more complex to manage these data and serve analytics and applications. With different service interfaces, data caliber, performance bias on different scenarios, the business users begin to suffer low confidence on quality and efficiency to get insight from data.
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases.
In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries.
Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases.
Agenda -
1) Introduction and Ideal Use cases for Druid
2) Data Architecture
3) Streaming Ingestion with Kafka
4) Demo using Druid, Kafka and Superset.
5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion
6) Future Work
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
Technical Leads and Databricks Champions Darren Fuller & Sandy May will give a fast paced view of how they have productionised Data Quality Pipelines across multiple enterprise customers. Their vision to empower business decisions on data remediation actions and self healing of Data Pipelines led them to build a library of Data Quality rule templates and accompanying reporting Data Model and PowerBI reports.
With the drive for more and more intelligence driven from the Lake and less from the Warehouse, also known as the Lakehouse pattern, Data Quality at the Lake layer becomes pivotal. Tools like Delta Lake become building blocks for Data Quality with Schema protection and simple column checking, however, for larger customers they often do not go far enough. Notebooks will be shown in quick fire demos how Spark can be leverage at point of Staging or Curation to apply rules over data.
Expect to see simple rules such as Net sales = Gross sales + Tax, or values existing with in a list. As well as complex rules such as validation of statistical distributions and complex pattern matching. Ending with a quick view into future work in the realm of Data Compliance for PII data with generations of rules using regex patterns and Machine Learning rules based on transfer learning.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media, etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies. Frameworks come into picture in such scenarios. Frameworks are nothing but toolsets that offer innovative, cost-effective solutions to the problems posed by Big Data processing and helps in providing insights, incorporating metadata and aids decision making aligned to the business needs.
Simplifying Disaster Recovery with Delta LakeDatabricks
There’s a need to develop a recovery process for Delta table in a DR scenario. Cloud multi-region sync is Asynchronous. This type of replication does not guarantee the chronological order of files at the target (DR) region. In some cases, we can expect large files to arrive later than small files.
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
Making Use of the Linked Data Cloud: The Role of Index StructuresThomas Gottron
The intensive growth of the Linked Open Data Cloud has spawned a web of data where a multitude of data sources provides huge amounts of valuable information across different domains. Nowadays, when accessing and using Linked Data more and more often the challenging question is not so much whether there is relevant data available, but rather where it can be found and how it is structured. Thus, index structures play an important role for making use of the information in LOD cloud. In this talk I will address three aspects of Linked Data index structures: (1) a high level view and categorization of indices structures and how they can be queried and explored, (2) approaches for building index structures and the need to maintain them and (3) some example applications which greatly benefit from indices over linked data.
Moldex3D, Structural Analysis, and HyperStudy Integrated in HyperWorks Platfo...Altair
In recent years, with the increasing variety, complexity, and precision requirement on plastic products, CAE tools have been widely used for solving product design and manufacturing issues. The structural designs or molding process parameters for products can be optimized efficiently through CAE analyses. Plus the reliable and correct verification with experiments, the directions or guidance in designs or process condition settings can be provided prior to the real moldings. However, sometimes it is not efficient to find an optimized set of parameters through traditional CAE analyses. A novel integration between Moldex3D and HyperStudy allows for more quick and efficient parameter optimization which will save time, increase product quality, and increase productivity.
Also, traditional CAE analyses do not consider the molding properties influence on structural analysis, such as material property variations caused by fiber orientation and residual stresses. Accordingly, an integrated technology is proposed to bridge molding and structural analysis. Through the integration of Moldex3D and structural analysis in HyperWorks platform, the important effects from molding process can be transferred to structural analysis for more accurate and realistic predictions of the product behaviors. This integration provides a virtual product development platform for users to increase profits as well as enhance productivity.
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
During this talk we'll navigate through a customer's journey as they migrate an existing MongoDB deployment to MongoDB Atlas. While the migration itself can be as simple as a few clicks, the prep/post effort requires due diligence to ensure a smooth transfer. We'll cover these steps in detail and provide best practices. In addition, we’ll provide an overview of what to consider when migrating other cloud data stores, traditional databases and MongoDB imitations to MongoDB Atlas.
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB
These days, everyone is expected to be a data analyst. But with so much data available, how can you make sense of it and be sure you're making the best decisions? One great approach is to use data visualizations. In this session, we take a complex dataset and show how the breadth of capabilities in MongoDB Charts can help you turn bits and bytes into insights.
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB
MongoDB Kubernetes operator and MongoDB Open Service Broker are ready for production operations. Learn about how MongoDB can be used with the most popular container orchestration platform, Kubernetes, and bring self-service, persistent storage to your containerized applications. A demo will show you how easy it is to enable MongoDB clusters as an External Service using the Open Service Broker API for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB
Are you new to schema design for MongoDB, or are you looking for a more complete or agile process than what you are following currently? In this talk, we will guide you through the phases of a flexible methodology that you can apply to projects ranging from small to large with very demanding requirements.
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB
Humana, like many companies, is tackling the challenge of creating real-time insights from data that is diverse and rapidly changing. This is our journey of how we used MongoDB to combined traditional batch approaches with streaming technologies to provide continues alerting capabilities from real-time data streams.
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB
Time series data is increasingly at the heart of modern applications - think IoT, stock trading, clickstreams, social media, and more. With the move from batch to real time systems, the efficient capture and analysis of time series data can enable organizations to better detect and respond to events ahead of their competitors or to improve operational efficiency to reduce cost and risk. Working with time series data is often different from regular application data, and there are best practices you should observe.
This talk covers:
Common components of an IoT solution
The challenges involved with managing time-series data in IoT applications
Different schema designs, and how these affect memory and disk utilization – two critical factors in application performance.
How to query, analyze and present IoT time-series data using MongoDB Compass and MongoDB Charts
At the end of the session, you will have a better understanding of key best practices in managing IoT time-series data with MongoDB.
Join this talk and test session with a MongoDB Developer Advocate where you'll go over the setup, configuration, and deployment of an Atlas environment. Create a service that you can take back in a production-ready state and prepare to unleash your inner genius.
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB
Our clients have unique use cases and data patterns that mandate the choice of a particular strategy. To implement these strategies, it is mandatory that we unlearn a lot of relational concepts while designing and rapidly developing efficient applications on NoSQL. In this session, we will talk about some of our client use cases, the strategies we have adopted, and the features of MongoDB that assisted in implementing these strategies.
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
Encryption is not a new concept to MongoDB. Encryption may occur in-transit (with TLS) and at-rest (with the encrypted storage engine). But MongoDB 4.2 introduces support for Client Side Encryption, ensuring the most sensitive data is encrypted before ever leaving the client application. Even full access to your MongoDB servers is not enough to decrypt this data. And better yet, Client Side Encryption can be enabled at the "flick of a switch".
This session covers using Client Side Encryption in your applications. This includes the necessary setup, how to encrypt data without sacrificing queryability, and what trade-offs to expect.
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB
MongoDB Kubernetes operator is ready for prime-time. Learn about how MongoDB can be used with most popular orchestration platform, Kubernetes, and bring self-service, persistent storage to your containerized applications.
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB
These days, everyone is expected to be a data analyst. But with so much data available, how can you make sense of it and be sure you're making the best decisions? One great approach is to use data visualizations. In this session, we take a complex dataset and show how the breadth of capabilities in MongoDB Charts can help you turn bits and bytes into insights.
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB
When you need to model data, is your first instinct to start breaking it down into rows and columns? Mine used to be too. When you want to develop apps in a modern, agile way, NoSQL databases can be the best option. Come to this talk to learn how to take advantage of all that NoSQL databases have to offer and discover the benefits of changing your mindset from the legacy, tabular way of modeling data. We’ll compare and contrast the terms and concepts in SQL databases and MongoDB, explain the benefits of using MongoDB compared to SQL databases, and walk through data modeling basics so you feel confident as you begin using MongoDB.
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB
Join this talk and test session with a MongoDB Developer Advocate where you'll go over the setup, configuration, and deployment of an Atlas environment. Create a service that you can take back in a production-ready state and prepare to unleash your inner genius.
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB
Query performance should be the unsung hero of an application, but without proper configuration, can become a constant headache. When used properly, MongoDB provides extremely powerful querying capabilities. In this session, we'll discuss concepts like equality, sort, range, managing query predicates versus sequential predicates, and best practices to building multikey indexes.
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
Aggregation pipeline has been able to power your analysis of data since version 2.2. In 4.2 we added more power and now you can use it for more powerful queries, updates, and outputting your data to existing collections. Come hear how you can do everything with the pipeline, including single-view, ETL, data roll-ups and materialized views.
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB
Are you new to schema design for MongoDB, or are you looking for a more complete or agile process than what you are following currently? In this talk, we will guide you through the phases of a flexible methodology that you can apply to projects ranging from small to large with very demanding requirements.
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
MongoDB Atlas Data Lake is a new service offered by MongoDB Atlas. Many organizations store long term, archival data in cost-effective storage like S3, GCP, and Azure Blobs. However, many of them do not have robust systems or tools to effectively utilize large amounts of data to inform decision making. MongoDB Atlas Data Lake is a service allowing organizations to analyze their long-term data to discover a wealth of information about their business.
This session will take a deep dive into the features that are currently available in MongoDB Atlas Data Lake and how they are implemented. In addition, we'll discuss future plans and opportunities and offer ample Q&A time with the engineers on the project.
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB
Virtual assistants are becoming the new norm when it comes to daily life, with Amazon’s Alexa being the leader in the space. As a developer, not only do you need to make web and mobile compliant applications, but you need to be able to support virtual assistants like Alexa. However, the process isn’t quite the same between the platforms.
How do you handle requests? Where do you store your data and work with it to create meaningful responses with little delay? How much of your code needs to change between platforms?
In this session we’ll see how to design and develop applications known as Skills for Amazon Alexa powered devices using the Go programming language and MongoDB.
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB
aux Core Data, appréciée par des centaines de milliers de développeurs. Apprenez ce qui rend Realm spécial et comment il peut être utilisé pour créer de meilleures applications plus rapidement.
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB
Il n’a jamais été aussi facile de commander en ligne et de se faire livrer en moins de 48h très souvent gratuitement. Cette simplicité d’usage cache un marché complexe de plus de 8000 milliards de $.
La data est bien connu du monde de la Supply Chain (itinéraires, informations sur les marchandises, douanes,…), mais la valeur de ces données opérationnelles reste peu exploitée. En alliant expertise métier et Data Science, Upply redéfinit les fondamentaux de la Supply Chain en proposant à chacun des acteurs de surmonter la volatilité et l’inefficacité du marché.
3. B-tree Definition
In computer science, a B-tree is a tree data
structure that keeps data sorted and allows searches,
sequential access, insertions, and deletions
in logarithmic time.
http://en.wikipedia.org/wiki/B-tree
9. B-tree - storage
Performance is IO limited when bigger than RAM:
try to fit all internal nodes and some leaf nodes
22
RAM
10
99
DISK
RAM
2, 3, 4
10,20
22,25
99
10. B-tree – serial insertions
Serial insertion workloads are in-memory,
think MongoDB’s “_id” index
22
RAM
10
99
DISK
RAM
2, 3, 4
10,20
22,25
99
12. Fractal Tree Indexes
message All internal nodes
buffer
have message buffers
message message
buffer
buffer
similar to B-trees
different than B-trees
- store data in leaf nodes
- message buffer in all internal nodes
- use PK for ordering
- doesn’t need to update leaf node immediately
- much larger nodes (4MB vs. 8KB*)
13. Fractal Tree Indexes – “insert 15”
insert(15)
22
10
99
2, 3, 4
10, 20
22, 25
99
No IO is required, all internal nodes usually fit in RAM
13
17. Fractal Tree Indexes – compression
• Large node size (4MB) leads to high compression
ratios.
• Supports zlib, quicklz, and lzma compression
algorithms.
• Compression is generally 5x to 25x, similar to what
gzip and 7z can do to your data.
• Significantly less disk space needed
• Less writes, bigger writes
• Both of which are great for SSDs
• Reads are highly compressed, more data per IO
17
19. So what does this have to do with
MongoDB?
* Watch Tyler Brock’s presentation “Indexing
and Query Optimization”
19
20. MongoDB Storage
db.test.insert({foo:55})
db.test.ensureIndex({foo:1})
PK index (_id + pointer)
Secondary Index (foo + pointer)
25
85
10
99
40
120
(2,ptr2),
(10,ptr10)
(25,ptr25), (101,ptr101)
(2,ptr10),
(55,ptr4)
(90,ptr2)
(2599,ptr98)
(4,ptr4)
(98,ptr98)
(35,ptr101)
The “pointer” tells MongoDB where to look in the data files for the actual
document data.
20
22. Who is Tokutek and what have we done?
• Tokutek’s Fractal Tree Index Implementations
• MySQL Storage Engine (TokuDB)
• BerkeleyDB API
• File System (TokuFS)
• Recently added Fractal Tree Indexes to
MongoDB 2.2
• Existing indexes are still supported
• Source changes are available via our blog at
www.tokutek.com/tokuview
• This is a work in progress (see roadmap
slides)
22
23. MongoDB and Fractal Tree Indexes
as simple as
db.test.ensureIndex({foo:1}, {v:2})
23
25. Indexing Options #2
db.test.ensureIndex({foo:1},{v:2,
blocksize:4194304,
basementsize=131072,
compression:quicklz,
clustering:false})
• Basement node size, defaults to 128K.
• Smallest retrievable unit of a leaf node,
efficient point queries
25
26. Indexing Options #3
db.test.ensureIndex({foo:1},{v:2,
blocksize:4194304,
basementsize=131072,
compression:quicklz,
clustering:false})
• Compression algorithm, defaults to quicklz.
• Supports quicklz, lzma, zlib, and none.
• LZMA provides 40% additional compression
beyond quicklz, needs more CPU.
• Decompression is of quicklz and lzma are
similar.
26
27. Indexing Options #4
db.test.ensureIndex({foo:1},{v:2,
blocksize:4194304,
basementsize=131072,
compression:quicklz,
clustering:false})
• Clustering indexes store data by key and
include the entire document as the payload
(rather than a pointer to the document)
• Always “cover” a query, no need to retrieve
the document data
27
28. How well does it perform?
Three Benchmarks
• Benchmark 1 : Raw insertion performance
• Benchmark 2 : Insertion plus queries
• Benchmark 3 : Covered indexes vs. clustering
indexes
28
29. Benchmarks…
Race Results
• First Place = John
• Second Place = Tim
• Third Place = Frank
29
30. Benchmarks…
Race Results
• First Place = John
• Second Place = Tim
• Third Place = Frank
Frank can say the following:
“I finished third, but Tim was second to last.”
30
31. Benchmarks…
Race Results
• First Place = John
• Second Place = Tim
• Third Place = Frank
Frank can say the following:
“I finished third, but Tim was second to last.”
Understand benchmark specifics and review all results.
31
32. Benchmark 1 : Overview
• Measure single threaded insertion performance
• Document is URI (character), name (character),
origin (character), creation date (timestamp), and
expiration date (timestamp)
• Secondary indexes on URI, name, origin, expiration
• Machine specifics:
– Sun x4150, (2) Xeon 5460, 8GB RAM, StorageTek
Controller (256MB, write-back), 4x10K SAS/RAID 0
– Ubuntu 10.04 Server (64-bit), ext4 filesystem
– MongoDB v2.2.RC0
32
35. Benchmark 1 : Observations
• Fractal Tree Indexing insertion performance is 8x
better than standard MongoDB indexing with
journaling, and 11x without journaling
• Fractal Tree Indexing insertion performance
reaches steady state, even at 200 million
insertions. MongoDB insertion performance seems
to be in continual decline at only 50 million
insertions
• B-tree performance is great until the working data
set > RAM
35
36. Benchmark 2 : Overview
• Measure single threaded insertion
performance while querying for 1000
documents with a URI greater than or equal
to a randomly selected value once every 60
seconds
• Document is same as benchmark 1
• Secondary indexes on URI, name, origin, expiration
• Fractal Tree Index on URI is clustering
– clustering indexes store entire document inline
– Compression controls disk usage
– no need to get document data from elsewhere
– db.tokubench.ensureIndex({URI:1}, {v:2, clustering:true})
• Same hardware as benchmark 1
36
39. Benchmark 2 : Observations
• Fractal Tree Indexing insertion performance is 10x
better than standard MongoDB indexing
• Fractal Tree Indexing query latency is 268x better
than standard MongoDB indexing
• B-tree performance is great until the working data
set > RAM
• Random lookups are bad
...but what about MongoDB’s covered indexes?
39
40. Benchmark 3 : Overview
• Same workload and hardware as benchmark 2
• Create a MongoDB covered index on URI to
eliminate lookups in the data files.
– db.tokubench.ensureIndex({URI:1,creation:1,name:1,origin:1})
40
43. Benchmark 3 : Observations
• Fractal Tree Indexing insertion performance is still
3.7x better than standard MongoDB indexing
• Fractal Tree Indexing query latency is 3.2x better
than standard MongoDB indexing (although the
MongoDB performance is highly variable)
• B-tree performance is great until the working data
set > RAM
• MongoDB’s covered indexes can help a lot
– But what happens when I add new fields to my
document?
o Do I drop and re-create by including my new field?
o Do I live without it?
– Clustered Fractal Tree Indexes keep on covering your
queries!
43
44. Roadmap : Continuing the Implementation
• Optimize Indexing Insert/Update/Delete Operations
– Each of our secondary indexes is currently creating and
committing a transaction for each operation
– A single transaction envelope will improve performance
44
45. Roadmap : Continuing the Implementation
• Add Support for Parallel Array Indexes
– MongoDB does not support indexing the following two
fields:
o {a: [1, 2], b: [1, 2]}
– “it could get out of hand”
– Ticketed on 3/24/2010,
jira.mongodb.org/browse/SERVER-826
– Benchmark coming soon…
45
46. Roadmap : Continuing the Implementation
• Add Crash Safety
– Our implementation is not [yet] crash safe with the
MongoDB PK/heap storage mechanism.
– MongoDB journal is separate from Fractal Tree Index
logs.
– Need to create a transactional envelope around both of
them
46
47. Roadmap : Continuing the Implementation
• Replace MongoDB data store and PK index
– A clustering index on _id eliminates the need for two
storage systems
– Compression greatly reduces disk footprint
– This is a large task
47
48. We are looking for evaluators!
Email me at tim@tokutek.com
See me after the presentation
48
49. Questions?
Tim Callaghan
tim@tokutek.com
@tmcallaghan
More detailed benchmark information
in my blogs at
www.tokutek.com/tokuview
49