The document provides an overview of indexing and querying in Couchbase Server 2.0. It discusses view basics like index definition, building, and querying phases. It covers topics like replica indexes, failover, primary and secondary indexes, and best practices. Examples are provided for simple indexing, aggregations, time-based rollups, and leaderboards using views.
"SQL Server Storage Configuration for SharePoint" presented to the Silicon Valley SQL Server User Group on January 13, 2010
Presenter: Burzin Patel, author and Solutions Architect at StorSimple
Learn about the Top Five SQL Server storage configuration best practices for SharePoint, including:
•Disk sizing and configuration •Externalizing BLOB storage •Common maintenance tasks •Performance tuning
Development Platform as a Service - erfarenheter efter ett års användning - ...IBM Sverige
Presentation från IBM Smarter Business 2011. Spår: Utveckla produkter och tjänster kostnadseffektivt.
Ta del av Tietos erfarenheter inom implementation av agil utveckling och Application Lifecycle Management med IBM Rationals lösningar. Presentationen visar på ett antal olika exempel på implementationer, och en representant från en svensk kund berättar om sina erfarenheter från ett års användning av IBM och Tietos Cloudbaserad utvecklingsplattform, DpaaS.
Talare: Per Engman, Business Development, Tieto.
Mer information på www.smarterbusiness.se
"SQL Server Storage Configuration for SharePoint" presented to the Silicon Valley SQL Server User Group on January 13, 2010
Presenter: Burzin Patel, author and Solutions Architect at StorSimple
Learn about the Top Five SQL Server storage configuration best practices for SharePoint, including:
•Disk sizing and configuration •Externalizing BLOB storage •Common maintenance tasks •Performance tuning
Development Platform as a Service - erfarenheter efter ett års användning - ...IBM Sverige
Presentation från IBM Smarter Business 2011. Spår: Utveckla produkter och tjänster kostnadseffektivt.
Ta del av Tietos erfarenheter inom implementation av agil utveckling och Application Lifecycle Management med IBM Rationals lösningar. Presentationen visar på ett antal olika exempel på implementationer, och en representant från en svensk kund berättar om sina erfarenheter från ett års användning av IBM och Tietos Cloudbaserad utvecklingsplattform, DpaaS.
Talare: Per Engman, Business Development, Tieto.
Mer information på www.smarterbusiness.se
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.
A user's perspective on SaltStack and other configuration management toolsSaltStack
Aurelien Geron uses SaltStack to manage a few VMs running Django web apps based on a sharded mongodb cluster. He had struggled with another configuration management tool for months but then read about Saltstack and decided to try it out. For Aurelien SaltStack just works, it's plain and simple, powerful, configurable and ultra-fast. This is his presentation.
In this talk, we’ll discuss the benefits of the document-based data model that MongoDB offers by walking through how one can build a simple app. We'll show you how to design a full-blown RSS Aggregation service to replace the loss the world suffered when Google Reader was shutdown.
We'll dive deeper into topics, such as how to model your data and create your REST API using MongoDB, Express.js and Node.js (core components of the MEAN stack). This session will jumpstart your development knowledge of MongoDB.
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
Presented by Eoin Brazil, Proactive Technical Services Engineer, MongoDB
Experience level: Advanced
MongoDB offers a flexible, scalable, and easy way to store your large data set. Python provides many useful data science tools (e.g. NumPy, SciPy, Scikit-learn, etc.). This talk will discuss the concerns for creating operational data analytic pipelines, introduce Monary as alternative for loading data into NumPy, and give examples of accessing data with Monary, as well as how to build scalable data analysis pipelines using these open source tools.
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
Introduction to Apache Airflow (Incubating), best practices and roadmap. Airflow is a platform to programmatically author, schedule and monitor workflows.
Научно-технический семинар спикеров RuSSIR 2012 Чирага Шаха и Исмаила Сенгор Алтинговде в московском офисе Яндекса, 3 августа 2012.
Исмаил Сенгор Алтинговде, ведущий научный сотрудник в Исследовательском центре L3S в Ганновере, Германия.
Automation of document scanning, document type separation, metadata extraction, indexing, and finally release of scanned document with extracted metadata into AxonShare (Collaborative Content Management) repository. This way the scanned document is stored and managed in a central content repository and Searchable on Metadata applied.It could be used to automate the document-centric business processes such as Invoice/Applications/Contracts capture, processing, indexing, storage and management through a collaborative content management.
How companies use NoSQL & Couchbase - NoSQL Now 2014Dipti Borkar
My presentation from the NoSQL Now 2014 conference.
Abstract
NoSQL databases including Couchbase are increasingly being selected as the backend technology for web and mobile apps. Document databases in particular are well suited for a large number of different use cases as an operational datastore.
This session provides a brief overview of Couchbase Server, a document database and its underlying distributed architecture. In addition, Dipti will present some common use cases of Couchbase with a drill down into three specific customer use cases.
Paypal – A multi data center session store
LivePerson – A scalable, real time analytics system
Orbitz – A highly available cache solution
Is Your Index Reader Really Atomic or Maybe Slow?lucenerevolution
Presented by Uwe Schindler | SD DataSolutions GmbH - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Since the first day, Apache Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API did not reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index is not a single index while logically treated as a such. This talk will introduce the new API classes AtomicReader and CompositeReader added in Lucene 4.0 as very general interfaces, and DirectoryReader, which most people know as the segment-based “Lucene index on disk”. The talk will also cover more changes and improvements to the search API like reader contexts that allow to convert local document ids to global ones from IndexSearcher. Lucene changed all IndexReaders to be read-only, so it’s no longer possible to modify indexes using those classes. Finally, Uwe Schindler will show migration paths from custom norm values to the various new ranking models that were added to Lucene; this includes using Similarity with Lucene 4.0’s DocValues as replacement for norms.
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.
A user's perspective on SaltStack and other configuration management toolsSaltStack
Aurelien Geron uses SaltStack to manage a few VMs running Django web apps based on a sharded mongodb cluster. He had struggled with another configuration management tool for months but then read about Saltstack and decided to try it out. For Aurelien SaltStack just works, it's plain and simple, powerful, configurable and ultra-fast. This is his presentation.
In this talk, we’ll discuss the benefits of the document-based data model that MongoDB offers by walking through how one can build a simple app. We'll show you how to design a full-blown RSS Aggregation service to replace the loss the world suffered when Google Reader was shutdown.
We'll dive deeper into topics, such as how to model your data and create your REST API using MongoDB, Express.js and Node.js (core components of the MEAN stack). This session will jumpstart your development knowledge of MongoDB.
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
Presented by Eoin Brazil, Proactive Technical Services Engineer, MongoDB
Experience level: Advanced
MongoDB offers a flexible, scalable, and easy way to store your large data set. Python provides many useful data science tools (e.g. NumPy, SciPy, Scikit-learn, etc.). This talk will discuss the concerns for creating operational data analytic pipelines, introduce Monary as alternative for loading data into NumPy, and give examples of accessing data with Monary, as well as how to build scalable data analysis pipelines using these open source tools.
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
Introduction to Apache Airflow (Incubating), best practices and roadmap. Airflow is a platform to programmatically author, schedule and monitor workflows.
Научно-технический семинар спикеров RuSSIR 2012 Чирага Шаха и Исмаила Сенгор Алтинговде в московском офисе Яндекса, 3 августа 2012.
Исмаил Сенгор Алтинговде, ведущий научный сотрудник в Исследовательском центре L3S в Ганновере, Германия.
Automation of document scanning, document type separation, metadata extraction, indexing, and finally release of scanned document with extracted metadata into AxonShare (Collaborative Content Management) repository. This way the scanned document is stored and managed in a central content repository and Searchable on Metadata applied.It could be used to automate the document-centric business processes such as Invoice/Applications/Contracts capture, processing, indexing, storage and management through a collaborative content management.
How companies use NoSQL & Couchbase - NoSQL Now 2014Dipti Borkar
My presentation from the NoSQL Now 2014 conference.
Abstract
NoSQL databases including Couchbase are increasingly being selected as the backend technology for web and mobile apps. Document databases in particular are well suited for a large number of different use cases as an operational datastore.
This session provides a brief overview of Couchbase Server, a document database and its underlying distributed architecture. In addition, Dipti will present some common use cases of Couchbase with a drill down into three specific customer use cases.
Paypal – A multi data center session store
LivePerson – A scalable, real time analytics system
Orbitz – A highly available cache solution
Is Your Index Reader Really Atomic or Maybe Slow?lucenerevolution
Presented by Uwe Schindler | SD DataSolutions GmbH - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Since the first day, Apache Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API did not reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index is not a single index while logically treated as a such. This talk will introduce the new API classes AtomicReader and CompositeReader added in Lucene 4.0 as very general interfaces, and DirectoryReader, which most people know as the segment-based “Lucene index on disk”. The talk will also cover more changes and improvements to the search API like reader contexts that allow to convert local document ids to global ones from IndexSearcher. Lucene changed all IndexReaders to be read-only, so it’s no longer possible to modify indexes using those classes. Finally, Uwe Schindler will show migration paths from custom norm values to the various new ranking models that were added to Lucene; this includes using Similarity with Lucene 4.0’s DocValues as replacement for norms.
This document describes different Load Tests performed over several days with OpenProdoc v 0.8.
The objective of the tests was to measure the behavior of OpenProdoc in different scenarios and to measure the speed that can be obtained with a small infrastructure that can be used even by the smallest organization.
The tests inserted more than 200.000 documents/hour in a SOHO
http://code.google.com/p/openprodoc/
Some key value stores using log-structureZhichao Liang
This slides presents three key-value stores using log-structure, includes Riak, RethinkDB, LevelDB. BTW, i state that RethinkDB employs append-only B-tree and that is an estimate made by combining guessing wih reasoning!
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar
Born at Facebook, Presto is an open source high performance, distributed SQL query engine. With the disaggregation of storage and compute, Presto was created to simplify querying of all data lakes - cloud data lakes like S3 and on premise data lakes like HDFS. Presto's high performance and flexibility has made it a very popular choice for interactive query workloads on large Hadoop-based clusters as well as AWS S3, Google Cloud Storage and Azure blob store. Today it has grown to support many users and use cases including ad hoc query, data lake house analytics, and federated querying. In this session, we will give an overview on Presto including architecture and how it works, the problems it solves, and most common use cases. We'll also share the latest innovation in the project as well as the future roadmap.
Silicon Valley NoSQL Meetup - Nov 2012. View with animations: video version here: https://vimeo.com/54691785
http://www.meetup.com/Silicon-Valley-NoSQL/events/88257222/
For more information visit: www.couchbase.com
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012Dipti Borkar
For more deep NoSQL content from Couchbase, check out http://www.couchbase.com/webinars
NoSQL databases have emerged as a better match than relational systems for modern interactive applications, offering cost-effective data management at “Big Data” scale. But there are significant differences between structured and schema-less database technology. What should architects and technical managers know as they explore NoSQL solutions for their teams?
In this workshop you will learn:
- How to evaluate NoSQL (both technical advantages and limitations) as a potential data management approach
- Critical differences between NoSQL and RDBMS for designing, building and running production applications
- Ideal use cases for NoSQL technology and sample reference architectures
Couchbase Server and IBM BigInsights: One + One = ThreeDipti Borkar
Session presented at CouchConf San Francisco
http://www.couchbase.com/couchconf-san-francisco
Frequently the terms NoSQL and Big Data are used as synonyms. While both technologies divert from the traditional RDBMS data model and spread data across clusters of servers, the “problems” these technologies address are quite different. Hadoop, is focused on data analysis – gleaning insights from large volumes of data. NoSQL databases, focus on interactive applications – delivering high-performance, cost-effective data management for massive number of users. In this session, we share how IBM BigInsights and Couchbase Server can used together to build better applications.
Go simple-fast-elastic-with-couchbase-server-borkar
Couchbase Server 2.0 - Indexing and Querying - Deep dive
1. Couchbase Server 2.0
Indexing and Querying
Quick Overview
Dipti Borkar
Director, Product Management
1 1
2. What we’ll talk about
• View basics
• Lifecycle of a view
Index definition, build, and query phase
Indexing details
• Replica indexes, failover and compaction
• Primary and Secondary indexes
• View best practices
• Additional patterns
2
3. JSON Documents
• Map more closely to objects or entities
• CRUD Operations, lightweight schema
{
“fields” : [“with basic types”, 3.14159, true],
“like” : “your favorite language”
}
• Stored under an identifier key
client.set(“mydocumentid”, myDocument);
mySavedDocument = client.get(“mydocumentid”);
3
4. What are Views?
• Extract fields from JSON documents and produce an index of
the selected information
5. Views – The basics
• Define materialized views on JSON documents and then query
across the data set
• Using views you can define
• Primary indexes
• Simple secondary indexes (most common use case)
• Complex secondary, tertiary and composite indexes
• Aggregations (reduction)
• Indexes are eventually indexed
• Queries are eventually consistent with respect to documents
• Built using Map/Reduce technology
• Map and Reduce functions are written in Javascript
8. Eventually indexed Views – Data flow
2
Doc 1
App Server
Couchbase Server Node
33 2 33
Managed Cache 2
To other node Replication
Doc 1
Queue
Disk Queue
Disk
Doc 1
View engine
8
9. Distributed Indexing and Querying
Create Index / View
App Server 1 App Server 2
COUCHBASE Client Library
COUCHBASE Client Library COUCHBASE Client Library
COUCHBASE Client Library
Cluster Map Cluster Map
Query
Server 1 Server 2 Server 3
• Indexing work is distributed
Active Active Active amongst nodes
Doc 5 Doc Doc 3 Doc Doc 4 Doc
• Parallelize the effort
Doc 2 Doc Doc 1 Doc Doc 6 Doc
• Each node has index for data stored
Doc 9 Doc Doc 8 Doc Doc 7 Doc on it
REPLICA
• Queries combine the results from
REPLICA REPLICA
required nodes
Doc 3 Doc Doc 6 Doc Doc 2 Doc
Doc 1 Doc Doc 4 Doc Doc 5 Doc
Doc 7 Doc Doc 9 Doc Doc 8 Doc
Couchbase Server Cluster
User Configured Replica Count = 1
9
10. DEFINE Index / View Definition in JavaScript
CREATE INDEX City ON Brewery.City;
10
11. BUILD Distributed Index Build Phase
• Optimized for lookups, in-order access and aggregations
• View reads are from disk (different performance profile than GET/SET)
• Views built against every document on every node
– Group them in a design document
• Views are automatically kept up to date
11
12. QUERY Dynamic Queries with Optional Aggregation
• Eventually consistent with respect to document updates
• Efficiently fetch a document or group of similar documents
• Queries will use cached values from B-tree inner nodes when possible
• Take advantage of in-order tree traversal with group_level queries
Query ?startkey=“J”&endkey=“K”
{“rows”:[{“key”:“Juneau”,“value”:null}]}
12
13. Index building details
– All the views within a design document are incrementally updated
when the view is accessed or auto-indexing kicks in
– Automatic view updates
• In addition to forcing an index build at query time, active & replica indexes are
updated every 3 seconds of inactivity if there are at least 5000 new changes
(configurable)
– The entire view is recreated if the view definition has changed
– Views can be conditionally updated by specifying the “stale”
argument to the view query
– The index information stored on disk consists of the combination
of both the key and value information defined within your view.
14. Queries run against stale indexes by default
• stale=update_after (default if nothing is specified)
– always get fastest response
– can take two queries to read your own writes
• stale=ok
– auto update will trigger eventually
– might not see your own writes for a few minutes
– least frequent updates -> least resource impact
• stale=false
– Use with “set with persistence” if data needs to be included in
view results
– BUT be aware of delay it adds, only use when really required
14
15. Views and Replica indexes
• In addition to replicas for data (up to 3 copies), optionally
create replica for indexes
• Each node manages replica index data structures
• Set at a bucket level
• Replica index populated from replica data
• Replica index is used after a failover
16. Views and failover
• Replica indexes enabled on failover
• Replicas indexes are rebuilt on replica nodes
– Automatically incrementally built based on replica data
– Updated every 3 seconds of inactivity if there are at least 5000
new changes
– Not copied/moved to be consistent with persisted replica data
17. View Compaction
• Compaction is ONLINE
• Reclaims empty allocated space from disk
• Indexes are stored on disk for active vBuckets on each
node and updated in append-only manner
• Auto-compaction performed in the background
– Set the database fragmentation levels
– Set the index fragmentation levels
– Choose a schedule
– Global and bucket specific settings
18. Development vs. Production Views
• Development views index a
subset of the data.
• Publishing a view builds the
index across the entire
cluster.
• Queries on production views
are scattered to all cluster
members and results are
gathered and returned to
the client.
18
27. View writing guidance
• Move frequently used views out to a separate design document
– All views in a design document are updated at the same time
– This can result in increase index building time if all views are in a single design
document, especially for frequently accessed views.
– However, grouping views into smaller number of design documents improves overall
performance
• Try to avoid computing too many things with one view
• Use built-in reduces where possible - custom reduces are not optimized
• Check for attribute existence
function(doc, meta){
function(doc, meta){
if (doc.ingredient)
if (doc.ingredient)
{
{
emit(doc.ingredient.ingredtext, null);
emit(doc.ingredient.ingredtext, null);
}
}
}
} 27
28. View writing guidance
• Do not include the document in the view value
– Instead either use the GET / SET API or the API that includes documents filtered by
the query [example: willIncludeDocs()]
– Emit either null or the ID instead (meta.id) in your key or value data
emit(doc.name, null)
emit(doc.name, null)
• Don’t emit too much data into a view value
– Use views to filter documents
– Then use the data path to access the matched documents
• Use Document Types to make views more selective
function(doc, meta)
function(doc, meta)
{
{
if(doc.type == “player”)
if(doc.type == “player”)
emit(doc.experience, null);
emit(doc.experience, null);
}
}
28
29. What impact do views have on the system?
• Complexity of the index CPU
• Size of the value emitted and selectivity Disk size, I/O
• Replica index Disk size, I/O, CPU
• Number of design doc CPU, I/O, Disk size
– 4 active and 2 replica design documents are built in parallel by default
– Can be changed using the maxParallelIndexers and
maxParallelReplicaIndexers parameters
• Compaction of views CPU, I/O
• Rebalance time Increases with views to support consistent
query results during rebalance
– Can be disabled using the indexAwareRebalanceDisabled parameter
30. Views and OS caching
• File system cache availability for the index has a big impact
performance
• Indexes are disk based and should have sufficient file system
cache available for faster query access
• In house performance results show that by doubling system
cache availability
– query latency reduces to half
– throughput increases by 50%
• Runs based on 10 million items with 16GB bucket quota and
4GB, 8GB system RAM availability for indexes
38. dateToArray() is your friend
()
rr ay
oA
eT
dat
• String or Integer based timestamps
• Output optimized for group_level queries
• array of JSON numbers:
[2012,9,21,11,30,44] 38
39. group_level=2 results
• Monthly rollup
• Sorted by time—sort the query results in your application if
you want to rank by value—no chained map-reduce
39
40. group_level=3 - daily results - great for graphing
• Daily, hourly, minute or second rollup all possible with the
same index.
40
if you are ingesting Tweets, git commits, and linked-in API data, there ’ s little value in transforming it before you save it. just store it and sort it out later — the same holds for user data
schemaless is good as far as it goes, but what it ’ s really saying is: “ don ’ t worry about the database ” so a lot of the patterns move to the application. that ’ s what this section is about.
1. A set request comes in from the application . 2. Couchbase Server responses back that they key is written 3. Couchbase Server then Replicates the data out to memory in the other nodes 4. At the same time it is put the data into a write que to be persisted to disk
Bulletize the text. Make sure the builds work.
Defined via SDKs or administration console Deploying a new view to production is an online operation , but can be heavy
no downtime deploy?
If Sum = 11 and Count = 2, the Average is 5.5
ratings are stored in a hash to ensure each user can only rate each beer once