Scaling Search at Lendingkart with Elasticsearch

Scaling Search at Lendingkart
Shivendra Singh, Swapnil Bagadia and Nitesh Kumar
16 June 2018

Scaling
Scaling/Scalability is the capability of a system to handle a
growing amount of work, or its potential to be enlarged to
accommodate that growth ~Wiki

Scaling and High Availability(Application)

Scaling and High Availability(Application)
● Application does not change too often (static)
● If we need more performance, we are adding more resources
● Easy to scale and achieve High Availability
● But what happens with the database?

Scaling and High Availability(Databases)
● We have to distribute the changes to all the databases in real time
● It has to be available for all the applications
● The application has to be able to do changes

Horizontal Scaling
● Master Master Setup
● Mysql cluster
● MariaDB/ Galera/ Percona XTRA DB
● Problems - Messy, Hard to identify/fix when issues arise, autoincrement
● Sharding Databases
● Complexity of managing at application level
● Multiple Read Replicas
● Single Master Multiple Read Replicas used by application
● Separate DB for Analytics
● Problems - replication lag

Horizontal Scaling(Multiple Read Replicas)

Query Optimizations
● Use indexes for better read performance
○ Multiple non-clustered/secondary indexes
○ Too many and too little indexes are both bad
○ Check for duplicate and unused indexes
○ Queries can be run without indexes but it can take a really long time
○ Best if all WHERE and JOIN clause are using INDEX for lookups
● Monitor and force use of indexes if required
○ FORCE INDEX for index to be used
● Fix top offenders (repeatedly)
○ Slow query logs (using long_query_time)
○ Use explain on these queries
■ Using Index - Good
■ Using Filesort, Using temporary - Bad

Server Side Optimizations for performance
● Sensible timeout for queries
○ Max query execution time
○ Lock wait timeouts
● Changed to Read Committed from Repeatable Read

Separating out Databases
● Smaller databases that are completely decoupled and independent
● Pros
○ Simplicity
○ More cost effective
○ High Availability
○ Enforces loose coupling across data stores
○ Allows better usage of connections to DB
● Cons
○ Hard to maintain referential integrity across different DBs
○ Usage in analytics/reporting.
○ Transaction Management
○ Does not solve the problem of a table growing really large

Monitoring and Key Metrics
● Memory Usage
○ Often most important for performance
○ Your working set must fit in memory well.
○ Less memory = more pressure on IO
● IOPS
○ 1xIOPS/ GB burstable upto 3x for General Purpose SSD
○ Provisioned IOPS for better performance
● CPU Usage
● Free Disk usage
● Replication Lag (in Read Replicas)
● Database Connections

Challenges of direct search in DB
● Searching on non indexed columns
● Perils of using LIKE queries
○ Full table scan
● Returning all columns
● Aggregations were killing the database performance

Other database storages
● Non Transactional Data resides in MongoDB
● Data is highly unstructured

Challenges of direct search on mongoDB
● Single index
○ MongoDB supports the creation of user-defined ascending/descending indexes on a single field of a document.
○ Default index on the _id field during the creation of a collection.
○ Problem: 24 searchable fields in a document. So 24 indexes???
● Case insensitive search
○ LendingKart or LENdingkart or lendingkart or lendingKART etc.
○ Problem: MongoDB doesn't support case-insensitive regular expression search.
● Prefix search
○ Search query: John
First/middle/last name: John
Company name: Johnson
Email: johnkumar@gmail.com
○ Problem: slow performance
● Sorting and pagination
○ Sorting on specific fields like date or some id.
○ Pagination to separate a big result set into smaller chunks.
○ Problem: MongoDB has sorting memory limit.

Chain of thought for search improvement
● Compound index
○ An index that contains references to multiple fields within a document.
○ MongoDB imposes a limit of 31 fields for any compound index.
○ Example:
{
"_id": ObjectId(...),
"leadId": 1234,
"companyName": "lendingkart",
"city": "bangalore",
"email": "lendingkart@abc.com",
"phone": "9999999999”
}
db.leads.createIndex( { "leadId": 1, "companyName": 1, "email":1 } )
● Elastic search
○ An open-source, broadly-distributable, readily-scalable, enterprise-grade search engine.

Why we needed some magic!!!
● Searches in MySQL were slow
○ Around 8 seconds for normal search
● Searches in MongoDB were slow
○ Around 8 seconds for normal search
● Aggregations were slow
○ Taking 21 seconds - 36 seconds for aggregations
● Data Growth
○ Transactional/Application from 0.04M to 1.2M
○ Non Transactional/Leads from .6M to 2M
● Our goal was to get searches to happen within 250ms

ElasticSearch - You know for search....
Wer Ordnung hält, ist nur zu faul zum Suchen.
(If you keep things tidily ordered, you're just too lazy to go
searching.)
—German proverb

What is ELASTICsearch ?
● Full-text search and analytics engine
○ It allows you to store, search, and analyze big volumes of data quickly.
● Near Real Time(NRT)
○ Slight latency (normally one second) from the time you index a document until the time it becomes
searchable.
● Highly scalable
○ Elastic, as the name suggests. It’s clustered by default— you call it a cluster even if you run it on a
single server.
○ Increase/Decrease nodes as per requirement
● It just works...
○ Open-source/Free built on top of Apache Lucene, in Java(inherently cross-platform)
○ Ships with sensible defaults, keeping complex theories for leisure reading
○ Mostly, plug and play.
○ Much more than Lucene - JSON Based, Distributed, web server.

Sharding for scalability
○ To add data to Elasticsearch, we need an index—a place to store related data. In reality, an index is just
a logical namespace that points to one or more physical shards.
○ Each shard can have zero or more replicas
○ Replicas on different servers (server pools) for
failover
○ One in the cluster goes down? No problem.
○ Master - Automatic Master detection + failover
○ Responsible for distribution/balancing of shards

Distributed to the hilt
1 node cluster 2 nodes cluster
3 nodes cluster
2 Replica Shards
1 node goes down!!!!

Where does Elasticsearch help us?
● Dashboards
● Fraud Detection
● Logs
● Analytics

Data Seeding from MySQL to ES
● What were the options?
○ Binlog Processor service sync ing your MySQL data into Elasticsearch automatically
○ Asynchronous Kafka(as a queue) pipeline
● Why go through all the pain when we can get all the same from ELK stack itself?
○ Logstash was a perfect fit for our requirements
○ 100% Config Based
○ Not a single Line of Code

Simplicity at its best- Logstash
● How logstash works?
○ Ah, just like others, logstash has input/filter/output plugins.
○ Attention: logstash process events, not (only) loglines!• "Inputs generate events, filters modify them,
outputs ship them elsewhere." -- [the life of an event in logstash]
● Plugin Architecture
○ Input plugins: captures external data+format & transform it to logstash events
○ Filter plugins: process/transform events
○ Output plugins: send events to external destination & format
○ All Plugins

Logstash Configurations - introducing multiple pipelines
● Lack of congestion isolation - backpressure
● One size does not fit all - TCP2TCP(fast and light) vs JDBC2ES(large and low volume)
● The solution before Logstash 6.0: Multiple Logstash Instances - RPM/DEB /Multi-JVM instances

Data seeding from mongo to elastic cluster
● How to copy data from mongo to elastic cluster?
○ Mongo-connector
● Do we need to copy all fields and their values of a document from mongo to elastic cluster?
○ Useful(or searchable) data on cluster
● What is Oplog (operations log)?
● How mongo-connector reads oplog to copy documents (new or updated documents) on elastic cluster?
● Can we use a custom configuration file to specify some options to mongo-connector?
● How to track whether mongo-connector has stopped syncing data?

MongoDB Connector
Mongo-connector creates a pipeline from a MongoDB cluster to elasticsearch cluster and it copies your
documents from MongoDB to your target system.

OpLog(operations log)
● Oplog (operations log) that keeps a rolling record of all operations that modify the data stored in your
databases. Example:
> use test //switched to db test
> db.leads.insert({"leadId":1})
> db.leads.update({"leadId":1}, {$set : {"city": "bangalore"}})
● Oplog entry of above operation:
{ "ts" : { "t" : 1286821977000, "i" : 1 }, "h" : NumberLong("1722870850266333201"), "op" : "i", "ns" : "test.leads", "o" : { "_id"
: ObjectId("4cb35859007cc1f4f9f7f85d"), "leadId" : 1 } }
{ "ts" : { "t" : 1286821984000, "i" : 1 }, "h" : NumberLong("1633487572904743924"), "op" : "u", "ns" : "test.leads", "o2" : {
"_id" : ObjectId("4cb35859007cc1f4f9f7f85d") }, "o" : { "$set" : { "city": "bangalore" } } }
op: the write operation[i: insert, u: update]
Insert
Update

How mongo-connector reads oplog to copy documents (new or updated
documents) on elastic cluster?
● Mongo Connector creates an oplog progress file (oplog.timestamp).
● The oplog progress file keeps track of the latest oplog entry seen for each replica set to which Mongo
Connector is connected.
● Mongo Connector uses this file to decide, where to begin reading the oplog on startup.
● When the oplog progress file cannot be found, or if it is empty, Mongo Connector will begin pulling data
from all MongoDB collection in the "collection dump" phase.
● The oplog progress file is then updated with the most recent timestamp from before the dump
happened.
● Mongo Connector then applies all oplog operations from before the dump, so that the copied
documents will be up-to-date with what's on MongoDB.

Can we use a custom configuration file to specify some options to mongo-
connector?
● You can use a custom configuration file to specify some options to mongo-connector.
● To invoke mongo-connector with a configuration file option, run:
○ mongo-connector -c config.json
● Configuration options:
○ excludeFields: List of fields to not read from MongoDB. Comma-separated list of fields to exclude from MongoDB
documents. Example: [Database: test, Collection: leads]
"test.leads": {
"excludeFields": ["isSynced","comments","dndMobile","isDuplicateLead"]
}
○ oplogFile: The path to the oplog progress file.
○ batchSize: Number of records processed from the oplog before updating the timestamp file.
■ default bulk size is 1000 docs

How to track whether mongo-connector has stopped syncing data?
● Causes:
○ High write-load.
○ Mongo-connector connection with mongoDB or cluster got interrupted.
● Solution:
○ Write a script which run at scheduled time.
○ This script will query the total count of documents from mongo and also elastic.
○ If difference in count is greater than threshold, it will send notification.
MongoDB
Elastic
Cluster
Mongo-connector

ES Analyzers
● An analyzer — whether built-in or custom — is just a package which contains three lower-level
building blocks: character filters(>=0), tokenizers( =1), and token filters(>=0).
● Character filters - A character filter receives the original text as a stream of characters and can
transform the stream by adding, removing, or changing characters.
● A tokenizer receives a stream of characters, breaks it up into individual tokens (usually
individual words), and outputs a stream of tokens.
● A token filter receives the token stream and may add, remove, or change tokens.

Default Analyzers
● Standard Analyzer
The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most
punctuation, lowercases terms, and supports removing stop words.
● Simple Analyzer
The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.
● Whitespace Analyzer
The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.
● Stop Analyzer
The stop analyzer is like the simple analyzer, but also supports removal of stop words.
● Keyword Analyzer
The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.
● Language Analyzers
Elasticsearch provides many language-specific analyzers like english or french.
● Fingerprint Analyzer
The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.

Analyzed Mappings
● How to analyze a field?
● How to analyze using an analyzer?
● How to analyze your querystring?
● Term Query vs Match Query

Search-Sort-Filter Operations
● Where to perform sorting/pagination?
○ Direct mongoDB or elastic cluster
● How to perform prefix/match/smart search?
○ Searchable fields: first/middle/last name, company name, status/substatus, leadId, email, phone
○ Search queries
■ Query1: Lendingkart
■ Query2: +91 9999999999
■ Query3: LEA-1234
■ Query4: 9999999999lkart@gmail.com
● How to perform case insensitive search?
○ LendingKart
○ LENdingkart
○ LendingKARt
○ lendingkart
○ LENDINGKART

Aggregations
Aggregations allow us to ask sophisticated questions of our data. A combination of buckets and metrics.
Snapshot performance improvement: 21-36 sec to ~200ms

Relevance Score
● Boolean model to find matching documents:
full AND text AND search AND (elasticsearch OR lucene)
● Term frequency/inverse document frequency
tf(t in d) = √frequency
idf(t) = 1 + log ( numDocs / (docFreq + 1))

Numbers
● Data Growth
● Transactional/Application from 0.04M to 1.2M+ ~ 3000%
● Non Transactional/Leads from .6M to 2M+ - 250%
● Speed of search
● Searches came down from 8 seconds to ~ 230ms
● Aggregations came down from 21-36 seconds to ~ 200ms

Scaling Search at Lendingkart with Elasticsearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling Search at Lendingkart with Elasticsearch

Similar to Scaling Search at Lendingkart with Elasticsearch (20)

Recently uploaded

Recently uploaded (20)

Scaling Search at Lendingkart with Elasticsearch

Editor's Notes