SlideShare a Scribd company logo
Scaling Search at Lendingkart
Shivendra Singh, Swapnil Bagadia and Nitesh Kumar
16 June 2018
Scaling
Scaling/Scalability is the capability of a system to handle a
growing amount of work, or its potential to be enlarged to
accommodate that growth ~Wiki
Scaling and High Availability(Application)
Scaling and High Availability(Application)
● Application does not change too often (static)
● If we need more performance, we are adding more resources
● Easy to scale and achieve High Availability
● But what happens with the database?
Scaling and High Availability(Databases)
● We have to distribute the changes to all the databases in real time
● It has to be available for all the applications
● The application has to be able to do changes
Vertical Scaling
Horizontal Scaling
● Master Master Setup
● Mysql cluster
● MariaDB/ Galera/ Percona XTRA DB
● Problems - Messy, Hard to identify/fix when issues arise, autoincrement
● Sharding Databases
● Complexity of managing at application level
● Multiple Read Replicas
● Single Master Multiple Read Replicas used by application
● Separate DB for Analytics
● Problems - replication lag
Horizontal Scaling(Multiple Read Replicas)
Query Optimizations
● Use indexes for better read performance
○ Multiple non-clustered/secondary indexes
○ Too many and too little indexes are both bad
○ Check for duplicate and unused indexes
○ Queries can be run without indexes but it can take a really long time
○ Best if all WHERE and JOIN clause are using INDEX for lookups
● Monitor and force use of indexes if required
○ FORCE INDEX for index to be used
● Fix top offenders (repeatedly)
○ Slow query logs (using long_query_time)
○ Use explain on these queries
■ Using Index - Good
■ Using Filesort, Using temporary - Bad
Server Side Optimizations for performance
● Sensible timeout for queries
○ Max query execution time
○ Lock wait timeouts
● Changed to Read Committed from Repeatable Read
Separating out Databases
● Smaller databases that are completely decoupled and independent
● Pros
○ Simplicity
○ More cost effective
○ High Availability
○ Enforces loose coupling across data stores
○ Allows better usage of connections to DB
● Cons
○ Hard to maintain referential integrity across different DBs
○ Usage in analytics/reporting.
○ Transaction Management
○ Does not solve the problem of a table growing really large
Monitoring and Key Metrics
● Memory Usage
○ Often most important for performance
○ Your working set must fit in memory well.
○ Less memory = more pressure on IO
● IOPS
○ 1xIOPS/ GB burstable upto 3x for General Purpose SSD
○ Provisioned IOPS for better performance
● CPU Usage
● Free Disk usage
● Replication Lag (in Read Replicas)
● Database Connections
Challenges of direct search in DB
● Searching on non indexed columns
● Perils of using LIKE queries
○ Full table scan
● Returning all columns
● Aggregations were killing the database performance
Other database storages
● Non Transactional Data resides in MongoDB
● Data is highly unstructured
Challenges of direct search on mongoDB
● Single index
○ MongoDB supports the creation of user-defined ascending/descending indexes on a single field of a document.
○ Default index on the _id field during the creation of a collection.
○ Problem: 24 searchable fields in a document. So 24 indexes???
● Case insensitive search
○ LendingKart or LENdingkart or lendingkart or lendingKART etc.
○ Problem: MongoDB doesn't support case-insensitive regular expression search.
● Prefix search
○ Search query: John
First/middle/last name: John
Company name: Johnson
Email: johnkumar@gmail.com
○ Problem: slow performance
● Sorting and pagination
○ Sorting on specific fields like date or some id.
○ Pagination to separate a big result set into smaller chunks.
○ Problem: MongoDB has sorting memory limit.
Chain of thought for search improvement
● Compound index
○ An index that contains references to multiple fields within a document.
○ MongoDB imposes a limit of 31 fields for any compound index.
○ Example:
{
"_id": ObjectId(...),
"leadId": 1234,
"companyName": "lendingkart",
"city": "bangalore",
"email": "lendingkart@abc.com",
"phone": "9999999999”
}
db.leads.createIndex( { "leadId": 1, "companyName": 1, "email":1 } )
● Elastic search
○ An open-source, broadly-distributable, readily-scalable, enterprise-grade search engine.
Why we needed some magic!!!
● Searches in MySQL were slow
○ Around 8 seconds for normal search
● Searches in MongoDB were slow
○ Around 8 seconds for normal search
● Aggregations were slow
○ Taking 21 seconds - 36 seconds for aggregations
● Data Growth
○ Transactional/Application from 0.04M to 1.2M
○ Non Transactional/Leads from .6M to 2M
● Our goal was to get searches to happen within 250ms
ElasticSearch - You know for search....
Wer Ordnung hält, ist nur zu faul zum Suchen.
(If you keep things tidily ordered, you're just too lazy to go
searching.)
—German proverb
How ElasticSearch Happened?
What is ELASTICsearch ?
● Full-text search and analytics engine
○ It allows you to store, search, and analyze big volumes of data quickly.
● Near Real Time(NRT)
○ Slight latency (normally one second) from the time you index a document until the time it becomes
searchable.
● Highly scalable
○ Elastic, as the name suggests. It’s clustered by default— you call it a cluster even if you run it on a
single server.
○ Increase/Decrease nodes as per requirement
● It just works...
○ Open-source/Free built on top of Apache Lucene, in Java(inherently cross-platform)
○ Ships with sensible defaults, keeping complex theories for leisure reading
○ Mostly, plug and play.
○ Much more than Lucene - JSON Based, Distributed, web server.
Sharding for scalability
○ To add data to Elasticsearch, we need an index—a place to store related data. In reality, an index is just
a logical namespace that points to one or more physical shards.
○ Each shard can have zero or more replicas
○ Replicas on different servers (server pools) for
failover
○ One in the cluster goes down? No problem.
○ Master - Automatic Master detection + failover
○ Responsible for distribution/balancing of shards
Distributed to the hilt
1 node cluster 2 nodes cluster
3 nodes cluster
2 Replica Shards
1 node goes down!!!!
Where does Elasticsearch help us?
● Dashboards
● Fraud Detection
● Logs
● Analytics
Data Seeding from MySQL to ES
● What were the options?
○ Binlog Processor service sync ing your MySQL data into Elasticsearch automatically
○ Asynchronous Kafka(as a queue) pipeline
● Why go through all the pain when we can get all the same from ELK stack itself?
○ Logstash was a perfect fit for our requirements
○ 100% Config Based
○ Not a single Line of Code
Simplicity at its best- Logstash
● How logstash works?
○ Ah, just like others, logstash has input/filter/output plugins.
○ Attention: logstash process events, not (only) loglines!• "Inputs generate events, filters modify them,
outputs ship them elsewhere." -- [the life of an event in logstash]
● Plugin Architecture
○ Input plugins: captures external data+format & transform it to logstash events
○ Filter plugins: process/transform events
○ Output plugins: send events to external destination & format
○ All Plugins
How did we use Logstash?
Logstash Configurations - introducing multiple pipelines
● Lack of congestion isolation - backpressure
● One size does not fit all - TCP2TCP(fast and light) vs JDBC2ES(large and low volume)
● The solution before Logstash 6.0: Multiple Logstash Instances - RPM/DEB /Multi-JVM instances
Data seeding from mongo to elastic cluster
● How to copy data from mongo to elastic cluster?
○ Mongo-connector
● Do we need to copy all fields and their values of a document from mongo to elastic cluster?
○ Useful(or searchable) data on cluster
● What is Oplog (operations log)?
● How mongo-connector reads oplog to copy documents (new or updated documents) on elastic cluster?
● Can we use a custom configuration file to specify some options to mongo-connector?
● How to track whether mongo-connector has stopped syncing data?
MongoDB Connector
Mongo-connector creates a pipeline from a MongoDB cluster to elasticsearch cluster and it copies your
documents from MongoDB to your target system.
OpLog(operations log)
● Oplog (operations log) that keeps a rolling record of all operations that modify the data stored in your
databases. Example:
> use test //switched to db test
> db.leads.insert({"leadId":1})
> db.leads.update({"leadId":1}, {$set : {"city": "bangalore"}})
● Oplog entry of above operation:
{ "ts" : { "t" : 1286821977000, "i" : 1 }, "h" : NumberLong("1722870850266333201"), "op" : "i", "ns" : "test.leads", "o" : { "_id"
: ObjectId("4cb35859007cc1f4f9f7f85d"), "leadId" : 1 } }
{ "ts" : { "t" : 1286821984000, "i" : 1 }, "h" : NumberLong("1633487572904743924"), "op" : "u", "ns" : "test.leads", "o2" : {
"_id" : ObjectId("4cb35859007cc1f4f9f7f85d") }, "o" : { "$set" : { "city": "bangalore" } } }
op: the write operation[i: insert, u: update]
Insert
Update
How mongo-connector reads oplog to copy documents (new or updated
documents) on elastic cluster?
● Mongo Connector creates an oplog progress file (oplog.timestamp).
● The oplog progress file keeps track of the latest oplog entry seen for each replica set to which Mongo
Connector is connected.
● Mongo Connector uses this file to decide, where to begin reading the oplog on startup.
● When the oplog progress file cannot be found, or if it is empty, Mongo Connector will begin pulling data
from all MongoDB collection in the "collection dump" phase.
● The oplog progress file is then updated with the most recent timestamp from before the dump
happened.
● Mongo Connector then applies all oplog operations from before the dump, so that the copied
documents will be up-to-date with what's on MongoDB.
Can we use a custom configuration file to specify some options to mongo-
connector?
● You can use a custom configuration file to specify some options to mongo-connector.
● To invoke mongo-connector with a configuration file option, run:
○ mongo-connector -c config.json
● Configuration options:
○ excludeFields: List of fields to not read from MongoDB. Comma-separated list of fields to exclude from MongoDB
documents. Example: [Database: test, Collection: leads]
"test.leads": {
"excludeFields": ["isSynced","comments","dndMobile","isDuplicateLead"]
}
○ oplogFile: The path to the oplog progress file.
○ batchSize: Number of records processed from the oplog before updating the timestamp file.
■ default bulk size is 1000 docs
How to track whether mongo-connector has stopped syncing data?
● Causes:
○ High write-load.
○ Mongo-connector connection with mongoDB or cluster got interrupted.
● Solution:
○ Write a script which run at scheduled time.
○ This script will query the total count of documents from mongo and also elastic.
○ If difference in count is greater than threshold, it will send notification.
MongoDB
Elastic
Cluster
Mongo-connector
ES Analyzers
● An analyzer  — whether built-in or custom — is just a package which contains three lower-level
building blocks: character filters(>=0), tokenizers( =1), and token filters(>=0).
● Character filters - A character filter receives the original text as a stream of characters and can
transform the stream by adding, removing, or changing characters.
● A tokenizer receives a stream of characters, breaks it up into individual tokens (usually
individual words), and outputs a stream of tokens.
● A token filter receives the token stream and may add, remove, or change tokens.
Analyzers basics...contd
Default Analyzers
● Standard Analyzer
The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most
punctuation, lowercases terms, and supports removing stop words.
● Simple Analyzer
The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.
● Whitespace Analyzer
The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.
● Stop Analyzer
The stop analyzer is like the simple analyzer, but also supports removal of stop words.
● Keyword Analyzer
The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.
● Language Analyzers
Elasticsearch provides many language-specific analyzers like english or french.
● Fingerprint Analyzer
The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.
Custom Analyzers
Our Custom Analyzers
Analyzed Mappings
● How to analyze a field?
● How to analyze using an analyzer?
● How to analyze your querystring?
● Term Query vs Match Query
Search-Sort-Filter Operations
● Where to perform sorting/pagination?
○ Direct mongoDB or elastic cluster
● How to perform prefix/match/smart search?
○ Searchable fields: first/middle/last name, company name, status/substatus, leadId, email, phone
○ Search queries
■ Query1: Lendingkart
■ Query2: +91 9999999999
■ Query3: LEA-1234
■ Query4: 9999999999lkart@gmail.com
● How to perform case insensitive search?
○ LendingKart
○ LENdingkart
○ LendingKARt
○ lendingkart
○ LENDINGKART
Aggregations
Aggregations allow us to ask sophisticated questions of our data. A combination of buckets and metrics.
Snapshot performance improvement: 21-36 sec to ~200ms
Relevance Score
● Boolean model to find matching documents:
full AND text AND search AND (elasticsearch OR lucene)
● Term frequency/inverse document frequency
tf(t in d) = √frequency
idf(t) = 1 + log ( numDocs / (docFreq + 1))
Generic Framework
Numbers
● Data Growth
● Transactional/Application from 0.04M to 1.2M+ ~ 3000%
● Non Transactional/Leads from .6M to 2M+ - 250%
● Speed of search
● Searches came down from 8 seconds to ~ 230ms
● Aggregations came down from 21-36 seconds to ~ 200ms
Q & A!!!
Thank you!

More Related Content

What's hot

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
MongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the FieldMongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the Field
MongoDB
 
MongoDB Operations for Developers
MongoDB Operations for DevelopersMongoDB Operations for Developers
MongoDB Operations for Developers
MongoDB
 
How leading financial services organisations are winning with tech
How leading financial services organisations are winning with techHow leading financial services organisations are winning with tech
How leading financial services organisations are winning with tech
MongoDB
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needs
Sriskandarajah Suhothayan
 
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
MongoDB
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
Rushika Shah
 
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
confluent
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Big Data Spain
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
Yaroslav Tkachenko
 
SAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix ScaleSAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix Scale
Nitin S
 
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
MongoDB
 
Druid meetup 2018-03-13
Druid meetup 2018-03-13Druid meetup 2018-03-13
Druid meetup 2018-03-13
gianmerlino
 
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Fwdays
 
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2
 
Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDB
MongoDB
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
lohitvijayarenu
 

What's hot (20)

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
MongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the FieldMongoDB Days UK: Tales from the Field
MongoDB Days UK: Tales from the Field
 
MongoDB Operations for Developers
MongoDB Operations for DevelopersMongoDB Operations for Developers
MongoDB Operations for Developers
 
How leading financial services organisations are winning with tech
How leading financial services organisations are winning with techHow leading financial services organisations are winning with tech
How leading financial services organisations are winning with tech
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needs
 
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
SAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix ScaleSAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix Scale
 
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
Druid meetup 2018-03-13
Druid meetup 2018-03-13Druid meetup 2018-03-13
Druid meetup 2018-03-13
 
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
Дмитрий Лавриненко "Blockchain for Identity Management, based on Fast Big Data"
 
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0
 
Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDB
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
 

Similar to Ledingkart Meetup #2: Scaling Search @Lendingkart

NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
Guillaume Lefranc
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Druid
DruidDruid
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
Omid Vahdaty
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
OVHcloud
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
bangaloredjangousergroup
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
Abhi Jain
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
Wei Shan Ang
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Adam Doyle
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Jihoon Son
 

Similar to Ledingkart Meetup #2: Scaling Search @Lendingkart (20)

NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Druid
DruidDruid
Druid
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 

Recently uploaded

Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 

Recently uploaded (20)

Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 

Ledingkart Meetup #2: Scaling Search @Lendingkart

  • 1. Scaling Search at Lendingkart Shivendra Singh, Swapnil Bagadia and Nitesh Kumar 16 June 2018
  • 2. Scaling Scaling/Scalability is the capability of a system to handle a growing amount of work, or its potential to be enlarged to accommodate that growth ~Wiki
  • 3. Scaling and High Availability(Application)
  • 4. Scaling and High Availability(Application) ● Application does not change too often (static) ● If we need more performance, we are adding more resources ● Easy to scale and achieve High Availability ● But what happens with the database?
  • 5. Scaling and High Availability(Databases) ● We have to distribute the changes to all the databases in real time ● It has to be available for all the applications ● The application has to be able to do changes
  • 7. Horizontal Scaling ● Master Master Setup ● Mysql cluster ● MariaDB/ Galera/ Percona XTRA DB ● Problems - Messy, Hard to identify/fix when issues arise, autoincrement ● Sharding Databases ● Complexity of managing at application level ● Multiple Read Replicas ● Single Master Multiple Read Replicas used by application ● Separate DB for Analytics ● Problems - replication lag
  • 9. Query Optimizations ● Use indexes for better read performance ○ Multiple non-clustered/secondary indexes ○ Too many and too little indexes are both bad ○ Check for duplicate and unused indexes ○ Queries can be run without indexes but it can take a really long time ○ Best if all WHERE and JOIN clause are using INDEX for lookups ● Monitor and force use of indexes if required ○ FORCE INDEX for index to be used ● Fix top offenders (repeatedly) ○ Slow query logs (using long_query_time) ○ Use explain on these queries ■ Using Index - Good ■ Using Filesort, Using temporary - Bad
  • 10. Server Side Optimizations for performance ● Sensible timeout for queries ○ Max query execution time ○ Lock wait timeouts ● Changed to Read Committed from Repeatable Read
  • 11. Separating out Databases ● Smaller databases that are completely decoupled and independent ● Pros ○ Simplicity ○ More cost effective ○ High Availability ○ Enforces loose coupling across data stores ○ Allows better usage of connections to DB ● Cons ○ Hard to maintain referential integrity across different DBs ○ Usage in analytics/reporting. ○ Transaction Management ○ Does not solve the problem of a table growing really large
  • 12. Monitoring and Key Metrics ● Memory Usage ○ Often most important for performance ○ Your working set must fit in memory well. ○ Less memory = more pressure on IO ● IOPS ○ 1xIOPS/ GB burstable upto 3x for General Purpose SSD ○ Provisioned IOPS for better performance ● CPU Usage ● Free Disk usage ● Replication Lag (in Read Replicas) ● Database Connections
  • 13. Challenges of direct search in DB ● Searching on non indexed columns ● Perils of using LIKE queries ○ Full table scan ● Returning all columns ● Aggregations were killing the database performance
  • 14. Other database storages ● Non Transactional Data resides in MongoDB ● Data is highly unstructured
  • 15. Challenges of direct search on mongoDB ● Single index ○ MongoDB supports the creation of user-defined ascending/descending indexes on a single field of a document. ○ Default index on the _id field during the creation of a collection. ○ Problem: 24 searchable fields in a document. So 24 indexes??? ● Case insensitive search ○ LendingKart or LENdingkart or lendingkart or lendingKART etc. ○ Problem: MongoDB doesn't support case-insensitive regular expression search. ● Prefix search ○ Search query: John First/middle/last name: John Company name: Johnson Email: johnkumar@gmail.com ○ Problem: slow performance ● Sorting and pagination ○ Sorting on specific fields like date or some id. ○ Pagination to separate a big result set into smaller chunks. ○ Problem: MongoDB has sorting memory limit.
  • 16. Chain of thought for search improvement ● Compound index ○ An index that contains references to multiple fields within a document. ○ MongoDB imposes a limit of 31 fields for any compound index. ○ Example: { "_id": ObjectId(...), "leadId": 1234, "companyName": "lendingkart", "city": "bangalore", "email": "lendingkart@abc.com", "phone": "9999999999” } db.leads.createIndex( { "leadId": 1, "companyName": 1, "email":1 } ) ● Elastic search ○ An open-source, broadly-distributable, readily-scalable, enterprise-grade search engine.
  • 17. Why we needed some magic!!! ● Searches in MySQL were slow ○ Around 8 seconds for normal search ● Searches in MongoDB were slow ○ Around 8 seconds for normal search ● Aggregations were slow ○ Taking 21 seconds - 36 seconds for aggregations ● Data Growth ○ Transactional/Application from 0.04M to 1.2M ○ Non Transactional/Leads from .6M to 2M ● Our goal was to get searches to happen within 250ms
  • 18. ElasticSearch - You know for search.... Wer Ordnung hält, ist nur zu faul zum Suchen. (If you keep things tidily ordered, you're just too lazy to go searching.) —German proverb
  • 20. What is ELASTICsearch ? ● Full-text search and analytics engine ○ It allows you to store, search, and analyze big volumes of data quickly. ● Near Real Time(NRT) ○ Slight latency (normally one second) from the time you index a document until the time it becomes searchable. ● Highly scalable ○ Elastic, as the name suggests. It’s clustered by default— you call it a cluster even if you run it on a single server. ○ Increase/Decrease nodes as per requirement ● It just works... ○ Open-source/Free built on top of Apache Lucene, in Java(inherently cross-platform) ○ Ships with sensible defaults, keeping complex theories for leisure reading ○ Mostly, plug and play. ○ Much more than Lucene - JSON Based, Distributed, web server.
  • 21. Sharding for scalability ○ To add data to Elasticsearch, we need an index—a place to store related data. In reality, an index is just a logical namespace that points to one or more physical shards. ○ Each shard can have zero or more replicas ○ Replicas on different servers (server pools) for failover ○ One in the cluster goes down? No problem. ○ Master - Automatic Master detection + failover ○ Responsible for distribution/balancing of shards
  • 22. Distributed to the hilt 1 node cluster 2 nodes cluster 3 nodes cluster 2 Replica Shards 1 node goes down!!!!
  • 23. Where does Elasticsearch help us? ● Dashboards ● Fraud Detection ● Logs ● Analytics
  • 24. Data Seeding from MySQL to ES ● What were the options? ○ Binlog Processor service sync ing your MySQL data into Elasticsearch automatically ○ Asynchronous Kafka(as a queue) pipeline ● Why go through all the pain when we can get all the same from ELK stack itself? ○ Logstash was a perfect fit for our requirements ○ 100% Config Based ○ Not a single Line of Code
  • 25. Simplicity at its best- Logstash ● How logstash works? ○ Ah, just like others, logstash has input/filter/output plugins. ○ Attention: logstash process events, not (only) loglines!• "Inputs generate events, filters modify them, outputs ship them elsewhere." -- [the life of an event in logstash] ● Plugin Architecture ○ Input plugins: captures external data+format & transform it to logstash events ○ Filter plugins: process/transform events ○ Output plugins: send events to external destination & format ○ All Plugins
  • 26. How did we use Logstash?
  • 27. Logstash Configurations - introducing multiple pipelines ● Lack of congestion isolation - backpressure ● One size does not fit all - TCP2TCP(fast and light) vs JDBC2ES(large and low volume) ● The solution before Logstash 6.0: Multiple Logstash Instances - RPM/DEB /Multi-JVM instances
  • 28. Data seeding from mongo to elastic cluster ● How to copy data from mongo to elastic cluster? ○ Mongo-connector ● Do we need to copy all fields and their values of a document from mongo to elastic cluster? ○ Useful(or searchable) data on cluster ● What is Oplog (operations log)? ● How mongo-connector reads oplog to copy documents (new or updated documents) on elastic cluster? ● Can we use a custom configuration file to specify some options to mongo-connector? ● How to track whether mongo-connector has stopped syncing data?
  • 29. MongoDB Connector Mongo-connector creates a pipeline from a MongoDB cluster to elasticsearch cluster and it copies your documents from MongoDB to your target system.
  • 30. OpLog(operations log) ● Oplog (operations log) that keeps a rolling record of all operations that modify the data stored in your databases. Example: > use test //switched to db test > db.leads.insert({"leadId":1}) > db.leads.update({"leadId":1}, {$set : {"city": "bangalore"}}) ● Oplog entry of above operation: { "ts" : { "t" : 1286821977000, "i" : 1 }, "h" : NumberLong("1722870850266333201"), "op" : "i", "ns" : "test.leads", "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d"), "leadId" : 1 } } { "ts" : { "t" : 1286821984000, "i" : 1 }, "h" : NumberLong("1633487572904743924"), "op" : "u", "ns" : "test.leads", "o2" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") }, "o" : { "$set" : { "city": "bangalore" } } } op: the write operation[i: insert, u: update] Insert Update
  • 31. How mongo-connector reads oplog to copy documents (new or updated documents) on elastic cluster? ● Mongo Connector creates an oplog progress file (oplog.timestamp). ● The oplog progress file keeps track of the latest oplog entry seen for each replica set to which Mongo Connector is connected. ● Mongo Connector uses this file to decide, where to begin reading the oplog on startup. ● When the oplog progress file cannot be found, or if it is empty, Mongo Connector will begin pulling data from all MongoDB collection in the "collection dump" phase. ● The oplog progress file is then updated with the most recent timestamp from before the dump happened. ● Mongo Connector then applies all oplog operations from before the dump, so that the copied documents will be up-to-date with what's on MongoDB.
  • 32. Can we use a custom configuration file to specify some options to mongo- connector? ● You can use a custom configuration file to specify some options to mongo-connector. ● To invoke mongo-connector with a configuration file option, run: ○ mongo-connector -c config.json ● Configuration options: ○ excludeFields: List of fields to not read from MongoDB. Comma-separated list of fields to exclude from MongoDB documents. Example: [Database: test, Collection: leads] "test.leads": { "excludeFields": ["isSynced","comments","dndMobile","isDuplicateLead"] } ○ oplogFile: The path to the oplog progress file. ○ batchSize: Number of records processed from the oplog before updating the timestamp file. ■ default bulk size is 1000 docs
  • 33. How to track whether mongo-connector has stopped syncing data? ● Causes: ○ High write-load. ○ Mongo-connector connection with mongoDB or cluster got interrupted. ● Solution: ○ Write a script which run at scheduled time. ○ This script will query the total count of documents from mongo and also elastic. ○ If difference in count is greater than threshold, it will send notification. MongoDB Elastic Cluster Mongo-connector
  • 34. ES Analyzers ● An analyzer  — whether built-in or custom — is just a package which contains three lower-level building blocks: character filters(>=0), tokenizers( =1), and token filters(>=0). ● Character filters - A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. ● A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. ● A token filter receives the token stream and may add, remove, or change tokens.
  • 36. Default Analyzers ● Standard Analyzer The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words. ● Simple Analyzer The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms. ● Whitespace Analyzer The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms. ● Stop Analyzer The stop analyzer is like the simple analyzer, but also supports removal of stop words. ● Keyword Analyzer The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term. ● Language Analyzers Elasticsearch provides many language-specific analyzers like english or french. ● Fingerprint Analyzer The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.
  • 39. Analyzed Mappings ● How to analyze a field? ● How to analyze using an analyzer? ● How to analyze your querystring? ● Term Query vs Match Query
  • 40. Search-Sort-Filter Operations ● Where to perform sorting/pagination? ○ Direct mongoDB or elastic cluster ● How to perform prefix/match/smart search? ○ Searchable fields: first/middle/last name, company name, status/substatus, leadId, email, phone ○ Search queries ■ Query1: Lendingkart ■ Query2: +91 9999999999 ■ Query3: LEA-1234 ■ Query4: 9999999999lkart@gmail.com ● How to perform case insensitive search? ○ LendingKart ○ LENdingkart ○ LendingKARt ○ lendingkart ○ LENDINGKART
  • 41. Aggregations Aggregations allow us to ask sophisticated questions of our data. A combination of buckets and metrics. Snapshot performance improvement: 21-36 sec to ~200ms
  • 42. Relevance Score ● Boolean model to find matching documents: full AND text AND search AND (elasticsearch OR lucene) ● Term frequency/inverse document frequency tf(t in d) = √frequency idf(t) = 1 + log ( numDocs / (docFreq + 1))
  • 44. Numbers ● Data Growth ● Transactional/Application from 0.04M to 1.2M+ ~ 3000% ● Non Transactional/Leads from .6M to 2M+ - 250% ● Speed of search ● Searches came down from 8 seconds to ~ 230ms ● Aggregations came down from 21-36 seconds to ~ 200ms

Editor's Notes

  1. Scaling is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. It is the ability of a system to handle efficiently more work than it typically performs without going down, or otherwise the ability to be enlarged in order to perform more work efficiently A system whose performance improves after adding hardware, proportionally to the capacity added, is said to be a scalable system. Most of the times we are concerned about load scalability- The ability for a distributed system to easily expand and contract its resource pool to accommodate heavier or lighter loads. Alternatively, the ease with which a system or component can be modified(added or removed) to accommodate changing load.
  2. Also referred to as Scaling Up(increase the hardware capability of the machine), Storage and instance type are decoupled in RDS There is minimal downtime when you are scaling up on a Multi-AZ environment because the standby database gets upgraded first, then a failover will occur to the newly sized database. A Single-AZ instance will be unavailable during the scale operation. Can scale up/down when needed. RDS does not allow to downgrade the size of the disk, can only upgrade.
  3. Also referred to as scaling out(add more machines). To scale horizontally (or scale out) means to add more nodes to a system, such as adding a new computer to a distributed software application. An example might be scaling out from one web server system to three. Problems with Master Master setup, Managing difficulty, need rigorous monitoring, when facing issues hard to fix. Multiple Read Replicas- replication lag under write heavy load Sharding challenges - complexity at the applications level We are using Multiple read replica setup.
  4. There are MySQL Connectors that allow you to do read/write splitting. In addition to using a MySQL Connector, you can add a load balancer between your application and database servers. You make this addition so that you have a single database endpoint presented to the application. This approach allows for a more dynamic environment where you can transparently add or remove read replicas behind the load balancer without constantly updating the database connection string of the application. You can also perform a custom health check by using scripts. -Run large, repeating reporting queries and batch jobs on the slave instead -Point completely read-only pages to serve from the slave -when a crawler such as Google is identified in the headers, hardwire all queries to go to the slave
  5. changed the transaction isolation level in mysql from "Repeatable Read" to "Read Committed". Lower level of isolation max_execution_time 120000 The execution timeout for SELECT statements, in milliseconds. innodb_lock_wait_timeout 60 Timeout in seconds an innodb transaction may wait for a row lock before giving up increase for reliability decrease for performance tx_isolation READ-COMMITTED allowed values READ-UNCOMMITTED, READ-COMMITTED, REPEATABLE-READ, SERIALIZABLE Repeatable read is a higher isolation level, that in addition to the guarantees of the read committedlevel, it also guarantees that any data read cannot change, if the transaction reads the same data again general_log 0 long_query_time 10 log_queries_not_using_indexes 0 log_output NONE allowed values TABLE, FILE, NONE
  6. we debated whether to do caching with NoSQL solutions or in memory memcached or redis/ replication and read/write split and load balancing. We finally came to the conclusion that going from one server to two was a lot simpler from an application-design standpoint, more cost effective as can scale up/down the DB under high/low load independently, better in terms of high availability - outage in one database will affect only related services. Enforces loose coupling by preventing direct access to the DB from developers Cons- sharding/other solution is needed if a table grows really huge
  7. Memory is most important. To tell if your working set is almost all in memory, check the ReadIOPS metric (while the DB instance is under load. The value of ReadIOPS should be small and stable. If scaling up the DB instance class—to a class with more RAM—results in a dramatic drop in ReadIOPS, your working set was not almost completely in memory. Continue to scale up until ReadIOPS no longer drops dramatically after a scaling operation, or ReadIOPS is reduced to a very small amount Typical IOPS should be within baseline for consistent performance Limit is around 75% for CPU, Memory and Storage Metrics. If the usage/metric goes up, cloudwatch alarms are triggered and accordingly take action if this is getting breached consistently. QPS(Not monitoring)
  8. Context switching to go to mongo from mysql. we are using mongoDB for non transactional data which is highly unstructured Separate Leads module which was using MongoDB. Use cases were different as the data is highly unstructured Give context for keeping data in mongo.(not required)
  9. Application - transactional Leads non-transaction Transactional data grew 3000% Non transactional data grew by 250% Goal was to get searches to happen within 250ms Nitesh started on non-transactional which we released 1 year back and then Swapnil picked it up for transactional data
  10. Lives easier, Machines Lazier
  11. No need for an external load balancer • Since cluster does it‟s own routing. • Ask any server in the cluster, it will delegate to correct node.• What if … • More data > More shards. • More availability > More replicas per shard. WHAT DOES IT ADD TO LUCENE?• RESTfull Service • JSON API over HTTP • Want to use it from PHP? • CURL Requests, as if you‟d do requests to the Facebook Graph API.• High Availability & Performance • Clustering• Long Term Persistency • Write through to persistent storage system
  12. Lucene is a java library. You can include it in your project and refer to its functions using function calls. Elasticsearch is a JSON Based, Distributed, web server build over Lucene. Though it's Lucene who is doing the actual work beneath, Elasticsearch provides us a convenient layer over Lucene. Each shard in ELasticsearch is a separate Lucene instance. So to summarize Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features. Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is aware of or built for. Elasticsearch provides this abstraction of distributed structure. Provides other supporting features like thread-pool, queues, node/cluster monitoring API, data monitoring API, Cluster management, etc.
  13. The number of primary shards in an index is fixed at the time that an index is cre‐ ated, but the number of replica shards can be changed at any time. Let’s create an index called blogs in our empty one-node cluster. By default, indices are assigned five primary shards, but for the purpose of this demonstration, we’ll assign just three primary shards and one replica (one replica of every primary shard): PUT /blogs { "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 } } shard = hash(routing) % number_of_primary_shards
  14. it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from. Add one liner texts for this The number of primary shards in an index is fixed at the time that an index is cre‐ ated, but the number of replica shards can be changed at any time. Let’s create an index called blogs in our empty one-node cluster. By default, indices are assigned five primary shards, but for the purpose of this demonstration, we’ll assign just three primary shards and one replica (one replica of every primary shard): PUT /blogs { "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 } }
  15. Logstash queue size capped at arbitrary # 20 - (non-configurable) Back Pressure Inflight events can be lost - so small size queues Persistent queues in pipeline
  16. # # # How many events to retrieve from inputs before sending to filters+workers # pipeline.batch.size: 125 # # # How long to wait in milliseconds while polling for the next event # # before dispatching an undersized batch to filters+outputs # pipeline.batch.delay: 50
  17. Add isSync flag etc as exmaple
  18. Take a break at the end
  19. Data has grown immensely Search speed has reduced significantly Also means reduced load on primary data store (mysql/ mongo). No scaling up has happened since an year