SlideShare a Scribd company logo
1 of 20
Proprietary and confidential information of stackArmor
PRESENTATION FOR DEVOPSDC
MAY 17, 2016
ETL processing at scale with
MongoDB and Solr on AWS using Chef
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 2
Delivering innovation with Cloud, Data
Analytics and Automation
• Cloud orchestration and automation
• Migration and Operations support
• Cloud hosting and SaaS Development
www.stackArmor.com
Our Partnerships
The Customer
• Big Data Analytics SaaS Firm
◦ 500 million records every month needed to be processed as fast as possible at
the lowest possible cost
◦ The current on-premises infrastructure was taking 3 weeks to run and
complete the process
◦ Processing costs were a big deal and needed to be executed within a very
tight and specific budget
◦ Due to HIPAA compliance reasons, the application components could not be
altered
◦ Needed to process 5 million records per hour and complete the process in 1-2
days and tear down the environment post-completion
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 3
ETL System Overview
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 4
Input
Files
Parsing
Component
Staging Collection
Final Collection
Ingestion
Component
Search
Index
JSON
Doc
JSON
Doc
1 - Read Input
Records
2 – Store JSON
Docs
3 – Read each
JSON Doc
4 – Search for Docs which
might be a match.
5 – Get set of
candidate Doc IDs
6 – After evaluating
candidates, merge
with existing Doc, or
save as new Doc
Summary of Process
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 5
• Customer receives batch files with user data from different sources
• Needs to reconcile with existing records
• Update info or create new records
• Most users already exist in the DBs
The Stack
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 6
• MongoDB 2.6
AWS environment with 10 mongo shards each with one primary, one replica and one arbiter. The primaries
and replicas are running on r3.2xlarge instances. Each has 4 EBS volumes attached. 1x1TB, 1x60GB, 2x25GB
with 3000, 180/3000, 250, and 200 IOPs, respectively.
• MongoConfig
Three mongo config instances are running on 3 m3.large instances. These instances also run the Zookeeper
processes used to manage the SolrCloud cluster. Issues with Zookeeper stability.
• Mongos
Mongos nodes on the same VM as our application logic. Each application server connects to a local mongos
node.
• Solr 5.4.1
Running a SolrCloud cluster with 20 shards and 5 replication factors. Each primary and secondary is
deployed to a c4.4xlarge instance. Each instance has 3 EBS volumes: 2x 60 GB, 1x 200GB. 180/3000, 1000,
and 600/3000 IOPs, respectively.
• Application Nodes
Ingestion/indexing nodes running on c4.xlarge instances. 10 such nodes running Tomcat 7.
Design Considerations
• Wanted to build environment by “hand” to save time and meet
project goals – automate now or later?
◦ Which automation and orchestration technology to use?
• What is the optimal shard to replica ratio?
• To P-IOPS or not P-IOPS? Or supersize the instance?
• RAID or NOT, if so which one
• Scale-up or Scale-out? Cost considerations
• Solr versus SolrCloud?
• Goal was to process 5 million records per hour
• Money, Money, Money?
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 7
The Math
• Needed to process a total data set of 2TB for Staging and
2TB for Production
◦ Estimated 50 shards with 80GB per shard
◦ Choose instance size with memory to fit 2TB / 50 shards
• Separate mongos nodes or with app nodes
• PIOPs disks vs RAID 0
◦ Cost vs performance
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 8
Instance Selection
• Scale-up versus Scale-out
• IO Calculations
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 9
Name vCPU Memory Instance I/O On-Demand Cost per GB RAM
Cost per Core
Per GB RAM
r3.8xlarge 32 244 SSD 2 x 320 10 Gigabit $ 2.6600 $ 0.01090 $ 0.0000447
r3.4xlarge 16 122 SSD 1 x 320 High $ 1.3300 $ 0.01090 $ 0.0000894
r3.2xlarge 8 61 SSD 1 x 160 High $ 0.6650 $ 0.01090 $ 0.0001787
m2.4xlarge 8 68.4 2 x 840 High $ 0.9800 $ 0.01433 $ 0.0002095
cr1.8xlarge 32 244 SSD 2 x 120 10 Gigabit $ 3.5000 $ 0.01434 $ 0.0000588
m4.xlarge 4 16 -- High $ 0.2390 $ 0.01494 $ 0.0009336
m4.10xlarge 40 160 -- 10 Gigabit $ 2.3940 $ 0.01496 $ 0.0000935
m4.4xlarge 16 64 -- High $ 0.9580 $ 0.01497 $ 0.0002339
m4.2xlarge 8 32 -- High $ 0.4790 $ 0.01497 $ 0.0004678
m3.2xlarge 8 30 SSD 2 x 80 High $ 0.5320 $ 0.01773 $ 0.0005911
m3.xlarge 4 15 SSD 2 x 40 High $ 0.2660 $ 0.01773 $ 0.0011822
d2.8xlarge 36 244HDD 24 x 2000 10 Gigabit $ 5.5200 $ 0.02262 $ 0.0000927
Helped us find the
right instance for us;
There were other
Considerations as well
P-IOPS or not?
Initial Design
• <insert initial design>
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 10
Final Design
• MongoDB nodes
◦ 10x c4.8xlarge
•solr nodes
◦ 100x c4.4xlarge
◦ Max heap 6 gb
◦ Replication factor of 1 (shard, non-cloud mode)
• Separate EBS volumes for MongoDB data, journal, and logs
• 4x1TB EBS RAID0 drives for MongoDB data
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 11
The tuning journey
• Driving optimal load from mongos to SOLRs
• Throttling the number of java threads that can generate the load that
cluster can handle
• Maximize the CPU utilization on the SOLR nodes but at the same time
not to push the network bandwidth traffic
• Use of 10GB network between EC2s
• Determination of IOPS requirement to maximize the throughput
(#number of ingestion records processed).
• Gradually increasing threads instead of overloading SOLR at the start
• use optimize SOLR function periodically to sustain steady SOLR search
response times
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 12
Automation Saved the Day!
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 13
Solr
Parameterized
ETL Environment
 Number of Instances
 Type of Instances
 Other params
Tuning by Doing!
Some charts along the way
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 14
Graph showing throughput
reaching around 75,000
records per 15 mins for each
node. We reached nearly 3
million records per hour but
with spikes (and errors).
Things are getting better!
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 15
Graph showing throughput
reaching around 75,000
records per 15 mins for each
node. We reached nearly 3
million records per hour
sustained throughput.
What worked to improve throughput?
Our primary goal is to maximize the throughput at the optimal price
• Gradual increase in ‘solr’ load/throttle
•‘solr’ optimize run frequently to make ‘solr’ index spread evenly and
sustain ‘solr’ search time
•Use of 10-GB network between ec2 nodes to improve throughput of more
than 3GB network bandwidth
•Use of RAID0/with 4 disks to improve IO , keeping the read queue smaller
•Use of RAID read block size of 32 for Mongo DB
•Disable read-ahead for Mongo DB
•Use of enhanced EBS networking
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 16
Results
• Client has an automated, dynamically built environment
• Automated parts of code deployment processes
◦ (Jenkins -> S3 -> Nodes)
• Capabilities delivered include:
◦ Offsite code archiving (S3)
◦ Infrastructure automation
◦ CloudTrail, VPC Flow Logs and S3 access logs helped auditing
◦ Created over 100 Chef recipes
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 17
Next Steps
• Upgrade to the technology stack and increase instance
utilization using Apache Mesos
• Enable Self Service for Engineers
◦ Create a new MongoDB / SOLR cluster
◦ Start a cluster
◦ Stop a cluster
◦ Update the code on a cluster (deploy)
◦ Update configuration parameters
◦ Resize instances within a named cluster
• Create Rundeck or Jenkins based dashboard
• Capture state of all clusters / generate daily reports for
management
•Optimize Cost and Performance of Ingestion Cluster
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 18
StackArmor Demo
Thank you
Gaurav “GP” Pal
Principal,
stackArmor.com
gpal@stackArmor.com
(571) 271 4396
www.stackArmor.com
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 20

More Related Content

What's hot

My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2
MySQLConference
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 
Introduction to Sharding
Introduction to ShardingIntroduction to Sharding
Introduction to Sharding
MongoDB
 

What's hot (19)

Running MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWSRunning MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWS
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
My Sql Performance In A Cloud
My Sql Performance In A CloudMy Sql Performance In A Cloud
My Sql Performance In A Cloud
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Solr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySolr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the Ugly
 
No sql but even less security
No sql but even less securityNo sql but even less security
No sql but even less security
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...
 
Aerospike DB and Storm for real-time analytics
Aerospike DB and Storm for real-time analyticsAerospike DB and Storm for real-time analytics
Aerospike DB and Storm for real-time analytics
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise
 
Introduction to Sharding
Introduction to ShardingIntroduction to Sharding
Introduction to Sharding
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 

Similar to stackArmor presentation for DevOpsDC ver 4

Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Vigyan Jain
 
Why Software Defined Storage is Critical for Your IT Strategy
Why Software Defined Storage is Critical for Your IT StrategyWhy Software Defined Storage is Critical for Your IT Strategy
Why Software Defined Storage is Critical for Your IT Strategy
andreas kuncoro
 

Similar to stackArmor presentation for DevOpsDC ver 4 (20)

Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Amazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database EngineAmazon Aurora: Amazon’s New Relational Database Engine
Amazon Aurora: Amazon’s New Relational Database Engine
 
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Why Software Defined Storage is Critical for Your IT Strategy
Why Software Defined Storage is Critical for Your IT StrategyWhy Software Defined Storage is Critical for Your IT Strategy
Why Software Defined Storage is Critical for Your IT Strategy
 
Sparc t4 systems customer presentation
Sparc t4 systems customer presentationSparc t4 systems customer presentation
Sparc t4 systems customer presentation
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from Amazon
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from Amazon
 
Amazon Aurora Getting started Guide -level 0
Amazon Aurora Getting started Guide -level 0Amazon Aurora Getting started Guide -level 0
Amazon Aurora Getting started Guide -level 0
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon Aurora
 
Getting Started with Amazon Aurora
Getting Started with Amazon AuroraGetting Started with Amazon Aurora
Getting Started with Amazon Aurora
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Getting started with amazon aurora - Toronto
Getting started with amazon aurora - TorontoGetting started with amazon aurora - Toronto
Getting started with amazon aurora - Toronto
 
Deep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS PerformanceDeep Dive: Maximizing EC2 and EBS Performance
Deep Dive: Maximizing EC2 and EBS Performance
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
 
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store PerformanceDeep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
Deep Dive: Maximizing Amazon EC2 and Amazon Elastic Block Store Performance
 
Maximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceMaximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk Performance
 

More from Gaurav "GP" Pal

Magento Hosting on AWS
Magento Hosting on AWS Magento Hosting on AWS
Magento Hosting on AWS
Gaurav "GP" Pal
 
AWS Frederick Meetup 07192016
AWS Frederick Meetup 07192016AWS Frederick Meetup 07192016
AWS Frederick Meetup 07192016
Gaurav "GP" Pal
 

More from Gaurav "GP" Pal (19)

stackArmor - FedRAMP and 800-171 compliant cloud solutions
stackArmor - FedRAMP and 800-171 compliant cloud solutionsstackArmor - FedRAMP and 800-171 compliant cloud solutions
stackArmor - FedRAMP and 800-171 compliant cloud solutions
 
stackArmor - FedRAMP and 800-171 compliant cloud solutions
stackArmor - FedRAMP and 800-171 compliant cloud solutionsstackArmor - FedRAMP and 800-171 compliant cloud solutions
stackArmor - FedRAMP and 800-171 compliant cloud solutions
 
stackArmor Security MicroSummit - Next Generation Firewalls for AWS
stackArmor Security MicroSummit - Next Generation Firewalls for AWSstackArmor Security MicroSummit - Next Generation Firewalls for AWS
stackArmor Security MicroSummit - Next Generation Firewalls for AWS
 
stackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfeestackArmor - Security MicroSummit - McAfee
stackArmor - Security MicroSummit - McAfee
 
stackArmor MicroSummit - Niksun Network Monitoring - DPI
stackArmor MicroSummit - Niksun Network Monitoring - DPIstackArmor MicroSummit - Niksun Network Monitoring - DPI
stackArmor MicroSummit - Niksun Network Monitoring - DPI
 
stackArmor Security MicroSummit - AWS Security with Splunk
stackArmor Security MicroSummit - AWS Security with SplunkstackArmor Security MicroSummit - AWS Security with Splunk
stackArmor Security MicroSummit - AWS Security with Splunk
 
Magento Hosting on AWS
Magento Hosting on AWS Magento Hosting on AWS
Magento Hosting on AWS
 
Rapid deployment of Sitecore on AWS
Rapid deployment of Sitecore on AWSRapid deployment of Sitecore on AWS
Rapid deployment of Sitecore on AWS
 
Secured Hosting of PCI DSS Compliant Web Applications on AWS
Secured Hosting of PCI DSS Compliant Web Applications on AWSSecured Hosting of PCI DSS Compliant Web Applications on AWS
Secured Hosting of PCI DSS Compliant Web Applications on AWS
 
Implementing Secure DevOps on Public Cloud Platforms
Implementing Secure DevOps on Public Cloud PlatformsImplementing Secure DevOps on Public Cloud Platforms
Implementing Secure DevOps on Public Cloud Platforms
 
FGMC - Managed Data Platform - CloudDC Meetup
FGMC - Managed Data Platform - CloudDC MeetupFGMC - Managed Data Platform - CloudDC Meetup
FGMC - Managed Data Platform - CloudDC Meetup
 
AWS Frederick Meetup 07192016
AWS Frederick Meetup 07192016AWS Frederick Meetup 07192016
AWS Frederick Meetup 07192016
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
 
Hosting Tableau on AWS
Hosting Tableau on AWSHosting Tableau on AWS
Hosting Tableau on AWS
 
AWS Security Best Practices, SaaS and Compliance
AWS Security Best Practices, SaaS and ComplianceAWS Security Best Practices, SaaS and Compliance
AWS Security Best Practices, SaaS and Compliance
 
Big Data - Accountability Solutions for Public Sector Programs
Big Data - Accountability Solutions for Public Sector ProgramsBig Data - Accountability Solutions for Public Sector Programs
Big Data - Accountability Solutions for Public Sector Programs
 
2013 11-06 adopting aws at scale - lessons from the trenches
2013 11-06 adopting aws at scale - lessons from the trenches2013 11-06 adopting aws at scale - lessons from the trenches
2013 11-06 adopting aws at scale - lessons from the trenches
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
 
Enterprise transformation with cloud computing Jan 2014
Enterprise transformation with cloud computing Jan 2014Enterprise transformation with cloud computing Jan 2014
Enterprise transformation with cloud computing Jan 2014
 

stackArmor presentation for DevOpsDC ver 4

  • 1. Proprietary and confidential information of stackArmor PRESENTATION FOR DEVOPSDC MAY 17, 2016 ETL processing at scale with MongoDB and Solr on AWS using Chef
  • 2. PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 2 Delivering innovation with Cloud, Data Analytics and Automation • Cloud orchestration and automation • Migration and Operations support • Cloud hosting and SaaS Development www.stackArmor.com Our Partnerships
  • 3. The Customer • Big Data Analytics SaaS Firm ◦ 500 million records every month needed to be processed as fast as possible at the lowest possible cost ◦ The current on-premises infrastructure was taking 3 weeks to run and complete the process ◦ Processing costs were a big deal and needed to be executed within a very tight and specific budget ◦ Due to HIPAA compliance reasons, the application components could not be altered ◦ Needed to process 5 million records per hour and complete the process in 1-2 days and tear down the environment post-completion PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 3
  • 4. ETL System Overview PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 4 Input Files Parsing Component Staging Collection Final Collection Ingestion Component Search Index JSON Doc JSON Doc 1 - Read Input Records 2 – Store JSON Docs 3 – Read each JSON Doc 4 – Search for Docs which might be a match. 5 – Get set of candidate Doc IDs 6 – After evaluating candidates, merge with existing Doc, or save as new Doc
  • 5. Summary of Process PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 5 • Customer receives batch files with user data from different sources • Needs to reconcile with existing records • Update info or create new records • Most users already exist in the DBs
  • 6. The Stack PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 6 • MongoDB 2.6 AWS environment with 10 mongo shards each with one primary, one replica and one arbiter. The primaries and replicas are running on r3.2xlarge instances. Each has 4 EBS volumes attached. 1x1TB, 1x60GB, 2x25GB with 3000, 180/3000, 250, and 200 IOPs, respectively. • MongoConfig Three mongo config instances are running on 3 m3.large instances. These instances also run the Zookeeper processes used to manage the SolrCloud cluster. Issues with Zookeeper stability. • Mongos Mongos nodes on the same VM as our application logic. Each application server connects to a local mongos node. • Solr 5.4.1 Running a SolrCloud cluster with 20 shards and 5 replication factors. Each primary and secondary is deployed to a c4.4xlarge instance. Each instance has 3 EBS volumes: 2x 60 GB, 1x 200GB. 180/3000, 1000, and 600/3000 IOPs, respectively. • Application Nodes Ingestion/indexing nodes running on c4.xlarge instances. 10 such nodes running Tomcat 7.
  • 7. Design Considerations • Wanted to build environment by “hand” to save time and meet project goals – automate now or later? ◦ Which automation and orchestration technology to use? • What is the optimal shard to replica ratio? • To P-IOPS or not P-IOPS? Or supersize the instance? • RAID or NOT, if so which one • Scale-up or Scale-out? Cost considerations • Solr versus SolrCloud? • Goal was to process 5 million records per hour • Money, Money, Money? PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 7
  • 8. The Math • Needed to process a total data set of 2TB for Staging and 2TB for Production ◦ Estimated 50 shards with 80GB per shard ◦ Choose instance size with memory to fit 2TB / 50 shards • Separate mongos nodes or with app nodes • PIOPs disks vs RAID 0 ◦ Cost vs performance PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 8
  • 9. Instance Selection • Scale-up versus Scale-out • IO Calculations PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 9 Name vCPU Memory Instance I/O On-Demand Cost per GB RAM Cost per Core Per GB RAM r3.8xlarge 32 244 SSD 2 x 320 10 Gigabit $ 2.6600 $ 0.01090 $ 0.0000447 r3.4xlarge 16 122 SSD 1 x 320 High $ 1.3300 $ 0.01090 $ 0.0000894 r3.2xlarge 8 61 SSD 1 x 160 High $ 0.6650 $ 0.01090 $ 0.0001787 m2.4xlarge 8 68.4 2 x 840 High $ 0.9800 $ 0.01433 $ 0.0002095 cr1.8xlarge 32 244 SSD 2 x 120 10 Gigabit $ 3.5000 $ 0.01434 $ 0.0000588 m4.xlarge 4 16 -- High $ 0.2390 $ 0.01494 $ 0.0009336 m4.10xlarge 40 160 -- 10 Gigabit $ 2.3940 $ 0.01496 $ 0.0000935 m4.4xlarge 16 64 -- High $ 0.9580 $ 0.01497 $ 0.0002339 m4.2xlarge 8 32 -- High $ 0.4790 $ 0.01497 $ 0.0004678 m3.2xlarge 8 30 SSD 2 x 80 High $ 0.5320 $ 0.01773 $ 0.0005911 m3.xlarge 4 15 SSD 2 x 40 High $ 0.2660 $ 0.01773 $ 0.0011822 d2.8xlarge 36 244HDD 24 x 2000 10 Gigabit $ 5.5200 $ 0.02262 $ 0.0000927 Helped us find the right instance for us; There were other Considerations as well P-IOPS or not?
  • 10. Initial Design • <insert initial design> PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 10
  • 11. Final Design • MongoDB nodes ◦ 10x c4.8xlarge •solr nodes ◦ 100x c4.4xlarge ◦ Max heap 6 gb ◦ Replication factor of 1 (shard, non-cloud mode) • Separate EBS volumes for MongoDB data, journal, and logs • 4x1TB EBS RAID0 drives for MongoDB data PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 11
  • 12. The tuning journey • Driving optimal load from mongos to SOLRs • Throttling the number of java threads that can generate the load that cluster can handle • Maximize the CPU utilization on the SOLR nodes but at the same time not to push the network bandwidth traffic • Use of 10GB network between EC2s • Determination of IOPS requirement to maximize the throughput (#number of ingestion records processed). • Gradually increasing threads instead of overloading SOLR at the start • use optimize SOLR function periodically to sustain steady SOLR search response times PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 12
  • 13. Automation Saved the Day! PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 13 Solr Parameterized ETL Environment  Number of Instances  Type of Instances  Other params Tuning by Doing!
  • 14. Some charts along the way PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 14 Graph showing throughput reaching around 75,000 records per 15 mins for each node. We reached nearly 3 million records per hour but with spikes (and errors).
  • 15. Things are getting better! PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 15 Graph showing throughput reaching around 75,000 records per 15 mins for each node. We reached nearly 3 million records per hour sustained throughput.
  • 16. What worked to improve throughput? Our primary goal is to maximize the throughput at the optimal price • Gradual increase in ‘solr’ load/throttle •‘solr’ optimize run frequently to make ‘solr’ index spread evenly and sustain ‘solr’ search time •Use of 10-GB network between ec2 nodes to improve throughput of more than 3GB network bandwidth •Use of RAID0/with 4 disks to improve IO , keeping the read queue smaller •Use of RAID read block size of 32 for Mongo DB •Disable read-ahead for Mongo DB •Use of enhanced EBS networking PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 16
  • 17. Results • Client has an automated, dynamically built environment • Automated parts of code deployment processes ◦ (Jenkins -> S3 -> Nodes) • Capabilities delivered include: ◦ Offsite code archiving (S3) ◦ Infrastructure automation ◦ CloudTrail, VPC Flow Logs and S3 access logs helped auditing ◦ Created over 100 Chef recipes PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 17
  • 18. Next Steps • Upgrade to the technology stack and increase instance utilization using Apache Mesos • Enable Self Service for Engineers ◦ Create a new MongoDB / SOLR cluster ◦ Start a cluster ◦ Stop a cluster ◦ Update the code on a cluster (deploy) ◦ Update configuration parameters ◦ Resize instances within a named cluster • Create Rundeck or Jenkins based dashboard • Capture state of all clusters / generate daily reports for management •Optimize Cost and Performance of Ingestion Cluster PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 18
  • 20. Thank you Gaurav “GP” Pal Principal, stackArmor.com gpal@stackArmor.com (571) 271 4396 www.stackArmor.com PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 20