stackArmor presentation for DevOpsDC ver 4

Proprietary and confidential information of stackArmor
PRESENTATION FOR DEVOPSDC
MAY 17, 2016
ETL processing at scale with
MongoDB and Solr on AWS using Chef

PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 2
Delivering innovation with Cloud, Data
Analytics and Automation
• Cloud orchestration and automation
• Migration and Operations support
• Cloud hosting and SaaS Development
www.stackArmor.com
Our Partnerships

The Customer
• Big Data Analytics SaaS Firm
◦ 500 million records every month needed to be processed as fast as possible at
the lowest possible cost
◦ The current on-premises infrastructure was taking 3 weeks to run and
complete the process
◦ Processing costs were a big deal and needed to be executed within a very
tight and specific budget
◦ Due to HIPAA compliance reasons, the application components could not be
altered
◦ Needed to process 5 million records per hour and complete the process in 1-2
days and tear down the environment post-completion

ETL System Overview
Input
Files
Parsing
Component
Staging Collection
Final Collection
Ingestion
Component
Search
Index
JSON
Doc
JSON
Doc
1 - Read Input
Records
2 – Store JSON
Docs
3 – Read each
JSON Doc
4 – Search for Docs which
might be a match.
5 – Get set of
candidate Doc IDs
6 – After evaluating
candidates, merge
with existing Doc, or
save as new Doc

Summary of Process
• Customer receives batch files with user data from different sources
• Needs to reconcile with existing records
• Update info or create new records
• Most users already exist in the DBs

The Stack
• MongoDB 2.6
AWS environment with 10 mongo shards each with one primary, one replica and one arbiter. The primaries
and replicas are running on r3.2xlarge instances. Each has 4 EBS volumes attached. 1x1TB, 1x60GB, 2x25GB
with 3000, 180/3000, 250, and 200 IOPs, respectively.
• MongoConfig
Three mongo config instances are running on 3 m3.large instances. These instances also run the Zookeeper
processes used to manage the SolrCloud cluster. Issues with Zookeeper stability.
• Mongos
Mongos nodes on the same VM as our application logic. Each application server connects to a local mongos
node.
• Solr 5.4.1
Running a SolrCloud cluster with 20 shards and 5 replication factors. Each primary and secondary is
deployed to a c4.4xlarge instance. Each instance has 3 EBS volumes: 2x 60 GB, 1x 200GB. 180/3000, 1000,
and 600/3000 IOPs, respectively.
• Application Nodes
Ingestion/indexing nodes running on c4.xlarge instances. 10 such nodes running Tomcat 7.

Design Considerations
• Wanted to build environment by “hand” to save time and meet
project goals – automate now or later?
◦ Which automation and orchestration technology to use?
• What is the optimal shard to replica ratio?
• To P-IOPS or not P-IOPS? Or supersize the instance?
• RAID or NOT, if so which one
• Scale-up or Scale-out? Cost considerations
• Solr versus SolrCloud?
• Goal was to process 5 million records per hour
• Money, Money, Money?

The Math
• Needed to process a total data set of 2TB for Staging and
2TB for Production
◦ Estimated 50 shards with 80GB per shard
◦ Choose instance size with memory to fit 2TB / 50 shards
• Separate mongos nodes or with app nodes
• PIOPs disks vs RAID 0
◦ Cost vs performance

Instance Selection
• Scale-up versus Scale-out
• IO Calculations
Name vCPU Memory Instance I/O On-Demand Cost per GB RAM
Cost per Core
Per GB RAM
r3.8xlarge 32 244 SSD 2 x 320 10 Gigabit $ 2.6600 $ 0.01090 $ 0.0000447
r3.4xlarge 16 122 SSD 1 x 320 High $ 1.3300 $ 0.01090 $ 0.0000894
r3.2xlarge 8 61 SSD 1 x 160 High $ 0.6650 $ 0.01090 $ 0.0001787
m2.4xlarge 8 68.4 2 x 840 High $ 0.9800 $ 0.01433 $ 0.0002095
cr1.8xlarge 32 244 SSD 2 x 120 10 Gigabit $ 3.5000 $ 0.01434 $ 0.0000588
m4.xlarge 4 16 -- High $ 0.2390 $ 0.01494 $ 0.0009336
m4.10xlarge 40 160 -- 10 Gigabit $ 2.3940 $ 0.01496 $ 0.0000935
m4.4xlarge 16 64 -- High $ 0.9580 $ 0.01497 $ 0.0002339
m4.2xlarge 8 32 -- High $ 0.4790 $ 0.01497 $ 0.0004678
m3.2xlarge 8 30 SSD 2 x 80 High $ 0.5320 $ 0.01773 $ 0.0005911
m3.xlarge 4 15 SSD 2 x 40 High $ 0.2660 $ 0.01773 $ 0.0011822
d2.8xlarge 36 244HDD 24 x 2000 10 Gigabit $ 5.5200 $ 0.02262 $ 0.0000927
Helped us find the
right instance for us;
There were other
Considerations as well
P-IOPS or not?

Initial Design
• <insert initial design>

Final Design
• MongoDB nodes
◦ 10x c4.8xlarge
•solr nodes
◦ 100x c4.4xlarge
◦ Max heap 6 gb
◦ Replication factor of 1 (shard, non-cloud mode)
• Separate EBS volumes for MongoDB data, journal, and logs
• 4x1TB EBS RAID0 drives for MongoDB data

The tuning journey
• Driving optimal load from mongos to SOLRs
• Throttling the number of java threads that can generate the load that
cluster can handle
• Maximize the CPU utilization on the SOLR nodes but at the same time
not to push the network bandwidth traffic
• Use of 10GB network between EC2s
• Determination of IOPS requirement to maximize the throughput
(#number of ingestion records processed).
• Gradually increasing threads instead of overloading SOLR at the start
• use optimize SOLR function periodically to sustain steady SOLR search
response times

Automation Saved the Day!
Solr
Parameterized
ETL Environment
 Number of Instances
 Type of Instances
 Other params
Tuning by Doing!

Some charts along the way
Graph showing throughput
reaching around 75,000
records per 15 mins for each
node. We reached nearly 3
million records per hour but
with spikes (and errors).

Things are getting better!
Graph showing throughput
reaching around 75,000
records per 15 mins for each
node. We reached nearly 3
million records per hour
sustained throughput.

What worked to improve throughput?
Our primary goal is to maximize the throughput at the optimal price
• Gradual increase in ‘solr’ load/throttle
•‘solr’ optimize run frequently to make ‘solr’ index spread evenly and
sustain ‘solr’ search time
•Use of 10-GB network between ec2 nodes to improve throughput of more
than 3GB network bandwidth
•Use of RAID0/with 4 disks to improve IO , keeping the read queue smaller
•Use of RAID read block size of 32 for Mongo DB
•Disable read-ahead for Mongo DB
•Use of enhanced EBS networking

Results
• Client has an automated, dynamically built environment
• Automated parts of code deployment processes
◦ (Jenkins -> S3 -> Nodes)
• Capabilities delivered include:
◦ Offsite code archiving (S3)
◦ Infrastructure automation
◦ CloudTrail, VPC Flow Logs and S3 access logs helped auditing
◦ Created over 100 Chef recipes

Next Steps
• Upgrade to the technology stack and increase instance
utilization using Apache Mesos
• Enable Self Service for Engineers
◦ Create a new MongoDB / SOLR cluster
◦ Start a cluster
◦ Stop a cluster
◦ Update the code on a cluster (deploy)
◦ Update configuration parameters
◦ Resize instances within a named cluster
• Create Rundeck or Jenkins based dashboard
• Capture state of all clusters / generate daily reports for
management
•Optimize Cost and Performance of Ingestion Cluster

Thank you
Gaurav “GP” Pal
Principal,
stackArmor.com
gpal@stackArmor.com
(571) 271 4396
www.stackArmor.com

stackArmor presentation for DevOpsDC ver 4

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to stackArmor presentation for DevOpsDC ver 4

Similar to stackArmor presentation for DevOpsDC ver 4 (20)

More from Gaurav "GP" Pal

More from Gaurav "GP" Pal (19)

stackArmor presentation for DevOpsDC ver 4