High Level Architecture ProcessingMultiple Data Ingest (Hadoop Jobs)Sources Data Storage Output Location Datasets API Intelligence
Our Challenges• Mul$ple data sources – social, retail, events, news, census, loca$on etc • Spa$al data analysis and querying – loca$on overlay on data • Temporal nature of the input datasets • Large input data sets and hundreds of GB compressed inputs for jobs • Complex processing and business logic based on use cases • Custom output data formats – JSON, XML, XLS, Flat ﬁles etc
Why Amazon EMR?I am interested in using Hadoop to solve business problemsand not in building and managing Hadoop infrastructure! Scalable Storage - S3Flexible Computing – EC2No Hadoop Management – EMR
EMR Map Reduce Jobs Amazon EMR supports – streaming, custom jar, cascading, pig and hive. Streaming – Write Map Reduce jobs in any scrip$ng language. Custom Jar – Write using Java and good for speed/control. Cascading, Hive and Pig – Higher level of abstrac$on. AWS EMR forums if you need help.
EMR – Good, Bad and UglyGreat for bootstrapping large clusters and very cost-‐eﬀec$ve for transient clusters. Most patches are applied and Amazon creates new AMI’s with improvements – but not for everything. Intermiient network issues – Some$mes could cause serious degrada$on of performance. Network/Disk IO is variable based on instance types and streaming jobs will be much sluggish on EMR compared to dedicated setup. Be ready to face variable performance in Cloud.
Hadoop and EMR – JobsUse local Hadoop setup for debugging your jobs – there is no easy way on EMR. Capture EMR cluster metrics -‐ always bootstrap with Ganglia. High JVM memory alloca$on lead to long GC pauses. Don’t trust EMR tuned sekngs of Hadoop conﬁgura$ons. Benchmark on small cluster for data points.
Hadoop and EMR – Jobs performance GC Overhead -‐ increase memory and reduce the jvm reuse tasks. Avoid read conten$on at S3 – Have equal or more ﬁles in S3 compared to available mappers. Use mapred output compression to save storage, processing $me and bandwidth costs. Set mapred task $meout to 0 if you have long running jobs (> 10 mins) and can disable specula$ve execu$on $me. Always benchmark third party libraries used in your jobs code before pukng them in produc$on – too much sluggish stuﬀ out there.
Hadoop – High Level Tuning Small ﬁles problem – avoid too Tune your sekngs – JVM Reuse, many small ﬁles in S3 Sort Buﬀer, Sort Factor, Map/ Reduce Tasks, Parallel Copies, MapRed Output Compression etc Good thing is that you can use Know what is limi$ng you at a small cluster and sample input node level – CPU, Memory, DISK size for tuning IO or Network IN/OUT
Performance Tuning Golden RulesWhen you are operating at very large scaleeven a 10 ms makes a big difference!Example : Moving away from Simple-‐Json to Jackson JSON Parsing – 600 ms Op$mized Parsing – 500 ms Number of input JSON records – 3 million Time saved by simple op$miza$on – 84 hrs of savings
We have seen improvements from 10x to100x in our production clusters –significant money savings.
Lesson Learned – Saving TimeHadoop Job with complex business logic opera$ng on 350 MB input size Job Language Cluster Size Input Files Processing Time Ruby 6 m1.xlarge 1000 184 mins Java 6 m1.xlarge 1000 69 mins Java 6 m1.xlarge 100 39 mins (1000 files combined) Java 6 m1.xlarge 100 25 mins (EMR tuned) (1000 files combined) Java 6 m1.xlarge 100 13 mins (EMR and Code tuned) (1000 files combined)
Lesson Learned – Saving CostA data mining job in produc$on with 50 GB compressed input data Job Cluster Size Processing Each Job 100 Jobs Cost Per Language Time Cost Month Ruby 50 m2.2xlarge 240 mins $242 $24200 Java 20 m1.xlarge 200 mins $68 $6800 Java 20 m1.xlarge $50 $5000 (EMR tuned) 165 mins Java 20 m1.xlarge 50 mins $17 $1700 (EMR and Code tuned)
EMR Cost Optimization Use a small dedicated/transient cluster Leverage spot instance for Task Nodes Op$mize, proﬁle and tune your code always – code ﬁrst and conﬁg next Tune EMR conﬁgura$on based on historical jobs data Always benchmark third party libraries