Your SlideShare is downloading. ×
Surviving Hadoop on AWS
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Surviving Hadoop on AWS

875
views

Published on


0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
875
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SURVIVING HADOOP ON AWS IN PRODUCTION
  • 2. DISCLAIMER:I AM A BAD PERSON.
  • 3. ABOUT MEChief Data Scientist at Yieldbot, Co-Founder at StockTwits. @sorenmacbeth
  • 4. YIELDBOT“Yieldbots technology creates a marketplace where search advertisers buy real-time consumer intent on premium publishers.”
  • 5. WHERE WE ARE TODAYMapR M3 on EMRAll data read from and written to S3
  • 6. CLOJURE FOR DATA PROCESSING All of our MapReduce jobs are written in Cascalog . This gives us speed, flexability and testability.More importantly, Clojure and Cascalog are fun to write.
  • 7. CASCALOG EXAMPLE(ns lucene-cascalog.core (:gen-class) (:use cascalog.api) (:import org.apache.lucene.analysis.standard.StandardAnalyzer org.apache.lucene.analysis.TokenStream org.apache.lucene.util.Version org.apache.lucene.analysis.tokenattributes.TermAttribute))(defn tokenizer-seq "Build a lazy-seq out of a tokenizer with TermAttribute" [^TokenStream tokenizer ^TermAttribute term-att] (lazy-seq (when (.incrementToken tokenizer) (cons (.term term-att) (tokenizer-seq tokenizer term-att)))))
  • 8. HADOOP IS COMPLEX
  • 9. “Fact: There are more Hadoop configuration options than there are stars our galaxy.”
  • 10. EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT OF TUNING TO GET A HADOOP CLUSTER RUNNING WELL.There are large companies that make money soley by configuring and supporting hadoop clusters for enterprise customers.
  • 11. RUNNING HADOOP ON AWS
  • 12. SO WHY RUN ON AWS? $$$
  • 13. HADOOP ON AWS: AN PERSONAL HISTORY
  • 14. PIG AND ELASTICMAPREDUCESlow development cycle; writing Java sucks.
  • 15. CASCALOG AND ELASTICMAPREDUCELearning Emacs, Clojure, and Cascalog was hard, butwas worth it.The way our jobs were designed sucked and didntwork well with ElasticMapReduce
  • 16. CASCALOG AND SELF-MANAGED HADOOP CLUSTERWe used a hacked up version of a cloudera pythonscript to launch and bootstrap a cluster.We ran on spot instancesCluster boot up time SUCKED and often failed. Wepaid for instances during bootstrap and configurationOur jobs werent designed to tolerate things like spotinstances going away in the middle of a job.Drinking heavily dulled the pain a little.
  • 17. CASCALOG AND ELASTICMAPREDUCE AGAINRebuilt data processing pipeline from scratch (onlytook nine months!)Data pipelines were broken out into a handful of fault-tolerant jobflow steps; each steps writes output to S3.EMR supported spot instances at this point.
  • 18. WEIRD BUGS THAT WEVE HITBootstrap script errorsRandom cluster fuckedupednessAMI version changesVendor issuesMy personal favourite: Invisible S3 write failures.
  • 19. IF YOU MUST RUN ON AWSBreak your processing pipelines into stages; write outto S3 after each stage.Bake in (a lot) of variability into your expected jobflowrun times.Compress the data your are reading and writing fromS3 as much as possible.Drinking helps.
  • 20. QUESTIONS?
  • 21. YIELDBOT IS HIRING! http://yieldbot.com/jobs

×