Why run Hadoop in AWS?
• elastic: batch jobs on clusters can consume many nodes,
scalable demand, not 24/7 – great case for using EC2
• commodity hardware: MR is built for fault tolerance, great
case for leveraging AMIs
• right-sizing: difficult to know a priori how large of a cluster
is needed – without running significant jobs (test k/v skew,
data quality, etc.)
• when your input data is already in S3, SDB, EBS, RDS…
• when your output needs to be consumed in AWS …
You really don't want to buy rack space in a datacenter before
assessing these issues – besides, a private datacenter probably
won’t even be cost-effective afterward.