SlideShare a Scribd company logo
1 of 69
‘Amazon EMR’ coming up…by Sujee Maniyam
Big Data Cloud Meetup Cost Effective Big-Data Processing using Amazon Elastic Map Reduce Sujee Maniyam hello@sujee.net  |  www.sujee.net July 08, 2011
Cost Effective Big-Data Processing using Amazon Elastic Map Reduce Sujee Maniyam http://sujee.net hello@sujee.net
Quiz PRIZE! Where was this picture taken?
Quiz : Where was this picture taken?
Answer : Montara Light House
Hi, I’m Sujee 10+ years of software development enterprise apps  web apps iphone apps   Hadoop Hands on experience with Hadoop / Hbase/ Amazon ‘cloud’ More : http://sujee.net/tech
I am  an ‘expert’ 
Ah.. Data
Nature of Data… Primary Data Email, blogs, pictures, tweets Critical for operation (Gmail can’t loose emails) Secondary data Wikipedia access logs, Google search logs Not ‘critical’, but  used to ‘enhance’  user experience Search logs help predict ‘trends’ Yelp can figure out you like Chinese food
Data Explosion Primary data has grown phenomenally But secondary data has exploded in recent years “log every thing and ask questions later” Used for Recommendations (books, restaurants ..etc) Predict trends (job skills in demand) Show ADS  ($$$) ..etc ‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook) Startups are struggling to get on top of ‘big data’
Hadoop to Rescue Hadoop can help with BigData Hadoop has been proven in the field Under active development Throw hardware at the problem Getting cheaper by the year Bleeding edge technology Hire good people!
Hadoop: It is a CAREER
Data Spectrum
Who is Using Hadoop?
Big Guys
Startups
Startups and bigdata
About This Presentation Based on my experience with a startup 5 people (3 Engineers) Ad-Serving Space Amazon EC2 is our ‘data center’ Technologies: Web stack : Python, Tornado,  PHP,  mysql , LAMP Amazon EMR to crunch data Data size : 1 TB  / week
Story of a Startup…month-1 Each web serverwrites logs locally Logs were copiedto a log-serverand purged from web servers Log Data size : ~100-200 G
Story of a Startup…month-6 More web servers comeonline Aggregate log serverfalls behind
Data @ 6 months 2 TB of data already 50-100 G new data / day  And we were operating on 20% of our capacity!
Future…
Solution? Scalable database (NOSQL) Hbase Cassandra Hadoop log processing / Map Reduce
What We Evaluated 1) Hbase cluster 2) Hadoop cluster 3) Amazon EMR
Hadoop on Amazon EC2 1) Permanent Cluster 2) On demand cluster (elastic map reduce)
1) Permanent Hadoop Cluster
Architecture 1
Hadoop Cluster 7 C1.xlarge machines 15 TB EBS volumes Sqoop exports mysql log tables into HDFS Logs are compressed (gz) to minimize disk usage (data locality trade-off) All is working well…
Lessons Learned C1.xlarge is  pretty stable (8 core / 8G memory) EBS volumes max size 1TB,  so string few for higher density / node DON’T RAID them; let hadoop handle them as individual disks ?? : Skip EBS.  Use instance store disks, and store data in S3
Amazon Storage Options
2 months later Couple of EBS volumes DIE Couple of EC2 instances DIE Maintaining the hadoop cluster is mechanical  job  less appealing COST! Our jobs utilization is about 50% But still paying for machines running 24x7
Amazon EC2 Cost
Hadoop cluster on EC2 cost $3,500 = 7 c1.xlarge @ $500 / month $1,500 = 15 TB EBS storage @ $0.10 per GB $ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests   $5,500 / month $60,000 / year !
Buy / Rent ? Typical hadoop machine cost : $10k 10 node cluster = $100k  Plus data center  costs Plus IT-ops costs Amazon Ec2 10 node cluster: $500 * 10 = $5,000 / month = $60k / year
Buy / Rent Amazon EC2 is great, for Quickly getting started Startups Scaling on demand / rapidly adding more servers popular social games Netflix story Streaming is powered by EC2 Encoding movies ..etc Use 1000s of instances Not so economical for running clusters 24x7
Next : Amazon EMR
Where was this picture taken?
Answer : Pacifica Pier
Amazon’s solution :  Elastic Map Reduce Store data on Amazon S3 Kick off a hadoop cluster to process data Shutdown when done Pay for the HOURS used
Architecture : Amazon EMR
Moving parts Logs go into Scribe Scribe master ships logs into S3, gzipped Spin EMR cluster, run job, done Using same old Java MR jobs for EMR Summary data gets directly updated to a mysql
EMR Launch Scripts scripts  to launch jar EMR jobs Custom parameters depending on job needs (instance types, size of cluster ..etc) monitor  job progress Save logs for later inspection Job status (finished / cancelled) https://github.com/sujee/amazon-emr-beyond-basics
Sample Launch Script #!/bin/bash ## run-sitestats4.sh # config MASTER_INSTANCE_TYPE="m1.large" SLAVE_INSTANCE_TYPE="c1.xlarge" INSTANCES=5 export JOBNAME="SiteStats4" export TIMESTAMP=$(date +%Y%m%d-%H%M%S) # end config echo "===========================================" echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...." export t1=$(date +%s) export JOBID=$(elastic-mapreduce --plain-output  --create --name "${JOBNAME}__${TIMESTAMP}"   --num-instances "$INSTANCES"  --master-instance-type "$MASTER_INSTANCE_TYPE"  --slave-instance-type "$SLAVE_INSTANCE_TYPE"  --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config  --log-uri s3://my_bucket/emr-logs/   --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”) sh ./emr-wait-for-completion.sh
Mapred-config-m1-xl.xml	 <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration>     <property>         <name>mapreduce.map.java.opts</name>         <value>-Xmx1024M</value>     </property>     <property>         <name>mapreduce.reduce.java.opts</name>         <value>-Xmx3000M</value>     </property>     <property>         <name>mapred.tasktracker.reduce.tasks.maximum</name>         <value>3</value>         <decription>4 is running out of memory</description>     </property> <property>         <name>mapred.output.compress</name>         <value>true</value> </property>     <property>         <name>mapred.output.compression.type</name>         <value>BLOCK</value>     </property> </configuration>
emr-wait-for-completion.sh Polls for job status periodically Saves the logs  Calculates job run time
Saved Logs
Sample Saved Log
Data joining (x-ref) Data is split across log files, need to x-ref during Map phase Used to load the data in mapper’s memory (data was small and in mysql) Now we use Membase  (Memcached) Two MR jobs are chained First one processes logfile_type_A and populates Membase (very quick,  takes minutes) Second one, processes logfile_type_B, cross-references values from Membase
X-ref
EMR Wins Cost   only pay for use http://aws.amazon.com/elasticmapreduce/pricing/ Example: EMR ran on 5 C1.xlarge for 3hrs EC2 instances for 3 hrs = $0.68  per hr x 5 inst x 3 hrs = $10.20 http://aws.amazon.com/elasticmapreduce/faqs/#billing-4 (1 hour of c1.xlarge = 8 hours normalized compute time) EMR cost = 5 instances x 3 hrs x 8 normalized hrs x  0.12 emr = $14.40 Plus S3 storage cost :  1TB / month = $150 Data bandwidth from S3 to EC2 is FREE!  $25 bucks
EMR Wins No hadoop cluster to maintainno failed nodes / disks Bonus : Can tailor cluster  for various jobs smaller jobs  fewer number of machines memory hungry tasks  m1.xlarge cpu hungry tasks  c1.xlarge
Design Wins Bidders now write logs to Scribe directly  No mysql at web server machines Writes much faster! S3 has been a reliable  storage and cheap
Next : Lessons Learned
Where was this pic taken?
Answer : Foster City
Lessons learned : Logfile format CSV  JSON Started with CSV CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL 20-40 fields… fragile, position dependant, hard to code  url = csv[18]…counting position numbers gets old after 100th time around) If (csv.length == 29) url = csv[28]     else url = csv[26] JSON: { exchange_id: 2,  url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….} Self-describing,  easy to add new fields, easy to process url = map.get(‘url’)
Lessons Learned : Control the amount of Input We get different type of events event A (freq: 10,000)   >>> event B (100)  >> event C (1) Initially we put them all into a single log file A A A A B A A B C
Control Input… So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processing So we split the logs log_A….gz log_B….gz log_C…gz Now only processing fraction of our logs Input : s3://my_bucket/logs/log_B* x-ref using memcache if needed
Lessons learned : Incremental Log Processing Recent data (today / yesterday / this week) is more relevant than older data (6 months +) Adding ‘time window’ to our stats only process newer logs faster
EMR trade-offs Lower performance on MR jobs compared to a  clusterReduced data throughput (S3 isn’t the same as local disk) Streaming data from S3, for each job EMR Hadoop is not the latest version Missing tools : Oozie Right now, trading performance for convenience and cost
Next steps : faster processing Streaming S3 data for each MR job is not optimal Spin cluster Copy data from S3 to HDFS Run all MR jobs (make use of data locality) terminate
Next Steps : More Processing More MR jobs More frequent data processing Frequent log rolls Smaller delta window
Next steps : new software  New Software Python,  mrJOB(from Yelp) Scribe  Cloudera flume? Use work flow tools like Oozie Hive? Adhoc SQL like queries
Next Steps : SPOT instances SPOT instances : name your price (ebay style) Been available on EC2 for a while Just became available for Elastic map reduce! New cluster setup: 10 normal instances + 10 spot instances Spots may go away anytime That is fine!  Hadoop will handle node failures Bigger cluster : cheaper & faster
Example Price Comparison
Next Steps : nosql Summary data goes into mysqlpotential weak-link ( some tables have ~100 million rows and growing) Evaluating nosql solutionsusing Membase in limited capacity Watch out for Amazon’s Hbase offering
Take a test drive Just bring your credit-card  http://aws.amazon.com/elasticmapreduce/ Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52
Thanks Questions? Sujee Maniyam http://sujee.net hello@sujee.net Devil’s slide, Pacifica

More Related Content

What's hot

12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocratJonathan Linowes
 
OpenERP Performance Benchmark
OpenERP Performance BenchmarkOpenERP Performance Benchmark
OpenERP Performance BenchmarkAudaxis
 
Setting Up Amazon EC2 server
Setting Up Amazon EC2 serverSetting Up Amazon EC2 server
Setting Up Amazon EC2 serverTahsin Hasan
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Fastly
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentationjward5519
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentationjward5519
 
API analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters editionAPI analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters editionjavier ramirez
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Client Side Storage
Client Side StorageClient Side Storage
Client Side StoragePaul Sowden
 
CODAIT/Spark-Bench
CODAIT/Spark-BenchCODAIT/Spark-Bench
CODAIT/Spark-BenchEmily Curtin
 
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Jeffrey Breen
 
Big data Lambda Architecture - Batch Layer Hands On
Big data Lambda Architecture - Batch Layer Hands OnBig data Lambda Architecture - Batch Layer Hands On
Big data Lambda Architecture - Batch Layer Hands Onhkbhadraa
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Nancy CLI. Automated Database Experiments
Nancy CLI. Automated Database ExperimentsNancy CLI. Automated Database Experiments
Nancy CLI. Automated Database ExperimentsNikolay Samokhvalov
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderSadayuki Furuhashi
 
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014Amazon Web Services
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
 

What's hot (20)

12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
 
HPC on AWS
HPC on AWSHPC on AWS
HPC on AWS
 
OpenERP Performance Benchmark
OpenERP Performance BenchmarkOpenERP Performance Benchmark
OpenERP Performance Benchmark
 
Setting Up Amazon EC2 server
Setting Up Amazon EC2 serverSetting Up Amazon EC2 server
Setting Up Amazon EC2 server
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
 
API analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters editionAPI analytics with Redis and Google Bigquery. NoSQL matters edition
API analytics with Redis and Google Bigquery. NoSQL matters edition
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Client Side Storage
Client Side StorageClient Side Storage
Client Side Storage
 
CODAIT/Spark-Bench
CODAIT/Spark-BenchCODAIT/Spark-Bench
CODAIT/Spark-Bench
 
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
 
Big data Lambda Architecture - Batch Layer Hands On
Big data Lambda Architecture - Batch Layer Hands OnBig data Lambda Architecture - Batch Layer Hands On
Big data Lambda Architecture - Batch Layer Hands On
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Nancy CLI. Automated Database Experiments
Nancy CLI. Automated Database ExperimentsNancy CLI. Automated Database Experiments
Nancy CLI. Automated Database Experiments
 
Time series databases
Time series databasesTime series databases
Time series databases
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 

Similar to BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee

Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMRFaizan Javed
 
Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickrxlight
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationPrestaShop
 
LatJUG. Google App Engine
LatJUG. Google App EngineLatJUG. Google App Engine
LatJUG. Google App Enginedenis Udod
 
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR HadoopCrunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR HadoopAdrian Cockcroft
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relationalTony Tam
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everythingLew Tucker
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Building prediction models with Amazon Redshift and Amazon ML
Building prediction models with  Amazon Redshift and Amazon MLBuilding prediction models with  Amazon Redshift and Amazon ML
Building prediction models with Amazon Redshift and Amazon MLJulien SIMON
 
Making it fast: Zotonic & Performance
Making it fast: Zotonic & PerformanceMaking it fast: Zotonic & Performance
Making it fast: Zotonic & PerformanceArjan
 
Deferred Processing in Ruby - Philly rb - August 2011
Deferred Processing in Ruby - Philly rb - August 2011Deferred Processing in Ruby - Philly rb - August 2011
Deferred Processing in Ruby - Philly rb - August 2011rob_dimarco
 
Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]Matthew McCullough
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsAmazon Web Services
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Hosting Drupal on Amazon EC2
Hosting Drupal on Amazon EC2Hosting Drupal on Amazon EC2
Hosting Drupal on Amazon EC2Kornel Lugosi
 

Similar to BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee (20)

Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR
 
Capacity Management from Flickr
Capacity Management from FlickrCapacity Management from Flickr
Capacity Management from Flickr
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimization
 
LatJUG. Google App Engine
LatJUG. Google App EngineLatJUG. Google App Engine
LatJUG. Google App Engine
 
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR HadoopCrunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
Crunch Your Data in the Cloud with Elastic Map Reduce - Amazon EMR Hadoop
 
SEO for Large Websites
SEO for Large WebsitesSEO for Large Websites
SEO for Large Websites
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Building prediction models with Amazon Redshift and Amazon ML
Building prediction models with  Amazon Redshift and Amazon MLBuilding prediction models with  Amazon Redshift and Amazon ML
Building prediction models with Amazon Redshift and Amazon ML
 
Making it fast: Zotonic & Performance
Making it fast: Zotonic & PerformanceMaking it fast: Zotonic & Performance
Making it fast: Zotonic & Performance
 
Deferred Processing in Ruby - Philly rb - August 2011
Deferred Processing in Ruby - Philly rb - August 2011Deferred Processing in Ruby - Philly rb - August 2011
Deferred Processing in Ruby - Philly rb - August 2011
 
Cloud Talk
Cloud TalkCloud Talk
Cloud Talk
 
Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Deploying On EC2
Deploying On EC2Deploying On EC2
Deploying On EC2
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Hosting Drupal on Amazon EC2
Hosting Drupal on Amazon EC2Hosting Drupal on Amazon EC2
Hosting Drupal on Amazon EC2
 

More from BigDataCloud

Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning PlatformsWebinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning PlatformsBigDataCloud
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction SystemBigDataCloud
 
REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS BigDataCloud
 
Cloud Computing Services
Cloud Computing ServicesCloud Computing Services
Cloud Computing ServicesBigDataCloud
 
Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!BigDataCloud
 
Big Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBig Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBigDataCloud
 
Big Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBig Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBigDataCloud
 
Streak + Google Cloud Platform
Streak + Google Cloud PlatformStreak + Google Cloud Platform
Streak + Google Cloud PlatformBigDataCloud
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value BigDataCloud
 
Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.BigDataCloud
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
Recommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideRecommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideBigDataCloud
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?BigDataCloud
 
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalHadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalBigDataCloud
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBigDataCloud
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBigDataCloud
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinBigDataCloud
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
 

More from BigDataCloud (20)

Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning PlatformsWebinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
 
REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS
 
Cloud Computing Services
Cloud Computing ServicesCloud Computing Services
Cloud Computing Services
 
Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!
 
Big Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBig Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & Apps
 
Big Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBig Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud Platform
 
Streak + Google Cloud Platform
Streak + Google Cloud PlatformStreak + Google Cloud Platform
Streak + Google Cloud Platform
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value
 
Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Recommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideRecommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural Guide
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?
 
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalHadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

BigDataCloud meetup - July 8th - Cost effective big-data processing using Amazon EMR- Presentation by Sujee

  • 1. ‘Amazon EMR’ coming up…by Sujee Maniyam
  • 2. Big Data Cloud Meetup Cost Effective Big-Data Processing using Amazon Elastic Map Reduce Sujee Maniyam hello@sujee.net | www.sujee.net July 08, 2011
  • 3. Cost Effective Big-Data Processing using Amazon Elastic Map Reduce Sujee Maniyam http://sujee.net hello@sujee.net
  • 4. Quiz PRIZE! Where was this picture taken?
  • 5. Quiz : Where was this picture taken?
  • 6. Answer : Montara Light House
  • 7. Hi, I’m Sujee 10+ years of software development enterprise apps  web apps iphone apps  Hadoop Hands on experience with Hadoop / Hbase/ Amazon ‘cloud’ More : http://sujee.net/tech
  • 8. I am an ‘expert’ 
  • 10. Nature of Data… Primary Data Email, blogs, pictures, tweets Critical for operation (Gmail can’t loose emails) Secondary data Wikipedia access logs, Google search logs Not ‘critical’, but used to ‘enhance’ user experience Search logs help predict ‘trends’ Yelp can figure out you like Chinese food
  • 11. Data Explosion Primary data has grown phenomenally But secondary data has exploded in recent years “log every thing and ask questions later” Used for Recommendations (books, restaurants ..etc) Predict trends (job skills in demand) Show ADS ($$$) ..etc ‘Big Data’ is no longer just a problem for BigGuys (Google / Facebook) Startups are struggling to get on top of ‘big data’
  • 12. Hadoop to Rescue Hadoop can help with BigData Hadoop has been proven in the field Under active development Throw hardware at the problem Getting cheaper by the year Bleeding edge technology Hire good people!
  • 13. Hadoop: It is a CAREER
  • 15. Who is Using Hadoop?
  • 19. About This Presentation Based on my experience with a startup 5 people (3 Engineers) Ad-Serving Space Amazon EC2 is our ‘data center’ Technologies: Web stack : Python, Tornado, PHP, mysql , LAMP Amazon EMR to crunch data Data size : 1 TB / week
  • 20. Story of a Startup…month-1 Each web serverwrites logs locally Logs were copiedto a log-serverand purged from web servers Log Data size : ~100-200 G
  • 21. Story of a Startup…month-6 More web servers comeonline Aggregate log serverfalls behind
  • 22. Data @ 6 months 2 TB of data already 50-100 G new data / day And we were operating on 20% of our capacity!
  • 24. Solution? Scalable database (NOSQL) Hbase Cassandra Hadoop log processing / Map Reduce
  • 25. What We Evaluated 1) Hbase cluster 2) Hadoop cluster 3) Amazon EMR
  • 26. Hadoop on Amazon EC2 1) Permanent Cluster 2) On demand cluster (elastic map reduce)
  • 29. Hadoop Cluster 7 C1.xlarge machines 15 TB EBS volumes Sqoop exports mysql log tables into HDFS Logs are compressed (gz) to minimize disk usage (data locality trade-off) All is working well…
  • 30. Lessons Learned C1.xlarge is pretty stable (8 core / 8G memory) EBS volumes max size 1TB, so string few for higher density / node DON’T RAID them; let hadoop handle them as individual disks ?? : Skip EBS. Use instance store disks, and store data in S3
  • 32. 2 months later Couple of EBS volumes DIE Couple of EC2 instances DIE Maintaining the hadoop cluster is mechanical job less appealing COST! Our jobs utilization is about 50% But still paying for machines running 24x7
  • 34. Hadoop cluster on EC2 cost $3,500 = 7 c1.xlarge @ $500 / month $1,500 = 15 TB EBS storage @ $0.10 per GB $ 500 = EBS I/O requests @ $0.10 per 1 million I/O requests  $5,500 / month $60,000 / year !
  • 35. Buy / Rent ? Typical hadoop machine cost : $10k 10 node cluster = $100k Plus data center costs Plus IT-ops costs Amazon Ec2 10 node cluster: $500 * 10 = $5,000 / month = $60k / year
  • 36. Buy / Rent Amazon EC2 is great, for Quickly getting started Startups Scaling on demand / rapidly adding more servers popular social games Netflix story Streaming is powered by EC2 Encoding movies ..etc Use 1000s of instances Not so economical for running clusters 24x7
  • 38. Where was this picture taken?
  • 40. Amazon’s solution : Elastic Map Reduce Store data on Amazon S3 Kick off a hadoop cluster to process data Shutdown when done Pay for the HOURS used
  • 42. Moving parts Logs go into Scribe Scribe master ships logs into S3, gzipped Spin EMR cluster, run job, done Using same old Java MR jobs for EMR Summary data gets directly updated to a mysql
  • 43. EMR Launch Scripts scripts to launch jar EMR jobs Custom parameters depending on job needs (instance types, size of cluster ..etc) monitor job progress Save logs for later inspection Job status (finished / cancelled) https://github.com/sujee/amazon-emr-beyond-basics
  • 44. Sample Launch Script #!/bin/bash ## run-sitestats4.sh # config MASTER_INSTANCE_TYPE="m1.large" SLAVE_INSTANCE_TYPE="c1.xlarge" INSTANCES=5 export JOBNAME="SiteStats4" export TIMESTAMP=$(date +%Y%m%d-%H%M%S) # end config echo "===========================================" echo $(date +%Y%m%d.%H%M%S) " > $0 : starting...." export t1=$(date +%s) export JOBID=$(elastic-mapreduce --plain-output --create --name "${JOBNAME}__${TIMESTAMP}" --num-instances "$INSTANCES" --master-instance-type "$MASTER_INSTANCE_TYPE" --slave-instance-type "$SLAVE_INSTANCE_TYPE" --jar s3://my_bucket/jars/adp.jar --main-class com.adpredictive.hadoop.mr.SiteStats4 --arg s3://my_bucket/jars/sitestats4-prod.config --log-uri s3://my_bucket/emr-logs/ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--core-config-file,s3://my_bucket/jars/core-site.xml,--mapred-config-file,s3://my_bucket/jars/mapred-site.xml”) sh ./emr-wait-for-completion.sh
  • 45. Mapred-config-m1-xl.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1024M</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx3000M</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>3</value> <decription>4 is running out of memory</description> </property> <property> <name>mapred.output.compress</name> <value>true</value> </property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property> </configuration>
  • 46. emr-wait-for-completion.sh Polls for job status periodically Saves the logs Calculates job run time
  • 49. Data joining (x-ref) Data is split across log files, need to x-ref during Map phase Used to load the data in mapper’s memory (data was small and in mysql) Now we use Membase (Memcached) Two MR jobs are chained First one processes logfile_type_A and populates Membase (very quick, takes minutes) Second one, processes logfile_type_B, cross-references values from Membase
  • 50. X-ref
  • 51. EMR Wins Cost  only pay for use http://aws.amazon.com/elasticmapreduce/pricing/ Example: EMR ran on 5 C1.xlarge for 3hrs EC2 instances for 3 hrs = $0.68 per hr x 5 inst x 3 hrs = $10.20 http://aws.amazon.com/elasticmapreduce/faqs/#billing-4 (1 hour of c1.xlarge = 8 hours normalized compute time) EMR cost = 5 instances x 3 hrs x 8 normalized hrs x 0.12 emr = $14.40 Plus S3 storage cost : 1TB / month = $150 Data bandwidth from S3 to EC2 is FREE!  $25 bucks
  • 52. EMR Wins No hadoop cluster to maintainno failed nodes / disks Bonus : Can tailor cluster for various jobs smaller jobs  fewer number of machines memory hungry tasks  m1.xlarge cpu hungry tasks  c1.xlarge
  • 53. Design Wins Bidders now write logs to Scribe directly No mysql at web server machines Writes much faster! S3 has been a reliable storage and cheap
  • 54. Next : Lessons Learned
  • 55. Where was this pic taken?
  • 57. Lessons learned : Logfile format CSV  JSON Started with CSV CSV: "2","26","3","07807606-7637-41c0-9bc0-8d392ac73b42","MTY4Mjk2NDk0eDAuNDk4IDEyODQwMTkyMDB4LTM0MTk3OTg2Ng","2010-09-09 03:59:56:000 EDT","70.68.3.116","908105","http://housemdvideos.com/seasons/video.php?s=01&e=07","908105","160x600","performance","25","ca","housemdvideos.com","1","1.2840192E9","0","221","0.60000","NULL","NULL 20-40 fields… fragile, position dependant, hard to code url = csv[18]…counting position numbers gets old after 100th time around) If (csv.length == 29) url = csv[28] else url = csv[26] JSON: { exchange_id: 2, url : “http://housemdvideos.com/seasons/video.php?s=01&e=07”….} Self-describing, easy to add new fields, easy to process url = map.get(‘url’)
  • 58. Lessons Learned : Control the amount of Input We get different type of events event A (freq: 10,000) >>> event B (100) >> event C (1) Initially we put them all into a single log file A A A A B A A B C
  • 59. Control Input… So have to process the entire file, even if we are interested only in ‘event C’ too much wasted processing So we split the logs log_A….gz log_B….gz log_C…gz Now only processing fraction of our logs Input : s3://my_bucket/logs/log_B* x-ref using memcache if needed
  • 60. Lessons learned : Incremental Log Processing Recent data (today / yesterday / this week) is more relevant than older data (6 months +) Adding ‘time window’ to our stats only process newer logs faster
  • 61. EMR trade-offs Lower performance on MR jobs compared to a clusterReduced data throughput (S3 isn’t the same as local disk) Streaming data from S3, for each job EMR Hadoop is not the latest version Missing tools : Oozie Right now, trading performance for convenience and cost
  • 62. Next steps : faster processing Streaming S3 data for each MR job is not optimal Spin cluster Copy data from S3 to HDFS Run all MR jobs (make use of data locality) terminate
  • 63. Next Steps : More Processing More MR jobs More frequent data processing Frequent log rolls Smaller delta window
  • 64. Next steps : new software New Software Python, mrJOB(from Yelp) Scribe  Cloudera flume? Use work flow tools like Oozie Hive? Adhoc SQL like queries
  • 65. Next Steps : SPOT instances SPOT instances : name your price (ebay style) Been available on EC2 for a while Just became available for Elastic map reduce! New cluster setup: 10 normal instances + 10 spot instances Spots may go away anytime That is fine! Hadoop will handle node failures Bigger cluster : cheaper & faster
  • 67. Next Steps : nosql Summary data goes into mysqlpotential weak-link ( some tables have ~100 million rows and growing) Evaluating nosql solutionsusing Membase in limited capacity Watch out for Amazon’s Hbase offering
  • 68. Take a test drive Just bring your credit-card  http://aws.amazon.com/elasticmapreduce/ Forum : https://forums.aws.amazon.com/forum.jspa?forumID=52
  • 69. Thanks Questions? Sujee Maniyam http://sujee.net hello@sujee.net Devil’s slide, Pacifica