Introducing Elastic MapReduce
Upcoming SlideShare
Loading in...5
×
 

Introducing Elastic MapReduce

on

  • 919 views

Introducing Elastic MapReduce

Introducing Elastic MapReduce

Statistics

Views

Total Views
919
Views on SlideShare
740
Embed Views
179

Actions

Likes
1
Downloads
73
Comments
0

5 Embeds 179

http://blog.rivendel.com.br 92
http://www.ricardomartins.com.br 72
http://cloud.feedly.com 7
http://digg.com 7
http://www.feedspot.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Introducing Elastic MapReduce Introducing Elastic MapReduce Presentation Transcript

  • Karan Bhatia, PhD Introducing Elastic MapReduce Big Data Solutions Practice
  • Vários Tutoriais , treinamentos e mentoria em português Inscreva-se agora !! http://awshub.com.br
  • 4 bytes x 1,000,000 households x 1 measurement/month x 10 years 480 MBytes
  • 4 bytes x 1,000,000 households x 1 measurement/min x 10 years 220 TBytes
  • Big Data as Business Transformation
  • Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  • AWS Elastic MapReduce Map reduce HDFS
  • Thousands of customers, 2 million+ clusters in 2012
  • EMR Sample Use Cases
  • Apontador e MapLink e AWS Apoio:
  • • O que conheço do usuário? {"BaseLogId":"RmlpbjZkWVhCM0NxckNjYjF3eFU0dGNTYnhJPQ","TrackUserId":"a18e0672-ad07-4f28- b447-fc0cba90ee17","SiteId":"apto- dv01","SessionId":"1369827720327:f52c5b","ExternalId":"1933510381","Hostname":"integra01.aponta dor.lan","Path":"/local/sp/sao_paulo/bares_e_casas_noturnas/QYN7825H/","Referer":null,"PageTitle":"L ocais, Eventos, Endereços, Mapas - Apontador.com","IpAddress":"200.150.177.249","AgentInfo":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116Safari/537.36","Position":"{ "lat": -23.5934691, "lon": -46.6882606, "acc": 36}","SearchInfo":null,"RawRequestInfo":”RawRequest”: ","CreateAt":"2013-06-24T14:39:46.7082358Z"} •O que mais? Ações, cliques, buscas COMO trazer o melhor para o usuário?
  • • O que recebemos para determinar o transito? <Route><Category>1</Category><DateTime>0001-01-01T00:00:00</DateTime><Destination xmlns:a="http://schemas.datacontract.org/2004/07/SwissKnife.Spatial"><a:Lat>- 8.150483</a:Lat><a:Lng>-35.420284</a:Lng></Destination><Origin xmlns:a="http://schemas.datacontract.org/2004/07/SwissKnife.Spatial"><a:Lat>- 8.149973</a:Lat><a:Lng>-35.41825</a:Lng></Origin> COMO descobrir o trânsito?
  • Teorema de Bayes: O MODELO estatístico
  • • Hive (~ 40 instancias spot m3.large) 90% - Utilidades diárias • Streaming 10% - Solr, MapReduces mais complexos (MCMC, FastFourier, e.g.) • Estrutura usada Hive ( ~ 40 instancias spot m3.large), Elastic MapReduce S3 (aproximadamente 7 Tb de dados estruturados em diversos buckets) RDS (dados de organização dos dados do S3) O QUE usamos?
  • • A Chaordic é a empresa líder em personalização para e- commerce no Brasil, tendo como clientes 9 dos 15 maiores players do país. • Os produtos desenvolvidos pela Chaordic se integram aos maiores sites de e-commerce brasileiros e precisam de uma infra-estrutura confiável, rápida, escalável e de baixo custo. “ Com a AWS conseguimos construir um único sistema para atender a demanda dos maiores sites de e-commerce do Brasil a um custo relativamente baixo”. “Construir um data center próprio para atender nossa demanda seria economicamente inviável” - João Bosco, CTO
  • O Desafio • Atender dezenas de milhões de usuários únicos por mês; • Processamento de Big Data; • Responder em menos de 100ms; • Escalar bem em momentos de pico de acesso; • Tudo isto a um custo acessível.
  • Sobre o Papel da AWS e Benefícios alcançados • 4 bilhões de requisições por mês; • +300 mil requisições por minuto; • +200 milhões de recomendações todos os dias; • Spot instances: -20% custo aws.
  • Map Reduce
  • Map Shuffle Reduce
  • AWS Elastic MapReduce
  • Managed Hadoop analytics
  • Input data S3, DynamoDB, Redshift
  • Elastic MapReduce Code Input data S3, DynamoDB, Redshift
  • Elastic MapReduce Code Name node Input data S3, DynamoDB, Redshift
  • Elastic MapReduce Code Name node Input data Elastic cluster S3, DynamoDB, Redshift S3/HDFS
  • Elastic MapReduce Code Name node Input data S3/HDFS Queries + BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster
  • Elastic MapReduce Code Name node Output Input data Queries + BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster S3/HDFS
  • Output Input data S3, DynamoDB, Redshift
  • 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 Memory(GB) EC2 Compute Units Instance Types Standard 2nd Gen Standard Micro High-Memory High-CPU Cluster Compute Cluster GPU High I/O High-Storage Cluster High-Mem hi1.4xlarge 60.5 GB of memory 35 EC2 Compute Units 2x1024 GB SSD instance storage 64-bit platform cc1.4xlarge 23 GB of memory 33.5 EC2 Compute Units 1690 GB of instance storage 64-bit platform c1.xlarge 7 GB of memory 20 EC2 Compute Units 1690 GB of instance storage 64-bit platform m1.small 1.7 GB memory 1 EC2 Compute Unit 160 GB instance storage 32-bit or 64-bit m1.medium 3.75 GB memory 2 EC2 Compute Unit 410 GB instance storage 32-bit or 64-bit platform m1.large EBS Optimizable 7.5 GB memory 4 EC2 Compute Units 850 GB instance storage 64-bit platform m1.xlarge EBS Optimizable 15 GB memory 8 EC2 Compute Units 1,690 GB instance storage 64-bit platform m2.xlarge 17.1 GB of memory 6.5 EC2 Compute Units 420 GB of instance storage 64-bit platform m2.2xlarge 34.2 GB of memory 13 EC2 Compute Units 850 GB of instance storage 64-bit platform m2.4xlarge EBS Optimizable 68.4 GB of memory 26 EC2 Compute Units 1690 GB of instance storage 64-bit platform t1.micro 613 MB memory Up to 2 EC2 Compute Units EBS storage only 32-bit or 64-bit platform c1.medium 1.7 GB of memory 5 EC2 Compute Units 350 GB of instance storage 32-bit or 64-bit platform cg1.4xlarge 22 GB of memory 33.5 EC2 Compute Units 2 x NVIDIA Tesla “Fermi”  M2050 GPUs 1690 GB of instance storage 64-bit platform cc2.8xlarge 60.5 GB of memory 88 EC2 Compute Units 3370 GB of instance storage 64-bit platformm3.xlarge 15 GB of memory 13 EC2 Compute Units m3.2xlarge EBS Optimizable 30 GB of memory 26 EC2 Compute Units hs1.8xlarge 117 GB of memory 35 EC2 Compute Units 24x2 TB instance storage 64-bit platform cr1.8xlarge 244 GB of memory 88 EC2 Compute Units 2x120 GB SSD instance storage 64-bit platform
  • 1. Elastic clusters
  • 10 hours
  • 5 hours
  • Peak capacity
  • 2. Rapid, tuned provisioning
  • Tedious.
  • Remove undifferentiated heavy lifting.
  • 3. Hadoop all the way down
  • Robust ecosystem. Databases, machine learning, segmentation, clustering, analytics, metadata stores, exchange formats, and so on...
  • 4. Agility for experimentation
  • Instance choice. Stay flexible on instance type & number.
  • 5. Cost optimizations
  • Built for Spot. Name-your-price supercomputing.
  • 1. Elastic clusters 2. Rapid, tuned provisioning 3. Hadoop all the way down 4. Agility for experimentation. 5. Cost optimizations
  • Data, data, everywhere... Data is stored in silos.
  • S3 DynamoDB EMR HBase on EMR RDS Redshift On-premises
  • S3 DynamoDB EMR HBase on EMR RDS Redshift On-premises
  • S3 DynamoDB EMR HBase on EMR RDS Redshift On premises
  • S3 DynamoDB EMR HBase on EMR RDS Redshift On premises
  • S3 DynamoDB EMR HBase on EMR RDS Redshift On premises
  • AWS Data Pipeline Announced in November, available now. Orchestration for data-intensive workloads.
  • AWS Data Pipeline Data-intensive orchestration and automation Reliable and scheduled Easy to use, drag and drop Execution and retry logic Map data dependencies Create and manage temporary compute resources
  • Anatomy of a pipeline
  • Additional checks and notifications
  • Arbitrarily complex pipelines
  • aws.amazon.com/datapipeline
  • aws.amazon.com/big-data
  • Thanks karanb@amazon.com