Karan Bhatia, PhD
Introducing Elastic MapReduce
Big Data Solutions Practice
Vários Tutoriais , treinamentos e mentoria em
português
Inscreva-se agora !!
http://awshub.com.br
4 bytes x 1,000,000 households x 1 measurement/month x 10 years
480 MBytes
4 bytes x 1,000,000 households x 1 measurement/min x 10 years
220 TBytes
Big Data as Business Transformation
Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Cen...
AWS Elastic MapReduce
Map reduce
HDFS
Thousands of customers, 2 million+ clusters in 2012
EMR Sample Use Cases
Apontador e MapLink
e AWS
Apoio:
• O que conheço do usuário?
{"BaseLogId":"RmlpbjZkWVhCM0NxckNjYjF3eFU0dGNTYnhJPQ","TrackUserId":"a18e0672-ad07-4f28-
b447-...
• O que recebemos para determinar o transito?
<Route><Category>1</Category><DateTime>0001-01-01T00:00:00</DateTime><Destin...
Teorema de Bayes:
O MODELO estatístico
• Hive (~ 40 instancias spot m3.large)
90% - Utilidades diárias
• Streaming
10% - Solr, MapReduces mais complexos (MCMC, F...
• A Chaordic é a empresa líder
em personalização para e-
commerce no Brasil, tendo
como clientes 9 dos 15 maiores
players ...
O Desafio
• Atender dezenas de milhões de
usuários únicos por mês;
• Processamento de Big Data;
• Responder em menos de 10...
Sobre o Papel da AWS e
Benefícios alcançados
• 4 bilhões de requisições por
mês;
• +300 mil requisições por
minuto;
• +200...
Map Reduce
Map Shuffle Reduce
AWS Elastic MapReduce
Managed Hadoop analytics
Input data
S3, DynamoDB, Redshift
Elastic
MapReduce
Code
Input data
S3, DynamoDB, Redshift
Elastic
MapReduce
Code Name
node
Input data
S3, DynamoDB, Redshift
Elastic
MapReduce
Code Name
node
Input data
Elastic
cluster
S3, DynamoDB, Redshift
S3/HDFS
Elastic
MapReduce
Code Name
node
Input data
S3/HDFS
Queries
+ BI
Via JDBC, Pig, Hive
S3, DynamoDB, Redshift
Elastic
cluster
Elastic
MapReduce
Code Name
node
Output
Input data
Queries
+ BI
Via JDBC, Pig, Hive
S3, DynamoDB, Redshift
Elastic
cluster...
Output
Input data
S3, DynamoDB, Redshift
1
2
4
8
16
32
64
128
256
1 2 4 8 16 32 64 128
Memory(GB)
EC2 Compute Units
Instance Types
Standard 2nd Gen Standard Micro ...
1. Elastic clusters
10 hours
5 hours
Peak capacity
2. Rapid, tuned provisioning
Tedious.
Remove undifferentiated
heavy lifting.
3. Hadoop all the way down
Robust ecosystem.
Databases, machine learning, segmentation,
clustering, analytics, metadata stores,
exchange formats, and...
4. Agility for experimentation
Instance choice.
Stay flexible on instance type & number.
5. Cost optimizations
Built for Spot.
Name-your-price supercomputing.
1. Elastic clusters
2. Rapid, tuned provisioning
3. Hadoop all the way down
4. Agility for experimentation.
5. Cost optimi...
Data, data, everywhere...
Data is stored in silos.
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On-premises
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On-premises
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On premises
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On premises
S3
DynamoDB EMR
HBase on EMR RDS
Redshift
On premises
AWS Data Pipeline
Announced in November, available now.
Orchestration for data-intensive workloads.
AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution ...
Anatomy of a pipeline
Additional checks and notifications
Arbitrarily complex pipelines
aws.amazon.com/datapipeline
aws.amazon.com/big-data
Thanks
karanb@amazon.com
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Introducing Elastic MapReduce
Upcoming SlideShare
Loading in...5
×

Introducing Elastic MapReduce

824

Published on

Introducing Elastic MapReduce

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
824
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
84
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Introducing Elastic MapReduce

  1. 1. Karan Bhatia, PhD Introducing Elastic MapReduce Big Data Solutions Practice
  2. 2. Vários Tutoriais , treinamentos e mentoria em português Inscreva-se agora !! http://awshub.com.br
  3. 3. 4 bytes x 1,000,000 households x 1 measurement/month x 10 years 480 MBytes
  4. 4. 4 bytes x 1,000,000 households x 1 measurement/min x 10 years 220 TBytes
  5. 5. Big Data as Business Transformation
  6. 6. Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  7. 7. AWS Elastic MapReduce Map reduce HDFS
  8. 8. Thousands of customers, 2 million+ clusters in 2012
  9. 9. EMR Sample Use Cases
  10. 10. Apontador e MapLink e AWS Apoio:
  11. 11. • O que conheço do usuário? {"BaseLogId":"RmlpbjZkWVhCM0NxckNjYjF3eFU0dGNTYnhJPQ","TrackUserId":"a18e0672-ad07-4f28- b447-fc0cba90ee17","SiteId":"apto- dv01","SessionId":"1369827720327:f52c5b","ExternalId":"1933510381","Hostname":"integra01.aponta dor.lan","Path":"/local/sp/sao_paulo/bares_e_casas_noturnas/QYN7825H/","Referer":null,"PageTitle":"L ocais, Eventos, Endereços, Mapas - Apontador.com","IpAddress":"200.150.177.249","AgentInfo":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116Safari/537.36","Position":"{ "lat": -23.5934691, "lon": -46.6882606, "acc": 36}","SearchInfo":null,"RawRequestInfo":”RawRequest”: ","CreateAt":"2013-06-24T14:39:46.7082358Z"} •O que mais? Ações, cliques, buscas COMO trazer o melhor para o usuário?
  12. 12. • O que recebemos para determinar o transito? <Route><Category>1</Category><DateTime>0001-01-01T00:00:00</DateTime><Destination xmlns:a="http://schemas.datacontract.org/2004/07/SwissKnife.Spatial"><a:Lat>- 8.150483</a:Lat><a:Lng>-35.420284</a:Lng></Destination><Origin xmlns:a="http://schemas.datacontract.org/2004/07/SwissKnife.Spatial"><a:Lat>- 8.149973</a:Lat><a:Lng>-35.41825</a:Lng></Origin> COMO descobrir o trânsito?
  13. 13. Teorema de Bayes: O MODELO estatístico
  14. 14. • Hive (~ 40 instancias spot m3.large) 90% - Utilidades diárias • Streaming 10% - Solr, MapReduces mais complexos (MCMC, FastFourier, e.g.) • Estrutura usada Hive ( ~ 40 instancias spot m3.large), Elastic MapReduce S3 (aproximadamente 7 Tb de dados estruturados em diversos buckets) RDS (dados de organização dos dados do S3) O QUE usamos?
  15. 15. • A Chaordic é a empresa líder em personalização para e- commerce no Brasil, tendo como clientes 9 dos 15 maiores players do país. • Os produtos desenvolvidos pela Chaordic se integram aos maiores sites de e-commerce brasileiros e precisam de uma infra-estrutura confiável, rápida, escalável e de baixo custo. “ Com a AWS conseguimos construir um único sistema para atender a demanda dos maiores sites de e-commerce do Brasil a um custo relativamente baixo”. “Construir um data center próprio para atender nossa demanda seria economicamente inviável” - João Bosco, CTO
  16. 16. O Desafio • Atender dezenas de milhões de usuários únicos por mês; • Processamento de Big Data; • Responder em menos de 100ms; • Escalar bem em momentos de pico de acesso; • Tudo isto a um custo acessível.
  17. 17. Sobre o Papel da AWS e Benefícios alcançados • 4 bilhões de requisições por mês; • +300 mil requisições por minuto; • +200 milhões de recomendações todos os dias; • Spot instances: -20% custo aws.
  18. 18. Map Reduce
  19. 19. Map Shuffle Reduce
  20. 20. AWS Elastic MapReduce
  21. 21. Managed Hadoop analytics
  22. 22. Input data S3, DynamoDB, Redshift
  23. 23. Elastic MapReduce Code Input data S3, DynamoDB, Redshift
  24. 24. Elastic MapReduce Code Name node Input data S3, DynamoDB, Redshift
  25. 25. Elastic MapReduce Code Name node Input data Elastic cluster S3, DynamoDB, Redshift S3/HDFS
  26. 26. Elastic MapReduce Code Name node Input data S3/HDFS Queries + BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster
  27. 27. Elastic MapReduce Code Name node Output Input data Queries + BI Via JDBC, Pig, Hive S3, DynamoDB, Redshift Elastic cluster S3/HDFS
  28. 28. Output Input data S3, DynamoDB, Redshift
  29. 29. 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 Memory(GB) EC2 Compute Units Instance Types Standard 2nd Gen Standard Micro High-Memory High-CPU Cluster Compute Cluster GPU High I/O High-Storage Cluster High-Mem hi1.4xlarge 60.5 GB of memory 35 EC2 Compute Units 2x1024 GB SSD instance storage 64-bit platform cc1.4xlarge 23 GB of memory 33.5 EC2 Compute Units 1690 GB of instance storage 64-bit platform c1.xlarge 7 GB of memory 20 EC2 Compute Units 1690 GB of instance storage 64-bit platform m1.small 1.7 GB memory 1 EC2 Compute Unit 160 GB instance storage 32-bit or 64-bit m1.medium 3.75 GB memory 2 EC2 Compute Unit 410 GB instance storage 32-bit or 64-bit platform m1.large EBS Optimizable 7.5 GB memory 4 EC2 Compute Units 850 GB instance storage 64-bit platform m1.xlarge EBS Optimizable 15 GB memory 8 EC2 Compute Units 1,690 GB instance storage 64-bit platform m2.xlarge 17.1 GB of memory 6.5 EC2 Compute Units 420 GB of instance storage 64-bit platform m2.2xlarge 34.2 GB of memory 13 EC2 Compute Units 850 GB of instance storage 64-bit platform m2.4xlarge EBS Optimizable 68.4 GB of memory 26 EC2 Compute Units 1690 GB of instance storage 64-bit platform t1.micro 613 MB memory Up to 2 EC2 Compute Units EBS storage only 32-bit or 64-bit platform c1.medium 1.7 GB of memory 5 EC2 Compute Units 350 GB of instance storage 32-bit or 64-bit platform cg1.4xlarge 22 GB of memory 33.5 EC2 Compute Units 2 x NVIDIA Tesla “Fermi”  M2050 GPUs 1690 GB of instance storage 64-bit platform cc2.8xlarge 60.5 GB of memory 88 EC2 Compute Units 3370 GB of instance storage 64-bit platformm3.xlarge 15 GB of memory 13 EC2 Compute Units m3.2xlarge EBS Optimizable 30 GB of memory 26 EC2 Compute Units hs1.8xlarge 117 GB of memory 35 EC2 Compute Units 24x2 TB instance storage 64-bit platform cr1.8xlarge 244 GB of memory 88 EC2 Compute Units 2x120 GB SSD instance storage 64-bit platform
  30. 30. 1. Elastic clusters
  31. 31. 10 hours
  32. 32. 5 hours
  33. 33. Peak capacity
  34. 34. 2. Rapid, tuned provisioning
  35. 35. Tedious.
  36. 36. Remove undifferentiated heavy lifting.
  37. 37. 3. Hadoop all the way down
  38. 38. Robust ecosystem. Databases, machine learning, segmentation, clustering, analytics, metadata stores, exchange formats, and so on...
  39. 39. 4. Agility for experimentation
  40. 40. Instance choice. Stay flexible on instance type & number.
  41. 41. 5. Cost optimizations
  42. 42. Built for Spot. Name-your-price supercomputing.
  43. 43. 1. Elastic clusters 2. Rapid, tuned provisioning 3. Hadoop all the way down 4. Agility for experimentation. 5. Cost optimizations
  44. 44. Data, data, everywhere... Data is stored in silos.
  45. 45. S3 DynamoDB EMR HBase on EMR RDS Redshift On-premises
  46. 46. S3 DynamoDB EMR HBase on EMR RDS Redshift On-premises
  47. 47. S3 DynamoDB EMR HBase on EMR RDS Redshift On premises
  48. 48. S3 DynamoDB EMR HBase on EMR RDS Redshift On premises
  49. 49. S3 DynamoDB EMR HBase on EMR RDS Redshift On premises
  50. 50. AWS Data Pipeline Announced in November, available now. Orchestration for data-intensive workloads.
  51. 51. AWS Data Pipeline Data-intensive orchestration and automation Reliable and scheduled Easy to use, drag and drop Execution and retry logic Map data dependencies Create and manage temporary compute resources
  52. 52. Anatomy of a pipeline
  53. 53. Additional checks and notifications
  54. 54. Arbitrarily complex pipelines
  55. 55. aws.amazon.com/datapipeline
  56. 56. aws.amazon.com/big-data
  57. 57. Thanks karanb@amazon.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×