AWS Summit 2013 Tel Aviv
Oct 16 – Tel Aviv, Israel

Data Analytics on BigData
Jan Borch | AWS Solutions Architect
GENERATE  STORE  ANALYZE  SHARE
THE COST OF DATA
GENERATION IS FALLING
Progress is not evenly distributed

1980
14,000,000$/TB  450,000 ÷ 
 30,000 X 
100MB
 50 X 
4MB/s

Today
30$/TB
3TB
...
THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT
Lower cost,
higher throughput

GENERATE  STORE  ANALYZE  SHARE
Lower cost,
higher throughput



GENERATE  STORE  ANALYZE  SHARE
Highly
constrained
DATA VOLUME

Generated data

Available for analysis

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data ...
GENERATE

STORE  ANALYZE  SHARE
ACCELERATE

GENERATE 

STORE  ANALYZE  SHARE
+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ ONLY PAY FOR WHAT YOU USE
+ AVAILABLE ON-DEMAND

= REMOVE

CO...
AWS EC2
AWS CloudFront

GENERATE  STORE  ANALYZE  SHARE
•
•
•
•
•

Fluentd
Flume
Scribe
Chukwa
LogStash

{output{ s3 {
bucket => myBucket,
aws_credential_file => ~/cred.json
size...
“Poor man’s Analytics”
Embed poor-man pixel
http://www.poor-mananalytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban
.com&ut...
GENERATE  STORE  ANALYZE  SHARE
AWS Import / Export
AWS Direct Connect
AWS Elastic Map Reduce

GENERATE  STORE  ANALYZE  SHARE
Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regiona...
Aggregation with S3Distcp
S3distcp on EMR job sample
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar 
/home/hadoop/lib/emr-s3distcp-1.0.jar 
--a...
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2

GENERATE...
AMAZON S3
SIMPLE STORAGE SERVICE
AMAZON
DYNAMODB
HIGH-PERFORMANCE, FULLY MANAGED
NoSQL DATABASE SERVICE
DURABLE &
AVAILABLE
CONSISTENT, DISK-ONLY
WRITES (SSD)
LOW LATENCY
AVERAGE READS < 5MS,
WRITES < 10MS
NO ADMINISTRATION
Very general table structure
not many
rows

Ads

frequent
update
(near realtime)

advertiser

max-price

imps to
deliver

...
500,000 WRITES PER SECOND
DURING SUPER BOWL
AMAZON
GLACIER
reliable long term archiving
S3 Lifecycle policies
AMAZON S3

If object older than
5 month

Archive to
Amazon Glacier
S3 Lifecycle policies
AMAZON S3

If object older than
5 month

Delete object
from S3
If object older than
1 year

/dev/nul...
AMAZON
REDSHIFT
FULLY MANAGED, PETA-BYTE SCALE
DATAWAREHOUSE ON AWS
DESIGN OBJECTIVES:
A petabyte-scale data warehouse service that was…

A Lot Faster

AMAZON
REDSHIFT

A Lot Cheaper
A Whole...
AMAZON REDSHIFT
RUNS ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rat...
30 MINUTES
DOWN TO

12 SECONDS
AMAZON REDSHIFT LETS YOU
START SMALL AND GROW BIG
Extra Large Node
(HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB ...
JDBC/ODBC
Price Per Hour for
HS1.XL Single
Node

Effective Hourly
Price Per TB

Effective Annual
Price per TB

On-Demand

$ 0.850

$...
DATA WAREHOUSING DONE THE AWS WAY
Easy to provision and scale up massively

No upfront costs, pay as you go
Really fast pe...
USAGE SCENARIOS
Reporting Warehouse

OLTP
ERP

RDBMS

Redshift

Reporting
and BI

Accelerated operational reporting
Support for short-time...
On-Premises Integration

OLTP
ERP

RDBMS

Data
Integration
Partners*

Redshift

Reporting
and BI
Live Archive for (Structured) Big Data

OLTP
Web Apps

DynamoDB

Redshift

Reporting
and BI

Direct integration with copy ...
Cloud ETL for Big Data

S3

Elastic MapReduce

Redshift

Reporting
and BI

Maintain online SQL access to historical logs
T...
COPY into Amazon Redshift
create table cf_logs
(
d date,
t char(8),
edge char(4),
bytes int,
cip varchar(15),
verb char(3)...
COPY into Amazon Redshift

copy cf_logs from 's3://cfri/cflogs-sm/E123ABCDEF/'
credentials
'aws_access_key_id=<key_id>;aws...
GENERATE  STORE  ANALYZE  SHARE
Amazon EC2
Amazon Elastic
MapReduce
AMAZON EC2
ELASTIC COMPUTE CLOUD
EC2 instance families – General purpose

m1.small

Virtual core: 1
Memory: 1.7 GiB
I/O performance: Moderate
EC2 instance families – Compute optimized
Virtual core: 32 - 2 x Intel Xeon
Memory: 60,5 GiB
I/O performance: 10 Gbit

m1....
EC2 instance families – Memory optimized
Virtual core: 32 - 2 x Intel Xeon
Memory: 240 GiB
I/O performance: 10 Gbit
SSD In...
EC2 instance families – Storage optimized

m1.small

cc2.8xlarge

cr1.8xlarge

hi.4xlarge

Virtual core: 16
Memory: 60.5 G...
ON A SINGLE INSTANCE

COMPUTE TIME: 4h
COST: 4h x $2.1 = $8.4
ON MULTIPLE INSTANCES

COMPUTE TIME: 1h
COST: 1h x 4 x $2.1 = $8.4
3 HOURS
FOR $4828.85/hr
Instead of
$20+ MILLIONS
in infrastructure
•
•
•
•

A FRAMEWORK
SPLITS DATA INTO PIECES
LETS PROCESSING OCCUR
GATHERS THE RESULTS
AMAZON ELASTIC
MAPREDUCE
HADOOP AS A SERVICE
Corporate Data
Center

Elastic Data
Center
Corporate Data
Center

Application data
and logs for
analysis pushed
to S3

Elastic Data
Center
Amazon Elastic
Map Reduce
master node to
control analysis
M

Corporate Data
Center

Elastic Data
Center
M

Corporate Data
Center

Hadoop cluster
started by Elastic
Map Reduce

Elastic Data
Center
M

Corporate Data
Center

Adding many
hundreds or
thousands of
nodes
Elastic Data
Center
Disposed of when
job completes

M

Corporate Data
Center

Elastic Data
Center
Corporate Data
Center

Results of
analysis pulled
back into your
systems

Elastic Data
Center
Your Spreadsheet does not
scale …
PIG
A real Pig script
(used at Twitter)
Run on
a sample
dataset on
your Laptop
$ pig –f myPigFile.q
M
Run the same
script on a
50 node
Hadoop cluster
Elastic Data
Center
$ ./elastic-mapreduce --create
--name "$USER's Pig JobFlow"
--pig-script
--args s3://myawsbucket/mypigquery.q
--instance-t...
$ elastic-mapreduce -j j-21IMWIA28LRK1
--add-instance-group task
--instance-count 10
--instance-type m1.xlarge
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2

GENERATE  STORE  ANALYZE  SHARE
PUBLIC DATA SETS
http://aws.amazon.com/publicdatasets
GENERATE  STORE  ANALYZE  SHARE

AWS Data Pipeline
AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution ...
AWS Import / Export
AWS Direct Connect

Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Stora...
FROM DATA TO
ACTIONABLE
INFORMATION
Shlomi Vaknin
Amazon AWS generates big data core component for Ginger
Software

Shlomi Vaknin
Oct 16, 2013
English writing
assistant

An open platform for
personal assistants

118
119
Natural language speech
interface for mobile apps

•

•

An end-to-end Speech-to-Action solution

•
120

Users talk natura...
Web Corpus

Domain
Corpus

Language model

User
Corpus

Semantic Model

NLP/NLU Algorithms

Writing Assistant

Proofrea
de...
Our platform depends on scanning and indexing
all the language we can find on the internet
• A collection of all the langu...
1. Crawling [own cluster, EMR+S3]
• Generated about 50 TB of raw data
• Reduced to about 5 TB of text data
2. Post process...
• Mainly an NLP task
• So we picked up
• It’s a Lisp!
• Integrates very well with EMR, S3, etc..
• n-Gram Counting
• How a...
• EMR cluster node types
• Master, Task, Core

• Ratio between Core and Task nodes
• We expected a very large output (100T...
• Job took about 30 hours to complete
• We generated nearly 100TB of output data
• During map phase, the cluster achieved ...
• Stay up to date with AMI releases
• Don't stick to an old AMI just because it previously worked
• Use the Job-Tracker
• ...
• Stash the data for later use, to reduce cost
• Glacier offers very cheap storage
• Important things to know about Glacie...
• EMR/S3 provides great power and elasticity, to grow and
shrink as required
• Do your homework before running large jobs!...
• Our platforms depends on scanning and indexing all the
language we can find on the internet
• To achieve this Ginger Sof...
Thank You!
We are hiring!
shlomiv@gingersoftware.com
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Upcoming SlideShare
Loading in...5
×

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

585

Published on

Published in: Technology, Travel
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
585
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
48
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

  1. 1. AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel Data Analytics on BigData Jan Borch | AWS Solutions Architect
  2. 2. GENERATE  STORE  ANALYZE  SHARE
  3. 3. THE COST OF DATA GENERATION IS FALLING
  4. 4. Progress is not evenly distributed 1980 14,000,000$/TB  450,000 ÷   30,000 X  100MB  50 X  4MB/s Today 30$/TB 3TB 200MB/s
  5. 5. THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN DERIVE FROM IT
  6. 6. Lower cost, higher throughput GENERATE  STORE  ANALYZE  SHARE
  7. 7. Lower cost, higher throughput  GENERATE  STORE  ANALYZE  SHARE Highly constrained
  8. 8. DATA VOLUME Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  9. 9. GENERATE STORE  ANALYZE  SHARE
  10. 10. ACCELERATE GENERATE  STORE  ANALYZE  SHARE
  11. 11. + ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS
  12. 12. AWS EC2 AWS CloudFront GENERATE  STORE  ANALYZE  SHARE
  13. 13. • • • • • Fluentd Flume Scribe Chukwa LogStash {output{ s3 { bucket => myBucket, aws_credential_file => ~/cred.json size_file=> 120MB }}
  14. 14. “Poor man’s Analytics”
  15. 15. Embed poor-man pixel http://www.poor-mananalytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban .com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=enus&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr =-&utmp=%2F&utmac=UA-70197651&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B %2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(re ferral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analyticsarchitecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~
  16. 16. GENERATE  STORE  ANALYZE  SHARE
  17. 17. AWS Import / Export AWS Direct Connect AWS Elastic Map Reduce GENERATE  STORE  ANALYZE  SHARE
  18. 18. Generated and stored in AWS Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect Regional replication of AMIs and snapshots
  19. 19. Aggregation with S3Distcp
  20. 20. S3distcp on EMR job sample ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,s3://myoutputbucket/aggregate , --groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*, --targetSize,128, --outputCodec,lzo, --deleteOnSuccess'
  21. 21. Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  22. 22. AMAZON S3 SIMPLE STORAGE SERVICE
  23. 23. AMAZON DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED NoSQL DATABASE SERVICE
  24. 24. DURABLE & AVAILABLE CONSISTENT, DISK-ONLY WRITES (SSD)
  25. 25. LOW LATENCY AVERAGE READS < 5MS, WRITES < 10MS
  26. 26. NO ADMINISTRATION
  27. 27. Very general table structure not many rows Ads frequent update (near realtime) advertiser max-price imps to deliver imps delivered 1 AAA 100 50000 1200 2 so many rows ad-id BBB 150 30000 2500 user-id attribute1 attribute2 attribute3 attribute4 A XXX XXX XXX XXX B YYY YYY YYY batch manner update YYY Profiles
  28. 28. 500,000 WRITES PER SECOND DURING SUPER BOWL
  29. 29. AMAZON GLACIER reliable long term archiving
  30. 30. S3 Lifecycle policies AMAZON S3 If object older than 5 month Archive to Amazon Glacier
  31. 31. S3 Lifecycle policies AMAZON S3 If object older than 5 month Delete object from S3 If object older than 1 year /dev/null
  32. 32. AMAZON REDSHIFT FULLY MANAGED, PETA-BYTE SCALE DATAWAREHOUSE ON AWS
  33. 33. DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was… A Lot Faster AMAZON REDSHIFT A Lot Cheaper A Whole Lot Simpler
  34. 34. AMAZON REDSHIFT RUNS ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
  35. 35. 30 MINUTES DOWN TO 12 SECONDS
  36. 36. AMAZON REDSHIFT LETS YOU START SMALL AND GROW BIG Extra Large Node (HS1.XL) Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
  37. 37. JDBC/ODBC
  38. 38. Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999
  39. 39. DATA WAREHOUSING DONE THE AWS WAY Easy to provision and scale up massively No upfront costs, pay as you go Really fast performance at a really low price Open and flexible with support for popular tools
  40. 40. USAGE SCENARIOS
  41. 41. Reporting Warehouse OLTP ERP RDBMS Redshift Reporting and BI Accelerated operational reporting Support for short-time use cases Data compression, index redundancy
  42. 42. On-Premises Integration OLTP ERP RDBMS Data Integration Partners* Redshift Reporting and BI
  43. 43. Live Archive for (Structured) Big Data OLTP Web Apps DynamoDB Redshift Reporting and BI Direct integration with copy command High velocity data Data ages into Redshift Low cost, high scale option for new apps
  44. 44. Cloud ETL for Big Data S3 Elastic MapReduce Redshift Reporting and BI Maintain online SQL access to historical logs Transformation and enrichment with EMR Longer history ensures better insight
  45. 45. COPY into Amazon Redshift create table cf_logs ( d date, t char(8), edge char(4), bytes int, cip varchar(15), verb char(3), distro varchar(MAX), object varchar(MAX), status int, Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )
  46. 46. COPY into Amazon Redshift copy cf_logs from 's3://cfri/cflogs-sm/E123ABCDEF/' credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>' IGNOREHEADER 2 GZIP DELIMITER 't' DATEFORMAT 'YYYY-MM-DD'
  47. 47. GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce
  48. 48. AMAZON EC2 ELASTIC COMPUTE CLOUD
  49. 49. EC2 instance families – General purpose m1.small Virtual core: 1 Memory: 1.7 GiB I/O performance: Moderate
  50. 50. EC2 instance families – Compute optimized Virtual core: 32 - 2 x Intel Xeon Memory: 60,5 GiB I/O performance: 10 Gbit m1.small cc2.8xlarge
  51. 51. EC2 instance families – Memory optimized Virtual core: 32 - 2 x Intel Xeon Memory: 240 GiB I/O performance: 10 Gbit SSD Instance store: 240 GB m1.small cc2.8xlarge cr1.8xlarge
  52. 52. EC2 instance families – Storage optimized m1.small cc2.8xlarge cr1.8xlarge hi.4xlarge Virtual core: 16 Memory: 60.5 GiB I/O performance: 10 Gbit SSD Instance store: 2 x 1TB hs1.8xlarge Virtual core: 16 Memory: 117 GiB I/O performance: 10 Gbit Instance store: 24 x 2TB
  53. 53. ON A SINGLE INSTANCE COMPUTE TIME: 4h COST: 4h x $2.1 = $8.4
  54. 54. ON MULTIPLE INSTANCES COMPUTE TIME: 1h COST: 1h x 4 x $2.1 = $8.4
  55. 55. 3 HOURS FOR $4828.85/hr
  56. 56. Instead of $20+ MILLIONS in infrastructure
  57. 57. • • • • A FRAMEWORK SPLITS DATA INTO PIECES LETS PROCESSING OCCUR GATHERS THE RESULTS
  58. 58. AMAZON ELASTIC MAPREDUCE HADOOP AS A SERVICE
  59. 59. Corporate Data Center Elastic Data Center
  60. 60. Corporate Data Center Application data and logs for analysis pushed to S3 Elastic Data Center
  61. 61. Amazon Elastic Map Reduce master node to control analysis M Corporate Data Center Elastic Data Center
  62. 62. M Corporate Data Center Hadoop cluster started by Elastic Map Reduce Elastic Data Center
  63. 63. M Corporate Data Center Adding many hundreds or thousands of nodes Elastic Data Center
  64. 64. Disposed of when job completes M Corporate Data Center Elastic Data Center
  65. 65. Corporate Data Center Results of analysis pulled back into your systems Elastic Data Center
  66. 66. Your Spreadsheet does not scale …
  67. 67. PIG
  68. 68. A real Pig script (used at Twitter)
  69. 69. Run on a sample dataset on your Laptop
  70. 70. $ pig –f myPigFile.q
  71. 71. M Run the same script on a 50 node Hadoop cluster Elastic Data Center
  72. 72. $ ./elastic-mapreduce --create --name "$USER's Pig JobFlow" --pig-script --args s3://myawsbucket/mypigquery.q --instance-type m1.xlarge --instance-count 50
  73. 73. $ elastic-mapreduce -j j-21IMWIA28LRK1 --add-instance-group task --instance-count 10 --instance-type m1.xlarge
  74. 74. Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  75. 75. PUBLIC DATA SETS http://aws.amazon.com/publicdatasets
  76. 76. GENERATE  STORE  ANALYZE  SHARE AWS Data Pipeline
  77. 77. AWS Data Pipeline Data-intensive orchestration and automation Reliable and scheduled Easy to use, drag and drop Execution and retry logic Map data dependencies Create and manage compute resources
  78. 78. AWS Import / Export AWS Direct Connect Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce AWS Data Pipeline
  79. 79. FROM DATA TO ACTIONABLE INFORMATION
  80. 80. Shlomi Vaknin
  81. 81. Amazon AWS generates big data core component for Ginger Software Shlomi Vaknin Oct 16, 2013
  82. 82. English writing assistant An open platform for personal assistants 118
  83. 83. 119
  84. 84. Natural language speech interface for mobile apps • • An end-to-end Speech-to-Action solution • 120 Users talk naturally with any mobile application, Ginger understands and executes their command First open platform for creating personal assistants
  85. 85. Web Corpus Domain Corpus Language model User Corpus Semantic Model NLP/NLU Algorithms Writing Assistant Proofrea der Rephras e DB Persona l Coach PA Platform Speech Engine Query Understanding
  86. 86. Our platform depends on scanning and indexing all the language we can find on the internet • A collection of all the language we found on the internet, accessible and pre-processed • Has to contain lots and lots of sentences • Needs to represent “common written language” • Accessible both for offline (research) and online (service) uses 122
  87. 87. 1. Crawling [own cluster, EMR+S3] • Generated about 50 TB of raw data • Reduced to about 5 TB of text data 2. Post processing • Tokenize • Normalize • Split to n-grams [EMR+S3] • • • Generalize Count Filter 3. Indexing/Serving [EMR+S3] • Key/Value – has to be super fast • Full-text-search 4. Archiving (Glacier) [S3+Glacier] • Keeping data available for later research while minimizing cost 123
  88. 88. • Mainly an NLP task • So we picked up • It’s a Lisp! • Integrates very well with EMR, S3, etc.. • n-Gram Counting • How are you, How are, are you, How, are, you • Lots of grams are repeated • Generalize contextually similar tokens • Fits map-reduce paradigm very well • Most parts can be trivially parallelized • One part is sequential by grams 124
  89. 89. • EMR cluster node types • Master, Task, Core • Ratio between Core and Task nodes • We expected a very large output (100TB) • m2.4xlarge core output 1690GB • core nodes • Estimate number of total map tasks • Final specs: Instance Count MASTER cc2.8xlarge 1 CORE 125 Node Type m2.4xlarge 200 TASK m2.2xlarge 500
  90. 90. • Job took about 30 hours to complete • We generated nearly 100TB of output data • During map phase, the cluster achieved nearly 100% utilization • After initial filtration, 20TB remained 126
  91. 91. • Stay up to date with AMI releases • Don't stick to an old AMI just because it previously worked • Use the Job-Tracker • Use custom progress notification • Increase mapred.task.timeout • Limit number of concurrent map tasks • Use the minimum number that gets you close to 100% CPU • Beware of spot nodes • If you ask for too many you might compete against your own price 127
  92. 92. • Stash the data for later use, to reduce cost • Glacier offers very cheap storage • Important things to know about Glacier: • Restoring the data could be VERY expensive • The key to reduce restore costs - restore SLOWLY • There is no built-in mechanism to restore slowly • • 3rd party application do it manually • Glacier is very useful if your use case matches its design 128
  93. 93. • EMR/S3 provides great power and elasticity, to grow and shrink as required • Do your homework before running large jobs! 129
  94. 94. • Our platforms depends on scanning and indexing all the language we can find on the internet • To achieve this Ginger Software makes heavy use of Amazon EMR • With Amazon EMR, Ginger Software can scale up vast amounts of computing power and scale back down when it is not needed • This gives Ginger Software the ability to create the world’s most accurate language enhancement technology without the need to have expensive hardware lying idle 130 during quiet periods
  95. 95. Thank You! We are hiring! shlomiv@gingersoftware.com
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×