• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
 

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

on

  • 683 views

 

Statistics

Views

Total Views
683
Views on SlideShare
683
Embed Views
0

Actions

Likes
3
Downloads
38
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data Presentation Transcript

    • AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel Data Analytics on BigData Jan Borch | AWS Solutions Architect
    • GENERATE  STORE  ANALYZE  SHARE
    • THE COST OF DATA GENERATION IS FALLING
    • Progress is not evenly distributed 1980 14,000,000$/TB  450,000 ÷   30,000 X  100MB  50 X  4MB/s Today 30$/TB 3TB 200MB/s
    • THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN DERIVE FROM IT
    • Lower cost, higher throughput GENERATE  STORE  ANALYZE  SHARE
    • Lower cost, higher throughput  GENERATE  STORE  ANALYZE  SHARE Highly constrained
    • DATA VOLUME Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
    • GENERATE STORE  ANALYZE  SHARE
    • ACCELERATE GENERATE  STORE  ANALYZE  SHARE
    • + ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS
    • AWS EC2 AWS CloudFront GENERATE  STORE  ANALYZE  SHARE
    • • • • • • Fluentd Flume Scribe Chukwa LogStash {output{ s3 { bucket => myBucket, aws_credential_file => ~/cred.json size_file=> 120MB }}
    • “Poor man’s Analytics”
    • Embed poor-man pixel http://www.poor-mananalytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban .com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=enus&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr =-&utmp=%2F&utmac=UA-70197651&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B %2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(re ferral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analyticsarchitecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~
    • GENERATE  STORE  ANALYZE  SHARE
    • AWS Import / Export AWS Direct Connect AWS Elastic Map Reduce GENERATE  STORE  ANALYZE  SHARE
    • Generated and stored in AWS Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect Regional replication of AMIs and snapshots
    • Aggregation with S3Distcp
    • S3distcp on EMR job sample ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,s3://myoutputbucket/aggregate , --groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*, --targetSize,128, --outputCodec,lzo, --deleteOnSuccess'
    • Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
    • AMAZON S3 SIMPLE STORAGE SERVICE
    • AMAZON DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED NoSQL DATABASE SERVICE
    • DURABLE & AVAILABLE CONSISTENT, DISK-ONLY WRITES (SSD)
    • LOW LATENCY AVERAGE READS < 5MS, WRITES < 10MS
    • NO ADMINISTRATION
    • Very general table structure not many rows Ads frequent update (near realtime) advertiser max-price imps to deliver imps delivered 1 AAA 100 50000 1200 2 so many rows ad-id BBB 150 30000 2500 user-id attribute1 attribute2 attribute3 attribute4 A XXX XXX XXX XXX B YYY YYY YYY batch manner update YYY Profiles
    • 500,000 WRITES PER SECOND DURING SUPER BOWL
    • AMAZON GLACIER reliable long term archiving
    • S3 Lifecycle policies AMAZON S3 If object older than 5 month Archive to Amazon Glacier
    • S3 Lifecycle policies AMAZON S3 If object older than 5 month Delete object from S3 If object older than 1 year /dev/null
    • AMAZON REDSHIFT FULLY MANAGED, PETA-BYTE SCALE DATAWAREHOUSE ON AWS
    • DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was… A Lot Faster AMAZON REDSHIFT A Lot Cheaper A Whole Lot Simpler
    • AMAZON REDSHIFT RUNS ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
    • 30 MINUTES DOWN TO 12 SECONDS
    • AMAZON REDSHIFT LETS YOU START SMALL AND GROW BIG Extra Large Node (HS1.XL) Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
    • JDBC/ODBC
    • Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999
    • DATA WAREHOUSING DONE THE AWS WAY Easy to provision and scale up massively No upfront costs, pay as you go Really fast performance at a really low price Open and flexible with support for popular tools
    • USAGE SCENARIOS
    • Reporting Warehouse OLTP ERP RDBMS Redshift Reporting and BI Accelerated operational reporting Support for short-time use cases Data compression, index redundancy
    • On-Premises Integration OLTP ERP RDBMS Data Integration Partners* Redshift Reporting and BI
    • Live Archive for (Structured) Big Data OLTP Web Apps DynamoDB Redshift Reporting and BI Direct integration with copy command High velocity data Data ages into Redshift Low cost, high scale option for new apps
    • Cloud ETL for Big Data S3 Elastic MapReduce Redshift Reporting and BI Maintain online SQL access to historical logs Transformation and enrichment with EMR Longer history ensures better insight
    • COPY into Amazon Redshift create table cf_logs ( d date, t char(8), edge char(4), bytes int, cip varchar(15), verb char(3), distro varchar(MAX), object varchar(MAX), status int, Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )
    • COPY into Amazon Redshift copy cf_logs from 's3://cfri/cflogs-sm/E123ABCDEF/' credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>' IGNOREHEADER 2 GZIP DELIMITER 't' DATEFORMAT 'YYYY-MM-DD'
    • GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce
    • AMAZON EC2 ELASTIC COMPUTE CLOUD
    • EC2 instance families – General purpose m1.small Virtual core: 1 Memory: 1.7 GiB I/O performance: Moderate
    • EC2 instance families – Compute optimized Virtual core: 32 - 2 x Intel Xeon Memory: 60,5 GiB I/O performance: 10 Gbit m1.small cc2.8xlarge
    • EC2 instance families – Memory optimized Virtual core: 32 - 2 x Intel Xeon Memory: 240 GiB I/O performance: 10 Gbit SSD Instance store: 240 GB m1.small cc2.8xlarge cr1.8xlarge
    • EC2 instance families – Storage optimized m1.small cc2.8xlarge cr1.8xlarge hi.4xlarge Virtual core: 16 Memory: 60.5 GiB I/O performance: 10 Gbit SSD Instance store: 2 x 1TB hs1.8xlarge Virtual core: 16 Memory: 117 GiB I/O performance: 10 Gbit Instance store: 24 x 2TB
    • ON A SINGLE INSTANCE COMPUTE TIME: 4h COST: 4h x $2.1 = $8.4
    • ON MULTIPLE INSTANCES COMPUTE TIME: 1h COST: 1h x 4 x $2.1 = $8.4
    • 3 HOURS FOR $4828.85/hr
    • Instead of $20+ MILLIONS in infrastructure
    • • • • • A FRAMEWORK SPLITS DATA INTO PIECES LETS PROCESSING OCCUR GATHERS THE RESULTS
    • AMAZON ELASTIC MAPREDUCE HADOOP AS A SERVICE
    • Corporate Data Center Elastic Data Center
    • Corporate Data Center Application data and logs for analysis pushed to S3 Elastic Data Center
    • Amazon Elastic Map Reduce master node to control analysis M Corporate Data Center Elastic Data Center
    • M Corporate Data Center Hadoop cluster started by Elastic Map Reduce Elastic Data Center
    • M Corporate Data Center Adding many hundreds or thousands of nodes Elastic Data Center
    • Disposed of when job completes M Corporate Data Center Elastic Data Center
    • Corporate Data Center Results of analysis pulled back into your systems Elastic Data Center
    • Your Spreadsheet does not scale …
    • PIG
    • A real Pig script (used at Twitter)
    • Run on a sample dataset on your Laptop
    • $ pig –f myPigFile.q
    • M Run the same script on a 50 node Hadoop cluster Elastic Data Center
    • $ ./elastic-mapreduce --create --name "$USER's Pig JobFlow" --pig-script --args s3://myawsbucket/mypigquery.q --instance-type m1.xlarge --instance-count 50
    • $ elastic-mapreduce -j j-21IMWIA28LRK1 --add-instance-group task --instance-count 10 --instance-type m1.xlarge
    • Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
    • PUBLIC DATA SETS http://aws.amazon.com/publicdatasets
    • GENERATE  STORE  ANALYZE  SHARE AWS Data Pipeline
    • AWS Data Pipeline Data-intensive orchestration and automation Reliable and scheduled Easy to use, drag and drop Execution and retry logic Map data dependencies Create and manage compute resources
    • AWS Import / Export AWS Direct Connect Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce AWS Data Pipeline
    • FROM DATA TO ACTIONABLE INFORMATION
    • Shlomi Vaknin
    • Amazon AWS generates big data core component for Ginger Software Shlomi Vaknin Oct 16, 2013
    • English writing assistant An open platform for personal assistants 118
    • 119
    • Natural language speech interface for mobile apps • • An end-to-end Speech-to-Action solution • 120 Users talk naturally with any mobile application, Ginger understands and executes their command First open platform for creating personal assistants
    • Web Corpus Domain Corpus Language model User Corpus Semantic Model NLP/NLU Algorithms Writing Assistant Proofrea der Rephras e DB Persona l Coach PA Platform Speech Engine Query Understanding
    • Our platform depends on scanning and indexing all the language we can find on the internet • A collection of all the language we found on the internet, accessible and pre-processed • Has to contain lots and lots of sentences • Needs to represent “common written language” • Accessible both for offline (research) and online (service) uses 122
    • 1. Crawling [own cluster, EMR+S3] • Generated about 50 TB of raw data • Reduced to about 5 TB of text data 2. Post processing • Tokenize • Normalize • Split to n-grams [EMR+S3] • • • Generalize Count Filter 3. Indexing/Serving [EMR+S3] • Key/Value – has to be super fast • Full-text-search 4. Archiving (Glacier) [S3+Glacier] • Keeping data available for later research while minimizing cost 123
    • • Mainly an NLP task • So we picked up • It’s a Lisp! • Integrates very well with EMR, S3, etc.. • n-Gram Counting • How are you, How are, are you, How, are, you • Lots of grams are repeated • Generalize contextually similar tokens • Fits map-reduce paradigm very well • Most parts can be trivially parallelized • One part is sequential by grams 124
    • • EMR cluster node types • Master, Task, Core • Ratio between Core and Task nodes • We expected a very large output (100TB) • m2.4xlarge core output 1690GB • core nodes • Estimate number of total map tasks • Final specs: Instance Count MASTER cc2.8xlarge 1 CORE 125 Node Type m2.4xlarge 200 TASK m2.2xlarge 500
    • • Job took about 30 hours to complete • We generated nearly 100TB of output data • During map phase, the cluster achieved nearly 100% utilization • After initial filtration, 20TB remained 126
    • • Stay up to date with AMI releases • Don't stick to an old AMI just because it previously worked • Use the Job-Tracker • Use custom progress notification • Increase mapred.task.timeout • Limit number of concurrent map tasks • Use the minimum number that gets you close to 100% CPU • Beware of spot nodes • If you ask for too many you might compete against your own price 127
    • • Stash the data for later use, to reduce cost • Glacier offers very cheap storage • Important things to know about Glacier: • Restoring the data could be VERY expensive • The key to reduce restore costs - restore SLOWLY • There is no built-in mechanism to restore slowly • • 3rd party application do it manually • Glacier is very useful if your use case matches its design 128
    • • EMR/S3 provides great power and elasticity, to grow and shrink as required • Do your homework before running large jobs! 129
    • • Our platforms depends on scanning and indexing all the language we can find on the internet • To achieve this Ginger Software makes heavy use of Amazon EMR • With Amazon EMR, Ginger Software can scale up vast amounts of computing power and scale back down when it is not needed • This gives Ginger Software the ability to create the world’s most accurate language enhancement technology without the need to have expensive hardware lying idle 130 during quiet periods
    • Thank You! We are hiring! shlomiv@gingersoftware.com