Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jesper Söderlund, Solutions Architect, AWS
May 4...
Agenda
• Introduction
• Benefits
• Use cases
• Getting started
• Q&A
AnalyzeStore
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Amazon RDS,
Amazon Aurora
AWS big data portfolio
AWS Data Pipeline
A...
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at...
The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programmi...
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester R...
Selected Amazon Redshift customers
Amazon Redshift architecture
Leader node
Simple SQL endpoint
Stores metadata
Optimizes query plan
Coordinates query execut...
Benefit #1: Amazon Redshift is fast
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storag...
Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
Benefit #1: Amazon Redshift is fast
Hardware optimized for I/O intensive workloads, 4 GB/sec/node
Enhanced networking, ove...
Benefit #1: Amazon Redshift is fast
New Dense Storage (HDD) instance type
Improved memory 2x, compute 2x, disk throughput ...
Benefit #2: Amazon Redshift is inexpensive
Ds2 (HDD)
Price per hour for
DW1.XL single node
Effective annual
price per TB c...
Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and ...
Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node fail...
Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• ECDHE perfect forward security...
Benefit #5: We innovate quickly
Well over 100 new features added since launch
Release every two weeks
Automatic patching
S...
Benefit #6: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine learning
• Data science
Benefit #7: Amazon Redshift has a large ecosystem
Data integration Systems integratorsBusiness intelligence
Benefit #8: Service oriented architecture
DynamoDB
EMR
S3
EC2/SSH
RDS/Aurora
Amazon
Redshift
Amazon Kinesis
Amazon ML
Data...
Use cases
NTT Docomo: Japan’s largest mobile service
provider
68 million customers
Tens of TBs per day of data across a
mobile netwo...
Nasdaq: powering 100 marketplaces in 50
countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 ...
Getting started
Provisioning
Enter cluster details
Select node configuration
Select security settings and provision
Point-and-click resize
Resize
• Resize while remaining online
• Provision a new cluster in the
background
• Copy data in parallel from node to
no...
Data modeling
SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2...
• Single column
• Compound
• Interleaved
Single column
• Table is sorted by 1 column
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2...
Compound
• Table is sorted by first column, then second column, and so on
Date Region Country
2-JUN-2015 Africa Zaire
2-JU...
Interleaved
• Equal weight is given to each column
Date Region Country
2-JUN-2015 Africa Zaire
3-JUN-2015 Asia Singapore
2...
• KEY
• ALL
• EVEN
ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
...
ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
...
ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
...
ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
...
• KEY
• Large fact tables
• Large dimension tables
• ALL
• Medium dimension tables (1K–2M)
• EVEN
• Tables with no joins o...
Loading data
AWS cloudCorporate data center
Amazon S3
Amazon
Redshift
Flat files
Data loading options
AWS cloudCorporate data center
ETL
Source DBs
Amazon
Redshift
Amazon
Redshift
Data loading options
AWS cloud
Amazon
Redshift
Amazon
Kinesis
Firehose
Data loading options
Querying
JDBC/ODBC
Amazon Redshift
Amazon Redshift works with your existing
analysis tools
ODBC/JDBC
BI clients
Amazon
Redshift
ODBC/JDBC
BI server
Amazon
Redshift
Clients
Monitor query performance
https://www.flickr.com/photos/wandering_angel/834854901
Getting started with
Redshift
Experiences from Yle
Vision 2018
The best personal
user experience
https://www.flickr.com/photos/rh2ox/9990024683/in/photolist-gdMuhT-7GaoYw-FMVAH-5NfuQU-pVKYKn-9uDrJR-bvtCC4-gdLJnk-ekYLQy-...
Big Data =
Big Dada
Big Data =
Big Dada
Yle News
Government
One journalist
https://www.flickr.com/photos/rh2ox/9990024683/in/photolist-gdMuhT-7GaoYw-FMVAH-5NfuQU-pVKYKn-9uDrJR-bvtCC4-gdLJnk-ekYLQy-...
https://www.flickr.com/photos/rh2ox/9990016123/in/photolist-gdMrKi-6GiYkV-ax8z4B-9rnTdL-bf2wtK-22CjS2-gdMuhT-7GaoYw-FMVAH-...
Painpoint #1
Losing data
In production
Critical
Yle Analytics Pipeline – First idea
Elastic
Beanstalk
Analytics
Collector
Web events
~50 mill./day
...
Painpoint #2
Cloud doesn’t
scale?
Painpoint #3
It’s down
again?
Painpoint #4
Where are the
skills?
Pilot
In production
Critical
Responsible: JO
Testing
Yle Analytics Pipeline - Current situation
Elastic
Beanstalk
Kinesis ...
Yle Analytics Pipeline - Future scenario
In production
Critical
Responsible: JO
Elastic
Beanstalk Kinesis Streams
Analytic...
Redshift is still
being explored and
looks promising!
Aleksi.Rossi@yle.fi
@AlekRossi
Thanks!
Amazon Redshift
Amazon Redshift
Amazon Redshift
Amazon Redshift
Upcoming SlideShare
Loading in …5
×

Amazon Redshift

3,237 views

Published on

Services Track 2

Published in: Business
  • Be the first to comment

Amazon Redshift

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jesper Söderlund, Solutions Architect, AWS May 4th 2016 Getting Started with Amazon Redshift
  2. 2. Agenda • Introduction • Benefits • Use cases • Getting started • Q&A
  3. 3. AnalyzeStore Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora AWS big data portfolio AWS Data Pipeline Amazon CloudSearch Amazon EMR Amazon EC2 Amazon Redshift Amazon Machine Learning Amazon Elasticsearch Service AWS Database Migration Service Amazon QuickSight Amazon Kinesis Firehose AWS Import/Export AWS Direct Connect Collect Amazon Kinesis Streams
  4. 4. Relational data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  5. 5. The Amazon Redshift view of data warehousing 10x cheaper Easy to provision Higher DBA productivity 10x faster No programming Easily leverage BI tools, Hadoop, machine learning, streaming Analysis inline with process flows Pay as you go, grow as you need Managed availability and disaster recovery Enterprise Big data SaaS
  6. 6. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. Forrester Wave™ Enterprise Data Warehouse Q4 ’15
  7. 7. Selected Amazon Redshift customers
  8. 8. Amazon Redshift architecture Leader node Simple SQL endpoint Stores metadata Optimizes query plan Coordinates query execution Compute nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at just $0.25/hour, grow to 2 PB (compressed) DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB Ingestion/Backup Backup Restore JDBC/ODBC 10 GigE (HPC)
  9. 9. Benefit #1: Amazon Redshift is fast Dramatically less I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  10. 10. Benefit #1: Amazon Redshift is fast Parallel and distributed Query Load Export Backup Restore Resize
  11. 11. Benefit #1: Amazon Redshift is fast Hardware optimized for I/O intensive workloads, 4 GB/sec/node Enhanced networking, over 1 million packets/sec/node Choice of storage type, instance size Regular cadence of autopatched improvements
  12. 12. Benefit #1: Amazon Redshift is fast New Dense Storage (HDD) instance type Improved memory 2x, compute 2x, disk throughput 1.5x Cost: Same as our prior generation! Performance improvement: 50% Enhanced I/O and commit improvements (Jan ’16) Reduce amount of time to commit data Performance improvement: 35%
  13. 13. Benefit #2: Amazon Redshift is inexpensive Ds2 (HDD) Price per hour for DW1.XL single node Effective annual price per TB compressed On-demand $ 0.850 $ 3,725 1 year reservation $ 0.500 $ 2,190 3 year reservation $ 0.228 $ 999 Dc1 (SSD) Price per hour for DW2.L single node Effective annual price per TB compressed On-demand $ 0.250 $ 13,690 1 year reservation $ 0.161 $ 8,795 3 year reservation $ 0.100 $ 5,500 Pricing is simple Number of nodes x price/hour No charge for leader node No upfront costs Pay as you go
  14. 14. Benefit #3: Amazon Redshift is fully managed Continuous/incremental backups Multiple copies within cluster Continuous and incremental backups to Amazon S3 Continuous and incremental backups across regions Streaming restore Amazon S3 Amazon S3 Region 1 Region 2
  15. 15. Benefit #3: Amazon Redshift is fully managed Amazon S3 Amazon S3 Region 1 Region 2 Fault tolerance Disk failures Node failures Network failures Availability Zone/region level disasters
  16. 16. Benefit #4: Security is built-in • Load encrypted from S3 • SSL to secure data in transit • ECDHE perfect forward security • Amazon VPC for network isolation • Encryption to secure data at rest • All blocks on disks and in S3 encrypted • Block key, cluster key, master key (AES-256) • On-premises HSM & AWS CloudHSM support • Audit logging and AWS CloudTrail integration • SOC 1/2/3, PCI-DSS, FedRAMP, BAA 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC
  17. 17. Benefit #5: We innovate quickly Well over 100 new features added since launch Release every two weeks Automatic patching Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) DUB (4/25) SOC1/2/3 (5/8) Unload Encrypted Files NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) SHA1 Builtin (7/15) 4 byte UTF-8 (7/18) Sharing snapshots (7/18) Statement Timeout (7/22) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) Resource Level IAM (8/9) PCI (8/22) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6) Rename Cluster (6/2) Copy from multiple regions, percentile_cont, percentile_disc (6/30) Free Trial (7/1) pg_last_unload_count (9/15) AES-128 S3 encryption (9/29) UTF-16 support (9/29)
  18. 18. Benefit #6: Amazon Redshift is powerful • Approximate functions • User defined functions • Machine learning • Data science
  19. 19. Benefit #7: Amazon Redshift has a large ecosystem Data integration Systems integratorsBusiness intelligence
  20. 20. Benefit #8: Service oriented architecture DynamoDB EMR S3 EC2/SSH RDS/Aurora Amazon Redshift Amazon Kinesis Amazon ML Data Pipeline CloudSearch Amazon Mobile Analytics
  21. 21. Use cases
  22. 22. NTT Docomo: Japan’s largest mobile service provider 68 million customers Tens of TBs per day of data across a mobile network 6 PB of total data (uncompressed) Data science for marketing operations, logistics, and so on Greenplum on-premises Scaling challenges Performance issues Need same level of security Need for a hybrid environment
  23. 23. Nasdaq: powering 100 marketplaces in 50 countries Orders, quotes, trade executions, market “tick” data from 7 exchanges 7 billion rows/day Analyze market share, client activity, surveillance, billing, and so on Microsoft SQL Server on-premises Expensive legacy DW ($1.16 M/yr.) Limited capacity (1 yr. of data online) Needed lower TCO Must satisfy multiple security and regulatory requirements Similar performance
  24. 24. Getting started
  25. 25. Provisioning
  26. 26. Enter cluster details
  27. 27. Select node configuration
  28. 28. Select security settings and provision
  29. 29. Point-and-click resize
  30. 30. Resize • Resize while remaining online • Provision a new cluster in the background • Copy data in parallel from node to node • Only charged for source cluster
  31. 31. Data modeling
  32. 32. SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’ MIN: 01-JUNE-2013 MAX: 20-JUNE-2013 MIN: 08-JUNE-2013 MAX: 30-JUNE-2013 MIN: 12-JUNE-2013 MAX: 20-JUNE-2013 MIN: 02-JUNE-2013 MAX: 25-JUNE-2013 Unsorted table MIN: 01-JUNE-2013 MAX: 06-JUNE-2013 MIN: 07-JUNE-2013 MAX: 12-JUNE-2013 MIN: 13-JUNE-2013 MAX: 18-JUNE-2013 MIN: 19-JUNE-2013 MAX: 24-JUNE-2013 Sorted by date Zone maps
  33. 33. • Single column • Compound • Interleaved
  34. 34. Single column • Table is sorted by 1 column Date Region Country 2-JUN-2015 Oceania New Zealand 2-JUN-2015 Asia Singapore 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Hong Kong 3-JUN-2015 Europe Germany 3-JUN-2015 Asia Korea [ SORTKEY ( date ) ] • Best for: • Queries that use first column (that is, date) as primary filter • Can speed up joins and group bys • Quickest to VACUUM
  35. 35. Compound • Table is sorted by first column, then second column, and so on Date Region Country 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Korea 2-JUN-2015 Asia Singapore 2-JUN-2015 Europe Germany 3-JUN-2015 Asia Hong Kong 3-JUN-2015 Asia Korea [ SORTKEY COMPOUND ( date, region, country) ] • Best for: • Queries that use first column as primary filter, then other columns • Can speed up joins and group bys • Slower to VACUUM
  36. 36. Interleaved • Equal weight is given to each column Date Region Country 2-JUN-2015 Africa Zaire 3-JUN-2015 Asia Singapore 2-JUN-2015 Asia Korea 2-JUN-2015 Europe Germany 3-JUN-2015 Asia Hong Kong 2-JUN-2015 Asia Korea [ SORTKEY INTERLEAVED ( date, region, country) ] • Best for: • Queries that use different columns in filter • Queries get faster the more columns used in the filter • Slowest to VACUUM
  37. 37. • KEY • ALL • EVEN
  38. 38. ID Gender Name 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M James White 306 F Lisa Green 2 3 4 ID Gender Name 101 M John Smith 306 F Lisa Green ID Gender Name 292 F Jane Jones 209 M James White ID Gender Name 139 M Peter Black 164 M Brian Snail ID Gender Name 446 M Pat Partridge 658 F Sarah Cyan Round Robin DISTSTYLE EVEN
  39. 39. ID Gender Name 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M James White 306 F Lisa Green Hash Function ID Gender Name 101 M John Smith 306 F Lisa Green ID Gender Name 292 F Jane Jones 209 M James White ID Gender Name 139 M Peter Black 164 M Brian Snail ID Gender Name 446 M Pat Partridge 658 F Sarah Cyan DISTSTYLE KEY
  40. 40. ID Gender Name 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M James White 306 F Lisa Green Hash Function ID Gender Name 101 M John Smith 139 M Peter Black 446 M Pat Partridge 164 M Brian Snail 209 M James White ID Gender Name 292 F Jane Jones 658 F Sarah Cyan 306 F Lisa Green DISTSTYLE KEY
  41. 41. ID Gender Name 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M James White 306 F Lisa Green 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M Lisa Green 306 F James White 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M Lisa Green 306 F James White 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M Lisa Green 306 F James White 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M Lisa Green 306 F James White All DISTSTYLE ALL
  42. 42. • KEY • Large fact tables • Large dimension tables • ALL • Medium dimension tables (1K–2M) • EVEN • Tables with no joins or group bys • Small dimension tables (<1000)
  43. 43. Loading data
  44. 44. AWS cloudCorporate data center Amazon S3 Amazon Redshift Flat files Data loading options
  45. 45. AWS cloudCorporate data center ETL Source DBs Amazon Redshift Amazon Redshift Data loading options
  46. 46. AWS cloud Amazon Redshift Amazon Kinesis Firehose Data loading options
  47. 47. Querying
  48. 48. JDBC/ODBC Amazon Redshift Amazon Redshift works with your existing analysis tools
  49. 49. ODBC/JDBC BI clients Amazon Redshift
  50. 50. ODBC/JDBC BI server Amazon Redshift Clients
  51. 51. Monitor query performance
  52. 52. https://www.flickr.com/photos/wandering_angel/834854901 Getting started with Redshift Experiences from Yle
  53. 53. Vision 2018 The best personal user experience
  54. 54. https://www.flickr.com/photos/rh2ox/9990024683/in/photolist-gdMuhT-7GaoYw-FMVAH-5NfuQU-pVKYKn-9uDrJR-bvtCC4-gdLJnk-ekYLQy-9fLRVv-mWf6or-nMui2q-uqFgjs-fY9CgE-92kp1F-5YiALt-bf2wDV-h4RV3y-7R6usE- 5hrCbv-6yx8zu-da8jMn-ddn815-4nWCrn-98rFiH-2Wa9no-fS7qGg-gbA7WS-4C1MAu-kwvVNz-kwxBw1-amkoeY-p1mpRv-8eKo29-7mfH7s-mWXwd5-in4qSv-7n3E7Y-aywJ1i-kwvpGe-ddn5G7-5BK5qi-ddmZbF-egsmCb-75dpDG- Big Data =
  55. 55. Big Data = Big Dada
  56. 56. Big Data = Big Dada Yle News Government One journalist
  57. 57. https://www.flickr.com/photos/rh2ox/9990024683/in/photolist-gdMuhT-7GaoYw-FMVAH-5NfuQU-pVKYKn-9uDrJR-bvtCC4-gdLJnk-ekYLQy-9fLRVv-mWf6or-nMui2q-uqFgjs-fY9CgE-92kp1F-5YiALt-bf2wDV-h4RV3y-7R6usE- 5hrCbv-6yx8zu-da8jMn-ddn815-4nWCrn-98rFiH-2Wa9no-fS7qGg-gbA7WS-4C1MAu-kwvVNz-kwxBw1-amkoeY-p1mpRv-8eKo29-7mfH7s-mWXwd5-in4qSv-7n3E7Y-aywJ1i-kwvpGe-ddn5G7-5BK5qi-ddmZbF-egsmCb-75dpDG- Big Data = Small data
  58. 58. https://www.flickr.com/photos/rh2ox/9990016123/in/photolist-gdMrKi-6GiYkV-ax8z4B-9rnTdL-bf2wtK-22CjS2-gdMuhT-7GaoYw-FMVAH-5NfuQU-pVKYKn-9uDrJR-bvtCC4-gdLJnk-ekYLQy-9fLRVv-mWf6or-nMui2q-uqFgjs- fY9CgE-92kp1F-5YiALt-bf2wDV-h4RV3y-7R6usE-5hrCbv-6yx8zu-da8jMn-ddn815-4nWCrn-98rFiH-2Wa9no-fS7qGg-gbA7WS-4C1MAu-kwvVNz-kwxBw1-amkoeY-p1mpRv-8eKo29-7mfH7s-mWXwd5-in4qSv-7n3E7Y-aywJ1i-
  59. 59. Painpoint #1 Losing data
  60. 60. In production Critical Yle Analytics Pipeline – First idea Elastic Beanstalk Analytics Collector Web events ~50 mill./day RedShift Recommendation service
  61. 61. Painpoint #2 Cloud doesn’t scale?
  62. 62. Painpoint #3 It’s down again?
  63. 63. Painpoint #4 Where are the skills?
  64. 64. Pilot In production Critical Responsible: JO Testing Yle Analytics Pipeline - Current situation Elastic Beanstalk Kinesis Streams Analytics Collector Web events ~50 mill./day Kinesis Firehose RedShift Lambda S3 EMR Daily archiving Areena Recommendation Delay: 10-15 minutes Note: Altogether tens of source application types with in total 120 different labels (most of them common). Labels may change. Small batch sizes Compression and larger file sizes. ~60 Gb per day CloudWatch Dashboards Small sized files Delay: minute?
  65. 65. Yle Analytics Pipeline - Future scenario In production Critical Responsible: JO Elastic Beanstalk Kinesis Streams Analytics Collector RedShift Lambda S3 Delay: 15 minutes to hours Kinesis Firehose Aggregated analysis (hours) Ad hoc analysis Validation Parsing Aggregating? CloudWatch Web events ~50 mill./day Delay: minutes? Delay: seconds DynamoDBEMR Specific real time analysis (seconds) Selected analysis EMR Ad hoc analysis w/ random tools Batch analysis Raw data queries APIs,buffering Additional data Daily/weekly division into paths? ElasticSearch? Lambda EC2 Container Service? What kind of architecture? Data vault? Schema versioning Remove old information
  66. 66. Redshift is still being explored and looks promising!
  67. 67. Aleksi.Rossi@yle.fi @AlekRossi Thanks!

×