The document provides an agenda for a Master AWS Redshift training event. The agenda includes introductory and lab sessions on using Amazon Redshift for data warehousing. It also lists presentations on processing large volumes of data with Node.js, Docker and AWS, and tips for optimizing Redshift performance.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
CloudZone Big Data Month 2016 Agenda for Mastering AWS Redshift
1. All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.
2. All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.
Big Data Month 2016 – Up Next…
15.11
22.11
22.11
28.11 30.11
14.11
3. All content is the property and proprietary interest of CloudZone, The removal of any proprietary notices, including attribution information, is strictly prohibited.
13:00 – 13:20 Intro to Amazon Redshift by IronSource
13:20 – 15:00 LAB I – Using Amazon RedShift
15:00 – 15:15 Break
15:15 – 17:25 LAB II – Table Layout and Schema Design with
Amazon Redshift
17:25 – 17:30 Your next steps on AWS by CloudZone
Master AWS Redshift - Agenda
4. Shimon Tolts
General Manager, Data Solutions
Atom
Data Pipeline Processing 200B events with
Node.js And Docker On AWS
5. About ironSource: Hypergrowth
People Reached Each Month
4200
Apps Installed Every Minute
with the ironSource Platform
Registered & Analyzed Data Events
Every Month
200B
800M
50B
0
100B
150B
200B
Jun
2015
Jul
2015
Aug
2015
Sep
2015
Oct
2015
Nov
2015
Dec
2015
Jan
2016
Feb
2016
Mar
2016
Apr
2016
May
2016
6. We needed a way to manage this data:
Our Business Challenge
ProcessCollect Store
7.
8. Collection
● Multi region layer - Latency based routing
● Low latency from client to Atom servers
● High Availability - AWS regions does fail!
● Storing raw data + headers upon receiving
9. Data Enrichment
● Enrich data before storing in your Data Lake
and/or Warehouse
○ IP to Country
○ Currency conversion
○ Decrypt data
○ User Agent parsing - OS, Browser, Device...
● Any custom logic you would like! - fully
extendible
10. Data Targets
● Near real-time data insertion - 1 minute!
● Stream data to Google Storage and/or AWS S3
● Smart insertion of data into AWS Redshift
○ Set the amount of parallel copys
○ Configure priority on tables
● BigQuery - Streaming data using batch files
import (saves 20% cost)
13. Docker
● Linux Container
● Save provisioning time
● Infrastructure as code
● Dev-Test-Production - identical container
● Ship easily
14. Cloud infrastructure
● Pay as you go - (grow)
● SaaS services
● Auto-scaling-groups
● DynamoDB
● RDS *SQL
● Redshift data warehouse
15. Continuous Integration
● From commit to production
● Jenkins commit hook
● Git branching model
● AWS dynamic slaves
● Unit tests
● Docker builds
● Updating live environment
18. ● Xplenty - hadoop service - ~40min query
● One big cluster - 96 xlarge nodes
● No WLM configuration
● CSV copy
● No reserved nodes
● different ETL process implemented by every
department.
STARTING POINT
19.
20.
21. ● using 8xlnodes if needed
● Redshift cluster per department
● “hot and cold” clusters - SSD: fast and furios, HDD: slow but cheap
● WLM configuration
● Reserved Nodes
● JSON copy
● One pipeline to rule them all - ironBeast - currently supporting over
50B events per month. inserting data to more than 10 Redshift clusters.
SOLUTION:
23. THINGS WE LEARNED ALONG THE WAY
● https://github.com/awslabs/amazon-redshift-utils (AdminViews)
● users permissions does not apply on new tables created in a schema
● Vacuum Vacuum Vacuum
● Avoid parallel inserts (especially in 8xl nodes) - if you copy to multiple tables, it is better to
implement a COPY queue
● STL_LOAD_ERRORS - money on the floor
● Columnar datastore does not mean you can use as much columns as you want - it is better to
split to multiple tables.
● Encode your columns - ‘analyze compression’
● instances that query Redshift should use MTU 1500 - link