AWS BigData 전략과 관련 AWS 서비스 이해하기

2018.12.6
AWS BigData 전략과 관련 AWS 서비스 이해하기
베스핀글로벌 CDP팀 최정식 위원
js.choi@bespinglobal.com

AGENDA
Ⅰ. 시작하면서
1. Database 아키텍처와 고려사항
2. 최근까지의 7가지 트렌드
3. Big Data 도전 과제
Ⅱ. AWS Big Data 전략
4. Data Store 관점에서의 AWS 서비스
5. Big Data Architecting process
6. AWS Big data 서비스
Ⅲ. AWS RedShift 소개

1. 데이터베이스 아키텍처와 고려사항

Copyright © 2018 BESPIN GLOBAL Co., Ltd. All rights reserved | Confidential
http://www.bespinglobal.com
1. Database 아키텍처와 고려사항
R/W
A
Single InstanceSingle Instance
Replication
Async mode
R/W
A
Read
Only
A’
Read ReplicaRead Replica
Active
Shared
Storage
Standby
HA: Shared Storage ClusterHA: Shared Storage Cluster
Active
A
Standby
A
Replication
Sync mode
HA: Shared Nothing ClusterHA: Shared Nothing Cluster
Active Active Active…
Shared
Storage
Parallel Server with shared diskParallel Server with shared disk
Active
A
Active
B
Active
C
…
Parallel Server with No masterParallel Server with No master
Slave
A
Slave
B
Slave
C
Master
…
Parallel Server with masterParallel Server with master
Relational vs. Non-relational
Analyticvs.Operational
RDB
NewSQL
DW
NoSQL
Hadoop
Data Model
Purpose
• Performance Scale-out
• Fault tolerance
• Parallel processing
• Shared Storage for cloud
adoption
• Performance Scale-out
• Fault tolerance
• Parallel processing
• Shared Storage for cloud
adoption
Infra ConsiderationInfra Consideration
DB type considerationDB type consideration

2. 최근까지의 7가지 트렌드

2.1 Data 종류가 확장, 대형화 되고 있다.
Non-relational
(unstructured data)
Relational
(structured data)
AnalyticOperational
RDB
DW
Text, Log, IOT,
외부 데이터
등등
 비정형 데이터의 증가와 분석 needs 도 증가 하는 추세
 많은 데이터가 아직도 분석되지 못하고 있음

2.2 Big Data 의미가 분석계 전반으로 확장 되고 있다.
Non-relational
(unstructured data)
Relational
(structured data)
AnalyticOperational
RDB
DW
 전통적인 정형 데이터(RDB)의 분석계인 DW와 구분해서
비정형 데이터 분석 시스템인 Hadoop과 그의 Eco시스템들
을 Big Data라고 불리던 의미가
 분석계 전반의 시스템을 Big Data로 부르는 추세

2.3 정형, 비정형 Data의 분석계의 경계가 무너지고 있다.
Non-relational
(unstructured data)
Relational
(structured data)
AnalyticOperational
RDB
DW
 원천 데이터의 종류와 상관 없이, 서비스의 요구 형태에 따라 분석계의
• 다양한 “데이터 저장소”,
• 다양한 “분석 방법”,
• 다양한 “시각화 방법”
들이 나오고 있다.

2.4 데이터의 실시간 처리(near real-time) needs가 있다.
Non-relational
(unstructured data)
Relational
(structured data)
AnalyticOperational
RDB
DW
 배치형 처리(retrospective)위주에서
 요구사항에 따라, 실시간 처리
(near real-time)이 추가되는 추세
Streaming
D+1

2.5 데이터를 버리지 않고 일단 모아 두자 (Data is Money)
Non-relational
(unstructured data)
Relational
(structured data)
AnalyticOperational
RDB
DW
 Data Lake
분석 목적, 방법이 정의 되지 않았다 하더라도, 의미가 있을 것으로
예상되는 데이터를 한 장소에 모아 놓는 추세
 Data Lake 요구사항
• 저렴하고, 대량 데이터를, 안전하게, 한가지 저장소 여야 한다.
• 저렴하고, 쉽고 다양한 방법, 다양한 데이터의 수집이 가능해야 한다.
• 쉽고, 빠르게, 분석/시각화가 가능해야 한다.
• 분석/시각화 방법에 따라, 다른 저장소로 쉽게 데이터 이관이 되어야 한다.
분석 대상과 목적이 사전
에 명확히 정의 되어 있다.

2.6 시각화 및 분석이 다양화 되고 있다.
Non-relational
(unstructured data)
Relational
(structured data)
AnalyticOperational
RDB
DW
 시각화 툴
• ODBC/JDBC기반의 Data source에서
 다양한 Data Source를 지원하는 추세
• 정형 보고서  Self-trained BI
• 실시간 분석을 위한, 시계열 DB 및 시계열 시각화 툴
 AI/ML 기술의 활용
• 수집 데이터로 데이터를 한 차원 높은 분석 및 예측에 이용
• 분석계와 관련이 있는 이유는, AI/ML은 풍부한 학습 데이터
가 중요하다
Deep LearningAI/ML DW BigData

2.7 Cloud 인프라로 변화하고 있다.
Difficulty of
resource
expansion
Difficulty of
resource
expansion
Resource sizing
is important
Resource sizing
is important
Multiple
businesses are
running on one
Database
Multiple
businesses are
running on one
Database
Various business
types are
running on one
type of DB
Various business
types are
running on one
type of DB
Considering the
5-years work-
load increase
Considering the
5-years work-
load increase
Refer to Tpmc.Refer to Tpmc.
Resource
over-spec
sizing
Resource
over-spec
sizing
Oracle RAC
is preferred
Oracle RAC
is preferred
Infra Cost
Efficiency
Infra Cost
Efficiency
Performance
test is important
Performance
test is important
On-premise 환경을 돌아 보면…On-premise 환경을 돌아 보면…

Physical
Machine
Physical
Machine
Virtual
Machine
Virtual
Machine
VM on Cloud
(=IaaS)
VM on Cloud
(=IaaS)
Managed Service
(=PaaS)
Managed Service
(=PaaS)
Server-less
(=FaaS)
Server-less
(=FaaS)
On-premise Public Cloud
 인프라 관리의 부담/이슈/비용
등을 줄이고, 고객사 서비스 본
질에 집중 하고자 하는 추세
 Server-less (FaaS) 특징
• Server Spec 선택 불필요
• Auto-scaling
• Pay as you go
SaaSSaaS

• 언제든지, 즉시 deploy…• 언제든지, 즉시 deploy…
• Managed, Server-less• Managed, Server-less
• 사용한 만큼 지불• 사용한 만큼 지불
• 필요한 만큼 만 sizing…• 필요한 만큼 만 sizing…
• 유연한 Re-architecture• 유연한 Re-architecture
비용비용
유연성유연성
효율화효율화
 Big Data환경에서 Public Cloud 도입을 우선 고려하는 추세
Public Cloud의 특징
Public Cloud의 목적
서베이 결과 “기업이 원하는 것은 무엇인가?”
1) Cloud로 비즈니스를 구축한 이유는 무엇입니까?
- Cost saving (39%)
- Resource scalability based on application /
workload demands (32%)
- time to market/agility (32%)
- less to manage internally (23%)
- improved availability uptime (27%)

3. Big Data 도전 과제

3. Big Data 도전과제

4. Data Store 관점에서의 AWS 서비스

4.1 Data related 3 Elements
Data
Store
Data
producer
Data
consumer
• Data Store
- Database is one of Data Store
- 7 types of Data Store and each types has a unique their
characteristic.
- Business requirements are mainly related to Data Store
- Some are solution(like RDB, NoSQL) and some are simply File.
• Data producer and Data consumer
- It could be many kind depending on Data Store
- Some are service(like kinesis stream) and some are user
application using API/SQL.
Oracle
SQL
Tera
data
CDC
ETL
MySQL
Java
App
BI
App
SQL
EIS
App
SQL
SampleSample

4.2 Data Store 7 Type
Data Store Usage
1 RDB OLTP
2 DW OLAP
3 NoSQL
On-line, no
consistency
4 Cache
In-Memory
read cache
5 Stream
Near real-time
processing
6
Block Storage
(File system)
File
7 Object Storage File
AWS Service
RDS, Aurora
(Oracle, MS-SQL, PostgreSQL, MySQL, Maria)
RedShift
DynamoDB, Neptune
ElastiCache, ElasticSearch
DAX(DynamoDB Accelerator)
Kinesis Data Streams,
Kinesis Data Firehose
DynamoDB Streams
SQS,
EBS
S3
Major Keyword
ACID, SQL based, Relational Model, Row
Based
SQL Based, Columnar, Parallel Server
API, Key-Value Set, Schema-Free, Parallel
Server, eventually consistency
API, Key-Value Set, Parallel Server, memory,
volatile
API, Key-value, Parallel Server, volatile
OS based Accessible File
URL based Accessible File

참고: Data store - AWS RDS
Custom
Application Glue
QuickSight
DMS
Producer Consumer
Producer & Consumer
AWS SDK (API) or Library 이용하여 개발 필요
개발 필요없이 AWS Console 또는 CLI 등 사용
[표기]
[QuickSight]
- Fast, easy-to-use business analytics
(SPICE : The Super-fast, Parallel, In-memory, Calculation Engine)
- RDS, Athena, Aurora, Redshift, Redshift Spectrum 지원
[Glue]
- Fully Managed ETL Service
- ETL 작업
- Metadata (테이블 정의, 스키마 등) Catalog 생성 : Athena/EMR/Redshift Spectrum 에서 사용
- 스케줄링 제공 : 종속성 확인, 작업 모니터링 및 알림 기능이 탑재
- Apache Spark 환경에서 실행
[Supported File]
- JSON, CVS, ORC, Apache Parquet, Avro
- gzip, bzip2, lz4
[DMS]
- DMS Source
: Oracle, MySQL, MS-SQL, MariaDB, PostgreSQL, Aurora for MySQL
- Target
: Oracle, MySQL, MS-SQL, MariaDB, PostgreSQL, Aurora for MySQL
: Redshift, S3

참고: Data store - AWS Redshift
S3
Apache SPARK
Glue
DMS
EMR
DynamoDB
Lambda
Kinesis Firehose
Producer Consumer
Producer & Consumer
Custom
Application
EMR Cluster 의 HDFS 안의 파일을 병렬로 Loading 함
[표기]
QuickSight
[Apache Spark]
- In-Memory 기반의 대용량 데이터 고속 처리 엔진, 범용의 분산 클러스터 컴퓨팅 프레임워크
- 정형화된 데이터 처리 (Spark SQL), 실시간 처리 (Spark Streaming), 머신러닝 (Mlib) 등 지원
- Big Data 차세대 구조 (Big Think) : HDFS + YARN + Spark 구조
[AWS EMR]
- Apache Hadoop, Spark 등 빅 데이터 프레임워크 실행을 간소화하는 관리형 클러스터 플랫폼
- Apache Hive 및 Apache Pig와 같은 관련 오픈 소스 프로젝트 지원 (분석, BI 처리)
- S3, DynamoDB 등과 양방향으로 데이터 변환 및 이동 처리

참고: Data store - AWS DynamoDB
AWS CLI
…
AWS SDK
(Application)
Apache Hadoop Apache Hive Apache Spark
EMR
Redshift
Redshift Copy
DynamoDB Streams
DynamoDB 의 Table Activity 로깅을 이용
- Cross-Region Replication Library 사용 (Region 간 복제 구현)
- DynamoDB Streams Kinesis Adapter (KCL 과 유사) 사용하여 Stream 데이터 처리
- DynamoDB Streams 의 이벤트를 자동으로 응답하는 Lambda 함수 구현 처리
DAX
(DynamoDB Accelerator)
사용자가 API 을 이용하여 Put/Get 구현
DynamoDB 전용 In-Memory Cache 임
DynamoDB API 호환
- App 코드 수정 없이 아래의 작업이 필요함
- Table 관리 작업 (DDL 등) 지원 안함
[작업 순서]
1. DAX Cluster 생성
2. DAX SDK (DynamoDB API 호환) 다운로드
3. DXA Client 사용하도록 App 을 rebuild
4. DAX Cluster End-Point 지정
[Caching Strategies]
- Read : Cache 누락 시 DynamoDB 에서 자동 검색
- Write : DynamoDB 에 기록 후 Cache 에 Update
(Write-Through Cache)
Lambda
Notification
Producer Consumer
Producer & Consumer
DynamoDB Streams
- DynamoDB 와 DynamoDB Streams 의 End-Point 가 다름
- Logstach Plug-in 을 이용하여 Elastic Search 로 데이터 연동
Apache Hadoop/Hive/Spark 등 에서 AWS EMR Connector 사용
- AWS EMR 환경에서 HiveQL 을 사용하여 데이터 조회 및 저장 가능
[표기]
[DynamoDB]
- Fully Managed NoSQL Database Service
- 원활한 확장성으로 예측 가능한 (일관된) 성능 보장 (성능 지연 대신 요청 거부 : Throttling)
: RCU/WCU 설정 및 Strongly/Eventually Consistent Read 제공

참고: Data store - AWS ElastiCache
AWS CLI
…
AWS SDK
(Application)
API
Java, PHP, Python, Ruby, .NET 용 SDK 존재
언어별 SDK 을 이용하여 API 사용하여 구현
API
Producer Consumer
Producer & Consumer
[표기]
[ElastiCache]
- Key/Value 기반 In-Memory Cache (Redis, Memcached 엔진 지원)
- Cache 선택 시 고려 사항 : 속도 및 비용, 데이터 및 액세스 패턴, 기한 경과
예) Session 정보 유지 및 Sticky Session 처리 시 사용

참고: Data store - AWS Kinesis Data Streams
Producer Consumer
Kinesis–enabled
Application
Kinesis–enabled
Application
Log4J Appender
Apache Flume
Apache Fluentd
Kinesis Agent
A
Kinesis Firehose
Lambda
EMR
Apache Storm
Producer & Consumer
Apache Log4j Appender 을 Kinesis Log4j Appender 로 변경하여 사용함
Apache Fluentd 에서 Kinesis Data Stream 을 위한 Plugin 사용함
Kinesis Connector Lib (KCL 필요)
DynamoDB Redshift S3 ElasticSearch
Kinesis Data Analytics
현재 Java, Python, Node.js, .NET 지원
단, Java 설치 필요
독립형 Java 프로그램 (설치,구성 및 시작 필요)
Java 만 지원 가능함
Stream 데이터를 표준 SQL 로 처리 및 분석 가능
(실시간 분석 생성, 실시간 대시보드 생성, 실시간 지표 생성)
처리 결과를 다른 Kinesis Streams 에 전달 가능
Kinesis 스트림을 읽고 분석
Hive의 경우 두 개의 다른 Kinesis 스트림을 Join 할 수 있음
[표기]
Pulg-in
[Fluentd]
- 데이터 수집, 소비를 위한 오픈 소스 엔진
- 다양한 로그에 대한 Filter, Buffer, Routing 처리
- 다양한 Plug-in 이 존재
[Apache Storm]
- 실시간 분산 클러스터 컴퓨팅 프레임워크
[Apache Flume]
- 분산 환경에서 대량의 로그 데이터를 효과적으로 수집, 처리, 전송할 수 있는 프레임워크
- 단순하며 유연한 Streaming Data Flow 아키텍처를 기반

참고: Data store - AWS Kinesis Data Firehose
Kinesis Agent
A독립형 Java 프로그램 (설치,구성 및 시작 필요)
Kinesis Data Analytics
IoT
Redshift
S3
Splunk
ElasticSearch
Lambda
Cloud Watch
Cloud Watch 의 Log 와 Event 데이터 처리 가능
Producer Consumer
Producer & Consumer
Kinesis Streams
A
Kinesis Stream 와 연결된 Firehose 는
Analytics 의 Producer 가 될 수 없음
Firehose Put API 사용
- Java, Node.js, Python, Ruby 용 AWS SDK 이용
Apache Fluentd
Apache Fluentd 는 Fluentd Plugin 사용
(Kinesis Firehose API 사용)
Output Data 에 대한 Transform 처리
Kinesis Firehose
[표기]

참고: Data store - AWS ElasticSearch
Lambda
S3Kinesis Streams
S3, Kinesis, DynamoDB Streams 의 Event Handler 로 사용
DynamoDB Streams Cloud Watch
Cloud Trail
kibana
데이터 저장 영역으로 EBS 사용
Kinesis Firehose
Logstash
ElasticSearch 에 내장됨
ElasticSearch 에 내장됨
Kibana 을 위한 End-Point 가 제공됨
Producer Consumer
Producer & Consumer
AWS CLI
…
AWS SDK
API
[표기]
Apache Fluentd
[Logstash]
- 실시간 파이프라인 기능을 가진 오픈소스 데이터 수집 엔진
- 서로 다른 소스의 데이터를 탄력적으로 통합, 선택한 목적지로 데이터를 정규화 처리
- 다양한 Plug-in 이 존재
[Kibana]
- 강력하고 화려한 오픈 소스 데이터 시각화 플랫폼
- 다양한 시각화 도구를 사용자 지정 대시보드와 결합하여 데이터 통찰력 제공
- 데이터 Discovery/Visualize/Dashboard/Setting 기능 제공
[ElasticSearch]
- ElasticSearch 을 쉽게 배포/운영하고 확장 가능한 서비스
- 로그 분석, 전체 텍스트 검색 및 Application 모니터링 기능

참고: Data store - AWS EBS
AWS CLI
…
AWS SDK OS Command
Create, Delete, Describe, Attach, Detach 등 가능
Producer Consumer
Producer & Consumer
S3
EC2 Snapshot & Restore 시 사용됨
(현재 Incremental Backup 만 가능함)
[표기]

참고: Data store - AWS S3
DMS
Glue
EMR S3
Transfer Acceleration
Kinesis Firehose
Snowball
Edge/Mobile
Storage
Gateway
On Premise
(Data Center)
Athena
EBS
Redshift Spectrum
SNSSQSLambda
RedshiftUnload/Copy
Redshift
AWS CLI
…
AWS SDK
S3 API 사용
- Management, Data Operation 제공
- ls, cp, mv, sync, Upload/Download 기능 제공
S3 Select SDK 사용
- S3 레벨의 데이터 필터링 선 처리 (off-load 검색)
- AWS EMR 기반 Presto 이용한 ANSI SQL 쿼리
- Lambda, Java, Python 용 Select SDK 이용
대용량 S3 객체에 대한 빠른 전송 처리
객체의 변경 사항에 대한 이벤트 감지 후 호출
S3 데이터를 Redshift 에 로딩없이 S3 데이터 조회
S3 데이터를 Redshift 데이터와 같이 조회
Interactive Query Service
S3 데이터를 표준 SQL 로 조회
Producer Consumer
Producer & Consumer
QuickSight
[표기]
Apache Fluentd
[Presto]
- 짧은 지연 시간의 임시 데이터 분석용 오픈 소스 분산 SQL 쿼리 엔진
- 표준 ANSI SQL 지원
- HDFS 및 S3 데이터 지원
[S3]
- AWS 핵심 Service, AWS Big Data 전략의 핵심 Service

4.3 AWS Big Data 서비스
Glue
QuickSight
DMS
Redshift
S3
RDS /
Aurora
DynamoDB
EMR
DAX
(DDB Accelerator)
DDB Streams
Kinesis Firehose
Kinesis Streams
Kinesis Analytics
IoT
Cloud Watch ElasticSearch
SageMaker
(ML/DL Platform)
Athena
Building Big data architecture with additional
• User application on Lambda
• User application with API/SQL
• 3rd party Solution/Service
Apache SPARK
Log4J Appender Apache Fluentd Apache Flume
Apache Storm
SplunkLogstash
Cloud Trail
kibana

5. Big Data Architecting process

5.1 Big Data Architecting process
Business
Requirements
A
Decide
Data store(s)
Decide
Data consumer
Data producer
Data
Architecture
V1.0
Step1: From the requirement, firstly we decide data store type.
Step2: decide the data store.
Step3: By choosing Data Store, decide the data producer and data consumer.
Step4: Re-architecting by varying business requirement
….
고려 사항  데이터 종류, 데이터 구조, 데이터 Access패턴, 데이터 온도,
데이터 처리 요구사항, 데이터 크기, 데이터 지속성, 처리시간, 비용 등등
고려 사항  데이터 종류, 데이터 구조, 데이터 Access패턴, 데이터 온도,
데이터 처리 요구사항, 데이터 크기, 데이터 지속성, 처리시간, 비용 등등
Business
Requirements
B
Data
Architecture
V2.0
….

6. AWS Big data 서비스

6.1 AWS Lambda 소개
S3 DynamoDB
Streams
Kinesis Streams
SNS
Cloud Watch
SES Cognito CloudFormation
IoT API Gateway Kinesis Firehose
그 외 이벤트 소스
: CodeCommit, Config, Alexa, Lex, ALB, CloudFront, Custom Application
• Server-less, Event 처리 방식의 Compute Service
• 지원 언어 : Java, Node.js, .NET, Python, Go 지원
• 실시간 처리, 유연한 백엔드 서비스 구축 시 사용
• Cloud Watch 을 이용한 Cron-job Scheduler 구현 가능
• 제약 사항: 처리시간 300초 이내, Reuse 안됨, Stateless
이므로 상태 저장 필요
• 계정(Who/What)에 대한 권한, VPC 내부/외부 여부에
대한 고려 필요
• VPC 안에서는 IP 개수 및 ENI 할당을 포함한 Free-
Warning 고려 필요
• 버전 관리 및 Alias 처리 가능, 버전 별 가중치 부여 가능
Ruby 추가

6.2 AWS EMR 소개
Hadoop on VMHadoop on VM AWS EMRAWS EMR
Storage가 부족하면, Node 추가 해야 함.
• 대량 작업 시에만, 일시적 Node 추가 후 작업
• EC2 Spot instance 사용 가능

6.2 AWS EMR 소개

6.3 AWS S3 소개
 안정성/확장성
• 최대 99.999999999% 내구성, 99.99% 가용성
• 3개 Availability Zone에 자동 분산 복제
• 타 리전 복제
• 가상적으로 무제한 용량
• 매우 높은 대여폭
 폭넓은 활용성
• Athena, S3 Select, RedShift Spectrum 등등
• 다양한 AWS 서비스와 tight한 연계
 요구 가용성, 속도, 접근 빈도에 따른 차등화된 서비스
• S3 standard  S3 standard Infrequent Access  S3 one zone  Glacia
• 사용한 만큼 과금, 차등화된 요금: 2.1c GB/월  1.25c GB  1c GB  0.4c GB
 기타
• 보안
• Event notification 기능을 추가작업
• 버전 관리, 수명 관리, Tag관리
S3 Intelligent-tiering
Glacier Deep Archive

6.4 AWS Athena 소개

6.5 AWS Glue 소개

6.5 AWS Lake Formation 소개

6.6 AWS QuickSight 소개
ML-powered anomaly detection
ML-powered forecasting

6.7 AWS AI/ML 소개
Apache MXNet Caffe2 CNTK PyTorch TensorFlow
Theano Torch Keras Gluon
Amazon SageMaker
Amazon Rekognition
Amazon Rekognition Video
Amazon Kinesis Video Stream
Vision(Image, Audio, Video) Language(NLP)
Amazon Polly
Amazon Transcribe
Amazon Lex
Amazon Translate
Amazon Comprehend
Alexa for Business
Amazon Personalize
New
Recommendation
Amazon Forecast
New
Forecast
Frameworks and
Infrastructure
ML
Service
AI
Service
Algorithm Paper
User invented
Customer Service
AWS AI/ML VM

3rd Party ETL
Server (EC2)
DW (Redshift)
Redshift
Copy / Unload ODBC
데이터 영구 저장,
Staging 영역
(S3)
File/Log
Athena
External Table
(Spectrum)
Deep Learning (EC2)
Multi Availability Zone (AZ) 기반
임시
사용 영역
영구/Cold
저장 영역
Kinesis
Streams
Kinesis
Firehose
실시간 Stream 처리
Elastic
Search
Stream 데이터
IoT 데이터
JDBC
Streaming내부 데이터
외부 데이터
(날씨, 주식,
SNS 등)
공장 데이터
(센서, 로그 등)
SageMaker
(ML/DL
Platform)
고객 (내부 – BigData 관련)
Real-time,
Batch Process
Real-time,
Batch Process Data Store (Lake)Data Store (Lake) Data AnalyzeData Analyze Data VisualizeData Visualize
HadoopHadoop
Batch 처리
ECS Lex
Polly
Rekognitio
nML/DL Service
(Personalize,
Forecast, …)
AI/MLAI/ML
EMR
AI Service
ECS
Service Endpoint
Web/WAS Server (EC2)
ELB
DB Server (RDS)
NoSQL
(DynamoDB)
OnlineOnline
고객 (내부 – AI/ML 관련)
APIGateway
Replica
standby
Serverless
IoT
Endpoint Endpoint Endpoint
ELBELB
Deploy
고객 (내부)
공유 파일 영역 (EFS)
BI Tool (EC2)
ELB
QuickSight
3rd Party BI
AWS BI
Glue
(AWS ETL)
6.7 AWS Big Data Architecture Example

7.1 AWS RedShift 는…

7.2 AWS RedShift - Architect & Price
Leader
Node
Compute Node
JDBC/ODBC
10GigE
…
S3
Node Type vCPU
메모리
(GB)
스토리지
(TB)
I/O
(GB/초)
Slice수
/node
Node수
최대
Disk
Dense
Compute
(SSD)
dc2.large 2 15 0.16 0.6 2 1~32 5.12TB
dc2.8xlarge 32 244 2.56 7.5 16 2~128 326TB
Dense
Storage
(HDD)
ds2.xlarge 4 31 2 0.4 2 1~32 64T
ds2.8xlarge 36 244 16 3.3 16 2~128 2PB
Node Type
단가/시간
(USD))
온
디맨드
1년 계약 3년 계약
선결제
없음
부분
선결제
전체
선결제
부분
선결제
전체
선결제
Dense
Compute
(SSD)
dc2.large 0.3 16,425 13,140 10,348 9,691 6,406 5,749
dc2.8xlarge 5.8 19,847 15,878 12,504 8,336 6,153 5,756
Dense
Storage
(HDD)
ds2.xlarge 1.15 5,037 3,979 2,972 2,921 1,310 1,209
ds2.8xlarge 9.05 4,955 3,964 2,973 2,923 1,288 1,189
Node 스펙
1TB당 연간 요금(USD) 예시
Data Load/Unload, Backup/Restore, Redshift Spectrum
Slice
Slice
Slice
Slice
Slice
Slice
Slice
Slice
Slice
Slice
Slice
Slice

Query
SELECT COUNT(*) F
ROM S3.EXT_TABLE
GROUP BY…JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
Query is optimized and compiled at the leader node.
Determine what gets run locally and what goes to A
mazon Redshift Spectrum
2
Query plan is sent to all compute nodes3
Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions4
Each compute node issues multiple
requests to the Amazon Redshift S
pectrum layer
5
Amazon Redshift Spectrum nodes
scan your S3 data
6
Amazon Redshift
7 Spectrum projects, filt
ers, and aggregates
Final aggregations and joins
8 with local Amazon Redshift
tables done in-cluster
Result is sent back to client9
7.3 AWS RedShift – Spectrum
•Data Archiving to Cold storage(S3)
•Archived data on S3 join with RedShift
data with SQL
•Cost saving for old data infrequent access
•Performance for mass read
•Data Archiving to Cold storage(S3)
•Archived data on S3 join with RedShift
data with SQL
•Cost saving for old data infrequent access
•Performance for mass readAWS RedShift
AWS RedShift Spectrum nodes

7.4 AWS RedShift – Managed Service
 쉽고 빠른 구성
 가용성
• Multiple copies within cluster
 백업
• Continuous and incremental backups to S3
• Continuous and incremental backups across regions
• Amazon Redshift provides free storage for snapshots that is equal to
the storage capacity of your cluster.
• Auto snapshot  retention policy (1~35days)
• Manual snapshot  The system will never delete a manual snapshot
 복구 (Streaming restore)
• 클러스터 All 또는 특정 Table만 복원 가능
• During restore your node is provisioned within less than two minutes
• Query provisioned node immediately as data is automatically
streamed from the S3 snapshot
• Most frequently used data is restored first
Amazon S3
Amazon S3
Region 1
Region 2

7.5 AWS RedShift – Node resize
1. Cluster is put into read-only mode
2. New cluster is provisioned according to resizing needs
3. Node-to-node parallel data copy
Only charged for source cluster
4. There will be a short outage at the end of cluster resizing
Automatic SQL endpoint switchover via DNS
5. Decommission the source cluster
Toresize without downtime:
• Restore cluster from snapshot
• Apply schema and data updates to the new cluster
made to online cluster during restoral
• Change applications to point to the new cluster
RedShift concurrency scaling

7.6 AWS RedShift – Performance
1. Parallel server Architecture  parallel execution by Slice (Query, Load/Unload, Backup/Restore)
2. Columnar store and related loading feature
3. Local Storage(HDD/SSD) and 10GnE N/W
4. Data Distribution option via slice (EVEN, HASH, ALL)
5. Column compression algorithm depending on data characteristic
6. SQL Tuning feature
7. Sort Key: Zone Map (Single, Compound, Interleaved)
8. WLM(work load manager) and SQA(Short Query Acceleration)

7.6 AWS RedShift – performance
AS-IS 분석
TO-BE 성능 목표

7.6 AWS RedShift – performance
성능 테스트 결과
T사

7.7 AWS RedShift – Security
• Load encrypted from S3
• SSL to secure data in transit
• ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3 encrypted
• Block key, Cluster key, Master key (AES-256)
• On-premises HSM, AWS CloudHSM & KMS
support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA

7.8 RedShift 요약
• Parallel Server Architecture
• Columnlar store
• Cost efficiency
• RedShift Spectrum
• Performance
• Security
• Flexibility
• Managed Service

마치며
4차 산업혁명4차 산업혁명 Big DataBig Data
3. Public Cloud3. Public Cloud
비용비용 유연성유연성 효율화효율화
3차 산업혁명3차 산업혁명OILOIL
2. Data Lake2. Data LakeDataData
1. Open Source1. Open Source
 Big Data 구축 전략에는 Data를 기반으로 한, 세가지 분야의 전략이 중요하다.

56
Cloud로 가기로 결정하였다면
누구와 함께 갈지를 선택해야합니다.
처음부터 끝까지 믿을만한 파트너를 찾는다면
베스핀글로벌이 정답입니다.
감사합니다.
베스핀글로벌 CDP팀 최정식 위원
js.choi@bespinglobal.com

AWS BigData 전략과 관련 AWS 서비스 이해하기

More Related Content

What's hot

Similar to AWS BigData 전략과 관련 AWS 서비스 이해하기

More from BESPIN GLOBAL

AWS BigData 전략과 관련 AWS 서비스 이해하기