AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축

© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | 4
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved
추천 시스템의 원리와 구축 사례
Sungmin Kim
AWS Solutions Architect
Amazon Personalize

Agenda
• 추천의 정의
• 추천의 중요성
• 추천 성과 지표
• Amazon Personalize 소개
• Amazon Personalize 적용 아키텍처

“추천” 이란 무엇일까?

“추천”은 왜 중요할까?
E-Commerce의 중요한 지표
• Retention Rate – 고객 체류 시간
• Churn Rate – 고객 이탈율
• (Purchase) Conversion Rate – (구매) 전환율
추천
검색 결제

“추천”은 왜 중요할까?
검색
• 원하는 결과를 빨리 찾기
• 정확
• 신속
추천
• 새로운 경험
• 탐험, 탐색
• 발견의 기쁨

추천 알고리즘의 종류
• Collaborative Filtering
• Content-based Filtering
• Hybrid = Collaborative Filtering + Content-based Filtering

Collaborative Filtering
user1
user2
user3
user1
user2
user3

Content-based Filtering
article read
by me
Similar
Recommend
new article
Read

Collaborative Filtering
(+)
• Content 를 분석 할 필요 없음.
• 사용자 행동 로그만으로 추천 계산
가능.
• 다양한 곳에 적용 가능.
(-)
• Cold Start 문제 발생.
Content-based Filtering
(-)
• Content가 풍부한 곳에 적용 가능.
• 적용 범위가 비교적 제한적.
(예: Music, Movies 등은 적용이 쉽지 않음)
(+)
• Content 분석 만으로 추천 가능.
• 사용자 행동 로그 없이 추천 계산 가능.
• Cold Start 문제 완화.
Hybrid = Collaborative + Content-based

어떻게 추천을 적용할까?
Personalization
Non-Personalization

추천
CF
Content
-based
Hybrid
Non-
Personalization
Personalization
Cold-Start Problem
User-Item
Interactions
(e.g. View,
Buy, Cart)
Content
(e.g. Customer
reviews, Product
Details)
데이터
적용

데이터
준비
요구
사항
정리
추천
모델
개발
배포 및
적용
A/B Test
새로운 Feature
새로운 요구 사항
추천 서비스 개발 및 적용 - 반복적인 작업

Deliver high-quality
recommendations
Deliver
personalization in
days, not months
Real-time Works with any
product or content
Amazon Personalize
amazon.com 에서 사용하는 머신 러닝 기술을 이용한
실시간 개인화 및 추천 서비스

Solution
(Recipes)
Model selection,
training, tunning
and verification
Campaign
Model hosting,
and inference
Amazon
Personalize
Data Set Group
Users Items
Interactions
Data Sets
User events /
interactions
Item meta data
(a.k.a catalog
information -
optional)
User meta data
(e.g.
demographics
– optional)
Amazon Personalize
How it works
• GetRecommendations
• GetPersonalizedRanking

어떤 데이터를 준비해야할까?
• 3 가지 데이터
• Users
• 사용자 메타 데이터
• 연령, 성별, 고객 멤버쉽 등
• Items
• Item 메타 데이터
• 가격, SKU(상품 재고 관리 단위), 재고 여부 등
• (User-Item) Interactions
• 사용자의 Item에 대한 행동 로그 데이터
• 구매(buy), 장바구니 담기(cart), 상품 보기(view) 등
• Interactions 데이터는 추천 계산에 사용되므로 반드시 필요함
• User, Items는 추천 계산에서 사용할 데이터 제외 및 추천 결과 filtering 용도로 사용
• 입력 데이터는 S3에 CSV 포맷의 저장, 첫번째 row에 컬럼 Header가 필요함

Formatting Your Input Data Example
컬럼 Header
CSV 포맷

Users Items Interactions
https://docs.aws.amazon.com/personalize/latest/dg/how-it-
works-dataset-schema.html

Users Items Interactions
https://docs.aws.amazon.com/personalize/latest/dg/how-it-
works-dataset-schema.html
{
"type": "record",
"name": "Users | Items | Interactions",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "Field Name",
"type": "Data Type"
},
….
],
"version": "1.0"
}

데이터는 얼마나 준비해야할까?
Minimum Suggested Data Volume
• More than 50 users.
• More than 50 items.
• More than 1,500 interactions.
※ https://github.com/aws-samples/amazon-personalize-samples/blob/master/PersonalizeCheatSheet2.0.md

User personalization Personalized ranking
Similar items
Recipes
• User-personalization
• HRNN, HRNN-Metadata,
HRNN-Coldstart(legacy)
• Popularity-Count (baseline)
Recipe
• SIMS
Recipe
• Personalized-Ranking
Use case by Recipes

추천
데이터
Content
-based
적용
Non-
Personalization
Personalization
User-Item
Interactions
(e.g. View,
Buy, Cart)
Content
(e.g. Customer
reviews, Product
Details)
• User-Personalization
• Popularity-Count
• SIMS
• SIMS
• Popularity-Count
• User-Personalization

맛있는 요리처럼, 좋은 추천이란?
• Coverage
• Relevance (≈ Accuracy)
• Mean Reciprocal Rank@K
• NDCG@K
• Precision@K
Coverage
Relevance
※ Relevance: https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html

정말 더 좋은 추천이란?
• Coverage
• Relevance (≈ Accuracy)
• Mean Reciprocal Rank@K
• NDCG@K
• Precision@K
• Serendipity (≈ Surprise)
Serendipity
Coverage
Relevance
※ Relevance: https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html

relevance
serendipity
Why Serendipity?
It keeps your customers interested.

Amazon Personalize 정말 효과가 있을까?
https://aws.amazon.com/ko/solutions/case-studies/brandi/

Amazon Personalize를 사용하는 고객들

• visit
• view
• cart
• buy
Web
Server
Users
Items
Transactions
(Interactions)

Web
Server
Reco.
Server
?
response
request
recommendations
• visit
• view
• cart
• buy
Users
Items
Transactions
(Interactions)

Web
Server
• visit
• view
• cart
• buy
Amazon Personalize
Users
Items
Transactions
(Interactions)

Amazon Personalize
Data Set Group
Users Items
Interactions
Solution
(Recipes)
Model selection,
training, tunning
and verification
Campaign
Model hosting,
and inference
Amazon
Personalize
Data Sets
Users, Items,
and Interactions

Dataset
Group
Dataset
Import
Web
Server
S3
Personalize
Recipe
&
Solution
Campaign
• visit
• view
• cart
• buy
자동화?
Users
Items
Transactions
(Interactions)

Dataset
Group
Dataset
Import
Web
Server
S3
Personalize
Recipe
&
Solution
Campaign
• visit
• view
• cart
• buy
https://github.com/aws-samples/amazon-personalize-samples/tree/master/next_steps/operations/ml_ops
AWS Step Functions
자동화?
Users
Items
Transactions
(Interactions)

https://github.com/aws-samples/amazon-personalize-samples/tree/master/next_steps/operations/ml_ops
Amazon Personalize MLOps using Step Functions

#6: Dynamic parallelism
#5: Parallel processing
#4: Human in the loop
#3: Error handling
#2: Branching
#1: Function orchestration
Amazon Step Functions Use Cases

Amazon Personalize MLOps using Step Functions

Dataset
Group
Dataset
Import
Web
Server
S3
Personalize
Recipe
&
Solution
Campaign
• visit
• view
• cart
• buy
AWS Step Functions
Users
Items
Transactions
(Interactions)

Web
Server
S3
Step Functions
Personalize
Runtime
API
• visit
• view
• cart
• buy
Users
Items
Transactions
(Interactions)
사용자 Interactions 실시간 Update?

Amazon Personalize
Data Set Group
Users Items
Interactions
Solution
(Recipes)
Model selection,
training, tunning
and verification
Campaign
Model hosting,
and inference
Events
Tracker Data Sets
Users, Items,
and Interactions
Amazon
Personalize

Amazon Personalize
Data Set Group
Users Items
Interactions
Solution
(Recipes)
Model selection,
training, tunning
and verification
Campaign
Model hosting,
and inference
Events
Tracker Data Sets
Users, Items,
and Interactions
Real-Time
Events
Amazon
Personalize

Web
Server
S3
Step Functions
Personalize
Runtime
API
• visit
• view
• cart
• buy
Users
Items
Transactions
(Interactions)
사용자 Interactions 실시간 Update

Web
Server
Event
Tracker
S3
Step Functions
Personalize
Runtime
API
Users
Items
Transactions
(Interactions)
• visit
• view
• cart
• buy

Web
Server
Event
Tracker
S3
Personalize
Runtime
API
Step Functions
Users
Items
Transactions
(Interactions)
Kinesis Data
Streams
• visit
• view
• cart
• buy

Web
Server
Lambda Event
Tracker
S3
Personalize
Runtime
API
Step Functions
Users
Items
Transactions
(Interactions)
Kinesis Data
Streams
• visit
• view
• cart
• buy
PutEvents

Web
Server
Lambda Event
Tracker
S3
Kinesis
Firehose
Personalize
Runtime
API
Step Functions
Users
Items
Transactions
(Interactions)
Kinesis Data
Streams
• visit
• view
• cart
• buy
PutEvents

• Filter – 학습(training)에 사용할 records제외하기
• EVENT_TYPE – 특정 EVENT_TYPE의 records만 사용하고 싶은 경우
• EVENT_VALUE – 특정 threshold 값 이상의 records만 사용하고 싶은 경우
• (e.g. Review Rating > 2.0, Watch > 5)
• Dynamic Filter – 추천 API 호출 시 추천 결과에서 제외 하기
Dataset
Group
Dataset
Import
Recipe
&
Solution
Campaign
Amazon
Personalize
Filtering Recommendations
※ https://docs.aws.amazon.com/personalize/latest/dg/filter.html

Dynamic Filter
• Filtering by item
• EXCLUDE ItemId WHERE items.genre IN ($GENRE)
• EXCLUDE ItemId WHERE items.genre IN ("Comedy")
• INCLUDE ItemId WHERE items.number_of_downloads < 20
• Filtering by interactions
• INCLUDE ItemId WHERE interactions.event_type IN ("*")
• EXCLUDE ItemId WHERE interactions.event_type IN (“click”, “stream”)
• Filtering by item based on user properties
• EXCLUDE ItemId WHERE items.number_of_downloads < 20 IF CurrentUser.age > 18
AND CurrentUser.age < 30
• INCLUDE Item.ID WHERE items.genre IN (“Comedy”) | EXCLUDE ItemID WHERE
items.description IN ("classic”)
※ Amazon Personalize now supports dynamic filters for applying business rules to your recommendations on the fly

추천 결과 추천 결과
Web
Server
Amazon Personalize
X
X
Web
Server
Amazon Personalize
X
X
X
X
Lambda
X
X
X
X
Partially Filled Fully Filled
추천 Filtering을 Business Logic 처럼 적용하기
Amazon Personalize Filtering 사용 Filtering Business Logic 적용

Batch Recommendation
• 사용 사례
§ 많은 수의 사용자에 대한 추천을 한 번에 계산 및 저장 하고 싶은 경우
§ 배치 기반 워크플로(예: 이메일 또는 알림 전송)를 통해 추천 결과를 지속적으로
제공하고 싶은 경우
• 비용
§ 사용자들에게 동일한 추천 결과를 제공해도 되는 경우
§ 배치 처리가 훨씬 더 편리하고 경제적
• 배치 추론 작업 방법
• AWS 웹 콘솔
• API 호출

Web
Server
S3
Kinesis
Firehose
API
GW
Personalize
Batch Inference
Users
Items
Transactions
(Interactions)
S3
Dynamo
DB
Glue
Lambda
Kinesis Data
Streams
Batch Recommendation
• visit
• view
• cart
• buy

실시간 vs 배치 추천 비용
데이터 양
실시간 추천
배치 추천
비용 ü 학습용 데이터 양
ü 모델 학습 시간
ü 추론 API 호출 시간(TPS)

Summary
• 추천의 중요성
- Retention Rate, Churn Rate, Conversion Rate 등의 E-Commerce 주요 지표에 영항을 줌
• 추천 평가 기준
- Coverage, Relevance, Serendipity의 조화
• Amazon Personalize 적용 시 필요한 데이터
- Users, Items, Interactions
• Amazon Personalize Recipes
- User-Personalization, SIMS, Popularity-Count, Personalized-Ranking
• Amazon Personalize 적용 순서
- Data Ingestion(Data Set Group 생성 & Data Set Import) → Training (Recipe & Solution 생성) →
Inference(Campaign 생성)
• AWS Step Functions을 이용한 추천 계산 작업 자동화(MLOps)
• 추천 결과 필터링 방법
- 추천 계산 시, Interactions 데이터 필터링
- 추천 API 호출 시, 결과에서 필터링 (Dynamic Filter)
- 추천 계산 결과 후처리 (Business Logic 적용 하기)

Reference
• Amazon Personalize Immersion Day
• https://personalization-immersionday.workshop.aws/en/
• Sample Code
• https://github.com/aws-samples/amazon-personalize-samples/
• Personalize Cheat Sheet
• https://github.com/aws-samples/amazon-personalize-
samples/blob/master/PersonalizeCheatSheet2.0.md
• AWS 리소스 허브 – AI & 기계학습
• https://kr-resources.awscloud.com/aws-ai-and-machinelearning

© 2020, Amazon Web Services, Inc. or its Affiliates.
Sungmin, Kim
추천 서비스를 위한
데이터 분석 시스템 구축하기

Agenda
• 데이터 분석의 위한 사전 지식
• 데이터 구조
• 데이터 온도 스펙트럼
• 데이터 파이프라인
• 추천 시스템 구축을 위해 필요한 데이터
• 사용자 행동 로그
• 추천 성과 분석 지표
• 사용자 행동 로그 수집을 위한 데이터 분석 아키텍처
• 추천 성과 분석을 위한 데이터 분석 시스템 확장
• Lesson Learned – Architectural Principles

데이터 분석에 필요한 3가지 개념

Structured, Unstructured, and Semi-Structured

Structure
Hot data Warm data Cold data
Low
High
High Request rate
Low
High Cost / GB
Low High
Latency
Low High
Data Volume
Low
In-Memory SQL
NoSQL Search
Object Storage
Archive
Storage
Graph
Data Temperature Spectrum

Simplify Big Data Processing
Collect Consume
Store Process/Analyze
Data
1 4
0 9
5
Answers &
Insights
Time to answer (Latency)
Throughput
Cost
ETL

추천 시스템에 필요한 데이터
• 추천 계산을 위한 사용자 행동 로그
• 상품 상세 페이지 보기
• 장바구니 담기
• 상품 구매
• …
• 추천 성과 측정을 위한 데이터
• 추천 아이템 노출 횟수
• 추천 아이템 클릭 횟수
• 추천 아이템 노출 위치
• …
visit
view
cart
buy

사용자 행동 로그 수집
Data Set Group
Users Items
Interactions
Solution
(Recipes)
Model selection,
training, tunning
and verification
Campaign
Model hosting,
and inference
Events
Tracker Data Sets
Users, Items,
and Interactions
Real-Time
Events
Amazon
Personalize

• visit
• view
• cart
• buy
Web
Server
Users
Items
Interactions
Amazon Personalize

• visit
• view
• cart
• buy
Web
Server
Users
Items
Interactions
Dataset
Group
Dataset
Import
Recipe
&
Solution
Campaign
Amazon Personalize
Personalize
Runtime
API

• visit
• view
• cart
• buy
Web
Server
Users
Items
Interactions
Dataset
Group
Dataset
Import
Recipe
&
Solution
Campaign
AWS Step
Functions
workflow
Personalize
Runtime
API

• visit
• view
• cart
• buy
Web
Server
Users
Items
Interactions
Dataset
Group
Dataset
Import
Recipe
&
Solution
Campaign
AWS Step
Functions
workflow
S3
Personalize
Runtime
API

• visit
• view
• cart
• buy
Web
Server
Users
Items
Interactions
Dataset
Group
Dataset
Import
Recipe
&
Solution
Campaign
AWS Step
Functions
workflow
Personalize
Runtime
API
Users Items
Interactions
S3

• visit
• view
• cart
• buy
Web
Server
S3
Users
Items
Interactions
Step Functions
How to deliver?
ü Fastly
ü Without loss
Personalize
Runtime
API

Key Components of Real-time Analytics
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Devices and/or
applications that
produce real-time
data at high
velocity
Data from tens of
thousands of data
sources can be
written to a single
stream
Data are stored in the
order they were
received for a set
duration of time and
can be replayed
indefinitely during
that time
Records are read in
the order they are
produced, enabling
real-time analytics
or streaming ETL
Data lake
(most common)
Database
(least common)

Web
Server
S3
Users
Items
Interactions
Step Functions
Data Source
Data Sink
Personalize
Runtime
API
• visit
• view
• cart
• buy

• visit
• view
• cart
• buy
Web
Server
S3
Users
Items
Interactions
Step Functions
Data Source
Data Sink
Personalize
Runtime
API
Stream
Storage

• visit
• view
• cart
• buy
Web
Server
S3
Stream
Delivery
Users
Items
Interactions
Step Functions
Data Source
Data Sink
Personalize
Runtime
API
Stream
Storage

• visit
• view
• cart
• buy
Web
Server
S3
Stream
Delivery
Users
Items
Interactions
Step Functions
Kinesis Data
Streams
Managed Streaming
for Kafka
Kinesis Data
Firehose
Personalize
Runtime
API
Stream
Storage

Why is Stream Storage?
• Decouple producers &
consumers
• Persistent buffer
• Collect multiple streams
• Preserve client ordering
• Parallel consumption
• Streaming MapReduce

Hash
Function
Consumer
Consumer
Consumer
Consumer Group
PK
PK
PK
PK
= next consumer offset oldest data
newest data
Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka
Producers
shard/partition-1
shard/partition-2
5 4 3 2 1 0
3 2 1 0
4 3 2 1 0
4
2
0
shard/partition-3
Anatomy of

Comparing Amazon Kinesis Data Streams to MSK
• Streams and shards
• AWS API experience
• Throughput provisioning model
• Seamless scaling
• Typically lower costs
• Deep AWS integrations
• Topics and partitions
• Open-source compatibility
• Strong third-party tooling
• Cluster provisioning model
• Apache Kafka scaling isn’t
seamless to clients
• Raw performance
Amazon Kinesis Data Streams Amazon MSK

Stream Ingestion
• AWS SDKs
• Publish directly from application code via APIs
• AWS Mobile SDK
• Kinesis Agent
• Monitors log files and forwards lines as messages to
Kinesis Data Streams
• Kinesis Producer Library (KPL)
• Background process aggregates and batches messages
• 3rd party and open source
• Kafka Connect (kinesis-kafka-connector)
• fluentd (aws-fluent-plugin-kinesis)
• Log4J Appender (kinesis-log4j-appender)
• and more …
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Amazon Kinesis
Data Streams

Elasticsearch
Service
Redshift
Stream Delivery
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Stream
Delivery
Kinesis Data
Firehose
• Kinesis Agent
• CloudWatch Logs
• CloudWatch Events
• AWS IoT
• Direct PUT using APIs
• Kinesis Data Streams
• MSK(Kafka) using
Kafka Connect
Kinesis Data
Analytics
S3

Amazon Kinesis Data Firehose
• Zero administration and seamless elasticity
• Direct-to-data store integration
• Serverless continuous data transformations
• Near real-time

Kinesis Firehose: Filter, Enrich, Convert
Data
Source
apache log
apache log
json Data
Sink
[Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178]
[Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1]
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
}
geo-ip
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
},
},
{
"recordId": "2",
"result": "Dropped"
}
json
Lambda function
Kinesis
Data Firehose
1
2
3

Pre-built Data Conversion
Data
Source
Kinesis
Data Firehose
JSON Data
schema
AWS Glue Data
Catalog
Amazon S3
• Convert the format of your input data from JSON to columnar data
format Apache Parquet or Apache ORC before storing the data in
Amazon S3
• Works in conjunction to the transform features to convert other format
to JSON before the data conversion
convert to
columnar format
/failed

• visit
• view
• cart
• buy
Web
Server
S3
Stream
Delivery
Users
Items
Interactions
Step Functions
Personalize
Runtime
API
Kinesis Data
Streams
Managed Streaming
for Kafka
Kinesis Data
Firehose
Stream
Storage

• visit
• view
• cart
• buy
Web
Server
S3
Users
Items
Interactions
Step Functions
Kinesis Data
Streams
Personalize
Runtime
API

• visit
• view
• cart
• buy
Web
Server
S3
Users
Items
Interactions
Step Functions
Kinesis Data
Streams
Kinesis Data
Firehose
Personalize
Runtime
API

• visit
• view
• cart
• buy
Web
Server
S3
Users
Items
Interactions
Step Functions
Kinesis Data
Streams
Kinesis Data
Firehose
Event
Tracker
Lambda
Personalize
Runtime
API

추천 성과 분석
• E-Commerce 주요 지표
• Retention Rate (체류 시간)
• Churn Rate (이탈률)
• Conversion Rate (전환률)
• 추천 알고리즘 지표 - A/B Test
• Coverage
• CTR (Click-through Rate)
≈ Relevance + Serendipity
If you can’t measure it, you can’t improve it.
– Peter Drucker

• visit
• view
• cart
• buy
Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Event
Tracker
Lambda
Step Functions
Personalize
Runtime
API
Kinesis Data
Streams

• visit
• view
• cart
• buy
Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Amazon Personalize
사용자
행동 로그
Kinesis Data
Streams

• visit
• view
• cart
• buy
ü click
ü impression
ü channel
Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Amazon Personalize
사용자
행동 로그
추천 성과
지표
Kinesis Data
Streams

• visit
• view
• cart
• buy
ü click
ü impression
ü channel
Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Amazon Personalize
Marketer
Data Scientist
Business Analyst
사용자
행동 로그
추천 성과
지표
Kinesis Data
Streams

• visit
• view
• cart
• buy
ü click
ü impression
ü channel
Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Amazon Personalize
Marketer
Data Scientist
Business Analyst
QuickSight
사용자
행동 로그
추천 성과
지표
Kinesis Data
Streams

DATA SOURCES
Relational
Databases
Flat Files
And Many Others!
DATA SETS
Retail Data
Ops Data
Marketing Data
ANALYSES DASHBOARDS &
STORIES
Amazon QuickSight
Fast BI Service with Pay-per-Session Pricing and ML Insights for everyone

Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
?
QuickSight
Glue Athena
EMR Redshift
Batch Interactive
Amazon Personalize
• visit
• view
• cart
• buy
ü click
ü impression
ü channel
사용자
행동 로그
추천 성과
지표
Kinesis Data
Streams

Comparison of SQL Processing engines
Data Structure Semi Semi Semi Full
Languages API/SQL SQL SQL SQL
Data Store
S3 (Glue),
S3/HDFS (Spark)
S3/HDFS S3 Local
Use case Transformation
SQL Queries
for S3/HDFS
Serverless SQL
Queries for S3
Fully Featured
SQL Database
Performance
AWS Glue Amazon Athena Amazon Redshift

Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
QuickSight
Amazon Personalize
Athena
• visit
• view
• cart
• buy
ü click
ü impression
ü channel
사용자
행동 로그
추천 성과
지표
Kinesis Data
Streams

Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
QuickSight
Amazon Personalize
How to analyze data in real-time?
Athena
• visit
• view
• cart
• buy
ü click
ü impression
ü channel
Kinesis Data
Streams

Kinesis
Data Streams
Amazon Elasticsearch
Service
Kibana
EMR
real-time
dashboard
ElastiCache
Kinesis
Data Analytics
Lambda function
QuickSight
Amazon RDS
Kinesis
Data Streams
DynamoDB
1
2
3
Kinesis
Data Firehose
Collect Store Consume
Process/Analyze
ETL

Amazon Kinesis
Data Analytics
for Flink
AWS Glue Amazon EMR
Serverless Serverless
Fully Managed

Amazon EMR
Applications
Framework
Process Layer
Data Layer
Infrastructure
S3
EMRFS
Amazon
S3
Instances Spot Instances
Amazon EMR
Easily run and scale Apache Spark, Hive, Presto, and other big data frameworks

• Interact with streaming data in real-time using SQL or integrated Apache Flink applications
• Build fully managed and elastic stream processing applications
Amazon Kinesis Data Analytics
A managed Apache Flink solution that enables building of sophisticated streaming
applications

Kinesis Data Analytics for SQL
Data
Source
Stream
Storage
Stream
Ingestion
Data
Sink
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
“It's raining cats and dogs!” It’s 1
raining 1
cats 1
and 1
dogs! 1

Kinesis Data Analytics (SQL)
• STREAM (in-application): a continuously
updated entity that you can SELECT from and
INSERT into like a TABLE
• PUMP: an entity used to continuously
'SELECT ... FROM' a source STREAM, and
INSERT SQL results into an output STREAM
• Create output stream, which can be used to
send to a destination
SOURCE
STREAM
INSERT
& SELECT
(PUMP)
DESTIN.
STREAM
Destination
Source
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]

Kinesis
Data Streams
Service
Kibana
EMR
real-time
dashboard
ElastiCache
Kinesis
Data Analytics
Lambda function
QuickSight
Amazon RDS
Kinesis
Data Streams
DynamoDB
1
2
3
Kinesis
Data Firehose
Collect Store Consume
Process/analyze
ETL

Amazon Elasticsearch Service
Fully managed, scalable, and secure Elasticsearch service

Kinesis
Data Streams
Service
Kibana
EMR
real-time
dashboard
ElastiCache
Kinesis
Data Analytics
Lambda function
QuickSight
Amazon RDS
Kinesis
Data Streams
DynamoDB
1
2
3
Kinesis
Data Firehose
3 ways to build Real-time Analytics

Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Athena
QuickSight
Amazon Personalize
Kinesis Data
Firehose
Amazon ES Kibana
• visit
• view
• cart
• buy
ü click
ü impression
ü channel
Kinesis Data
Streams

Web
Server
Kinesis
Firehose
Users
Items
Interactions
Kibana
Athena
QuickSight
Kinesis
Firehose
Amazon Personalize
Amazon
ES
S3
데이터 분석 시스템
Kinesis Data
Streams

Web
Server
Kinesis
Firehose
Users
Items
Interactions
Kibana
Athena
QuickSight
Data Lake
Kinesis
Firehose
Amazon Personalize
Amazon
ES
Kinesis Data
Streams

Use Amazon S3 as your Data lake
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• Decouple storage and compute
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances
• Multiple & heterogeneous analysis clusters and services can use the same
data
• Designed for 99.999999999% durability
• No need to pay for data replication within a region
• Secure: SSL, client/server-side encryption at rest
• Low cost

Web
Server
S3
Users
Items
Interactions
Kibana
Athena
QuickSight
Data Lake
Stream Storage Stream Delivery
Kinesis Data
Firehose
Kinesis Data
Firehose
Kinesis Data
Streams
Amazon
ES
Amazon Personalize

Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Amazon Personalize
Amazon ES Kibana
Kinesis
Firehose
Athena QuickSight
Data Lake
Batch Layer
Speed Layer
Serving Layer
Kinesis Data
Streams

Streaming
Data
Batch View
Stream Process
Real-time
View
Query
Query
Batch View
Real-time
View
Raw Data
Batch Process
Batch Layer Serving Layer
Speed Layer
Lambda Architecture

Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Kibana
Athena
QuickSight
Event
Tracker
Lambda
Step Functions
Kinesis
Firehose
Dataset
Group
Dataset
Import
Recipe &
Solution
Campaign
Amazon
ES
Personalize
Runtime
API
추천 + 데이터 분석 시스템
Kinesis Data
Streams

Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Kibana
Athena
QuickSight
Event
Tracker
Lambda
Step Functions
Kinesis
Firehose
Dataset
Group
Dataset
Import
Recipe &
Solution
Campaign
Amazon
ES
Amazon
Personalize
Personalize
Runtime
API
추천 시스템
Kinesis Data
Streams

Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Kibana
Athena
QuickSight
Event
Tracker
Lambda
Step Functions
Kinesis
Firehose
Dataset
Group
Dataset
Import
Recipe &
Solution
Campaign
Amazon
ES
Analytics
Personalize
Runtime
API
Kinesis Data
Streams

Web
Server
S3
Kinesis
Firehose
Users
Items
Interactions
Kibana
Athena
QuickSight
Event
Tracker
Lambda
Step Functions
Kinesis
Firehose
Dataset
Group
Dataset
Import
Recipe &
Solution
Campaign
Amazon
ES
Amazon
Personalize
Analytics
Personalize
Runtime
API
추천 + 데이터 분석
Kinesis Data
Streams

Web
Server
Kinesis
Firehose
Users
Items
Interactions
Kibana
Athena
QuickSight
Event
Tracker
Lambda
Step Functions
Kinesis
Firehose
Dataset
Group
Dataset
Import
Recipe &
Solution
Campaign
Amazon
ES
Amazon
Personalize
Kinesis Data
Streams
S3
Personalize
Runtime
API
Analytics
추천 + 데이터 분석

From Batch to Real-time:
Lambda Architecture
Data
Source
Stream
Storage
Speed Layer
Batch Layer
Batch
Process
Batch
View
Real-
time
View
Consumer
Query &
Merge Results
Service Layer
Stream
Ingestion
Raw Data
Storage
Streaming Data
Stream
Delivery
Stream
Process

Collect Consume
Store Process /
Analyze
Data
1 4
0 9
5 Answers &
Insights
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon Managed
Streams for Kafka
Amazon S3
Amazon Kinesis
Data Analytics
AWS Glue
Amazon EMR
Amazon Athena Amazon QuickSight
Amazon Redshift
Service
Amazon Machine
Learning
AWS Lambda
ETL

Collect Consume
Store Process /
Analyze
Data
1 4
0 9
5 Answers &
Insights
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon Managed
Streams for Kafka
Amazon S3
Amazon Kinesis
Data Analytics
AWS Glue
Amazon EMR
Amazon Redshift
Service
Amazon Machine
Learning
AWS Lambda
ETL
Stream
Storage

Collect Consume
Store Process /
Analyze
Data
1 4
0 9
5 Answers &
Insights
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon Managed
Streams for Kafka
Amazon S3
Amazon Kinesis
Data Analytics
AWS Glue
Amazon EMR
Amazon Redshift
Service
Amazon Machine
Learning
AWS Lambda
ETL
Stream
Delivery

Collect Consume
Store Process /
Analyze
Data
1 4
0 9
5 Answers &
Insights
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon Managed
Streams for Kafka
Amazon S3
Amazon Kinesis
Data Analytics
AWS Glue
Amazon EMR
Amazon Redshift
Service
Amazon Machine
Learning
AWS Lambda
ETL
Data Lake

Collect Consume
Store Process /
Analyze
Data
1 4
0 9
5 Answers &
Insights
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon Managed
Streams for Kafka
Amazon S3
Amazon Kinesis
Data Analytics
AWS Glue
Amazon EMR
Amazon Redshift
Service
Amazon Machine
Learning
AWS Lambda
ETL
Stream/Batch
Process
(Batch,
Speed Layer)

Collect Consume
Store Process /
Analyze
Data
1 4
0 9
5 Answers &
Insights
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Streams
Amazon Managed
Streams for Kafka
Amazon S3
Amazon Kinesis
Data Analytics
AWS Glue
Amazon EMR
Amazon Redshift
Service
Amazon Machine
Learning
AWS Lambda
ETL
Serving Layer

Lessons Learned: Architectural Principles
• Build decoupled systems
- Data → Store → Process → Store → Analyze → Answers
• Use the right tool for the job
- Data structure, latency, throughput, access patterns
• Leverage managed and serverless services
- Scalable/elastic, available, reliable, secure, no/low admin
• Use log-centric design patterns
- Immutable logs (data lake), materialized views
• Be cost-conscious
- Big data ≠ Big cost
• Working backwards
- Design from consume to collect

Data
Pipeline
Data
Amazon
Personalize
Retention
Rate
Churn
Rate
Conversion
Rate
E-Commerce를 위한 AWS 서비스

Reference
• Build BI System From Scratch
• Hands-on-Lab: https://serverless-bi-system-from-scratch.workshop.aws/ko/
• Sample Code: https://tinyurl.com/37d9kd76
• Video: https://tinyurl.com/y2r6kljp
• Real-time Analytics on AWS
• Slide: https://tinyurl.com/tpacz9w3
• Video: https://tinyurl.com/s5d83982
• Choose Right Stream Storage: Amazon Kinesis Data Streams vs. MSK(Kafka)
• Slide: https://tinyurl.com/3eetzek5
• Video: https://tinyurl.com/yfzttdbm
• AWS 리소스 허브 – 데이터베이스 및 데이터 분석AI
• https://kr-resources.awscloud.com/data-databases-and-analytics

Sungmin Kim
E-Commerce Site를
한 단계 더 Smart 하게 만들기

Agenda
• Amazon Comprehend – 텍스트 데이터 120% 활용하기
• Amazon Fraud Detector – 실시간 온라인 사기 거래 탐지

Data
The world’s most
valuable resource is
no longer oil, but data.*
*Copyright: The Economist, 2017, David Parkins
“
”

사이트를 방문하는 수 많은 사용자들의 행동 로그

사이트 내의 많은 Text 데이터

Text 데이터 분석의 어려움

단어 출현 빈도수
검색 52
추천 30
서비스 27
상품 25
고객 23
가치 20
탐색 17
광고 14
경험 9
Text 데이터의 구조화

Amazon Comprehend
Deep Learning 기반의 NLP 엔진이 탑재된 완전 관리형 자연 언어
처리 서비스
Entities 추출 언어 자동 감지 핵심 문구 Topic 모델링
POWERED BY
DEEP LEARNING
감정 분석
!

Amazon Comprehend: 한글 텍스트 분석 예제

Demo
https://ai-service-demos.go-aws.com/

Customer Review 분석
https://aws.amazon.com/ko/blogs/machine-learning/detect-sentiment-from-customer-reviews-using-amazon-comprehend/

실시간 Text 분석
https://aws.amazon.com/ko/blogs/machine-learning/enable-
smart-text-analytics-using-amazon-elasticsearch-search-and-
amazon-comprehend/

AI Driven Social Media Dashboard
https://aws.amazon.com/ko/solutions/implementations/ai-
driven-social-media-dashboard/

Reference
• Detect sentiment from customer reviews using Amazon Comprehend
• Enable smart text analytics using Amazon Elasticsearch Service and
Amazon Comprehend
• Build a social media dashboard using machine learning and BI services
• Building a custom classifier using Amazon Comprehend
• Build a custom entity recognizer using Amazon Comprehend

Detect more online fraud faster
Amazon Fraud Detector

Hand Designed Rule Automated Rule learning from Data
사기 탐지(Fraud Detection) 어떻게 할 수 있을까?

ü 확장성(Scalability)
ü 새로운 유형의 사기 탐지
ü Domain 전문가
사람이 직접 Fraud Detection Rules을 개발한다면,
Hand Designed Rules

ü ML(기계 학습) 전문가의 부재
ü 반복적인 학습과 모델 평가
ü Time-consuming 작업
Automated Rule learning from Data
Fraud Detection은 ML 역시 어렵다

기계 학습을 사용하여 온라인 사기를 대규모로 실시간으로 쉽게
감지 할 수 있는 사기 탐지 서비스
사전 구축 된 사기
탐지 모델 템플릿
맞춤형 사기 탐지
모델 자동 생성
아마존 내부 경험을
통한 다양한 패턴
Amazon
SageMaker와의
통합
과거 평가 및 탐지
로직 검토 통합

Generating Fraud Predictions
Guest Checkout: Purchase
IP: 1.23.123.123
email: joe@example.com
Payment: Bank123
…
Fraud Detector returns:
Outcome: Approved
ML Score: 160
Purchase Approved
Call service with:
IP: 1.23.123.123
email: joe@example.com
Payment: Bank123
…

ML template: Online Fraud Insights
• Detect risky events based on an event’s attributes
• Best for detecting potential fraud when historical account/user data is
limited
• Inspired by models and techniques used to protect Amazon.com/AWS
account registration
• Use cases: new account, first transaction, guest checkout
• Inputs: 3 required data elements and 50+ optional

Data requirements (for Online Fraud Insights template)
EVENT_TIMESTAMP Variable 1 Variable 2 Variable N EVENT_LABEL
4/10/2019 11:05 … … … Legit / 0
4/10/2019 19:34 … … … Legit / 0
4/10/2019 20:29 … … ... Fraud / 1
… … … … …
Required Required
At least 2 variables required (max 100)
At least 10K total examples
At least 500 fraud examples
• Data must reside in S3 (same region with AFD)
• Data should be in CSV format
• First line of CSV file should have headers
• 2 required headers: EVENT_TIMESTAMP and
EVENT_LABEL (they should not have any NULL or missing
values)
• Maximum file size of 5GB
• Minimum 6 weeks of data
• Recommended: 3-6 months of data
• AFD can handle NULL and missing
values (for variables)

• You will need to map all the event variables to a variable type
• Amazon Fraud Detector can also do this automatically, when you
import the dataset
• For more information see Variable types .
EVENT_TIMESTAMP Variable 1 Variable 2 Variable N EVENT_LABEL
4/10/2019 11:05 … … … Legit / 0
4/10/2019 19:34 … … … Legit / 0
4/10/2019 20:29 … … ... Fraud / 1
… … … … …
Variable type
EMAIL_ADDRESS
IP_ADDRESS
PHONE_NUMBER
USERAGENT
FINGERPRINT
PAYMENT_TYPE
CARD_BIN
AUTH_CODE
AVS
BILLING_NAME
BILLING_PHONE
BILLING_ADDRESS_L1
BILLING_ADDRESS_L2
BILLING_CITY
BILLING_STATE
BILLING_COUNTRY
BILLING_ZIP
SHIPPING_NAME
SHIPPING_PHONE
SHIPPING_ADDRESS_L1
SHIPPING_ADDRESS_L2
SHIPPING_CITY
SHIPPING_STATE
SHIPPING_COUNTRY
SHIPPING_ZIP
ORDER_ID
PRODUCT_CATEGORY
CURRENCY_CODE
PRICE
NUMERIC
CATEGORICAL
FREE_FORM_TEXT
Variables

ML Template: Automated model building
Data
Validation
1
Data Enrichment
&Transformation
2
Model Training
& Selection
4
Performance
Metrics
5
Training data in
Amazon S3
Deployment
& Hosting
6
Feature
Engineering
3

Interactive ML performance metrics
• GUI for defining the
optimal decision threshold
for the best separation
between fraud and legits
• Confusion matrix
• Easily control the trade-
off between FP and FN
Part of Fraud Detector UI

Reference
• Catching fraud faster by building a proof of concept in
• Reviewing online fraud using Amazon Fraud Detector and
Amazon A2I
• AWS Fraud Detector Samples

Data
Pipeline
Data
Amazon
Personalize
Retention
Rate
Churn
Rate
Conversion
Rate
E-Commerce를 위한 AWS 서비스
Amazon
Comprehend
Amazon
Fraud Detector

AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축

More Related Content

What's hot

Similar to AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축

More from Sungmin Kim

Recently uploaded

AWS Personalize 중심으로 살펴본 추천 시스템 원리와 구축