Serverless content recommendation

Content Recommendation 
(Item-Item Collaborative Filtering)
In Serverless

Kurt Lee
Technical Leader, Vingle Inc
iOS / Frontend / Backend
kurt@vingle.net
https://github.com/breath103

목표:  
“사용자의 반응에 기반하여, 
새로운 컨텐츠 추천”

1. Item-Item Collaborative Filtering
2. Serverless Data Pipeline for CF

A가 B를 샀다
B를 사면 C도 산다
A는 C를 살 것이다.

A. User-Item Approach
B. Item-Item Approach
SageMaker / Tensorﬂow 등으로 가능
S3 + Athena 등으로 가능

Assume) 
Rating(U, I) = UserFeatures[U] * ItemFeatures[I]

SageMaker Factorization Machine 
(https://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines.html)
Spark ML 
(EMR)
Collborative Filtering 
(https://spark.apache.org/docs/latest/ml-collaborative-ﬁltering.html)

1) it requires training
2) it requires massive data to store 
User * Features + Item * Features
3) it gets really expensive when it get’s big 
O(user * item)

User의 ItemA에 대한 예상 점수
= AVG(User가 이미 평가한 ItemB의 점수 * ItemB와 ItemA의 유사도 점수)
User의 ItemA에 대한 예상 점수 ( “Like” 만 있을때 )
= AVG(User가 좋아한 ItemB와 ItemA의 유사도 점수)
내가 A, B를 Like했고, S(A, C) = 0.5, S(B, C) = 0.8 이면, 
나의 C에 대한 예상 점수 = (0.5 + 0.8) / 2 = 0.65

장점
1)User-Item Rating, Item-Item Similarity 두개만 관리하면되서,
사이즈가 훨씬 작음. 
(특히, Item 갯수가 User 갯수에 비해 작을땐) 
2)자연스럽게 Item-Item Similarity를 알게 되기 때문에,  
"요 Item의 Related Items..." 를 일석이조

주어진 아이템 A, B의 “유사성”은  
어떻게 점수화?

다양한 변형 (Tanimoto, Sigmoid, …) 이 있지만,  
공통적인 가장 중요한 특성은,

A = A를 좋아한 사람 숫자
B = B를 좋아한 사람 숫자
A ⋂ B = A와 B를 모두 좋아한 사람 숫자
ex)

참고: 여러가지 Jaccard Similarity Function을 비교해 봤는데,  
Sigmoid Jaccard가 가장 퍼포먼스 좋았습니다.
http://www.semantic-web-journal.net/system/files/swj1740.pdf

Application
Interest Recommendation

데이터 수집
{ 
content_id: string,
user_id: number,
at: timestamp
}

데이터 수집이 잘 되어있다는 가정하에…
1) Item-Item Jaccard Score를 Athena에서 계산해서,  
Aurora로 "LOAD FROM" 
2) User-Item을 Athena에서 정리하여,  
Aurora로 "LOAD FROM" 
3) Lambda에서, UserId로 두 table을 조회하여 계산

빠르게 접근 가능해야 하는 데이터
1) Interest - Interest 연관도 
Array<{ A, B, Score (0~1) }>
2) “내가” 최근에 좋아한 Interest 목록 
Array<{ user, interest }>

Interest-Interest score
A = A를 방문한 사람
B = B를 방문한 사람
A ⋂ B = A와 B를 모두 방문한 사람

Interest-Interest score
WITH
interest_reads AS (
SELECT
user_id,
content_id as interest
FROM user_actions
WHERE
(year || month || day) >
date_format(CURRENT_TIMESTAMP - interval '30' DAY, '%Y%m%d')
GROUP BY 1, 2
),
ab_inner_reads_count AS (
SELECT
a.interest AS a,
b.interest AS b,
count(1) AS count
FROM interest_reads a
JOIN interest_reads b ON a.user_id = b.user_id AND a.interest < b.interest
GROUP BY 1, 2
),
reads_count AS (
SELECT
interest, count(1) AS count
FROM interest_reads
GROUP BY 1
),
similarity AS (
SELECT
innerCnt.a AS a,
innerCnt.b AS b,
(innerCnt.count / (aCount.count + bCount.count - innerCnt.count)) AS score
FROM ab_inner_reads_count AS innerCnt
JOIN reads_count AS aCount ON aCount.interest = innerCnt.a
JOIN reads_count AS bCount ON bCount.interest = innerCnt.b
)
SELECT * FROM similarity
[user_id, interest]
[A, count]
[A, B, count] 
To avoid [B, A, count], a < b
[A, B, count] JOIN [A, count] JOIN [B, count]
[A, B, score]

Record가 너무 많음. 
Item이 n개일때, n*n 개...
30000 => 900,000,000

S3에는 문제 없지만,  
Aurora에 넣으면 너무 비쌈

[A, B, 0.5]
[A, C, 0.3]
[A, D, 0.2]
[A, "(B, 0.5), (C, 0.3), (D, 0.2)"]
ARRAY_AGG!

(year || month || day) >
date_format(CURRENT_TIMESTAMP - interval '30' DAY, '%Y%m%d')
GROUP BY 1, 2
),
ab_inner_reads_count AS (
SELECT
a.interest AS a,
b.interest AS b,
count(1) AS count
FROM interest_reads a
JOIN interest_reads b ON a.user_id = b.user_id AND a.interest < b.interest
GROUP BY 1, 2
),
reads_count AS (
SELECT
interest, count(1) AS count
FROM interest_reads
GROUP BY 1
),
similarity AS (
SELECT
innerCnt.a AS a,
innerCnt.b AS b,
(innerCnt.count / (aCount.count + bCount.count - innerCnt.count)) AS score
FROM ab_inner_reads_count AS innerCnt
JOIN reads_count AS aCount ON aCount.interest = innerCnt.a
JOIN reads_count AS bCount ON bCount.interest = innerCnt.b
)
SELECT
a,  
json_format(
CAST(
array_agg(
ROW(b, score)
) AS JSON
)
) as cards
FROM similarity
GROUP BY 1
[A, "[[B, 0.1], [C, 0.2]....]"] 
[B, "[[C, 0.1], [D, 0.2]....]"]

LOAD DATA FROM S3 's3://vingle-redshift/athena_output.csv'
REPLACE
INTO TABLE interest_similarity
CHARACTER SET 'utf8mb4'
FIELDS
TERMINATED BY ','
ENCLOSED BY '"'
IGNORE 1 ROWS (@interest, @others)
SET
interest = @interest,
others = @others,
created_at = CURRENT_TIMESTAMP;
CREATE TABLE ìnterest_similarity` (
ìnterest` VARCHAR(50) NOT NULL,
òthers` text COLLATE utf8mb4_bin NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (ìnterest`)
)
Aurora로 Load!

User-Recently-Visited-Interest
Record가 너무 많음.
유저가 평균 100개 본다 => User * 100

User-Recently-Visited-Interest
WITH
interest_reads AS (
SELECT
session_data.user_id,
data.content.id as interest
FROM track_tickets.user_content_action
WHERE
(year || month || day) > date_format(CURRENT_TIMESTAMP - interval '30' DAY, '%Y%m%d')
GROUP BY
1, 2
)
SELECT
user_id,
json_format(CAST(array_agg(ROW(interest)) AS JSON)) as interests
FROM interest_reads
GROUP BY 1
[user_id, "[['A'], ['B'], ....]"]

LOAD DATA FROM S3 's3://vingle-redshift/athena_output.csv'
REPLACE
INTO TABLE user_interest
CHARACTER SET 'utf8mb4'
FIELDS
TERMINATED BY ','
ENCLOSED BY '"'
IGNORE 1 ROWS (@user_id, @others)
SET
user_id = @user_id,
others = @others,
created_at = CURRENT_TIMESTAMP;
CREATE TABLE ùser_interest` (
ùser_id` INT NOT NULL,
òthers` text COLLATE utf8mb4_bin NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (ùser_id`)
)
Aurora로 Load!

async function recommendedInterests(userId: number) {
const visitedInterests = JSON.parse(
(await UserVisitedInterests.find(userId)).others
) as Array<string>;
// [interest, "[[A, score], [B, score]]"]...
const related = await InterestInterestSimilarity.findAll({ interest: visitedInterests });
return _.chain(visitedInterests)
.flatMap((base) => {
const similarItems = related.find(i => i.interest === base);
// How much user likes interest. 1 for every interests now
return (
JSON.parse(similarItems.others) as Array<[string, number]>
).map(item => {
return {
interest: item[0],
score: item[1] * (1.0) /** base.score */,
};
});
})
.groupBy(i => i.interest)
.mapValues(items =>
_.chain(items).map(i => i.score).sum().value() / items.length
)
.sortBy(i => - i.score)
.value();
}

S3 + Athena + Aurora 조합
무거운 쿼리는 Athena로, 빠른 Access가 필요한 결과물은Aurora로
데이터 수집은 Firehose + S3로
+ 진리의 JSON
Athena도 API로, Cron
Cloudwatch Cron => Lambda => Athena => Aurora

Serverless content recommendation

Recommended

Recommended

More Related Content

Similar to Serverless content recommendation

Similar to Serverless content recommendation (20)

Serverless content recommendation