推薦系統實作

Big-Data之處理與分析實務班(7)
1

環境準備
• 開啟 HDP Sandbox
– 請調整記憶體大小，約 20 GB 以上
• 登入 HDP 環境
– ssh root@xxx.xxx.xxx.xxx
• 設定 Ambari 密碼
– ambari-admin-password-reset
• 取得資料
– cd /tmp
– git clone https://github.com/orozcohsu/weblog.git

HDP Hive 練習 – 上傳資料
• 以 admin 登入 Ambari 平台
– http://IP:8080
• 進入 Files View 並建立 HDFS
– /tmp/data
• 上傳本地檔案
– drivers.csv
– drivers_tmp.csv
– timesheet.csv
• 點選 data 目錄後，按下 Permission、設定權限

HDP Hive 練習 – DDL
• 到 Hive View，建立 drivers 表格 [預設在 default 資料庫]
CREATE TABLE DRIVERS (
DRIVERID INT
,NAME STRING
,SSN BIGINT
,LOCATION STRING
,CERTIFIED STRING
,WAGEPLAN STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");

• 建立 drivers 臨時表格 [預設在 default 資料庫]
CREATE TABLE TEMP_DRIVERS (COL_VALUE STRING);

• 建立 timesheet 表格 [預設在 default 資料庫]
CREATE TABLE TIMESHEET (
DRIVERID INT
,WEEK INT
,HOURS_LOGGED INT
,MILES_LOGGED INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1");

• 讀取 HDFS 資料，並寫入表格
LOAD DATA INPATH '/tmp/data/drivers.csv' OVERWRITE INTO TABLE drivers;
LOAD DATA INPATH '/tmp/data/drivers_tmp.csv' OVERWRITE INTO TABLE
TEMP_DRIVERS;
LOAD DATA INPATH '/tmp/data/timesheet.csv' OVERWRITE INTO TABLE timesheet;

• 設定 hive.execution.engine => mr 或 tez
LLAP 目前版本需要額外設定
記憶體不夠的時候，需要用 mr

• 設定 hive.auto.convert.join => false
• Shuffle join (Map-Reduce join)
• Broadcast join (mapper 掃描大表進行連接 – distribute cache)
• Sort-Merge-Bucket join (mapper 定位 key 進行連接，搭配表格建立語法)
• hive.mapjoin.smalltable.filesize = 2500000 (25MB)
http://henning.kropponline.de/2016/10/09/hive-join-strategies/

• 設定 hive.tez.container.size => 8192
hive.tez.java.opts 設置必須小於 hive.tez.container.size
例如: 28GB 記憶體主機，若決定10GB給 container，因此習慣分配 80% 給 opts
• SET hive.tez.container.size=10240
• SET hive.tez.java.opts=-Xmx8192m

HDP Hive 練習 – DML
• 進行 table join
SELECT
D.*
,T.HOURS_LOGGED
,T.MILES_LOGGED
FROM DRIVERS D
JOIN TIMESHEET T
ON D.DRIVERID = T.DRIVERID;

HDP Hive 練習 – DML
• 進行 table join SELECT
D.DRIVERID
,D.NAME
,T.TOTAL_HOURS
,T.TOTAL_MILES
FROM DEFAULT.DRIVERS D
JOIN (
SELECT
DRIVERID
,SUM(HOURS_LOGGED)TOTAL_HOURS
,SUM(MILES_LOGGED)TOTAL_MILES
FROM DEFAULT.TIMESHEET
GROUP BY DRIVERID
) T
ON (D.DRIVERID = T.DRIVERID);

HDP Hive 練習 – DCL
• 執行
• 執行 ANALYZE 的好處
– 使用 TEZ (CBO) 時，將有更可靠的 table 資料，執行計畫時
會更準確，一般系統會自動進行，也可手動進行
ANALYZE TABLE DRIVERS COMPUTE STATISTICS;
DESCRIBE EXTENDED DRIVERS;

HDP Hive 練習–系統函數
• 執行
SELECT
REGEXP_EXTRACT(COL_VALUE, '^(?:([^,]*),?){1}', 1) DRIVERID
,REGEXP_EXTRACT(COL_VALUE, '^(?:([^,]*),?){2}', 1) NAME
,REGEXP_EXTRACT(COL_VALUE, '^(?:([^,]*),?){3}', 1) SSN
,REGEXP_EXTRACT(COL_VALUE, '^(?:([^,]*),?){4}', 1) LOCATION
,REGEXP_EXTRACT(COL_VALUE, '^(?:([^,]*),?){5}', 1) CERTIFIED
,REGEXP_EXTRACT(COL_VALUE, '^(?:([^,]*),?){6}', 1) WAGEPLAN
FROM TEMP_DRIVERS;
先通通吃進來，之後再調整
提示:
create table CTAS as…

• Directed Acyclic Graph (DAG)
– 由 Tez 執行 job 時建立
– 如何分散到不同叢集上、計數器
(例如工作及頂點所使用的記憶體)，
以及錯誤訊息
– 簡單的 Hive 查詢通常不用 Tez 就
能解決，但更複雜的查詢 (進行篩
選、分組、排序、聯結等)
延伸閱讀 (1)

• Hive中的 Join 可分為
– Common Join
（Reduce 階段完成 Join）
– Map Join（Map 階段完成 Join）
17
延伸閱讀 (2)
SELECT a.id,a.dept,b.age FROM a join b ON (a.id = b.id);
根據 key 的值完成 Join 操作，透過 Tag 來識別不同表中的資料

• 何謂 Stats & Cost Based Optimization (CBO)?
– 統計表格之欄位分布，常用於產生更好的查詢方式
– 提高叢集資料查詢效率
• 常見的 CBO 模式
– Table Stats
– Column Stats
• 如何確保 Hive 啟動 CBO?
– 使用 explain 查看
– 表格執行過 ANALYZE TABLE
– CBO 有開啟
– 表格設有 partition
– 有時 join 條件不複雜，系統不會啟動
延伸閱讀 (3)
ANALYZE TABLE table [partition(key)]
COMPUTE STATISTICS;
ANALYZE TABLE table [partition(key)]
COMPUTE STATISTICS FOR COLUMNS col1,col2,...;

• 某晶圓製造廠商之機台 log
– 7700 萬 rows * 50 columns
– 12 台機器，每台 32 Gb 記憶體
• 執行查詢結果，用 where 條件過濾
– Spark SQL = 240 秒
– Hive ORC + Tez + Llap = 20 秒
延伸閱讀 (4)

• 壓縮格式
– high level compression (one of NONE, ZLIB,
SNAPPY)
• 建立表格
– create table Addresses ( name string, street
string, city string, state string, zip int ) stored as
orc tblproperties ("orc.compress"="NONE");
• 知識層
延伸閱讀 (5)

• 增快 SQL 的技巧
– 使用 Tez 或 LLAP
– 使用 ORCFILE
– 使用 Vectorization
– 使用 CBO
– 好的 SQL 語法
21
延伸閱讀 (6)
參考: https://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/

資料模型
• Prestage
– 資料初始
• Stage
– 資料整理
• Pdata
– 各主要表格
• Pmart
– 特定商用目的、資料大表、視覺化圖表

HDP Sandbox 環境介紹
• 外層
– 原始 Linux 作業系統
• 內層
– docker 應用程式 (HDP2.6)
• 內外交換目錄
– 外: /var/lib/docker/volumes/hadoop/_data
– 內: /hadoop

HDP Sandbox 準備環境
• 複製資料到內層
• docker cp /tmp/weblog/ 99a03ce676c4:/hadoop
• 進入內層 HDP 環境 (預設密碼: hadoop)
– ssh root@localhost -p 2222
– cd /Hadoop/weblog
• 建立 ES 交換目錄
– /hadoop/hadoop-log

調整系統時間
• 調整系統時間
– ln -sf /usr/share/zoneinfo/Asia/Taipei
/etc/localtime
– yum install -y ntpdate
– ntpdate time.stdtime.gov.tw
• 檢查
– date

建立 profile 表格
• 在 Hive view 建立資料庫
– create database pdata;
• 建立 mysql profile 資料表
– mysql -u root -p [預設: hadoop]
– create database weblog;
– use weblog;
– create table profile (id int not null
auto_increment, uuid varchar(50), name
varchar(50), primary key(id) );

匯入 profile 資料
• 在內層中，匯入資料
– mysql -uroot weblog -p -e "LOAD DATA LOCAL
INFILE '/hadoop/weblog/hivename.csv' INTO TABLE
profile FIELDS TERMINATED BY ',' OPTIONALLY
ENCLOSED BY '"' LINES TERMINATED BY 'n' "
[預設:hadoop]
• 查看資料
– select * from weblog.profile;

Sqoop Import 資料到 Hive
• 在內層中，import 資料
– sqoop import --connect
jdbc:mysql://localhost/weblog --username root --
password hadoop --driver com.mysql.jdbc.Driver --
table profile --hive-import --hive-table pdata.profile
• 查看資料 (Hive view)
– SELECT * FROM pdata.profile LIMIT 100;
缺少參數: https://community.hortonworks.com/questions/110580/unable-to-publish-import-data-to-publisher-orgapac.html

上傳 URL 麵包屑
• 在 Hive view 中，點選本地檔案
– host_url.csv
– 輸入表格、欄位名稱、選擇欄位型態、資料庫

自動產生 Weblog
• 在內層，執行產生 log 程式
– python realtime_data.py -t "2017-07-15 09:00:00" -p
"/hadoop/weblog/log"
• 觀察 log 格式
– tail -n +20 20170711071217.csv
– 分隔符號 |
日期 Page view User agent Uuid

簡單 ETL 程式
• 上傳到 HDFS (/user/hive/prestage/weblog)
– bash /hadoop/weblog/SH/moveETL.sh &
• 檢查 Files View

執行 Log 批次作業 (1)
• 執行 stage 程式 (模擬多做幾批、次)
– sudo -u hive beeline -u
"jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDis
coveryMode=zooKeeper;zooKeeperNamespace=hiveser
ver2" --hivevar MDATE='20170711' --hivevar
SDATE='06' -f '/hadoop/weblog/SQL/stage.sql';
• 查看資料
– SELECT * FROM stage.s_weblog LIMIT 100;
查看 hdfs 上的 prestage 資料，是否還存在?

• 執行 pdata 程式 (模擬多做幾批、次)
"jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDi
scoveryMode=zooKeeper;zooKeeperNamespace=hives
erver2" --hivevar MDATE='20170711' --hivevar
SDATE='06' -f '/hadoop/weblog/SQL/pdata.sql';
• 查看資料
– SELECT * FROM pdata.p_weblog LIMIT 100;
目前設計以 PDATA 存有每小時資料，每日彙總資料將於次日凌晨一點批次執行

• 執行 pmart 程式 (模擬多做幾批、次)
"jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDi
scoveryMode=zooKeeper;zooKeeperNamespace=hives
erver2" --hivevar CDATE='2017-07-11' -f
'/hadoop/weblog/SQL/pmart.sql';
• 查看資料
– SELECT * FROM pmart.m_weblog LIMIT 100;
這些都是 Tez 執行的

資料分析
• 客戶取數
– 找一群喜歡香水香氛的用戶
– 找一群關注營養補給與商業理財
SELECT NAME FROM PMART.M_WEBLOG
WHERE CAT3 LIKE '%香水香氛%';
SELECT NAME FROM PMART.M_WEBLOG
WHERE CAT2 LIKE '%營養補給%'
AND CAT3 LIKE '%商業理財%';
觀察自己的資料狀況，決定查詢條件
或用 beeline 查詢(非中文)，離開方式: !q:
sudo -u hive beeline -u "jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"

常用的資料查詢工具
38
SparkSQL Good for iterative processing, access existing Hive
tables, given results fits in memory
HAWQ Good for traditional BI-like queries, star schemas,
cubes OLAP
HIVE(LLAP) Good for petabyte scale mixed with smaller tables
requiring sub-second queries
Phoenix Good way to interact with HBase tables, good with
time series, good indexing
Drill、Presto Query federation-like capabilities but limited SQL
syntax. Performance varies quite a bit.

• 從客戶取數的觀點下，有沒有其他應用?
• 手動完成 6 月份歷史資料
– 歷史資料: /hadoop/weblog/history-log
• 如何要做月檔，有甚麼函數?
– 參考函數: trunc(dt,'MM')
延伸閱讀 (1)

• 到內層，用 Python 存取 hive 資料
– yum install gcc-c++ python-devel.x86_64 cyrus-sasl-
devel.x86_64
– yum -y install python-setuptools python-setuptools-devel
– easy_install pip
– pip install sasl
– pip install thrift
– pip install thrift-sasl
– pip install PyHive
– pip install pyhs2
– pip install woothee
延伸閱讀 (2)

• 到內層，用 Python 存取 hive 資料
延伸閱讀 (3)
import pyhs2
import woothee
conn = pyhs2.connect(host='localhost', port=10000,authMechanism='PLAIN',
user='hive', password='',database='pmart')
with conn.cursor() as cur:
cur.execute("select * from m_weblog limit 10")
for i in cur.fetch():
print i[5].decode('utf-8')
with conn.cursor() as cur:
cur.execute("select ua from m_weblog limit 10")
for i in cur.fetch():
print woothee.parse(i[2])

Big-Data之處理與分析實務班(8)
42

Recommendation system (1)
• Personalized Recommender Systems
– Customers behavior transaction log
– Provide recommendation in a short time
– Cross-sell or up-sell in opportunities
– Reduce bounce rate or churn rate
– Increase customer intimacy or customer
stickiness
– Used to in B2C on-line shopping, video,
music, mobile advertisement etc
44

• Collaborative Filtering recommendation
– A Group of people who have the same
interests or common experience preferences
– Personal information given by a considerable
degree of response mechanisms (e.g. item
rating – like, dislike) for social filtering and find
out those potential people who may be
interested in
45

• CF is a common term using in many fields
– Tapestry System in Xerox PARC lab
– GroupLens/ MovieLens
• http://grouplens.org/datasets/movielens/
– Amazon (item-to-item CF)
• http://www.cin.ufpe.br/~idal/rs/Amazon-
Recommendations.pdf
– Facebook
• http://cs229.stanford.edu/proj2012/DavidBajajJazra-
AFacebookProfileBasedTvRecommenderSystem.pdf
• https://code.facebook.com/posts/861999383875667/re
commending-items-to-more-than-a-billion-people/
46

• User-based CF
– Explicit or implicit collecting customers
preferences
– Projecting a group people N who have the
same interests in customer A by using
similarity statistic method and it helps for
calculating customer A unknown rates
• Top-N recommendation
• Association rule recommendation
47

• Person Correlation Coefficient
• Cosine-based Similarity
• Adjusted Cosine Similarity
• …
48
S13 => item 1 & item 3 similarity
Find every item and item similarity

• Item-based CF
– Items are always in co-occurrences
– Counting high frequent pairs in
recommendation orders
• Content-based CF
– Categorization in content of item and
calculating item similarity for a abstraction
recommendation (as features, for prediction)
49
參考: http://files.grouplens.org/papers/www10_sarwar.pdf

• Model-based CF
– Building up the model from historical data
(user-based or item-based)
– This model is then processing forecast
– Pre-model is good for saving time and
increasing response efficiency
50

• Knowledge-based CF
• Hybrid ensemble-based CF
• Context-sensitive CF
– Time
– Location
• Advanced topics CF
51
Reference:
http://rd.springer.com/book/10.1007%2F978-3-319-29659-3
Reciprocal (feedback) is very important to recommender

• Slope one
– A series CF of algorithms, such like
user-based, item-based etc.
– Easily calculating and good accuracy
52
user itemA … itemB
UserA 1 … 1.5
… … … ….
UserB 2 … ?
user itemA itemB itemC
UserA 5 3 7
UserB 3 4 ?
UserC ? 2 9
1.5-1 = 2-? [(5-3)+(3-4)]/2 = 0.5
? = 2+0.5 = 2.5

• CF pros
– Easily figuring out the interesting targets without
complex computation
– Surprise recommendation (behavior data)
• CF cons (challenge)
– Sparsity
• Cannot find targeting subset in sparse matrix
– Scalability
• Not easy in quick response under a big calculation
– Accuracy
• Categorizing, weighting bias in transformation stage
– Cold start
• For new customer or item
53

• In life system doing A/B test
(Confusion matrix)
• Offline evaluation
– Training/ testing dataset
– 0.5 vs 0.5
– Hold-out test (0.9 vs 0.1)
– Cross validation (k-fold)
– R2、AIC & BIC
– Stepwise validation
54
Variables:
• Verify & Validate
Model:
• Evaluation

• Recommendation performance
– Coverage
• Forecasting items are in the proper ration for real
evaluation
– MAE (Mean Absolute Error), RMSE
• The value of difference of forecast and actual
results (In orders in recommendation with reality)
– ROC curve
• The curve shape goes to upper left and the value
of ROC is better
55

56
找到
的資料
相關的資料
找到
不相關資料
找到，相關的資料
沒找到，但是相關的資料
相關不相關
找到 A (TP) C (FP)
沒找到 B (FN) D (TN)
準確率: 已找到的都越相關越好，即越大越好
回覆率: 能找到的越多越好
在大規模的資料中，這兩個指標是反比
例如: 希望找到更多的資料時候，回覆率會上升，但準確率會下降，
B: 沒找到
但是相關的
A: 相關的
但是被找到
C: 找到
但是不相關的
A: 相關的
但是被找到
推薦系統建議用 A/B 測試，當然也可以用 RMSE/ MSE 評估
A=(TP+TN)/(ALL)
P=TP/(TP+FP)
R=TP/TP+FN)
F1=(P*R*2)/(P+R)

延伸閱讀
• 加快矩陣運算的函式庫
– OpenBLAS (Basic Linear Algebra Subprograms)
– Atlas (Automatically Tuned Linear Algebra Software)
– MKL (Math Kernel Library)
– LAPACK (Linear Algebra PACKage)
57

建立推薦系統 (1)
• 目的
– 使用同儕推薦 (CF) 推薦商品
• 資料來源
– 由歷史資料 (冷資料)
– 由前端用戶評分資料
59

• Vmware (HDP Sandbox)
• 快速學習 HDP
– https://hortonworks.com/tutorials/
Host name IP Memory
/ Cores
方式 Distribution
sandbox.hortonwor
ks.com
192.168.214.140 24GB+
/4+
直接啟動 HDP2.6
請注意，至少要 4 個 Cores 數

61
Front End
Webpage
Data Lake
Recommendation
Mahout
Cold: Hadoop
Hot: Mysql

What is Mahout (1)
• Single mode/ Hadoop distributed mode
• Provides lots common algorithms
• Mahout-Samsara interactive shell on a Spark
cluster (0.10), R-Like, Matlab-like Linear
Algebra library
• Batch process, runs in backend analysis
– Apply to apache storm(speed layer) for a
lambda architecture
63
Mahout: 0.13.1 – 2017, May,13

What is Mahout (2)
• 登入系統後，執行 mahout
– mahout
– rpm -qa | grep mahout
64

What is Mahout (3)
65
Mahout runs in batch layer of back-end process

延伸閱讀
• 何謂 Lambda architecture?
• 如何搭配 Storm 與 Mahout 的應用?
• Mahout-Samsara
– http://mahout.apache.org/docs/0.13.1-
SNAPSHOT/tutorials/samsara/play-with-shell.html
• 關於 Mahout 更多比較說明 (mahout、Spark、H2o)
– https://www.linkedin.com/pulse/choosing-machine-learning-
frameworks-apache-mahout-vs-debajani
– https://www.h2o.ai/h2o/h2o-flow/
67

Alternative Least Square Algorithm (1)
• Rating table (R matrix) has many sparse
columns
• Comparing with SVD algorithm, ALS is
much easier and faster
• From Netflix competition algorithm
• Apply to A matrix, we fitting to R:
68

69
• User gives ratings to items matrix
Rating matrix
Recommendation matrix User features matrix Item features matrix
= X
In practice, U and M are non-readable
files in the mahout results
最小平方法採用的手段是儘量使得(函數)等號兩邊的方差最小

• How to find the optimized point
– Slow moving while
closing to the optimize value
• overfitting
70
smoothing

• Minimizing function (lambda is a
coefficient, called L1 regularization)
• Using least square method
• Stop condition/ Convergence at
– Iteration
– Δ RMSE
71
校正: L1 norm

延伸閱讀
• 從 Cost 函數找出最佳解 (Optimization)
– Gradient descent
– Least square method
– Newton method
72

Exploratory data analysis – EDA (1)
• 透過視覺方式分析資料集
74
計算敘述性統計量 >> 繪製統計圖表 >> 判斷離群值與資料偵錯 >> 找出資料基本特性

• 資料洞察
75

• 資料呈現應該有特定目的
– http://blog.infographics.tw/2015/02/sev
en-common-chart-visualization-
mistakes/
76

What is Sqoop (1)
• SQL to hadoop (import/ export)
• Using JDBC for connection
• Using mapper function to fetch the
data (default: 4 mappers)
78

What is Sqoop (2)
• Each mapper using split-by for a batch
data
• Using boundary-query for predicate/
pushdown query
• Support multi-writer
79
RDBMS
mapper1
mapper2
mapper3
HDFS
age>0 & age < 30
age>30 & age < 50
age>50 & age < 80

• 登入外層環境 (預設密碼: hadoop)
– ssh root@xxx.xxx.xxx.xxx
• 查看 docker 執行個體
– docker ps
• 查看 docker ssh 連線(port)設定
– docker port deb4bcadda8f | grep 22
80

• HDP 測試資料集
– hadoop fs -ls /demo/data
81

• 登入內層後，安裝 mahout (版本: 0.9)
– yum install mahout
82
目前 ambari 安裝會有問題

• 下載推薦系統網站
– cd ~
– git clone
git@github.com:orozcohsu/webRecommend.git
• 安裝 python 相關套件
– easy_install pip
– cd webRecommend && pip install -r env/requirements.txt
83

• Python 套件說明
– flask
– flask-bootstrap
– flask-script
– flask-sqlalchemy
– flask-moment
– pymysql
84

• 建立 iii 資料庫
– mysql -u root -p (預設密碼: hadoop)
– mysql> create database iii;
• 授予 root 權限
– mysql> GRANT ALL ON *.* to root@'%';
– mysql> GRANT ALL PRIVILEGES ON *.* TO root@'%' IDENTIFIED BY
'hadoop' WITH GRANT OPTION;
– mysql> FLUSH PRIVILEGES;
– mysql> quit;
• 建立相關資料庫
– 檢視修改 /root/webRecommend/app.py
– 檢視修改 /root/webRecommend/env/settings.py
– 執行(建立資料表) python /root/webRecommend/env/log_models.py
85

• 啟動網站，看看有無錯誤
– python app.py runserver
• 離開，回到外層，並建立新容器
– exit
– docker commit sandbox sandbox
• 刪除目前執行中的容器
– docker stop sandbox
– docker rm sandbox
– docker ps –a
86
檢查目前執行的容器:
docker ps
updated container

• 修改啟動檔案 (增加 5001:5001)
– vi /root/start_scripts/start_sandbox.sh
• 重啟系統
– reboot
• 檢視服務與 port 開通
– docker ps | grep 5001
87

• 進入內層，執行網站
– python /root/webRecommend/app.py runserver &
• 網站樣貌 (http://xxx.xxx.xxx.xxx:5001)
88

• 透過網頁點擊評分
• 匯入訓練資料集
– mysql -uroot iii -p -e "LOAD DATA LOCAL
INFILE '~/webRecommend/weblog.csv' INTO
TABLE weblog FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"' LINES
TERMINATED BY 'n'"
89

• 建立目錄，用來放 Sqoop 資料
– hadoop fs -mkdir -p /data
• 透過 Sqoop 將 Mysql 資料匯入 HDFS
– sqoop import --connect jdbc:mysql://localhost/iii --
username root --password hadoop --driver
com.mysql.jdbc.Driver --query "select
uuid,item,rating from iii.weblog where 1=1 and
$CONDITIONS" -target-dir /data/weblog -m1
90

• Create a tmp for Mahout
– hadoop fs -mkdir /data/tmp
• Running ALS-WR algorithm
– mahout parallelALS --input /data/weblog/* --output
/data/logout --tempDir /data/tmp --numFeatures 5 --
numIterations 2 --lambda 0.065
• Check
– hadoop fs -ls /data/logout
91
M 目錄: itemFeatures
U 目錄: userFeatures
userRatings 目錄

• Generate recommendation results
– mahout recommendfactorized --input
/data/logout/userRatings/ --output /data/recommend/ --
userFeatures /data/logout/U --itemFeatures /data/logout/M
--numRecommendations 3 --maxRating 5 --numThreads 2
• Check
– hadoop fs -ls /data/recommend
– hadoop fs -cat /data/recommend/*
– hadoop fs -cat /data/recommend/* > /tmp/result.txt
92

• Parsing file format (output in /tmp)
– cd ~/webRecommend && python mahout_to_mysql.py
• Upload data into DB
– mysql -uroot iii -p -e "LOAD DATA LOCAL INFILE
'/tmp/mahout_result.csv' INTO TABLE recommend
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY
'"' LINES TERMINATED BY 'n'"
93

• 查看 recommend 資料表
– mysql -u root iii -p -e "select * from iii.recommend;"
94

95
• 啟動網站服務，搭配 id 查
看推薦效果

延伸閱讀 (1)
• 將 mahout 執行方式改為 Tez
– http://blog.sequenceiq.com/blog/2014/
03/31/mahout-on-tez/
96
http://xxx.xxx.xxx.xxx:8088

延伸閱讀 (2)
• 推薦結果通常存在熱資料區，以利查詢
– Hbase
– MongoDB
– Rethinkdb
• https://www.rethinkdb.com/
– Redis
97

延伸閱讀 (3)
• 下載測試資料集 ml-1m
– wget http://www.grouplens.org/system/files/ml-1m.zip
– cat ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3,4 > ratings.csv
• 建立訓練、測試資料集
– mahout splitDataset --input /ratings.csv --output
/recommend/factorized/dataset --trainingPercentage 0.9 --
probePercentage 0.1 --tempDir /tmp
• 建立模型
– mahout parallelALS --input
/recommend/factorized/dataset/trainingSet/ --output
/recommend/factorized/als/out --tempDir /tmp --
numFeatures 5 --numIterations 10 --lambda 0.065
98
Hold-out 方法

延伸閱讀 (4)
• Evaluation 結果 (RMSE)
– mahout evaluateFactorization --input
/recommend/factorized/dataset/probeSet/ --output
/recommend/factorized/als/rmse/ --userFeatures
/recommend/factorized/als/out/U/ --itemFeatures
/recommend/factorized/als/out/M/ --tempDir
/recommend/factorized/als/tmp
• 推薦名單
– mahout recommendfactorized --input
/recommend/factorized/als/out/userRatings/ --output
/recommend/factorized/recommendations/ --userFeatures
/recommend/factorized/als/out/U/ --itemFeatures
/recommend/factorized/als/out/M/ --numRecommendations 6
--maxRating 5
99
試想，一個好的 RMSE模型，也許會推薦一個完全不相干的你給(沒驚喜的)

熱資料區
• 適用於快速查找，透過 tcp 查詢
– Memcached
– AeroSpike
– Redis (2017-07: 排名 9)
• https://db-engines.com/en/ranking
101

Redis
• 提供豐富的資料型態
– Hash、List、Set
• 具有持久化機制
– 定期將内存中的資料持久化到硬碟
– 具備 bin-log 功能，寫入日誌當中；當系統損壞時，可用於恢復紀錄
• 支援 LRU 演算法 (Least Recently Used Page Replacement)
– 當儲存資料超過設定的記憶體資料量，系統會將資料儲存到硬碟
• 支援叢集
– 支援複寫 (Replica)
102

Redis 安裝與練習 (1)
• 登入內層，安裝 Redis 與執行
– cd /tmp && wget http://download.redis.io/releases/redis-4.0.0.tar.gz
– tar -zxvf /tmp/redis-4.0.0.tar.gz
– mv /tmp/redis-4.0.0 /opt
– ln -s /opt/redis-4.0.0/ /opt/redis
– cd /opt/redis && make
– cp /opt/redis/src/redis-benchmark redis-cli redis-server /usr/bin
• 預設於背景啟動
– Vi /opt/redis/redis.conf
103

• 啟動 Redis
– mkdir /var/lib/redis
– redis-server /opt/redis/redis.conf
• 嘗試連線 (退出後，資料仍在)
– redis-cli
104
> set class iii
> get class
> get clas
> quit

• 安裝 pyredis
– easy_install redis
• Python 存取 redis
– python-redis1.py
• Python 改用 pipeline 存取
105
參考: https://redis.readthedocs.io/en/2.4/set.html

• 存取多樣化物件
• 同樣喜好搜尋
106

找出同一商品推薦用戶
• 輸出 mysql 資料表
– mysql -uroot iii -p --skip-column-names -e "select
uuid, item from iii.recommend" >
/tmp/recommend.txt
• 寫入 redis，查詢推薦資料
107

將 mahout 結果輸出到 redis
• 寫入 redis
108

讀取 redis 推薦產品
• 修改網站
– app.py
– templates
109

延伸閱讀 (1)
• 將推薦結果改為用 redis 方式查詢
• RDB 與 AOF(完全持久化) 備份差別
• 何謂 Redis Bloomfilter?
– 常應用於爬網
– 參考: 512MB記憶體，可以放 2億筆資料
• 安裝測試 redis cluster
110

延伸閱讀 (2)
• 大數據常用的 Ad Hoc 資料查詢工具
– 了解 Presto
– 了解 HAWQ
– 了解 Hive (LLAP)
• 常用的 In-memory db (非商用)
– MEMSQL
– VoltDB
111

Future Outlook
• 收集用戶在網站上的行為，包含用戶資料、
瀏覽與點擊紀錄
• 即時觀察網站人流變化，並建立主題式KPI
觀察指標
• 勾稽內部用戶資料，擴充衍伸欄位，建構
出完整用戶分析大表
• 進行用戶行為、商品關聯、分群模型，提
供決策分析
• 分析結果回饋，如用戶即時廣告推薦

網站經營:
● 打造自己的專屬網站、安裝外掛與修改樣式、網站SEO
洞悉客戶:
● 用戶行為、GA、埋Code技術、資料流技術、系統設計資料串接
儲存與查詢:
● 資料倉儲與方法論、冷熱數據池、數據分析、圖表 KPI 與警示
數據分析:
● 數據科學導論、數據探勘方法論、分群技術與協同式過濾
回饋與績效:
● 視覺化、定期評分、自動建模與變數更新、廣告推薦與績效評估
Future Outlook

數據分析
RFM 模型
• RFM
– R(Recency): 表示客戶最近一次購買的時間有多遠
– F(Frequency): 表示客戶在最近一段時間內購買的次數
– M (Monetary): 表示客戶在最近一段時間內購買的金額
• 一般的分析型CRM著重在對於客戶貢獻度的
分析，RFM則強調以客戶的行為來區分客戶
114

• 可自行架設 WordPress 網站，並掌握住網
站經營網站技巧
• 收集用戶在網站上瀏覽、點擊行為，進行即
時網站流量觀察、特定行為分析與主題式
KPI 視覺化呈現
• 勾稽現有數據進行用戶、商品深層數據分析
• 即時前端廣告回饋，提高網站經營績效
Future Outlook

116
聯絡方式:
orozcohsu@hotmail.com

推薦系統實作

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 推薦系統實作

Similar to 推薦系統實作 (20)

More from FEG

More from FEG (20)

推薦系統實作