12. HDP Hive 練習 – DML
• 進行 table join
SELECT
D.*
,T.HOURS_LOGGED
,T.MILES_LOGGED
FROM DRIVERS D
JOIN TIMESHEET T
ON D.DRIVERID = T.DRIVERID;
13. HDP Hive 練習 – DML
• 進行 table join SELECT
D.DRIVERID
,D.NAME
,T.TOTAL_HOURS
,T.TOTAL_MILES
FROM DEFAULT.DRIVERS D
JOIN (
SELECT
DRIVERID
,SUM(HOURS_LOGGED)TOTAL_HOURS
,SUM(MILES_LOGGED)TOTAL_MILES
FROM DEFAULT.TIMESHEET
GROUP BY DRIVERID
) T
ON (D.DRIVERID = T.DRIVERID);
20. • 壓縮格式
– high level compression (one of NONE, ZLIB,
SNAPPY)
• 建立表格
– create table Addresses ( name string, street
string, city string, state string, zip int ) stored as
orc tblproperties ("orc.compress"="NONE");
• 知識層
延伸閱讀 (5)
37. 資 料 分 析
• 客戶取數
– 找一群喜歡香水香氛的用戶
– 找一群關注營養補給與商業理財
SELECT NAME FROM PMART.M_WEBLOG
WHERE CAT3 LIKE '%香水香氛%';
SELECT NAME FROM PMART.M_WEBLOG
WHERE CAT2 LIKE '%營養補給%'
AND CAT3 LIKE '%商業理財%';
觀察自己的資料狀況,決定查詢條件
或用 beeline 查詢(非中文),離開方式: !q:
sudo -u hive beeline -u "jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"
38. 常用的資料查詢工具
38
SparkSQL Good for iterative processing, access existing Hive
tables, given results fits in memory
HAWQ Good for traditional BI-like queries, star schemas,
cubes OLAP
HIVE(LLAP) Good for petabyte scale mixed with smaller tables
requiring sub-second queries
Phoenix Good way to interact with HBase tables, good with
time series, good indexing
Drill、Presto Query federation-like capabilities but limited SQL
syntax. Performance varies quite a bit.
41. • 到內層,用 Python 存取 hive 資料
延伸閱讀 (3)
import pyhs2
import woothee
conn = pyhs2.connect(host='localhost', port=10000,authMechanism='PLAIN',
user='hive', password='',database='pmart')
with conn.cursor() as cur:
cur.execute("select * from m_weblog limit 10")
for i in cur.fetch():
print i[5].decode('utf-8')
with conn.cursor() as cur:
cur.execute("select ua from m_weblog limit 10")
for i in cur.fetch():
print woothee.parse(i[2])
44. Recommendation system (1)
• Personalized Recommender Systems
– Customers behavior transaction log
– Provide recommendation in a short time
– Cross-sell or up-sell in opportunities
– Reduce bounce rate or churn rate
– Increase customer intimacy or customer
stickiness
– Used to in B2C on-line shopping, video,
music, mobile advertisement etc
44
45. Recommendation system (2)
• Collaborative Filtering recommendation
– A Group of people who have the same
interests or common experience preferences
– Personal information given by a considerable
degree of response mechanisms (e.g. item
rating – like, dislike) for social filtering and find
out those potential people who may be
interested in
45
46. Recommendation system (3)
• CF is a common term using in many fields
– Tapestry System in Xerox PARC lab
– GroupLens/ MovieLens
• http://grouplens.org/datasets/movielens/
– Amazon (item-to-item CF)
• http://www.cin.ufpe.br/~idal/rs/Amazon-
Recommendations.pdf
– Facebook
• http://cs229.stanford.edu/proj2012/DavidBajajJazra-
AFacebookProfileBasedTvRecommenderSystem.pdf
• https://code.facebook.com/posts/861999383875667/re
commending-items-to-more-than-a-billion-people/
46
47. Recommendation system (4)
• User-based CF
– Explicit or implicit collecting customers
preferences
– Projecting a group people N who have the
same interests in customer A by using
similarity statistic method and it helps for
calculating customer A unknown rates
• Top-N recommendation
• Association rule recommendation
47
48. Recommendation system (5)
• Person Correlation Coefficient
• Cosine-based Similarity
• Adjusted Cosine Similarity
• …
48
S13 => item 1 & item 3 similarity
Find every item and item similarity
49. Recommendation system (6)
• Item-based CF
– Items are always in co-occurrences
– Counting high frequent pairs in
recommendation orders
• Content-based CF
– Categorization in content of item and
calculating item similarity for a abstraction
recommendation (as features, for prediction)
49
參考: http://files.grouplens.org/papers/www10_sarwar.pdf
50. Recommendation system (7)
• Model-based CF
– Building up the model from historical data
(user-based or item-based)
– This model is then processing forecast
– Pre-model is good for saving time and
increasing response efficiency
50
51. Recommendation system (8)
• Knowledge-based CF
• Hybrid ensemble-based CF
• Context-sensitive CF
– Time
– Location
• Advanced topics CF
51
Reference:
http://rd.springer.com/book/10.1007%2F978-3-319-29659-3
Reciprocal (feedback) is very important to recommender
52. Recommendation system (9)
• Slope one
– A series CF of algorithms, such like
user-based, item-based etc.
– Easily calculating and good accuracy
52
user itemA … itemB
UserA 1 … 1.5
… … … ….
UserB 2 … ?
user itemA itemB itemC
UserA 5 3 7
UserB 3 4 ?
UserC ? 2 9
1.5-1 = 2-? [(5-3)+(3-4)]/2 = 0.5
? = 2+0.5 = 2.5
53. Recommendation system (10)
• CF pros
– Easily figuring out the interesting targets without
complex computation
– Surprise recommendation (behavior data)
• CF cons (challenge)
– Sparsity
• Cannot find targeting subset in sparse matrix
– Scalability
• Not easy in quick response under a big calculation
– Accuracy
• Categorizing, weighting bias in transformation stage
– Cold start
• For new customer or item
53
54. Recommendation system (11)
• In life system doing A/B test
(Confusion matrix)
• Offline evaluation
– Training/ testing dataset
– 0.5 vs 0.5
– Hold-out test (0.9 vs 0.1)
– Cross validation (k-fold)
– R2、AIC & BIC
– Stepwise validation
54
Variables:
• Verify & Validate
Model:
• Evaluation
55. Recommendation system (12)
• Recommendation performance
– Coverage
• Forecasting items are in the proper ration for real
evaluation
– MAE (Mean Absolute Error), RMSE
• The value of difference of forecast and actual
results (In orders in recommendation with reality)
– ROC curve
• The curve shape goes to upper left and the value
of ROC is better
55
63. What is Mahout (1)
• Single mode/ Hadoop distributed mode
• Provides lots common algorithms
• Mahout-Samsara interactive shell on a Spark
cluster (0.10), R-Like, Matlab-like Linear
Algebra library
• Batch process, runs in backend analysis
– Apply to apache storm(speed layer) for a
lambda architecture
63
Mahout: 0.13.1 – 2017, May,13
64. What is Mahout (2)
• 登入系統後,執行 mahout
– mahout
– rpm -qa | grep mahout
64
65. What is Mahout (3)
65
Mahout runs in batch layer of back-end process
68. Alternative Least Square Algorithm (1)
• Rating table (R matrix) has many sparse
columns
• Comparing with SVD algorithm, ALS is
much easier and faster
• From Netflix competition algorithm
• Apply to A matrix, we fitting to R:
68
69. Alternative Least Square Algorithm (2)
69
• User gives ratings to items matrix
Rating matrix
Recommendation matrix User features matrix Item features matrix
= X
In practice, U and M are non-readable
files in the mahout results
最小平方法採用的手段是儘量使得(函數)等號兩邊的方差最小
70. Alternative Least Square Algorithm (3)
• How to find the optimized point
– Slow moving while
closing to the optimize value
• overfitting
70
smoothing
71. Alternative Least Square Algorithm (4)
• Minimizing function (lambda is a
coefficient, called L1 regularization)
• Using least square method
• Stop condition/ Convergence at
– Iteration
– Δ RMSE
71
校正: L1 norm
72. 延伸閱讀
• 從 Cost 函數找出最佳解 (Optimization)
– Gradient descent
– Least square method
– Newton method
72
78. What is Sqoop (1)
• SQL to hadoop (import/ export)
• Using JDBC for connection
• Using mapper function to fetch the
data (default: 4 mappers)
78
79. What is Sqoop (2)
• Each mapper using split-by for a batch
data
• Using boundary-query for predicate/
pushdown query
• Support multi-writer
79
RDBMS
mapper1
mapper2
mapper3
HDFS
age>0 & age < 30
age>30 & age < 50
age>50 & age < 80
89. 建立推薦系統 (12)
• 透過網頁點擊評分
• 匯入訓練資料集
– mysql -uroot iii -p -e "LOAD DATA LOCAL
INFILE '~/webRecommend/weblog.csv' INTO
TABLE weblog FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"' LINES
TERMINATED BY 'n'"
89
93. 建立推薦系統 (16)
• Parsing file format (output in /tmp)
– cd ~/webRecommend && python mahout_to_mysql.py
• Upload data into DB
– mysql -uroot iii -p -e "LOAD DATA LOCAL INFILE
'/tmp/mahout_result.csv' INTO TABLE recommend
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY
'"' LINES TERMINATED BY 'n'"
93
94. 建立推薦系統 (17)
• 查看 recommend 資料表
– mysql -u root iii -p -e "select * from iii.recommend;"
94