[CS570] Machine Learning Team Project (I know what items really are)

2013. 06. 13(Thu)
Team 11. Junghyun Kwon
Kunwoo Park
Jongin Lee
Seungkyu Nam
I know
what items really are

• Problem
• Challenges
• Related works
• Motivation
• Approaches
• Experiment setup
• Feature extraction
• Result
• Discussion
2
Contents

• Purpose of Track 1 in 2012 KDD cup
• Predict which users(or items) a Weibo user might follow.
• Recommendation System [1]
• Save valuable time sifting through
less relevant stories
• Increase customer satisfaction
3
Problem
Twitter.com

• 90% data of the world are generated for the last three years
• 1.0 × 1016 byte everyday
• Sensor, Mobile, SNS, Online transaction
• 10 billion tweets everyday
• 30 billion FB msgs everyday (*)
• …
4
Problem
Source: http://goo.gl/9xXaG
*: BLOTER.NET 12.01.26

• Problem
• Too many data to find the informative features
• 80 million training data, Large user and item meta data
• Few accepted results compared to many rejected results
• Take too much time for data processing
• SVM for all data: 16 days
• Lack of computing resources
• Our goal
• Train large and complex Weibo data as much as
possible in a single machine
• Find effective features with a simpler(and faster) approach
5
Challenges

• Online learning [2],[3]
• Learns one instance at a time
• Ex. Product searching
• Pro – minimize some performance criteria
• Con – many incorrect label feedback
• Map-Reduce [4]
• Parallel, distributed model for processing large data
• Pro – good for lots of input, intermediate and output data
• Con – bad for synchronization required data
6
Related works

7
Motivation
User Keywords
Year of birth
Gender
Number of tweets
Tag-ids
Category
Keywords
What Item is favorite for which user ?

8
Motivation
User
Item
User IDs in User_profile.txt include item IDs in item.txt.

9
Motivation
User Keywords
Year of birth
Gender
Number of tweets
Tag-ids
Category
Keywords
User Keywords
Year of birth
Gender
Number of tweets
Tag-ids
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
Our training data!

• Extract features between users and items using
metadata of user and item.
• Train model by Support Vector Machine
• Libsvm in R
10
Initial Approach
Failure!
Lots of computation time: 16 days for training SVN
Lack of computational resource: single machine

• Apply logistic regression using stochastic gradient
descent
• Logistic regression
• Stochastic gradient descent
11
Alternative Approach
stochastic gradient descent:
gradient descent:

1. Training data (73,209,277 user-item pairs)
- applying target ID, 38,332,489 user-item pairs
2. Test data (public, 2,617,106 user-item pairs)
3. Used features
- User’s number of tweet
- User’s number of tag
- Age similarity
- Item’s number of tweet
- Item’s number of tag
- Gender similarity
- Network similarity
- Number of Item’s follower
- Keyword similarity
4. Construct separate models using each feature
5. Evaluation metrics : F1 score, MAP@3
6. Baseline : Random prediction
12
Experiment Setup

• Age similarity = zscore( ||user_age – item_age|| )
• Gender similarity =
1
−1
0
𝑖𝑓 𝑠𝑎𝑚𝑒 𝑔𝑒𝑛𝑑𝑒𝑟
𝑖𝑓 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑔𝑒𝑛𝑑𝑒𝑟
𝑖𝑓 𝑢𝑛𝑘𝑜𝑤𝑛 𝑔𝑒𝑛𝑑𝑒𝑟
• Z-scored number of tweets from user
• Z-scored number of tweets from item
• Z-scored number of tags from user
• Z-scored number of tags from item
• Z-scored number of followers of item
13
Feature Extraction

• Keyword similarity =
𝑢𝑠𝑒𝑟_ 𝑘𝑒𝑦𝑤𝑜𝑟𝑑 ∙ 𝑖𝑡𝑒𝑚_𝑘𝑒𝑦𝑤𝑜𝑟𝑑
𝑢𝑠𝑒𝑟_ 𝑘𝑒𝑦𝑤𝑜𝑟𝑑 𝑖𝑡𝑒𝑚 _𝑘𝑒𝑦𝑤𝑜𝑟𝑑
: (cosine similarity)
1. Reduce lower document frequency(DF) under 20%. (255,141 → 2,507)
2. Using PCA, reduce the dimension (2,507 → 1,191) by choosing the k
as follow :
Begin k=1:N (number of total PC)
when
error = 1 −
λ 𝑖
𝑘
𝑖=1
λ 𝑖
𝑁
𝑖=1
≤ 0.05
End
14
Feature Extraction

• Network similarity =
𝐹𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 𝑢𝑠𝑒𝑟 ∩ 𝐹𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔(𝑖𝑡𝑒𝑚)
𝐹𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 𝑢𝑠𝑒𝑟 ∪ 𝐹𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔(𝑖𝑡𝑒𝑚)
𝐹𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 𝑢 : 𝑆𝑒𝑡 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 𝑎 𝑢𝑠𝑒𝑟 𝑓𝑜𝑙𝑙𝑜𝑤𝑠
15
Feature Extraction
user
item
𝟐
𝟓
= 𝟎. 𝟒

• Homophilly
• Similar people get together!
Age Similarity, Gender Similarity
16
Background of choosing features

• Friend recommendation in Facebook
17
Background of choosing features
Common Friends
Works!!!!!

18
Results
• All models outperformed random predictor
• Network similarity showed the highest f1 score
• Model using all features showed the best performance
• Top-5 model covers more accepted items compared to the model using all features
• Interestingly, prediction conducted by only two feature,
age similarity and network similarity, presented similar results with Top-5 model.

• Contribution
• Successfully trained large data set with a light classifier
• Found many features by analyzing meta data
• We saw the unseen 
• Limitation
• Our models fairly showed good prediction results,
but they are not comparable to the level of KDD-Cup winners
• Possible solution: ensemble learning
• to make the best model using multiple weak classifiers(predictors)
19
Discussion

• Power of feature scaling
• Importance of learning rate
• Difficulty of handling Big Data
• Data reduction technique is essential for handling
large dimensional data.
20
What we learned 

[1] Phelan, Owen, Kevin McCarthy, and Barry Smyth. "Using twitter to
recommend real-time topical news." Proceedings of the third ACM c
onference on Recommender systems. ACM, 2009.
[2] Littlestone, Nick. "Learning quickly when irrelevant attributes abo
und: A new linear-threshold algorithm." Machine learning 2.4 (1988):
285-318.
[3] Mairal, Julien, et al. "Online learning for matrix factorization and s
parse coding." The Journal of Machine Learning Research 11 (2010):
19-60.
[4] Tang, Jie, et al. "Social influence analysis in large-scale networks."
Proceedings of the 15th ACM SIGKDD international conference on K
nowledge discovery and data mining. ACM, 2009.
22
References

• F-score
= 2 ×
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑟𝑒𝑐𝑎𝑙𝑙
= 2 x
𝑡𝑝
𝑡𝑝+𝑓𝑝
×
𝑡𝑝
𝑡𝑝+𝑓𝑛
𝑡𝑝
𝑡𝑝+𝑓𝑛
+
𝑡𝑝
𝑡𝑝+𝑓𝑛
= 2 ×
𝑡𝑝
2×𝑡𝑝+𝑓𝑝+𝑓𝑛
24
Baseline for F-score
total : 2617106
tp(true positive) : 30792
fp(false positive) : 1276492
tn(true negative) : 1279030
fn(false negative) : 30792
precision : 0.0235541779751
recall : 0.5
f-score : 0.0449889982087
rec_test_txt(target_user & public)
http://en.wikipedia.org/wiki/F1_score
Random Prediction

• MAP@3 (Mean Average Precision)
• ap@n = Σ k=1,...,n P(k) / (number of items clicked in m items)
• AP@n = Σ i=1,...,N ap@ni / N
25
Baseline for MAP@3
rec_log_test.txt(target_user&public)
https://www.kddcup2012.org/c/kddcup2012-track1/details/Evaluation
(UserId)t(ItemId)t(Result)t(Unix-timestamp)
(UserId)t(ItemId)t(ItemId)t(ItemId)
MAP@3 0.034106932193
Random Prediction

[CS570] Machine Learning Team Project (I know what items really are)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [CS570] Machine Learning Team Project (I know what items really are)

Similar to [CS570] Machine Learning Team Project (I know what items really are) (20)

More from Kunwoo Park

More from Kunwoo Park (12)

Recently uploaded

Recently uploaded (20)

[CS570] Machine Learning Team Project (I know what items really are)