Lastfm crawler

•

0 likes•946 views

Big social networks don't often disclose their data to other interested individuals. Their data and users add value to their business. So in order to estimate certain parameters such as the average age or other user characteristics or interests, we need to access the nodes individually. The way this nodes are accessed depends on the network. In the case of LastFM it is easy to pick a node and from there select one of its friends at random. This technique is called random walk and it generally consists of a path generated by randomly selecting nodes in a network. As soon as this LastFM crawler travels through a significative amount of nodes, it is possible to estimate an average of the pretended metrics. The problem with this simple implementation is that, although one might think that picking uniformly at random the nodes from within the friends is enough for them to have the same probability to be accessed, in the global scope, nodes with higher degrees (number of friends) will have higher probabilities to be visited. For example, if we are estimating the number of friends and during the process of picking a friend, the friends with higher degree have higher probabilities of being visited, then the estimated number of friends will be much higher than the real value. To fix this problem we introduced another type of random walks called RWRW, which, in this case, gives different weights to each nodes metric based on their degree while calculating the average. In this way, while estimating the number of friends, although nodes with high values will be visited with higher probability, their impact on the estimated value will be lower. This technique proved to estimate more accurate results, check in the presentation the graphs and values that we obtained. Also, I might add that it is very easy to experiment with LastFM using their API : http://www.last.fm/api Check my blog: www.marioalmeida.eu

last.fm crawler
RW vs RWRW
Mário Almeida malmeida@kth.se
Zafar Gilani szuhgi@kth.se
Arinto Murdopo arinto@kth.se

Outline
● Parameters
● Methodology
● Results
● Challenges
● Conclusion

Parameters
1. Playcounts
2. Playlists
3. Ages
4. IDs
5. Number of friends (degrees)

Compare average using RW and RWRW!

Methodology
Used lastfm APIs to obtain
● user info
● number of friends (degree)
RW with UIS-WR
We applied the following RW formula:

Methodology
For RWRW, we apply:

The weight Wv is set to number of friends (degree)

Results
Crawled for ~10 hours
Number of samples: 48000
Number of age samples: 36363, not all users
show their age

Results - Ages

RW
estimates
lower After about 25k
average age samples, the There is a big
values. age stabilizes. correlation
between age
and the
degree

Results - Playlists

Most users do
not have
playlists.

RW estimates higher
numbers of playlists. Users
with higher degrees tend to
have more playlists.

Results - Playcounts

We found some
users having
playcounts in the
order of millions.
RW estimates higher
playcounts. Users with
higher degree tend to
have higher playcounts

Results - IDs

Not yet stable.

RW estimates a lower
average ID compared to
RWRW. An user with lower
ID has generally a higher
degree

Results - Degrees

RWRW reduces the bias of nodes
with higher probability to be visited
due to the high degree. This is
indeed close to the expected degree
value.

Conclusion
● A simple random walk in a social network
generally results into biased averages.
○ A node with higher degree has a higher probability of
being discovered.
● RWRW normalizes the averages.
○ High variations do not abruptly impact the
estimation.
○ RWRW reduces the biases of RW.
● Low variance means lower difference
between RW and RWRW.
● Crawling lastfm produces many challenges
○ e.g.: 0 degree, banned user, huge playcounts

Questions
Check the code in:
● http://code.google.com/p/lastfm-rwrw/

More from Mário Almeida

SparkMário Almeida

High-Availability of YARN (MRv2)Mário Almeida

Flume impact of reliability on scalabilityMário Almeida

Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida

Smith waterman algorithm parallelizationMário Almeida

Man-In-The-Browser attacksMário Almeida

Exploiting Availability Prediction in Distributed SystemsMário Almeida

High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida

Instrumenting parsecs raytraceMário Almeida

Architecting a cloud scale identity fabricMário Almeida

SOAP vs RESTMário Almeida

More from Mário Almeida (11)

Spark

High-Availability of YARN (MRv2)

Flume impact of reliability on scalability

Self-Adapting, Energy-Conserving Distributed File Systems

Smith waterman algorithm parallelization

Man-In-The-Browser attacks

Exploiting Availability Prediction in Distributed Systems

High Availability of Services in Wide-Area Shared Computing Networks

Instrumenting parsecs raytrace

Architecting a cloud scale identity fabric

SOAP vs REST

Lastfm crawler

1. last.fm crawler RW vs RWRW Mário Almeida malmeida@kth.se Zafar Gilani szuhgi@kth.se Arinto Murdopo arinto@kth.se

2. Outline ● Parameters ● Methodology ● Results ● Challenges ● Conclusion

3. Parameters 1. Playcounts 2. Playlists 3. Ages 4. IDs 5. Number of friends (degrees) Compare average using RW and RWRW!

4. Methodology Used lastfm APIs to obtain ● user info ● number of friends (degree) RW with UIS-WR We applied the following RW formula:

5. Methodology For RWRW, we apply: The weight Wv is set to number of friends (degree)

6. Results Crawled for ~10 hours Number of samples: 48000 Number of age samples: 36363, not all users show their age

7. Results - Ages RW estimates lower After about 25k average age samples, the There is a big values. age stabilizes. correlation between age and the degree

8. Results - Playlists Most users do not have playlists. RW estimates higher numbers of playlists. Users with higher degrees tend to have more playlists.

9. Results - Playcounts We found some users having playcounts in the order of millions. RW estimates higher playcounts. Users with higher degree tend to have higher playcounts

10. Results - IDs Not yet stable. RW estimates a lower average ID compared to RWRW. An user with lower ID has generally a higher degree

11. Results - Degrees RWRW reduces the bias of nodes with higher probability to be visited due to the high degree. This is indeed close to the expected degree value.

12. Conclusion ● A simple random walk in a social network generally results into biased averages. ○ A node with higher degree has a higher probability of being discovered. ● RWRW normalizes the averages. ○ High variations do not abruptly impact the estimation. ○ RWRW reduces the biases of RW. ● Low variance means lower difference between RW and RWRW. ● Crawling lastfm produces many challenges ○ e.g.: 0 degree, banned user, huge playcounts

13. Questions Check the code in: ● http://code.google.com/p/lastfm-rwrw/

Lastfm crawler

Recommended

Recommended

More Related Content

More from Mário Almeida

More from Mário Almeida (11)

Lastfm crawler