Reference classes: a case study with the poweRlaw
package
Colin Gillespie
Newcastle University, UK
http://aperiodical.com/...
The power law distribution
Name f (x) Notes
Power law x−α Pareto distribution
Log-normal 1
x
exp(−(ln(x)−µ)2
2σ2 )
Exponen...
Alleged power-law phenomena
The frequency of occurrence of unique words in the novel Moby Dick by
Herman Melville
The numb...
Alleged power-law phenomena
The frequency of occurrence of unique words in the novel Moby Dick by
Herman Melville
The numb...
Zipf plots
Blackouts Fires Flares
Moby Dick Terrorism Web links
10−8
10−6
10−4
10−2
100
10−8
10−6
10−4
10−2
100
100
102
10...
The power law distribution
The power-law distribution is
p(x) ∝ x−α
where α, the scaling parameter, is constant
The scalin...
The power law distribution
The power-law distribution is
p(x) ∝ x−α
where α, the scaling parameter, is constant
The scalin...
Power law: PMF & CMF
Discrete power law, the PMF is
p(x) =
x−α
ζ(α, xmin)
where α > 1, xmin ≥ 1 and
ζ(α, xmin) =
∞
∑
n=0
(...
Fitting power laws
The main technique for fitting power laws comes from Clausett et al, 2009
This paper gets around ten new...
The poweRlaw package
The package is available on CRAN and at
https://github.com/csgillespie/poweRlaw
Makes fitting power la...
Case study: Moby Dick
R> m_pl = displ$new(moby)
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> plot(m_pl)
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq...
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> (est = estimate_xmin(m_pl))
$KS
[1] 0.009229
$xmin
[1] 7
$pars
[1] 1.95...
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> est = estimate_xmin(m_pl)
R> m_pl$setXmin(est)
q
q
q
q
q
q
q
q
qqqqqqqq...
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> est = estimate_xmin(m_pl)
R> m_pl$setXmin(est)
R> lines(m_pl)
q
q
q
q
q...
Case study: Moby Dick
R> m_pl = displ$new(moby)
R> est = estimate_xmin(m_pl)
R> m_pl$setXmin(est)
R> lines(m_pl)
R> m_ln =...
Why use objects?
Each distribution is represented by an object:
Parent class: distribution
Power-law: displ, log-normal: d...
Reference classes
Reference classes behave like classes in C++, Python and many other
languages - not like standard R clas...
Mutable states
R> displ = setRefClass("displ", fields = "xmin")
R> d1 = displ$new(xmin = 1)
R> d1$xmin
[1] 1
Mutable states
R> displ = setRefClass("displ", fields = "xmin")
R> d1 = displ$new(xmin = 1)
R> d1$xmin
[1] 1
R> d2 = d1
R>...
Mutable states
R> displ = setRefClass("displ", fields = "xmin")
R> d1 = displ$new(xmin = 1)
R> d1$xmin
[1] 1
R> d2 = d1
R>...
Mutable states
When estimating xmin, a naive implementation makes this calculation slow
Efficient caching speeds up calcula...
Mutable states
When estimating xmin, a naive implementation makes this calculation slow
Efficient caching speeds up calcula...
Comments
Reference classes are still new
Code has now broken twice with R upgrades
roxygen2 and reference classes didn’t p...
References
Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law
distributions in empirical data. SIAM revi...
Upcoming SlideShare
Loading in …5
×

Reference classes: a case study with the poweRlaw package

1,428
-1

Published on

Power-law distributions have been used extensively to characterise many disparate scenarios, inter alia, the sizes of moon craters and annual incomes. Recently power-laws have even been used to characterize terrorist attacks and interstate wars. However, for every correct characterisation that a particular process obeys a power-law, there are many systems that have been incorrectly labelled as being scale-free.

Part of the reason for incorrectly categorising systems with power-law properties is the lack of easy to use software. The poweRlaw package aims to tackles this problem by allowing multiple heavy tail distributions, to be fitted within a standard framework. Within this package, different distributions are represented using reference classes. This enables a consistent interface to be constructed for plotting and parameter inference.

This talk will describe the advantages (and disadvantages) of using reference classes. In particular, how reference classes can be leveraged to allow fast, efficient computation via parameter caching. The talk will also touch upon potential difficulties such as combining reference classes with parallel computation.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,428
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Reference classes: a case study with the poweRlaw package

  1. 1. Reference classes: a case study with the poweRlaw package Colin Gillespie Newcastle University, UK http://aperiodical.com/2013/01/log-log-whos-there-not-a-power-law/
  2. 2. The power law distribution Name f (x) Notes Power law x−α Pareto distribution Log-normal 1 x exp(−(ln(x)−µ)2 2σ2 ) Exponential e−λx Power law x−α Zeta distribution Power law x−α x = 1, . . . , n, Zipf’s dist’ Yule Γ(x) Γ(x+α) Poisson λx /x!
  3. 3. Alleged power-law phenomena The frequency of occurrence of unique words in the novel Moby Dick by Herman Melville The numbers of customers affected in electrical blackouts in the United States between 1984 and 2002 The number of links to web sites found in a 1997 web crawl of about 200 million web pages
  4. 4. Alleged power-law phenomena The frequency of occurrence of unique words in the novel Moby Dick by Herman Melville The numbers of customers affected in electrical blackouts in the United States between 1984 and 2002 The number of links to web sites found in a 1997 web crawl of about 200 million web pages The number of hits on web pages The number of papers scientist write The number of citations received by papers Annual incomes Sales of books, music; in fact anything that can be sold
  5. 5. Zipf plots Blackouts Fires Flares Moby Dick Terrorism Web links 10−8 10−6 10−4 10−2 100 10−8 10−6 10−4 10−2 100 100 102 104 106 100 102 104 106 100 102 104 106 x 1−P(x)
  6. 6. The power law distribution The power-law distribution is p(x) ∝ x−α where α, the scaling parameter, is constant The scaling parameter typically lies in the range 2 < α < 3, although there are some occasional exceptions When α < 2, all moments are infinite
  7. 7. The power law distribution The power-law distribution is p(x) ∝ x−α where α, the scaling parameter, is constant The scaling parameter typically lies in the range 2 < α < 3, although there are some occasional exceptions When α < 2, all moments are infinite Typically, the entire process doesn’t obey a power law Instead, the power law applies only for values greater than some minimum xmin
  8. 8. Power law: PMF & CMF Discrete power law, the PMF is p(x) = x−α ζ(α, xmin) where α > 1, xmin ≥ 1 and ζ(α, xmin) = ∞ ∑ n=0 (n + xmin)−α is the generalised zeta function When xmin = 1, ζ(α, 1) is the standard zeta function PDF CDF 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0 10 20 30 40 50 x 1.50 1.75 2.00 2.25 2.50 α
  9. 9. Fitting power laws The main technique for fitting power laws comes from Clausett et al, 2009 This paper gets around ten new citations a week Estimating α given xmin is straightforward - just use the mle The lower cut-off, xmin, is estimated using a Kolmogorov-Smirnoff approach
  10. 10. The poweRlaw package The package is available on CRAN and at https://github.com/csgillespie/poweRlaw Makes fitting power laws easy to fit Crucially, it makes fitting (to the tails) of the log normal, exponential, Poisson equally easy Consistent interface between distributions Estimate parameter uncertainty Compare distributions (statistically and visually)
  11. 11. Case study: Moby Dick R> m_pl = displ$new(moby)
  12. 12. Case study: Moby Dick R> m_pl = displ$new(moby) R> plot(m_pl) q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
  13. 13. Case study: Moby Dick R> m_pl = displ$new(moby) R> (est = estimate_xmin(m_pl)) $KS [1] 0.009229 $xmin [1] 7 $pars [1] 1.95 attr(,"class") [1] "estimate_xmin" q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
  14. 14. Case study: Moby Dick R> m_pl = displ$new(moby) R> est = estimate_xmin(m_pl) R> m_pl$setXmin(est) q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
  15. 15. Case study: Moby Dick R> m_pl = displ$new(moby) R> est = estimate_xmin(m_pl) R> m_pl$setXmin(est) R> lines(m_pl) q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
  16. 16. Case study: Moby Dick R> m_pl = displ$new(moby) R> est = estimate_xmin(m_pl) R> m_pl$setXmin(est) R> lines(m_pl) R> m_ln = dislnorm$new(moby) R> est = estimate_xmin(m_ln) R> m_ln$setXmin(est) R> lines(m_ln) q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
  17. 17. Why use objects? Each distribution is represented by an object: Parent class: distribution Power-law: displ, log-normal: disln, . . . Method dispatch on object class: dist_pdf(m) returns the probability density function based on the class of m Consistent interface: Bootstrapping: R> bootstrap(m) Model selection: R> compare_distributions(m1, m2) Simple interface that enables easy addition of new distributions (currently there are seven available distributions to fit)
  18. 18. Reference classes Reference classes behave like classes in C++, Python and many other languages - not like standard R classes You can use these classes with ordinary R expressions and functions An extension to core R (October, 2010) Big difference - mutable state
  19. 19. Mutable states R> displ = setRefClass("displ", fields = "xmin") R> d1 = displ$new(xmin = 1) R> d1$xmin [1] 1
  20. 20. Mutable states R> displ = setRefClass("displ", fields = "xmin") R> d1 = displ$new(xmin = 1) R> d1$xmin [1] 1 R> d2 = d1 R> d2$xmin = 100 R> d2$xmin [1] 100
  21. 21. Mutable states R> displ = setRefClass("displ", fields = "xmin") R> d1 = displ$new(xmin = 1) R> d1$xmin [1] 1 R> d2 = d1 R> d2$xmin = 100 R> d2$xmin [1] 100 R> d1$xmin [1] 100
  22. 22. Mutable states When estimating xmin, a naive implementation makes this calculation slow Efficient caching speeds up calculations 100 fold For example, using the call R> m_pl$setXmin(10) updates internal variables that makes future calculations quicker
  23. 23. Mutable states When estimating xmin, a naive implementation makes this calculation slow Efficient caching speeds up calculations 100 fold For example, using the call R> m_pl$setXmin(10) updates internal variables that makes future calculations quicker On creation of a distribution object, we make "multiple copies" of the data R> x R> cumsum(log(x)) using reference classes avoids constant copying and speeds up calculations R> pl_ref$xmin = 10 R> pl_s4@xmin = 10
  24. 24. Comments Reference classes are still new Code has now broken twice with R upgrades roxygen2 and reference classes didn’t play well together Very few questions on Stackoverflow on reference classes Structuring code and files Care has to be taken when using them with parallel computing
  25. 25. References Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data. SIAM review 51.4 (2009): 661–703. poweRlaw package https://github.com/csgillespie/poweRlaw
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×