Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
×

# Reference classes: a case study with the poweRlaw package

1,938 views

Published on

Power-law distributions have been used extensively to characterise many disparate scenarios, inter alia, the sizes of moon craters and annual incomes. Recently power-laws have even been used to characterize terrorist attacks and interstate wars. However, for every correct characterisation that a particular process obeys a power-law, there are many systems that have been incorrectly labelled as being scale-free.

Part of the reason for incorrectly categorising systems with power-law properties is the lack of easy to use software. The poweRlaw package aims to tackles this problem by allowing multiple heavy tail distributions, to be fitted within a standard framework. Within this package, different distributions are represented using reference classes. This enables a consistent interface to be constructed for plotting and parameter inference.

This talk will describe the advantages (and disadvantages) of using reference classes. In particular, how reference classes can be leveraged to allow fast, efficient computation via parameter caching. The talk will also touch upon potential difficulties such as combining reference classes with parallel computation.

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

### Reference classes: a case study with the poweRlaw package

1. 1. Reference classes: a case study with the poweRlaw package Colin Gillespie Newcastle University, UK http://aperiodical.com/2013/01/log-log-whos-there-not-a-power-law/
2. 2. The power law distribution Name f (x) Notes Power law x−α Pareto distribution Log-normal 1 x exp(−(ln(x)−µ)2 2σ2 ) Exponential e−λx Power law x−α Zeta distribution Power law x−α x = 1, . . . , n, Zipf’s dist’ Yule Γ(x) Γ(x+α) Poisson λx /x!
3. 3. Alleged power-law phenomena The frequency of occurrence of unique words in the novel Moby Dick by Herman Melville The numbers of customers affected in electrical blackouts in the United States between 1984 and 2002 The number of links to web sites found in a 1997 web crawl of about 200 million web pages
4. 4. Alleged power-law phenomena The frequency of occurrence of unique words in the novel Moby Dick by Herman Melville The numbers of customers affected in electrical blackouts in the United States between 1984 and 2002 The number of links to web sites found in a 1997 web crawl of about 200 million web pages The number of hits on web pages The number of papers scientist write The number of citations received by papers Annual incomes Sales of books, music; in fact anything that can be sold
5. 5. Zipf plots Blackouts Fires Flares Moby Dick Terrorism Web links 10−8 10−6 10−4 10−2 100 10−8 10−6 10−4 10−2 100 100 102 104 106 100 102 104 106 100 102 104 106 x 1−P(x)
6. 6. The power law distribution The power-law distribution is p(x) ∝ x−α where α, the scaling parameter, is constant The scaling parameter typically lies in the range 2 < α < 3, although there are some occasional exceptions When α < 2, all moments are inﬁnite
7. 7. The power law distribution The power-law distribution is p(x) ∝ x−α where α, the scaling parameter, is constant The scaling parameter typically lies in the range 2 < α < 3, although there are some occasional exceptions When α < 2, all moments are inﬁnite Typically, the entire process doesn’t obey a power law Instead, the power law applies only for values greater than some minimum xmin
8. 8. Power law: PMF & CMF Discrete power law, the PMF is p(x) = x−α ζ(α, xmin) where α > 1, xmin ≥ 1 and ζ(α, xmin) = ∞ ∑ n=0 (n + xmin)−α is the generalised zeta function When xmin = 1, ζ(α, 1) is the standard zeta function PDF CDF 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0 10 20 30 40 50 x 1.50 1.75 2.00 2.25 2.50 α
9. 9. Fitting power laws The main technique for ﬁtting power laws comes from Clausett et al, 2009 This paper gets around ten new citations a week Estimating α given xmin is straightforward - just use the mle The lower cut-off, xmin, is estimated using a Kolmogorov-Smirnoff approach
10. 10. The poweRlaw package The package is available on CRAN and at https://github.com/csgillespie/poweRlaw Makes ﬁtting power laws easy to ﬁt Crucially, it makes ﬁtting (to the tails) of the log normal, exponential, Poisson equally easy Consistent interface between distributions Estimate parameter uncertainty Compare distributions (statistically and visually)
11. 11. Case study: Moby Dick R> m_pl = displ\$new(moby)
12. 12. Case study: Moby Dick R> m_pl = displ\$new(moby) R> plot(m_pl) q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
13. 13. Case study: Moby Dick R> m_pl = displ\$new(moby) R> (est = estimate_xmin(m_pl)) \$KS [1] 0.009229 \$xmin [1] 7 \$pars [1] 1.95 attr(,"class") [1] "estimate_xmin" q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
14. 14. Case study: Moby Dick R> m_pl = displ\$new(moby) R> est = estimate_xmin(m_pl) R> m_pl\$setXmin(est) q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
15. 15. Case study: Moby Dick R> m_pl = displ\$new(moby) R> est = estimate_xmin(m_pl) R> m_pl\$setXmin(est) R> lines(m_pl) q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
16. 16. Case study: Moby Dick R> m_pl = displ\$new(moby) R> est = estimate_xmin(m_pl) R> m_pl\$setXmin(est) R> lines(m_pl) R> m_ln = dislnorm\$new(moby) R> est = estimate_xmin(m_ln) R> m_ln\$setXmin(est) R> lines(m_ln) q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q Words CDF 100 101 102 103 104 10−4 10−3 10−2 10−1 100
17. 17. Why use objects? Each distribution is represented by an object: Parent class: distribution Power-law: displ, log-normal: disln, . . . Method dispatch on object class: dist_pdf(m) returns the probability density function based on the class of m Consistent interface: Bootstrapping: R> bootstrap(m) Model selection: R> compare_distributions(m1, m2) Simple interface that enables easy addition of new distributions (currently there are seven available distributions to ﬁt)
18. 18. Reference classes Reference classes behave like classes in C++, Python and many other languages - not like standard R classes You can use these classes with ordinary R expressions and functions An extension to core R (October, 2010) Big difference - mutable state
19. 19. Mutable states R> displ = setRefClass("displ", fields = "xmin") R> d1 = displ\$new(xmin = 1) R> d1\$xmin [1] 1
20. 20. Mutable states R> displ = setRefClass("displ", fields = "xmin") R> d1 = displ\$new(xmin = 1) R> d1\$xmin [1] 1 R> d2 = d1 R> d2\$xmin = 100 R> d2\$xmin [1] 100
21. 21. Mutable states R> displ = setRefClass("displ", fields = "xmin") R> d1 = displ\$new(xmin = 1) R> d1\$xmin [1] 1 R> d2 = d1 R> d2\$xmin = 100 R> d2\$xmin [1] 100 R> d1\$xmin [1] 100
22. 22. Mutable states When estimating xmin, a naive implementation makes this calculation slow Efﬁcient caching speeds up calculations 100 fold For example, using the call R> m_pl\$setXmin(10) updates internal variables that makes future calculations quicker
23. 23. Mutable states When estimating xmin, a naive implementation makes this calculation slow Efﬁcient caching speeds up calculations 100 fold For example, using the call R> m_pl\$setXmin(10) updates internal variables that makes future calculations quicker On creation of a distribution object, we make "multiple copies" of the data R> x R> cumsum(log(x)) using reference classes avoids constant copying and speeds up calculations R> pl_ref\$xmin = 10 R> pl_s4@xmin = 10
24. 24. Comments Reference classes are still new Code has now broken twice with R upgrades roxygen2 and reference classes didn’t play well together Very few questions on Stackoverﬂow on reference classes Structuring code and ﬁles Care has to be taken when using them with parallel computing
25. 25. References Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data. SIAM review 51.4 (2009): 661–703. poweRlaw package https://github.com/csgillespie/poweRlaw