Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real Time-Big Data-Social Network-Data Science-Gamified!                                                  Jason Capeharta....
1. Visualization2. Data3. Analysis
Show Me!
The Good, The Bad, The Ugly
Surely, You Must Be Joking.            Store            ExamplesKey-Value           Hadoop, Memcached, RedisDocument      ...
Citation:Kwak, H., Changhyun, L., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a NewsMedia? Proceedin...
Citation:A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAMReview 51(4), 661-70...
800,000,000   (that’s a lot of users)   (cost = 200k for fire hose)
Sampled                  Not SampledCitation:Stumpf, M. P., Wiuf, C., & May, R. M. (2005). Subnets of scale-free networks ...
# Pseudo Codeid_guess = randint(0, 10^9)user = api.get_user(id = id_guess)Repeat until tired or rate limited
Power Law (xmin = 281, α = 2.19) LognormalDiscrete Power Law vs.LognormalLoglikelihood                89.46RatioVuong’s Te...
Power Law (xmin = 222, α = 2.33)LognormalStretched Exponential
• Conclusions = None!  – All work is in progress• Discussion  – Cascade uses open source  – Opportunities to give back?
References1. A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAM Review 51(4), 6...
Cascade Project
Cascade Project
Cascade Project
Cascade Project
Cascade Project
Cascade Project
Cascade Project
Cascade Project
Upcoming SlideShare
Loading in …5
×

Cascade Project

386 views

Published on

A look at the challenges involved in creating a big data product in the context of the Cascade Project (https://www.cascadeproject.com/)

Published in: Technology
  • Be the first to comment

Cascade Project

  1. 1. Real Time-Big Data-Social Network-Data Science-Gamified! Jason Capeharta.k.a. The Cascade Project 12/12/12(Okay … that last part of the title isn’t true)
  2. 2. 1. Visualization2. Data3. Analysis
  3. 3. Show Me!
  4. 4. The Good, The Bad, The Ugly
  5. 5. Surely, You Must Be Joking. Store ExamplesKey-Value Hadoop, Memcached, RedisDocument MongoDB, CouchDBGraph Neo4j, Giraph, TitanReal Time Storm, Impala
  6. 6. Citation:Kwak, H., Changhyun, L., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a NewsMedia? Proceedings of the 19th International World Wide Web (WWW) Conference (pp. 591-600).Raleigh, NC: ACM.
  7. 7. Citation:A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAMReview 51(4), 661-703 (2009). (arXiv:0706.1062, doi:10.1137/070710111)
  8. 8. 800,000,000 (that’s a lot of users) (cost = 200k for fire hose)
  9. 9. Sampled Not SampledCitation:Stumpf, M. P., Wiuf, C., & May, R. M. (2005). Subnets of scale-free networks are not scale-free:Sampling properties of networks. Proceedings of the National Academy of Sciences, 4221-4224.
  10. 10. # Pseudo Codeid_guess = randint(0, 10^9)user = api.get_user(id = id_guess)Repeat until tired or rate limited
  11. 11. Power Law (xmin = 281, α = 2.19) LognormalDiscrete Power Law vs.LognormalLoglikelihood 89.46RatioVuong’s Test 7.14Statisticp-val >0.99(1-sided)
  12. 12. Power Law (xmin = 222, α = 2.33)LognormalStretched Exponential
  13. 13. • Conclusions = None! – All work is in progress• Discussion – Cascade uses open source – Opportunities to give back?
  14. 14. References1. A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAM Review 51(4), 661-703 (2009). (arXiv:0706.1062, doi:10.1137/070710111) – Code: http://tuvalu.santafe.edu/~aaronc/powerlaws/2. Newman, M. (2005, September-October). Power laws, Pareto distributions and Zipfs law. Contemporary Physics, 46(5), 323-351.3. Kwak, H., Changhyun, L., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media? Proceedings of the 19th International World Wide Web (WWW) Conference (pp. 591-600). Raleigh, NC: ACM4. Stumpf, M. P., Wiuf, C., & May, R. M. (2005). Subnets of scale-free networks are not scale-free: Sampling properties of networks. Proceedings of the National Academy of Sciences, 4221-4224.

×