Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EOS5 Demo


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

EOS5 Demo

  1. 1. Collective Inteligence <ul><li>Enginering </li></ul><ul><ul><li>Not Bigger Research Picture (Tom) </li></ul></ul><ul><li>Progress </li></ul><ul><li>Findings </li></ul><ul><li>Proposal </li></ul>
  2. 2. Progress <ul><li>Reading. </li></ul>
  3. 3. Progress <ul><li>Investigate Compute Problem. </li></ul>
  4. 4. Progress <ul><li>Investigate Compute Problem. </li></ul><ul><ul><li>Pearson example. </li></ul></ul>
  5. 5. Progress <ul><li>Investigate Compute Problem. </li></ul><ul><ul><li>Pearson example. </li></ul></ul>
  6. 6. Progress <ul><li>Investigate Compute Problem. </li></ul><ul><ul><li>Pearson example. </li></ul></ul>Store all comparisons = 1/2 N^2
  7. 7. Progress <ul><li>Investigate Scale </li></ul><ul><ul><li>N films </li></ul></ul><ul><ul><li>M people </li></ul></ul><ul><ul><li>M[(N(N-1)/2] time the algorithm cost </li></ul></ul><ul><ul><ul><li>Pearson: </li></ul></ul></ul><ul><ul><ul><ul><li>Numerator 2 – , 1 *, 1+ </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Denominator 2 -, 2 ^2, 2 + </li></ul></ul></ul></ul><ul><ul><li>M(N) time to compute averages </li></ul></ul><ul><ul><ul><li>Can be done on ingest in M(N) time </li></ul></ul></ul>
  8. 8. Progress <ul><li>Tractible? </li></ul><ul><ul><li>A typical P4 - theoretical max of 20-40 G FLOPs, </li></ul></ul><ul><ul><li>With L2cach bandwidth, supporting instructions etc. a max of 3-7 G Flop is more realistic. (my further benchmarking show 7GFlop on a dual core centrino) </li></ul></ul><ul><ul><li>What could we expect from various technologies </li></ul></ul><ul><ul><ul><li>Matrix multiplication is a good estimate..... </li></ul></ul></ul>
  9. 9. Progress <ul><li>41 mins </li></ul><ul><li>17 mins </li></ul><ul><li>8 mins </li></ul>
  10. 10. Progress <ul><li>40 Seconds </li></ul>
  11. 11. Progress <ul><li>Genial MFlops </li></ul>Which correlate well with: More investigation details at:
  12. 12. Progress <ul><li>Computation </li></ul><ul><ul><li>a 1M x 1M dense matrix multiply results in at least 1M ^ 3 FLOP's = 1E18 = 1 exaflop. </li></ul></ul><ul><ul><li>On a single P4 cpu this would take 1E18 / 7E9 = 142E6 seconds or 1653 days. </li></ul></ul><ul><ul><li>So even on a matrix 100,000 a theoretical time of 1.65 days. </li></ul></ul><ul><ul><ul><li>Of course Comparisons are ½ this </li></ul></ul></ul>
  13. 13. Progress <ul><li>Realisation. </li></ul><ul><ul><li>Huge compute problem </li></ul></ul><ul><ul><li>1M matrix 1650 days </li></ul></ul><ul><ul><ul><li>Paralelise? </li></ul></ul></ul><ul><ul><ul><li>16.5 days on 100 nodes </li></ul></ul></ul><ul><ul><ul><li>1.65 days on 1000 nodes </li></ul></ul></ul><ul><ul><li>10M matrix 1,650,000 days </li></ul></ul><ul><ul><ul><li>Paralelise? </li></ul></ul></ul><ul><ul><ul><li>16,500 days on 100 nodes </li></ul></ul></ul><ul><ul><ul><li>1,650 days on 1000 nodes </li></ul></ul></ul><ul><ul><ul><li>165 days on 10,000 nodes </li></ul></ul></ul>
  14. 14. Progress <ul><li>Brute force </li></ul><ul><ul><li>IBM's US$133M Roadrunner sustaining over 1petaflops </li></ul></ul><ul><ul><ul><li>12,960 IBM PowerXCell 8i CPUs and 6,480 AMD Opteron dual-core processors </li></ul></ul></ul><ul><ul><ul><li>PowerXCell 32 GFLOPS (similar to GPU's) </li></ul></ul></ul><ul><ul><ul><li>10M matrix = 1000 seconds (100M = 11.6 Days) </li></ul></ul></ul>
  15. 15. Progress <ul><li>Brute force </li></ul><ul><ul><li>Folding@Home (free!) is reached over 4.1 PFLOPS </li></ul></ul><ul><ul><ul><li>10M matrix = 25 seconds </li></ul></ul></ul><ul><ul><li>ATI Radeon™ HD 4870 X2 2.4 Teraflops </li></ul></ul><ul><ul><ul><li>500 * $500 + 250 *$1000 (each backplane) = $500,000 (£331,665) for 1 PFLOP </li></ul></ul></ul>
  16. 16. Progress <ul><li>Optimisations </li></ul><ul><ul><li>Intuitively sparse - Ignore Nulls? How sparse? </li></ul></ul><ul><ul><li>True Pearson for linear algebra requires zeros, but Nulls? </li></ul></ul><ul><ul><li>Depends on data – generally yes </li></ul></ul><ul><ul><ul><li>I.e three people A, B, C - A has seen no films in common with B or C </li></ul></ul></ul><ul><ul><ul><li>A has seen 10 films, B – 5 and C – 15 </li></ul></ul></ul><ul><ul><ul><li>Pearson numerator for B would be –15 and c -25 </li></ul></ul></ul><ul><ul><ul><li>So C is less similar to A than B is. </li></ul></ul></ul><ul><ul><li>So can ignore nulls - tfft! </li></ul></ul>
  17. 17. Progress <ul><li>600 Elsevier Full Text Articles. </li></ul><ul><ul><li>Single core running C++ </li></ul></ul><ul><ul><li>processes 20 articles / 80,000 terms per second. </li></ul></ul><ul><ul><li>Computations way faster than dense matrix. </li></ul></ul><ul><ul><li>Only 600 articles 150,000 unique terms. </li></ul></ul>A little distraction...
  18. 18. Progress
  19. 19. Progress (from van Rijsbergen, 1979) The most frequent words are not the most descriptive <ul><ul><li>More Optimisation </li></ul></ul><ul><ul><ul><li>Word Count (LSA) Characteristics </li></ul></ul></ul>Be carefull, the lower discriminatory words can provide good information... (and serendipity)
  20. 20. Progress
  21. 21. Progress <ul><li>How Sparse? </li></ul><ul><ul><li>Term Document Count from 2.18 million DBPedia Abstracts </li></ul></ul>
  22. 22. Progress <ul><li>Distraction.... </li></ul><ul><ul><li>Some Least Popular (stemmed) – 3 docs each </li></ul></ul><ul><ul><ul><ul><li>Accretionari - an increase in a beneficiary's share in an estate </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Accordiana - a musical radio series which was heard on CBS in 1934 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Accokeek - Located in the southwest corner of Prince George's County </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Nazarbayev - President of Kazakhstan </li></ul></ul></ul></ul><ul><ul><li>The Most Popular </li></ul></ul><ul><ul><ul><ul><li>15938 – year </li></ul></ul></ul></ul><ul><ul><ul><ul><li>12476 – season </li></ul></ul></ul></ul><ul><ul><ul><ul><li>11410 – state </li></ul></ul></ul></ul><ul><ul><ul><ul><li>10758 – world </li></ul></ul></ul></ul><ul><ul><ul><ul><li>10722 – name </li></ul></ul></ul></ul>Serendipity
  23. 23. Progress <ul><li>Very Sparse </li></ul><ul><ul><li>Turns out to be Zipf – Mandelbrot distribution. </li></ul></ul><ul><ul><ul><ul><li>[1] G. K. Zipf, Human Behavior and the Principle of Least Effort. (Cam- bridge, Mass., 1949; Addison-; Wesley, 1965). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>[2] B. Mandelbrot, “An informational theory of the statistical structure of language”, in Communication Theory, ed. Willis Jackson. (Better- worths, 1953). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Word Count is .0025% dense </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Ignore Null for huge optimisation. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>40,000 x less compute </li></ul></ul></ul></ul><ul><ul><ul><ul><li>(using uniform density assumption) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Zipf-Mandelbrot has the form: </li></ul></ul></ul></ul><ul><ul><ul><ul><li>y = P1/(x+P2)^P3. </li></ul></ul></ul></ul>
  24. 24. Progress <ul><li>DBPedia words follow Zipf-Mandelbrot </li></ul><ul><ul><li>Zoomed in chunk of DBPedia word count </li></ul></ul><ul><ul><ul><ul><li>y = P1/(x+P2)^P3 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>best fit regression (red curve) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>with factors </li></ul></ul></ul></ul><ul><ul><ul><ul><li>P1 = 874150 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>P2 = 60.0000 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>P3 = 1.01000 </li></ul></ul></ul></ul>
  25. 25. Progress <ul><li>Calculate Comparisons. </li></ul><ul><ul><li>2.18 million Abstracts – 1.3M unique terms. </li></ul></ul><ul><ul><li>Fits in 2G ram - (1.3M*2.18M*.000025 * 21 = 1.5G ) (as it is .0025% dense each entry is 21 bytes) </li></ul></ul><ul><ul><li>Uniform density assumption </li></ul></ul><ul><ul><li>Comparisons computable in few minutes </li></ul></ul><ul><ul><ul><li>Not storeable in RAM (3.6 Tbytes!) </li></ul></ul></ul><ul><ul><li>Big underestimate </li></ul></ul><ul><ul><ul><li>(Stopped the run after 4 hours) </li></ul></ul></ul>Stored random 100 article sample, and comparing with all 2M others. (0.2Gb) to allow intuitative QA
  26. 26. Progress <ul><li>Extrapolating Compute Times. </li></ul><ul><ul><li>Not 40,000x less compute (.0025% dense) </li></ul></ul><ul><ul><ul><li>Word Count is .0025% dense - 40,000 x less compute </li></ul></ul></ul><ul><ul><ul><li>Top 1000 most commonly occurring terms the density is 0.78% </li></ul></ul></ul><ul><ul><li>Not 128x less (.78% dense) </li></ul></ul><ul><ul><li>So use square of area under curve </li></ul></ul><ul><ul><li>= Integral of Zipf Mandelbrot Squared </li></ul></ul><ul><ul><li>Roughly a power law. </li></ul></ul><ul><ul><ul><li>y = (x^P)/N </li></ul></ul></ul>
  27. 27. Progress DBPedia Abstracs Ops (not including algorithm cost) Regression using simple power law (P = 2.1 N = 1E0.73)
  28. 28. Progress <ul><li>Extrapolating Compute Times. </li></ul><ul><ul><li>2 Million abstracts </li></ul></ul><ul><ul><li>Square of Integral of Zipf Mandelbrot </li></ul></ul><ul><ul><li>Predicts calculable in 5.42 Hours </li></ul></ul><ul><ul><ul><li>Assuming all in RAM. </li></ul></ul></ul><ul><ul><ul><li>But writing Gigs to disk big overhead. </li></ul></ul></ul><ul><ul><ul><li>(Not run this yet to prove) </li></ul></ul></ul>
  29. 29. Progress Loans Data Ops (not including algorithm cost) (proving power law linear regression prediction) Regression using simple power law (P = 3.5 N = 1E10.15)
  30. 30. Progress <ul><li>Loans Data </li></ul><ul><ul><li>Hereford Libraries. </li></ul></ul><ul><ul><li>C++ In memory 'super fast hash' </li></ul></ul><ul><ul><li>Processed 19M loans in 1min 20sec. </li></ul></ul><ul><ul><li>Producing 269,000 unique borrower and 491,000 unique books. </li></ul></ul><ul><ul><li>8 Million unique loan events </li></ul></ul>
  31. 31. Progress Loans per Individual – Nice Zipf Mandelbrot curve
  32. 32. Progress <ul><li>Zipf Mandelbrot – Good Assumption? </li></ul><ul><ul><li>Most (all large complex systems?) data that we are likely to process will follow a Zipf-Mandlebrot model. </li></ul></ul><ul><ul><li>K. Silagadze shows [1] that these comply... </li></ul></ul><ul><ul><ul><li>Clickstreams </li></ul></ul></ul><ul><ul><ul><li>Page-rank (Linkage/Centrality) </li></ul></ul></ul><ul><ul><ul><li>Citations </li></ul></ul></ul><ul><ul><ul><li>Other long tail interactions. </li></ul></ul></ul><ul><ul><ul><ul><li>[1]Z. K. Silagadze [physics.soc-ph] 26 Jan 1999 Citations and the Zipf-Mandelbrot’s law - Budker Institute of Nuclear Physics, 630 090, Novosibirsk, Russia ing. </li></ul></ul></ul></ul>
  33. 33. Findings <ul><li>Cant Store All Comparisons </li></ul><ul><ul><li>50 Tb for 10M matrix (½ N^2) for triangle matrix. </li></ul></ul><ul><ul><li>Store only meaningfull? – Thld = n or f. </li></ul></ul><ul><li>Can compute All 2M (squared) Comparisons </li></ul><ul><ul><li>In 6 hours (1 core). </li></ul></ul><ul><li>Cant Compute 1 Billion Comparisons </li></ul><ul><ul><li>287 years (7 days on 20,000 cores 10Billion?). </li></ul></ul><ul><li>Zipf Mandelbrot Curve is Usefull. </li></ul><ul><ul><li>Can store All(?) raw metrics </li></ul></ul>n=count - fixed f=factor - Z-M
  34. 34. Findings <ul><li>Zipf Mandelbrot Curve is Usefull. </li></ul><ul><ul><li>Head </li></ul></ul><ul><ul><ul><li>Big proportion of compute </li></ul></ul></ul><ul><ul><ul><ul><li>Large M1 M2 intersection. </li></ul></ul></ul></ul><ul><ul><ul><li>Low discrimination </li></ul></ul></ul><ul><ul><li>Body </li></ul></ul><ul><ul><ul><li>Good info </li></ul></ul></ul><ul><ul><ul><li>Medium compute </li></ul></ul></ul><ul><ul><li>Tail </li></ul></ul><ul><ul><ul><li>Specialist </li></ul></ul></ul><ul><ul><ul><li>Trivial or no compute </li></ul></ul></ul>
  35. 35. Findings <ul><li>Zipf Mandelbrot Curve is Usefull. </li></ul><ul><ul><li>Allows us to make optimisations </li></ul></ul><ul><ul><ul><li>In reducing the y axis (and x a bit) </li></ul></ul></ul><ul><ul><ul><li>Chop the head off. </li></ul></ul></ul><ul><ul><li>Body </li></ul></ul><ul><ul><ul><li>Dimensionality reduction </li></ul></ul></ul><ul><ul><li>Tail </li></ul></ul><ul><ul><ul><li>Chop the tail off </li></ul></ul></ul><ul><ul><ul><li>Or Dimensionality reduction </li></ul></ul></ul><ul><ul><li>X axis = N^2, Y axis = M </li></ul></ul>
  36. 36. Findings <ul><li>What about storing meaningfull comparisons. </li></ul><ul><ul><li>Solves storage problem </li></ul></ul><ul><ul><li>Requires repeated compute problem </li></ul></ul><ul><ul><ul><li>Deltas, could affect whole set </li></ul></ul></ul><ul><ul><ul><li>Will affect a chunk of the set </li></ul></ul></ul><ul><ul><ul><li>Could trade off timely accuracy with batch processing. </li></ul></ul></ul>
  37. 37. Proposal <ul><li>Store raw curve </li></ul><ul><ul><li>Sparse Strorage – Bigtable like </li></ul></ul><ul><ul><ul><li>Hbase, Hypertable, etc </li></ul></ul></ul><ul><ul><ul><li>Unloads indexing and lookup to nodes. </li></ul></ul></ul><ul><li>Calculate on the fly </li></ul><ul><ul><li>Two indeces Books -> people and People -> books </li></ul></ul><ul><ul><li>Not 1/2N^2 – just 1* Intersect (* M) </li></ul></ul><ul><ul><li>Tail – Retrieval problem M~0 Intersect~0. </li></ul></ul><ul><ul><li>Body – Some retrieval and compute. </li></ul></ul><ul><ul><li>Head – Big retrieval and compute big M big intersect. </li></ul></ul>
  38. 38. Proposal <ul><li>Pre Compute Head </li></ul><ul><ul><li>Store top n </li></ul></ul><ul><ul><li>Store any that take more than .5 seconds </li></ul></ul><ul><ul><ul><li>Zipf-Mandelbrot – retrieval only problem </li></ul></ul></ul><ul><ul><ul><li>Dynamic – finding them is linear </li></ul></ul></ul><ul><ul><li>Store as a cache – only when requested? </li></ul></ul><ul><ul><ul><li>Depends on acceptible delay? </li></ul></ul></ul><ul><li>This Hybrid Scales Better. </li></ul><ul><ul><li>Better than Storing all or Computnig all </li></ul></ul>
  39. 39. Proposal <ul><li>It doesn't scale indefinately. </li></ul><ul><ul><li>* scale by 10, * nodes by 100 </li></ul></ul><ul><ul><li>Dimensionality reduction will HAVE to kick in. </li></ul></ul><ul><ul><ul><li>This aproach allows that, but at bigger scles than most </li></ul></ul></ul><ul><ul><ul><li>Consider severing head and tail as early aproach. </li></ul></ul></ul><ul><li>Only optimised for individual requests. </li></ul><ul><ul><li>Given this article find the 10 next similar </li></ul></ul><ul><ul><li>What about...”Given this corpus find the 100 most similar things” </li></ul></ul><ul><ul><ul><li>Then set n = infinity (or f=0) and the service will tell you how many days to come back for your results </li></ul></ul></ul>
  40. 40. Proposal <ul><li>Experimentation </li></ul><ul><ul><li>Used Hbase, HDFS, Hadoop </li></ul></ul><ul><ul><ul><li>Not using Hadoop yet – but is good fit for data ingest </li></ul></ul></ul><ul><ul><ul><li>Hadoop not v efficient for comparison – but doable. </li></ul></ul></ul><ul><ul><li>Used Loan data and binary pearson </li></ul></ul><ul><ul><ul><li>Ignoring nulls (and sigma) – so counts only. </li></ul></ul></ul>Quick demo
  41. 41. Proposal <ul><li>Next Steps </li></ul><ul><ul><li>Prove this aproach </li></ul></ul><ul><ul><li>Performance Testing. </li></ul></ul><ul><ul><ul><li>Hbase over n nodes (perf lab, poss then EC2?). </li></ul></ul></ul><ul><ul><ul><li>Timing retrieval vs compute </li></ul></ul></ul><ul><ul><ul><ul><li>Good logging </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Configurable variables </li></ul></ul></ul></ul><ul><ul><ul><li>Multiple 'stores' (data sets). </li></ul></ul></ul><ul><ul><ul><li>8M loans now – 80M? </li></ul></ul></ul><ul><ul><ul><li>Hadoop the ingest – if just to save time during trials </li></ul></ul></ul>