End of Sprint 5


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • End of Sprint 5

    1. 1. Collective Inteligence <ul><li>Enginering </li></ul><ul><ul><li>Not Bigger Research Picture (Tom) </li></ul></ul><ul><li>Progress </li></ul><ul><li>Findings </li></ul><ul><li>Proposal </li></ul>
    2. 2. Progress <ul><li>Reading. </li></ul>
    3. 3. Progress <ul><li>Investigate Compute Problem. </li></ul>
    4. 4. Progress <ul><li>Investigate Compute Problem. </li></ul><ul><ul><li>Pearson example. </li></ul></ul>
    5. 5. Progress <ul><li>Investigate Compute Problem. </li></ul><ul><ul><li>Pearson example. </li></ul></ul>
    6. 6. Progress <ul><li>Investigate Compute Problem. </li></ul><ul><ul><li>Pearson example. </li></ul></ul>Store all comparisons = 1/2 N^2
    7. 7. Progress <ul><li>Investigate Scale </li></ul><ul><ul><li>N films </li></ul></ul><ul><ul><li>M people </li></ul></ul><ul><ul><li>M[(N(N-1)/2] time the algorithm cost </li></ul></ul><ul><ul><ul><li>Pearson: </li></ul></ul></ul><ul><ul><ul><ul><li>Numerator 2 – , 1 *, 1+ </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Denominator 2 -, 2 ^2, 2 + </li></ul></ul></ul></ul><ul><ul><li>M(N) time to compute averages </li></ul></ul><ul><ul><ul><li>Can be done on ingest in M(N) time </li></ul></ul></ul>
    8. 8. Progress <ul><li>Tractible? </li></ul><ul><ul><li>A typical P4 - theoretical max of 20-40 G FLOPs, </li></ul></ul><ul><ul><li>With L2cach bandwidth, supporting instructions etc. a max of 3-7 G Flop is more realistic. (my further benchmarking show 7GFlop on a dual core centrino) </li></ul></ul><ul><ul><li>What could we expect from various technologies </li></ul></ul><ul><ul><ul><li>Matrix multiplication is a good estimate..... </li></ul></ul></ul>
    9. 9. Progress <ul><li>41 mins </li></ul><ul><li>17 mins </li></ul><ul><li>8 mins </li></ul>
    10. 10. Progress <ul><li>40 Seconds </li></ul>
    11. 11. Progress <ul><li>Genial MFlops </li></ul>Which correlate well with: http://www.ient.rwth-aachen.de/~laurent/genial/benchmark_gemm_4T.html More investigation details at: http://jira.talis.com/browse/COL-5
    12. 12. Progress <ul><li>Computation </li></ul><ul><ul><li>a 1M x 1M dense matrix multiply results in at least 1M ^ 3 FLOP's = 1E18 = 1 exaflop. </li></ul></ul><ul><ul><li>On a single P4 cpu this would take 1E18 / 7E9 = 142E6 seconds or 1653 days. </li></ul></ul><ul><ul><li>So even on a matrix 100,000 a theoretical time of 1.65 days. </li></ul></ul><ul><ul><ul><li>Of course Comparisons are ½ this </li></ul></ul></ul>
    13. 13. Progress <ul><li>Realisation. </li></ul><ul><ul><li>Huge compute problem </li></ul></ul><ul><ul><li>1M matrix 1650 days </li></ul></ul><ul><ul><ul><li>Paralelise? </li></ul></ul></ul><ul><ul><ul><li>16.5 days on 100 nodes </li></ul></ul></ul><ul><ul><ul><li>1.65 days on 1000 nodes </li></ul></ul></ul><ul><ul><li>10M matrix 1,650,000 days </li></ul></ul><ul><ul><ul><li>Paralelise? </li></ul></ul></ul><ul><ul><ul><li>16,500 days on 100 nodes </li></ul></ul></ul><ul><ul><ul><li>1,650 days on 1000 nodes </li></ul></ul></ul><ul><ul><ul><li>165 days on 10,000 nodes </li></ul></ul></ul>
    14. 14. Progress <ul><li>Brute force </li></ul><ul><ul><li>IBM's US$133M Roadrunner sustaining over 1petaflops </li></ul></ul><ul><ul><ul><li>12,960 IBM PowerXCell 8i CPUs and 6,480 AMD Opteron dual-core processors </li></ul></ul></ul><ul><ul><ul><li>PowerXCell 32 GFLOPS (similar to GPU's) </li></ul></ul></ul><ul><ul><ul><li>10M matrix = 1000 seconds (100M = 11.6 Days) </li></ul></ul></ul>
    15. 15. Progress <ul><li>Brute force </li></ul><ul><ul><li>Folding@Home (free!) is reached over 4.1 PFLOPS </li></ul></ul><ul><ul><ul><li>10M matrix = 25 seconds </li></ul></ul></ul><ul><ul><li>ATI Radeon™ HD 4870 X2 2.4 Teraflops </li></ul></ul><ul><ul><ul><li>500 * $500 + 250 *$1000 (each backplane) = $500,000 (£331,665) for 1 PFLOP </li></ul></ul></ul>
    16. 16. Progress <ul><li>Optimisations </li></ul><ul><ul><li>Intuitively sparse - Ignore Nulls? How sparse? </li></ul></ul><ul><ul><li>True Pearson for linear algebra requires zeros, but Nulls? </li></ul></ul><ul><ul><li>Depends on data – generally yes </li></ul></ul><ul><ul><ul><li>I.e three people A, B, C - A has seen no films in common with B or C </li></ul></ul></ul><ul><ul><ul><li>A has seen 10 films, B – 5 and C – 15 </li></ul></ul></ul><ul><ul><ul><li>Pearson numerator for B would be –15 and c -25 </li></ul></ul></ul><ul><ul><ul><li>So C is less similar to A than B is. </li></ul></ul></ul><ul><ul><li>So can ignore nulls - tfft! </li></ul></ul>
    17. 17. Progress <ul><li>600 Elsevier Full Text Articles. </li></ul><ul><ul><li>Single core running C++ </li></ul></ul><ul><ul><li>processes 20 articles / 80,000 terms per second. </li></ul></ul><ul><ul><li>Computations way faster than dense matrix. </li></ul></ul><ul><ul><li>Only 600 articles 150,000 unique terms. </li></ul></ul>A little distraction...
    18. 18. Progress
    19. 19. Progress (from van Rijsbergen, 1979) The most frequent words are not the most descriptive <ul><ul><li>More Optimisation </li></ul></ul><ul><ul><ul><li>Word Count (LSA) Characteristics </li></ul></ul></ul>Be carefull, the lower discriminatory words can provide good information... (and serendipity)
    20. 20. Progress
    21. 21. Progress <ul><li>How Sparse? </li></ul><ul><ul><li>Term Document Count from 2.18 million DBPedia Abstracts </li></ul></ul>
    22. 22. Progress <ul><li>Distraction.... </li></ul><ul><ul><li>Some Least Popular (stemmed) – 3 docs each </li></ul></ul><ul><ul><ul><ul><li>Accretionari - an increase in a beneficiary's share in an estate </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Accordiana - a musical radio series which was heard on CBS in 1934 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Accokeek - Located in the southwest corner of Prince George's County </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Nazarbayev - President of Kazakhstan </li></ul></ul></ul></ul><ul><ul><li>The Most Popular </li></ul></ul><ul><ul><ul><ul><li>15938 – year </li></ul></ul></ul></ul><ul><ul><ul><ul><li>12476 – season </li></ul></ul></ul></ul><ul><ul><ul><ul><li>11410 – state </li></ul></ul></ul></ul><ul><ul><ul><ul><li>10758 – world </li></ul></ul></ul></ul><ul><ul><ul><ul><li>10722 – name </li></ul></ul></ul></ul>Serendipity
    23. 23. Progress <ul><li>Very Sparse </li></ul><ul><ul><li>Turns out to be Zipf – Mandelbrot distribution. </li></ul></ul><ul><ul><ul><ul><li>[1] G. K. Zipf, Human Behavior and the Principle of Least Effort. (Cam- bridge, Mass., 1949; Addison-; Wesley, 1965). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>[2] B. Mandelbrot, “An informational theory of the statistical structure of language”, in Communication Theory, ed. Willis Jackson. (Better- worths, 1953). </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Word Count is .0025% dense </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Ignore Null for huge optimisation. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>40,000 x less compute </li></ul></ul></ul></ul><ul><ul><ul><ul><li>(using uniform density assumption) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Zipf-Mandelbrot has the form: </li></ul></ul></ul></ul><ul><ul><ul><ul><li>y = P1/(x+P2)^P3. </li></ul></ul></ul></ul>
    24. 24. Progress <ul><li>DBPedia words follow Zipf-Mandelbrot </li></ul><ul><ul><li>Zoomed in chunk of DBPedia word count </li></ul></ul><ul><ul><ul><ul><li>y = P1/(x+P2)^P3 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>best fit regression (red curve) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>with factors </li></ul></ul></ul></ul><ul><ul><ul><ul><li>P1 = 874150 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>P2 = 60.0000 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>P3 = 1.01000 </li></ul></ul></ul></ul>
    25. 25. Progress <ul><li>Calculate Comparisons. </li></ul><ul><ul><li>2.18 million Abstracts – 1.3M unique terms. </li></ul></ul><ul><ul><li>Fits in 2G ram - (1.3M*2.18M*.000025 * 21 = 1.5G ) (as it is .0025% dense each entry is 21 bytes) </li></ul></ul><ul><ul><li>Uniform density assumption </li></ul></ul><ul><ul><li>Comparisons computable in few minutes </li></ul></ul><ul><ul><ul><li>Not storeable in RAM (3.6 Tbytes!) </li></ul></ul></ul><ul><ul><li>Big underestimate </li></ul></ul><ul><ul><ul><li>(Stopped the run after 4 hours) </li></ul></ul></ul>Stored random 100 article sample, and comparing with all 2M others. (0.2Gb) to allow intuitative QA
    26. 26. Progress <ul><li>Extrapolating Compute Times. </li></ul><ul><ul><li>Not 40,000x less compute (.0025% dense) </li></ul></ul><ul><ul><ul><li>Word Count is .0025% dense - 40,000 x less compute </li></ul></ul></ul><ul><ul><ul><li>Top 1000 most commonly occurring terms the density is 0.78% </li></ul></ul></ul><ul><ul><li>Not 128x less (.78% dense) </li></ul></ul><ul><ul><li>So use square of area under curve </li></ul></ul><ul><ul><li>= Integral of Zipf Mandelbrot Squared </li></ul></ul><ul><ul><li>Roughly a power law. </li></ul></ul><ul><ul><ul><li>y = (x^P)/N </li></ul></ul></ul>
    27. 27. Progress DBPedia Abstracs Ops (not including algorithm cost) Regression using simple power law (P = 2.1 N = 1E0.73)
    28. 28. Progress <ul><li>Extrapolating Compute Times. </li></ul><ul><ul><li>2 Million abstracts </li></ul></ul><ul><ul><li>Square of Integral of Zipf Mandelbrot </li></ul></ul><ul><ul><li>Predicts calculable in 5.42 Hours </li></ul></ul><ul><ul><ul><li>Assuming all in RAM. </li></ul></ul></ul><ul><ul><ul><li>But writing Gigs to disk big overhead. </li></ul></ul></ul><ul><ul><ul><li>(Not run this yet to prove) </li></ul></ul></ul>
    29. 29. Progress Loans Data Ops (not including algorithm cost) (proving power law linear regression prediction) Regression using simple power law (P = 3.5 N = 1E10.15)
    30. 30. Progress <ul><li>Loans Data </li></ul><ul><ul><li>Hereford Libraries. </li></ul></ul><ul><ul><li>C++ In memory 'super fast hash' </li></ul></ul><ul><ul><li>Processed 19M loans in 1min 20sec. </li></ul></ul><ul><ul><li>Producing 269,000 unique borrower and 491,000 unique books. </li></ul></ul><ul><ul><li>8 Million unique loan events </li></ul></ul>
    31. 31. Progress Loans per Individual – Nice Zipf Mandelbrot curve
    32. 32. Progress <ul><li>Zipf Mandelbrot – Good Assumption? </li></ul><ul><ul><li>Most (all large complex systems?) data that we are likely to process will follow a Zipf-Mandlebrot model. </li></ul></ul><ul><ul><li>K. Silagadze shows [1] that these comply... </li></ul></ul><ul><ul><ul><li>Clickstreams </li></ul></ul></ul><ul><ul><ul><li>Page-rank (Linkage/Centrality) </li></ul></ul></ul><ul><ul><ul><li>Citations </li></ul></ul></ul><ul><ul><ul><li>Other long tail interactions. </li></ul></ul></ul><ul><ul><ul><ul><li>[1]Z. K. Silagadze [physics.soc-ph] 26 Jan 1999 Citations and the Zipf-Mandelbrot’s law - Budker Institute of Nuclear Physics, 630 090, Novosibirsk, Russia ing. </li></ul></ul></ul></ul>
    33. 33. Findings <ul><li>Cant Store All Comparisons </li></ul><ul><ul><li>50 Tb for 10M matrix (½ N^2) for triangle matrix. </li></ul></ul><ul><ul><li>Store only meaningfull? – Thld = n or f. </li></ul></ul><ul><li>Can compute All 2M (squared) Comparisons </li></ul><ul><ul><li>In 6 hours (1 core). </li></ul></ul><ul><li>Cant Compute 1 Billion Comparisons </li></ul><ul><ul><li>287 years (7 days on 20,000 cores 10Billion?). </li></ul></ul><ul><li>Zipf Mandelbrot Curve is Usefull. </li></ul><ul><ul><li>Can store All(?) raw metrics </li></ul></ul>n=count - fixed f=factor - Z-M
    34. 34. Findings <ul><li>Zipf Mandelbrot Curve is Usefull. </li></ul><ul><ul><li>Head </li></ul></ul><ul><ul><ul><li>Big proportion of compute </li></ul></ul></ul><ul><ul><ul><ul><li>Large M1 M2 intersection. </li></ul></ul></ul></ul><ul><ul><ul><li>Low discrimination </li></ul></ul></ul><ul><ul><li>Body </li></ul></ul><ul><ul><ul><li>Good info </li></ul></ul></ul><ul><ul><ul><li>Medium compute </li></ul></ul></ul><ul><ul><li>Tail </li></ul></ul><ul><ul><ul><li>Specialist </li></ul></ul></ul><ul><ul><ul><li>Trivial or no compute </li></ul></ul></ul>
    35. 35. Findings <ul><li>Zipf Mandelbrot Curve is Usefull. </li></ul><ul><ul><li>Allows us to make optimisations </li></ul></ul><ul><ul><ul><li>In reducing the y axis (and x a bit) </li></ul></ul></ul><ul><ul><ul><li>Chop the head off. </li></ul></ul></ul><ul><ul><li>Body </li></ul></ul><ul><ul><ul><li>Dimensionality reduction </li></ul></ul></ul><ul><ul><li>Tail </li></ul></ul><ul><ul><ul><li>Chop the tail off </li></ul></ul></ul><ul><ul><ul><li>Or Dimensionality reduction </li></ul></ul></ul><ul><ul><li>X axis = N^2, Y axis = M </li></ul></ul>
    36. 36. Findings <ul><li>What about storing meaningfull comparisons. </li></ul><ul><ul><li>Solves storage problem </li></ul></ul><ul><ul><li>Requires repeated compute problem </li></ul></ul><ul><ul><ul><li>Deltas, could affect whole set </li></ul></ul></ul><ul><ul><ul><li>Will affect a chunk of the set </li></ul></ul></ul><ul><ul><ul><li>Could trade off timely accuracy with batch processing. </li></ul></ul></ul>
    37. 37. Proposal <ul><li>Store raw curve </li></ul><ul><ul><li>Sparse Strorage – Bigtable like </li></ul></ul><ul><ul><ul><li>Hbase, Hypertable, etc </li></ul></ul></ul><ul><ul><ul><li>Unloads indexing and lookup to nodes. </li></ul></ul></ul><ul><li>Calculate on the fly </li></ul><ul><ul><li>Two indeces Books -> people and People -> books </li></ul></ul><ul><ul><li>Not 1/2N^2 – just 1* Intersect (* M) </li></ul></ul><ul><ul><li>Tail – Retrieval problem M~0 Intersect~0. </li></ul></ul><ul><ul><li>Body – Some retrieval and compute. </li></ul></ul><ul><ul><li>Head – Big retrieval and compute big M big intersect. </li></ul></ul>
    38. 38. Proposal <ul><li>Pre Compute Head </li></ul><ul><ul><li>Store top n </li></ul></ul><ul><ul><li>Store any that take more than .5 seconds </li></ul></ul><ul><ul><ul><li>Zipf-Mandelbrot – retrieval only problem </li></ul></ul></ul><ul><ul><ul><li>Dynamic – finding them is linear </li></ul></ul></ul><ul><ul><li>Store as a cache – only when requested? </li></ul></ul><ul><ul><ul><li>Depends on acceptible delay? </li></ul></ul></ul><ul><li>This Hybrid Scales Better. </li></ul><ul><ul><li>Better than Storing all or Computnig all </li></ul></ul>
    39. 39. Proposal <ul><li>It doesn't scale indefinately. </li></ul><ul><ul><li>* scale by 10, * nodes by 100 </li></ul></ul><ul><ul><li>Dimensionality reduction will HAVE to kick in. </li></ul></ul><ul><ul><ul><li>This aproach allows that, but at bigger scles than most </li></ul></ul></ul><ul><ul><ul><li>Consider severing head and tail as early aproach. </li></ul></ul></ul><ul><li>Only optimised for individual requests. </li></ul><ul><ul><li>Given this article find the 10 next similar </li></ul></ul><ul><ul><li>What about...”Given this corpus find the 100 most similar things” </li></ul></ul><ul><ul><ul><li>Then set n = infinity (or f=0) and the service will tell you how many days to come back for your results </li></ul></ul></ul>
    40. 40. Proposal <ul><li>Experimentation </li></ul><ul><ul><li>Used Hbase, HDFS, Hadoop </li></ul></ul><ul><ul><ul><li>Not using Hadoop yet – but is good fit for data ingest </li></ul></ul></ul><ul><ul><ul><li>Hadoop not v efficient for comparison – but doable. </li></ul></ul></ul><ul><ul><li>Used Loan data and binary pearson </li></ul></ul><ul><ul><ul><li>Ignoring nulls (and sigma) – so counts only. </li></ul></ul></ul>Quick demo
    41. 41. Proposal <ul><li>Next Steps </li></ul><ul><ul><li>Prove this aproach </li></ul></ul><ul><ul><li>Performance Testing. </li></ul></ul><ul><ul><ul><li>Hbase over n nodes (perf lab, poss then EC2?). </li></ul></ul></ul><ul><ul><ul><li>Timing retrieval vs compute </li></ul></ul></ul><ul><ul><ul><ul><li>Good logging </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Configurable variables </li></ul></ul></ul></ul><ul><ul><ul><li>Multiple 'stores' (data sets). </li></ul></ul></ul><ul><ul><ul><li>8M loans now – 80M? </li></ul></ul></ul><ul><ul><ul><li>Hadoop the ingest – if just to save time during trials </li></ul></ul></ul>