Anegdotic Maxeler (Romania)

402
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
402
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Elastic makes things worse
  • Anegdotic Maxeler (Romania)

    1. 1. V. Milutinovic, G. Rakocevic, S. Stojanovic, and Z. Sustran University of Belgrade Oskar Mencer Imperial College, London Oliver Pell Maxeler Technologies, London and Palo Alto Michael Flynn Stanford University, Palo Alto, USA Valentina Balas Aurel Vlaicu University of Arad, Romania, Maxeler Ambassador 1/83
    2. 2. An Alternative Title: How to hire more than 1000 PhD students at no additional cost to tax payers?
    3. 3. For Big Data algorithms and for the same hardware price as before, achieving: a) speed-up, 20-200 b) monthly electricity bills, reduced 20 times c) size, 20 times smaller The major issues of engineering are: design cost and design complexity. Remember, economy has its own rules: production count and market demand! 3/83
    4. 4. Elaboration :) If a computer center spends E50M/year on electricity bills, and moves most of its time-consuming algorithms to Maxeler, which uses 20 times less power, the yearly spending drops down to E2.5M, and E47.5M is saved to tax payers :) If the average net salary of a PHD student in Germany is E1500, and if the overhead factor is 1.00, it is easy to calculate that E47.5M can pay 2611 PHD students to work for one year, and that can go year after year :) If the overhead factor is 2.611 (I do not know how big it is, but it is less than 2.611, for sure), one can hire 1000 PHD students, at no additional cost :)
    5. 5. 1. Over 95% of run time in loops 2. 3. 4. 5. 6. [loops to almost zero] Reusability of data (e.g., x+x2+x3+x4+…) [how close to zero?] BigData [prog: for data streaming, not for data control] Latency A new programming model WORM [prog.effort+comp.tim] Use a tractor, not a Ferrari, to drive over a plowed field 5/83
    6. 6. Absolutely all results achieved in Europe: a) All hardware produced in Europe, specifically UK b) All software generated by programmers of EU and WB 6/83
    7. 7. ControlFlow (MultiFlow and ManyFlow):  Top500 ranks using Linpack (Japanese K, IBM Sequoya, Cray Titan, …) DataFlow:  Coarse Grain (HEP) vs. Fine Grain (Maxeler) The history starts in 1960's! The enabler technology did not exist before the year 2000! 7/83
    8. 8. Compiling below the machine code level brings speedups; also a smaller power, size, and cost. The price to pay: The machine is more difficult to program. Consequently: Ideal for WORM applications :) Examples using Maxeler: GeoPhysics (20-200), Banking (200-2000, with JP Morgan 20%), M&C (New York City), Datamining (Google), … 8/83
    9. 9. Simulator builder Hardware builder 9 2n+3
    10. 10. 10/83
    11. 11. Why Java? Minimal Kolmogorov Complexity, etc… 11/83
    12. 12. 12
    13. 13. 13
    14. 14. tCPU = N * NOPS * CCPU*TclkCPU /NcoresCPU tGPU = N * NOPS * CGPU*TclkGPU / NcoresGPU tDF = NOPS * CDF * TclkDF + (N – 1) * TclkDF / NDF Assumptions: 1. Software includes enough parallelism to keep all cores busy 2. The only limiting factor is the number of cores. 14/83
    15. 15. DualCore? Which way are the horses going? 15/83
    16. 16. Is it possible to use 2000 chicken instead of two horses? ? == What is better, real and anecdotic? 16/83
    17. 17. 2 x 1000 chickens (CUDA and rCUDA) 17/83
    18. 18. at a D How about 2 000 000 ants? 18/83
    19. 19. Big Data Input Results Marmalade 19/83
    20. 20. Factor: 20 to 200 MultiCore/ManyCore Dataflow Machine Level Code Gate Transfer Level 20/83
    21. 21. Factor: 20 MultiCore/ManyCore Dataflow 21/83
    22. 22. Factor: 20 MultiCore/ManyCore DataFlow Data Processing Data Processing Process Control Process Control 22/83
    23. 23.  MultiCore:  Explain what to do, to the driver  Caches, instruction buffers, and predictors needed  ManyCore:  Explain what to do, to many sub-drivers  Reduced caches and instruction buffers needed  DataFlow:  Make a field of processing gates: 1C+2nJava+3Java  No caches, etc. (300 students/year: BGD, BCN, LjU, ICL,…) 23/83
    24. 24. MultiCore:  Business as usual ManyCore:  More difficult DataFlow:  Much more difficult  Debugging both, application and configuration code 24/83
    25. 25.  MultiCore/ManyCore:  Several minutes  DataFlow:  Several hours for the real hardware  Fortunately, only several minutes for the simulator, several seconds for reload (90% due to DRAM inertia), and several milliseconds to restart  The simulator supports both the large JPMorgan machine as well as the smallest “University Support” machine  Good news:  Tabula@2GHz 25/83
    26. 26. 26/83
    27. 27. MultiCore:  Horse stable ManyCore:  Chicken house DataFlow:  Ant hole 27/83
    28. 28. MultiCore:  Haystack ManyCore:  Cornbits DataFlow:  Crumbs 28/83
    29. 29. Small Data: Toy Benchmarks (e.g., Linpack) 29/83
    30. 30. Medium Data (benchmarks favorising NVidia, compared to Intel,…) 30/83
    31. 31. Big Data 31/83
    32. 32. Maxeler Hardware CPUs plus DFEs Intel Xeon CPU cores and up to 4 DFEs with 192GB of RAM DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation of DFEs to CPU servers MaxWorkstation Desktop development system 32/83 Low latency connectivity Intel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet connections MaxCloud On-demand scalable accelerated compute resource, hosted in London
    33. 33. Major Classes of Algorithms, from the Computational Perspective 1. Coarse grained, stateful: Business – CPU requires DFE for minutes or hours – Interrupts 1. Fine grained, transactional with shared database: DM – CPU utilizes DFE for ms to s – Many short computations, accessing common database data 1. Fine grained, stateless transactional: Science (Phy, ...) – CPU requires DFE for ms to s – Many short computations 33/83
    34. 34. Coarse Grained: Modeling 34/83 Timesteps (thousand) 70 60 Domain points (billion) 50 Total computed points (trillion) 40 30 20 10 0 0 10 20 30 40 50 Peak Frequency (Hz) 60 70 2,000 1,800 15Hz peak frequency 1,600 30Hz peak frequency 1,400 45Hz peak frequency 1,200 70Hz peak frequency 1,000 800 600 s r o c U P C t n e l a v i u q E • Long runtime, but: • Memory requirements change dramatically based on modelled frequency • Number of DFEs allocated to a CPU process can be easily varied to increase available memory • Streaming compression • Boundary data exchanged over chassis MaxRing 80 400 200 0 1 4 Number of MAX2 cards 8 80
    35. 35. Fine Grained, Shared Data: Monitoring • DFE DRAM contains the database to be searched • CPUs issue transactions find(x, db) • Complex search function – Text search against documents – Shortest distance to coordinate (multi-dimensional) – Smith Waterman sequence alignment for genomes • Any CPU runs on any DFE that has been loaded with the database – MaxelerOS may add or remove DFEs from the processing group to balance system demands – New DFEs must be loaded with the search DB before use 35/83
    36. 36. Fine Grained, Stateless: The BSOP Control • • • • Analyse > 1,000,000 scenarios Many CPU processes run on many DFEs ≈50x MPC-X vs. multi-core x86 node Each transaction executes on any DFE in the assigned group atomically CPU CPU CPU CPU CPU Market and instruments data Tail Tail Tail Tail Tail Tail Tail analysis Tail analysis Tail analysis Tail analysis analysis analysis analysis onCPU CPU analysis onCPU analysis CPU onCPU analysis onCPU onCPU on on CPU on on CPU on CPU Instrument values 36/83 DFE DFE DFE DFE DFE Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Loop over instruments Random number Random number Random number Random number Random number Random number generator and generator Random numberand Random number generator and generator Random numberand Random number generator and generator and sampling of and sampling underliers generator and generator underliers sampling of of underliers sampling underliers generator and generator and sampling of of underliers sampling underliers sampling of of underliers sampling underliers sampling of of underliers sampling of underliers Price instruments Price instruments Price instruments Price instruments Price instruments Priceusing Black instruments using Black Price instruments Priceusing Black instruments using Black Price instruments Priceusing Scholes instruments Black using Black Scholes using Scholes Black using Black Scholes using Scholes Black using Black Scholes Scholes Scholes Scholes Scholes
    37. 37. Selected Examples: Business, Mathematics, GeoPhysics, etc. 37/83
    38. 38. 38
    39. 39. An MIS Example: Credit Derivatives
    40. 40. Orbital station Climber Tether HW
    41. 41. 41
    42. 42. Seismic Imaging • Running on MaxNode servers - 8 parallel compute pipelines per chip - 150MHz => low power consumption! - 30x faster than microprocessors An Implementation of the Acoustic Wave Equation on FPGAs T. Nemeth†, J. Stefani†, W. Liu†, R. Dimond‡, O. Pell‡, R.Ergas§ † Chevron, ‡Maxeler, §Formerly Chevron, SEG 2008 42/83
    43. 43. The CRS Results  Performance of one MAX2 card vs. 1 CPU core  Land case (8 params), speedup of 230x  Marine case (6 params), speedup of 190x CPU Coherency 43/83 MAX2 Coherency
    44. 44. 44
    45. 45. 46 466/83 4
    46. 46. P. Marchetti et al, 2010 Trace Stacking: Speed-up 217 • DM for Monitoring and Control in Seismic processing • Velocity independent / data driven method to obtain a stack of traces, based on 8 parameters • Search for every sample of each output trace 2 t 2 hyp  2 T  2t0 T =  t0 + w m  + m H zy K N H T m + h T H zy K NIP H T h zy zy   v0 v0   ( 2 parameters ( emergence angle & azimuth ) 3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 ) 3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 ) 47/83 )
    47. 47. Maxeler running Smith Waterman 48
    48. 48. Molecular Correlates of Tumor Signatures from a Large Cohort From whole slide sections, of a cohort, to pathway analysis (Prof Bahram Parvin, Berkeley) High Content Analysis (HCA) on MPC-X
    49. 49. 51
    50. 50. Conclusion: Nota Bene This is about algorithmic changes, to maximize the algorithm to architecture match: algorithmic modifications, pipeline utilization, data choreography, and decision making precision. The winning paradigm of Big Data ExaScale? 52/83
    51. 51. Algorithmic Changes: Data Dependencies PSI[0] … PSI[1] OP cbeta[0] OP cbeta[1] PSI[N-3] OP … … 0 OP’ OP’ … PSI[0] PSI[1] PSI[2] … PSI[N-2] PSI[N-1] OP cbeta[N-3] OP’ PSI[N-3] Example generated by Sasa Stojanovic (Gross-Pitaevskii) cbeta[N-2] OP’ 0 PSI[N-2] PSI[N-1] 53/83
    52. 52. Pipeline Changes: Higher Efficiency 0 X[0,0] X[0,1] [0,0] 0 [0,1] [7,0] [7,0] [6,0] [6,0] [5,0] [5,0] [4,0] [4,0] [3,0] [3,0] [2,0] [2,0] [1,0] [1,0] [0,0] R[0,0] R[0,0] Example generated by Sasa Stojanovic (Gross-Pitaevskii) 54/83
    53. 53. Data Recoreography: Pipeline Utilization Example generated by Sasa Stojanovic (Gross-Pitaevskii) Order of data accesses inside of a burst … … … 55/83
    54. 54. Fixed Point: Savings Reinvestable • Consider fixed point compared to single precision floating point • If the range is tightly confined, one could use 24-bit fixed point • If data has a wider range, may need 32-bit fixed point hwFloat(8,24) hwFix(24,...) Add Multiply hwFix(32,...) 500 LUTs 24 LUTs 32 LUTs 2 DSPs 2 DSPs 4 DSPs • Arithmetic is not 100% of the chip. In practice, often ~5x performance boost from fixed point. 56
    55. 55.  Revisiting the Top 500 SuperComputers benchmarks  Our paper in Communications of the ACM  Revisiting all major Big Data DM algorithms  Massive static parallelism at low clock frequencies  Concurrency and communication  Concurrency between millions of tiny cores difficult, “jitter” between cores will harm performance at synchronization points  Reliability and fault tolerance  10-100x fewer nodes, failures much less often  Memory bandwidth and FLOP/byte ratio  Optimize data choreography, data movement, and the algorithmic computation  New architecture of n-Programming paradigms 57/83
    56. 56. FP7: RoMoL@BCN The SAB goal: Out of box thinking! 58/83
    57. 57. FP7: BalCon@SRB The vision of Alkis Konstantellos The SAB goal: Seed for new proposals! 59/83
    58. 58. DAFNE: Leader MISANU 60/83
    59. 59. DAFNE = South (MaxCode) + North (BigData) MISANU, IMP, KG, NS, UK BSC, UPV, Sweden U of Siena, U of Roma, Norway IJS, FRI, Denmark Germany IRB, France QPLAN, Bogazici, U of Istanbul, Austria U of Bucharest, U of Arad, Swiss U of Tuzla, Poland Technion, Maxeler Israel, IPSI Hungary 61/83 61/83
    60. 60. The DAFNE Map 62/83
    61. 61. The TriPeak @ DATAMAN Siena + BSC + Imperial College + Maxeler + Belgrade 63/83 46/83
    62. 62. The TriPeak: Essence MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM) Maxeler = A FineGrain DataFlow (FPGA) How about a happy marriage? MontBlanc (ompSS) and Maxeler (an accelerator) In each happy marriage, it is known who does what :) The Big Data DM algorithms: What part goes to MontBlanc and what to Maxeler? 64/83 64/83
    63. 63. TriPeak: Core of the Symbiotic Success An intelligent DM algorithmic scheduler, partially implemented for compile time, and partially for run time. At compile time: Checking what part of code fits where (MontBlanc or Maxeler): LoC 1M vs 2K vs 20K At run time: Rechecking the compile time decision, based on the current data values. 65/83 65/83
    64. 64. 66 66/83
    65. 65. Maxeler: Research (Google: good method) Structure of a Typical Research Paper: Scenario #1 [Comparison of Platforms for One Algorithm] Curve A: MultiCore of approximately the same PurchasePrice Curve B: ManyCore of approximately the same PurchasePrice Curve C: Maxeler after a direct algorithm migration Curve D: Maxeler after algorithmic improvements Curve E: Maxeler after data choreography Curve F: Maxeler after precision modifications Structure of a Typical Research Paper: Scenario #2 [Ranking of Algorithms for One Application] CurveSet A: Comparison of Algorithms on a MultiCore CurveSet B: Comparison of Algorithms on a ManyCore CurveSet C: Comparison on Maxeler, after a direct algorithm migration CurveSet D: Comparison on Maxeler, after algorithmic improvements CurveSet E: Comparison on Maxeler, after data choreography CurveSet F: Comparison on Maxeler, after precision modifications 67/83 67/83
    66. 66. Maxeler Research in Serbia: Special Issue of IPSI Transactions Journal KG: Blood Flow, Tijana Djukic and Prof. Filipovic NS: Combinatorial Math, Prof. Senk and Ivan Stanojevic MISANU: The SAT Math, Zivojin Sustran and Prof. Ognjanovic ETF: Meteorology, Radomir Radojicic and Marko Stankovic ETF: Physics (Gross Pitaevskii 3D real), Sasa Stojanovic ETF: Physics (Gross Pitaevskii 3D imaginary), Lena Parezanovic 68/83 68/83
    67. 67. Maxeler Research WorldWide: Special Issue of Advances in Computers @ SCI Stanford, Texas, Imperial, Maxeler, ETF, MF, MISANU, IMP, KG, NS, BSC, UPV, U of Siena, U of Roma, IJS, FRI, … 69/83 69/83
    68. 68. © H. Maurer 70 70/83
    69. 69. Maxeler: Teaching (Google: prof vm) VLSI, PowerPoints, Maxeler: TEACHING, Maxeler Veljko Explanations, August 2012 Maxeler Veljko Anegdotic, Maxeler Oskar Talk, August 2012 Maxeler Forbes Article Flyer by JP Morgan Flyer by Maxeler HPC Tutorial Slides by Sasha and Veljko: Practice (Current Update) Paper, unconditionally accepted for Advances in Computers by Elsevier Paper, unconditionally accepted for Communications of the ACM Tutorial Slides by Oskar: Theory (7 parts) Slides by Jacob, New York Slides by Jacob, Alabama Slides by Sasha: Practice (Current Update) Maxeler in Meteorology Maxeler in Mathematics Examples generated in Belgrade and Worldwide THE COURSE ALSO INCLUDES DARPA METHODOLOGY FOR MICROPROCESSOR DESIGN, with an example 71/83 71/83
    70. 70. Maxeler PreConference Tutorials (2013) Google: IEEE HiPeak, Berlin, Germany, January 2013 ACM iSAC, Coimbra, Portugal, March 2013 IEEE MECO, Budva, Montenegro, June 2013 ACM ISCA, Tel Aviv, Israel, June 2013 72/83 72/83
    71. 71. Maxeler InHouse Tutorials (2013) 73/83 73/83
    72. 72. © H. Maurer 74 74/83
    73. 73. Maxeler University Program Members 75/83
    74. 74. How to Become a Family Member? Options to consider: a. MAX-UP free of charge b. Purchasing a university-level machine (min about $10K) c. Purchasing a JPM-level machine (slowly approaching $100M), or at least a Schlumberger-level machine (slowly moving above $10M) 76/83 76/83
    75. 75. Good to Know! Maxeler employs close to 100 people, GBR and USA: a. Maxeler cash burn per year = about $10M b. If a university-level machine is sold at the 100% profit margin, the company life of Maxeler is extended for about 2 hours. c. If a university-level machine is sold at the 1% profit margin, the company life of Maxeler is extended for 1 minute. Our past or ongoing FP7 projects requiring Maxeler speeds: a. ProSense b. ARTreat c. HiPEAC 77/83 77/83
    76. 76. The Educational Mission Important note: a. Total number of accredited universities in the whole world? b. As per WeboMetrics, about 20000. c. Consequently, all universities of the world together bring only: 20000 minutes of extra life, or about two weeks of extra life. The reality: a. University-level machines are sold at the ZERO profit margin! b. Only the Xilinx costs, handling, and shipping. c. Email support for student doing thesis is practically unlimited! Conclusion: This is a chance for those who jump in first :) 78/83 78/83
    77. 77. Our Work Impacting Maxeler Milutinovic, V., Knezevic, P., Radunovic, B., Casselman, S., Schewel, J., Obelix Searches Internet Using Customer Data, IEEE COMPUTER, July 2000 (impact factor 2.205/2010). Milutinovic, V., Cvetkovic, D., Mirkovic, J., Genetic Search Based on Multiple Mutation Approaches, IEEE COMPUTER, November 2000 (impact factor 2.205/2010). Milutinovic, V., Ngom, A., Stojmenovic, I., STRIP --- A Strip Based Neural Network Growth Algorithm for Learning Multiple-Valued Functions, IEEE TRANSACTIONS ON NEURAL NETWORKS, March 2001, Vol.12, No.2, pp. 212-227. Jovanov, E., Milutinovic, V., Hurson, A., Acceleration of Nonnumeric Operations Using Hardware Support for the Ordered Table Hashing Algorithms, IEEE TRANSACTIONS ON COMPUTERS, September 2002, Vol.51, No.9, pp. 1026-1040 (impact factor 1.822/2010). 79/83 79/83
    78. 78. Maxeler Impacting Our Work Tafa, Z., Rakocevic, G., Mihailovic, Dj., Milutinovic, V., Effects of Interdisciplinary Education On Technology-driven Application Design IEEE Transactions on Education, August 2011, pp.462-470. (impact factor 1.328/2010). Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S., Milutinovic, V., Fast File Existence Checking in Archiving Systems ACM Transactions on Storage (TOS) TOS Homepage archive, Volume 7 Issue 1, June 2011, ACM New York, NY, USA. Jovanovic, Z., Milutinovic, V., FPGA Accelerator for Floating-Point Matrix Multiplication, IEE Computers & Digital Techniques, 2012, 6, (4), pp. 249256. Flynn, M., Mencer, O., Milutinovic, V., Rakocevic, G., Stenstrom, P., Trobec, R., and Valero, M., Moving from Petaflops (on Simple Benchmarks) to Petadata per Unit of Time and Power (On Sophisticated Benchmarks) Communications of the ACM, May 2013 (impact factor 1.919/2010). 80/83 80/83
    79. 79. Current Main Efforts of Maxeler 1. To encourage a lot of software to be written/ported. This is a key business opportunity that needs to be developed. 2. Maxeler is building up a website and a community to share software for DFEs. This would allow the software to also be sold directly from the Maxeler website. 3. If a PhD student ports an important software to a Maxeler machine, she/he could become the first software vendor in the world for dataflow computers, and Maxeler would be happy to help sell licenses. 81/83
    80. 80. Current Side Efforts of Maxeler 1. Developing new tools for easier making of kernels. 2. Bringing new languages to Maxeler: C, C++, MathLab, Matematika 3. Porting popular application packages to Maxeler: OpenSees, etc... 4. Trying the Tabula FPGA! 5. Getting more than 1TeraByte/sec thru I/O 6. Minimizing the hardware, so it can go into Galaxy 5,6… 82/83
    81. 81. NewTools: MaxSkins Custom Engine Interfaces (.c) MaxCompiler .max file Testing / Application integration Dataflow Design (.maxj) MaxCompiler App Packager .max file developer .max file user App Installer SLiC level programming MATLAB .mex .m C/C++ R Excel 83 Python 83/83
    82. 82. Getting Started a Practical Work from the Linux Shell 1. Open a shell terminal (e.g., $ /usr/bin/xfce4-terminal). 2. Connect to the Maxeler machine (e.g., $ ssh root@147.91.12.216). 3. If more shell screens needed, start screen (e.g., $ screen). 4. Switch to the directory that contains the 2n+3 programs you wrote (e.g., $ cd Desktop/workspace/src/ind/z88/). 5. Prepare your C code for measuring the execution time (e.g., clock_gettime(CLOCK_REALTIME, &t2);). 6. See what you can do (e.g., $ make). 7. Select one of those that you can do (e.g., $ make build-sim, $ make run-sim, $ make build-hw, $ make run-hw). 8. Measure the power consumption at the wall plug. 84/83
    83. 83. Q&A vm@etf.rs © H. Maurer 85 85/83
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×