Presentation Ispass 2012 Session6 Presentation1

544 views

Published on

Our ISPASS 2012 Presentation

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
544
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Presentation Ispass 2012 Session6 Presentation1

  1. 1. Evaluating FPGA-acceleration for Real-time Unstructured Search Sai Rahul Chalamalasetti†, Martin Margala†, Wim Vanderbauwhede*, Mitch Wright‡, Parthasarathy Ranganathan‡‡ †University of Massachusetts Lowell, Lowell, MA *University of Glasgow, Scotland, UK ‡Hewlett Packard, Houston, TX ‡‡Hewlett Packard Labs, Palo Alto, CA4/9/2012 1
  2. 2. Outline  Motivation  Workload and Algorithm Description  Hardware Systems  Synthetic Datasets  Performance Results  Alternatives and Future Work  Conclusion4/9/2012 2
  3. 3. Motivation The era of “big data”  Explosion in data – particularly unstructured data  Information doubling every 18 months or faster  Enterprise server systems processed, delivered over 9 zettabytes in 2008 (UCSD report)  Walmart:1M transactions/hour; LHC: 1PB/second; YouTube: 48 hours/minute; Facebook: 100TB logs/day  Explosion in data-centric workloads  Collect, store, access, share, visualize, analyze, interpret, …  Consumer, Enterprise, Scientific, …  New applications emerging recently: search, live business analytics, social correlation, collaborative filtering, …  Need better performance for deeper analytics across diverse data4/9/2012 3
  4. 4. Motivation  The era of “green computing”  Power and cooling important constraint for servers/datacenters  Only 4 countries consume more electricity than worldwide datacenters; millions of dollars for cloud datacenters  Thermal density and costs of power delivery and cooling infrastructure  Sustainability a growing concern  Lifecycle minimization of environmental effects and carbon emissions  Corporate initiatives from HP, Cisco, Dell, Google, IBM, Intel; Government initiatives from EPA, DOE, etc.4/9/2012 4
  5. 5. This Work High-performance energy-efficient data-centric architectures  FPGAs/accelerators a good way to improve energy efficiency  Accelerated Unstructured Search, mainly data analysis (document filtering & profile match)  GiDEL ProcStar IV board (Four Altera Stratix IV 530 FPGAs) Recent developments offer promise  Better toolkits, and IPs for Host Computer Interfaces to FPGAs, e.g. GiDEL  Future platforms, e.g. ARM+FPGA on a single die by Altera and Xilinx  Recent commercial successes , e.g. Fusion-io, Netezza, etc. What we achieved  Performance speed up of 23X to 38X  Energy efficiency improvements of 31X to 40X  Performance-per-cost improvement of 10X 4/9/2012 5
  6. 6. Choice of Workload Wide variety of emerging data-centric workloads  Operations: collect and store, maintain & manage; retrieve, interpret & analyze Focus on important emerging class: real-time unstructured search  Searching patent repositories for related work comparison  Searching emails and share points for enterprise information management  Detecting spam in incoming emails  Monitoring communications for e.g. terrorist activity  News story topic detection and tracking  Searching through books, images, and videos for matching profiles 4/9/2012 6
  7. 7. Algorithm Description Document model  Each document modeled as bag of words “D” of pairs (t,f)  t is a term; f is number of occurrences of t in document D  Profile “M” is a set of pairs p = (t,w)  t is term; w is weight function  Bayesian algorithm used offline to precompute profile based on user requirements 4/9/2012 7
  8. 8. Hardware Platform FPGA Board  Application Implementation  GiDEL PROCStar-IV development board  GiDEL External Memory IPs  Internal FPGA Memory of 20Mb  Algorithm in VHDL in Altera Quartus  External Memory for single FPGA  GiDEL ProcWizard to integrate Algorithm  Bank A  512 MB (profile and scores)  Bank B/C  2 GB each (document stream) with its IPs 4/9/2012 8
  9. 9. Baseline Systems  An optimized multi-threaded reference implementation  Written in C++, compiled in g++ with optimization –O3  Different platforms  System 1 – Intel Core 2 Duo Mobile E8435, 3.06 GHz and 8GB RAM  System 2 – 8-core Intel Core i7-2600, 3.4 GHz and 16GB RAM  The high memory baselines are required to enable sufficient memory for the data collection  Reading the data from disk would dominate the performance  Collection is preloaded in memory4/9/2012 9
  10. 10. Hardware Algorithm Description Profile Ext. Mem Bloom Storage (Bank A) Filter Latency/ 20 cc 1 cc Term Probability of Input Terms to be a profile hit is extremely small. Bloom Filter is used to discard misses. Extract Parallelism out of FPGA ?  Parallel term look up of Bloom Filter 4/9/2012 10
  11. 11. Hardware Algorithm Description Multi Bank Bloom Filter: To decrease congestion for multi look up  Lookup Eight Terms in Parallel in Bloom Filter Individual banks are implemented on Altera M9K (hard memory blocks) on the FPGA The current implementation uses only half of 1280 M9K blocks to map 4Mb of Profile To decrease/eliminate false positives future Bloom Filter designs include  8 Mb on all the M9K blocks (130 MHz)  16 Mb profile size on both M9k and M114k together (100MHz) 4/9/2012 11
  12. 12. Implemented Algorithm Utilization Logic Memory Elements MRAMs 424 KLEs 20 Mbits Total 17,562 4 Algorithm 4,561 (1%) 4 (22%)4/9/2012 12
  13. 13. Synthetic Data Sets  Creating Synthetic Data Sets  The real world data is hard to access, e.g. patent collections are governed by licenses that restrict their use.  Synthetic document collections statistically match real-world collections.  Real-World Document Collections  Newspaper Collection (TREC Aquaint)  Patent collection from US Patent Office (USPTO), and European Patent Office (EPO)  Lemur2 Information Retrieval toolkit is used to determine the rank frequency for all the terms in the collection Average Average Collection # Docs Doc. Len. Uniq. Terms Aquaint 1,033,461 437 169 USPTO 1,406,200 1718 353 EPO 989,507 3863 7054/9/2012 2www.lemurproject.org 13
  14. 14. Synthetic Data Sets Modeling Distributions of terms  Most natural language documents follow Zipfian for rank-frequency distribution  We use Montemurro’s extension to Zipfs law Modeling Document Lengths  Sampled from a truncated Gaussian  Verified using a χ2 test with 95 % confidence Synthetic documents of varying lengths  Each document terms follow fitted rank-frequency distribution  Convert documents into the standard bag-of-words representation 4/9/2012 14
  15. 15. Experimental Parameters  The Performance of Algorithm on the system depends on  the size of the collection  256K document of 4096 terms (Patent collection)  1M documents of 1024terms (Aquaint collection)  the size of the profile  4K, 16K and 64K terms, which are similar to that of TREC Aquaint and EPO  Profile Types  “Random”: Selecting number of random documents from the collection until the desired profile size is reached, hit probability 10-5  “Selected”: Selecting terms that occur in very few documents (Most representative of real world usage), hit probability 5.10-44/9/2012 15
  16. 16. Performance Results Profile System1 System2 FPGA board System1 System2 FPGA board Random, 4K 269 416 3090 292 1118 3090Random, 16K 245 324 3090 288 1014 3090Random, 64K 223 379 3090 253 945 3090 Selected, 4K 118 232 3088 120 309 3088Selected, 16K 107 164 3088 94 350 3088Selected, 64K 82 136 3088 72 183 3088 Empty, 4K 710 1564 3090 911 2005 3090 Empty, 16K 711 1664 3090 844 1976 3090 Empty, 64K 710 1338 3090 877 1952 3090 Full, 4K 8 11 36 7 10 36 Full, 16K 8 12 36 8 12 36 Full, 64K 9 10 36 8 11 36 256K documents of 4096 terms(M Terms/Sec) 1M documents of 1024 terms(M Terms/Sec) #Threads System1 System2 FPGA System 0 (Idle) 40 67 35 FPGA / System * System 1 System2 1 67 93 61.5 2 67 107 68 Speed up 38X 23X 4 67 135 74.5 Perf. / Watt 31X 40X 8 67* 141 81 Power consumption of document filtering application (W) 4/9/2012 16
  17. 17. Performance versus Cost  We used cost model from Shah and Patel’s work Cost Breakdown CPU CPU+FPGA Space 21M$/y Power & Cooling 52M$/y 29M$/y IT Infrastructure 59M$/y 248M$/y Total 132M$/y 299M$/y Performance 136Mops/s 3090Mops/s (single system) Performance/Cost 32Mops/$ 330Mops/$  Considering the Device Demand Economics, Performance/Cost for various FPGA costs, such as $2000, $4000, and $8000 are calculated Performance/Cost versus FPGA system cost and  Various speedup factors effect on performance gains Gain Factor4/9/2012 17
  18. 18. Alternatives and Future Work  ASIC Bloom Filter  Frequency of operation, and its effect on Power Consumption  GPGPU  Frequency of operation, size of internal memory, and I/O bottleneck  Decrease congestion probability for multi term access of Bloom Filter by increasing number of banks  In-depth characterization of other diverse workloads  Explore low power host systems, such as ARM, Atom etc  Implement the Hardware algorithm using high-level languages such as Impulse-C, Catapult-C, MORA framework and OpenCL4/9/2012 18
  19. 19. Conclusion  Growing demand on Data-Center Computing, and “Green Computing” motivated designers for high performance system with improved energy efficiency  A new FPGA- accelerated system design for information retrieval or unstructured search  Algorithm is implemented on GiDEL ProcStar IV (Altera Stratix IV 530 FPGA), achieving 800Mterms/sec of throughput with power consumption of 6W  Comparisons of FPGA system with Baseline system  Speed up of 23x to 38X  Energy Efficiency of 31x to 40x  Performance-per-cost improvement of 10X4/9/2012 19
  20. 20. Thank you Questions ?4/9/2012 20

×