What should be done to IR algorithms to meet current, and possible future, hardware trends.

Hardware Developments and Algorithm Design:
“What should be done to IR algorithms to meet
current, and possible future, hardware trends?”
Simon Jonassen
Department of Computer and Information Science
Norwegian University of Science and Technology

This talk is not about….
Uncovered, but highly related topics:
–  Query processing on specialized hardware, including GPU.
–  Succinct indexes, suffix arrays, wavelet trees, etc.
–  Map-Reduce and machine learning.
–  Green and Cloud computing.
–  Distributed query processing.
–  Shared memory and NUMA.
–  Scalability and availability.
–  Solid-state drives.
–  Virtualization.
–  …

Information Retrieval
Information Retrieval (IR): representing, searching and manipulating large collections of
electronic and human-language data.
Scope for this talk:
•  Indexed search in document collections.
Other examples and applications:
•  Clustering and categorization.
•  Information extraction.
•  Question answering.
•  Multimedia retrieval.
•  Real-time search.
•  Etc.
Index
Search Engine
Documents
Documents
Results
Queries
Users

Search in inverted indexes
Posting lists:
•  Contain document IDs and frequencies.
•  May also contain scores, context ID, positions and other information.
•  Ordered by document, frequency or impact.
Query processing:
•  Term- vs document-at-a-time.
•  Boolean vs score-based evaluation.
•  Pruning.
Other alternatives:
•  Bitmaps, trees, etc.
Other matters:
•  Preprocessing:
–  E.g., stemming.
•  Two-phase search.
•  Postprocessing:
–  E.g., snippet generation.
•  Static pruning, result cache, etc.
©apple.com

Recent hardware trends
seen from a naïve IR perspective
Scope for
this talk.
4x512MB-
2GHz--
80GB-
4x2x3GHz++
4x8GB+
512GB
fast!
not so fast =( fast!
super fast!!!
2002 2012
DiskProcessor
Main
Memory

CPU: From GHz to multi-core
Moore’s Law:
•  ~ the number of transistors on
an IC doubles every two years.
–  Less space, more complexity.
–  Shorter gates, higher clock rate.
Strategy of the 80s and 90’s:
•  Add more complexity!
•  Increase the clock rate!
Pollack’s Rule:
•  The performance increase is ~
square root of the increased
complexity. [Borkar 2007]
The Power Wall:
•  Increasing clock rate and transistor
current leakage lead to excess power
consumption, while RC delays in signal
transmission grow as feature sizes
shrink. [Borkar et al. 2005]

Instruction-level parallelism
– ”It’s not about GHz’s, but how you spend them!”
Pipeline length: 31 (P4) vs 14 stages (i7).
Multiple execution units and
out-of-order execution:
•  i7: 2 load/store address, 1 store data,
and 3 computational operations can
be executed simultaneously.
Dependences and hazards:
•  Control: branches.*
•  Data: output dependence,
antidependence (naming).
•  Structural: access to the same
physical unit of the processor.
Simultaneous multi-threading (“Hyper-threading”):
•  Duplicate certain sections of a processor (registers etc., but not execution units).
•  Reduces the impact of cache miss, branch misprediction and data dependency stalls.
•  Drawback: logical processors are most likely to be treated just like physical processors.
(*[Dean 2010]: a branch misprediction costs ~5ns)

Computer memory hierarchy
source:[Jahre2010]

(1ms = 1 000 µs = 1 000 000 ns; 1ns = 3 clock cycles at 3GHz or 29.8cm of light travel)

Level

Latency
Size
Technology
Managed
by

Registers
<<1ns
?1KB
CMOS
Compiler

L1
Cache
(on-‐chip)
<1ns
4x32KBx2
SRAM
Hardware

L2
Cache
(oﬀ-‐chip)

2.5ns
4x256KB
SRAM
Hardware

L3
Cache
(shared)
5ns
8MB
SRAM
Hardware

Main
Memory
50ns
4x8GB+
DRAM
OS

Solid-‐State
Drive
<100µs

512GB-‐
NAND
Flash
Hardware/OS/User

Hard-‐Disk
Drive
3-‐12ms
1TB+
MagneXc
Hardware/OS/User

(Intel
Core
i7-‐2600K)

Computer memory hierarchy
L1-L3 cache and performance implications
Some of the main challenges of CMP:
•  Cache coherence
•  Cache conflicts
•  Cache affinity
Other important cache-related issues:
•  Data size and cache line utilization.
–  i7 has 64B cache lines.
•  Data alignment and padding.
•  Cache associativity and replacement.
Additional memory issues:
•  A large span of random memory accesses may
have additional slowdown due to TLB misses.
•  Some of the virtual memory pages can also
be swaped out to disk.
Core
32KB
L1D
Core
32KB
L1D
256KB
L2
256KB
L2
8MB L3
Main memory
Thread1 Thread2
Core
32KB
L1D
Core
32KB
L1D
256KB
L2
256KB
L2
Thread3 Thread4

Writing efficient IR algorithms
–”The troubles with radix sort are in implementation, not in conception.” (McIlroy et al. 1993)
In-Place MSB Radix Sort:
[Birkeland 2008, Gorset 2011]
•  Starting from the most significant byte.
•  For each of the 256 combinations: count
the cardinality and Initialize the pointers.
•  Apply Counting-Sort (shown on the right).
•  Recursively apply on the less significant
byte until the least significant byte; use
insertion sort if the range is too small.
Complexity:
•  O(kN), where k = 4 for 32-bit integers.
•  Has also been shown to be 3x faster than the native Java/C++
QuickSort implementation on large 32-bit integer arrays [Gorset 2011].
Benefits from:
•  Memory usage.
•  Comparing groups of bits at once.
•  Swaps instead of branches.
code: https://github.com/gorset/radix

Cache- and processor-efficient query processing
Modern compression methods for IR:
•  BP, S9/S16, PFOR, NewPFD, etc.
•  Fast, superscalar and branch-free.
•  Loops/methods can be generated by a script.
While compression works on chunks of
postings, query processing itself remains
posting-at-a-time.
What about:
•  Branches and loops?
•  Cache utilization?
•  ILP utilization?
•  Candidates and results?
Interesting alternatives and trade-offs:
•  Impact-ordered vs document-ordered lists.
•  Term vs document-at-a-time processing.
•  Posting list iteration vs random access.
•  Mixed vs two-phase search.
•  Bitmaps vs posting lists.
code: https://github.com/javasoze/kamikaze

source: [Zukowski 2009]
Some experience from Databases
Vector-at-a-time execution [Zukowski 2009]
provides a good trade-off between tuple and
column-at-a-time execution:
•  Less time spent in interpretation logic.
•  “SIMDization” and data alignment.
•  Parallel memory access (prefetching).
•  In-cache execution.
Loop compilation can be another
alternative, especially if the application
already has a tuple-at-a-time API.
•  [Sompolski et al. 2011] show that
plain loop compilation can be inferior
to vectorization and motivate further
combination of the two techniques.

Concurrent query processing
– In-memory indexes and “1024 core CPU”s: What to expect?
Inter-query vs intra-query concurrency:
•  Inter:
–  Each thread works with a different query.
–  Improves throughput, but latency may degrade.
•  Intra:
–  A query is processed by multiple threads.
–  Improves latency, but throughput may degrade.
Inter-query concurrency and memory access:
•  [Strohman and Croft 2007]:
–  Top-k query processing with impact-ordered lists.
–  Observed that shared memory bandwidth
becomes a bottleneck with four processors.
•  [Tatikonda et al. 2009]:
–  Intersection with document-ordered lists.
–  Observed no cache or memory bandwidth problems.
•  [Qiao et al. 2008]:
–  DBMS query processing with a very large table.
–  Demonstrated that when all cores are used,
main memory bandwidth becomes bottleneck.
source: [Qiao et al. 2008]

Concurrent query processing
– In-memory indexes and “1024 core CPU”s: What to expect?
Intra-query concurrency and memory access:
•  [Lilleengen 2010]:
–  CPU simulator for Vespa Search Engine Platform (Yahoo! Trondheim).
–  Evaluated intra-query concurrency, its scalability, impact on the
processor caches and performance under various workloads.
Other ideas:
•  [Qiao et al. 2008] studied efficient memory scan sharing
for multi-core CPUs in databases. Suggested solution:
–  Each core gets a batch of queries, restricted by the
estimated working set size.
–  Queries in each batch share memory scans, i.e., a
block of data is fed to through all queries in the batch.
–  Note: queries operate on a single but very large table.
•  Batch optimizations similar to those presented by
[Ding et al. 2011] can be interesting on sub-query level.
–  Query reordering.
–  Reusing partial results.
source: [Qiao et al. 2008]

Data-level parallelism
Single-instruction multiple-data (SIMD)
Driven by game industry, SIMD extensions are
very common in modern desktop computers.
•  Intel’s implementations: MMX (1996), SSE (1999),
SSE4(2006), AVX (2011), etc.
Vector size:
•  SSE2: 128 bit containing 2 double’s, 2 long’s,
4 int’s, 8 short’s or 16 char’s.
•  AVX: 256 bit
Operations:
•  Data movement, arithmetic, comparison, shuffle
(broadcast, swap, rotate), type conversion,
cache and memory management, etc.
Drawbacks:
•  Portability and compatibility.
•  Unaligned memory access is very expensive.
•  SIMD restricts how operations should be performed.
©realworldtech.com

Data-level parallelism
Single-instruction multiple-data (SIMD)
Example, SIMD-based optimizations/algorithms:
•  [Chhugani et al. 2008] – integer sorting (cache-aware merge sort).
•  [Lemire and Boytsov 2012] – integer compression (SIMD-BP128).
•  [Ladra et al. 2012] – rank, select in bit-sequences, Horspool algorithm.
Fast intersection of sorted 32-bit integer lists:
•  Two vectors can be intersected either by computing a mask of common (32b) elements and
rotating one of the vectors or by using PCMPESTRM (obtains a mask of common 16b elements).
•  PCMPESTRM variant requires a custom data
structure when integers are larger than 216.
•  Requires more comparisons than a simple scalar
intersection, but runs much faster
(PCMPESTRM: ~5.4x speedup with one
million elements in each list and selectivity
around 30%).
•  [Schlegel et al. 2011, Katsov 2012].
© [Katsov 2012]

© Venray
Technologies
–“What would Google do?”
Scale matters!
(based on Jeff Dean’s talks [Dean 2009, Dean 2010]):
•  “Don’t design to scale infinitely:
~5x–50x growth good to consider,
>100x probably requires rethink and rewrite.”
•  Buying 2x more machines, rather than buying better machines.
•  Low-end storage and networking hardware; in-memory data.
•  Single machine performance is not important.
•  Key focus: distribution and availability.
•  Interference between your (multiple) jobs and
random cron tasks.
Facebook is rumored to be testing out low-
power ARM processors in their data centers.
•  Challenges and opportunities of using a large
number of slower but more energy-efficient
nodes coupled with low-power storage have
been discussed by [Vasudvan et al. 2011].

One more thing… Java!
Bytecode and Just-in-time (JIT) compilation:
•  Java bytecode is halfway between the human-readable and machine code.
•  Bytecode can be interpreted by JVM or compiled to machine code at runtime.
•  JIT/HotSpot tricks: inlining, dead-code elimination, optimization/deoptimization.
•  Intrinsics: some functions can be replaced by machine instructions (e.g., popcount, max/min).
Concurrent processing in Java:
•  Powerful and flexible features (e.g., thread pools,
synchronous data structures, Fork/Join).
•  To be efficient, needs a careful understanding of
synchronization and the Java memory model.
•  Does not provide any affinity or low-level thread control.
Garbage collection (GC) and memory management:
•  Multiple areas/generations: eden and survivor (young),
tenured (old), permgen (internal).
•  Minor (young generation) vs major (old generation) GC.
•  Low-pause vs high-throughput GC algorithms.
•  Escape analysis.

One more thing… Java!
Efficiency tips:
•  Data:
–  Avoid big class hierarchies. Write simple and when applicable immutable objects.
–  Avoid creating unnecessary objects, use primitives.
–  Avoid frequent allocation of very large arrays.
•  Methods:
–  Write compact, clean, reusable and when applicable static methods.
•  Concurrency:
–  Divide and conquer!
–  Minimize synchronization and resource sharing between threads.
•  Development:
–  Correctness over performance.
–  Use existing collections and libraries.
–  Learn to profile, version control and unit-test your code.

Conclusions
•  Processors are getting faster and more advanced. However, these improvements are
becoming more challenging to harness by memory-intensive applications, such as IR.
•  Future IR algorithms should pay more attention to the CPU and cache-related issues.
•  Understanding of the hardware and programming language principles and their
interaction is essential for realization of conception advantage in performance of an
actual implementation.
•  Certain optimizations and performance improvements can be limited to the chosen
architecture and/or technology. For large-scale and heterogeneous IR systems such
optimizations may be less beneficial, economically infeasible or even impossible.
•  Low-power RISC processors are capable of delivering higher performance-per-watt
as well as performance-per-$ when compared to the high-end/desktop processors.
However, it remains unclear whether they can be more advantageous for efficient IR
and which challenges they may introduce.

References:
1.  Birkeland: “Searching large data volumes with MISD processing”, PhD Thesis, NTNU 2008.
2.  Borkar: "Thousand core chips: a technology perspective”, In Proc. DAC 2007, pp.746-749.
3.  Borkar et al.: “Platform 2015: Intel® Processor and Platform Evolution for the Next Decade”, Intel 2005.
4.  Bosworth: “The Power Wall: Why aren’t modern CPUs faster? What happened in the late 1990’s?”, 2011.
5.  Büttcher et al.: “Information Retrieval: Implementing and Evaluating Search Engines”, 2010.
6.  Chhugani et al.: “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture”, In Proc. VLDB 2008, pp.
1313-1324.
7.  Clark: “Facebook stretches ARM chips in datacentre tests”, ZDNet news article, 24th September 2012.
8.  Dean: “Challenges in Building Large-Scale Information Retrieval Systems”, keynote at WSDM 2009.
9.  Dean: “Building Software Systems at Google and Lessons Learned”, talk at Standford University 2010.
10.  Ding et al.: “Batch Query Processing for Web Search Engines”, In Proc. WSDM 2011, pp. 137-146.
11.  Evans and Verburg: “Well Grounded Java Developer: Vital Techniques of Java 7 and polyglot programming”, 2013.
12.  Gorset: http://erik.gorset.no/2011/04/radix-sort-is-faster-than-quicksort.html, 2010.
13.  Hennessy and Patterson: “Computer Architecture: A Quantitative Approach”, 3rd ed., 2003.
14.  Jahre: “Managing Shared Resources in Chip Multiprocessor Memory Systems”, PhD Thesis, NTNU 2010.
15.  Katsov: http://scalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/, 2012.
16.  Ladra et al.: “Exploiting SIMD instructions in current processors to improve classical string algorithms”, In Proc. ADBIS 2012,
pp. 254-267.
17.  Lemire and Boytsov: “Decoding billions of integers per second through vectorization”, CoRR abs/1209.2137, 2012.
18.  Lilleengen: “Parallel query evaluation on multicore architectures”, Master Thesis, NTNU 2010.
19.  Qiao et al.: “Main-Memory Scan Sharing For Multi-Core CPUs”, PVLDB 2008:1(1), pp. 610-621.
20.  Schlegel et al.: “Fast Sorted-Set Intersection using SIMD Instructions”, In ADMS Workshop, VLDB 2011.
21.  Strohman and Croft: “Efﬁcient Document Retrieval in Main Memory”, In Proc. SIGIR 2007, pp. 175-182.
22.  Tatikonda et al.: “On efficient posting list intersection with multicore processors”, In Proc. SIGIR 2009, pp. 738-739.
23.  Vasudevan et al.: “Challenges and Opportunities for Efficient Computing with FAWN”, In Proc. SIGOPS 2011, pp. 34-44.
24.  Zukowski: “Balancing Vectorized Query Execution with Bandwidth-Optimized Storage”, PhD Thesis, University of
Amsterdam 2009.

General Purpose GPU
Graphical Processing Unit:
•  Large number of stream processors
(Nvidia GeForce GTX 690: 2x1566).
•  Supports SIMD/MIMD and scatter.
•  Dedicated on-board memory (4GB).
Query processing on GPU:
•  [Ding et al. 2009, Hovland 2009, etc.]
Drawbacks:
•  Mispredicted branches are very expensive.
•  Requires uploading data to/from the graphic card.
•  Might be economically infeasible to write software specifically for GPU.
Integrated graphic sub-systems on modern CPUs:
•  Major advantage: proximity to CPU.
•  Major drawback: use of computer system’s RAM L
© PGI Insider

Solid-State Drives
Based on NAND floating gate transistors.
Each disk is a redundant array of NAND.
Cannot delete/overwrite individual pages.
•  Consequence: frequent writes are
problematic and write performance
degrades with aging.
•  Solutions: 128MB+ on-board memory,
background garbage collection, trimming,
overprovisioning.
•  Other (SandForce DuraWhite): compression, deduplication and differencing.
Lifetime is limited due to writes, but modern SSD should last as long as HDD.
Single-level vs multi-level charge:
•  SLC is more reliable, but expensive
•  MLC may have larger capacity/cheaper, but is less reliable.

Solid-State vs Hard-Disk Drives
SSD was found to improve the performance of
several applications such as spatial query
processing with R-trees ([Emrich et al. 2010]).
January 2013: A 3TB HDD and 32GB DRAM
cost less than a 512GB SSD.
SSD may be considered as infeasible for
large data centers
•  see the discussion in the paper by
[Ananthanarayanan et a. 2011].
SSD and HDD can be combined in the
same system. (eg., [Risvik 2013])
SSD and HDD require different trade-offs.
Some
other
numbers
[Dean
2010]:

Send
2KB
over
1
Gbps
network
20µs

Round
trip
within
same
datacenter
500µs

Send
packet
CA-‐>Netherlands-‐>CA
150ms

Access
Dme
Bandwidth

Price
Capacity

HDD
3-‐12ms
<140MB/s
<0.05$/GB
1TB+

SSD
<100µs

<600MB/s
0.5-‐1$/GB
512GB-‐

DRAM
<50ns
<21GB/s
5-‐10$/GB
32GB-‐

References:
1.  Ananthanarayanan et al.; “Disk-Locality in Datacenter Computing Considered Irrelevant”, In Proc. HotOS Workshop at
USENIX 2011.
2.  Chhugani et al.: “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture”, In Proc. VLDB 2008, pp.
1313-1324.
3.  Emrich et al.: “On the Impact of Flash SSDs on Spatial Indexing”, In Proc. DaMoN 2010, pp. 3-8.
4.  Hovland: “Throughput Computing on Future GPUs”, Master Thesis, NTNU 2009.
5.  Hutchinson: “Solid-state revolution: in-depth on how SSDs really work”, ARS Technica, 2012.
6.  Risvik et al.: “Maguro, a system for indexing and searching over very large text collections”, In Proc. WSDM 2013, To
appear.

What should be done to IR algorithms to meet current, and possible future, hardware trends.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to What should be done to IR algorithms to meet current, and possible future, hardware trends.

Similar to What should be done to IR algorithms to meet current, and possible future, hardware trends. (20)

More from Simon Lia-Jonassen

More from Simon Lia-Jonassen (9)

Recently uploaded

Recently uploaded (20)

What should be done to IR algorithms to meet current, and possible future, hardware trends.