SlideShare a Scribd company logo
Hardware Developments and Algorithm Design:
“What should be done to IR algorithms to meet
current, and possible future, hardware trends?”
Simon Jonassen
Department of Computer and Information Science
Norwegian University of Science and Technology
This talk is not about….
Uncovered, but highly related topics:
–  Query processing on specialized hardware, including GPU.
–  Succinct indexes, suffix arrays, wavelet trees, etc.
–  Map-Reduce and machine learning.
–  Green and Cloud computing.
–  Distributed query processing.
–  Shared memory and NUMA.
–  Scalability and availability.
–  Solid-state drives.
–  Virtualization.
–  …
Information Retrieval
Information Retrieval (IR): representing, searching and manipulating large collections of
electronic and human-language data.
Scope for this talk:
•  Indexed search in document collections.
Other examples and applications:
•  Clustering and categorization.
•  Information extraction.
•  Question answering.
•  Multimedia retrieval.
•  Real-time search.
•  Etc.
Search Engine
Search in inverted indexes
Posting lists:
•  Contain document IDs and frequencies.
•  May also contain scores, context ID, positions and other information.
•  Ordered by document, frequency or impact.
Query processing:
•  Term- vs document-at-a-time.
•  Boolean vs score-based evaluation.
•  Pruning.
Other alternatives:
•  Bitmaps, trees, etc.
Other matters:
•  Preprocessing:
–  E.g., stemming.
•  Two-phase search.
•  Postprocessing:
–  E.g., snippet generation.
•  Static pruning, result cache, etc.
Recent hardware trends
seen from a naïve IR perspective
Scope for
this talk.
not so fast =( fast!
super fast!!!
2002 2012
CPU: From GHz to multi-core
Moore’s Law:
•  ~ the number of transistors on
an IC doubles every two years.
–  Less space, more complexity.
–  Shorter gates, higher clock rate.
Strategy of the 80s and 90’s:
•  Add more complexity!
•  Increase the clock rate!
Pollack’s Rule:
•  The performance increase is ~
square root of the increased
complexity. [Borkar 2007]
The Power Wall:
•  Increasing clock rate and transistor
current leakage lead to excess power
consumption, while RC delays in signal
transmission grow as feature sizes
shrink. [Borkar et al. 2005]
Instruction-level parallelism
– ”It’s not about GHz’s, but how you spend them!”
Pipeline length: 31 (P4) vs 14 stages (i7).
Multiple execution units and
out-of-order execution:
•  i7: 2 load/store address, 1 store data,
and 3 computational operations can
be executed simultaneously.
Dependences and hazards:
•  Control: branches.*
•  Data: output dependence,
antidependence (naming).
•  Structural: access to the same
physical unit of the processor.
Simultaneous multi-threading (“Hyper-threading”):
•  Duplicate certain sections of a processor (registers etc., but not execution units).
•  Reduces the impact of cache miss, branch misprediction and data dependency stalls.
•  Drawback: logical processors are most likely to be treated just like physical processors.
(*[Dean 2010]: a branch misprediction costs ~5ns)
Computer memory hierarchy

(1ms = 1 000 µs = 1 000 000 ns; 1ns = 3 clock cycles at 3GHz or 29.8cm of light travel)	

Computer memory hierarchy
L1-L3 cache and performance implications
Some of the main challenges of CMP:
•  Cache coherence
•  Cache conflicts
•  Cache affinity
Other important cache-related issues:
•  Data size and cache line utilization.
–  i7 has 64B cache lines.
•  Data alignment and padding.
•  Cache associativity and replacement.
Additional memory issues:
•  A large span of random memory accesses may
have additional slowdown due to TLB misses.
•  Some of the virtual memory pages can also
be swaped out to disk.
8MB L3
Main memory
Thread1 Thread2
Thread3 Thread4
Writing efficient IR algorithms
–”The troubles with radix sort are in implementation, not in conception.” (McIlroy et al. 1993)
In-Place MSB Radix Sort:
[Birkeland 2008, Gorset 2011]
•  Starting from the most significant byte.
•  For each of the 256 combinations: count
the cardinality and Initialize the pointers.
•  Apply Counting-Sort (shown on the right).
•  Recursively apply on the less significant
byte until the least significant byte; use
insertion sort if the range is too small.
•  O(kN), where k = 4 for 32-bit integers.
•  Has also been shown to be 3x faster than the native Java/C++
QuickSort implementation on large 32-bit integer arrays [Gorset 2011].
Benefits from:
•  Memory usage.
•  Comparing groups of bits at once.
•  Swaps instead of branches.
Writing efficient IR algorithms
Cache- and processor-efficient query processing
Modern compression methods for IR:
•  BP, S9/S16, PFOR, NewPFD, etc.
•  Fast, superscalar and branch-free.
•  Loops/methods can be generated by a script.
While compression works on chunks of
postings, query processing itself remains
What about:
•  Branches and loops?
•  Cache utilization?
•  ILP utilization?
•  Candidates and results?
Interesting alternatives and trade-offs:
•  Impact-ordered vs document-ordered lists.
•  Term vs document-at-a-time processing.
•  Posting list iteration vs random access.
•  Mixed vs two-phase search.
•  Bitmaps vs posting lists.
source: [Zukowski 2009]
Writing efficient IR algorithms
Some experience from Databases
Vector-at-a-time execution [Zukowski 2009]
provides a good trade-off between tuple and
column-at-a-time execution:
•  Less time spent in interpretation logic.
•  “SIMDization” and data alignment.
•  Parallel memory access (prefetching).
•  In-cache execution.
Loop compilation can be another
alternative, especially if the application
already has a tuple-at-a-time API.
•  [Sompolski et al. 2011] show that
plain loop compilation can be inferior
to vectorization and motivate further
combination of the two techniques.
Concurrent query processing
– In-memory indexes and “1024 core CPU”s: What to expect?
Inter-query vs intra-query concurrency:
•  Inter:
–  Each thread works with a different query.
–  Improves throughput, but latency may degrade.
•  Intra:
–  A query is processed by multiple threads.
–  Improves latency, but throughput may degrade.
Inter-query concurrency and memory access:
•  [Strohman and Croft 2007]:
–  Top-k query processing with impact-ordered lists.
–  Observed that shared memory bandwidth
becomes a bottleneck with four processors.
•  [Tatikonda et al. 2009]:
–  Intersection with document-ordered lists.
–  Observed no cache or memory bandwidth problems.
•  [Qiao et al. 2008]:
–  DBMS query processing with a very large table.
–  Demonstrated that when all cores are used,
main memory bandwidth becomes bottleneck.
source: [Qiao et al. 2008]
Concurrent query processing
– In-memory indexes and “1024 core CPU”s: What to expect?
Intra-query concurrency and memory access:
•  [Lilleengen 2010]:
–  CPU simulator for Vespa Search Engine Platform (Yahoo! Trondheim).
–  Evaluated intra-query concurrency, its scalability, impact on the
processor caches and performance under various workloads.
Other ideas:
•  [Qiao et al. 2008] studied efficient memory scan sharing
for multi-core CPUs in databases. Suggested solution:
–  Each core gets a batch of queries, restricted by the
estimated working set size.
–  Queries in each batch share memory scans, i.e., a
block of data is fed to through all queries in the batch.
–  Note: queries operate on a single but very large table.
•  Batch optimizations similar to those presented by
[Ding et al. 2011] can be interesting on sub-query level.
–  Query reordering.
–  Reusing partial results.
source: [Qiao et al. 2008]
Data-level parallelism
Single-instruction multiple-data (SIMD)
Driven by game industry, SIMD extensions are
very common in modern desktop computers.
•  Intel’s implementations: MMX (1996), SSE (1999),
SSE4(2006), AVX (2011), etc.
Vector size:
•  SSE2: 128 bit containing 2 double’s, 2 long’s,
4 int’s, 8 short’s or 16 char’s.
•  AVX: 256 bit
•  Data movement, arithmetic, comparison, shuffle
(broadcast, swap, rotate), type conversion,
cache and memory management, etc.
•  Portability and compatibility.
•  Unaligned memory access is very expensive.
•  SIMD restricts how operations should be performed.
Data-level parallelism
Single-instruction multiple-data (SIMD)
Example, SIMD-based optimizations/algorithms:
•  [Chhugani et al. 2008] – integer sorting (cache-aware merge sort).
•  [Lemire and Boytsov 2012] – integer compression (SIMD-BP128).
•  [Ladra et al. 2012] – rank, select in bit-sequences, Horspool algorithm.
Fast intersection of sorted 32-bit integer lists:
•  Two vectors can be intersected either by computing a mask of common (32b) elements and
rotating one of the vectors or by using PCMPESTRM (obtains a mask of common 16b elements).
•  PCMPESTRM variant requires a custom data
structure when integers are larger than 216.
•  Requires more comparisons than a simple scalar
intersection, but runs much faster
(PCMPESTRM: ~5.4x speedup with one
million elements in each list and selectivity
around 30%).
•  [Schlegel et al. 2011, Katsov 2012].
© [Katsov 2012]
© Venray
–“What would Google do?”
Scale matters!
(based on Jeff Dean’s talks [Dean 2009, Dean 2010]):
•  “Don’t design to scale infinitely:
~5x–50x growth good to consider,
>100x probably requires rethink and rewrite.”
•  Buying 2x more machines, rather than buying better machines.
•  Low-end storage and networking hardware; in-memory data.
•  Single machine performance is not important.
•  Key focus: distribution and availability.
•  Interference between your (multiple) jobs and
random cron tasks.
Facebook is rumored to be testing out low-
power ARM processors in their data centers.
•  Challenges and opportunities of using a large
number of slower but more energy-efficient
nodes coupled with low-power storage have
been discussed by [Vasudvan et al. 2011].
One more thing… Java!
Bytecode and Just-in-time (JIT) compilation:
•  Java bytecode is halfway between the human-readable and machine code.
•  Bytecode can be interpreted by JVM or compiled to machine code at runtime.
•  JIT/HotSpot tricks: inlining, dead-code elimination, optimization/deoptimization.
•  Intrinsics: some functions can be replaced by machine instructions (e.g., popcount, max/min).
Concurrent processing in Java:
•  Powerful and flexible features (e.g., thread pools,
synchronous data structures, Fork/Join).
•  To be efficient, needs a careful understanding of
synchronization and the Java memory model.
•  Does not provide any affinity or low-level thread control.
Garbage collection (GC) and memory management:
•  Multiple areas/generations: eden and survivor (young),
tenured (old), permgen (internal).
•  Minor (young generation) vs major (old generation) GC.
•  Low-pause vs high-throughput GC algorithms.
•  Escape analysis.
One more thing… Java!
Efficiency tips:
•  Data:
–  Avoid big class hierarchies. Write simple and when applicable immutable objects.
–  Avoid creating unnecessary objects, use primitives.
–  Avoid frequent allocation of very large arrays.
•  Methods:
–  Write compact, clean, reusable and when applicable static methods.
•  Concurrency:
–  Divide and conquer!
–  Minimize synchronization and resource sharing between threads.
•  Development:
–  Correctness over performance.
–  Use existing collections and libraries.
–  Learn to profile, version control and unit-test your code.
•  Processors are getting faster and more advanced. However, these improvements are
becoming more challenging to harness by memory-intensive applications, such as IR.
•  Future IR algorithms should pay more attention to the CPU and cache-related issues.
•  Understanding of the hardware and programming language principles and their
interaction is essential for realization of conception advantage in performance of an
actual implementation.
•  Certain optimizations and performance improvements can be limited to the chosen
architecture and/or technology. For large-scale and heterogeneous IR systems such
optimizations may be less beneficial, economically infeasible or even impossible.
•  Low-power RISC processors are capable of delivering higher performance-per-watt
as well as performance-per-$ when compared to the high-end/desktop processors.
However, it remains unclear whether they can be more advantageous for efficient IR
and which challenges they may introduce.
1.  Birkeland: “Searching large data volumes with MISD processing”, PhD Thesis, NTNU 2008.
2.  Borkar: "Thousand core chips: a technology perspective”, In Proc. DAC 2007, pp.746-749.
3.  Borkar et al.: “Platform 2015: Intel® Processor and Platform Evolution for the Next Decade”, Intel 2005.
4.  Bosworth: “The Power Wall: Why aren’t modern CPUs faster? What happened in the late 1990’s?”, 2011.
5.  Büttcher et al.: “Information Retrieval: Implementing and Evaluating Search Engines”, 2010.
6.  Chhugani et al.: “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture”, In Proc. VLDB 2008, pp.
7.  Clark: “Facebook stretches ARM chips in datacentre tests”, ZDNet news article, 24th September 2012.
8.  Dean: “Challenges in Building Large-Scale Information Retrieval Systems”, keynote at WSDM 2009.
9.  Dean: “Building Software Systems at Google and Lessons Learned”, talk at Standford University 2010.
10.  Ding et al.: “Batch Query Processing for Web Search Engines”, In Proc. WSDM 2011, pp. 137-146.
11.  Evans and Verburg: “Well Grounded Java Developer: Vital Techniques of Java 7 and polyglot programming”, 2013.
12.  Gorset:, 2010.
13.  Hennessy and Patterson: “Computer Architecture: A Quantitative Approach”, 3rd ed., 2003.
14.  Jahre: “Managing Shared Resources in Chip Multiprocessor Memory Systems”, PhD Thesis, NTNU 2010.
15.  Katsov:, 2012.
16.  Ladra et al.: “Exploiting SIMD instructions in current processors to improve classical string algorithms”, In Proc. ADBIS 2012,
pp. 254-267.
17.  Lemire and Boytsov: “Decoding billions of integers per second through vectorization”, CoRR abs/1209.2137, 2012.
18.  Lilleengen: “Parallel query evaluation on multicore architectures”, Master Thesis, NTNU 2010.
19.  Qiao et al.: “Main-Memory Scan Sharing For Multi-Core CPUs”, PVLDB 2008:1(1), pp. 610-621.
20.  Schlegel et al.: “Fast Sorted-Set Intersection using SIMD Instructions”, In ADMS Workshop, VLDB 2011.
21.  Strohman and Croft: “Efficient Document Retrieval in Main Memory”, In Proc. SIGIR 2007, pp. 175-182.
22.  Tatikonda et al.: “On efficient posting list intersection with multicore processors”, In Proc. SIGIR 2009, pp. 738-739.
23.  Vasudevan et al.: “Challenges and Opportunities for Efficient Computing with FAWN”, In Proc. SIGOPS 2011, pp. 34-44.
24.  Zukowski: “Balancing Vectorized Query Execution with Bandwidth-Optimized Storage”, PhD Thesis, University of
Amsterdam 2009.
(backup slides)
General Purpose GPU
Graphical Processing Unit:
•  Large number of stream processors
(Nvidia GeForce GTX 690: 2x1566).
•  Supports SIMD/MIMD and scatter.
•  Dedicated on-board memory (4GB).
Query processing on GPU:
•  [Ding et al. 2009, Hovland 2009, etc.]
•  Mispredicted branches are very expensive.
•  Requires uploading data to/from the graphic card.
•  Might be economically infeasible to write software specifically for GPU.
Integrated graphic sub-systems on modern CPUs:
•  Major advantage: proximity to CPU.
•  Major drawback: use of computer system’s RAM L
© PGI Insider
Solid-State Drives
Based on NAND floating gate transistors.
Each disk is a redundant array of NAND.
Cannot delete/overwrite individual pages.
•  Consequence: frequent writes are
problematic and write performance
degrades with aging.
•  Solutions: 128MB+ on-board memory,
background garbage collection, trimming,
•  Other (SandForce DuraWhite): compression, deduplication and differencing.
Lifetime is limited due to writes, but modern SSD should last as long as HDD.
Single-level vs multi-level charge:
•  SLC is more reliable, but expensive
•  MLC may have larger capacity/cheaper, but is less reliable.
Solid-State vs Hard-Disk Drives
SSD was found to improve the performance of
several applications such as spatial query
processing with R-trees ([Emrich et al. 2010]).
January 2013: A 3TB HDD and 32GB DRAM
cost less than a 512GB SSD.
SSD may be considered as infeasible for
large data centers
•  see the discussion in the paper by
[Ananthanarayanan et a. 2011].
SSD and HDD can be combined in the
same system. (eg., [Risvik 2013])
SSD and HDD require different trade-offs.
1.  Ananthanarayanan et al.; “Disk-Locality in Datacenter Computing Considered Irrelevant”, In Proc. HotOS Workshop at
USENIX 2011.
2.  Chhugani et al.: “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture”, In Proc. VLDB 2008, pp.
3.  Emrich et al.: “On the Impact of Flash SSDs on Spatial Indexing”, In Proc. DaMoN 2010, pp. 3-8.
4.  Hovland: “Throughput Computing on Future GPUs”, Master Thesis, NTNU 2009.
5.  Hutchinson: “Solid-state revolution: in-depth on how SSDs really work”, ARS Technica, 2012.
6.  Risvik et al.: “Maguro, a system for indexing and searching over very large text collections”, In Proc. WSDM 2013, To

More Related Content

What's hot

Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
Uday Sharma
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
Lecture02 types
Lecture02 typesLecture02 types
Lecture02 types
Ganesh Chavan
Intel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn HintonIntel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn Hinton
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Parallel Computing
Parallel Computing Parallel Computing
Parallel Computing
Mr. Vikram Singh Slathia
Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
Beat Signer
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...
Alexander Decker
Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
Allocation and free space management
Allocation and free space managementAllocation and free space management
Allocation and free space management
Parallel processing
Parallel processingParallel processing
Parallel processing
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
Page Maker
Distributed processing
Distributed processingDistributed processing
Distributed processing
Neil Stein
Comparison of In-memory Data Platforms
Comparison of In-memory Data PlatformsComparison of In-memory Data Platforms
Comparison of In-memory Data Platforms
Amir Mahdi Akbari
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
Chi-fan Chu
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private Cloud
IJERA Editor
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
Vajira Thambawita
Par com
Par comPar com
Par com

What's hot (20)

Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
Lecture02 types
Lecture02 typesLecture02 types
Lecture02 types
Intel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn HintonIntel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn Hinton
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Parallel Computing
Parallel Computing Parallel Computing
Parallel Computing
Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
Storage Management - Lecture 8 - Introduction to Databases (1007156ANR)
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...
Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
Allocation and free space management
Allocation and free space managementAllocation and free space management
Allocation and free space management
Parallel processing
Parallel processingParallel processing
Parallel processing
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
Distributed processing
Distributed processingDistributed processing
Distributed processing
Comparison of In-memory Data Platforms
Comparison of In-memory Data PlatformsComparison of In-memory Data Platforms
Comparison of In-memory Data Platforms
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private Cloud
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
Par com
Par comPar com
Par com

Viewers also liked

GPGPU using CUDA Thrust
GPGPU using CUDA ThrustGPGPU using CUDA Thrust
GPGPU using CUDA Thrust
Nicola Pezzotti
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
Victor de Boer
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
Inverted index
Inverted indexInverted index
Inverted index
Krishna Gehlot
Information seeking
Information seekingInformation seeking
Information seeking
Johan Koren
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted indexweedge
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
Khalid Mahmood
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
Boise is the Best Base Camp in America
Boise is the Best Base Camp in AmericaBoise is the Best Base Camp in America
Boise is the Best Base Camp in America
Steve Stuebner
Quote spreekt over Berkeley International Nederland
Quote spreekt over Berkeley International NederlandQuote spreekt over Berkeley International Nederland
Quote spreekt over Berkeley International NederlandBerkeley International
I escalante a day in the life
I escalante a day in the lifeI escalante a day in the life
I escalante a day in the life
ESPN Covers Rafael Nadal’s Next French Open Run
ESPN Covers Rafael Nadal’s Next French Open Run ESPN Covers Rafael Nadal’s Next French Open Run
ESPN Covers Rafael Nadal’s Next French Open Run
Jed Drake
Pankaj malhotra training profile
Pankaj malhotra   training profilePankaj malhotra   training profile
Pankaj malhotra training profile
Nupur Sood
Nkor Ioka
Mm3 project ppt group 1_section a
Mm3 project ppt group 1_section aMm3 project ppt group 1_section a
Mm3 project ppt group 1_section a
Abhijeet Dash
Sheyla UruetaJ
Visual Basic ADO
Visual Basic ADOVisual Basic ADO
Visual Basic ADO
Spy Seat

Viewers also liked (20)

GPGPU using CUDA Thrust
GPGPU using CUDA ThrustGPGPU using CUDA Thrust
GPGPU using CUDA Thrust
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
Inverted index
Inverted indexInverted index
Inverted index
Information seeking
Information seekingInformation seeking
Information seeking
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
An introduction to inverted index
An introduction to inverted indexAn introduction to inverted index
An introduction to inverted index
Information searching & retrieving techniques khalid
Information searching & retrieving techniques khalidInformation searching & retrieving techniques khalid
Information searching & retrieving techniques khalid
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
Boise is the Best Base Camp in America
Boise is the Best Base Camp in AmericaBoise is the Best Base Camp in America
Boise is the Best Base Camp in America
Quote spreekt over Berkeley International Nederland
Quote spreekt over Berkeley International NederlandQuote spreekt over Berkeley International Nederland
Quote spreekt over Berkeley International Nederland
I escalante a day in the life
I escalante a day in the lifeI escalante a day in the life
I escalante a day in the life
ESPN Covers Rafael Nadal’s Next French Open Run
ESPN Covers Rafael Nadal’s Next French Open Run ESPN Covers Rafael Nadal’s Next French Open Run
ESPN Covers Rafael Nadal’s Next French Open Run
Pankaj malhotra training profile
Pankaj malhotra   training profilePankaj malhotra   training profile
Pankaj malhotra training profile
Mm3 project ppt group 1_section a
Mm3 project ppt group 1_section aMm3 project ppt group 1_section a
Mm3 project ppt group 1_section a
Visual Basic ADO
Visual Basic ADOVisual Basic ADO
Visual Basic ADO

Similar to What should be done to IR algorithms to meet current, and possible future, hardware trends.

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
InfinIT - Innovationsnetværket for it
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
Bernd Ocklin
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
Nisha Talagala
Factored operating systems
Factored operating systemsFactored operating systems
Factored operating systems
Indika Munaweera Kankanamge
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
Kinetic basho public
Kinetic basho publicKinetic basho public
Kinetic basho public
Anton Gerasimov
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
Speedment, Inc.
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation Final
Dhritiman Halder
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
Prateek Jain
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
Lars Nielsen
1 hardware
1 hardware1 hardware
1 hardware
Veeresh Khelgi
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
Prabhat gangwar
Risc cisc Difference
Risc cisc DifferenceRisc cisc Difference
Risc cisc Difference
Sehrish Asif

Similar to What should be done to IR algorithms to meet current, and possible future, hardware trends. (20)

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
Factored operating systems
Factored operating systemsFactored operating systems
Factored operating systems
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Kinetic basho public
Kinetic basho publicKinetic basho public
Kinetic basho public
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation Final
In memory grids IMDG
In memory grids IMDGIn memory grids IMDG
In memory grids IMDG
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
1 hardware
1 hardware1 hardware
1 hardware
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
Oracle real application_cluster
Oracle real application_clusterOracle real application_cluster
Oracle real application_cluster
Risc cisc Difference
Risc cisc DifferenceRisc cisc Difference
Risc cisc Difference

More from Simon Lia-Jonassen

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
No more bad news!
No more bad news!No more bad news!
No more bad news!
Simon Lia-Jonassen
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
Simon Lia-Jonassen
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
Simon Lia-Jonassen
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and Monetization
Simon Lia-Jonassen
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
Simon Lia-Jonassen
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
Simon Lia-Jonassen
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
Simon Lia-Jonassen

More from Simon Lia-Jonassen (9)

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
No more bad news!
No more bad news!No more bad news!
No more bad news!
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and Monetization
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
Techgropse Pvt.Ltd.
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
名前 です男
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides

What should be done to IR algorithms to meet current, and possible future, hardware trends.

  • 1. Hardware Developments and Algorithm Design: “What should be done to IR algorithms to meet current, and possible future, hardware trends?” Simon Jonassen Department of Computer and Information Science Norwegian University of Science and Technology
  • 2. This talk is not about…. Uncovered, but highly related topics: –  Query processing on specialized hardware, including GPU. –  Succinct indexes, suffix arrays, wavelet trees, etc. –  Map-Reduce and machine learning. –  Green and Cloud computing. –  Distributed query processing. –  Shared memory and NUMA. –  Scalability and availability. –  Solid-state drives. –  Virtualization. –  …
  • 3. Information Retrieval Information Retrieval (IR): representing, searching and manipulating large collections of electronic and human-language data. Scope for this talk: •  Indexed search in document collections. Other examples and applications: •  Clustering and categorization. •  Information extraction. •  Question answering. •  Multimedia retrieval. •  Real-time search. •  Etc. Index Search Engine Documents Documents Results Queries Users
  • 4. Search in inverted indexes Posting lists: •  Contain document IDs and frequencies. •  May also contain scores, context ID, positions and other information. •  Ordered by document, frequency or impact. Query processing: •  Term- vs document-at-a-time. •  Boolean vs score-based evaluation. •  Pruning. Other alternatives: •  Bitmaps, trees, etc. Other matters: •  Preprocessing: –  E.g., stemming. •  Two-phase search. •  Postprocessing: –  E.g., snippet generation. •  Static pruning, result cache, etc. ©
  • 5. Recent hardware trends seen from a naïve IR perspective Scope for this talk. 4x512MB- 2GHz-- 80GB- 4x2x3GHz++ 4x8GB+ 512GB fast! not so fast =( fast! super fast!!! 2002 2012 DiskProcessor Main Memory
  • 6. CPU: From GHz to multi-core Moore’s Law: •  ~ the number of transistors on an IC doubles every two years. –  Less space, more complexity. –  Shorter gates, higher clock rate. Strategy of the 80s and 90’s: •  Add more complexity! •  Increase the clock rate! Pollack’s Rule: •  The performance increase is ~ square root of the increased complexity. [Borkar 2007] The Power Wall: •  Increasing clock rate and transistor current leakage lead to excess power consumption, while RC delays in signal transmission grow as feature sizes shrink. [Borkar et al. 2005]
  • 7. Instruction-level parallelism – ”It’s not about GHz’s, but how you spend them!” Pipeline length: 31 (P4) vs 14 stages (i7). Multiple execution units and out-of-order execution: •  i7: 2 load/store address, 1 store data, and 3 computational operations can be executed simultaneously. Dependences and hazards: •  Control: branches.* •  Data: output dependence, antidependence (naming). •  Structural: access to the same physical unit of the processor. Simultaneous multi-threading (“Hyper-threading”): •  Duplicate certain sections of a processor (registers etc., but not execution units). •  Reduces the impact of cache miss, branch misprediction and data dependency stalls. •  Drawback: logical processors are most likely to be treated just like physical processors. (*[Dean 2010]: a branch misprediction costs ~5ns)
  • 8. Computer memory hierarchy source:[Jahre2010] (1ms = 1 000 µs = 1 000 000 ns; 1ns = 3 clock cycles at 3GHz or 29.8cm of light travel) Level    Latency   Size   Technology   Managed  by   Registers   <<1ns   ?1KB   CMOS   Compiler   L1  Cache  (on-­‐chip)   <1ns   4x32KBx2   SRAM   Hardware   L2  Cache  (off-­‐chip)     2.5ns   4x256KB   SRAM   Hardware   L3  Cache  (shared)   5ns   8MB   SRAM   Hardware   Main  Memory   50ns   4x8GB+   DRAM   OS   Solid-­‐State  Drive   <100µs     512GB-­‐   NAND  Flash   Hardware/OS/User   Hard-­‐Disk  Drive   3-­‐12ms   1TB+   MagneXc   Hardware/OS/User   (Intel  Core  i7-­‐2600K)  
  • 9. Computer memory hierarchy L1-L3 cache and performance implications Some of the main challenges of CMP: •  Cache coherence •  Cache conflicts •  Cache affinity Other important cache-related issues: •  Data size and cache line utilization. –  i7 has 64B cache lines. •  Data alignment and padding. •  Cache associativity and replacement. Additional memory issues: •  A large span of random memory accesses may have additional slowdown due to TLB misses. •  Some of the virtual memory pages can also be swaped out to disk. Core 32KB L1D Core 32KB L1D 256KB L2 256KB L2 8MB L3 Main memory Thread1 Thread2 Core 32KB L1D Core 32KB L1D 256KB L2 256KB L2 Thread3 Thread4
  • 10. Writing efficient IR algorithms –”The troubles with radix sort are in implementation, not in conception.” (McIlroy et al. 1993) In-Place MSB Radix Sort: [Birkeland 2008, Gorset 2011] •  Starting from the most significant byte. •  For each of the 256 combinations: count the cardinality and Initialize the pointers. •  Apply Counting-Sort (shown on the right). •  Recursively apply on the less significant byte until the least significant byte; use insertion sort if the range is too small. Complexity: •  O(kN), where k = 4 for 32-bit integers. •  Has also been shown to be 3x faster than the native Java/C++ QuickSort implementation on large 32-bit integer arrays [Gorset 2011]. Benefits from: •  Memory usage. •  Comparing groups of bits at once. •  Swaps instead of branches. code:
  • 11. Writing efficient IR algorithms Cache- and processor-efficient query processing Modern compression methods for IR: •  BP, S9/S16, PFOR, NewPFD, etc. •  Fast, superscalar and branch-free. •  Loops/methods can be generated by a script. While compression works on chunks of postings, query processing itself remains posting-at-a-time. What about: •  Branches and loops? •  Cache utilization? •  ILP utilization? •  Candidates and results? Interesting alternatives and trade-offs: •  Impact-ordered vs document-ordered lists. •  Term vs document-at-a-time processing. •  Posting list iteration vs random access. •  Mixed vs two-phase search. •  Bitmaps vs posting lists. code:
  • 12. source: [Zukowski 2009] Writing efficient IR algorithms Some experience from Databases Vector-at-a-time execution [Zukowski 2009] provides a good trade-off between tuple and column-at-a-time execution: •  Less time spent in interpretation logic. •  “SIMDization” and data alignment. •  Parallel memory access (prefetching). •  In-cache execution. Loop compilation can be another alternative, especially if the application already has a tuple-at-a-time API. •  [Sompolski et al. 2011] show that plain loop compilation can be inferior to vectorization and motivate further combination of the two techniques.
  • 13. Concurrent query processing – In-memory indexes and “1024 core CPU”s: What to expect? Inter-query vs intra-query concurrency: •  Inter: –  Each thread works with a different query. –  Improves throughput, but latency may degrade. •  Intra: –  A query is processed by multiple threads. –  Improves latency, but throughput may degrade. Inter-query concurrency and memory access: •  [Strohman and Croft 2007]: –  Top-k query processing with impact-ordered lists. –  Observed that shared memory bandwidth becomes a bottleneck with four processors. •  [Tatikonda et al. 2009]: –  Intersection with document-ordered lists. –  Observed no cache or memory bandwidth problems. •  [Qiao et al. 2008]: –  DBMS query processing with a very large table. –  Demonstrated that when all cores are used, main memory bandwidth becomes bottleneck. source: [Qiao et al. 2008]
  • 14. Concurrent query processing – In-memory indexes and “1024 core CPU”s: What to expect? Intra-query concurrency and memory access: •  [Lilleengen 2010]: –  CPU simulator for Vespa Search Engine Platform (Yahoo! Trondheim). –  Evaluated intra-query concurrency, its scalability, impact on the processor caches and performance under various workloads. Other ideas: •  [Qiao et al. 2008] studied efficient memory scan sharing for multi-core CPUs in databases. Suggested solution: –  Each core gets a batch of queries, restricted by the estimated working set size. –  Queries in each batch share memory scans, i.e., a block of data is fed to through all queries in the batch. –  Note: queries operate on a single but very large table. •  Batch optimizations similar to those presented by [Ding et al. 2011] can be interesting on sub-query level. –  Query reordering. –  Reusing partial results. source: [Qiao et al. 2008]
  • 15. Data-level parallelism Single-instruction multiple-data (SIMD) Driven by game industry, SIMD extensions are very common in modern desktop computers. •  Intel’s implementations: MMX (1996), SSE (1999), SSE4(2006), AVX (2011), etc. Vector size: •  SSE2: 128 bit containing 2 double’s, 2 long’s, 4 int’s, 8 short’s or 16 char’s. •  AVX: 256 bit Operations: •  Data movement, arithmetic, comparison, shuffle (broadcast, swap, rotate), type conversion, cache and memory management, etc. Drawbacks: •  Portability and compatibility. •  Unaligned memory access is very expensive. •  SIMD restricts how operations should be performed. ©
  • 16. Data-level parallelism Single-instruction multiple-data (SIMD) Example, SIMD-based optimizations/algorithms: •  [Chhugani et al. 2008] – integer sorting (cache-aware merge sort). •  [Lemire and Boytsov 2012] – integer compression (SIMD-BP128). •  [Ladra et al. 2012] – rank, select in bit-sequences, Horspool algorithm. Fast intersection of sorted 32-bit integer lists: •  Two vectors can be intersected either by computing a mask of common (32b) elements and rotating one of the vectors or by using PCMPESTRM (obtains a mask of common 16b elements). •  PCMPESTRM variant requires a custom data structure when integers are larger than 216. •  Requires more comparisons than a simple scalar intersection, but runs much faster (PCMPESTRM: ~5.4x speedup with one million elements in each list and selectivity around 30%). •  [Schlegel et al. 2011, Katsov 2012]. © [Katsov 2012]
  • 17. © Venray Technologies –“What would Google do?” Scale matters! (based on Jeff Dean’s talks [Dean 2009, Dean 2010]): •  “Don’t design to scale infinitely: ~5x–50x growth good to consider, >100x probably requires rethink and rewrite.” •  Buying 2x more machines, rather than buying better machines. •  Low-end storage and networking hardware; in-memory data. •  Single machine performance is not important. •  Key focus: distribution and availability. •  Interference between your (multiple) jobs and random cron tasks. Facebook is rumored to be testing out low- power ARM processors in their data centers. •  Challenges and opportunities of using a large number of slower but more energy-efficient nodes coupled with low-power storage have been discussed by [Vasudvan et al. 2011].
  • 18. One more thing… Java! Bytecode and Just-in-time (JIT) compilation: •  Java bytecode is halfway between the human-readable and machine code. •  Bytecode can be interpreted by JVM or compiled to machine code at runtime. •  JIT/HotSpot tricks: inlining, dead-code elimination, optimization/deoptimization. •  Intrinsics: some functions can be replaced by machine instructions (e.g., popcount, max/min). Concurrent processing in Java: •  Powerful and flexible features (e.g., thread pools, synchronous data structures, Fork/Join). •  To be efficient, needs a careful understanding of synchronization and the Java memory model. •  Does not provide any affinity or low-level thread control. Garbage collection (GC) and memory management: •  Multiple areas/generations: eden and survivor (young), tenured (old), permgen (internal). •  Minor (young generation) vs major (old generation) GC. •  Low-pause vs high-throughput GC algorithms. •  Escape analysis.
  • 19. One more thing… Java! Efficiency tips: •  Data: –  Avoid big class hierarchies. Write simple and when applicable immutable objects. –  Avoid creating unnecessary objects, use primitives. –  Avoid frequent allocation of very large arrays. •  Methods: –  Write compact, clean, reusable and when applicable static methods. •  Concurrency: –  Divide and conquer! –  Minimize synchronization and resource sharing between threads. •  Development: –  Correctness over performance. –  Use existing collections and libraries. –  Learn to profile, version control and unit-test your code.
  • 20. Conclusions •  Processors are getting faster and more advanced. However, these improvements are becoming more challenging to harness by memory-intensive applications, such as IR. •  Future IR algorithms should pay more attention to the CPU and cache-related issues. •  Understanding of the hardware and programming language principles and their interaction is essential for realization of conception advantage in performance of an actual implementation. •  Certain optimizations and performance improvements can be limited to the chosen architecture and/or technology. For large-scale and heterogeneous IR systems such optimizations may be less beneficial, economically infeasible or even impossible. •  Low-power RISC processors are capable of delivering higher performance-per-watt as well as performance-per-$ when compared to the high-end/desktop processors. However, it remains unclear whether they can be more advantageous for efficient IR and which challenges they may introduce.
  • 21. References: 1.  Birkeland: “Searching large data volumes with MISD processing”, PhD Thesis, NTNU 2008. 2.  Borkar: "Thousand core chips: a technology perspective”, In Proc. DAC 2007, pp.746-749. 3.  Borkar et al.: “Platform 2015: Intel® Processor and Platform Evolution for the Next Decade”, Intel 2005. 4.  Bosworth: “The Power Wall: Why aren’t modern CPUs faster? What happened in the late 1990’s?”, 2011. 5.  Büttcher et al.: “Information Retrieval: Implementing and Evaluating Search Engines”, 2010. 6.  Chhugani et al.: “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture”, In Proc. VLDB 2008, pp. 1313-1324. 7.  Clark: “Facebook stretches ARM chips in datacentre tests”, ZDNet news article, 24th September 2012. 8.  Dean: “Challenges in Building Large-Scale Information Retrieval Systems”, keynote at WSDM 2009. 9.  Dean: “Building Software Systems at Google and Lessons Learned”, talk at Standford University 2010. 10.  Ding et al.: “Batch Query Processing for Web Search Engines”, In Proc. WSDM 2011, pp. 137-146. 11.  Evans and Verburg: “Well Grounded Java Developer: Vital Techniques of Java 7 and polyglot programming”, 2013. 12.  Gorset:, 2010. 13.  Hennessy and Patterson: “Computer Architecture: A Quantitative Approach”, 3rd ed., 2003. 14.  Jahre: “Managing Shared Resources in Chip Multiprocessor Memory Systems”, PhD Thesis, NTNU 2010. 15.  Katsov:, 2012. 16.  Ladra et al.: “Exploiting SIMD instructions in current processors to improve classical string algorithms”, In Proc. ADBIS 2012, pp. 254-267. 17.  Lemire and Boytsov: “Decoding billions of integers per second through vectorization”, CoRR abs/1209.2137, 2012. 18.  Lilleengen: “Parallel query evaluation on multicore architectures”, Master Thesis, NTNU 2010. 19.  Qiao et al.: “Main-Memory Scan Sharing For Multi-Core CPUs”, PVLDB 2008:1(1), pp. 610-621. 20.  Schlegel et al.: “Fast Sorted-Set Intersection using SIMD Instructions”, In ADMS Workshop, VLDB 2011. 21.  Strohman and Croft: “Efficient Document Retrieval in Main Memory”, In Proc. SIGIR 2007, pp. 175-182. 22.  Tatikonda et al.: “On efficient posting list intersection with multicore processors”, In Proc. SIGIR 2009, pp. 738-739. 23.  Vasudevan et al.: “Challenges and Opportunities for Efficient Computing with FAWN”, In Proc. SIGOPS 2011, pp. 34-44. 24.  Zukowski: “Balancing Vectorized Query Execution with Bandwidth-Optimized Storage”, PhD Thesis, University of Amsterdam 2009.
  • 24. General Purpose GPU Graphical Processing Unit: •  Large number of stream processors (Nvidia GeForce GTX 690: 2x1566). •  Supports SIMD/MIMD and scatter. •  Dedicated on-board memory (4GB). Query processing on GPU: •  [Ding et al. 2009, Hovland 2009, etc.] Drawbacks: •  Mispredicted branches are very expensive. •  Requires uploading data to/from the graphic card. •  Might be economically infeasible to write software specifically for GPU. Integrated graphic sub-systems on modern CPUs: •  Major advantage: proximity to CPU. •  Major drawback: use of computer system’s RAM L © PGI Insider
  • 25. Solid-State Drives Based on NAND floating gate transistors. Each disk is a redundant array of NAND. Cannot delete/overwrite individual pages. •  Consequence: frequent writes are problematic and write performance degrades with aging. •  Solutions: 128MB+ on-board memory, background garbage collection, trimming, overprovisioning. •  Other (SandForce DuraWhite): compression, deduplication and differencing. Lifetime is limited due to writes, but modern SSD should last as long as HDD. Single-level vs multi-level charge: •  SLC is more reliable, but expensive •  MLC may have larger capacity/cheaper, but is less reliable.
  • 26. Solid-State vs Hard-Disk Drives SSD was found to improve the performance of several applications such as spatial query processing with R-trees ([Emrich et al. 2010]). January 2013: A 3TB HDD and 32GB DRAM cost less than a 512GB SSD. SSD may be considered as infeasible for large data centers •  see the discussion in the paper by [Ananthanarayanan et a. 2011]. SSD and HDD can be combined in the same system. (eg., [Risvik 2013]) SSD and HDD require different trade-offs. Some  other  numbers  [Dean  2010]:   Send  2KB  over  1  Gbps  network   20µs     Round  trip  within  same  datacenter   500µs     Send  packet  CA-­‐>Netherlands-­‐>CA   150ms   Access  Dme   Bandwidth    Price   Capacity   HDD   3-­‐12ms   <140MB/s   <0.05$/GB   1TB+   SSD   <100µs     <600MB/s   0.5-­‐1$/GB   512GB-­‐   DRAM   <50ns   <21GB/s   5-­‐10$/GB   32GB-­‐  
  • 27. References: 1.  Ananthanarayanan et al.; “Disk-Locality in Datacenter Computing Considered Irrelevant”, In Proc. HotOS Workshop at USENIX 2011. 2.  Chhugani et al.: “Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture”, In Proc. VLDB 2008, pp. 1313-1324. 3.  Emrich et al.: “On the Impact of Flash SSDs on Spatial Indexing”, In Proc. DaMoN 2010, pp. 3-8. 4.  Hovland: “Throughput Computing on Future GPUs”, Master Thesis, NTNU 2009. 5.  Hutchinson: “Solid-state revolution: in-depth on how SSDs really work”, ARS Technica, 2012. 6.  Risvik et al.: “Maguro, a system for indexing and searching over very large text collections”, In Proc. WSDM 2013, To appear.