Content-Based Matching on GPUs
Upcoming SlideShare
Loading in...5
×
 

Content-Based Matching on GPUs

on

  • 952 views

 

Statistics

Views

Total Views
952
Views on SlideShare
952
Embed Views
0

Actions

Likes
0
Downloads
21
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Content-Based Matching on GPUs Content-Based Matching on GPUs Presentation Transcript

  • High Performance Content-Based Matching Using GPUs
    Alessandro Margara and GianpaoloCugola
    margara@elet.polimi.it, cugola@elet.polimi.it
    Dip. Elettronica e Informazione (DEI)
    Politecnico di Milano
  • The Problem: Content-Based Matching
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    2
    Publishers
    Content-Based Matching
    Subscribers
    Predicate
    Filter
    (Smoke=true and Room = “Kitchen”) or (Light>30 and Room=“Bedroom”)
    Light=50, Room=Bedroom, Sender=“Sensor1”
    Attribute
    Constraint
  • Introduced by Nvidia in 2006
    General purpose parallel computing architecture
    New instruction set
    New programming model
    Programmable using high-level languages
    Cuda C (a C dialect)
    Programming GPUs: CUDA
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    3
  • Programming Model: Basics
    The device (GPU) acts as a coprocessor for the host (CPU) and has its own separate memory space
    It is necessary to copy input data from the main memory to the GPU memory before starting a computation …
    … and to copy results back to the main memory when the computation finishes
    Often the most expensive operations
    Involve sending information through the PCI-Ex bus
    Bandwidth but also latency
    Also requires serialization of data structures!
    They must be kept simple
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    4
  • Typical Workflow
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    5
    Allocate memory on device
    Serialize and copy data to device
    Execute one or more kernels on the device
    Wait for the device to finish processing
    Copy results back
  • Programming Model: Fundamentals
    Single Program Multiple Threads implementation strategy
    A single kernel(function) is executed by multiple threads in parallel
    Threads are organized in blocks
    Threads within different blocks operate independently
    Threads within the same block cooperate to solve a single sub-problem
    The runtime provides a blockIdand athreadIdvariable, to uniquely identify each running thread
    Accessing such variables is the only way to differentiate the work done by different threads
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    6
  • Programming Model: Memory management
    Hierarchical organization of memory
    All threads have access to the same common global memory
    Large (512MB-6GB) but slow (DRAM)
    Stores information received from the host
    Persistent across different function calls
    Threads within a block coordinate themselves using a shared memory
    Implemented on-chip
    Fast but limited (16-48KB)
    Each thread has its own localmemory
    It’s the only “cache” available
    No hardware/system support
    Must be explicitly controlled by the application code
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    7
  • More on Memory Management
    Without hardware managed caches, accesses to global memory can easily become a bottleneck
    Issues to consider when designing algorithms and data structures
    Maximize usage of shared (block local) memory
    Without overcoming its size
    Threads with contiguous ids should access contiguous global memory regions
    Hardware can combine them into several memory-wide accesses
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    8
  • Hardware Implementation
    An array of Streaming Multiprocessors (SMs) containing many (extremely simple) processing cores
    Each SM executes threads in groups of 32 called warps
    Scheduling is performed in hardware with zero overhead
    Optimized for data parallel problems
    Maximum efficiency only if all threads in a warp agree on the execution path
    9
    High Performance Content-Based Matching Using GPUs - DEBS 2011
  • Some Numbers
    NVIDIA GTX 460
    1GB RAM (Global Memory)
    7 Streaming Multiprocessors
    Each SM contains 48 cores
    Each SM manages up to 48 warps (32 threads each)
    Up to 10752 threads managed concurrently!!!
    Up to 336 threads running concurrently!!!
    Today’s cheap GPU: less than 160$
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    10
  • Existing Algorithms
    Two approaches
    Counting algorithms
    Tree-based algorithms
    Complex data structures to optimize sequential execution
    Trees, Maps, …
    Lots of pointers!!!
    Hardly fit the data parallel programming model!
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    11
  • Algorithm Description
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    12
    F1: A>10 and B=20 F2: B>15 and C<30
    S1
    A=12
    B=20
    A=12
    B=20
    F3: D=20
    S2
    2
    1
    0
    0
    1
    0
  • Algorithm Description
    Constraints with the same name are stored in array on the GPU
    Contiguous memory regions
    When processing an event E, the CPU selects all relevant constraint arrays
    Based on the name of the attributes in E
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    13
  • Algorithm Description
    Bi-dimensional organization of threads
    One thread for each attribute/constraint pair
    Threads in the same block evaluate the same attribute
    It can be copied in shared memory
    Threads with contiguous ids access contiguous constraints
    Accesses combined into several memory-wide operations
    Filters count updated with an atomic operation
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    14
    Event attributes
    B=32
    C=21
    A=7
  • Improvement
    Problem: before processing each event we need to reset filters count and interfaces selection vector
    Naïve version: use a memset
    Communication with the GPU introduces additional delay
    Solution: two copies of filters count and interfaces vector
    While processing an event
    One copy is used
    One copy is reset for the next event
    Inside the same kernel
    No communication overhead
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    15
  • Results: Default Scenario
    Comparison against state of the art sequential implementation
    SFF (Siena) 1.9.4
    AMD CPU @ 2.8GHz
    Default scenario
    Relatively “simple”
    10 interfaces, 25k filters, 1M constraints
    Analysis changing various parameters
    We measure latency
    Processing time for a single event
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    16
    7x
  • Results: Number of Constraints
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    17
    10x
  • Results: Number of Filters
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    18
    13x
  • Results
    What is the time needed to install subscriptions?
    Need to serialize data structures
    Need to copy from CPU memory to GPU memory
    But data structures are simple!
    Memory requirements?
    35MB in the default scenario
    Up to 200MB in all our tests
    Not a problem for a modern GPU
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    19
  • Results
    We measured the latency when processing a single event
    0.14ms processing time  7000 events/s?
    What about the maximum throughput?
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    20
    9400
    events/s
  • Conclusions
    Benefits of GPU in a wide range of scenarios
    In particular in the most challenging workloads
    Additional advantage
    It leaves the CPU free to perform other tasks
    E.g. Communication related tasks
    Available for download
    Includes a translator from Siena subscriptions / messages
    More info at http://home.dei.polimi.it/margara
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    21
  • Future Work
    We are currently working with multi-core CPUs
    Using OpenMP
    We are currently testing our algorithm within a real system
    Both GPUs and multi-core CPUs
    Take into account communication overhead
    Measure of latency and throughput
    We plan to explore the advantages of GPUs with probabilistic (as opposed to exact) matching
    Encoded filters (Bloom filters)
    Balance between performance and percentage of false positives
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    22
  • Questions?
    High Performance Content-Based Matching Using GPUs - DEBS 2011
    23