This document discusses using GPUs to improve the performance of content-based matching. It describes how GPUs can process subscriptions and events in parallel using thousands of lightweight threads. The algorithm stores constraints in arrays to maximize memory coalescing. Testing shows the GPU implementation is 7-13x faster than software on CPUs and can process over 9,000 events per second while using modest memory. Future work includes integrating the algorithm into a real system and exploring probabilistic matching.
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
In this deck from the Stanford Colloquium on Computer Systems Seminar, Brian Boucher from Maxeler Technologies presents: Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier.
"Maxeler Multiscale Dataflow computing is at the leading edge of energy-efficient high performance computing, providing competitive advantage in industries from energy to finance to defense. Maxeler builds the computer around the problem to maximize performance density, eliminating the elaborate caching and decoding machinery occupying most silicon in a standard processor. This talk will explain the motivation behind dataflow computing to escape the end of frequency scaling in the push to exascale machines, introduce the Maxeler dataflow ecosystem including MaxJ code and DFE hardware, and demonstrate the application of dataflow principles to a specific HPC software package (Quantum ESPRESSO)."
Watch the video: https://wp.me/p3RLHQ-hq1
Learn more: http://maxeler.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
In this deck from the Stanford Colloquium on Computer Systems Seminar, Brian Boucher from Maxeler Technologies presents: Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier.
"Maxeler Multiscale Dataflow computing is at the leading edge of energy-efficient high performance computing, providing competitive advantage in industries from energy to finance to defense. Maxeler builds the computer around the problem to maximize performance density, eliminating the elaborate caching and decoding machinery occupying most silicon in a standard processor. This talk will explain the motivation behind dataflow computing to escape the end of frequency scaling in the push to exascale machines, introduce the Maxeler dataflow ecosystem including MaxJ code and DFE hardware, and demonstrate the application of dataflow principles to a specific HPC software package (Quantum ESPRESSO)."
Watch the video: https://wp.me/p3RLHQ-hq1
Learn more: http://maxeler.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
EXTENT-2017: Heterogeneous Computing Trends and Business Value CreationIosif Itkin
EXTENT-2017: Software Testing & Trading Technology Trends Conference
29 June, 2017, 10 Paternoster Square, London
Heterogeneous Computing Trends and Business Value Creation
Thayaparan Sripavan, Head of Hardware Accelerated Systems, MillenniumIT
Would like to know more?
Visit our website: extentconf.com
Follow us:
https://www.linkedin.com/company/exactpro-systems-llc?trk=biz-companies-cym
https://twitter.com/exactpro
#extentconf
#exactpro
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
Parallel Computing: Perspectives for more efficient hydrological modelingGrigoris Anagnostopoulos
A presentation that introduces the basic concepts of parallel computing and gives some details on General Purpose GPU computing using the CUDA architecture.
In-Memory Computing: How, Why? and common PatternsSrinath Perera
Traditionally, big data is mostly read from disks and processed. However, most big data systems are latency bound, which means often the CPU sits idle waiting for data to arrive. This problem is more prevalent with use cases like graph searches that need to randomly access different parts of datasets. In-memory computing proposes an alternative model where data is loaded or stored in-memory and processed instead of processing them from the disk. Although such designs cost more in terms of memory, sometimes resulting systems can have faster order of magnitudes (e.g. 1000X), which could lead to savings in the long run. With rapidly falling memory prices, this difference is reducing by the day. Furthermore, in-memory computing can enable use cases like ad hoc analysis over a large set of data that was not possible earlier. This talk will provide an overview of in-memory technology and discuss how WSO2 technologies like complex event processing that can be used to build in-memory solutions. It will also provide an overview of upcoming improvements in the WSO2 platform.
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
I understand that physics and hardware emmaded on the use of finete element methods to predict
fluid flow over airplane wings,that progress is likely to continue. However, in recent years, this
progress has been achieved through greatly increased hardware complexity with the rise of
multicore and manycore processors, and this is affecting the ability of application developers to
achieve the full potential of these systems. currently performance is measured on a dense
matrix–matrix multiplication test which has questionable relevance to real applications.the
incredible advances in processor technology and all of the accompanying aspects of computer
system design, such as the memory subsystem and networking
In embedded it seems to combination of both hardware and the software , it is used to be
combined function of action in the systems .while we do that the application to developed in the
achieve the full potential of the systems in advanced processer technology.
Hardware
(1) Memory
Advances in memory technology have struggled to keep pace with the phenomenal advances in
processors. This difficulty in improving the main memory bandwidth led to the development of a
cache hierarchy with data being held in different cache levels within the processor. The idea is
that instead of fetching the required data multiple times from the main memory, it is instead
brought into the cache once and re-used multiple times. Intel allocates about half of the chip to
cache, with the largest LLC (last-level cache) being 30MB in size. IBM\'s new Power8 CPU has
an even larger L3 cache of up to 96MB [4]. By contrast, the largest L2 cache in NVIDIA\'s
GPUs is only 1.5MB.These different hardware design choices are motivated by careful
consideration of the range of applications being run by typical users.
One complication which has become more common and more important in the past few years is
non-uniform memory access. Ten years ago, most shared-memory multiprocessors would have
several CPUs sharing a memory bus to access a single main memory. A final comment on the
memory subsystem concerns the energy cost of moving data compared to performing a single
floating point computation.
(2) Processors
CPUs had a single processing core, and the increase in performance came partly from an increase
in the number of computational pipelines, but mainly through an increase in clock frequency.
Unfortunately, the power consumption is approximately proportional to the cube of the
frequency and this led to CPUs with a power consumption of up to 250W.CPUs address memory
bandwidth limitations by devoting half or more of the chip to LLC, so that small applications can
be held entirely within the cache. They address the 200-cycle latency issue by using very
complex cores which are capable of out-of-order execution , By contrast, GPUs adopt a very
different design philosophy because of the different needs of the graphical applications they
target. A GPU usually has a number of functional u.
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo
Watch Pablo's session from Fast Data Strategy on-demand here: https://goo.gl/1aEBo8
The tide is changing for analytics architectures. Traditional approaches, from the data warehouse to the data lake, implicitly assume that all relevant data can be stored in a single, centralized repository. But this approach is slow and expensive, and sometimes not even feasible, because some data sources are too big to be replicated, and data is often too distributed such as those found in cloud data sources to make a “full centralization” strategy successful.
Watch this session to learn more about:
• Modern data architectures
• Why logical architectures are the best option when integrating big data
• How Denodo’s parallel in-memory capabilities with dynamic query optimization redefine analytics architectures
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
EXTENT-2017: Heterogeneous Computing Trends and Business Value CreationIosif Itkin
EXTENT-2017: Software Testing & Trading Technology Trends Conference
29 June, 2017, 10 Paternoster Square, London
Heterogeneous Computing Trends and Business Value Creation
Thayaparan Sripavan, Head of Hardware Accelerated Systems, MillenniumIT
Would like to know more?
Visit our website: extentconf.com
Follow us:
https://www.linkedin.com/company/exactpro-systems-llc?trk=biz-companies-cym
https://twitter.com/exactpro
#extentconf
#exactpro
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
Parallel Computing: Perspectives for more efficient hydrological modelingGrigoris Anagnostopoulos
A presentation that introduces the basic concepts of parallel computing and gives some details on General Purpose GPU computing using the CUDA architecture.
In-Memory Computing: How, Why? and common PatternsSrinath Perera
Traditionally, big data is mostly read from disks and processed. However, most big data systems are latency bound, which means often the CPU sits idle waiting for data to arrive. This problem is more prevalent with use cases like graph searches that need to randomly access different parts of datasets. In-memory computing proposes an alternative model where data is loaded or stored in-memory and processed instead of processing them from the disk. Although such designs cost more in terms of memory, sometimes resulting systems can have faster order of magnitudes (e.g. 1000X), which could lead to savings in the long run. With rapidly falling memory prices, this difference is reducing by the day. Furthermore, in-memory computing can enable use cases like ad hoc analysis over a large set of data that was not possible earlier. This talk will provide an overview of in-memory technology and discuss how WSO2 technologies like complex event processing that can be used to build in-memory solutions. It will also provide an overview of upcoming improvements in the WSO2 platform.
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
I understand that physics and hardware emmaded on the use of finete element methods to predict
fluid flow over airplane wings,that progress is likely to continue. However, in recent years, this
progress has been achieved through greatly increased hardware complexity with the rise of
multicore and manycore processors, and this is affecting the ability of application developers to
achieve the full potential of these systems. currently performance is measured on a dense
matrix–matrix multiplication test which has questionable relevance to real applications.the
incredible advances in processor technology and all of the accompanying aspects of computer
system design, such as the memory subsystem and networking
In embedded it seems to combination of both hardware and the software , it is used to be
combined function of action in the systems .while we do that the application to developed in the
achieve the full potential of the systems in advanced processer technology.
Hardware
(1) Memory
Advances in memory technology have struggled to keep pace with the phenomenal advances in
processors. This difficulty in improving the main memory bandwidth led to the development of a
cache hierarchy with data being held in different cache levels within the processor. The idea is
that instead of fetching the required data multiple times from the main memory, it is instead
brought into the cache once and re-used multiple times. Intel allocates about half of the chip to
cache, with the largest LLC (last-level cache) being 30MB in size. IBM\'s new Power8 CPU has
an even larger L3 cache of up to 96MB [4]. By contrast, the largest L2 cache in NVIDIA\'s
GPUs is only 1.5MB.These different hardware design choices are motivated by careful
consideration of the range of applications being run by typical users.
One complication which has become more common and more important in the past few years is
non-uniform memory access. Ten years ago, most shared-memory multiprocessors would have
several CPUs sharing a memory bus to access a single main memory. A final comment on the
memory subsystem concerns the energy cost of moving data compared to performing a single
floating point computation.
(2) Processors
CPUs had a single processing core, and the increase in performance came partly from an increase
in the number of computational pipelines, but mainly through an increase in clock frequency.
Unfortunately, the power consumption is approximately proportional to the cube of the
frequency and this led to CPUs with a power consumption of up to 250W.CPUs address memory
bandwidth limitations by devoting half or more of the chip to LLC, so that small applications can
be held entirely within the cache. They address the 200-cycle latency issue by using very
complex cores which are capable of out-of-order execution , By contrast, GPUs adopt a very
different design philosophy because of the different needs of the graphical applications they
target. A GPU usually has a number of functional u.
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo
Watch Pablo's session from Fast Data Strategy on-demand here: https://goo.gl/1aEBo8
The tide is changing for analytics architectures. Traditional approaches, from the data warehouse to the data lake, implicitly assume that all relevant data can be stored in a single, centralized repository. But this approach is slow and expensive, and sometimes not even feasible, because some data sources are too big to be replicated, and data is often too distributed such as those found in cloud data sources to make a “full centralization” strategy successful.
Watch this session to learn more about:
• Modern data architectures
• Why logical architectures are the best option when integrating big data
• How Denodo’s parallel in-memory capabilities with dynamic query optimization redefine analytics architectures
In Memory Parallel Processing for Big Data ScenariosDenodo
Watch the full webinar on demand here: https://goo.gl/5VyGns
Denodo Platform offers one of the most sought after data fabric capabilities through data discovery, preparation, curation and integration across the broadest range of data sources. As data volume and variety grows exponentially, Denodo Platform 7.0 will offer in-memory massive parallel processing (MPP) capability for the most advanced query optimization in the market.
Attend this session to learn:
• How Denodo Platform 7.0’s native built-in integration with MPP systems will provide query acceleration and MPP caching
• How to successfully approach highly complex big data scenarios, leveraging inexpensive MPP solutions
• With the MPP capability in place, how data driven insights can be generated in real-time with Denodo Platform
Agenda:
• Challenges with traditional architectures
• Denodo Platform MPP capabilities and applications
• Product demonstration
• Q&A
Presentation & discussion around low-level graphics APIs. This was a quickly made presentation that I put together for a discussion with Intel and fellow ISVs, thought it could be worth sharing
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
Innovative companies are building Internet of Things, mobile, content management, single view, and big data apps on top of MongoDB. In this session, we'll explore how the IBM POWER8 platform brings new levels of performance and ease of configuration to these solutions which already benefit from easier and faster design and development using MongoDB.
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
This slide introduces technical specs and details about Backend.AI 19.09.
* On-premise clustering / container orchestration / scaling on cloud
* Container-level fractional GPU technology to use one GPU as many GPUs on many containers at the same time.
* NVidia GPU Cloud integrations
* Enterprise features
The Unified Computing System is a next-generation database platform that Unites compute, network, storage access, and virtualization into a cohesive system
This is an Final Year Project implemented by Er. Anal Prasanna Salshingikar during Diploma in Engineering
Design Considerations, Installation, and Commissioning of the RedRaider Cluster at the Texas Tech University
High Performance Computing Center
Outline of this talk
HPCC Staff and Students
Previous clusters
• History, Performance, usage Patterns, and Experience
Motivation for Upgrades
• Compute Capacity Goals
• Related Considerations
Installation and Benchmarks Conclusions and Q&A
Cluster based storage - Nasd and Google file system - advanced operating syst...Antonio Cesarano
This is a seminar at the Course of Advanced Operating Systems at University of Salerno which shows the first cluster based storage technology (NASD) and its evolution till the development of the new Google File System.
Heterogeneous Computing : The Future of SystemsAnand Haridass
Charts from NITK-IBM Computer Systems Research Group (NCSRG)
- Dennard Scaling,Moore's Law, OpenPOWER, Storage Class Memory, FPGA, GPU, CAPI, OpenCAPI, nVidia nvlink, Google Microsoft Heterogeneous system usage
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Content-Based Matching on GPUs
1. High Performance Content-Based Matching Using GPUs Alessandro Margara and GianpaoloCugola margara@elet.polimi.it, cugola@elet.polimi.it Dip. Elettronica e Informazione (DEI) Politecnico di Milano
2. The Problem: Content-Based Matching High Performance Content-Based Matching Using GPUs - DEBS 2011 2 Publishers Content-Based Matching Subscribers Predicate Filter (Smoke=true and Room = “Kitchen”) or (Light>30 and Room=“Bedroom”) Light=50, Room=Bedroom, Sender=“Sensor1” Attribute Constraint
3. Introduced by Nvidia in 2006 General purpose parallel computing architecture New instruction set New programming model Programmable using high-level languages Cuda C (a C dialect) Programming GPUs: CUDA High Performance Content-Based Matching Using GPUs - DEBS 2011 3
4. Programming Model: Basics The device (GPU) acts as a coprocessor for the host (CPU) and has its own separate memory space It is necessary to copy input data from the main memory to the GPU memory before starting a computation … … and to copy results back to the main memory when the computation finishes Often the most expensive operations Involve sending information through the PCI-Ex bus Bandwidth but also latency Also requires serialization of data structures! They must be kept simple High Performance Content-Based Matching Using GPUs - DEBS 2011 4
5. Typical Workflow High Performance Content-Based Matching Using GPUs - DEBS 2011 5 Allocate memory on device Serialize and copy data to device Execute one or more kernels on the device Wait for the device to finish processing Copy results back
6. Programming Model: Fundamentals Single Program Multiple Threads implementation strategy A single kernel(function) is executed by multiple threads in parallel Threads are organized in blocks Threads within different blocks operate independently Threads within the same block cooperate to solve a single sub-problem The runtime provides a blockIdand athreadIdvariable, to uniquely identify each running thread Accessing such variables is the only way to differentiate the work done by different threads High Performance Content-Based Matching Using GPUs - DEBS 2011 6
7. Programming Model: Memory management Hierarchical organization of memory All threads have access to the same common global memory Large (512MB-6GB) but slow (DRAM) Stores information received from the host Persistent across different function calls Threads within a block coordinate themselves using a shared memory Implemented on-chip Fast but limited (16-48KB) Each thread has its own localmemory It’s the only “cache” available No hardware/system support Must be explicitly controlled by the application code High Performance Content-Based Matching Using GPUs - DEBS 2011 7
8. More on Memory Management Without hardware managed caches, accesses to global memory can easily become a bottleneck Issues to consider when designing algorithms and data structures Maximize usage of shared (block local) memory Without overcoming its size Threads with contiguous ids should access contiguous global memory regions Hardware can combine them into several memory-wide accesses High Performance Content-Based Matching Using GPUs - DEBS 2011 8
9. Hardware Implementation An array of Streaming Multiprocessors (SMs) containing many (extremely simple) processing cores Each SM executes threads in groups of 32 called warps Scheduling is performed in hardware with zero overhead Optimized for data parallel problems Maximum efficiency only if all threads in a warp agree on the execution path 9 High Performance Content-Based Matching Using GPUs - DEBS 2011
10. Some Numbers NVIDIA GTX 460 1GB RAM (Global Memory) 7 Streaming Multiprocessors Each SM contains 48 cores Each SM manages up to 48 warps (32 threads each) Up to 10752 threads managed concurrently!!! Up to 336 threads running concurrently!!! Today’s cheap GPU: less than 160$ High Performance Content-Based Matching Using GPUs - DEBS 2011 10
11. Existing Algorithms Two approaches Counting algorithms Tree-based algorithms Complex data structures to optimize sequential execution Trees, Maps, … Lots of pointers!!! Hardly fit the data parallel programming model! High Performance Content-Based Matching Using GPUs - DEBS 2011 11
12. Algorithm Description High Performance Content-Based Matching Using GPUs - DEBS 2011 12 F1: A>10 and B=20 F2: B>15 and C<30 S1 A=12 B=20 A=12 B=20 F3: D=20 S2 2 1 0 0 1 0
13. Algorithm Description Constraints with the same name are stored in array on the GPU Contiguous memory regions When processing an event E, the CPU selects all relevant constraint arrays Based on the name of the attributes in E High Performance Content-Based Matching Using GPUs - DEBS 2011 13
14. Algorithm Description Bi-dimensional organization of threads One thread for each attribute/constraint pair Threads in the same block evaluate the same attribute It can be copied in shared memory Threads with contiguous ids access contiguous constraints Accesses combined into several memory-wide operations Filters count updated with an atomic operation High Performance Content-Based Matching Using GPUs - DEBS 2011 14 Event attributes B=32 C=21 A=7
15. Improvement Problem: before processing each event we need to reset filters count and interfaces selection vector Naïve version: use a memset Communication with the GPU introduces additional delay Solution: two copies of filters count and interfaces vector While processing an event One copy is used One copy is reset for the next event Inside the same kernel No communication overhead High Performance Content-Based Matching Using GPUs - DEBS 2011 15
16. Results: Default Scenario Comparison against state of the art sequential implementation SFF (Siena) 1.9.4 AMD CPU @ 2.8GHz Default scenario Relatively “simple” 10 interfaces, 25k filters, 1M constraints Analysis changing various parameters We measure latency Processing time for a single event High Performance Content-Based Matching Using GPUs - DEBS 2011 16 7x
17. Results: Number of Constraints High Performance Content-Based Matching Using GPUs - DEBS 2011 17 10x
18. Results: Number of Filters High Performance Content-Based Matching Using GPUs - DEBS 2011 18 13x
19. Results What is the time needed to install subscriptions? Need to serialize data structures Need to copy from CPU memory to GPU memory But data structures are simple! Memory requirements? 35MB in the default scenario Up to 200MB in all our tests Not a problem for a modern GPU High Performance Content-Based Matching Using GPUs - DEBS 2011 19
20. Results We measured the latency when processing a single event 0.14ms processing time 7000 events/s? What about the maximum throughput? High Performance Content-Based Matching Using GPUs - DEBS 2011 20 9400 events/s
21. Conclusions Benefits of GPU in a wide range of scenarios In particular in the most challenging workloads Additional advantage It leaves the CPU free to perform other tasks E.g. Communication related tasks Available for download Includes a translator from Siena subscriptions / messages More info at http://home.dei.polimi.it/margara High Performance Content-Based Matching Using GPUs - DEBS 2011 21
22. Future Work We are currently working with multi-core CPUs Using OpenMP We are currently testing our algorithm within a real system Both GPUs and multi-core CPUs Take into account communication overhead Measure of latency and throughput We plan to explore the advantages of GPUs with probabilistic (as opposed to exact) matching Encoded filters (Bloom filters) Balance between performance and percentage of false positives High Performance Content-Based Matching Using GPUs - DEBS 2011 22