More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009
Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Node.js at Joyent: Engineering for Productionjclulow
Joyent is one of the largest deployers of Node.js in production systems. In order to successfully deploy large-scale, distributed systems, we must understand the systems we build! For us, that means having first-class tools for debugging our software, and understanding and improving its performance.
Come on a whirlwind tour of the tools and techniques we use at Joyent as we build out large-scale distributed software with Node.js: from mdb for Post-Mortem Debugging, to Flame Graphs for performance analysis; from DTrace for dynamic, production-safe instrumentation and tracing, to JSON-formatted logging with Bunyan.
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
In Terminator 3 - Rise of the Machines, bare metal comes back to haunt humanity, ruthlessly crushing all resistance. This keynote is here to warn you that the same thing is happening to Java and the JVM! Java was designed in a world where there were a wide range of hardware platforms to support. Its premise of Write Once Run Anywhere (WORA) proved to be one of the compelling reasons behind Java's dominance (even if the reality didn't quite meet the marketing hype). However, this WORA property means that Java and the JVM struggled to utilise specialist hardware and operating system features that could make a massive difference in the performance of your application. This problem has recently gotten much, much worse. Due to the rise of multi-core processors, massive increases in main memory and enhancements to other major hardware components (e.g. SSD), the JVM is now distant from utilising that hardware, causing some major performance and scalability issues! Kirk Pepperdine and Martijn Verburg will take you through the complexities of where Java meets the machine and loses. They'll give up some of their hard-won insights on how to work around these issues so that you can plan to avoid termination, unlike some of the poor souls that ran into the T-800...
Improving the Performance of the qcow2 Format (KVM Forum 2017)Igalia
By Alberto García.
qcow2 is QEMU's native file format for storing disk images. One of its features is that it grows dynamically, so disk space is only allocated when the virtual machine needs to store data. This makes the format efficient in terms of space requirements, but has an impact on its I/O performance. This presentation will describe some of those performance problems and will discuss possible ways to address them. Some of them can be solved by simply adjusting configuration parameters, others require improving the qcow2 driver in QEMU, and others need extending the file format itself.
(c) KVM Forum 2017
October 25 - 27, 2017
Hilton Prague, Prague, Czech Republic
http://events.linuxfoundation.org/events/archive/2017/kvm-forum
Accelerating hbase with nvme and bucket cacheDavid Grier
This set of slides describes some initial experiments which we have designed for discovering improvements for performance in Hadoop technologies using NVMe technology
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
on-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS.
First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark.
In summary, while the NVMe drives show up to 8x speedup in best case scenarios, testing the cost-efficiency of new device technologies is not straightforward in Big Data, where we need to overcome system level caching to measure the maximum benefits.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Node.js at Joyent: Engineering for Productionjclulow
Joyent is one of the largest deployers of Node.js in production systems. In order to successfully deploy large-scale, distributed systems, we must understand the systems we build! For us, that means having first-class tools for debugging our software, and understanding and improving its performance.
Come on a whirlwind tour of the tools and techniques we use at Joyent as we build out large-scale distributed software with Node.js: from mdb for Post-Mortem Debugging, to Flame Graphs for performance analysis; from DTrace for dynamic, production-safe instrumentation and tracing, to JSON-formatted logging with Bunyan.
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
In Terminator 3 - Rise of the Machines, bare metal comes back to haunt humanity, ruthlessly crushing all resistance. This keynote is here to warn you that the same thing is happening to Java and the JVM! Java was designed in a world where there were a wide range of hardware platforms to support. Its premise of Write Once Run Anywhere (WORA) proved to be one of the compelling reasons behind Java's dominance (even if the reality didn't quite meet the marketing hype). However, this WORA property means that Java and the JVM struggled to utilise specialist hardware and operating system features that could make a massive difference in the performance of your application. This problem has recently gotten much, much worse. Due to the rise of multi-core processors, massive increases in main memory and enhancements to other major hardware components (e.g. SSD), the JVM is now distant from utilising that hardware, causing some major performance and scalability issues! Kirk Pepperdine and Martijn Verburg will take you through the complexities of where Java meets the machine and loses. They'll give up some of their hard-won insights on how to work around these issues so that you can plan to avoid termination, unlike some of the poor souls that ran into the T-800...
Improving the Performance of the qcow2 Format (KVM Forum 2017)Igalia
By Alberto García.
qcow2 is QEMU's native file format for storing disk images. One of its features is that it grows dynamically, so disk space is only allocated when the virtual machine needs to store data. This makes the format efficient in terms of space requirements, but has an impact on its I/O performance. This presentation will describe some of those performance problems and will discuss possible ways to address them. Some of them can be solved by simply adjusting configuration parameters, others require improving the qcow2 driver in QEMU, and others need extending the file format itself.
(c) KVM Forum 2017
October 25 - 27, 2017
Hilton Prague, Prague, Czech Republic
http://events.linuxfoundation.org/events/archive/2017/kvm-forum
Accelerating hbase with nvme and bucket cacheDavid Grier
This set of slides describes some initial experiments which we have designed for discovering improvements for performance in Hadoop technologies using NVMe technology
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
on-Volatile-Memory express (NVMe) standard promises and order of magnitude faster storage than regular SSDs, while at the same time being more economical than regular RAM on TB/$. This talk evaluates the use cases and benefits of NVMe drives for its use in Big Data clusters with HBase and Hadoop HDFS.
First, we benchmark the different drives using system level tools (FIO) to get maximum expected values for each different device type and set expectations. Second, we explore the different options and use cases of HBase storage and benchmark the different setups. And finally, we evaluate the speedups obtained by the NVMe technology for the different Big Data use cases from the YCSB benchmark.
In summary, while the NVMe drives show up to 8x speedup in best case scenarios, testing the cost-efficiency of new device technologies is not straightforward in Big Data, where we need to overcome system level caching to measure the maximum benefits.
Trip down the GPU lane with Machine LearningRenaldas Zioma
What Machine Learning professional should know about GPU!
Brief outline of the deck:
* GPU architecture explained with simple images
* memory bandwidth cheat-sheats for common hardware configuration,
* overview of GPU programming model
* under the hood peek at the main building block of ML - matrix multiplication
* effect of mini-batch size on performance
Originally I gave this talk at the internal Machine Learning Workshop in Unity Seattle
HIGH QUALITY pdf slides: http://bit.ly/2iQxm7X (on Dropbox)
Elasticsearch Arcihtecture & What's New in Version 5Burak TUNGUT
General architectural concepts of Elasticsearch and what's new in version 5? Examples are prepared with our company business therefore these are excluded from presentation.
Responding rapidly when you have 100+ GB data sets in JavaPeter Lawrey
One way to speed up you application is to bring more of your data into memory. But how to do you handle hundreds of GB of data in a JVM and what tools can help you.
Mentions: Speedment, Azul, Terracotta, Hazelcast and Chronicle.
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
The talk I gave at the Snow Unix Event in Nederland about upgrading a massive production Elasticsearch cluster from a major version to another without downtime and a complete rollback plan.
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
Tips, tricks, and gotchas learned at Localytics for optimizing MongoDB installs. Includes information about document design, indexes, fragmentation, migration, AWS EC2/EBS, and more.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
Have you heard that all in-memory databases are equally fast but unreliable, inconsistent and expensive? This session highlights in-memory technology that busts all those myths.
Redis, the fastest database on the planet, is not a simply in-memory key-value data-store; but rather a rich in-memory data-structure engine that serves the world’s most popular apps. Redis Labs’ unique clustering technology enables Redis to be highly reliable, keeping every data byte intact despite hundreds of cloud instance failures and dozens of complete data-center outages. It delivers full CP system characteristics at high performance. And with the latest Redis on Flash technology, Redis Labs achieves close to in-memory performance at 70% lower operational costs. Learn about the best uses of in-memory computing to accelerate everyday applications such as high volume transactions, real time analytics, IoT data ingestion and more.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Lightning talk showing various aspectos of software system performance. It goes through: latency, data structures, garbage collection, troubleshooting method like workload saturation method, quick diagnostic tools, famegraph and perfview
http://cs264.org
Abstract:
High-level scripting languages are in many ways polar opposites to
GPUs. GPUs are highly parallel, subject to hardware subtleties, and
designed for maximum throughput, and they offer a tremendous advance
in the performance achievable for a significant number of
computational problems. On the other hand, scripting languages such as
Python favor ease of use over computational speed and do not generally
emphasize parallelism. PyOpenCL and PyCUDA are two packages that
attempt to join the two together. By showing concrete examples, both
at the toy and the whole-application level, this talk aims to
demonstrate that by combining these opposites, a programming
environment is created that is greater than just the sum of its two
parts.
Speaker biography:
Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at
the Department of Applied Mathematics at Brown University. He worked
on a variety of topics all aiming to broaden the utility of
discontinuous Galerkin (DG) methods. This included their use in the
simulation of plasma physics and the demonstration of their particular
suitability for computation on throughput-oriented graphics processors
(GPUs). He also worked on multi-rate time stepping methods and shock
capturing schemes for DG.
In the fall of 2010, he joined the Courant Institute of Mathematical
Sciences at New York University as a Courant Instructor. There, he is
working on problems in computational electromagnetics with Leslie
Greengard.
His research interests include:
- Discontinuous Galerkin and integral equation methods for wave
propagation
- Programming tools for parallel architectures
- High-order unstructured particle-in-cell methods for plasma simulation
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
Abstract:
Machine learning researchers and practitioners develop computer
algorithms that "improve performance automatically through
experience". At Google, machine learning is applied to solve many
problems, such as prioritizing emails in Gmail, recommending tags for
YouTube videos, and identifying different aspects from online user
reviews. Machine learning on big data, however, is challenging. Some
"simple" machine learning algorithms with quadratic time complexity,
while running fine with hundreds of records, are almost impractical to
use on billions of records.
In this talk, I will describe lessons drawn from various Google
projects on developing large scale machine learning systems. These
systems build on top of Google's computing infrastructure such as GFS
and MapReduce, and attack the scalability problem through massively
parallel algorithms. I will present the design decisions made in
these systems, strategies of scaling and speeding up machine learning
systems on web scale data.
Speaker biography:
Max Lin is a software engineer with Google Research in New York City
office. He is the tech lead of the Google Prediction API, a machine
learning web service in the cloud. Prior to Google, he published
research work on video content analysis, sentiment analysis, machine
learning, and cross-lingual information retrieval. He had a PhD in
Computer Science from Carnegie Mellon University.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Chapter 3 - Islamic Banking Products and Services.pptx
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's CUDA (Gene Cooperman, NEU)
1. Out-of-Core Programming with NVIDIA’s CUDA
Gene Cooperman
High Performance Computing Lab
College of Computer and Information Science
Northeastern University
Boston, Massachusetts 02115
USA
gene@ccs.neu.edu
2. Pencil and Paper Calculation
• GeForce 8800:
– 16 CPU chips/Streaming Multiprocessors (SMs),
8 Cores per chip : 128 cores
– Aggregate bandwidth to off-chip global memory: 86.4 GB/s (optimal)
– Average bandwidth to global memory per core: 0.67 GB/s
• Motherboard
– 4 CPU cores
– About 10 GB/s bandwidth to main RAM
– Average bandwidth to RAM per core: 2.5 GB/s
3. Keeping Pipe to Memory Flowing
• Thread block: threads on a single chip
• Thread block organized into warps
• Warp of 32 threads required (minimize overhead of switching thread blocks)
• Highest bandwidth when all SMs executing same code
4. Memory-Bound Computations
• So, how much data can we keep in the SMs before it overflows?
• 16 KB/SM −→ 256 KB total cache
• Any computation with an active working set of more than 256 KB risks being memory
bound.
5. Memory Bandwidth in Numbers
(Thanks to Kapil Arya and Viral Gupta; Illustrative for trends, only)
X-Axis: number of thread blocks
Y-Axis: bandwidth (MB/s)
Different curves: number of threads per thread block.
6. Is Life Any Better Back on the Motherboard?
• Up to 10 GB/s bandwidth to motherboard (perhaps five times slower than NVIDIA in
practice)
• Four cores competing for bandwidth
• Cache of at least 1 MB, and possibly much more (e.g L3 cache)
• Conclusion: Less pressure on memory, but similar order of magnitude
7. Is Life Any Better between CPU and Disk?
• Between 0.05 GB and 0.1 GB bandwidth to disk
• Four cores competing for bandwidth
• Cache consists of 4 GB or more of RAM
• Conclusion: huge pressure on memory (but RAM as cache is large)
8. Our Solution
• Disk is the New RAM
• Bandwidth of Disk: ˜ 100 MB/s
• Bandwidth of 50 Disks: 50 × 100 MB/s = 5 GB/s
• Bandwidth of RAM: approximately 5 GB/s
• Conclusion:
1. CLAIM: A computer cluster of 50 quad-core nodes, each with 500 GB of mostly
idle disk space, is a good approximation to a shared memory computer with 200
CPU cores and a single subsystem with 25 TB of shared memory.
(The arguments also work for a SAN with multiple access nodes, but we consider
local disks for simplicity.)
2. The disks of a cluster can serve as if they were RAM.
3. The traditional RAM can then serve as if it were cache.
9. Our Solution
• Disk is the New RAM
• Bandwidth of Disk: ˜ 100 MB/s
• Bandwidth of 50 Disks: 50 × 100 MB/s = 5 GB/s
• Bandwidth of RAM: approximately 5 GB/s
• Conclusion:
1. CLAIM: A computer cluster of 50 quad-core nodes, each with 500 GB of mostly
idle disk space, is a good approximation to a shared memory computer with 200
CPU cores and a single subsystem with 25 TB of shared memory.
(The arguments also work for a SAN with multiple access nodes, but we consider
local disks for simplicity.)
2. The disks of a cluster can serve as if they were RAM.
3. The traditional RAM can then serve as if it were cache.
10. What About Disk Latency?
• Unfortunately, putting 50 disks on it, doesn’t speed up the latency.
• So, re-organize the data structures and low-level algorithms.
• Our group has five years of case histories applying this computational algebra — but
each case requires months of development and debugging.
• We’re now developing both higher level abstractions for run-time libraries, and a
language extension that will make future development much faster.
11. Applications Benefiting from Disk-Based Parallel Computation
Discipline Example Application
1. Verification Symbolic Computation using BDDs
2. Verification Explicit State Verification
3. Comp. Group Theory Search and Enumeration in Mathematical Structures
4. Coding Theory Search for New Codes
5. Security Exhaustive Search for Passwords
6. Semantic Web RDF query language; OWL Web Ontology Language
7, Artificial Intelligence Planning
8. Proteomics Protein folding via a kinetic network model
9. Operations Research Branch and Bound
10. Operations Research Integer Programming (applic. of Branch-and-Bound)
11. Economics Dynamic Programming
12. Numerical Analysis ATLAS, PHiPAC, FFTW, and other adaptive software
13. Engineering Sensor Data
14. A.I. Search Rubik’s Cube
12. Central Claim
Suppose one had a single computer with 10 terabytes of RAM and 200 CPU cores. Does
that satisfy your need for computers with more RAM?
CLAIM: A computer cluster of 32 quad-core nodes, each with a 500 GB local disk, is
a good approximation of the above computer. (The arguments also work for a SAN with
multiple access nodes, but we discuss local disks for simplicity.)
13. When is a cluster like a 10 TB shared memory computer?
• Assume 200 GB/node of free disk space
• Assume 50 nodes,
• The bandwidth of 50 disks is 50 × 100MB/s = 5GB/s.
• The bandwidth of a single RAM subsystem is about 5GB/s.
CLAIM: You probably have the 10 TB of temporary disk space lying idle on your own
recent-model computer cluster. You just didn’t know it.
(Or were you just not telling other people about the space, so you could use if for yourself?)
The economics of disks are such that one saves very little by buying less than 500 GB
disk per node. It’s common to buy the 500 GB disk, and reserve the extra space for
expansion.
14. When is a cluster NOT like a 10 TB shared memory computer?
1. We require a parallel program. (We must access the local disks of many cluster nodes
in parallel.)
2. The latency problem of disk.
3. Can the network keep up with the disk?
15. When is a cluster NOT like a 10 TB shared memory computer?
. . . and why doesn’t it matter for our purposes?
• ANSWER 1: We’ve used this architecture, and it works for us.
• We’ve developed solutions for a series of algorithmically simple computational kernels
from computational algebra — especially mathematical group theory. All of the
following computations completed in less than one cluster-week on a cluster of 60 nodes
or less.
– Construction of Thompson Sporadic Simple Group (2003)
2 gigabytes (temporary space), 1.4 × 108 states, 4 bytes per state
– Construction of Baby Monster Sporadic Simple Group (2006)
6 terabytes (temporary space), 1.4 × 1010 states, 12 bytes per state
– Condensation of Fi23 Sporadic Simple Group (2007)
400 GB (temporary space) 1.2 × 1010 states, 30 bytes per state
(larger condensation for J4 now in progress)
– Rubik’s Cube: 26 Moves Suffice to Solve Rubik’s Cube (2007)
7 terabytes (temporary space), 1012 states, 6 bytes per state
– In progress: coset enumeration (pointer-chasing: similar to algorithm for converting
NFA to DFA (finite automata)).
16. When is a cluster NOT like a 10 TB shared memory computer?
1. We require a parallel program.
2. The latency problem of disk.
3. Can the network keep up with the disk?
17. When is a cluster NOT like a 10 TB shared memory computer?
. . . and why doesn’t it matter for our purposes?
1. We require a parallel program. (We must access the local disks of many nodes in
parallel.)
• Our bet (still to be proved): Any sequential algorithm that already creates gigabytes
of RAM-based data should have a way to create that data in parallel.
2. The latency problem of disk. Solutions exist:
(a) For duplicates on frontier in state space search: Delayed Duplicate Detection
implies waiting until many nodes of the next frontier (and duplicates from previous
iterations) have been discovered. Then remove duplicates.
(b) For hash tables, wait until there are millions of hash queries. Then sort on the hash
index, and scan the disk to resolve queries.
(c) For pointer-chasing, wait until millions of pointers are available for chasing. Then
sort and scan the disk to dereference pointers.
(d) For tracing strings, with each string being a lookup, wait until millions of strings are
available. Then ....
3. Can the network keep up with the disk?
18. When is a cluster NOT like a 10 TB shared memory computer?
. . . and why doesn’t it matter for our purposes?
1. We require a parallel program. (We must access the local disks of many nodes in
parallel.)
2. The latency problem of disk.
3. Can the network keep up with the disk?
(In our experience to date, the network does keep up. Here are some reasons why it
seems to just work.)
• The point-to-point bandwidth of Gigabit Ethernet is about 100 MB/s. The bandwidth
of disk is about 100 MB/s. As long as the aggregate bandwidth of network can keep
up, everything is fine.
• Researchers already face the issue of aggregate network bandwidth in RAM-based
programs. The disk is slower than RAM. So, probably traditional parallel programs
can cope.
19. Applications from Computational Group Theory (2003–2007)
Space State Total
Group Size Size Storage
1.17 × 1010 100 bytes
Fischer Fi23 1 TB
“Baby Monster” 1.35 × 1010 548 bytes 7 TB
1.31 × 1011 64 bytes
Janko J4 8 TB
(joint with Eric Robinson)
20. History of Rubik’s Cube
• Invented in late 1970s in Hungary.
• In 1982, in Cubik Math, Singmaster and Frey conjectured:
No one knows how many moves would be needed for “God’s Algorithm”
assuming he always used the fewest moves required to restore the cube. It
has been proven that some patterns must exist that require at least seventeen
moves to restore but no one knows what those patterns may be. Experienced
group theorists have conjectured that the smallest number of moves which would
be sufficient to restore any scrambled pattern — that is, the number of moves
required for “God’s Algorithm” — is probably in the low twenties.
• Current Best Guess: 20 moves suffice
– States needing 20 moves are known
21. History of Rubik’s Cube (cont.)
• Invented in late 1970s in Hungary.
• 1982: “God’s Number” (number of moves needed) was known by authors of conjecture
to be between 17 and 52.
• 1990: C., Finkelstein, and Sarawagi showed 11 moves suffice for Rubik’s 2 × 2 × 2 cube
(corner cubies only)
• 1995: Reid showed 29 moves suffice (lower bound of 20 already known)
• 2006: Radu showed 27 moves suffice
• 2007 Kunkle and C. showed 26 moves suffice
• 2008 Rockiki showed 22 moves suffice (using idle resources at Sony Pictures)
22. Large-Memory Apps: Experience in N.U. Course
(mixed undergrads and grads)
1. Chaitin’s Algorithm
2. Fast Permutation Multiplication
3. Kernighan-Lin Partitioning Algorithm
4. Large matrix-matrix Multiplication
5. Voronoi Diagrams
6. Cellular Automata
7. GAA* Search
8. Static Performance Evaluation for Memory Bound Computing
Others:
[BFS using External Sort] BFS using External Sort
[BFS using Segments & Hash Array] BFS using Segments & Hash Array
[Fast Permutation Multiplication] Fast Permutation Multiplication
[Kernighan-Lin Partitioning Algorithm] Kernighan-Lin Partitioning Algorithm
[Large matrix-matrix Multiplication] Large matrix-matrix Multiplication
23. Example: Rubik’s Cube: Sorting Delayed Duplicate Detection
1. Breadth-first search: storing new frontier (open list) on disk
2. Use Bucket Sorting to sort and eliminate duplicate states from the new
frontier
(The bucket size is chosen to fit in RAM (the new cache).
3. Storing the new frontier requires 6 terabytes of disk space (and we would
use more if we had it). Saving a large new frontier on disk prior to sorting
delays duplicate detection, but makes the routine more efficient due to
economies of scale.
24. Rubik’s Cube: Two-Bit trick
1. The final representation of the state space (1.4 × 1012 states) could use only 2 bits per
state. (We use 4 bits per state for convenience.)
2. We used mathematical group theory to derive a highly dense, perfect hash function (no
collisions) for the states of |cube|/|S|.
3. Our hash function represents symmetrized cosets (the union of all symmetric states of
|cube|/|S| under the symmetries of the cube).
4. Each hash slot need only store the level in the search tree modulo 3. This allows
the algorithm to distinguish states from the current frontier, the next frontier, and the
previous frontier (current level; current level plus one; and current level minus one).
This is all that is needed.
25. Space-Time Tradeoffs using Additional Disk
• Use even more disk space in order to speed up the algorithm.
“A Comparative Analysis of Parallel Disk-Based Methods for Enumerating Implicit Graphs”, Eric Robinson,
Daniel Kunkle and Gene Cooperman, Proc. of 2007 International Workshop on Parallel Symbolic and
Algebraic Computation (PASCO ’07), ACM Press, 2007, pp. 78–87
26. LONGER-TERM GOAL: Mini-Language Extension
Well-understood building blocks already exist: external sorting, B-trees, Bloom filters,
Delayed Duplicate Detection, Distributed Hash Trees (DHT), and some still more exotic
algorithms.
GOAL: Provide language extensions for common data structures and algorithms (including
breadth-first search) that invoke a run-time library. Design the language to bias the
programmer toward efficient use of disk.
ROOMY LANGUAGE:
New Parallel Disk-Based Language, Roomy, in development by Daniel Kunkle.
Implementation: Run-time C library with #define and typedef for nicer syntax.
Language appears to be sequential; back-end based on cluster with local disks; or cluster
with SAN; or single computer using RAM (for simpler development and debugging)
Expected availability: mid-2009