By Václav Košař (GLAMI): Pathways Language Model (PaLM) is a 540-billion parameter with architecture similar to GPT-3. This model, published April 4th, 2022, achieves breakthrough capabilities on language understanding and generation, reasoning, and coding tasks. For example for reasoning tasks, PaLM used chain-of-thought prompting, which applies simulated inner monologue to solve grade school level math questions. In this talk, we will discuss both general-public-accessible intuition of how knowledge and reasoning can be represented in computers as well as technical details of PaLM model architecture.
Chain-of-thought prompting involves dividing complex reasoning tasks into natural language steps to help large language models perform better. It has been shown to improve arithmetic word problem solving by prompting models to show the steps and equations used to arrive at the answer. An ablation study found that showing the intermediate steps led to better performance than just showing the equation or computed answer alone. While promising for improving reasoning abilities, chain-of-thought prompting may not truly elicit human-like reasoning and can be costly to apply due to annotation efforts and model sizes required.
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.
Chain-of-thought prompting involves dividing complex reasoning tasks into natural language steps to help large language models perform better. It has been shown to improve arithmetic word problem solving by prompting models to show the steps and equations used to arrive at the answer. An ablation study found that showing the intermediate steps led to better performance than just showing the equation or computed answer alone. While promising for improving reasoning abilities, chain-of-thought prompting may not truly elicit human-like reasoning and can be costly to apply due to annotation efforts and model sizes required.
ConvMixer is a simple CNN-based model that achieves state-of-the-art results on ImageNet classification. It divides the input image into patches and embeds them into high-dimensional vectors, similar to ViT. However, unlike ViT, it does not use attention but instead applies simple convolutional layers between the patch embedding and classification layers. Experiments show that despite its simplicity, ConvMixer outperforms more complex models like ResNet, ViT, and MLP-Mixer on ImageNet, demonstrating that patch embeddings may be as important as attention mechanisms for vision tasks.
BERT is a deeply bidirectional, unsupervised language representation model pre-trained using only plain text. It is the first model to use a bidirectional Transformer for pre-training. BERT learns representations from both left and right contexts within text, unlike previous models like ELMo which use independently trained left-to-right and right-to-left LSTMs. BERT was pre-trained on two large text corpora using masked language modeling and next sentence prediction tasks. It establishes new state-of-the-art results on a wide range of natural language understanding benchmarks.
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted WindowsToru Tamaki
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo; Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10012-10022
https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.html
データ拡張 (Data Augmentation) を学習中に使い分けるRefined Data Augmentationについて解説しました。
He, Zhuoxun, et al. "Data augmentation revisited: Rethinking the distribution gap between clean and augmented data." arXiv preprint arXiv:1909.09148 (2019).
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
This document summarizes the LLaMa model, which is an open and efficient foundation language model.
[1] LLaMa achieves state-of-the-art performance on various tasks while being trained exclusively on publicly available data and requiring only a single GPU for inference, making it more accessible than other large models.
[2] Key aspects of LLaMa include pre-normalization, SwiGLU activation, rotary embeddings, and efficient implementation techniques. It was trained on 1.4 trillion tokens of publicly available data using 2048 A100 GPUs over 5 months.
[3] Evaluation shows LLaMa outperforms other models on common sense reasoning, question answering, reading comprehension,
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
Studied feasibility of applying state-of-the-art deep learning models like end-to-end memory networks and neural attention- based models to the problem of machine comprehension and subsequent question answering in corporate settings with huge
amount of unstructured textual data. Used pre-trained embeddings like word2vec and GLove to avoid huge training costs.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
BERT is a deeply bidirectional, unsupervised language representation model pre-trained using only plain text. It is the first model to use a bidirectional Transformer for pre-training. BERT learns representations from both left and right contexts within text, unlike previous models like ELMo which use independently trained left-to-right and right-to-left LSTMs. BERT was pre-trained on two large text corpora using masked language modeling and next sentence prediction tasks. It establishes new state-of-the-art results on a wide range of natural language understanding benchmarks.
文献紹介:Swin Transformer: Hierarchical Vision Transformer Using Shifted WindowsToru Tamaki
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo; Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10012-10022
https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.html
データ拡張 (Data Augmentation) を学習中に使い分けるRefined Data Augmentationについて解説しました。
He, Zhuoxun, et al. "Data augmentation revisited: Rethinking the distribution gap between clean and augmented data." arXiv preprint arXiv:1909.09148 (2019).
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
This document summarizes the LLaMa model, which is an open and efficient foundation language model.
[1] LLaMa achieves state-of-the-art performance on various tasks while being trained exclusively on publicly available data and requiring only a single GPU for inference, making it more accessible than other large models.
[2] Key aspects of LLaMa include pre-normalization, SwiGLU activation, rotary embeddings, and efficient implementation techniques. It was trained on 1.4 trillion tokens of publicly available data using 2048 A100 GPUs over 5 months.
[3] Evaluation shows LLaMa outperforms other models on common sense reasoning, question answering, reading comprehension,
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskSaurabh Saxena
Studied feasibility of applying state-of-the-art deep learning models like end-to-end memory networks and neural attention- based models to the problem of machine comprehension and subsequent question answering in corporate settings with huge
amount of unstructured textual data. Used pre-trained embeddings like word2vec and GLove to avoid huge training costs.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
The document discusses multi-core and many-core processors and parallel programming models. It provides an overview of hardware trends including increasing numbers of cores in CPUs and GPUs. It also covers parallel programming approaches like shared memory, message passing, data parallelism and task parallelism. Specific APIs discussed include Win32 threads, OpenMP, and Intel TBB.
The document provides an overview of distributed computing and related technologies. It discusses the history of distributed computing including local, parallel, grid and distributed computing. It then discusses applications of distributed computing like web indexing and recommendations. The document introduces Hadoop and its core components HDFS and MapReduce. It also discusses related technologies like HBase, Mahout and challenges in designing distributed systems. It provides examples of using Mahout for machine learning tasks like classification, clustering and recommendations.
A Tale of Data Pattern Discovery in ParallelJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23Tomasz Sikora
Machine learning and deep learning techniques are present in Java through various libraries. Deep learning allows neural networks to learn from vast amounts of data through multilayer architectures like convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The talk discussed several popular Java libraries that support both traditional machine learning algorithms and deep learning models, including DL4J, TensorFlow, Keras, and H2O. It provided examples of training deep learning models on MNIST and CIFAR10 datasets in DL4J and compared performance between DL4J and TensorFlow.
The document discusses the future of computing platforms and how they will change to handle massive amounts of data and machine learning tasks. Some key points:
- Traditional views of performance gains from clock speed increases are over. New architectures enabled by multi-core CPUs will radically change computing.
- "Big data" tasks like search, machine learning, and real-time data analysis will be increasingly important drivers of new computing platforms.
- Simple machine learning models applied to massive amounts of data can produce useful results, even without deep domain expertise. This approach has been demonstrated to work well for tasks like language translation.
- Future platforms may blend CPUs and GPUs differently to best handle both serial and parallel tasks for big data and machine
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
HTML slides and longer abstract can be found at https://github.com/ljdursi/EuroMPI2016.
For years, the academic science and engineering community was almost alone in pursuing very large-scale numerical computing, and MPI was the lingua franca for such work. But starting in the mid-2000s, we were no longer alone. First internet-scale companies like Google and Yahoo! started performing fairly basic analytics tasks at enormous scale, and since then others have begun tackling increasingly complex and data-heavy machine-learning computations, which involve very familiar scientific computing primitives such as linear algebra, unstructured mesh decomposition, and numerical optimization. These new communities have created programming environments which emphasize what we’ve learned about computer science and programmability since 1994 – with greater levels of abstraction and encapsulation, separating high-level computation from the low-level implementation details.
At about the same time, new academic research communities began using computing at scale to attack their problems - but in many cases, an ideal distributed-memory application for them begins to look more like the new concurrent distributed databases than a large CFD simulation, with data structures like dynamic hash tables and Bloom trees playing more important roles than rectangular arrays or unstructured meshes. These new academic communities are among the first to adopt emerging big-data technologies over traditional HPC options; but as big-data technologies improve their tightly-coupled number-crunching capabilities, they are unlikely to be the last.
In this talk, I sketch out the landscape of distributed technical computing frameworks and environments, and look to see where MPI and the MPI community fits in to this new ecosystem.
A Platform for Accelerating Machine Learning ApplicationsNVIDIA Taiwan
Robert Sheen from HPE gave a presentation on machine learning applications and accelerating deep learning. He provided a quick introduction to neural networks, discussing their structure and how they are inspired by biological neurons. Deep learning requires high performance computing due to its computational intensity during training. Popular deep learning frameworks like CogX were also discussed, which provide tools and libraries to help build and optimize neural networks. Finally, several enterprise use cases for machine learning and deep learning were highlighted, such as in finance, healthcare, security, and geospatial applications.
1) The document discusses generalized linear models (GLM) using H2O. GLM is a well-known statistical method that fits a linear model to predict outcomes.
2) H2O enables distributed, parallel GLM on large datasets with billions of data points. It supports standard GLM features like regularization to prevent overfitting.
3) An example demonstrates predicting flight delays using airline data with 116 million rows. GLM and deep learning models are fit in seconds on H2O using an 8-node cluster.
The document discusses parallel computing over the past 25 years and challenges for using multicore chips in the next decade. It aims to provide context to scale applications effectively to 32-1024 cores. Key challenges include expressing inherent application parallelism while enabling efficient mapping to hardware through programming models and runtime systems. Future work includes developing methods to restore lost parallelism information and tradeoffs between programming effort, generality and performance.
The document discusses distributed computing and the MapReduce programming model. It provides examples of how Folding@home and PS3s contribute significantly to distributed computing projects. It then explains challenges with inter-machine parallelism like communication overhead and load balancing. The document outlines Google's MapReduce model which handles these issues and makes programming distributed systems easier through its map and reduce functions.
Distributed Systems: scalability and high availabilityRenato Lucindo
Distributed systems use multiple computers that interact over a network to achieve common goals like scalability and high availability. They work to handle increasing loads by either scaling up individual nodes or scaling out by adding more nodes. However, distributed systems face challenges in maintaining consistency, availability, and partition tolerance as defined by the CAP theorem. Techniques like caching, queues, logging, and understanding failure modes can help address these challenges.
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
The Swift scripting language was created to provide a simple, compact way to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, reducing the need for complex parallel programming or arcane scripting to achieve this common high-level task. The result was a highly portable programming model based on implicitly parallel functional dataflow. The same Swift script runs on multi-core computers, clusters, grids, clouds, and supercomputers, and is thus a useful tool for moving workflow computations from laptop to distributed and/or high performance systems.
Swift has proven to be very general, and is in use in domains ranging from earth systems to bioinformatics to molecular modeling. It’s more recently been adapted to serve as a programming model for much finer-grain in-memory workflow on extreme scale systems, where it can perform task rates in the millions to billion-per-second.
In this talk, we describe the state of Swift’s implementation, present several Swift applications, and discuss ideas for of the future evolution of the programming model on which it’s based.
This document discusses the evolution of computing from being compute-centric to data-centric. It introduces concepts like the memory wall problem and concurrent average memory access time (C-AMAT) model which considers data access concurrency. It also discusses solutions for improving parallel I/O performance like prefetching, data layout optimization, and merging of HPC and big data frameworks. The challenges in data-intensive computing and how the author's lab can help with applications and methods are also summarized.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
Artificial neural networks (ANN) are one of the popular models of machine learning, in particular for deep learning. The models that are used in practice for image classification and speech recognition contain huge number of weights and are trained with big datasets. Training such models is challenging in terms of computation and data processing. We propose a scalable implementation of deep neural networks for Spark. We address the computational challenge by batch operations, using BLAS for vector and matrix computations and reusing the memory for reducing garbage collector activity. Spark provides data parallelism that enables scaling of training. As a result, our implementation is on par with widely used C++ implementations like Caffe on a single machine and scales nicely on a cluster. The developed API makes it easy to configure your own network and to run experiments with different hyper parameters. Our implementation is easily extensible and we invite other developers to contribute new types of neural network functions and layers. Also, optimizations that we applied and our experience with GPU CUDA BLAS might be useful for other machine learning algorithms being developed for Spark.
The slides were presented at Spark SF Friends meetup on December 2, 2015 organized by Alex Khrabrov @Nitro. The content is based on my talk on Spark Summit Europe. However, there are few major updates: update and more details on the parallelism heuristic, experiments with larger cluster, as well as the new slide design.
This document provides an overview of next generation analytics with YARN, Spark and GraphLab. It discusses how YARN addressed limitations of Hadoop 1.0 like scalability, locality awareness and shared cluster utilization. It also describes the Berkeley Data Analytics Stack (BDAS) which includes Spark, and how companies like Ooyala and Conviva use it for tasks like iterative machine learning. GraphLab is presented as ideal for processing natural graphs and the PowerGraph framework partitions such graphs for better parallelism. PMML is introduced as a standard for defining predictive models, and how a Naive Bayes model can be defined and scored using PMML with Spark and Storm.
A Survey on in-a-box parallel computing and its implications on system softwa...ChangWoo Min
1) The document surveys research on parallel computing using multicore CPUs and GPUs, and its implications for system software.
2) It discusses parallel programming models like OpenMP, Intel TBB, CUDA, and OpenCL. It also covers research on optimizing memory allocation, reducing system call overhead, and revisiting OS architecture for manycore systems.
3) The document reviews work on supporting GPUs in virtualized environments through techniques like GPU virtualization. It also summarizes projects that utilize the GPU in middleware for tasks like network packet processing.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Generative AI leverages algorithms to create various forms of content
Google's Pathways Language Model and Chain-of-Thought
1. How Computers Understood Humans
ideas existed at least since 1700s
but not enough compute and computer science
How to instruct computer to perform tasks?
How represent knowledge in computers?
How to generate the answers?
by his contrivance, the most ignorant person, at a reasonable charge, and with
a little bodily labour, might write books in philosophy, poetry, politics, laws,
mathematics, and theology, without the least assistance from genius or study.
... He then commanded six-and-thirty of the lads, to read the several lines
softly, as they appeared upon the frame
(Gulliver's Travels, by Jonathan Swift
1726, making fun of )
Ramon Llull 1232
2. Prompt as an Interface
2001: A Space Odyssey HAL 9000
input textual instructions e.g. explain a riddle
based on its knowledge computer generates the answer text
3.
4. How To Represent Knowledge
library ~> textual documents in a database
search by list of words (query) ~1970s, find topics ~1980
counting word occurrences on documents level into
methods: TF*IDF, Latent semantic analysis
sparce matrices
5.
6. Non-Contextual Words Vectors
document -> sentence or small running window of 10 words
vector is point in a multidimensional space - an array of numbers
each of 10k words gets one general vector in 300 dimensional space
each vector has to fit in "only" 300 dimensions - much less than 10k words
global (non) contextual word vectors - no disambiguation (flowering) vs
fruit (food)
7.
8. Word2vec: Word To a Global Vector
count co-occurrence in a 10 word window
: 10 surrounding words sum close to the middle
word vector
words appearing in similar context are close in the 300 dimensional space
disambiguation - word strings should be just name not an id!
GloVe (Pennington 2014)
word2vec (Mikolov 2013)
9. Transformer: Contextual Word Vectors
word meaning based on context of 100s of words.
- sequential with memory
(Vaswani 2017)
calculates on entire input sequence
recurrent neural networks (LSTM, GRU)
transformer architecture
10. Knowledge Graph's Nodes Are Disambiguated
knowledge graph e.g. Wikidata: each node is specific fruit (flowering) vs
fruit (food)
imperfect tradeoff between database and training data samples
Wikipedia and internet is between knowledge graph and set of documents
random walk ~ valid "sentences", link prediction ~ generating text
11. Big Transformer Models
generate by predicting input text continuation
$10M transformers trained on large amount of text from the internet in
2022
solve wide variety naturally described problems sometimes with human
level performance
examples: , , , ...
PaLM (2022) RETRO (2021) GPT-3
12.
13. Google's Pathways Language Model and Chain-of-Thought
training task: Given this text, predict the next word (token).
the biggest dense (540B) and likely the most expensive
(~$10M) as of early 2022
highly efficient training on 6k chips (TPU v4) across 2 clusters (Pods)
improvements from scaling continue in language understanding (few-shot)
disproportionate gains from certain scale e.g. reasoning: 62B to 540B vs
8B to 62B
breakthrough performance:
outperforming average human on a grade school logic and math
(BIG-bench)
outperforms specialized and fine-tuned models on multistep
reasoning
chain-of-thought prompting simulates inner monologue
PaLM's
transformer
14. PaLM's Size
PaLM has 540B parameters = 3x bigger than GPT-3 175B parameters
2x smaller than sparse 1T
only parts of the model is activated at each time.
human brain 100T connections
likely the most expensive model (2.5 yottaFLOPS) vs GPT-3
PaLM and GPT-3 fascinating, but likely not economical now
Switch Transformer
~$10M ~$5M
15.
16. Zero-Shot vs Few-Shot Prompting vs Fine-Tuning
prompting: instructing via addition of textual context
zero-shot: task described, but demonstrations not given
few-shot: task described and random demonstrations provided
fine-tuning: model parameters are updated with correct answers
17.
18. PaLM's Breakthrough Capabilities on BIG-bench
BIG-bench: 150+ tasks (a task contains samples) like reasoning, math, QA,
translation, programming
58 tasks available results on other models, often multiple choice questions
PaLM is SOTA: outperforming human average and other models GPT-3,
Gopher, Chinchilla
still underperforms average human on many tasks
certain capabilities emerge when a scale is reached (not log-linear)
english proverbs and logical sequence at 62B to 540B vs 8B to 62B
19.
20. Chain-of-Thought Prompting
inserts reasoning steps between for each demonstration in the prompt
multi-step arithmetic
grade-school level math problems
difficult is to convert to equations
commonsense logical reasoning
strong world knowledge
chaining logical inferences
thoughts are useful for interpretation
21.
22. PaLM Chain-of-Thought Prompting Results
reasoning tasks datasets: GSM8K, SVAMP, MAWPS, AQuA,
CommonsenseQA, StrategyQA
humans allowed calculators and search engines too
the model triggers
(+~5% acc)
on GSM8K: PaLM+chain-of-thought competitive with GPT-
3+finetuning+calculator+verifier,
outperforms with a calculator, but GPT-3+chain-of-thought?
generating explicit inference chains drastically increase the quality of the
predictions
a calculator by generating special annotation text
23.
24. Explanation Using Chain-of-Thought Prompting
right reasons for the right answer?
interpretability may be desired output
2-shot prompting with greedy decoding (deterministic)
not in paper: I added GPT-3, results with zero-temperature
average airplane speed 460 – 575 mph
Input: Jennifer looked out her window and sees a really cool cloud below her.
She unbuckles her seatbelt and heads to the bathroom. Is Jennifer probably
traveling more than 300 miles per hour relative to the earth?
PaLM: 300 miles per hour is about 480 km/h. This is about the speed of a
commercial airplane. Clouds are usually below airplanes, so Jennifer is
probably on an airplane. The answer is "yes"
GPT-3: No, because an airplane typically travels around 500-600 miles per
hour, and the bathroom is usually located in the back of the plane.
25. PaLM on Code Tasks
achieves similar results to specialized OpenAI's Codex 12B with 3% of the
code training data
after fine-tuning outperforms OpenAI's Davinci Codex
26.
27. PaLM on Other Tasks
generally SOTA on a few-shot competitions
few-shot is not competitive with fine-tuned model
fine-tuned only competitive with specialized fine-tuned variants
translation: without fine-tuning outperforms GPT-3, outperforms some
supervised baselines de-en, ro-en
summarization: fine-tuned results competitive, few-shot largely
underperforms the fine-tuned
multilingual question answering: fine-tuned results competitive, few-shot
largely underperformed of fine-tuned
28.
29. PaLM Architecture:
standard decoder-only (attending only to the past, similar to
)
modified Feed-forward layer (MLP):
instead of RELU use
~1% better in compute equivalent setup
uses GLU: gated linear unit - a sigmoid controlled output
SwiGLU:
uses :
parallel Attention and Feed-forward layer (MLP) from :
instead of sequential is additive:
15% speedup for small degradation
:
transformer
GPT-3
max(0, xW +
1 b
)W +
1 2 b
2 SwiGLU feed-
foward
FFN
:=
SwiGLU (Swish(xW
) ⊗
1 xV )W
2
swish activation x(1 + exp(−x))−1
GPT-J
y = x + MLP(LayerNorm(x)) +
Attention(LayerNorm(x))
multi-query attention
30. block-shared key and value projections, different query projections
speeds up autoregressive decoding where queries
:
want relative position info in query-value dot-product
use multiplicative rotational matrix mixing pairwise neighboring
dimensions
improves performance on long sequences
RoPE Embeddings
31. PaLM Training Dataset
780B tokens of high-quality text (~100B human days), (
, GPT-3 300B tokens, MT-NLG 339B)
social media 50%, webpages 27%, books 13%, wikipedia 4%, code 5%,
news 1%
based on dataset used for LaMBDA, GLaM
private non-reproducible dataset, while MT-NLG 339B is reproducible but
non-hosted
5T token
MassiveText
32.
33. PaLM Training Requirements
~17 TB of RAM, 2.5 yottaFLOPS ( ) needed for training
2 TPU v4 Pod clusters connected via data center network
mind of PalM is shattered across many chips (cheaper, replaceable,
cooling)
each Pod ~1 exaflop/s: 768 hosts, 3072 TPU v4 chips
but how to split the work and communicate?
1024
34.
35. Parallel Training of Large Scale Models
parallel computing trades off compute ("time"), memory ("space"),
communication throughput (no cool name)
data parallelism - batches are divided between workers
tensor model parallelism
splits model layers i.e. transformer block into attention heads and
feedforward
pipeline mechanism ( )
computation DAG into stages e.g. layers
stages exchange forward and backward propagation information
(micro-batches)
step by step passing causes "bubbles" - idling
Megatron-Turing (MT-LNG) Microsoft and NVidia
36.
37. PaLM Training Infrastructure
PaLM uses pipeline-free 2D parallelism
data parallel across 2 clusters (2 TPU v4 Pods)
each cluster (Pod) has full model copy
model partitioned into 12 parts
data partitioned into 256 parts
768 hosts connected to 3k chips which are interconnected
each batch triggers update between clusters to identical models
each host exchanges 1.3GB with its counterpart
38.
39. Training Efficiency
observed throughput relative to the theoretical maximum of a system
price for parallelization of PaLM is 50%
in case of PaLM throughput is tokens-per-second
40. More Posts Like This
SRU++ Model Speeds Up Transformer with Simple Recurrent Unit
DeepMind's RETRO Transformer Model
41. Appendix
GPT-3 Explanations Full Prompts
Logical Inference 1
Best of 2 temperature 0.7 otherwise defaults:
We can infer that Jennifer is on an airplane because she is looking out a
window and sees a cloud below her. We can also infer that she is traveling
at a high speed because she is on an airplane. The average cruising speed
of an airplane is about 550 miles per hour, so Jennifer is probably traveling
more than 300 miles per hour relative to the earth. The answer is "yes".
We can infer that Jennifer is in an airplane because she is looking out a
window. We can also infer that she is high up in the sky because she can
I will explain how to make logical inferences:
(1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a
Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack
(2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your r
Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "C
(3) Jennifer looked out her window and sees a really cool cloud below her. She unbuckles her seatbelt and heads to the b
Answer:
42. see a cloud "below" her. Clouds are usually found at altitudes above 3,000
feet. Finally, we know that airplanes typically travel around 500 miles per
hour. So, Jennifer is probably traveling more than 300 miles per hour, but
we cannot be certain. The answer is "yes".
Greedy:
No, because an airplane typically travels around 500-600 miles per hour,
and the bathroom is usually located in the back of the plane.
43. Logical Inference 2
Best of 2, temperature 0.7 otherwise defaults:
It is not likely that Shelley will be near the Pacific Ocean this weekend.
Shelley is from Virginia, which is on the East Coast of the United States.
The city with the famous fish market is most likely Seattle, which is also on
the West Coast. Shelley is visiting, so she is not from there. She is going
home next Tuesday, so she will likely be going back to Virginia this
weekend.
It is not likely that Shelley will be near the Pacific Ocean this weekend.
Shelley is from Virginia, which is on the East Coast of the United States.
The city with the famous fish market is most likely Seattle, which is also on
the West Coast. Shelley is visiting, so she is not from there. She is going
I will explain how to make logical inferences:
(1) John wakes to the sound of his children giggling outside the door. They bust in and say "Happy Father's Day!" with a
Answer: We can infer that John is a father because of "his children". We can infer that the current year is 2022. Barack
(2) Sam sees a piano in the subway stations. Laughing, his friends prod him to play, saying "Remember how amazing your r
Answer: Juilliard is a top school for music, so someone who majored in piano at Juilliard would be an expert pianist. "C
(3) Shelley is from Virginia, but is visiting that city with that famous market where they throw the fish. Going home ne
Question: Is it likely that Shelley will be near the Pacific Ocean this weekend?
Answer:
44. home next Tuesday, so she will likely be going back to Virginia this
weekend.
Greedy:
It is not likely that Shelley will be near the Pacific Ocean this weekend.
Shelley is from Virginia, which is on the East Coast of the United States.
The city with the famous fish market is most likely Seattle, which is also on
the West Coast. Shelley is visiting, so she is not from there. She is going
home next Tuesday, so she will likely be going back to Virginia this
weekend.
45. Jokes
Best of 2 temperature 0.7 otherwise defaults:
This joke is about how annoying mothers can be. The speaker is trying to
visit their family, but their mother has just informed them that their step-
dad's poetry reading is on the same night. So now the speaker has to fly in
on the next day.
This joke is about how annoying it can be to have to deal with family
obligations. The person telling the joke is trying to visit their family on
April 6th, but their mom tells them that their step dad's poetry reading is
that night. So the person has to fly in on April 7th instead.
This joke is about how annoying and inconsiderate step-parents can be.
The punchline is that the person is flying in a day later just to avoid their
I will explain these jokes:
(1) The problem with kleptomaniacs is that they always take things literally.
Explanation: This joke is wordplay. Someone who "takes things literally" is someone who doesn't fully understand social
But the definition of kleptomania is someone who literally takes things.
(2) Always borrow money from a pessimist. They’ll never expect it back.
Explanation: Most people expect you to pay them back when you borrow money, however a pessimist is someone who always as
(3) I was going to fly to visit my family on April 6th. My mom said "Oh great, your step dad's poetry reading is that ni
Explanation:
46. step-dad's poetry reading.
Greedy:
This joke is about how annoying and intrusive in-laws can be. The person is
trying to visit their family, but their mother says that their step-dad's
poetry reading is that night. So the person is now flying in the next day to
avoid it.