XLNet is a generalized autoregressive pretraining method that aims to overcome limitations of previous autoregressive and autoencoding models. It uses a novel permutation language modeling technique that allows it to capture bidirectional context while maintaining the benefits of autoregressive modeling. The model utilizes two-stream self-attention to incorporate positional information and allow predictions to be aware of the target position. XLNet achieved state-of-the-art results on various NLP tasks such as question answering and textual entailment.
A Simple Explanation of the paper XLNet(https://arxiv.org/abs/1906.08237).
It would be helpful to get to grips with the concepts XLNet before you dive into the paper.
PR-175: XLNet: Generalized Autoregressive Pretraining for Language UnderstandingSungnam Park
최근 NLP에서 20개의 task에서 BERT를 이기고 그 중 18개에서는 SoTA를 찍은 XLNet에 대해서 발표해 보았습니다.
이 논문은 기존에 language 모델들이 갖고 있었던 Autoregressive한 특성을 지키면서도, BERT와 같이 다양한 방향에서의 context를 잘 반영한 pretraining language model입니다. XLNet의 가장 큰 특징은 다음 3가지 입니다.
- Permutation Language Model
- Two-Stream Self-Attention
- Transformer-XL
감사합니다.
* 논문 : https://arxiv.org/abs/1906.08237
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
A Simple Explanation of the paper XLNet(https://arxiv.org/abs/1906.08237).
It would be helpful to get to grips with the concepts XLNet before you dive into the paper.
PR-175: XLNet: Generalized Autoregressive Pretraining for Language UnderstandingSungnam Park
최근 NLP에서 20개의 task에서 BERT를 이기고 그 중 18개에서는 SoTA를 찍은 XLNet에 대해서 발표해 보았습니다.
이 논문은 기존에 language 모델들이 갖고 있었던 Autoregressive한 특성을 지키면서도, BERT와 같이 다양한 방향에서의 context를 잘 반영한 pretraining language model입니다. XLNet의 가장 큰 특징은 다음 3가지 입니다.
- Permutation Language Model
- Two-Stream Self-Attention
- Transformer-XL
감사합니다.
* 논문 : https://arxiv.org/abs/1906.08237
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of transfer learning methods and architectures which significantly improved upon the state-of-the-art on pretty much every NLP tasks.
The wide availability and ease of integration of these transfer learning models are strong indicators that these methods will become a common tool in the NLP landscape as well as a major research direction.
In this talk, I'll present a quick overview of modern transfer learning methods in NLP and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks, focusing on open-source solutions.
Website: https://fwdays.com/event/data-science-fwdays-2019/review/transfer-learning-in-nlp
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
Mihai is the Principal Architect for Platform Engineering and Technology Solutions at IBM, responsible for Cloud Native and AI Solutions. He is a Red Hat Certified Architect, CKA/CKS, a leader in the IBM Open Innovation community, and advocate for open source development. Mihai is driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models.
Mihai will share lessons learned building Retrieval Augmented Generation, or “Chat with Documents” platforms and APIs that scale, and deploy on Kubernetes. His talk will cover use cases for Generative AI, limitations of Large Language Models, use of RAG, Vector Databases and Fine Tuning to overcome model limitations and build solutions that connect to your data and provide content grounding, limit hallucinations and form the basis of explainable AI. In terms of technology, he will cover LLAMA2, HuggingFace TGIS, SentenceTransformers embedding models using Python, LangChain, and Weaviate and ChromaDB vector databases. He’ll also share tips on writing code using LLM, including building an agent for Ansible and containers.
Scaling factors for Large Language Model Architectures:
• Vector Database: consider sharding and High Availability
• Fine Tuning: collecting data to be used for fine tuning
• Governance and Model Benchmarking: how are you testing your model performance
over time, with different prompts, one-shot, and various parameters
• Chain of Reasoning and Agents
• Caching embeddings and responses
• Personalization and Conversational Memory Database
• Streaming Responses and optimizing performance. A fine tuned 13B model may
perform better than a poor 70B one!
• Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are
terrible at reasoning and prediction, consider calling other models)
• Fallback techniques: fallback to a different model, or default answers
• API scaling techniques, rate limiting, etc.
• Async, streaming and parallelization, multiprocessing, GPU acceleration (including
embeddings), generating your API using OpenAPI, etc.
Fine tune and deploy Hugging Face NLP modelsOVHcloud
Are you currently managing AI projects that require a lot of GPU power?
Are you tired of managing the complexity of your infrastructures, GPU instances and your Kubeflow yourself?
Need flexibility for your AI platform or SaaS solution?
OVHcloud innovates in AI by offering simple and turnkey solutions to train your models and put them into production.
Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
Timo Klerx and Kalman Graffi. Bootstrapping Skynet: Calibration and Autonomic Self-Control of Structured Peer-to-Peer Networks. In IEEE P2P ’13: Proceedings of the International Conference on Peer-to-Peer Computing, 2013.
Abstract—Peer-to-peer systems scale to millions of nodes and provide routing and storage functions with best effort quality. In order to provide a guaranteed quality of the overlay functions, even under strong dynamics in the network with regard to peer capacities, online participation and usage patterns, we propose to calibrate the peer-to-peer overlay and to autonomously learn which qualities can be reached. For that, we simulate the peer- to-peer overlay systematically under a wide range of parameter configurations and use neural networks to learn the effects of the configurations on the quality metrics. Thus, by choosing a specific quality setting by the overlay operator, the network can tune itself to the learned parameter configurations that lead to the desired quality. Evaluation shows that the presented self-calibration succeeds in learning the configuration-quality interdependencies and that peer-to-peer systems can learn and adapt their behavior according to desired quality goals.
Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of transfer learning methods and architectures which significantly improved upon the state-of-the-art on pretty much every NLP tasks.
The wide availability and ease of integration of these transfer learning models are strong indicators that these methods will become a common tool in the NLP landscape as well as a major research direction.
In this talk, I'll present a quick overview of modern transfer learning methods in NLP and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks, focusing on open-source solutions.
Website: https://fwdays.com/event/data-science-fwdays-2019/review/transfer-learning-in-nlp
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
Mihai is the Principal Architect for Platform Engineering and Technology Solutions at IBM, responsible for Cloud Native and AI Solutions. He is a Red Hat Certified Architect, CKA/CKS, a leader in the IBM Open Innovation community, and advocate for open source development. Mihai is driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models.
Mihai will share lessons learned building Retrieval Augmented Generation, or “Chat with Documents” platforms and APIs that scale, and deploy on Kubernetes. His talk will cover use cases for Generative AI, limitations of Large Language Models, use of RAG, Vector Databases and Fine Tuning to overcome model limitations and build solutions that connect to your data and provide content grounding, limit hallucinations and form the basis of explainable AI. In terms of technology, he will cover LLAMA2, HuggingFace TGIS, SentenceTransformers embedding models using Python, LangChain, and Weaviate and ChromaDB vector databases. He’ll also share tips on writing code using LLM, including building an agent for Ansible and containers.
Scaling factors for Large Language Model Architectures:
• Vector Database: consider sharding and High Availability
• Fine Tuning: collecting data to be used for fine tuning
• Governance and Model Benchmarking: how are you testing your model performance
over time, with different prompts, one-shot, and various parameters
• Chain of Reasoning and Agents
• Caching embeddings and responses
• Personalization and Conversational Memory Database
• Streaming Responses and optimizing performance. A fine tuned 13B model may
perform better than a poor 70B one!
• Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are
terrible at reasoning and prediction, consider calling other models)
• Fallback techniques: fallback to a different model, or default answers
• API scaling techniques, rate limiting, etc.
• Async, streaming and parallelization, multiprocessing, GPU acceleration (including
embeddings), generating your API using OpenAPI, etc.
Fine tune and deploy Hugging Face NLP modelsOVHcloud
Are you currently managing AI projects that require a lot of GPU power?
Are you tired of managing the complexity of your infrastructures, GPU instances and your Kubeflow yourself?
Need flexibility for your AI platform or SaaS solution?
OVHcloud innovates in AI by offering simple and turnkey solutions to train your models and put them into production.
Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
Timo Klerx and Kalman Graffi. Bootstrapping Skynet: Calibration and Autonomic Self-Control of Structured Peer-to-Peer Networks. In IEEE P2P ’13: Proceedings of the International Conference on Peer-to-Peer Computing, 2013.
Abstract—Peer-to-peer systems scale to millions of nodes and provide routing and storage functions with best effort quality. In order to provide a guaranteed quality of the overlay functions, even under strong dynamics in the network with regard to peer capacities, online participation and usage patterns, we propose to calibrate the peer-to-peer overlay and to autonomously learn which qualities can be reached. For that, we simulate the peer- to-peer overlay systematically under a wide range of parameter configurations and use neural networks to learn the effects of the configurations on the quality metrics. Thus, by choosing a specific quality setting by the overlay operator, the network can tune itself to the learned parameter configurations that lead to the desired quality. Evaluation shows that the presented self-calibration succeeds in learning the configuration-quality interdependencies and that peer-to-peer systems can learn and adapt their behavior according to desired quality goals.
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
See our presentation from the 6th International EULAG Users Workshop. We talked about taking HPC to the "Industry 4.0" by implementing smart techniques to optimize the codes in terms of performance and energy consumption. It explains how Machine Learning can dynamically optimize HPC simulations and byteLAKE's software autotuning solution.
Find out more about byteLAKE at: www.byteLAKE.com
Log Analytics in Datacenter with Apache Spark and Machine LearningPiotr Tylenda
Presented during DataMass Summit 2017.
http://summit2017.datamass.io/
https://www.youtube.com/watch?v=eGJfhHPdhuo
Data center workloads produce a significant amount of log data which has to be analyzed in order to discover any potential issues. We present an automated text mining approach for workload monitoring and data analytics, which is a combination of machine learning and big data processing. This session provides an overview of a data pipeline based on key components such as Apache Kafka, Apache Spark and generalized version of k-means algorithm.
Log Analytics in Datacenter with Apache Spark and Machine LearningAgnieszka Potulska
Presented during DataMass Summit 2017.
http://summit2017.datamass.io/
https://www.youtube.com/watch?v=eGJfhHPdhuo
Data center workloads produce a significant amount of log data which has to be analyzed in order to discover any potential issues. We present an automated text mining approach for workload monitoring and data analytics, which is a combination of machine learning and big data processing. This session provides an overview of a data pipeline based on key components such as Apache Kafka, Apache Spark and generalized version of k-means algorithm.
Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
Distributed algorithms in machine learning follow two main paradigms: data parallel, where the data is distributed across multiple workers and model parallel, where the model parameters are partitioned across multiple workers. The main limitation of the first approach is that the model parameters need to be replicated on every machine. This is problematic when the number of parameters is very large, and hence cannot fit in a single machine. The drawback of the latter approach is that the data needs to be replicated on each machine. Such replications limit the scalability of machine learning algorithms, since in several real-world tasks it is observed that the data and model sizes typically grow hand in hand. In this talk, I will present Hybrid-Parallelism, a new paradigm that partitions both, the data as well as the model parameters simultaneously in a completely de-centralized manner. As a result, each worker only needs access to a subset of the data and a subset of the parameters while performing parameter updates. Next, I will present a case-study showing how to apply these ideas to reformulate Multinomial Logistic Regression to achieve Hybrid Parallelism (DSMLR: Doubly-Separable Multinomial Logistic Regression). Finally, I will demonstrate the versatility of DS-MLR under various scenarios in data and model parallelism, through an empirical study consisting of real-world datasets.
Speeding up Distributed Big Data Recommendation in SparkHans De Sterck
A talk on algorithms for parallel big data analytics in Spark. We present an algorithm to speed up ALS for collaborative filtering (think "the Netflix prize"), and show how this leads to significant speedup when implemented efficiently in parallel in Spark.
Best Paper Award at ICPADS 2015, Melbourne.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
1. XLNet
Generalized Autoregressive Pretraining for Language Understanding
by Zhilin Yang and Zihang Dai et al.
Presented by:
V S Siva Kumar Lakkoju
CS2139
Information Retrieval
MTech CS
Indian Statistical Institute, Kolkata
June 23, 2022
7. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Transformer XL [2]
Increase context through segment level recurrence and a realtive positional
encoding scheme
• Caching and reusing previous segments’ hidden state
• Allows variable length context, great for long term dependencies
• Resolves the problem of context fragmentation
– XLNet 7/35
14. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
The Idea
• Permutation: only on factorization order, not the original sequence order
• Attention masks provide the context for each prediction
• Two stream self-attention allows prediction to be aware of target position
• Partial prediction : only predict 1/K tokens in each permutation
– XLNet 14/35
21. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
How to reparameterize?
Standard softmax does not work
P(xzt |xzt ) =
eemb(x)T h(xzt )
P
x′ eemb(x′)T h(xzt )
(10)
Solution: Incorporate z into the hidden state
P(xzt |xzt ) =
eemb(x′)T g(zt,xzt )
P
x′ eemb(x′)T g(zt,xzt )
(11)
**This is implemented using a two stream self attention
– XLNet 21/35
28. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
Information about specific datasets
• RACE dataset – 512 seq. length
• SQuAD – During finetuning on SQuAD2.0 dataset, jointly applied logistic
regression to check if an question can be answered
• Layer Wise Decay If the learning rate of the 24th layer is, say l and decay is
q, then the learning rate of the layer m is
lrm = l ∗ q(24−m)
(15)
– XLNet 28/35
30. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
XLNet vs BERT
Input Sequence: New York is a city
XLNet factorization order: [is, a, city, New, York]
For BERT,
log p(New York|is a city) = log p(New|is a city) + log p(Y ork|is a city) (16)
For XLNet,
log p(New York|is a city) = log p(New|is a city) + log p(Y ork|New is a city)
(17)
– XLNet 30/35
35. XLNet
Introduction
Proposed
Method
Design of
XLNet
HyperParameter
Results
References I
Z. Yang, Z. Dai, and et al., “XLNet: Generalized autoregressive pretraining for
language understanding,” in 33rd Conference on Neural Information Processing
Systems (NeurIPS), 2019.
Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov,
“Transformer-xl: Attentive language models beyond a fixed-length context,”
2019.
“https://towardsdatascience.com/transformer-xl-explained-combining-
transformers-and-rnns-into-a-state-of-the-art-language-model-c0cfe9e5a924.”
– XLNet 35/35