BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
A presentation on Bidirectional Encoder Representations from Transformers (BERT) meant to introduce the model's use cases and training mechanism. Best viewed with powerpoint since it contain many slide animations.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
A presentation on Bidirectional Encoder Representations from Transformers (BERT) meant to introduce the model's use cases and training mechanism. Best viewed with powerpoint since it contain many slide animations.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
This presentation is for SotA models in NLP called Transformer & BERT review materials. I reviewed many model in here Word2Vec, ELMo, GPT, ... etc
reference 1 : Kim Dong Ha (https://www.youtube.com/watch?v=xhY7m8QVKjo)
reference 2 : Raimi Karim (https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
Review of paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ArXiv link: https://arxiv.org/abs/1810.04805
YouTube Presentation: https://youtu.be/GK4IO3qOnLc
(Slides are written in English, but the presentation is done in Korean)
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
In this part 1 presentation, I have attempted to provide a '30,000 feet view' of BERT (Bidirectional Encoder Representations from Transformer) - a state of the art Language Model in NLP with high level technical explanations. I have attempted to collate useful information about BERT from various useful sources.
This talk is about how we applied deep learning techinques to achieve state-of-the-art results in various NLP tasks like sentiment analysis and aspect identification, and how we deployed these models at Flipkart
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of transfer learning methods and architectures which significantly improved upon the state-of-the-art on pretty much every NLP tasks.
The wide availability and ease of integration of these transfer learning models are strong indicators that these methods will become a common tool in the NLP landscape as well as a major research direction.
In this talk, I'll present a quick overview of modern transfer learning methods in NLP and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks, focusing on open-source solutions.
Website: https://fwdays.com/event/data-science-fwdays-2019/review/transfer-learning-in-nlp
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
This presentation is for SotA models in NLP called Transformer & BERT review materials. I reviewed many model in here Word2Vec, ELMo, GPT, ... etc
reference 1 : Kim Dong Ha (https://www.youtube.com/watch?v=xhY7m8QVKjo)
reference 2 : Raimi Karim (https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
Review of paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ArXiv link: https://arxiv.org/abs/1810.04805
YouTube Presentation: https://youtu.be/GK4IO3qOnLc
(Slides are written in English, but the presentation is done in Korean)
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
In this part 1 presentation, I have attempted to provide a '30,000 feet view' of BERT (Bidirectional Encoder Representations from Transformer) - a state of the art Language Model in NLP with high level technical explanations. I have attempted to collate useful information about BERT from various useful sources.
This talk is about how we applied deep learning techinques to achieve state-of-the-art results in various NLP tasks like sentiment analysis and aspect identification, and how we deployed these models at Flipkart
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of transfer learning methods and architectures which significantly improved upon the state-of-the-art on pretty much every NLP tasks.
The wide availability and ease of integration of these transfer learning models are strong indicators that these methods will become a common tool in the NLP landscape as well as a major research direction.
In this talk, I'll present a quick overview of modern transfer learning methods in NLP and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks, focusing on open-source solutions.
Website: https://fwdays.com/event/data-science-fwdays-2019/review/transfer-learning-in-nlp
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
Transformer Models have taken over most of the Natural language Inference tasks. In recent
times they have proved to beat several benchmarks. Chunking means splitting the sentences into
tokens and then grouping them in a meaningful way. Chunking is a task that has gradually
moved from POS tag-based statistical models to neural nets using Language models such as
LSTM, Bidirectional LSTMs, attention models, etc. Deep neural net Models are deployed
indirectly for classifying tokens as different tags defined under Named Recognition Tasks. Later
these tags are used in conjunction with pointer frameworks for the final chunking task. In our
paper, we propose an Ensemble Model using a fine-tuned Transformer Model and a recurrent
neural network model together to predict tags and chunk substructures of a sentence. We
analyzed the shortcomings of the transformer models in predicting different tags and then
trained the BILSTM+CNN accordingly to compensate for the same.
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
Recent development of generative pretrained language models has been proven very successful on a wide
range of NLP tasks, such as text classification, question answering, textual entailment and so on. In this
work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding
Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by
both automatic metrics and human annotators, and demonstrated that the architecture achieves the stateof-the-art comparable result on large scale corpus – ‘CNN/Daily Mail1
As the best of our knowledge’, this
is the first work that applies BERT based architecture to a text summarization task and achieved the stateof-the-art comparable result.
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
Recent development of generative pretrained language models has been proven very successful on a wide range of NLP tasks, such as text classification, question answering, textual entailment and so on.In this work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by both automatic metrics and human annotators, and demonstrated that the architecture achieves the state-of-the-art comparable result on large scale corpus - CNN/Daily Mail1. As the best of our knowledge, this is the first work that applies BERT based architecture to a text summarization task and achieved the state-of-the-art comparable result.
As a data science Intern at Leapcheck Services private limited, I have developed a naive chatbot using sequence to sequence model by LSTM of RNN. Sharing the tutorial which I made explicitly for the deep learning enthusiasts to
provide them a basic insight on how chatbot can be developed with the help of recurrent neural network.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
This is a brief introduction to text analysis given at the April, 2020 meeting of the Ann Arbor Statistical Association. It covers sentiment scoring using TextBlob, and basic use of the NLTK module. All work is done in Python. The Jupyter notebook and .csv data set are in: https://drive.google.com/open?id=1o90unxJQrm0i3iycCJAqFpePGOLnovvg
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
2. Defining Language
Language:- Divided into 3 Parts
● Syntax
● Semantics
● Pragmatics
Syntax- Word Ordering, Sentence form
Semantics- Meaning of word
Pragmatics- refers to the social language skills that
we use in our daily interactions with others.
3. Example of Syntax, Semantics, Pragmatics
+ This discussion is about BERT.
+ The green frogs sleep soundly.
+ BERT play football good
SSP
SS
None
4. Why study about BERT?
Bert has ability to perform state of the art performance in many Natural Language
Processing Tasks. It can perform tasks such as Text Classification, Text Similarity
finding, Next Sentence Sequence Prediction, Question Answering, Auto-
Summarization, Named Entity Recognition,etc.
What is BERT Exactly?
BERT is Pretrained model by Google, which is a
bidirectional representation from unlabeled text by
jointly conditioning on both left and right context in
all layers.
5. Dataset used to Pre-train BERT
+ BooksCorpus (800M words)
+ English Wikipedia (2,500+M words)
A pretrained model can be applied by feature-based approach or fine tuning.
+ In Fine Tuning all weights change.
+ In Feature based approach only the final layer weights change. (Approach by
ELMo)
This pretrained model is then fine tuned on different NLP tasks.
Pretraining and Fine Tuning: You train a model m on Data A, then this model m is
trained on Data B from the checkpoint. SLIDE 17
6. Language
Training
Approach
To train a language model:
Two approaches
Context Free
+ Traditionally we use to convert word2vec
or use Glove.
Contextual
+ RNN
+ BERT
7. How does BERT Work?
BERT weights are learned in advance through two unsupervised tasks: masked
language modeling (predicting a missing word given the left and right context)
and next sentence prediction (predicting whether one sentence follows another).
BERT makes use of Transformer, an attention mechanism that learns contextual
relations between words (or sub-words) in a text.
(Paper 2 Attention is all you need)
Multi-headed attention is used in BERT. It uses multiple layers of attention and
also incorporates multiple attention “heads” in every layer (12 or 16). Since model
weights are not shared between layers, a single BERT model effectively has up to
12 x 12 = 144 different attention mechanisms.
8. What does BERT learn, how it tokenize and handle
OOV?
Consider the input example:- I went to the store. At the store, I bought fresh
strawberries.
BERT uses a WORD PIECE Tokenizer which breaks a OOV(out of vocab) word
into segments. For example, if play, ##ing, and ##ed are present in the
vocabulary but playing and played are OOV words then they will be broken down
into play + ##ing and play + ##ed respectively. (## is used to represent sub-
words).
BERT also requires a [CLS] special classifier token at beginning and [SEP] at end
of a sequence.
[CLS] I went to the store. [SEP] At the store I bought fresh straw ##berries.[SEP]
9. Attention
An attention probe is a task for a pair of tokens (tokeni, tokenj) where the input is a model-wide attention vector formed by
concatenating the entries aij, in every attention matrix from every attention head in every layer.
10. Some visual Attention Patterns and Why we use
Attention Mechanism?
Reason for Attention: Attention helps in two main
tasks of BERT, MLM (Masked Language Model)
and NSP (Next Sentence Prediction).
11. Visual Pattern from Attention mechanism
● Attention to next word. [ Layer 2, Head 0 ] | Backward RNN
● Attention to Previous word. [Layer 0, Head 2 ] | Forward RNN
● Attention to identical/related word.
● Attention to identical words in other sentence. | Helps in nextsentence prediction task
● Attention to other words predictive of word.
● Attention to delimiter tokens [CLS], [SEP]
● Attention to Bag of Words.
13. MLM: Masked Language Model
Input: My dog is hairy.
Masking is done randomly, and 15% of all WordPiece tokens in each input
sequence in masked. We only predict the masked tokens rather than predicting
the entire input sequence.
Procedure:
+ 80% of the time: Replace the word with [MASK]. My dog is [MASK].
+ 10% of time: Replace word randomly. My dog is apple.
+ 10% of time: Keep same My dog is hairy.
15. NSP: Next Sentence Prediction
Training Method:
In unlabelled data, we take a input
sequence A and 50% of time
making next occurring input
sequence as B. Rest 50% of time
we randomly pick any sequence as
B.
16. BERT Architecture
BERT is a multi-layer bidirectional Transformer encoder.
There are two models introduced in the paper.
● BERT base – 12 layers (transformer blocks), 12
attention heads, and 110 million parameters.
● BERT Large – 24 layers, 16 attention heads and, 340
million parameters.
18. Illustration how the BERT Pretrain
architecture remain the same and
just the fine tuning layer architecture
change for different NLP tasks.
19. Related Work
EMLo:- A pretrained model based which is feature based (only final layer weights
change) for NLP tasks. Difference: ELMo uses LSTMS; BERT uses Transformer - an attention
based model with positional encodings to represent word positions). ELMo also failed because is was
word based and could not handle OOV.
OpenAI GPT: uses a left to right architecture where every token can only attend to
previous tokens in the self-attention layer of Transformer. Failed because it could
not get proper contextual knowledge.
20. How BERT Outperforms others?
In the paper Visualizing and Measuring the Geometry of BERT, we prove how
BERT holds semantic and syntax features of a text.
In this paper aims to show how attention matrix contains grammatical
representations. Turning to semantics, using visualizations of the activations
created by different pieces of text, we show suggestive evidence that BERT
distinguishes word senses at a very fine level.
BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a
pre-trained word piece embedding with position and segment information. Next, this initial sequence of
embeddings is run through multiple transformer layers, producing a new sequence of context embeddings
at each step. Implicit in each transformer layer is a set of attention matrices, one for each attention head,
each of which contains a scalar value for each ordered pair (tokeni , tokenj ). [SLIDE 11]
21. Experiment for Syntax Representation
Experiment on corpus of Penn TreeBank (3.1M dependency relations). With
PyStanford Dependency Library we found the grammatical dependency on which
we ran BERT-base through each sentence and obtained model-wide attention
matrix. [ SLIDE 9].
On this dataset we train test split of 30% and achieve an accuracy of 85.8% on
binary probe and 71.9% on multiclass probe.
Proved: Attention mechanism contains syntactic features.
22. Geometry of Word Sense (Experiment)
On wikipedia articles with a query word we applied nearest-neighbor classifier
where each neighbour is the centroid of a given word sense’s BERT-base
embeddings in training data.
23. Conclusion
BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language
Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide
range of practical applications in the future.
Tested on our data of SupportLen for Text Classification.
We have a priority column in supportlen where we manually label
whether a customer email is urgent or not.
On this dataset we used BERT-base-uncased.
Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}
24. Some FAQs on BERT
1. WHAT IS THE MAXIMUM SEQUENCE LENGTH OF THE INPUT?
512 tokens
2. OPTIMAL VALUES OF THE HYPERPARAMETERS USED IN FINE-TUNING
● Dropout – 0.1
● Batch Size – 16, 32
● Learning Rate (Adam) – 5e-5, 3e-5, 2e-5
● Number of epochs – 3, 4
3. HOW MANY LAYERS ARE FROZEN IN THE FINE-TUNING STEP?
No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained
simultaneously.
4. IN HOW MUCH TIME BERT WAS PRETRAINED BY GOOGLE?
Google took 4days to Pretrain BERT with 16TPUs.
25. ULMFiT: Universal Language Model Fine-tuning for Text Classification
ULMFiT paper added a intermediate step in which an intermediate step in which model is fine-tuned on
text from the same domain as the target task. Now, along with BERT Pretrained model classification task
is done, resulting in better accuracy than simply using BERT model alone. We too fine-tune bert on our
custom data. It took around 50mins per epoch on Telsa K80 12GB GPU on P2.
26. Future work and use cases that BERT can solve for us
+ Email Prioritization
+ Sentiment Analysis of Reviews
+ Review Tagging
+ Question-Answering for ChatBot & Community
+ Similar Products problem, we currently use cosine similarity on description
text.
Testing of ULMFiT Experiment to be done, by fine tuning BERT on our domain
dataset.
Editor's Notes
Language discussion
Examples
BERT's power
What is bert
Data used and pretrain vs finetune
Talk about:-
Bank of the River
Bank account
As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
Word Piece Tokenizer: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf
Understanding the Attention Patterns: https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77
A positional embedding is also added to each token to indicate its position in the sequence.
Advantage of this method is that the Transformer Does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every token.
NSP helps in Q&A and understand the relation b/w sentences.
State of the Art: the most recent stage in the development of a product, incorporating the newest ideas and features.
Parse Tree Embedding Concept- mathematical proof
Miscellaneous:-
Matthew Correlation Coefficient: https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
Miscellaneous: What is a TPU? https://www.google.com/search?q=tpu+full+form&rlz=1C5CHFA_enIN835IN835&oq=TPU+full+form&aqs=chrome.0.0l6.3501j0j9&sourceid=chrome&ie=UTF-8
Which Activation is used in BERT?
https://datascience.stackexchange.com/questions/49522/what-is-gelu-activation
Gaussian Error Linear Unit
Demo of Google Collab:
Sentiment on Movie Reviews: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb?authuser=1#scrollTo=VG3lQz_j2BtD
Sentence Pairing and Sentence Classification: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb?authuser=1#scrollTo=0yamCRHcV-nQ
BERT FineTune on Data. https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning