Massive distillation of pre-trained language models like multilingual BERT with 35x compression and 51x speedup (98% smaller and faster) retaining 95% F1-score over 41 languages
This document contains the text of the GATE CS exam from 1992. It includes 20 multiple choice questions and 9 short answer/essay questions covering topics in computer science such as computer architecture, algorithms, theory of computation and data structures. It also provides information about joining an online test preparation community that offers full length and section mock exams designed by IISc alumni to help students prepare for the GATE exam.
This document discusses strategies for scaling declarative language specifications to full-featured languages and real-world applications. One strategy is to identify a "core" language and translate new features down to the core features. Forwarding in attribute grammars supports both explicit and implicit specification of semantics via translation. Extensible language frameworks allow for pluggable domain-specific language extensions with their own syntax, analysis, and optimizations. Composable language specifications are needed for extensible languages, but modular analysis is also required to ensure safe composability of extensions.
Fault Tolerant Parallel Filters Based On Bch CodesIJERA Editor
Digital filters are used in signal processing and communication systems. In some cases, the reliability of those
systems is critical, and fault tolerant filter implementations are needed. Over the years, many techniques that
exploit the filters’ structure and properties to achieve fault tolerance have been proposed. As technology scales,
it enables more complex systems that incorporate many filters. In those complex systems, it is common that
some of the filters operate in parallel, for example, by applying the same filter to different input signals.
Recently, a simple technique that exploits the presence of parallel filters to achieve multiple fault tolerance has
been presented. In this brief, that idea is generalized to show that parallel filters can be protected using Bose–
Chaudhuri–Hocquenghem codes (BCH) in which each filter is the equivalent of a bit in a traditional ECC. This
new scheme allows more efficient protection when the number of parallel filters is large.
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Jinho Choi
This paper presents an approach to enhance multilingual dependency parsing using transformer encoders. Specifically, it adapts multilingual BERT and language-specific transformers like ALBERT and RoBERTa as encoders. A biaffine decoder is used to extract features and classify arcs and labels. The approach performs dependency tree parsing and graph parsing, and also uses an ensemble method combining the two for improved results. Evaluation on 13 languages shows the multilingual approach outperforms most language-specific models, and ensemble parsing achieves best performance overall.
Reed Solomon Coding For Error Detection and Correctioninventionjournals
This document discusses Reed-Solomon coding for error detection and correction. It provides an overview of Reed-Solomon codes, including that they are a type of linear block code well-suited for correcting burst errors. The encoding and decoding processes are described at a high-level, involving the generation and root-finding of polynomials over Galois fields. Applications mentioned include data storage, satellite communications, barcodes, and CD audio. In summary, the document provides a technical introduction to Reed-Solomon coding, the encoding and decoding algorithms, and common uses in digital communication and storage systems.
Please download, and use Slideshow mode to view the slides, as that's where the magic of animation moves the telemetry frame elements around to illustrate the inner workings.
JPL / NASA Deep Space Network Telemetry
This document discusses lexical analysis in compilers. It begins with an outline of the topics to be covered, including lexical analysis, regular expressions, finite state automata, and the process of converting regular expressions to deterministic finite automata (DFAs). It then provides more details on each phase of a compiler and the role of lexical analysis. Key aspects of lexical analysis like tokenizing source code and classifying tokens are explained. The document also covers implementation of regular expressions using non-deterministic finite automata (NFAs) and their conversion to equivalent DFAs using techniques like epsilon-closure and transition tables.
This document contains the text of the GATE CS exam from 1992. It includes 20 multiple choice questions and 9 short answer/essay questions covering topics in computer science such as computer architecture, algorithms, theory of computation and data structures. It also provides information about joining an online test preparation community that offers full length and section mock exams designed by IISc alumni to help students prepare for the GATE exam.
This document discusses strategies for scaling declarative language specifications to full-featured languages and real-world applications. One strategy is to identify a "core" language and translate new features down to the core features. Forwarding in attribute grammars supports both explicit and implicit specification of semantics via translation. Extensible language frameworks allow for pluggable domain-specific language extensions with their own syntax, analysis, and optimizations. Composable language specifications are needed for extensible languages, but modular analysis is also required to ensure safe composability of extensions.
Fault Tolerant Parallel Filters Based On Bch CodesIJERA Editor
Digital filters are used in signal processing and communication systems. In some cases, the reliability of those
systems is critical, and fault tolerant filter implementations are needed. Over the years, many techniques that
exploit the filters’ structure and properties to achieve fault tolerance have been proposed. As technology scales,
it enables more complex systems that incorporate many filters. In those complex systems, it is common that
some of the filters operate in parallel, for example, by applying the same filter to different input signals.
Recently, a simple technique that exploits the presence of parallel filters to achieve multiple fault tolerance has
been presented. In this brief, that idea is generalized to show that parallel filters can be protected using Bose–
Chaudhuri–Hocquenghem codes (BCH) in which each filter is the equivalent of a bit in a traditional ECC. This
new scheme allows more efficient protection when the number of parallel filters is large.
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Jinho Choi
This paper presents an approach to enhance multilingual dependency parsing using transformer encoders. Specifically, it adapts multilingual BERT and language-specific transformers like ALBERT and RoBERTa as encoders. A biaffine decoder is used to extract features and classify arcs and labels. The approach performs dependency tree parsing and graph parsing, and also uses an ensemble method combining the two for improved results. Evaluation on 13 languages shows the multilingual approach outperforms most language-specific models, and ensemble parsing achieves best performance overall.
Reed Solomon Coding For Error Detection and Correctioninventionjournals
This document discusses Reed-Solomon coding for error detection and correction. It provides an overview of Reed-Solomon codes, including that they are a type of linear block code well-suited for correcting burst errors. The encoding and decoding processes are described at a high-level, involving the generation and root-finding of polynomials over Galois fields. Applications mentioned include data storage, satellite communications, barcodes, and CD audio. In summary, the document provides a technical introduction to Reed-Solomon coding, the encoding and decoding algorithms, and common uses in digital communication and storage systems.
Please download, and use Slideshow mode to view the slides, as that's where the magic of animation moves the telemetry frame elements around to illustrate the inner workings.
JPL / NASA Deep Space Network Telemetry
This document discusses lexical analysis in compilers. It begins with an outline of the topics to be covered, including lexical analysis, regular expressions, finite state automata, and the process of converting regular expressions to deterministic finite automata (DFAs). It then provides more details on each phase of a compiler and the role of lexical analysis. Key aspects of lexical analysis like tokenizing source code and classifying tokens are explained. The document also covers implementation of regular expressions using non-deterministic finite automata (NFAs) and their conversion to equivalent DFAs using techniques like epsilon-closure and transition tables.
assembly language programming and organization of IBM PC" by YTHA YUEducation
This document contains solutions to chapters 1 through 10 of a manual on assembly language programming and organization of the IBM PC. It includes contents, chapter summaries, programming exercises and their solutions. Appendices provide information on how to run programs and some useful procedures.
LDPC Encoding and Hamming Encoding using MATLAB.
An LDPC code is a linear block code characterised by a very sparse parity-check matrix. This means that the parity check matrix has a very low concentration of 1’s in it, hence the name is “low-density parity-check” code. The sparseness of LDPC codes is what as it can lead to excellent performance in terms of bit error rates.
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
This document provides an overview of transformer seq2seq models, including their concepts, trends, and limitations. It discusses how transformer models have replaced RNNs for seq2seq tasks due to being more parallelizable and effective at modeling long-term dependencies. Popular seq2seq models like T5, BART, and Pegasus are introduced. The document reviews common pretraining objectives for seq2seq models and current trends in larger model sizes, task-specific pretraining, and long-range modeling techniques. Limitations discussed include the need for grounded representations and efficient generation for seq2seq models.
Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...VLSICS Design
Reversible logic has attracted substantial interest due to its low power consumption which is the main
concern of low power VLSI systems. In this paper, a novel 4x4 reversible gate called inventive gate has
been introduced and using this gate 1-bit, 2-bit, 8-bit, 32-bit and n-bit group-based reversible comparator
have been constructed with low value of reversible parameters. The MOS transistor realizations of 1-bit, 2-
bit, and 8-bit of reversible comparator are also presented and finding power, delay and power delay
product (PDP) with appropriate aspect ratio W/L. Novel inventive gate has the ability to use as an n-to-2n
decoder. Different novel reversible circuit design style is compared with the existing ones. The relative
results shows that the novel reversible gate wide utility, group-based reversible comparator outperforms
the present style in terms of number of gates, garbage outputs and constant input.
This document discusses extracting variability from C code and lifting it to the mbeddr platform. It analyzes several large C codebases to understand how variability is implemented and which constructs can be lifted. The analysis finds that typical preprocessor constructs use identifiers, integers, and logical/comparison operations. Over 90% of symbols behave like constants that could be modeled as configuration parameters. While #error and #warning directives are present, they only make up a small portion of preprocessor statements. The results inform how to model variability in mbeddr's higher-level configuration language. A case study of the ChibiOS real-time OS module shows its use of preprocessor definitions and conditions.
JetBrains MPS: Projectional Editing in Domain-Specific LanguagesOscar Rodriguez
JetBrains MPS is a language workbench that allows for projectional editing in domain-specific languages. It enables the combination of multiple languages in code by separating a language's definition into aspects like abstract syntax, concrete syntax, and code generation. This allows languages to have rich syntaxes and notations without being limited by parsing. MPS uses model-to-model and text transformations to map between language concepts and implementation domains through code generation.
The document discusses turbo codes, a type of forward error correction code. Turbo codes use parallel concatenated convolutional encoders with pseudorandom interleaving and iterative decoding. This structure allows turbo codes to achieve error correction performance very close to the theoretical limit defined by the Shannon capacity. Some key advantages of turbo codes are their remarkable power efficiency and ability to support delivery of multimedia services through design tradeoffs. However, turbo codes also have long latency and poor performance at very low bit error rates.
The document discusses Reed-Solomon codes, an error-correcting code invented in 1960 that remains widely used. It describes how Reed-Solomon codes work by adding redundant bits that allow the decoder to detect and correct a certain number of errors by processing each data block. Applications include data storage, wireless communications, QR codes, and more. The document also covers topics like symbol errors, decoding procedures, implementation methods, and finite field arithmetic used.
This document discusses using decipherment techniques to improve machine translation when parallel data is scarce. It presents an overview of machine translation pipelines and notes that performance drops when parallel data is limited. The document proposes using monolingual data to improve machine translation in real-world scenarios with limited parallel data. It outlines contributions including fast, accurate decipherment of over 1 billion tokens with 93% accuracy, and using decipherment to improve machine translation for domain adaptation and low-resource languages.
This document proposes a new approach called contextual parameter generator (CPG) for multilingual neural machine translation that generates parameters for the system based on language embeddings. The CPG allows for semi-supervised learning, parameter sharing across similar languages, and achieves state-of-the-art results on two datasets while using less data and computation compared to existing approaches.
Prolog is a logic programming language invented in the 1970s. It uses a declarative programming paradigm where programs describe relations and logic, rather than algorithms. The document provides an introduction to Prolog, covering its key features like facts, rules, questions, terms, backtracking, recursion, and lists. It also discusses resolution and unification, depth first search, scope, type systems, and bindings in Prolog. Examples demonstrate implementing the Towers of Hanoi problem and a non-deterministic finite state automaton in Prolog.
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
PyTorch, the popular open-source ML framework, has continued to evolve rapidly since the introduction of PyTorch 1.0, which brought an accelerated workflow from research to production.
Deep-learning based Language Understanding and Emotion extractionsJeongkyu Shin
This document discusses natural language understanding and emotion extraction using deep learning. It summarizes SyntaxNet and DRAGNN, which are frameworks for natural language processing using deep learning. It then discusses using these frameworks to build models for part-of-speech tagging, dependency parsing, and language understanding. It also discusses building models for extracting emotions from text using techniques like SentiWordNet and modeling emotions in a vector space.
Recent Advances in Natural Language ProcessingApache MXNet
The document provides an overview of recent advances in natural language processing (NLP), including traditional methods like bag-of-words models and word2vec, as well as more recent contextualized word embedding techniques like ELMo and BERT. It discusses applications of NLP like text classification, language modeling, machine translation and question answering, and how different models like recurrent neural networks, convolutional neural networks, and transformer models are used.
MaterialX has been integrated into MayaUSD and ArnoldUSD in the following ways:
- MaterialX definitions can be imported and exported from MayaUSD to enable interoperability via UsdShade and the MaterialX Preview Surface.
- The standard MaterialX shader generator is used to visualize MaterialX materials directly in Maya's viewport without translation, leveraging the same GLSL code generator as hdStorm.
- A plugin architecture allows custom MaterialX to Preview Surface translations to be defined and used for import and export between DCC tools and renderers.
Generating super resolution images using transformersNEERAJ BAGHEL
The document summarizes a research paper on using transformers for the task of natural language processing. Some key points:
- Transformers use attention mechanisms to draw global dependencies between input and output without regard to sequence length, addressing limitations of RNNs and CNNs for NLP tasks.
- The proposed transformer architecture contains self-attention layers in the encoder and decoder, as well as an attention mechanism between the encoder and decoder.
- The transformer uses scaled dot-product attention and multi-head attention. Self-attention allows relating different positions of a single sequence to compute representations.
- Other components include feedforward layers and positional encoding to inject information about the relative or absolute positions of the tokens in the sequence
This document presents research on developing an automatic speech recognition system for Egyptian Arabic. It explores using a diacritized lexicon versus an undiacritized graphemic lexicon, and compares different acoustic models. The researchers used the Al-Jazeera Egyptian dialect corpus and evaluated performance using word error rate. Their results showed that a graphemic lexicon outperformed a diacritized lexicon, and that reducing out-of-vocabulary words and using deep neural network acoustic models improved word error rate.
The document summarizes region inference techniques for an object-oriented language. It discusses representing regions as type parameters for classes and methods, enforcing no dangling reference properties as class invariants, handling region subtyping and polymorphism, localizing regions that don't escape blocks, maintaining constraints during method overriding and downcasting, and presenting experimental results showing the approach is fast and competitive with hand-annotation.
A Distributed Tableau Algorithm for Package-based Description LogicsJie Bao
The document describes a distributed tableau algorithm for reasoning with modular ontologies expressed in Package-based Description Logics (P-DL). The algorithm uses multiple local reasoners, each maintaining a local tableau for a single ontology module. Local reasoners communicate by querying each other or reporting clashes to collectively construct a global tableau without fully integrating the modules. The algorithm is proven sound and complete for P-DL with acyclic module importing. It can support reasoning across modules to answer queries.
System 1 and System 2 were basic early systems for image matching that used color and texture matching. Descriptor-based approaches like SIFT provided more invariance but not perfect invariance. Patch descriptors like SIFT were improved by making them more invariant to lighting changes like color and illumination shifts. The best performance came from combining descriptors with color invariance. Representing images as histograms of visual word occurrences captured patterns in local image patches and allowed measuring similarity between images. Large vocabularies of visual words provided more discriminative power but were costly to compute and store.
assembly language programming and organization of IBM PC" by YTHA YUEducation
This document contains solutions to chapters 1 through 10 of a manual on assembly language programming and organization of the IBM PC. It includes contents, chapter summaries, programming exercises and their solutions. Appendices provide information on how to run programs and some useful procedures.
LDPC Encoding and Hamming Encoding using MATLAB.
An LDPC code is a linear block code characterised by a very sparse parity-check matrix. This means that the parity check matrix has a very low concentration of 1’s in it, hence the name is “low-density parity-check” code. The sparseness of LDPC codes is what as it can lead to excellent performance in terms of bit error rates.
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
This document provides an overview of transformer seq2seq models, including their concepts, trends, and limitations. It discusses how transformer models have replaced RNNs for seq2seq tasks due to being more parallelizable and effective at modeling long-term dependencies. Popular seq2seq models like T5, BART, and Pegasus are introduced. The document reviews common pretraining objectives for seq2seq models and current trends in larger model sizes, task-specific pretraining, and long-range modeling techniques. Limitations discussed include the need for grounded representations and efficient generation for seq2seq models.
Evolution of Structure of Some Binary Group-Based N-Bit Compartor, N-To-2N De...VLSICS Design
Reversible logic has attracted substantial interest due to its low power consumption which is the main
concern of low power VLSI systems. In this paper, a novel 4x4 reversible gate called inventive gate has
been introduced and using this gate 1-bit, 2-bit, 8-bit, 32-bit and n-bit group-based reversible comparator
have been constructed with low value of reversible parameters. The MOS transistor realizations of 1-bit, 2-
bit, and 8-bit of reversible comparator are also presented and finding power, delay and power delay
product (PDP) with appropriate aspect ratio W/L. Novel inventive gate has the ability to use as an n-to-2n
decoder. Different novel reversible circuit design style is compared with the existing ones. The relative
results shows that the novel reversible gate wide utility, group-based reversible comparator outperforms
the present style in terms of number of gates, garbage outputs and constant input.
This document discusses extracting variability from C code and lifting it to the mbeddr platform. It analyzes several large C codebases to understand how variability is implemented and which constructs can be lifted. The analysis finds that typical preprocessor constructs use identifiers, integers, and logical/comparison operations. Over 90% of symbols behave like constants that could be modeled as configuration parameters. While #error and #warning directives are present, they only make up a small portion of preprocessor statements. The results inform how to model variability in mbeddr's higher-level configuration language. A case study of the ChibiOS real-time OS module shows its use of preprocessor definitions and conditions.
JetBrains MPS: Projectional Editing in Domain-Specific LanguagesOscar Rodriguez
JetBrains MPS is a language workbench that allows for projectional editing in domain-specific languages. It enables the combination of multiple languages in code by separating a language's definition into aspects like abstract syntax, concrete syntax, and code generation. This allows languages to have rich syntaxes and notations without being limited by parsing. MPS uses model-to-model and text transformations to map between language concepts and implementation domains through code generation.
The document discusses turbo codes, a type of forward error correction code. Turbo codes use parallel concatenated convolutional encoders with pseudorandom interleaving and iterative decoding. This structure allows turbo codes to achieve error correction performance very close to the theoretical limit defined by the Shannon capacity. Some key advantages of turbo codes are their remarkable power efficiency and ability to support delivery of multimedia services through design tradeoffs. However, turbo codes also have long latency and poor performance at very low bit error rates.
The document discusses Reed-Solomon codes, an error-correcting code invented in 1960 that remains widely used. It describes how Reed-Solomon codes work by adding redundant bits that allow the decoder to detect and correct a certain number of errors by processing each data block. Applications include data storage, wireless communications, QR codes, and more. The document also covers topics like symbol errors, decoding procedures, implementation methods, and finite field arithmetic used.
This document discusses using decipherment techniques to improve machine translation when parallel data is scarce. It presents an overview of machine translation pipelines and notes that performance drops when parallel data is limited. The document proposes using monolingual data to improve machine translation in real-world scenarios with limited parallel data. It outlines contributions including fast, accurate decipherment of over 1 billion tokens with 93% accuracy, and using decipherment to improve machine translation for domain adaptation and low-resource languages.
This document proposes a new approach called contextual parameter generator (CPG) for multilingual neural machine translation that generates parameters for the system based on language embeddings. The CPG allows for semi-supervised learning, parameter sharing across similar languages, and achieves state-of-the-art results on two datasets while using less data and computation compared to existing approaches.
Prolog is a logic programming language invented in the 1970s. It uses a declarative programming paradigm where programs describe relations and logic, rather than algorithms. The document provides an introduction to Prolog, covering its key features like facts, rules, questions, terms, backtracking, recursion, and lists. It also discusses resolution and unification, depth first search, scope, type systems, and bindings in Prolog. Examples demonstrate implementing the Towers of Hanoi problem and a non-deterministic finite state automaton in Prolog.
Scaling Up AI Research to Production with PyTorch and MLFlowDatabricks
PyTorch, the popular open-source ML framework, has continued to evolve rapidly since the introduction of PyTorch 1.0, which brought an accelerated workflow from research to production.
Deep-learning based Language Understanding and Emotion extractionsJeongkyu Shin
This document discusses natural language understanding and emotion extraction using deep learning. It summarizes SyntaxNet and DRAGNN, which are frameworks for natural language processing using deep learning. It then discusses using these frameworks to build models for part-of-speech tagging, dependency parsing, and language understanding. It also discusses building models for extracting emotions from text using techniques like SentiWordNet and modeling emotions in a vector space.
Recent Advances in Natural Language ProcessingApache MXNet
The document provides an overview of recent advances in natural language processing (NLP), including traditional methods like bag-of-words models and word2vec, as well as more recent contextualized word embedding techniques like ELMo and BERT. It discusses applications of NLP like text classification, language modeling, machine translation and question answering, and how different models like recurrent neural networks, convolutional neural networks, and transformer models are used.
MaterialX has been integrated into MayaUSD and ArnoldUSD in the following ways:
- MaterialX definitions can be imported and exported from MayaUSD to enable interoperability via UsdShade and the MaterialX Preview Surface.
- The standard MaterialX shader generator is used to visualize MaterialX materials directly in Maya's viewport without translation, leveraging the same GLSL code generator as hdStorm.
- A plugin architecture allows custom MaterialX to Preview Surface translations to be defined and used for import and export between DCC tools and renderers.
Generating super resolution images using transformersNEERAJ BAGHEL
The document summarizes a research paper on using transformers for the task of natural language processing. Some key points:
- Transformers use attention mechanisms to draw global dependencies between input and output without regard to sequence length, addressing limitations of RNNs and CNNs for NLP tasks.
- The proposed transformer architecture contains self-attention layers in the encoder and decoder, as well as an attention mechanism between the encoder and decoder.
- The transformer uses scaled dot-product attention and multi-head attention. Self-attention allows relating different positions of a single sequence to compute representations.
- Other components include feedforward layers and positional encoding to inject information about the relative or absolute positions of the tokens in the sequence
This document presents research on developing an automatic speech recognition system for Egyptian Arabic. It explores using a diacritized lexicon versus an undiacritized graphemic lexicon, and compares different acoustic models. The researchers used the Al-Jazeera Egyptian dialect corpus and evaluated performance using word error rate. Their results showed that a graphemic lexicon outperformed a diacritized lexicon, and that reducing out-of-vocabulary words and using deep neural network acoustic models improved word error rate.
The document summarizes region inference techniques for an object-oriented language. It discusses representing regions as type parameters for classes and methods, enforcing no dangling reference properties as class invariants, handling region subtyping and polymorphism, localizing regions that don't escape blocks, maintaining constraints during method overriding and downcasting, and presenting experimental results showing the approach is fast and competitive with hand-annotation.
A Distributed Tableau Algorithm for Package-based Description LogicsJie Bao
The document describes a distributed tableau algorithm for reasoning with modular ontologies expressed in Package-based Description Logics (P-DL). The algorithm uses multiple local reasoners, each maintaining a local tableau for a single ontology module. Local reasoners communicate by querying each other or reporting clashes to collectively construct a global tableau without fully integrating the modules. The algorithm is proven sound and complete for P-DL with acyclic module importing. It can support reasoning across modules to answer queries.
System 1 and System 2 were basic early systems for image matching that used color and texture matching. Descriptor-based approaches like SIFT provided more invariance but not perfect invariance. Patch descriptors like SIFT were improved by making them more invariant to lighting changes like color and illumination shifts. The best performance came from combining descriptors with color invariance. Representing images as histograms of visual word occurrences captured patterns in local image patches and allowed measuring similarity between images. Large vocabularies of visual words provided more discriminative power but were costly to compute and store.
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. In this lecture, we will cover some of the fundamentals of neural representation learning for text retrieval. We will also discuss some of the recent advances in the applications of deep neural architectures to retrieval tasks.
(These slides were presented at a lecture as part of the Information Retrieval and Data Mining course taught at UCL.)
Classification of CNN.com Articles using a TF*IDF MetricMarie Vans
This document summarizes a research paper that classified CNN news articles using term frequency-inverse document frequency (TF-IDF) metrics. It discusses the TF-IDF family of metrics used to calculate word frequencies, describes preprocessing steps and algorithms for classification. The dataset contained 3,000 news articles from 12 CNN categories split into training and test sets. Keywords were extracted and weighted using various TF-IDF variants to classify articles by comparing words to the training set categories.
This document summarizes structured prediction and structured large margin estimation approaches. It discusses how structured prediction can model complex, correlated outputs like sequences, trees, and matchings. It presents a min-max formulation that casts structured prediction as a linear program for inference, allowing joint training with large margin methods. This provides tractable learning for problems like conditional random fields, context-free grammars, and associative Markov networks.
An optimal and progressive algorithm for skyline queries slideWooSung Choi
The document presents an optimal and progressive algorithm for processing skyline queries using an R-tree index. It discusses two strategies - recursive nearest neighbor queries and a branch and bound skyline algorithm. The recursive NN query approach requires additional processing to eliminate duplicate results for higher dimensions, while the branch and bound skyline algorithm prunes non-skyline points during traversal to directly generate the skyline without duplicates. The algorithm processes the R-tree in a best-first manner by maintaining a priority queue of tree nodes ordered by their minimum possible skyline size.
A machine-learning view on heterogeneous catalyst design and discoveryIchigaku Takigawa
Telluride Workshop on Computational Materials Chemistry, Telluride, Colorado, USA, July 1, 2021.
https://research.chem.ucr.edu/groups/jiang/Telluride_Workshop.html
https://www.telluridescience.org/meetings/workshop-details?wid=901
https://www.telluridescience.org/meetings/workshop-details?wid=945
The document provides information about Shivananda R Koteshwar, a teacher who works in semiconductors. It discusses the evolution of semiconductors from early vacuum tube computers that took up large spaces and weighed tons, to modern integrated circuits with billions of transistors in compact devices. The document also covers topics like programmable logic devices, hardware description languages, digital design with FPGAs and state machines.
Similar to XtremeDistil: Multi-stage Distillation for Massive Multilingual Models (20)
OpenTag: Open Attribute Value Extraction From Product ProfilesSubhabrata Mukherjee
Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, Feifei Li
KDD 2018, London, UK
OpenTag brings deep learning and active learning together for state-of-the-art imputation and open entity extraction system.
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Co...Subhabrata Mukherjee
One of the major hurdles preventing the full exploitation of information from online communities is the widespread concern regarding the quality and credibility of user-contributed content. We propose probabilistic graphical models that can leverage the joint interplay between multiple factors --- like user interactions, community dynamics, and textual content --- to automatically assess the credibility of user-contributed online information, expertise of users and their evolution with user-interpretable explanation. We devise new models based on Conditional Random Fields that enable applications such as extracting reliable side-effects of drugs from user-contributed posts in health forums, and identifying credible news articles in news forums.
Online communities are dynamic, as users join and leave, adapt to evolving trends, and mature over time. To capture this dynamics, we propose generative models based on Hidden Markov Model, Latent Dirichlet Allocation, and Brownian Motion to trace the continuous evolution of user expertise and their language model over time. This allows us to identify expert users and credible content jointly over time, improving state-of-the-art recommender systems by explicitly considering the maturity of users. This enables applications such as identifying useful product reviews, and detecting fake and anomalous reviews with limited information.
Continuous Experience-aware Language Model
Subhabrata Mukherjee, Stephan Günnemann and Gerhard Weikum
Proc. of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 2016
Experience aware Item Recommendation in Evolving Review CommunitiesSubhabrata Mukherjee
Experience aware Item Recommendation in Evolving Review Communities
Subhabrata Mukherjee, Hemank Lamba and Gerhard Weikum
IEEE International Conference in Data Mining (ICDM) 2015
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...Subhabrata Mukherjee
Subhabrata Mukherjee, Jitendra Ajmera and Sachindra Joshi.
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construction from Corpus
Proc. of the 23rd ACM International Conference on Information and Knowledge Management (CIKM). 2014.
Leveraging Joint Interactions for Credibility Analysis in News CommunitiesSubhabrata Mukherjee
Leveraging Joint Interactions for Credibility Analysis in News Communities,
Subhabrata Mukherjee and Gerhard Weikum,
Max Planck Institute for Informatics,
CIKM 2015
People on Drugs: Credibility of User Statements in Health ForumsSubhabrata Mukherjee
People on Drugs: Credibility of User Statements in Health Communities. Subhabrata Mukherjee, Gerhard Weikum and Cristian Danescu-Niculescu-Mizil. Proc. of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 2014
Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of R...Subhabrata Mukherjee
Author-Specific Hierarchical Sentiment Aggregation for Rating Prediction of Reviews, Subhabrata Mukherjee and Sachindra Joshi, In Proc. of the 9th edition of the Language Resources and Evaluation Conference (LREC 2014), Reykjavik, Iceland, May 26-31, 2014
Joint Author Sentiment Topic Model, Subhabrata Mukherjee, Gaurab Basu and Sachindra Joshi, In Proc. of the SIAM International Conference in Data Mining (SDM 2014), Pennsylvania, USA, Apr 24-26, 2014 [http://people.mpi-inf.mpg.de/~smukherjee/jast.pdf]
TwiSent: A Multi-Stage System for Analyzing Sentiment in TwitterSubhabrata Mukherjee
TwiSent: A Multi-Stage System for Analyzing Sentiment in Twitter, Subhabrata Mukherjee, Akshat Malu, Balamurali A.R. and Pushpak Bhattacharyya, In Proceedings of The 21st ACM Conference on Information and Knowledge Management (CIKM 2012), Hawai, Oct 29 - Nov 2, 2012 (http://www.cse.iitb.ac.in/~pb/papers/cikm2012-twisent.pdf)
Adaptation of Sentiment Analysis to New Linguistic Features, Informal Languag...Subhabrata Mukherjee
Adaptation of Sentiment Analysis to New Linguistic Features, Informal Language Form and World Knowledge, Subhabrata Mukherjee and Pushpak Bhattacharyya, Master's Thesis, IIT Bombay, Dept. of Computer Science and Engineering
Leveraging Sentiment to Compute Word Similarity, Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya, In Proceedings of the 6th International Global Wordnet Conference (GWC 2011), Matsue, Japan, Jan, 2012 (http://www.cse.iitb.ac.in/~pb/papers/gwc12-sense-sa.pdf)
WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarizat...Subhabrata Mukherjee
WikiSent : Weakly Supervised Sentiment Analysis Through Extractive Summarization With Wikipedia, Subhabrata Mukherjee and Pushpak Bhattacharyya, In Proceedings of the European Conference on Machine Learning (ECML PKDD 2012), Bristol, U.K., 24-28 Sept, 2012 (http://www.cs.bris.ac.uk/~flach/ECMLPKDD2012papers/1125567.pdf)
Feature Specific Sentiment Analysis for Product Reviews, Subhabrata Mukherjee and Pushpak Bhattacharyya, In Proceedings of the 13th International Conference on Intelligent Text Processing and Computational Intelligence (CICLING 2012), New Delhi, India, March, 2012 (http://www.cse.iitb.ac.in/~pb/papers/cicling12-feature-specific-sa.pdf)
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data...Subhabrata Mukherjee
YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data & User Comments using WordNet & Wikipedia, Subhabrata Mukherjee and Pushpak Bhattacharyya, In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), IIT Bombay, Mumbai, Dec 8 - Dec 15, 2012 (Long Paper)
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSubhabrata Mukherjee
Sentiment Analysis in Twitter with Lightweight Discourse Analysis, Subhabrata Mukherjee and Pushpak Bhattacharyya, In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), IIT Bombay, Mumbai, Dec 8 - Dec 15, 2012 (http://www.cse.iitb.ac.in/~pb/papers/coling12-discourse-sa.pdf)
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
1. M I C R O S O F T R E S E A R C H A I
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
SUBHABRATA MUKHERJEE
AHMED H. AWADALLAH
ACL 2020
2. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Language Model Pre-Training
z
Representation Transfer
Paraphrase
Q&A
Sentiment
Learned via self-supervision
(language modeling objectives)
Fine-tune on target task on labeled data
(e.g., cross-entropy loss objective)
2
3. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
3
z
20182017
CoVe
Elmo
OpenAI GPT
BERT
2019
OpenAI GPT-2
Pre-trained Model Complexity
Nvidia Megatron
Google T5
Turing-NLG
2020
OpenAI GPT-3
4. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
4
z
20182017
CoVe
30M params
Elmo
~90M params
OpenAI GPT
~100M params
BERT
340M params.
2019
OpenAI GPT-2
1.5B params
Pre-trained Model Complexity
Nvidia Megatron
8B params
Google T5
11B params
Turing-NLG
17B params
2020
OpenAI GPT-3
175B params
Increase in model parameters
5. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
z
Knowledge Distillation to Mimic Teacher’s Output
Teacher: Huge pre-trained
model
Student: Shallow model
Transfer Set:
Unlabeled Data
(Ba and Caruana, 2014; Romero et al. 2014; Hinton, Vinyals and Dean, 2015)
• Logits
• Log probability values over classes (before softmax)
• Captures uncertainty better than hard labels
• Internal Representations
• Hint-based guidance from Teacher
5
6. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Knowledge Distillation
z
Unlabeled
Data
Teacher
(3) Generate logits and
representations on
unlabeled data
Labeled Data
(2) Task-specific
training w. CE loss
Augmented
Data
(1) LM pre-training
(optional)
Student
(4) Task-specific soft
training
6
7. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
TASK: Multi-lingual Named Entity Recognition
7
Jakers ! Aventurile lui Piggley Winks
R.H. Saunders ( St. Lawrence River ) ( 968 MW )
B-ORG I-ORG O B-ORG I-ORG I-ORG O O O O O
Широко распространён в Австралии , где его выращивают на срезку .
O O O B-LOC O O O O O O O
Jakers ! Aventurile lui Piggley Winks
B-ORG I-ORG I-ORG I-ORG I-ORG I-ORG
Identify PER, ORG and LOC from multiple languages jointly
We adopt Multi-lingual BERT pre-trained on 104 languages
in Wikipedia as the teacher
8. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
z
Knowledge Distillation (1/2)
8
{𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙
Shared Word Embedding Layer
Shared RNN Layer
Shared Feedforward Layer
𝑧 𝑠 𝑥 𝑘
ps 𝑥 𝑘 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑥 𝑘)
𝑚𝑢𝑙𝑡𝑖𝑡𝑎𝑠𝑘 𝑙𝑜𝑠𝑠 = −
𝑥∈𝐷𝑙 𝑥 𝑘
∈ 𝑥
𝑦 𝑘 log(𝑝𝑠 𝑥 𝑘 )
T: teacher
s: student
Dl: labeled data
Cross-entropy loss (labeled)
{𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙
Shared Word Embedding Layer
{𝑥, 𝑦} ∈ 𝐷𝑙
9. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
z
Knowledge Distillation (2/2)
9
{𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙
Shared Word Embedding Layer
Shared BiLSTM Layer
Shared feed-forward layer
𝑧 𝑠 𝑥 𝑘 𝑧 𝑠 𝑥 𝑘
ps 𝑥 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(. ) ps 𝑥 𝑘 = 𝑙𝑖𝑛𝑒𝑎𝑟(. )
𝑚𝑢𝑙𝑡𝑖𝑡𝑎𝑠𝑘 𝑙𝑜𝑠𝑠 = 𝐶𝐸_𝑙𝑜𝑠𝑠 +
𝑥∈𝐷 𝑢 𝑥𝑘∈𝑥
𝑙𝑜𝑔𝑖𝑡 𝑇 𝑦 𝑘 − 𝑝𝑠 𝑥 𝑘
2
+ 𝐾𝐿𝐷 𝑧 𝑇 𝑥 𝑘 || 𝑧 𝑠 𝑥 𝑘
T: teacher
s: student
Du: unlabeled data
{𝑥 𝑘, 𝑙𝑜𝑔𝑖𝑡𝑇 𝑦 𝑘 , 𝑧𝑇(𝑥 𝑘)} ∈ 𝐷 𝑢
Cross-entropy
loss (labeled)
Mean-squared logit
loss (unlabeled)
Representation loss
(unlabeled)
{𝑥𝑙, 𝑦𝑙} ∈ 𝐷𝑙
Shared Word Embedding Layer
{𝑥 𝑘, 𝑦𝑘} ∈ 𝐷𝑙
10. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Multi-Stage Distillation with Labeled (L) and Unlabeled (U) Data
z
Joint
• α CEloss (L) +
• β Logitloss (U) +
• γ Reploss (U)
2-Stage
Stage 1: Reploss(U)
• Stage 2:
• α CEloss(L) + β
Logitloss(U)
3-Stage
• Stage 1:Reploss(U)
• Stage 2: Logitloss (U)
• Stage 3: CEloss(L)
10
11. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Hyper-parameter free Loss Combination
z
3-Stage
• Stage 1:Reploss(U)
• Stage 2: Logitloss (U)
• Stage 3: CEloss(L)
11
General representation
Task-specific tuning
12. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Multi-Stage Distillation: Caveats
z
12
• Catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999)
• Model forgets information from earlier stages / tasks
• General solution: Update model parameters progressively*
1. Progressively in time (freezing)
2. Progressively in intensity (lower learning rates)
3. Progressively vs. pre-trained model (regularization)
*https://ruder.io/state-of-transfer-learning-in-nlp/
• Our approach: Multi-stage distillation of student with gradual unfreezing and
cosine learning rate schedule [1 + 2 + 3]
13. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Massive Multi-lingual NER F1 over 41 Languages
13
z
Dataset: Wikiann1,2 with 705K train, 329K dev and 329K test
sequences in IOB2 format with PER, ORG, LOC tags
1. Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. ACL 2017. Crosslingual name tagging and linking for 282 languages.
2. Afshin Rahimi, Yuan Li, and Trevor Cohn. ACL 2019. Massively multilingual transfer for NER.
Refer to paper for text classification datasets and experiments
14. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
XtremeDistil Summary: Distilled mBERT for NER
14
zLanguage-agnostic: single model for all languages
> 35x parameter compression
> 51x latency speedup for batch inference
95% of mBERT F1 for NER over 41 languages
15. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Parameter Compression (x) vs. F1 against mBERT
15
z
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE] [CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE] [CELLRANGE][CELLRANGE]
[CELLRANGE] [CELLRANGE][CELLRANGE][CELLRANGE]
0
5
10
15
20
25
30
35
40
84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89
ParameterCompression
F1 Measure
(E,H) denotes word embedding dim. and BiLSTM hidden states
95% of mBERT F1
16. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Inference Speedup (x) vs. F1 against mBERT
16
z
[CELLRANGE][CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE][CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE] [CELLRANGE]
[CELLRANGE][CELLRANGE]
[CELLRANGE]
[CELLRANGE]
0
10
20
30
40
50
60
70
80
84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89
InferenceSpeedup
F1 Measure
Config: batch_size=32 on single P100 GPU
95% of mBERT F1
(E,H) denotes word embedding dim. and BiLSTM hidden states
17. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
XtremeDistil NER F1 (41 Lang.): Model Features vs. Transfer Data
17
z
MODELFEATURES
UNLABELED TRANSFER DATA
Performance increases with internal representations and learning schedulePerformance increases with more unlabeled transfer data till saturation
18. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Which teacher layer to distil from?
18
z
Higher layers provide more task-specific knowledge but harder to
distil by shallow student
19. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Which (multi-lingual) word embeddings to initialize with?
19
z
- Random initialization works well
- Singular Value Decomposition for dimensionality reduction over fine-tuned
mBERT word embeddings obtain compression + better performance
20. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Takeaways
20
z
• Useful distillation aspects
• Hidden representations from intermediate teacher layers
• Stagewise optimization + gradual unfreezing + learning rate scheduler
• Key trade-off
• Student architecture for low-latency configurations vs. F1 score
• Parameter compression vs. latency speedup vs. F1 score
• 35x parameter compression with 51x latency speedup retaining 95% mBERT F1 score for
NER over 41 languages
• XtremeDistil can be easily extended to other tasks and languages
21. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Making SOTA Affordable in Practice
21
• Inference cost saving potential for a single hypothetical scenario
• 25x increase in inference capacity
• Distilled model achieves parity with full model
• Opportunity for application in Enterprise Search,
Communication and Productivity scenarios 100
Million
$2.07
10
$15 MM
per Year
Savings
100
ms
2
ms
260
Productive days in a year
Active users
100
Million
$2.07
Cost of 1 X P100 GPU per hour
Queries by an average
user per productive day
Query Processing time for a
Bert-based model
Query Processing time after
distillation
• Consider a hypothetical scenario: 100 MM users,
10 queries / user, 100 ms latency / query for BERT
• Inference cost saving potential with XtremeDistil
- 35x compression with 51x latency speedup
- 95% performance match with $15 MM
savings per year for a single scenario
22. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Thank You
Code and resources available at:
https://aka.ms/XtremeDistil
22
23. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Low-resource NER for 41 Languages
23
z
100 labeled samples per language and unlabeled transfer samples
TRANSFERDATA
24. M I C R O S O F T R E S E A R C H A C L 2 0 2 0 24
z
XtremeDistil matches Transformer teacher with > 26x compression
500
50
55
60
65
70
75
80
85
90
95
100
AG News IMDB Elec DBPedia
RNN TinyBERT BERT Large (Teacher)
Distillation most effective in low-resource settings
Distilled Student BERT Large
13M 340M
25. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
XtremeDistil for Sentiment Classification on SST-2
25
z
12 million sentences sampled from IMDB
as unlabeled transfer set for distillation
26. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Which student architecture to use?
26
z
27. M I C R O S O F T R E S E A R C H A C L 2 0 2 0
Multi-Stage Distillation: Progressive Improvement
27