{ } // Self-contained ML & NLP library in GO
Matteo Grella
● Artificial Intelligence to save lives.
○ Head of AI at EXOP GmbH
■ Detect global security incidents in short time from online-media;
■ Match security-incidents with travelers locations;
■ Automatically initiate emergency measures;
● Research on NLP and related Machine Learning algorithms.
○ Strong focus on Dependency Parsing.
■ Non-Projective Dependency Parsing via Latent Heads Representation (LHR)
M. Grella and S. Cangialosi, 2018 (https://arxiv.org/abs/1802.02116)
● Main author of spaGO.
What is spaGO?
● Open-source project initiated by members of .
○ BSD-Like License.
○ Work-in-Progress.
● Beautiful and maintainable machine learning (ML) library written in Go.
○ Design focuses on relevant neural architectures in Natural Language Processing (NLP).
● Train, load and perform inference on state-of-the-art NLP models.
○ e.g., Named Entity Recognition, Question-Answering, etc.
○ The first pure Go implementation that we know of.
Why spaGO?
1. Educational purposes.
a. The first goal was to learn Go and refresh basic concepts of ML.
b. Very few dependencies; developed almost from scratch.
2. Production use.
a. The scope of the goals broadened after things started working pretty well.
b. Pivot towards a comprehensive, production-ready, cloud-native NLP library.
c. e.g., stateless server, TLS encryption, gRPC, Docker, etc.
3. High-quality implementation.
a. Go does not have a rich ML/NLP ecosystem.
b. spaGO fills that niche by providing the most important architectures in the field.
Features
● Automatic differentiation
○ Define-by-run.
● Optimizers
○ Adam, RAdam, RMSProp, AdaGrad, and SGD.
● Feed-forward models
○ Linear, Highway, Convolution, etc.
● Recurrent models
○ LSTM, GRU, etc.
● Attention mechanisms
○ Self-Attention, Multi-Head Attention, etc.
● Memory-efficient Word Embeddings
○ Uses badger key–value store.
Features (continued)
● Language Modeling
○ Next char/word/sentence prediction, masked models.
● Sequence Tagger
○ BiLSTM+CRF, etc.
● Transformers
○ BERT-like
● Text classification
● Named Entity Recognition (NER)
● Question-Answering
● etc.
Ready-to-use
Compatible with PyTorch models:
● // state-of-the-art multi-lingual sequence labeling
● // 32+ pretrained transformers in 100+ languages
Caveats
● No support for GPU, yet.
○ spaGO is not GPU/TPU-friendly by design.
○ On the roadmap.
● Even for the CPU, there are constraints.
○ Far from the optimizations available in mainstream DL (deep learning) frameworks:
■ How data are stored in tensors.
■ Efficient batch operations.
■ etc.
Where would I use spaGO?
● Production NLP systems.
○ Requires optimization of CPU inference, though.
■ PyTorch is ~2x faster for large matrices
○ Relies on GONUM assembly code with SIMD instructions (AMD64).
○ CGO is a bottleneck, so there are experiments using Intel MKL and Apache Arrow.
○ Also, experimenting with sparse models, LSH, etc.
● Training of not-very-deep models.
○ The training bit seems less interesting as this is really expensive for modern models even
with GPUs/TPUs.
Use cases for `production ML/NLP`
● train shallow models
● fine-tune deep models
● inference
● train deep models
● fine-tune deep models
● inference
Weights Importer
Building Blocks
Lightweight internal Machine Learning framework
1. mat package (pkg/mat)
a. 2D dense and sparse matrices (vectors and scalars as a subset of).
i. go test dense 80% statements
ii. go test sparse 77% statements
b. []float64 as data storage.
c. BLAS linear algebra asm/f64 (implementation by GONUM Authors).
d. sync.Pool for an efficient reuse of allocated memory.
Building Blocks
1. mat package (pkg/mat)
2. auto-grad package (pkg/ml/ag)
a. dynamically created expression Graph.
i. variable node around a `Matrix` (pkg/ml/ag/variable.go).
ii. operator node e.g. Add, Mul, Tanh, ReLU, Concat, etc. (pkg/ml/ag/operator.go).
b. differentiation for all supported operations (pkg/ml/ag/fn).
i. go test: 100% files, 89% statements
c. enable training via back-propagation of gradients (BP, BPTT, TBTTT).
d. minimal overhead
Building Blocks (continued)
1. mat package (pkg/mat)
2. auto-grad package (pkg/ml/ag)
3. optimizers (pkg/ml/optimizers)
○ Params optimization via Gradient Descent (pkg/ml/optimizers/gd)
■ Adam, RAdam, RMSProp, AdaGrad, SGD
■ Decay methods
■ Gradient-clipping
○ Params optimization via Differential Evolution (pkg/ml/optimizers/de)
■ Population
■ Crossover
■ Mutation
Building Blocks (continued)
1. mat package (pkg/mat)
2. auto-grad package (pkg/ml/ag)
3. optimizers (pkg/ml/optimizers)
4. neural networks (pkg/ml/nn)
○ Param
■ Weights, Biases
■ Optimizable, Serializable
○ Model
■ Serializable parameters of a neural model (params, sub-models, etc.)
■ Can instantiate a `Processor`
○ Processor
■ Perform the model’s `Forward()` operation within the `ag.Graph`
■ Suitable for feedforward, convolutional, recurrent models
Toy Example (c = a+b)
g := ag.NewGraph()
// create a new node of type variable with a scalar
a := g.NewVariable(mat.NewScalar(2.0), true)
// create another node of type variable with a scalar
b := g.NewVariable(mat.NewScalar(5.0), true)
// create an addition operator (the calculation is actually performed here)
c := g.Add(a, b)
// print the result
fmt.Printf("c = %vn", c.Value()) // c = [7]
g.Backward(c, ag.OutputGrad(mat.NewScalar(0.5)))
// print the gradients
fmt.Printf("ga = %vn", a.Grad()) // ga = [0.5]
fmt.Printf("gb = %vn", b.Grad()) // gb = [0.5]
Affine transformation
func Affine(g *ag.Graph, xs ...ag.Node) ag.Node {
y := g.Add(xs[0], g.Mul(xs[1], xs[2])) // y = b + Wx
for i := 3; i < len(xs)-1; i += 2 {
w, x := xs[i], xs[i+1]
if x != nil {
y = g.Add(y, g.Mul(w, x)) // y += Wx
}
}
return y
}
// y = b + W1x1 + W2x2 + ... + WnXn
Linear Model (pkg/ml/nn/linear)
var (
_ nn.Model = &Model{}
_ nn.Processor = &Processor{}
)
type Model struct {
W *nn.Param `type:"weights"`
B *nn.Param `type:"biases"`
}
// New returns a new Linear model
func New(in, out int) *Model {
return &Model{
W: nn.NewParam(mat.NewEmptyDense(out, in)),
B: nn.NewParam(mat.NewEmptyVecDense(out)),
}
}
type Processor struct {
nn.BaseProcessor
w, b ag.Node
}
// NewProc returns Linear processor operating to the graph
func (m *Model) NewProc(g *ag.Graph) nn.Processor {
return &Processor{
BaseProcessor: nn.BaseProcessor{
Model: m,
Mode: nn.Training,
Graph: g,
},
// insert the weights and biases into the graph
w: g.NewWrap(m.W),
b: g.NewWrap(m.B),
}
}
// Forward computes y[i] = w (dot) x[i] + b
func (p *Processor) Forward(xs ...ag.Node) []ag.Node
ys := make([]ag.Node, len(xs))
for i, x := range xs {
ys[i] = nn.Affine(p.Graph, p.b, p.w, x)
}
return ys
}
Linear Model (pkg/ml/nn/linear)
var (
_ nn.Model = &Model{}
_ nn.Processor = &Processor{}
)
type Model struct {
W *nn.Param `type:"weights"`
B *nn.Param `type:"biases"`
}
// New returns a new Linear model
func New(in, out int) *Model {
return &Model{
W: nn.NewParam(mat.NewEmptyDense(out, in)),
B: nn.NewParam(mat.NewEmptyVecDense(out)),
}
}
type Processor struct {
nn.BaseProcessor
w, b ag.Node
}
// NewProc returns Linear processor operating to the graph
func (m *Model) NewProc(g *ag.Graph) nn.Processor {
return &Processor{
BaseProcessor: nn.BaseProcessor{
Model: m,
Mode: nn.Training,
Graph: g,
},
// insert the weights and biases into the graph
w: g.NewWrap(m.W),
b: g.NewWrap(m.B),
}
}
// Forward computes y[i] = w (dot) x[i] + b
func (p *Processor) Forward(xs ...ag.Node) []ag.Node
ys := make([]ag.Node, len(xs))
for i, x := range xs {
ys[i] = nn.Affine(p.Graph, p.b, p.w, x)
}
return ys
}
Build a Multi-Layer Perceptron (MLP)
mlp := &stack.New(
linear.New(inputSize, hiddenSize),
activation.New(ag.OpTanh),
linear.New(hiddenSize, outputSize),
activation.New(ag.OpSoftmax),
),
// y = Softmax(Wout
(Tanh(Win
x + bin
) + bout
))
Long-Short Term Memory (pkg/ml/nn/rec/lstm)
// inG = sigmoid(wIn (dot) x + bIn + wInRec (dot) yPrev)
// outG = sigmoid(wOut (dot) x + bOut + wOutRec (dot) yPrev)
// forG = sigmoid(wFor (dot) x + bFor + wForRec (dot) yPrev)
// cand = f(wCand (dot) x + bC + wCandRec (dot) yPrev)
// cell = inG * cand + forG * cellPrev
// y = outG * f(cell)
func (p *Processor) forward(x ag.Node) (s *State) {
g := p.Graph
s = new(State)
yPrev, cellPrev := p.prev()
s.InG = g.Sigmoid(nn.Affine(g, p.bIn, p.wIn, x, p.wInRec, yPrev))
s.OutG = g.Sigmoid(nn.Affine(g, p.bOut, p.wOut, x, p.wOutRec, yPrev))
s.ForG = g.Sigmoid(nn.Affine(g, p.bFor, p.wFor, x, p.wForRec, yPrev))
s.Cand = g.Tanh(nn.Affine(g, p.bCand, p.wCand, x, p.wCandRec, yPrev))
if cellPrev != nil {
s.Cell = g.Add(g.Prod(s.InG, s.Cand), g.Prod(s.ForG, cellPrev))
} else {
s.Cell = g.Prod(s.InG, s.Cand)
}
s.Y = g.Prod(s.OutG, g.Tanh(s.Cell))
return
}
func (p *Processor) prev() (
yPrev, cellPrev ag.Node
) {
s := p.LastState()
if s != nil {
yPrev = s.Y
cellPrev = s.Cell
}
return
}
Put the pieces together
model := linear.New(1, 1) // y = b+Wx
optimizer := gd.NewOptimizer(
sgd.New(sgd.NewConfig(0.0001, 0.0, false)),
nn.NewDefaultParamsIterator(model),
)
criterion := losses.MSESeq // g.ReduceMean(g.ProdScalar(g.Square(g.Sub(x, y)), g.NewScalar(0.5)))
g := ag.NewGraph()
predicted := model.NewProc(g).Forward(features...)
loss := criterion(g, predicted, expected, true)
g.Backward(loss)
optimizer.Optimize()
Put the pieces together
model := linear.New(1, 1) // y = b+Wx
optimizer := gd.NewOptimizer(
sgd.New(sgd.NewConfig(0.0001, 0.0, false)),
nn.NewDefaultParamsIterator(model),
)
criterion := losses.MSESeq // g.ReduceMean(g.ProdScalar(g.Square(g.Sub(x, y)), g.NewScalar(0.5)))
g := ag.NewGraph()
predicted := model.NewProc(g).Forward(features...)
loss := criterion(g, predicted, expected, true)
g.Backward(loss)
optimizer.Optimize()
Put the pieces together
model := linear.New(1, 1) // y = b+Wx
optimizer := gd.NewOptimizer(
sgd.New(sgd.NewConfig(0.0001, 0.0, false)),
nn.NewDefaultParamsIterator(model),
)
criterion := losses.MSESeq // g.ReduceMean(g.ProdScalar(g.Square(g.Sub(x, y)), g.NewScalar(0.5)))
g := ag.NewGraph()
predicted := model.NewProc(g).Forward(features...)
loss := criterion(g, predicted, expected, true)
g.Backward(loss)
optimizer.Optimize()
Put the pieces together
model := linear.New(1, 1) // y = b+Wx
optimizer := gd.NewOptimizer(
sgd.New(sgd.NewConfig(0.0001, 0.0, false)),
nn.NewDefaultParamsIterator(model),
)
criterion := losses.MSESeq // g.ReduceMean(g.ProdScalar(g.Square(g.Sub(x, y)), g.NewScalar(0.5)))
g := ag.NewGraph()
predicted := model.NewProc(g).Forward(features...)
loss := criterion(g, predicted, expected, true)
g.Backward(loss)
optimizer.Optimize()
NLP: Sequence Labeler (Flair Architecture)
flair := &sequencelabeler.Model{
Labels: config.Labels,
TaggerLayer: &birnncrf.Model{
BiRNN: birnn.New(
lstm.New(...),
lstm.New(...),
birnn.Concat,
),
Scorer: linear.New(...),
CRF: crf.New(...),
},
EmbeddingsLayer: &stackedembeddings.Model{
WordsEncoders: []nn.Model{
embeddings.New(...), // glovo
contextualstringembeddings.New(
charlm.New(...), // left to right
charlm.New(...), // right to left
contextualstringembeddings.Concat,
),
},
ProjectionLayer: linear.New(...),
},
}
NLP: BERT Layer
type Layer struct {
MultiHeadAttention *multiheadattention.Model
NormAttention *layernorm.Model
FFN *stack.Model
NormFFN *layernorm.Model
}
func (m *Layer) NewProc(g *ag.Graph) nn.Processor {
return &Processor{
BaseProcessor: nn.BaseProcessor{Model: m, Mode: nn.Training, Graph: g},
MultiHeadAttention: m.MultiHeadAttention.NewProc(g).(*multiheadattention.Processor),
NormAttention: m.NormAttention.NewProc(g).(*layernorm.Processor),
FFN: m.FFN.NewProc(g).(*stack.Processor),
NormFFN: m.NormFFN.NewProc(g).(*layernorm.Processor),
}
}
func (p *Processor) Forward(xs ...ag.Node) []ag.Node {
subLayer1 := rc.PostNorm(p.Graph, p.MultiHeadAttention.Forward, p.NormAttention.Forward, xs...)
subLayer2 := rc.PostNorm(p.Graph, p.FFN.Forward, p.NormFFN.Forward, subLayer1...)
return subLayer2
}
Live Demo
1. Question-Answering
a. GOOS=linux GOARCH=amd64 go build -o hugging_face_importer cmd/huggingfaceimporter/main.go
b. GOOS=linux GOARCH=amd64 go build -o bert_server cmd/bert/main.go
c. ./hugging_face_importer --model=deepset/bert-base-cased-squad2 --repo=~/.spago
d. ./bert_server server --model=~/.spago/deepset/bert-base-cased-squad2 --tls-disable
e. http://localhost:1987/bert-qa-ui
2. Named Entity Recognition
a. http://localhost:1988/ner-ui
Recap
● Educational.
○ Code relatively simple and quite succinct.
■ you don't need to be a math expert to understand spaGO.
● Easy-to-use (and to deploy).
○ GO module.
○ HTTP API (TLS available).
○ gRPC.
○ Docker.
○ Static-linked binary (no heavy dependencies; no burden of DL frameworks).
■ you can copy it “anywhere” and it will just run.
● Potentials.
○ Fast inference of NLP models on CPUs
■ Explore concurrent computation
Current status (v 0.0.z)
{
“Stars”: >600,
“Forks”: 23,
“Pull Requests”: 19,
“Issues”: 17,
“Main Contributors”: [
“M. Grella”, “E. Brambilla”, ”S. Cangialosi”, “M. Nicola”,
“E. McClure”, “J. Viana”,
],
“Building”: “passing”
“Codecov”: “60%”,
}
Related projects
GoPickle - Loading Python's data serialized with pickle and PyTorch module files
https://github.com/nlpodyssey/gopickle
GoTokenizers - Port of Hugging Face's tokenizers (Rust)
https://github.com/nlpodyssey/gotokenizers
GoSlide - Port of SLIDE for high performance deep-learning with CPU (C++)
https://github.com/nlpodyssey/goslide
Main contributor: M. Nicola (nicolamarco@protonmail.com)
Thanks! Questions?
Link to the repo:
https://github.com/nlpodyssey/spago
If you like the project, please leave a ★ to show your support!
Contacts:
Matteo Grella
matteogrella@gmail.com

spaGO: A self-contained ML & NLP library in GO

  • 1.
    { } //Self-contained ML & NLP library in GO
  • 2.
    Matteo Grella ● ArtificialIntelligence to save lives. ○ Head of AI at EXOP GmbH ■ Detect global security incidents in short time from online-media; ■ Match security-incidents with travelers locations; ■ Automatically initiate emergency measures; ● Research on NLP and related Machine Learning algorithms. ○ Strong focus on Dependency Parsing. ■ Non-Projective Dependency Parsing via Latent Heads Representation (LHR) M. Grella and S. Cangialosi, 2018 (https://arxiv.org/abs/1802.02116) ● Main author of spaGO.
  • 3.
    What is spaGO? ●Open-source project initiated by members of . ○ BSD-Like License. ○ Work-in-Progress. ● Beautiful and maintainable machine learning (ML) library written in Go. ○ Design focuses on relevant neural architectures in Natural Language Processing (NLP). ● Train, load and perform inference on state-of-the-art NLP models. ○ e.g., Named Entity Recognition, Question-Answering, etc. ○ The first pure Go implementation that we know of.
  • 4.
    Why spaGO? 1. Educationalpurposes. a. The first goal was to learn Go and refresh basic concepts of ML. b. Very few dependencies; developed almost from scratch. 2. Production use. a. The scope of the goals broadened after things started working pretty well. b. Pivot towards a comprehensive, production-ready, cloud-native NLP library. c. e.g., stateless server, TLS encryption, gRPC, Docker, etc. 3. High-quality implementation. a. Go does not have a rich ML/NLP ecosystem. b. spaGO fills that niche by providing the most important architectures in the field.
  • 5.
    Features ● Automatic differentiation ○Define-by-run. ● Optimizers ○ Adam, RAdam, RMSProp, AdaGrad, and SGD. ● Feed-forward models ○ Linear, Highway, Convolution, etc. ● Recurrent models ○ LSTM, GRU, etc. ● Attention mechanisms ○ Self-Attention, Multi-Head Attention, etc. ● Memory-efficient Word Embeddings ○ Uses badger key–value store.
  • 6.
    Features (continued) ● LanguageModeling ○ Next char/word/sentence prediction, masked models. ● Sequence Tagger ○ BiLSTM+CRF, etc. ● Transformers ○ BERT-like ● Text classification ● Named Entity Recognition (NER) ● Question-Answering ● etc.
  • 7.
    Ready-to-use Compatible with PyTorchmodels: ● // state-of-the-art multi-lingual sequence labeling ● // 32+ pretrained transformers in 100+ languages
  • 8.
    Caveats ● No supportfor GPU, yet. ○ spaGO is not GPU/TPU-friendly by design. ○ On the roadmap. ● Even for the CPU, there are constraints. ○ Far from the optimizations available in mainstream DL (deep learning) frameworks: ■ How data are stored in tensors. ■ Efficient batch operations. ■ etc.
  • 9.
    Where would Iuse spaGO? ● Production NLP systems. ○ Requires optimization of CPU inference, though. ■ PyTorch is ~2x faster for large matrices ○ Relies on GONUM assembly code with SIMD instructions (AMD64). ○ CGO is a bottleneck, so there are experiments using Intel MKL and Apache Arrow. ○ Also, experimenting with sparse models, LSH, etc. ● Training of not-very-deep models. ○ The training bit seems less interesting as this is really expensive for modern models even with GPUs/TPUs.
  • 10.
    Use cases for`production ML/NLP` ● train shallow models ● fine-tune deep models ● inference ● train deep models ● fine-tune deep models ● inference Weights Importer
  • 11.
    Building Blocks Lightweight internalMachine Learning framework 1. mat package (pkg/mat) a. 2D dense and sparse matrices (vectors and scalars as a subset of). i. go test dense 80% statements ii. go test sparse 77% statements b. []float64 as data storage. c. BLAS linear algebra asm/f64 (implementation by GONUM Authors). d. sync.Pool for an efficient reuse of allocated memory.
  • 12.
    Building Blocks 1. matpackage (pkg/mat) 2. auto-grad package (pkg/ml/ag) a. dynamically created expression Graph. i. variable node around a `Matrix` (pkg/ml/ag/variable.go). ii. operator node e.g. Add, Mul, Tanh, ReLU, Concat, etc. (pkg/ml/ag/operator.go). b. differentiation for all supported operations (pkg/ml/ag/fn). i. go test: 100% files, 89% statements c. enable training via back-propagation of gradients (BP, BPTT, TBTTT). d. minimal overhead
  • 13.
    Building Blocks (continued) 1.mat package (pkg/mat) 2. auto-grad package (pkg/ml/ag) 3. optimizers (pkg/ml/optimizers) ○ Params optimization via Gradient Descent (pkg/ml/optimizers/gd) ■ Adam, RAdam, RMSProp, AdaGrad, SGD ■ Decay methods ■ Gradient-clipping ○ Params optimization via Differential Evolution (pkg/ml/optimizers/de) ■ Population ■ Crossover ■ Mutation
  • 14.
    Building Blocks (continued) 1.mat package (pkg/mat) 2. auto-grad package (pkg/ml/ag) 3. optimizers (pkg/ml/optimizers) 4. neural networks (pkg/ml/nn) ○ Param ■ Weights, Biases ■ Optimizable, Serializable ○ Model ■ Serializable parameters of a neural model (params, sub-models, etc.) ■ Can instantiate a `Processor` ○ Processor ■ Perform the model’s `Forward()` operation within the `ag.Graph` ■ Suitable for feedforward, convolutional, recurrent models
  • 15.
    Toy Example (c= a+b) g := ag.NewGraph() // create a new node of type variable with a scalar a := g.NewVariable(mat.NewScalar(2.0), true) // create another node of type variable with a scalar b := g.NewVariable(mat.NewScalar(5.0), true) // create an addition operator (the calculation is actually performed here) c := g.Add(a, b) // print the result fmt.Printf("c = %vn", c.Value()) // c = [7] g.Backward(c, ag.OutputGrad(mat.NewScalar(0.5))) // print the gradients fmt.Printf("ga = %vn", a.Grad()) // ga = [0.5] fmt.Printf("gb = %vn", b.Grad()) // gb = [0.5]
  • 16.
    Affine transformation func Affine(g*ag.Graph, xs ...ag.Node) ag.Node { y := g.Add(xs[0], g.Mul(xs[1], xs[2])) // y = b + Wx for i := 3; i < len(xs)-1; i += 2 { w, x := xs[i], xs[i+1] if x != nil { y = g.Add(y, g.Mul(w, x)) // y += Wx } } return y } // y = b + W1x1 + W2x2 + ... + WnXn
  • 17.
    Linear Model (pkg/ml/nn/linear) var( _ nn.Model = &Model{} _ nn.Processor = &Processor{} ) type Model struct { W *nn.Param `type:"weights"` B *nn.Param `type:"biases"` } // New returns a new Linear model func New(in, out int) *Model { return &Model{ W: nn.NewParam(mat.NewEmptyDense(out, in)), B: nn.NewParam(mat.NewEmptyVecDense(out)), } } type Processor struct { nn.BaseProcessor w, b ag.Node } // NewProc returns Linear processor operating to the graph func (m *Model) NewProc(g *ag.Graph) nn.Processor { return &Processor{ BaseProcessor: nn.BaseProcessor{ Model: m, Mode: nn.Training, Graph: g, }, // insert the weights and biases into the graph w: g.NewWrap(m.W), b: g.NewWrap(m.B), } } // Forward computes y[i] = w (dot) x[i] + b func (p *Processor) Forward(xs ...ag.Node) []ag.Node ys := make([]ag.Node, len(xs)) for i, x := range xs { ys[i] = nn.Affine(p.Graph, p.b, p.w, x) } return ys }
  • 18.
    Linear Model (pkg/ml/nn/linear) var( _ nn.Model = &Model{} _ nn.Processor = &Processor{} ) type Model struct { W *nn.Param `type:"weights"` B *nn.Param `type:"biases"` } // New returns a new Linear model func New(in, out int) *Model { return &Model{ W: nn.NewParam(mat.NewEmptyDense(out, in)), B: nn.NewParam(mat.NewEmptyVecDense(out)), } } type Processor struct { nn.BaseProcessor w, b ag.Node } // NewProc returns Linear processor operating to the graph func (m *Model) NewProc(g *ag.Graph) nn.Processor { return &Processor{ BaseProcessor: nn.BaseProcessor{ Model: m, Mode: nn.Training, Graph: g, }, // insert the weights and biases into the graph w: g.NewWrap(m.W), b: g.NewWrap(m.B), } } // Forward computes y[i] = w (dot) x[i] + b func (p *Processor) Forward(xs ...ag.Node) []ag.Node ys := make([]ag.Node, len(xs)) for i, x := range xs { ys[i] = nn.Affine(p.Graph, p.b, p.w, x) } return ys }
  • 19.
    Build a Multi-LayerPerceptron (MLP) mlp := &stack.New( linear.New(inputSize, hiddenSize), activation.New(ag.OpTanh), linear.New(hiddenSize, outputSize), activation.New(ag.OpSoftmax), ), // y = Softmax(Wout (Tanh(Win x + bin ) + bout ))
  • 20.
    Long-Short Term Memory(pkg/ml/nn/rec/lstm) // inG = sigmoid(wIn (dot) x + bIn + wInRec (dot) yPrev) // outG = sigmoid(wOut (dot) x + bOut + wOutRec (dot) yPrev) // forG = sigmoid(wFor (dot) x + bFor + wForRec (dot) yPrev) // cand = f(wCand (dot) x + bC + wCandRec (dot) yPrev) // cell = inG * cand + forG * cellPrev // y = outG * f(cell) func (p *Processor) forward(x ag.Node) (s *State) { g := p.Graph s = new(State) yPrev, cellPrev := p.prev() s.InG = g.Sigmoid(nn.Affine(g, p.bIn, p.wIn, x, p.wInRec, yPrev)) s.OutG = g.Sigmoid(nn.Affine(g, p.bOut, p.wOut, x, p.wOutRec, yPrev)) s.ForG = g.Sigmoid(nn.Affine(g, p.bFor, p.wFor, x, p.wForRec, yPrev)) s.Cand = g.Tanh(nn.Affine(g, p.bCand, p.wCand, x, p.wCandRec, yPrev)) if cellPrev != nil { s.Cell = g.Add(g.Prod(s.InG, s.Cand), g.Prod(s.ForG, cellPrev)) } else { s.Cell = g.Prod(s.InG, s.Cand) } s.Y = g.Prod(s.OutG, g.Tanh(s.Cell)) return } func (p *Processor) prev() ( yPrev, cellPrev ag.Node ) { s := p.LastState() if s != nil { yPrev = s.Y cellPrev = s.Cell } return }
  • 21.
    Put the piecestogether model := linear.New(1, 1) // y = b+Wx optimizer := gd.NewOptimizer( sgd.New(sgd.NewConfig(0.0001, 0.0, false)), nn.NewDefaultParamsIterator(model), ) criterion := losses.MSESeq // g.ReduceMean(g.ProdScalar(g.Square(g.Sub(x, y)), g.NewScalar(0.5))) g := ag.NewGraph() predicted := model.NewProc(g).Forward(features...) loss := criterion(g, predicted, expected, true) g.Backward(loss) optimizer.Optimize()
  • 22.
    Put the piecestogether model := linear.New(1, 1) // y = b+Wx optimizer := gd.NewOptimizer( sgd.New(sgd.NewConfig(0.0001, 0.0, false)), nn.NewDefaultParamsIterator(model), ) criterion := losses.MSESeq // g.ReduceMean(g.ProdScalar(g.Square(g.Sub(x, y)), g.NewScalar(0.5))) g := ag.NewGraph() predicted := model.NewProc(g).Forward(features...) loss := criterion(g, predicted, expected, true) g.Backward(loss) optimizer.Optimize()
  • 23.
    Put the piecestogether model := linear.New(1, 1) // y = b+Wx optimizer := gd.NewOptimizer( sgd.New(sgd.NewConfig(0.0001, 0.0, false)), nn.NewDefaultParamsIterator(model), ) criterion := losses.MSESeq // g.ReduceMean(g.ProdScalar(g.Square(g.Sub(x, y)), g.NewScalar(0.5))) g := ag.NewGraph() predicted := model.NewProc(g).Forward(features...) loss := criterion(g, predicted, expected, true) g.Backward(loss) optimizer.Optimize()
  • 24.
    Put the piecestogether model := linear.New(1, 1) // y = b+Wx optimizer := gd.NewOptimizer( sgd.New(sgd.NewConfig(0.0001, 0.0, false)), nn.NewDefaultParamsIterator(model), ) criterion := losses.MSESeq // g.ReduceMean(g.ProdScalar(g.Square(g.Sub(x, y)), g.NewScalar(0.5))) g := ag.NewGraph() predicted := model.NewProc(g).Forward(features...) loss := criterion(g, predicted, expected, true) g.Backward(loss) optimizer.Optimize()
  • 25.
    NLP: Sequence Labeler(Flair Architecture) flair := &sequencelabeler.Model{ Labels: config.Labels, TaggerLayer: &birnncrf.Model{ BiRNN: birnn.New( lstm.New(...), lstm.New(...), birnn.Concat, ), Scorer: linear.New(...), CRF: crf.New(...), }, EmbeddingsLayer: &stackedembeddings.Model{ WordsEncoders: []nn.Model{ embeddings.New(...), // glovo contextualstringembeddings.New( charlm.New(...), // left to right charlm.New(...), // right to left contextualstringembeddings.Concat, ), }, ProjectionLayer: linear.New(...), }, }
  • 26.
    NLP: BERT Layer typeLayer struct { MultiHeadAttention *multiheadattention.Model NormAttention *layernorm.Model FFN *stack.Model NormFFN *layernorm.Model } func (m *Layer) NewProc(g *ag.Graph) nn.Processor { return &Processor{ BaseProcessor: nn.BaseProcessor{Model: m, Mode: nn.Training, Graph: g}, MultiHeadAttention: m.MultiHeadAttention.NewProc(g).(*multiheadattention.Processor), NormAttention: m.NormAttention.NewProc(g).(*layernorm.Processor), FFN: m.FFN.NewProc(g).(*stack.Processor), NormFFN: m.NormFFN.NewProc(g).(*layernorm.Processor), } } func (p *Processor) Forward(xs ...ag.Node) []ag.Node { subLayer1 := rc.PostNorm(p.Graph, p.MultiHeadAttention.Forward, p.NormAttention.Forward, xs...) subLayer2 := rc.PostNorm(p.Graph, p.FFN.Forward, p.NormFFN.Forward, subLayer1...) return subLayer2 }
  • 27.
    Live Demo 1. Question-Answering a.GOOS=linux GOARCH=amd64 go build -o hugging_face_importer cmd/huggingfaceimporter/main.go b. GOOS=linux GOARCH=amd64 go build -o bert_server cmd/bert/main.go c. ./hugging_face_importer --model=deepset/bert-base-cased-squad2 --repo=~/.spago d. ./bert_server server --model=~/.spago/deepset/bert-base-cased-squad2 --tls-disable e. http://localhost:1987/bert-qa-ui 2. Named Entity Recognition a. http://localhost:1988/ner-ui
  • 28.
    Recap ● Educational. ○ Coderelatively simple and quite succinct. ■ you don't need to be a math expert to understand spaGO. ● Easy-to-use (and to deploy). ○ GO module. ○ HTTP API (TLS available). ○ gRPC. ○ Docker. ○ Static-linked binary (no heavy dependencies; no burden of DL frameworks). ■ you can copy it “anywhere” and it will just run. ● Potentials. ○ Fast inference of NLP models on CPUs ■ Explore concurrent computation
  • 29.
    Current status (v0.0.z) { “Stars”: >600, “Forks”: 23, “Pull Requests”: 19, “Issues”: 17, “Main Contributors”: [ “M. Grella”, “E. Brambilla”, ”S. Cangialosi”, “M. Nicola”, “E. McClure”, “J. Viana”, ], “Building”: “passing” “Codecov”: “60%”, }
  • 30.
    Related projects GoPickle -Loading Python's data serialized with pickle and PyTorch module files https://github.com/nlpodyssey/gopickle GoTokenizers - Port of Hugging Face's tokenizers (Rust) https://github.com/nlpodyssey/gotokenizers GoSlide - Port of SLIDE for high performance deep-learning with CPU (C++) https://github.com/nlpodyssey/goslide Main contributor: M. Nicola (nicolamarco@protonmail.com)
  • 31.
    Thanks! Questions? Link tothe repo: https://github.com/nlpodyssey/spago If you like the project, please leave a ★ to show your support! Contacts: Matteo Grella matteogrella@gmail.com