spaGO: A self-contained ML & NLP library in GO

{ } // Self-contained ML & NLP library in GO

Matteo Grella
● Artificial Intelligence to save lives.
○ Head of AI at EXOP GmbH
■ Detect global security incidents in short time from online-media;
■ Match security-incidents with travelers locations;
■ Automatically initiate emergency measures;
● Research on NLP and related Machine Learning algorithms.
○ Strong focus on Dependency Parsing.
■ Non-Projective Dependency Parsing via Latent Heads Representation (LHR)
M. Grella and S. Cangialosi, 2018 (https://arxiv.org/abs/1802.02116)
● Main author of spaGO.

What is spaGO?
● Open-source project initiated by members of .
○ BSD-Like License.
○ Work-in-Progress.
● Beautiful and maintainable machine learning (ML) library written in Go.
○ Design focuses on relevant neural architectures in Natural Language Processing (NLP).
● Train, load and perform inference on state-of-the-art NLP models.
○ e.g., Named Entity Recognition, Question-Answering, etc.
○ The first pure Go implementation that we know of.

Why spaGO?
1. Educational purposes.
a. The first goal was to learn Go and refresh basic concepts of ML.
b. Very few dependencies; developed almost from scratch.
2. Production use.
a. The scope of the goals broadened after things started working pretty well.
b. Pivot towards a comprehensive, production-ready, cloud-native NLP library.
c. e.g., stateless server, TLS encryption, gRPC, Docker, etc.
3. High-quality implementation.
a. Go does not have a rich ML/NLP ecosystem.
b. spaGO fills that niche by providing the most important architectures in the field.

Features
● Automatic differentiation
○ Define-by-run.
● Optimizers
○ Adam, RAdam, RMSProp, AdaGrad, and SGD.
● Feed-forward models
○ Linear, Highway, Convolution, etc.
● Recurrent models
○ LSTM, GRU, etc.
● Attention mechanisms
○ Self-Attention, Multi-Head Attention, etc.
● Memory-efficient Word Embeddings
○ Uses badger key–value store.

Features (continued)
● Language Modeling
○ Next char/word/sentence prediction, masked models.
● Sequence Tagger
○ BiLSTM+CRF, etc.
● Transformers
○ BERT-like
● Text classification
● Named Entity Recognition (NER)
● Question-Answering
● etc.

Ready-to-use
Compatible with PyTorch models:
● // state-of-the-art multi-lingual sequence labeling
● // 32+ pretrained transformers in 100+ languages

Caveats
● No support for GPU, yet.
○ spaGO is not GPU/TPU-friendly by design.
○ On the roadmap.
● Even for the CPU, there are constraints.
○ Far from the optimizations available in mainstream DL (deep learning) frameworks:
■ How data are stored in tensors.
■ Efficient batch operations.
■ etc.

Where would I use spaGO?
● Production NLP systems.
○ Requires optimization of CPU inference, though.
■ PyTorch is ~2x faster for large matrices
○ Relies on GONUM assembly code with SIMD instructions (AMD64).
○ CGO is a bottleneck, so there are experiments using Intel MKL and Apache Arrow.
○ Also, experimenting with sparse models, LSH, etc.
● Training of not-very-deep models.
○ The training bit seems less interesting as this is really expensive for modern models even
with GPUs/TPUs.

Use cases for `production ML/NLP`
● train shallow models
● fine-tune deep models
● inference
● train deep models
● fine-tune deep models
● inference
Weights Importer

Building Blocks
Lightweight internal Machine Learning framework
1. mat package (pkg/mat)
a. 2D dense and sparse matrices (vectors and scalars as a subset of).
i. go test dense 80% statements
ii. go test sparse 77% statements
b. []float64 as data storage.
c. BLAS linear algebra asm/f64 (implementation by GONUM Authors).
d. sync.Pool for an efficient reuse of allocated memory.

Building Blocks
2. auto-grad package (pkg/ml/ag)
a. dynamically created expression Graph.
i. variable node around a `Matrix` (pkg/ml/ag/variable.go).
ii. operator node e.g. Add, Mul, Tanh, ReLU, Concat, etc. (pkg/ml/ag/operator.go).
b. differentiation for all supported operations (pkg/ml/ag/fn).
i. go test: 100% files, 89% statements
c. enable training via back-propagation of gradients (BP, BPTT, TBTTT).
d. minimal overhead

Building Blocks (continued)
3. optimizers (pkg/ml/optimizers)
○ Params optimization via Gradient Descent (pkg/ml/optimizers/gd)
■ Adam, RAdam, RMSProp, AdaGrad, SGD
■ Decay methods
■ Gradient-clipping
○ Params optimization via Differential Evolution (pkg/ml/optimizers/de)
■ Population
■ Crossover
■ Mutation

Building Blocks (continued)
3. optimizers (pkg/ml/optimizers)
4. neural networks (pkg/ml/nn)
○ Param
■ Weights, Biases
■ Optimizable, Serializable
○ Model
■ Serializable parameters of a neural model (params, sub-models, etc.)
■ Can instantiate a `Processor`
○ Processor
■ Perform the model’s `Forward()` operation within the `ag.Graph`
■ Suitable for feedforward, convolutional, recurrent models

Toy Example (c = a+b)
g := ag.NewGraph()
// create a new node of type variable with a scalar
a := g.NewVariable(mat.NewScalar(2.0), true)
// create another node of type variable with a scalar
b := g.NewVariable(mat.NewScalar(5.0), true)
// create an addition operator (the calculation is actually performed here)
c := g.Add(a, b)
// print the result
fmt.Printf("c = %vn", c.Value()) // c = [7]
g.Backward(c, ag.OutputGrad(mat.NewScalar(0.5)))
// print the gradients
fmt.Printf("ga = %vn", a.Grad()) // ga = [0.5]
fmt.Printf("gb = %vn", b.Grad()) // gb = [0.5]

Affine transformation
func Affine(g *ag.Graph, xs ...ag.Node) ag.Node {
y := g.Add(xs[0], g.Mul(xs[1], xs[2])) // y = b + Wx
for i := 3; i < len(xs)-1; i += 2 {
w, x := xs[i], xs[i+1]
if x != nil {
y = g.Add(y, g.Mul(w, x)) // y += Wx
}
}
return y
}
// y = b + W1x1 + W2x2 + ... + WnXn

Linear Model (pkg/ml/nn/linear)
var (
_ nn.Model = &Model{}
_ nn.Processor = &Processor{}
)
type Model struct {
W *nn.Param `type:"weights"`
B *nn.Param `type:"biases"`
}
// New returns a new Linear model
func New(in, out int) *Model {
return &Model{
W: nn.NewParam(mat.NewEmptyDense(out, in)),
B: nn.NewParam(mat.NewEmptyVecDense(out)),
}
}
type Processor struct {
nn.BaseProcessor
w, b ag.Node
}
// NewProc returns Linear processor operating to the graph
func (m *Model) NewProc(g *ag.Graph) nn.Processor {
return &Processor{
BaseProcessor: nn.BaseProcessor{
Model: m,
Mode: nn.Training,
Graph: g,
},
// insert the weights and biases into the graph
w: g.NewWrap(m.W),
b: g.NewWrap(m.B),
}
}
// Forward computes y[i] = w (dot) x[i] + b
func (p *Processor) Forward(xs ...ag.Node) []ag.Node
ys := make([]ag.Node, len(xs))
for i, x := range xs {
ys[i] = nn.Affine(p.Graph, p.b, p.w, x)
}
return ys
}

Build a Multi-Layer Perceptron (MLP)
mlp := &stack.New(
linear.New(inputSize, hiddenSize),
activation.New(ag.OpTanh),
linear.New(hiddenSize, outputSize),
activation.New(ag.OpSoftmax),
),
// y = Softmax(Wout
(Tanh(Win
x + bin
) + bout
))

Long-Short Term Memory (pkg/ml/nn/rec/lstm)
// inG = sigmoid(wIn (dot) x + bIn + wInRec (dot) yPrev)
// outG = sigmoid(wOut (dot) x + bOut + wOutRec (dot) yPrev)
// forG = sigmoid(wFor (dot) x + bFor + wForRec (dot) yPrev)
// cand = f(wCand (dot) x + bC + wCandRec (dot) yPrev)
// cell = inG * cand + forG * cellPrev
// y = outG * f(cell)
func (p *Processor) forward(x ag.Node) (s *State) {
g := p.Graph
s = new(State)
yPrev, cellPrev := p.prev()
s.InG = g.Sigmoid(nn.Affine(g, p.bIn, p.wIn, x, p.wInRec, yPrev))
s.OutG = g.Sigmoid(nn.Affine(g, p.bOut, p.wOut, x, p.wOutRec, yPrev))
s.ForG = g.Sigmoid(nn.Affine(g, p.bFor, p.wFor, x, p.wForRec, yPrev))
s.Cand = g.Tanh(nn.Affine(g, p.bCand, p.wCand, x, p.wCandRec, yPrev))
if cellPrev != nil {
s.Cell = g.Add(g.Prod(s.InG, s.Cand), g.Prod(s.ForG, cellPrev))
} else {
s.Cell = g.Prod(s.InG, s.Cand)
}
s.Y = g.Prod(s.OutG, g.Tanh(s.Cell))
return
}
func (p *Processor) prev() (
yPrev, cellPrev ag.Node
) {
s := p.LastState()
if s != nil {
yPrev = s.Y
cellPrev = s.Cell
}
return
}

Put the pieces together
model := linear.New(1, 1) // y = b+Wx
optimizer := gd.NewOptimizer(
sgd.New(sgd.NewConfig(0.0001, 0.0, false)),
nn.NewDefaultParamsIterator(model),
)
criterion := losses.MSESeq // g.ReduceMean(g.ProdScalar(g.Square(g.Sub(x, y)), g.NewScalar(0.5)))
g := ag.NewGraph()
predicted := model.NewProc(g).Forward(features...)
loss := criterion(g, predicted, expected, true)
g.Backward(loss)
optimizer.Optimize()

NLP: Sequence Labeler (Flair Architecture)
flair := &sequencelabeler.Model{
Labels: config.Labels,
TaggerLayer: &birnncrf.Model{
BiRNN: birnn.New(
lstm.New(...),
lstm.New(...),
birnn.Concat,
),
Scorer: linear.New(...),
CRF: crf.New(...),
},
EmbeddingsLayer: &stackedembeddings.Model{
WordsEncoders: []nn.Model{
embeddings.New(...), // glovo
contextualstringembeddings.New(
charlm.New(...), // left to right
charlm.New(...), // right to left
contextualstringembeddings.Concat,
),
},
ProjectionLayer: linear.New(...),
},
}

NLP: BERT Layer
type Layer struct {
MultiHeadAttention *multiheadattention.Model
NormAttention *layernorm.Model
FFN *stack.Model
NormFFN *layernorm.Model
}
func (m *Layer) NewProc(g *ag.Graph) nn.Processor {
return &Processor{
BaseProcessor: nn.BaseProcessor{Model: m, Mode: nn.Training, Graph: g},
MultiHeadAttention: m.MultiHeadAttention.NewProc(g).(*multiheadattention.Processor),
NormAttention: m.NormAttention.NewProc(g).(*layernorm.Processor),
FFN: m.FFN.NewProc(g).(*stack.Processor),
NormFFN: m.NormFFN.NewProc(g).(*layernorm.Processor),
}
}
func (p *Processor) Forward(xs ...ag.Node) []ag.Node {
subLayer1 := rc.PostNorm(p.Graph, p.MultiHeadAttention.Forward, p.NormAttention.Forward, xs...)
subLayer2 := rc.PostNorm(p.Graph, p.FFN.Forward, p.NormFFN.Forward, subLayer1...)
return subLayer2
}

Live Demo
1. Question-Answering
a. GOOS=linux GOARCH=amd64 go build -o hugging_face_importer cmd/huggingfaceimporter/main.go
b. GOOS=linux GOARCH=amd64 go build -o bert_server cmd/bert/main.go
c. ./hugging_face_importer --model=deepset/bert-base-cased-squad2 --repo=~/.spago
d. ./bert_server server --model=~/.spago/deepset/bert-base-cased-squad2 --tls-disable
e. http://localhost:1987/bert-qa-ui
2. Named Entity Recognition
a. http://localhost:1988/ner-ui

Recap
● Educational.
○ Code relatively simple and quite succinct.
■ you don't need to be a math expert to understand spaGO.
● Easy-to-use (and to deploy).
○ GO module.
○ HTTP API (TLS available).
○ gRPC.
○ Docker.
○ Static-linked binary (no heavy dependencies; no burden of DL frameworks).
■ you can copy it “anywhere” and it will just run.
● Potentials.
○ Fast inference of NLP models on CPUs
■ Explore concurrent computation

Current status (v 0.0.z)
{
“Stars”: >600,
“Forks”: 23,
“Pull Requests”: 19,
“Issues”: 17,
“Main Contributors”: [
“M. Grella”, “E. Brambilla”, ”S. Cangialosi”, “M. Nicola”,
“E. McClure”, “J. Viana”,
],
“Building”: “passing”
“Codecov”: “60%”,
}

Related projects
GoPickle - Loading Python's data serialized with pickle and PyTorch module files
https://github.com/nlpodyssey/gopickle
GoTokenizers - Port of Hugging Face's tokenizers (Rust)
https://github.com/nlpodyssey/gotokenizers
GoSlide - Port of SLIDE for high performance deep-learning with CPU (C++)
https://github.com/nlpodyssey/goslide
Main contributor: M. Nicola (nicolamarco@protonmail.com)

Thanks! Questions?
Link to the repo:
https://github.com/nlpodyssey/spago
If you like the project, please leave a ★ to show your support!
Contacts:
Matteo Grella
matteogrella@gmail.com

spaGO: A self-contained ML & NLP library in GO

More Related Content

What's hot

Similar to spaGO: A self-contained ML & NLP library in GO

Recently uploaded

spaGO: A self-contained ML & NLP library in GO