2. Agenda
Q U I C K I N T R O I N M A C H I N E L E A R N I N G
C O M P L E X I T Y O F W O R K I N G W I T H L A N G U A G E
O V E R V I E W O F N L P C O N C E P T S …
A N D T H E I R A P P L I C A T I O N S I N P R O D U C T I O N
F U T U R E
2
3. General ML approach
• How to formalize some process of decision making in mathematical
function
• Reduce the error rate of this function
• Formalization and minimization
3
4. General ML approach
• Math function requires constant dimension
• We can breakdown image in sequence of brightness levels from 0
to 255
• How is it possible to breakdown language?
F O R M A L I Z AT I O N
4
5. General ML approach
Find a cats on
a pictures
Each picture is
a equation in
terms of filter
theory math
Compose a
equations into
a equations
system
Resolve
equation
system to get
cat finding
framework
F O R M A L I Z AT I O N
5
6. General ML approach
x - 2y = 3
x + y = 6
3y = 3 y = 1 x = 5
A N A LY T I C A L S O LU T I O N
6
Brightness levels of picture from samples and cat position are in red
Neural network coefficients are in blue
Pixel
1
Pixel
2
The
cat
1 -2 3
1 1 6
7. General ML approach
x - 2y = 3
x + y = 6
You are really smart
human being, too
smart to do analytics
x = 5
y = 1
S TO C H A S T I C S O LU T I O N
7
8. Is it possible to breakdown language?
• Language is spoken sounds or written elements
• Vast number of different spoken languages with unique set of sounds.
• Long list of written traditions with local modifications. Signs are used for sounds, words, expressions
• No common interface at all!
F O R M A L I Z AT I O N
8
9. Focus on business goal, let leave linguistics to linguists
• Only written text
• English only alphabet (or Latin only for example)
F O R M A L I Z AT I O N
9
10. One-hot encoding
A B C D E F … K L M … O P Q … Food Company
A 1 0 0 0 0 0 0 0 0 0 0 0 0 1
P 0 0 0 0 0 0 0 0 0 0 1 0 0 1
P 0 0 0 0 0 0 0 0 0 0 1 0 0 1
L 0 0 0 0 0 0 0 1 0 0 0 0 0 1
E 0 0 0 0 1 0 0 0 0 0 0 0 0 1
10
11. Unstructured data extraction: Sequence neural network
I T I S N O T T R U E T H AT D R U N K D R I V I N G I S S AV E
11
12. Data labeling
• Data labeling is straight forward in some cases, but
C H A L L E N G E S A N D A P P R O A C H E S
12
13. Data labeling
• Data labeling is straight forward in some cases, but
• Data might be sensitive
• Data might require rare competency for exotic languages, or
complex business domain
• For our synthetic labeled data and production data are the same,
lets do feed back loop then. Retrain model automatically through
the time from user interaction with data.
C H A L L E N G E S A N D A P P R O A C H E S
13
14. Pretrained models and fine-tunning boom
• Large models pretrained on huge datasets, prepared through
expensive training process
• Tech giants train, you use
• Let’s download and apply word2vec to improve semantic
understanding
M I N I M I Z AT I O N
14
18. With semantics insights we can…
• Prioritize emails by sentiment or topic from sentiment summary
• Extract legal and personal names to process email automatically
and prioritize it. Or even register application in internal system
automatically
• Predict business domain from text summary and automatically
forward insurance compensation claim to proper department
S U M U P T E X T S E M A N T I C S A N D C AT E G O R I Z E
18
19. With semantics insights we can…
S U M U P W O R D I N S E N T E N C E A N D D O S E M A N T I C S E A R C H , J U S T F I N E - T U N E I T F O R YO U R D O M A I N
19
20. With semantics insights we can…
E N C O D E W O R D S I N V E C TO R S A N D D E C O D E S E N T E N C E I N A N O T H E R L A N G UA G E O R S T Y L E
20
21. Transformers age
• State of the art
• Increased ability to summarize semantics of text
• Generate text from metadata – opposite of summarization
• Google BERT, XLNet, RoBERTa
• Our Invoice extraction may be implemented with FastText
embeddings and BERT classification instead of one-hot encoding +
LSTM to become a state of the art data extraction algorithm
• Increased summarization capabilities boosted tasks that were
discussed previously
• New text generation feature opened intellectual Q&A and Chatbots
• Auto-ML features from cloud provides help to build ML solution
without ML knowledges
• Applied in COVID-19 researches to help pharma companies do
master data management
21
22. GPT-3 is coming…
• Largest transformer ever - between $11.5 million and $27.6 million,
plus the overhead of parallel GPUs. Infrastructure only
• Made quite a noise with public API. Demo version of the model
provided a lot of offensive texts and articles. Philosophical
discussions around control over AI became more adequate.
• I beta now, release date is unclear. Microsoft acquired a license and
will provide GPT-3 power by subscription.
22
23. Summary
• Machine Learning is formalization and minimization
• Production Machine Learning dramatically depends on business insights and constraints
• Modern NLP concepts are combinations of encoding, summarization and decoding
• Let’s look after GPT-3 and stay safe
23