1. The Allen AI Science
Challenge &
DeepHack.Q&A
St. Petersburg Data Science Meetup #6, Feb 19th, 2016
2. Q: When athletes begin to exercise, their heart rates and respiration rates
increase. At what level of organization does the human body coordinate these
functions?
A. at the tissue level
B. at the organ level
C. at the system level
D. at the cellular level
Wed 7 Oct 2015 – Sat 13 Feb 2016
Stage 1: 800 teams (>1000 participants),
Stage 2: 170 teams
https://www.kaggle.com/c/the-allen-ai-science-challenge
2700 questions - train set
8132 questions - validation set
21298 questions - final test set
3. DeepHack Q&A qa.deephack.me/
Qualification round: Top-50 participants with the
highest scores
Rough competition: Kaggle Top-40 to get to the Top-50 o_O
Winter ML school + hackathon: 31st, Jan - 5th Feb, 2016
GP team created at Jan, 31st from the four teams
The final 30 minutes of the hackathon: https:
//www.youtube.com/watch?v=tCKL5vbiHuo
4. Pavel Kalaidin (VK)
Marat Zainutdinov (Quantbrothers)
Roman Trusov (ITMO University)
Artyom Korkhov (Zvooq)
Igor Shilov (Zvooq)
Timur Luguev (Clevapi)
Ilyas Luguev (Clevapi)
Team Generation Gap
DeepHack: 1st, ~0.556
Allen AI: 7th, 0.55059
8. AdaGram (a.k.a Reptil)
Breaking Sticks and Ambiguities with Adaptive Skip-gram: http:
//arxiv.org/abs/1502.07257
Reference implementation in Julia: https://github.
com/sbos/AdaGram.jl
9. reptil art cultur final play
signific role folklor
religion popular cultur moch
peopl noun coldblood anim
scale general move stomach
short leg exampl snake lizard
turtl noun aw person
10. Model trained like this: sh train.sh --min-freq 20 --window 5 --workers 40 --epochs 5
--dim 300 --alpha 0.1 corpus.txt adam.dict adam.model
Number of prototypes is 5 by default.
12. N-grams PMI
x, y - Ngrams
Example 1-gram -> 1-gram
unit -> state
magnet -> field
carbon -> dioxid
million -> year
year -> ago
amino -> acid
Example 1-gram -> 3-gram
around -> million year ago
period -> million year ago
forc -> van der waal
fossil -> million year ago
nobel -> prize physiolog medicin
date -> million year ago
mercuri -> venus earth mar
13. N-grams PMI
greatest contributor air pollut unit state
What is the greatest contributor to air pollution
in the United States?
greatest
contributor
air
...
greatest contributor
contributor air
air pollut
...
1-grams 2-grams 3-grams
Power plants
power plant
power
plant
1-grams
power plant
2-grams
...
16. LSA + Lucene
Corpus
LSA
TI_1
TI_2
TI_n
Lucene
qa pair 1
qa pair 2
qa pair 3
qa pair 4
Queries in
topic
indices
Result: for each qa pair,
max(s1...sn)
Gave 1% improvement over
basic Lucene; but took
EXTREMELY long time to
process :(
17. Syntax co-occurrence
nobel chemistry prize 517
national science academy 445
long time period 340
also role play 306
nobel physic prize 279
national medical library 273
carbon water dioxide 261
second thermodynamics law 247
speed sound of_pobj
density population compound
take place dobj
link external compound 0.3 :(
18. word2vec combinations
Wanted to capture the intersection of meanings, but didn’t know
how to combine word2vec representations
TFIDFqa pairs
Combinations of
question tokens
Combinations of
answer tokens
Cosine
similarity
Max score ~ 0.3 :( even with careful kw filtering
word2gauss didn’t help too
21. Semantic Neural networks (2nd encounter)
+ Paragraphs
LSTM = LSTM(w2v)
LSTM(s1 | s2) > LSTM(s1 | s3) if s1 and s2 are from the same
paragraph, while s1 and s3 are not
LSTM(a, b) is low then a and b are from the same paragraph (energy
based learning)
Loss = max(0, M - LSTM(s1, s2) + LSTM(s1, s3))
Score: 0.26
24. Reading Neural networks (3rd encounter)
+ Lots of paragraphs
+ Search Engine
+ A survey:
- bigrams are not accounted
- main idea (keywords) of a sentence is not recognized
25. Reading Neural networks (3rd encounter)
+ Lots of paragraphs
+ Search Engine
+ A survey:
- bigrams are not accounted
- main idea (keywords) of a sentence is not recognized
26. Reading Neural networks (3rd encounter)
All we want is to know if a sentence is from a paragraph to be
able to rerank lucene scores.
29. Reading Neural networks (3rd encounter)
sentences -> LSTM -> Dense NN -> Embedding
w2v -> LSTM -> Dense NN -> Embedding
w2v -> Mean -> Dense NN -> Embedding
30. Neural networks. Learned lessons.
Start as small as possible
Corruption is important for siamese networks
Learning curve is misleading in NLP
31. Lessons learned
Start early - wasted two first months of the competition (but had a week
of 24/7 hackathon at the end)
No stickers in the team channel (except with Yann LeCun on a good submit)
Common toolbox is nice
A dedicated server is a good thing to have (no need in AWS spot instances)
Experiment fast, fail early
Team work means a lot