Financial News Mining @ FD Mediagroep/Company.info

Onderdeel van FD MediagroepOnderdeel van FD Mediagroep
Financial News Mining
Data Science Northeast Netherlands Meetup, 16 Nov 2017
Onderdeel van FD Mediagroep

Who am I?
•
• BA. Media Studies (UvA)
• Science editor (NTR)
• MSc. Media Technology (Leiden)
• Ph.D Information Retrieval @ UvA (2017)
• “Entities of Interest --- Discovery in Digital Traces”
• Data Scientist at FD Mediagroep/Company.info
2

Outline
• Financial News @ FDMG/Company.info
• Entity Linking
• What is
• Entity Linking with custom KB:
• Approach
• Results
• Applications
3

FD Mediagroep
4

Onderdeel van FD Mediagroep9
Financial News

Data
• News articles:
• Hundreds of sources (Dutch, online)
• From Het Financieele Dagblad to the Groninger Gezinsbode
• Thousands of articles per day
• Multiple years of archive
• Knowledge Base:
• ~2.7M companies & organisations
• Rich metadata: sector information, financial information, people,
buildings, etc…
15

Linking companies in news
• Before: humans
• Now: machines
16

Entity Linking

Entity Linking
1. Identify entity mentions (words that refer to organisations)
• NER: Named-entity Recognition
2. Link entity mentions to unique ID of entities in KB (KvK #)
• EL: Entity Linking
• Aka Entity Resolution
• Aka Entity Disambiguation
18

Step 1: NER
• De Hoge Dennen Capital heeft een minderheidsbelang genomen in
Pseudonimiseer, een Amsterdamse start-up die is gespecialiseerd in
privacybescherming bij data-analyse.
20

Step 1: NER
21

Step 2: EL
• Hoge Dennen Capital -> 32102936 0000
• Pseudonimiseer -> 58388702 0000
22

Step 2: EL
• Hoge Dennen Capital -> 32102936 0000 (De Hoge Dennen Capital B.V.)
• Pseudonimiseer -> 58388702 0000 (Viacryp B.V.)
23

Challenges
• A single entity mention can refer to multiple entities
24

Challenges
• A single entity can be referred to by multiple entity mentions
25

Approach

Approach: NER
• NER: Sequence Prediction
• Based on [Graus et al., ECIR ‘14]
• B-I-O scheme
• Beginning of entity mention
• Inside entity mention
• Outside entity mention
• E.g.: “Daarnaast sloot het bedrijf twee nieuwe
overeenkomsten met Xenos en Big Bazar
voor in totaal 2000 vierkante meter voor
een periode van 10 jaar.”
27
Daarnaast
sloot
het
bedrijf
twee
nieuwe
overeenkomsten
met
Xenos
en
Big
Bazar
voor
in
totaal
2000
vierkante
meter
voor
een
periode
van
10
jaar
.

Approach: NER
• NER: Sequence Prediction
• Based on [Graus et al., ECIR ‘14]
• B-I-O scheme
• Beginning of entity mention
• Inside entity mention
• Outside entity mention
• E.g.: “Daarnaast sloot het bedrijf twee nieuwe
overeenkomsten met Xenos en Big Bazar
voor in totaal 2000 vierkante meter voor
een periode van 10 jaar.”
28
Daarnaast O
sloot O
het O
bedrijf O
twee O
nieuwe O
overeenkomsten O
met O
Xenos B-ORG
en O
Big B-ORG
Bazar I-ORG
voor O
in O
totaal O
2000 O
vierkante O
meter O
voor O
een O
periode O
van O
10 O
jaar O
. O

Approach: NER
• Features (for token t in sentence s):
• Token-identity: token=Xenos
• Word-shape: TokenIsCaps={1,0},
TokenIsNumber={1,0}, …
• Context: prevToken=met, nextToken=en, …
• Dictionary: TokenInCompanyDict={1,0},
InPersonNameDict={1,0}, …
• Corpus: token’s TF-IDF weight, token’s word-cluster
membership, …
• And more…
• Structured Perceptron
• Predict tag {B, I, O}
29
Daarnaast O
sloot O
het O
bedrijf O
twee O
nieuwe O
overeenkomsten O
met O
Xenos B-ORG
en O
Big B-ORG
Bazar I-ORG
voor O
in O
totaal O
2000 O
vierkante O
meter O
voor O
een O
periode O
van O
10 O
jaar O
. O

Approach: EL
• Common: Linking to Wikipedia
30

EL 2 Wikipedia
• Use mappings;
• Anchor texts to Wikipedia pages.
• Kendrick Lamar -> Kendrick_Lamar
• Kendrick Duckworth -> Kendrick_Lamar
• Use statistics;
• How often are words used as anchor?
• To which pages do they link?
31

Approach: EL
• Custom KB – Custom features
• Based on [Meij et al., WSDM ‘12]
1. Binary classification, for each mention m:
• Retrieve candidate organisations (query CI database with m)
2. For c in candidates:
• Entity features: Turnover, Size, etc…
• Mention features: MentionLength, etc…
• Entity-Mention features: MentionTitleOverlap, etc…
• Doc features: WoonplaatsInDocument, etc…
• Classify(m, c, doc) -> score
3. Take top-ranked entity
32

Data
• Multiple years of (hand-labeled) articles.
• NER:
• Split article into sentences
• Filter sentences with at least 2 entity mentions
• EL:
• Apply NER to article
• For each mention (m) in doc:
• Query KB (retrieve 20 candidates)
• For each <m, c, doc>-tuple:
• Extract features
• Label: If c == groundtruth: label POS, else NEG
• Train binary classifier
33

Evaluation
• Take data, make train/test-split
• NER: ~85%
• EL: ~85%
• But: Data is noisy/biased
• + Manual inspection
34

Bonus: Entity Salience
• Based on [Reinanda et al., CIKM ‘16]
• Simple baseline approach:
• Prominence: where in the document is entity first mentioned?
• Frequency: how often is entity mentioned?
• Salience: math.sqrt(Prominence*Frequency)
35

Bonus: Sentiment analysis
• Simple Bag-of-Words binary classifier (Naive Bayes)
• Trained on hand-labeled data (~10k articles) (labeled POS/NEG.)
• Given article (TF-IDF weighted vector), predict {POS, NEG}
36

Document Enrichment
• On average; 0.24s/article;
1. NER: Feature extraction + Prediction
2. EL: Retrieve Candidates (one query per mention)
3. EL: Feature Extraction+Classification (for each candidate)
4. Entity Salience Scoring
5. Sentiment analysis
• Number of published articles per day: approx. +160%
• Number of linked orgs: approx. +310%
• Works 24h/day
• More “long tail” articles
37

Applications

Burst detection/summarization
• Simple burst detection algo:
• Take rolling average of time series
• Take cutoff (e.g., mean+std)
• Any point over cutoff = burst
39
Nederlandse Aardolie Maatschappij B.V.

40
2016-08
• Groen licht voor oliewinning in Drenthe
• Robotkraan RoBorg aan boord van de Kroonborg
• Afvalwater NAM weer door Hardenberg naar Twente
• Minister Kamp: NAM mag weer afvalwater injecteren in Twentse bodem
• Nam hervat volgende maand waterinjectie
• “Vertrouwen in NAM en CVW naar absoluut dieptepunt.”
• Groen licht voor herstart oliewinning in Schoonebeek
• Oliewinning in Schoonebeek half september hervat
• TU Delft: 'Schadeonderzoek Arcadis deugt niet'

41
2016-08
• Robotkraan RoBorg aan boord van de Kroonborg
• Afvalwater NAM weer door Hardenberg naar Twente
• Minister Kamp: NAM mag weer afvalwater injecteren in Twentse bodem
• Nam hervat volgende maand waterinjectie
• “Vertrouwen in NAM en CVW naar absoluut dieptepunt.”
• Groen licht voor herstart oliewinning in Schoonebeek
• Oliewinning in Schoonebeek half september hervat
• TU Delft: 'Schadeonderzoek Arcadis deugt niet'
2017-03
- NAM aansprakelijk voor immateriële schade aardbevingen
- NAM aansprakelijk psychologische schade aardbevingen
- Aardbevingsellende: 'Het vreet aan ons'
- NAM aansprakelijk voor psychische schade bewoners aardbevingsgebied
- NAM aansprakelijk immateriële schade inwoners Groningenveld
- NAM ook aansprakelijk voor immateriële schade door aardbevingen
- Live: Rechtszaak immateriële schade door aardbevingen [afgelopen]
- NAM moet ook immateriële schade aardbevingen vergoeden
- 'Uitspraak is een mokerslag voor NAM en minister Kamp'

Sentiment+events
42

Sentiment+events
43

Sentiment+events
44

Affiliation Networks
45

As a feature
46

Fin
Questions?
@dvdgrs
www.graus.co
david.graus@fdmediagroep.nl
Refs:
D. Graus, M. Tsagkias, L. Buitinck, and M. de Rijke, “Generating pseudo-ground truth for predicting new concepts in social streams,” ECIR 2014
E. Meij, W. Weerkamp, and M. de Rijke, “Adding semantics to microblog posts,” WSDM 2012
R. Reinanda, E. Meij, and M. de Rijke, “Document Filtering for Long-tail Entities,” CIKM 2016
47

Financial News Mining @ FD Mediagroep/Company.info

Recommended

Recommended

More Related Content

Similar to Financial News Mining @ FD Mediagroep/Company.info

Similar to Financial News Mining @ FD Mediagroep/Company.info (17)

More from David Graus

More from David Graus (19)

Financial News Mining @ FD Mediagroep/Company.info

Editor's Notes