DMAP (Data Mining and Automation for Platforms) is an online framework that presents a wide variety of official data, news and information about companies. It was developed with an aim to act as a one-stop platform for displaying everything official related to a company and its competitors. It aggregates data from the following sources: Bing News, Companys' official blogs, RSS Sources, Facebook, Twitter, Google Trends, Crunchbase, Financial Data Sources, and Alexa Web Analytics.
The aggregated information is shown in an intuitive fashion that allows a user to perform exploratory analysis of a particular company. Specifically, the interface makes it easy to do comparative analysis of a company with respect to its competitors. The user can use filters to select a given set of companies, data sources and date ranges. All the information is presented within the context of the framework such that the user doesn't have to go to different domains. The fetched news articles are cleaned, enriched, and presented in a manner that allows an analyst to navigate through news articles using named entities. For each news article, named entities: people, location, organization, and name of companies are extracted. For a given search query, the interface returns matching news articles along with associated entities which are shown using word clouds. This allows for easy discovery of connections between entities.
The framework was developed as a part of the 2015 summer internship with The Center for Global Enterprise (CGE). Conceptualization, Design and Development of the framework was done by me during the three month period.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
DMAP - Data Mining Automation for Platforms Aggregates Official Company Info
1. DMAP
–
Data
Mining
Automa/on
for
Pla3orms
Parang
Saraf
Ph.D.
Candidate
Discovery
Analy/cs
Center
Department
of
Computer
Science
Virginia
Tech
parang@vt.edu
2. What
is
DMAP?
• DMAP
is
a
system
for
aggrega/ng
official
informa/on
about
a
company
from
wide
range
of
data
sources
– News
Media
– Official
Company
Blog
Posts
– Official
TwiJer
and
Facebook
Handles
• The
intui/ve
presenta/on
of
ar/cles
helps
in
analyzing
them
faster
and
beJer
2
3. Objec/ve
1. Overview
of
the
system
architecture
2. Example
News
data
files
3. Ravens
Interface
4. Overview
of
the
web
architecture
5. Example
RSS
data
files
6. Horse’s
Mouth
Interface
7. Example
Facebook
and
TwiJer
data
files
8. Gossipers
Interface
9. Moving
from
Proof
of
Concept
to
Produc/on
System
3
4. DMAP
-‐
Objec/ve
• A
one
stop
pla3orm
for
displaying
everything
official
related
to
a
company
– News
Sources
(Ravens)
– Official
Blogs
(Horse’s
Mouth)
– Official
TwiJer
and
Facebook
Accounts
(Gossipers)
– Other
Informa/on
(Facts)
• Company
Informa/on
• Web
ac/vity
informa/on
• Financial
Informa/on
4
6. System
Architecture
Data
Collec/on
Through
Official
APIs
• We
use
official
APIs
to
collect
data
–
No
Scraping
• Why
use
Bing
and
not
Google
News?
• Example
search
result
data
• Bing
Cost
Calcula/on
6
7. System
Architecture
Data
Cleaning
and
Enrichment
• Structured
data
doesn’t
mean
clean
data
• Fetch
the
original
ar/cle
by
following
the
url
in
the
search
results
• Remove
Boilerplate
• Filter
again
for
the
keyword
• Enrich
ar/cle
with
Basis
En/ty
Extractor
7
8. System
Architecture
Duplicate
Detec/on
and
Data
Loading
• Search
for
duplicate
content
• Load
data
in
the
database
8
9. Ravens
Interface
• Results
control
op/ons
• Dynamically
generated
Google
Trends
chart
• Dynamically
generated
Word
Clouds
• Ar/cle
Detailed
View
• Ar/cle
specific
word
clouds
What
happens
behind-‐the-‐scenes
when
you
click
the
green
“SUBMIT”
buJon?
9
10. System
Architecture
Web
Architecture
Model
Controller
View
(MVC)
Framework
10
Server-‐Side
Model
Controller
View
Client-‐Side
40%
60%
11. System
Architecture
What
happens
aier
clicking
“Submit”
11
• Client-‐side
Sanity
Check
• Checks
wriJen
• If
everything
is
as
expected,
sends
the
query
parameters
to
the
server
12. System
Architecture
What
happens
aier
clicking
“Submit”
12
1. Performs
basic
sanity
checks
2. Get
qualifying
ar/cle
set
based
on
search
parameters
3. Iden/fy
the
10
search
results
from
this
set
based
on
page
number
4. Do
a
frequency
count
of
all
en//es
(People,
Loca/on,
Organiza/on
and
Product)
and
pick
the
top
50
in
each
category
and
determine
their
font
size
based
on
frequency.
5. Generate
Google
Trends
chart
data
6. Return
everything
back
to
the
client
13. System
Architecture
What
happens
aier
clicking
“Submit”
13
1. Display
the
returned
search
results
2. Embed
dynamically
generated
Google
Search
Trends
3. Dynamically
Generate
word
clouds
using
D3
14. RSS
Reader
• Sample
blog
feed
• Problem:
How
to
know
if
someone
has
published
a
new
post?
• Use
a
feed
reader
–
Inoreader:
– How
to
subscribe
to
a
blog
in
inoreader
– Checks
blogs
at
regular
intervals
for
new
posts
• Wrote
scripts
that
downloads
“new”
posts
found
by
inoreader
• Sample
blog
post
example
fetched
through
API
14
15. Horse’s
Mouth
• Selected
companies
carry
over
• Now
ar/cles
are
limited
only
to
the
official
blogs
15
16. TwiJer
API
• Have
to
abide
by
/me-‐based
rates
while
fetching
data
• Tweets
and
replies
both
count
towards
tweets
from
an
account
– We
are
only
interested
in
tweets
– Discard
reply
tweets
– There
can
be
a
lag
while
fetching
tweets
• Sample
Tweet
fetched
through
API
16
17. Facebook
API
• Can’t
be
automated.
Have
to
generate
a
new
authoriza/on
every
/me
before
fetching
data
• Pictures
are
provided
in
low
resolu/on
• Sample
Facebook
post
fetched
through
API
17
18. Gossipers
• Solves
the
feed
personaliza/on
problem
in
true
sense
• TwiJer
allows
embedding
of
tweets:
– Leads
to
exact
same
presenta/on
format
as
they
appear
on
twiJer.com
– The
media
links
are
all
embedded
and
works
• Facebook
doesn’t
allow
embedding
of
posts
due
to
privacy
issues
– Have
to
recreate
the
presenta/on
look
and
style
– Media
links
don’t
work
• Pagina/on
works
individually
for
both
of
them
18
19. Facts
• Will
show
the
following:
– Sta/c
informa/on
about
the
company
– Financial
Informa/on
– Web
usage
data
– Other
miscellaneous
informa/on
19