Segmentation

“Segmentation”
as the Workhorse of
Business Analytics

Overview
We research hierarchy of topics extracted from documents (news,
publications, discussions etc.).
Our system is targeted at data researchers.
It provides:
 Trend tracking
 Similar and related topics detection
 Topic segmentation, which aims to solve information
overload(http://mlvl.github.io/Hierarchie/) problem
The topic model we use is not a collection of tags but is the combination of
NLP + statistical analysis.

Possible applications
 creating concept infographics(http://findtheconversation.com/concept-map/)
 estimating concept influence(http://brightpointinc.com/political_influence/)
 detecting semantic relations(http://bl.ocks.org/mbostock/1153292)
 nested segments visualization (http://bl.ocks.org/mbostock/7607535)-
concepts hierarchy

Test prototype
We developed a prototype called Data Mining Tool (DMT) for
testing analytics model.
As test data, we use tech and political news (about 2k + 1k RSS
Feeds delivering 10k news daily).

DMT workflow
1. Import Documents to Index
2. Extract meta-data for each Document (NLP: keywords labels, terms etc.)
3. Extract Chains using Cluster Analysis
4. Assign Weights to Topics
5. Build Trends using ranking by Current Weight and Weight Dynamics
6. Build Segments (related topics, nested topics)
7. Visualize Data (Trends Statistics, Segments Hierarchy)
8. Explore Data (Flexible Search UI: Trends, Documents, Segments, Keywords
etc.)
9. Use API to communicate with the system

Test documents
About 370k news were imported in Sept-Nov 2014.
Document & Terms distribution statistics

NLP analysis (meta-data extraction)
Sentence NLP in index Current NLP
Metadata

Clustering analysis
Assign weight
By Summary
Creating topics tree
By Terms
By Labels

Clustering visualization
Clustering
histogram
Statistics attributes

Trends
Weight
of trends
Segments and
related topics
Trends Weight Dynamics

Visualization of Related and Nested Topics
zoom in

Visualization. Related Topics
Main Topic
Related
Topic
China
zoom in

Visualization. Nested Topics
Nested
Topics
Main Topic
China
zoom in

Visualization of topics for “Japan” keyword
United States
electionJapan

Hierarchies Topic Tree. Graph
42 Topics
for “Japan” keyword
JAPAN
20 Topics
for “Japan” keyword

Grouping similar Topics (all topics)
790 topics 526 topics

Document search
Search by metadata
Keep track of the
analyzed articles

Glossary
Term – sequence of characters for training NLP application
(represents Named Entity).
Trend - unique keywords chain with weight.
Topic – abstract ‘cluster’ of relations between particular keywords
that occur in Trend.
Segment - group of similar Trends, intersected by search results.
Segmentation – relations between topics from different segments,
based on subtopic dynamics. Represents 'new knowledge'.
Thread - sequence of keywords extracted from given sentence.
Label - an attribute of Term that defines its properties.

Technologies
PHP, CakePHP Framework
Python, Frameworks: NLTK, Django, Django-Rest-
Framework
Java, Jersey Framework, Stanford CoreNlp
Elasticsearch, MySQL DB

Team
We are a team with more than 3 years experience of Data Mining
research and projects.
We are interested in making sense of big data and experimenting with
Machine Learning Techniques. We build Semantic Networks and NLP
projects based on open-source projects as well as our own.
Oleksandr Shamrai - PHP software engineer, responsive for core
algorithms implementation and performance, team development tools
and rules
Pavel Yakovlev - Business analyst and QA, has passion for data mining:
cluster analysis and recommendation solutions
Max Leonov - Python software engineer, responsible for NLP (Natural
Languages Processing) applications modeling, development, testing and
deployment process

Segmentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Segmentation

Similar to Segmentation (20)

Recently uploaded

Recently uploaded (20)

Segmentation