“Segmentation”
as the Workhorse of
Business Analytics
Overview
We research hierarchy of topics extracted from documents (news,
publications, discussions etc.).
Our system is targeted at data researchers.
It provides:
 Trend tracking
 Similar and related topics detection
 Topic segmentation, which aims to solve information
overload(http://mlvl.github.io/Hierarchie/) problem
The topic model we use is not a collection of tags but is the combination of
NLP + statistical analysis.
Possible applications
 creating concept infographics(http://findtheconversation.com/concept-map/)
 estimating concept influence(http://brightpointinc.com/political_influence/)
 detecting semantic relations(http://bl.ocks.org/mbostock/1153292)
 nested segments visualization (http://bl.ocks.org/mbostock/7607535)-
concepts hierarchy
Research plans
Test prototype
We developed a prototype called Data Mining Tool (DMT) for
testing analytics model.
As test data, we use tech and political news (about 2k + 1k RSS
Feeds delivering 10k news daily).
DMT workflow
1. Import Documents to Index
2. Extract meta-data for each Document (NLP: keywords labels, terms etc.)
3. Extract Chains using Cluster Analysis
4. Assign Weights to Topics
5. Build Trends using ranking by Current Weight and Weight Dynamics
6. Build Segments (related topics, nested topics)
7. Visualize Data (Trends Statistics, Segments Hierarchy)
8. Explore Data (Flexible Search UI: Trends, Documents, Segments, Keywords
etc.)
9. Use API to communicate with the system
DMT workflow chart
Test documents
About 370k news were imported in Sept-Nov 2014.
Document & Terms distribution statistics
NLP analysis (meta-data extraction)
Sentence NLP in index Current NLP
Metadata
Clustering analysis
Assign weight
By Summary
Creating topics tree
By Terms
By Labels
Clustering visualization
Clustering
histogram
Statistics attributes
Trends
Weight
of trends
Segments and
related topics
Trends Weight Dynamics
Segments
Visualization of Related and Nested Topics
zoom in
Visualization. Related Topics
Main Topic
Related
Topic
China
zoom in
Visualization. Nested Topics
Nested
Topics
Main Topic
China
zoom in
Visualization of topics for “Japan” keyword
United States
electionJapan
Hierarchies Topic Tree. Graph
42 Topics
for “Japan” keyword
JAPAN
20 Topics
for “Japan” keyword
Grouping similar Topics (all topics)
790 topics 526 topics
Document search
Search by metadata
Keep track of the
analyzed articles
Glossary
Term – sequence of characters for training NLP application
(represents Named Entity).
Trend - unique keywords chain with weight.
Topic – abstract ‘cluster’ of relations between particular keywords
that occur in Trend.
Segment - group of similar Trends, intersected by search results.
Segmentation – relations between topics from different segments,
based on subtopic dynamics. Represents 'new knowledge'.
Thread - sequence of keywords extracted from given sentence.
Label - an attribute of Term that defines its properties.
Technologies
PHP, CakePHP Framework
Python, Frameworks: NLTK, Django, Django-Rest-
Framework
Java, Jersey Framework, Stanford CoreNlp
Elasticsearch, MySQL DB
Team
We are a team with more than 3 years experience of Data Mining
research and projects.
We are interested in making sense of big data and experimenting with
Machine Learning Techniques. We build Semantic Networks and NLP
projects based on open-source projects as well as our own.
Oleksandr Shamrai - PHP software engineer, responsive for core
algorithms implementation and performance, team development tools
and rules
Pavel Yakovlev - Business analyst and QA, has passion for data mining:
cluster analysis and recommendation solutions
Max Leonov - Python software engineer, responsible for NLP (Natural
Languages Processing) applications modeling, development, testing and
deployment process

Segmentation

  • 1.
  • 2.
    Overview We research hierarchyof topics extracted from documents (news, publications, discussions etc.). Our system is targeted at data researchers. It provides:  Trend tracking  Similar and related topics detection  Topic segmentation, which aims to solve information overload(http://mlvl.github.io/Hierarchie/) problem The topic model we use is not a collection of tags but is the combination of NLP + statistical analysis.
  • 3.
    Possible applications  creatingconcept infographics(http://findtheconversation.com/concept-map/)  estimating concept influence(http://brightpointinc.com/political_influence/)  detecting semantic relations(http://bl.ocks.org/mbostock/1153292)  nested segments visualization (http://bl.ocks.org/mbostock/7607535)- concepts hierarchy
  • 4.
  • 5.
    Test prototype We developeda prototype called Data Mining Tool (DMT) for testing analytics model. As test data, we use tech and political news (about 2k + 1k RSS Feeds delivering 10k news daily).
  • 6.
    DMT workflow 1. ImportDocuments to Index 2. Extract meta-data for each Document (NLP: keywords labels, terms etc.) 3. Extract Chains using Cluster Analysis 4. Assign Weights to Topics 5. Build Trends using ranking by Current Weight and Weight Dynamics 6. Build Segments (related topics, nested topics) 7. Visualize Data (Trends Statistics, Segments Hierarchy) 8. Explore Data (Flexible Search UI: Trends, Documents, Segments, Keywords etc.) 9. Use API to communicate with the system
  • 7.
  • 8.
    Test documents About 370knews were imported in Sept-Nov 2014. Document & Terms distribution statistics
  • 9.
    NLP analysis (meta-dataextraction) Sentence NLP in index Current NLP Metadata
  • 10.
    Clustering analysis Assign weight BySummary Creating topics tree By Terms By Labels
  • 11.
  • 12.
    Trends Weight of trends Segments and relatedtopics Trends Weight Dynamics
  • 13.
  • 14.
    Visualization of Relatedand Nested Topics zoom in
  • 15.
    Visualization. Related Topics MainTopic Related Topic China zoom in
  • 16.
  • 17.
    Visualization of topicsfor “Japan” keyword United States electionJapan
  • 18.
    Hierarchies Topic Tree.Graph 42 Topics for “Japan” keyword JAPAN 20 Topics for “Japan” keyword
  • 19.
    Grouping similar Topics(all topics) 790 topics 526 topics
  • 20.
    Document search Search bymetadata Keep track of the analyzed articles
  • 21.
    Glossary Term – sequenceof characters for training NLP application (represents Named Entity). Trend - unique keywords chain with weight. Topic – abstract ‘cluster’ of relations between particular keywords that occur in Trend. Segment - group of similar Trends, intersected by search results. Segmentation – relations between topics from different segments, based on subtopic dynamics. Represents 'new knowledge'. Thread - sequence of keywords extracted from given sentence. Label - an attribute of Term that defines its properties.
  • 22.
    Technologies PHP, CakePHP Framework Python,Frameworks: NLTK, Django, Django-Rest- Framework Java, Jersey Framework, Stanford CoreNlp Elasticsearch, MySQL DB
  • 23.
    Team We are ateam with more than 3 years experience of Data Mining research and projects. We are interested in making sense of big data and experimenting with Machine Learning Techniques. We build Semantic Networks and NLP projects based on open-source projects as well as our own. Oleksandr Shamrai - PHP software engineer, responsive for core algorithms implementation and performance, team development tools and rules Pavel Yakovlev - Business analyst and QA, has passion for data mining: cluster analysis and recommendation solutions Max Leonov - Python software engineer, responsible for NLP (Natural Languages Processing) applications modeling, development, testing and deployment process