Competitive analysis, product differentiation, nearest neighbor, topological data analysis, summary visualization, data science use cases, data access, data preparation, data exploration.
3. Goal of the Analysis
The goal was to identify the closest competitors, functionally speaking,
in the data science software industry.
Our hypothesis, was that text analysis and novel nearest neighbor
algorithms could distill text based reports into a useful summary
visualization of the products in the space.
4. Step 1: Text analysis, Data Transformation:
Two related reports covering the range of product capabilities across
four use cases were used for source data.
Source report content was converted to numeric representations of the
text. A matrix was populated with quantitative values ranging from 1 to
5.
5. Scope Adjustment
The source reports did not provide full breakdown of sub dimensions
within the four use cases. As a result, many fields in the matrix had
missing values.
The estimated completion time for a full analysis on all four use cases
exceeded cost benefit metrics. The goal was narrowed to focus on the
first use case, which had four sub dimensions: access, preparation,
exploration, and automation.
6. Step 2: Imputing Values and Adjusting Imputed Values
A value of [3] was set as the estimate for all missing values.
Scores for each sub dimension were averaged into a total for the use
case. Result totals in the cqtinf model score table came close to the
source report result totals for the use case.
Minor adjustments to sub dimension values brought 12 of the 16
product scores into very close alignment with the source scores,
without concern of overfitting.
Alteryx [AYX] was chosen as the fixed variable for further analysis.
7. Step 3: Analysis Model Outputs
The cqtinf model provided two outputs:
1] a short list of AYX closest competitors, based on the number of times
a competitor is within range, where frequency represents closeness:
2] an input for a complex topological / nearest neighbor data analysis,
based on actual distance measures of competitors.
Dataiku Datawatch TIBCO SAS VDMML KNIME
4 3.3 3.3 3 3
8. Step 4: Nearest Neighbor Conversion
To perform this nearest neighbor analysis, the matrix score values had
to be transformed into [x, y] grid coordinates which could be plotted on
a graph. cqtinf heuristics provided the conversion.
Once the modeling was completed, the full set of DSML software
products could be positioned on a grid, for summary visualization.
9. Step 5: Selection of Graphic Style
Four dimensions were required, and a layout that would support a
simple representation where product nodes could straddle two
dimensions without any crisscrossing of relationships, was designed
from scratch.
10. Comparing two Model Outputs: The resulting TDA map varied slightly
from the simpler frequency table.
Dataiku, #1 in the frequency table, fell just outside the map inclusion
criteria. Expanding the cqtinf model’s ‘top 5’ constraint from 5 to 6
would result in Dataiku being on the map.
According to the map, Rapidminer appears to be within shortlist
distance of AYX, which is inconsistent with other arguments.
The cqtinf node positioning heuristic delivers maps quickly, and in
theory these visualizations are explicatory. Spending more time on
additional calculations may repair this ‘problem,’ but since the model is
transparent, analysts can explain strengths or weaknesses in the
underlying data, and the positioning algorithm, and we can accept that
if some outputs of the model are not perfect, they are still useful.