1. A Blended Approach to Big Data Analytics
Richard Heimann & Max Watson
Data Tactics Corporation
A Blended Approach to Data Science and Big Data Analytics:
The Blended Approach to Big Data Analytics and Data Science is an acknowledgment that data science and big data analytics is more than just algorithm development, it requires deployment. Deployment is traditionally
thought of in terms of the machine, rather than the user. The utility of an analytic however, is in its ultimate use mitigated by its deployment and presentation to users. Data Science requires two elements to deploy
Hard Elements = Objective pattern discovery and recognition of precise and accurate patterns minus pattern paradoxes.
Soft Elements = Sensitive to user experiences and workﬂow, allowing users to subjectively evaluate patterns for their uniqueness, unexpectedness, actionability and novelty.
Analytics promises wisdom, but has at times provided trivial pattern discovery, pattern paradoxes, and/or overwhelming number of patterns - ultimately confusing users and leaving its potential unfulﬁlled. A Blended
Approach delivers a nontrivial process of identifying valid, novel, potentially useful, and ultimately comprehensible knowledge from big data systems that can be used by domain experts to support crucial intelligence
• Nontrivial: Complex computations are required to expose novel insights into big data. This includes analytics pluralism (many algorithms).
• Novel: The discovered patterns should be new to the organization and understood to be meaningful by users.
• Useful: Analytic frameworks aid the development of valid algorithms. Analysts of such analytics should be able to act upon these patterns to make better decisions.
• Comprehensible: New patterns should be understandable and thereby improve understanding.
• Valid: Algorithms need to be designed and developed by competent data scientist to ensure reliability and validity.
The key components to a Blended Approach to Data Science and Big Data Analytics:
1) Objective and Subjective Pattern Discovery -- A Blended Approach to Analytics
2)Interactive Analytics + Enhanced Visualization = Intelligent Data Analysis; Shiny
Users are enabled with mutable analytical elements and allowed to tweak parameters, reﬁne the
method, visualize the effect, and interpret the subsequent changes.
Data Science can often generate hundreds, maybe thousands of patterns. The task of
pattern recognition really becomes one of determining the most useful patterns from
those that are trivial. One of the most efﬁcient ways to do this is by allowing users to
engage more with algorithms.
The challenge, is to address the subjective/objective jointly, in a hybrid model and mediate
the dichotomy by using techniques that can reduce the knowledge acquisition bottleneck.
The evaluation of hybrid modes has to yield good results. Pattern paradox detection lies
midway between the subjective and objective measures. Goal directed analysis with user
enrichment versus data directed analysis is intimately related to the use of subjective and
Shiny is a web based presentation layer for the statistically programming language R and enables
the sought interaction between users and a given analytic. E.g.
Figure 1: Discontinuities is a change detection algorithm used to detect breaks e.g. social media
Figure 2: a unique implementation of topic models coined topic graphs - treating network
analysis and topic discovery jointly.
Figure 3: a density based cluster analysis is used to classify data e.g. log ﬁles & netﬂow data.
Figure 4: a supervised-by-supervised outlier detection algorithm, coined LUBaP.
How does the Blended Approach fit into the Data Science ecosystem?
No Free Lunch (NFL) for Data Science suggests that analytics must be designed for a speciﬁc type of problem and performs no better than any other when averaged over all possible
problem sets. There is no such thing as a general purpose algorithm across all problems. The lesson is that the elegance of analytics lies, at times, in its inelegance. Overlapping
solutions therefore, or a pluralist approach, may be optimally ﬁtted within the blended approach.
The blended approach is a mixture of objective and subjective pattern discovery, facilitated by interactive analytics as well as overlapping solutions. An example would be the
overlapping of two solutions to analyze Data D with Analytics A and Analytics B where A provides some insight to smooth pattern detection like a summary analytics and B offers some
insight to rough patterns in the data such as outlier detection. These two methods offer unique insights and may at times validate each other. Users would understand both structural
patterns and structural breaks in D. NFL shows us that this may be the best environment for data science.
The Blended Approach unlocks the power of data science. Data Science still approaches the problem by assuming there's a best way to solve a problem, but ignore alternate
solutions and, most egregiously, ignore the user. The lesson is to abandon the question "What is the cleverest way to solve the problem?" in favor of "Are there multiple, overlapping
ways to solve this problem?”
Data Scientists that have massive amounts of data without massive amounts of clue are going to be displaced by data scientists that have less data but more clue.