Data Science Symposium @ NIST
Upcoming SlideShare
Loading in...5
×
 

Data Science Symposium @ NIST

on

  • 250 views

Poster session with Max Watson at the inaugural Data Science Symposium @ NIST

Poster session with Max Watson at the inaugural Data Science Symposium @ NIST

Statistics

Views

Total Views
250
Views on SlideShare
247
Embed Views
3

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 3

https://twitter.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Science Symposium @ NIST Data Science Symposium @ NIST Document Transcript

  • A Blended Approach to Big Data Analytics Richard Heimann & Max Watson Data Tactics Corporation A Blended Approach to Data Science and Big Data Analytics: ! The Blended Approach to Big Data Analytics and Data Science is an acknowledgment that data science and big data analytics is more than just algorithm development, it requires deployment. Deployment is traditionally thought of in terms of the machine, rather than the user. The utility of an analytic however, is in its ultimate use mitigated by its deployment and presentation to users. Data Science requires two elements to deploy successful analytics. Hard Elements = Objective pattern discovery and recognition of precise and accurate patterns minus pattern paradoxes. Soft Elements = Sensitive to user experiences and workflow, allowing users to subjectively evaluate patterns for their uniqueness, unexpectedness, actionability and novelty. Analytics promises wisdom, but has at times provided trivial pattern discovery, pattern paradoxes, and/or overwhelming number of patterns - ultimately confusing users and leaving its potential unfulfilled. A Blended Approach delivers a nontrivial process of identifying valid, novel, potentially useful, and ultimately comprehensible knowledge from big data systems that can be used by domain experts to support crucial intelligence decisions. • Nontrivial: Complex computations are required to expose novel insights into big data. This includes analytics pluralism (many algorithms). • Novel: The discovered patterns should be new to the organization and understood to be meaningful by users. • Useful: Analytic frameworks aid the development of valid algorithms. Analysts of such analytics should be able to act upon these patterns to make better decisions. • Comprehensible: New patterns should be understandable and thereby improve understanding. • Valid: Algorithms need to be designed and developed by competent data scientist to ensure reliability and validity. The key components to a Blended Approach to Data Science and Big Data Analytics: ! 1) Objective and Subjective Pattern Discovery -- A Blended Approach to Analytics 2)Interactive Analytics + Enhanced Visualization = Intelligent Data Analysis; Shiny Users are enabled with mutable analytical elements and allowed to tweak parameters, refine the method, visualize the effect, and interpret the subsequent changes. Figure 1 Data Science can often generate hundreds, maybe thousands of patterns. The task of pattern recognition really becomes one of determining the most useful patterns from those that are trivial. One of the most efficient ways to do this is by allowing users to engage more with algorithms. The challenge, is to address the subjective/objective jointly, in a hybrid model and mediate the dichotomy by using techniques that can reduce the knowledge acquisition bottleneck. The evaluation of hybrid modes has to yield good results. Pattern paradox detection lies midway between the subjective and objective measures. Goal directed analysis with user enrichment versus data directed analysis is intimately related to the use of subjective and objective measures. Figure 2 Figure 3 Figure 4 Shiny is a web based presentation layer for the statistically programming language R and enables the sought interaction between users and a given analytic. E.g. Figure 1: Discontinuities is a change detection algorithm used to detect breaks e.g. social media Figure 2: a unique implementation of topic models coined topic graphs - treating network analysis and topic discovery jointly. Figure 3: a density based cluster analysis is used to classify data e.g. log files & netflow data. Figure 4: a supervised-by-supervised outlier detection algorithm, coined LUBaP. How does the Blended Approach fit into the Data Science ecosystem? ! No Free Lunch (NFL) for Data Science suggests that analytics must be designed for a specific type of problem and performs no better than any other when averaged over all possible problem sets. There is no such thing as a general purpose algorithm across all problems. The lesson is that the elegance of analytics lies, at times, in its inelegance. Overlapping solutions therefore, or a pluralist approach, may be optimally fitted within the blended approach. ! The blended approach is a mixture of objective and subjective pattern discovery, facilitated by interactive analytics as well as overlapping solutions. An example would be the overlapping of two solutions to analyze Data D with Analytics A and Analytics B where A provides some insight to smooth pattern detection like a summary analytics and B offers some insight to rough patterns in the data such as outlier detection. These two methods offer unique insights and may at times validate each other. Users would understand both structural patterns and structural breaks in D. NFL shows us that this may be the best environment for data science. ! The Blended Approach unlocks the power of data science. Data Science still approaches the problem by assuming there's a best way to solve a problem, but ignore alternate solutions and, most egregiously, ignore the user. The lesson is to abandon the question "What is the cleverest way to solve the problem?" in favor of "Are there multiple, overlapping ways to solve this problem?” ! Data Scientists that have massive amounts of data without massive amounts of clue are going to be displaced by data scientists that have less data but more clue.