Agile analytics : An exploratory study of technical complexity management

Agile Analytics:
An explanatory study of
complexity management.
Agnirudra Sikdar

Contents
▪ Introduction to Big Data and Analytics
▪ What is Agile Analytics and how it is related to the scope of this thesis
▪ Analytics and its case studies
▪ Lessons learned from case studies on technical complexity management
▪ Proposed Methodology
▪ Data Acquisition
▪ Selection of Tools for Analytics
▪ Selection of Algorithms
▪ Choice of modelling approach
▪ Outcomes and Benefits from proposed methodology
▪ Conclusions

Introduction to Big Data and Analytics
▪ What is Big Data ?
▪ What is Analytics?
▪ Why do we need Analytics?
Fig 1. Big Data Analytics Pipeline Model

What is Agile Analytics and how it is related to
the scope of this thesis?
• Agile Analytics is a style of building the data
warehouse, data marts, business intelligence
applications and analytics applications that focuses
on early and continuous delivery of business value.
• Agile Approach implements the practice of
implementing FEEDBACKS in each iteration
making faster delivery possible with every iteration.
• It is not a rigid methodology.
• The scope of this thesis is to understand the
various technical complexities involved from
different Big Data analytics and trying to implement
agile approach to address these issues. Technical
Complexity is not addressed in current agile
methodologies and I would try to focus on bridging
this gap.
Fig 2: Various Influences of Agile Analytics

Goals and Methodology
▪ Agile analytics focuses on soft/managerial variables.
▪ My goal is to establish the technical guidelines
▪ This thesis reports on an explanatory study surveying public case
studies to identify the various technical complexities involved, and
how it can be solved using the implementation of Agile Analytics.
▪ 12 case multi studies have been reviewed
▪ 2 Detailed Case studies on Customer Segmentation and Clustering
have also been reviewed
▪ After intense reviewing of all the cases, I had implemented my
proposed methodologies.
Analytical Data Solutions
IBM
Invenco Oy
PriceWater House Cooper
Deloitte
ThoughtWorks
No of cases #2
No of cases #5
No of cases #1
No of cases #2
No of cases #1
No of cases #1
Online UK Retailer
4 Major Australian Banks

Methodology Steps
Case Studies
Data Acquisition
Data Acquisition variable
Data Sampling Period
Selection of Tools
Selection of Algorithm
Modelling Techniques
From the various case studies , the following
Methodological steps have been performed.
1)Data Acquisition: a)This step has been performed to
determine the variables which can be used to sample the data.
b) The time period of the data that can be taken to sample it.
2) Selection of Tools: This step has been performed to
determine the type of tool to be used and the prose & cons
involved with various tools.
3) Selection of Algorithm This step to determine the most
efficient time complexity algorithm for the required analysis.
4) Modelling Techniques This step to identify if descriptive, or
predictive or prescriptive modelling has been used, and how to
improve it.
Discussion of Benefits from the methodology: The outcome and
advantages involved from applying the methodology.

Catalogue of case studies: a blueprint
Domain Company Modelling Software
Retail US Based Luxury
Company, Top Toy,
GAP
Descriptive,
Predictive
MongoDB, IBM
Cognos
Electronics Global PC
manufacturer
Descriptive,
Predictive
Proprietary software of
Analytical Data
Solutions
Food Colombus Food Descriptive,
Predictive
IBM Cognos TM1,
Finance Corona Direct,
Canadian Insurance
Firm
Descriptive,
Predictive, and
slight Prescriptive
IBM SPSS
Marketing Mediamath Descriptive,
Predictive,
Prescriptive
Netezza, MathClarity
Manufacturing Mueller Inc Descriptive,
Predictive,
Prescriptive
IBM Cognos, IBM
SPSS
Tele-communications Telecom company in
US
Descriptive HADOOP
Information Technology PriceWater House
Cooper
Descriptive Open Source Tools
Fig 3: Geolocations of the countries on which the case
studies were carried out

Conclusion from the Multi-Case studies
• Agile is not completely
present in all the cases.
• Descriptive modelling and
predictive modelling is present
in most of the cases.
• Marketing is the only sector
which has the influence of all
the modelling factors.
• We will be focusing into the
marketing domain, interested in
the topic of Customer
Segmentation and Clustering
in the following slides.
Fig 4 : Domains Vs Modelling Chart

Key technical complexity issues in case studies
Key points realized from Customer
Segmentation case studies-:
1)Choice of Algorithm was , K clustering method
along with RFM approach.
2)Data complexities were experienced while pre-
processing datasets.
3)Choice of algorithm also resulted in algorithm
processing complexity.
4)Process complexity lies in the fact how
parallelization of computer processing can be
used to compute data
Fig 5: A flowchart of Customer Segmentation procedure.

Proposed Methodologies
I. Data Acquisition
Proper Data Acquisition techniques are very important for the analytics domain.
My proposed techniques are :
1) Starting with a selection of 6 variables.
3 from the customer’s perspective (Postal Code, Gender, Age Group)
3 From the company’s perspective ( Product , Price , Quantity )
2) For analytic segmentation, we should try to gather a large sample (more than 600).
Segments as small as 5% can often promise high profitability
once they are weighted according to their spend levels.
Conclusion:
Agile Approach of early analytics and implementation
can reduce the Data Warehouse/ Business Intelligence risk

II. Selection Tool for Analytics
In the selection of tools methodology,
we have taken the following queries
into consideration
When choosing a data analysis
software we need to answer a basic
set of questions, like:
Does it run natively on the
computer?
Does it handle large datasets?
Does the software provide all
methods we need?
Is it affordable?
Ease of use.
Fig 6.A(left): Lavastorm survey of Analytics Tool
Fig 6.B(right):Tools used by 2015 respondents to O’Reilly 2015 salary survey.

A Tabular Summary of Software and their
functionalities
Name Advantages Disadvantages Open source? Typical users
R Library support;
visualization
Steep learning curve Yes Finance; Statistics
Matlab Elegant matrix support;
visualization
Expensive; incomplete
statistics support
No Engineering
SciPy/NumPy
/Matplotlib
Python (general-purpose
programming language)
Immature Yes Engineering
Excel Easy; visual; flexible Large datasets No Business
SAS Large datasets Expensive; outdated
programming language
No Business; Government
Stata Easy statistical analysis No Science
SPSS Like Stata but more expensive and worse

III. Choice of Algorithm
The choice of algorithm is extremely crucial to any
form of analytics. It is because algorithms addresses
two issues, a) Time Complexity b) Space
Complexity
Some of the proposals about the selection of
Algorithm are :-
1) Implementing pre-clustering methods like Canopy
Clusters. (Canopy Clusters can process huge
datasets efficiently. It is an unsupervised pre-
clustering algorithm
2) Divisive Algorithms can minimize cut based
cost in place of K means, as the latter
maximizes only similarity within a cluster and
ignores cost of cuts.
Fig 7: Hierarchical Distribution of Analytical Segmentation.

K-Clustering Method
1) Simplest Algorithm to solve clustering
Problems
2) No of clusters is first initialized and
Accordingly the initial cluster centers are
randomly selected
3)New Partition created by assigning
each Data to cluster that has the closest
centroid
4)Steps 2 & 3 repeated until the
centroids no longer move any cluster
5) The downside of K-Means is that,
the result of clustering mostly
depends
on the initially selected centroids
6) Spherical datasets cannot be
efficiently
clustered using K-Means

A few guidelines: clustering as a sample problem
1) Precede interdependence segmentation with focus groups
2) To solve differential bias, an exhaustive 5 point scale can be used.
3) Conjoint or discrete-choice based utilities can be used as basis of clustering.
4) We should try to cluster only unique dimensions by factor analyzation.
5) We should try to focus on the reliability and accuracy of the dependent variable (e.g. using CHAID)
5) Heavy Data Cleansing can be avoided when using decision trees (e.g. CHAID).
6) Different algorithms have a best-practice parametrization, e.g. for CHAID:
(i) Smallest child node should be set at approximately 5% of total N,
(ii) and parent node at twice the size of the child node.
(ii) Alpha set at 0.5, with Bonferroni adjustment turned off

IV. Modelling Approach
DESCRIPTIVE MODELLING :-
a) We estimate probability densities using either parametric or non- parametric approach.
b) Some of the types of parametric modelling techniques that can be followed are:
I) Mixture models
II) Expectation Maximization.(Time complexity O(Kp2n); space complexity O(Kn).
Can be slow to converge; local maxima )
c) Non-parametric density estimation doesn’t scale well.
d) For determining number of clusters, we can use K-Means clustering, Partitioning
Around Medoids (PAM),Clustering Large Applications (CLARA).
e) For identifying number of clusters, we can select the Method of Mojena.
f) Some of the defining decisions taken for clustering using descriptive approach are
1) Ability to deal with various attributes
2) Discovery of clusters with arbitrary shapes.
3) Minimal requirement for domain knowledge to determine input parameters.
Fig 8: Components of Modelling

Predictive Modelling
The most used clustering algorithms based on predictive modelling are stated as follows:
1) Behavioral Clustering
This approach enlightens about how people behave while purchasing.
2) Product Based Clustering/ Category Based Clustering
This algorithm discovers what different groupings of product people buy from.
3) Brand based clustering
This clustering tells us what brands people prefer.
The other types of modelling techniques are used in Propensity Models for Predictions and Collaborative
Filtering for Recommendations.

Prescriptive Modelling
Prescriptive modelling is an emerging modelling approach
It is still in its infancy
maintaining an Agile approach towards failures while
prescriptive modelling, has resulted this to be one of the
strongest modelling techniques on the rise
From the case studies ,we have seen very few companies
Are able to implement prescriptive modelling.
It can also be said that , all prescriptive models are
descriptive in nature, but all descriptive models are not
prescriptive.
Fig 9: Comparison of different modeling techniques.

V. Outcomes and Benefits from Methodology
By extracting common themes from the case studies and
literature reviews, I have noticed three underlying principles.
These principles are present in different magnitudes
irrespective of their size, sector and structure. They are as
follows:
1. Commonality,i.e ,business functions in a common field.
2. Concentration, i.e, grouping of enterprises who can
interact.
3. Connectivity,i.e ,interconnected organizations with different
types of relationships

Conclusion
1) Data Integration Process is cumbersome & Companies should try to migrate to the latest Big
Data Technologies like Hadoop, Cloudera, SPARK, HIVE and the likes.
2)Various technical complexities can be approached using Agile Approach
3) My guidelines on technical complexity management are Agile as they help the data scientist
to obtain continuous outputs. Agile analytics focuses on soft managerial variables.
Utilizing this idea and putting it to practice in situations of technical complexities from the
various case study reviews, we can see that with every iteration , the result can be improved.
Faster deployment. Faster feedback incorporation. Faster Delivery. The potential to scale the
system using the feedback and its default flexible nature makes it the state of the art analytics
for choice.
In one phrase, “ Get Lean. Get Agile. Get Started”

Agile analytics : An exploratory study of technical complexity management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Agile analytics : An exploratory study of technical complexity management

Similar to Agile analytics : An exploratory study of technical complexity management (20)

Recently uploaded

Recently uploaded (20)

Agile analytics : An exploratory study of technical complexity management

Editor's Notes