The thesis involved the reviewing of various case studies to determine the types of modelling, choice of algorithm, types of analytical approaches and trying to determine the various complexities arising from these cases. From these reviews, procedures have been proposed to improve the efficiency and manage the various types of complexities from using agile methodological perspective. Focus was mostly done on Customer Segmentation and Clustering , with the sole purpose to bridge Big Data and Business Intelligence together using Analytic.
2. Contents
▪ Introduction to Big Data and Analytics
▪ What is Agile Analytics and how it is related to the scope of this thesis
▪ Analytics and its case studies
▪ Lessons learned from case studies on technical complexity management
▪ Proposed Methodology
▪ Data Acquisition
▪ Selection of Tools for Analytics
▪ Selection of Algorithms
▪ Choice of modelling approach
▪ Outcomes and Benefits from proposed methodology
▪ Conclusions
3. Introduction to Big Data and Analytics
▪ What is Big Data ?
▪ What is Analytics?
▪ Why do we need Analytics?
Fig 1. Big Data Analytics Pipeline Model
4. What is Agile Analytics and how it is related to
the scope of this thesis?
• Agile Analytics is a style of building the data
warehouse, data marts, business intelligence
applications and analytics applications that focuses
on early and continuous delivery of business value.
• Agile Approach implements the practice of
implementing FEEDBACKS in each iteration
making faster delivery possible with every iteration.
• It is not a rigid methodology.
• The scope of this thesis is to understand the
various technical complexities involved from
different Big Data analytics and trying to implement
agile approach to address these issues. Technical
Complexity is not addressed in current agile
methodologies and I would try to focus on bridging
this gap.
Fig 2: Various Influences of Agile Analytics
5. Goals and Methodology
▪ Agile analytics focuses on soft/managerial variables.
▪ My goal is to establish the technical guidelines
▪ This thesis reports on an explanatory study surveying public case
studies to identify the various technical complexities involved, and
how it can be solved using the implementation of Agile Analytics.
▪ 12 case multi studies have been reviewed
▪ 2 Detailed Case studies on Customer Segmentation and Clustering
have also been reviewed
▪ After intense reviewing of all the cases, I had implemented my
proposed methodologies.
Analytical Data Solutions
IBM
Invenco Oy
PriceWater House Cooper
Deloitte
ThoughtWorks
No of cases #2
No of cases #5
No of cases #1
No of cases #2
No of cases #1
No of cases #1
Online UK Retailer
4 Major Australian Banks
6. Methodology Steps
Case Studies
Data Acquisition
Data Acquisition variable
Data Sampling Period
Selection of Tools
Selection of Algorithm
Modelling Techniques
From the various case studies , the following
Methodological steps have been performed.
1)Data Acquisition: a)This step has been performed to
determine the variables which can be used to sample the data.
b) The time period of the data that can be taken to sample it.
2) Selection of Tools: This step has been performed to
determine the type of tool to be used and the prose & cons
involved with various tools.
3) Selection of Algorithm This step to determine the most
efficient time complexity algorithm for the required analysis.
4) Modelling Techniques This step to identify if descriptive, or
predictive or prescriptive modelling has been used, and how to
improve it.
Discussion of Benefits from the methodology: The outcome and
advantages involved from applying the methodology.
7. Catalogue of case studies: a blueprint
Domain Company Modelling Software
Retail US Based Luxury
Company, Top Toy,
GAP
Descriptive,
Predictive
MongoDB, IBM
Cognos
Electronics Global PC
manufacturer
Descriptive,
Predictive
Proprietary software of
Analytical Data
Solutions
Food Colombus Food Descriptive,
Predictive
IBM Cognos TM1,
Finance Corona Direct,
Canadian Insurance
Firm
Descriptive,
Predictive, and
slight Prescriptive
IBM SPSS
Marketing Mediamath Descriptive,
Predictive,
Prescriptive
Netezza, MathClarity
Manufacturing Mueller Inc Descriptive,
Predictive,
Prescriptive
IBM Cognos, IBM
SPSS
Tele-communications Telecom company in
US
Descriptive HADOOP
Information Technology PriceWater House
Cooper
Descriptive Open Source Tools
Fig 3: Geolocations of the countries on which the case
studies were carried out
8. Conclusion from the Multi-Case studies
• Agile is not completely
present in all the cases.
• Descriptive modelling and
predictive modelling is present
in most of the cases.
• Marketing is the only sector
which has the influence of all
the modelling factors.
• We will be focusing into the
marketing domain, interested in
the topic of Customer
Segmentation and Clustering
in the following slides.
Fig 4 : Domains Vs Modelling Chart
9. Key technical complexity issues in case studies
Key points realized from Customer
Segmentation case studies-:
1)Choice of Algorithm was , K clustering method
along with RFM approach.
2)Data complexities were experienced while pre-
processing datasets.
3)Choice of algorithm also resulted in algorithm
processing complexity.
4)Process complexity lies in the fact how
parallelization of computer processing can be
used to compute data
Fig 5: A flowchart of Customer Segmentation procedure.
10. Proposed Methodologies
I. Data Acquisition
Proper Data Acquisition techniques are very important for the analytics domain.
My proposed techniques are :
1) Starting with a selection of 6 variables.
3 from the customer’s perspective (Postal Code, Gender, Age Group)
3 From the company’s perspective ( Product , Price , Quantity )
2) For analytic segmentation, we should try to gather a large sample (more than 600).
Segments as small as 5% can often promise high profitability
once they are weighted according to their spend levels.
Conclusion:
Agile Approach of early analytics and implementation
can reduce the Data Warehouse/ Business Intelligence risk
11. II. Selection Tool for Analytics
In the selection of tools methodology,
we have taken the following queries
into consideration
When choosing a data analysis
software we need to answer a basic
set of questions, like:
Does it run natively on the
computer?
Does it handle large datasets?
Does the software provide all
methods we need?
Is it affordable?
Ease of use.
Fig 6.A(left): Lavastorm survey of Analytics Tool
Fig 6.B(right):Tools used by 2015 respondents to O’Reilly 2015 salary survey.
12. A Tabular Summary of Software and their
functionalities
Name Advantages Disadvantages Open source? Typical users
R Library support;
visualization
Steep learning curve Yes Finance; Statistics
Matlab Elegant matrix support;
visualization
Expensive; incomplete
statistics support
No Engineering
SciPy/NumPy
/Matplotlib
Python (general-purpose
programming language)
Immature Yes Engineering
Excel Easy; visual; flexible Large datasets No Business
SAS Large datasets Expensive; outdated
programming language
No Business; Government
Stata Easy statistical analysis No Science
SPSS Like Stata but more expensive and worse
13. III. Choice of Algorithm
The choice of algorithm is extremely crucial to any
form of analytics. It is because algorithms addresses
two issues, a) Time Complexity b) Space
Complexity
Some of the proposals about the selection of
Algorithm are :-
1) Implementing pre-clustering methods like Canopy
Clusters. (Canopy Clusters can process huge
datasets efficiently. It is an unsupervised pre-
clustering algorithm
2) Divisive Algorithms can minimize cut based
cost in place of K means, as the latter
maximizes only similarity within a cluster and
ignores cost of cuts.
Fig 7: Hierarchical Distribution of Analytical Segmentation.
14. K-Clustering Method
1) Simplest Algorithm to solve clustering
Problems
2) No of clusters is first initialized and
Accordingly the initial cluster centers are
randomly selected
3)New Partition created by assigning
each Data to cluster that has the closest
centroid
4)Steps 2 & 3 repeated until the
centroids no longer move any cluster
5) The downside of K-Means is that,
the result of clustering mostly
depends
on the initially selected centroids
6) Spherical datasets cannot be
efficiently
clustered using K-Means
15. A few guidelines: clustering as a sample problem
1) Precede interdependence segmentation with focus groups
2) To solve differential bias, an exhaustive 5 point scale can be used.
3) Conjoint or discrete-choice based utilities can be used as basis of clustering.
4) We should try to cluster only unique dimensions by factor analyzation.
5) We should try to focus on the reliability and accuracy of the dependent variable (e.g. using CHAID)
5) Heavy Data Cleansing can be avoided when using decision trees (e.g. CHAID).
6) Different algorithms have a best-practice parametrization, e.g. for CHAID:
(i) Smallest child node should be set at approximately 5% of total N,
(ii) and parent node at twice the size of the child node.
(ii) Alpha set at 0.5, with Bonferroni adjustment turned off
16. IV. Modelling Approach
DESCRIPTIVE MODELLING :-
a) We estimate probability densities using either parametric or non- parametric approach.
b) Some of the types of parametric modelling techniques that can be followed are:
I) Mixture models
II) Expectation Maximization.(Time complexity O(Kp2n); space complexity O(Kn).
Can be slow to converge; local maxima )
c) Non-parametric density estimation doesn’t scale well.
d) For determining number of clusters, we can use K-Means clustering, Partitioning
Around Medoids (PAM),Clustering Large Applications (CLARA).
e) For identifying number of clusters, we can select the Method of Mojena.
f) Some of the defining decisions taken for clustering using descriptive approach are
1) Ability to deal with various attributes
2) Discovery of clusters with arbitrary shapes.
3) Minimal requirement for domain knowledge to determine input parameters.
Fig 8: Components of Modelling
17. Predictive Modelling
The most used clustering algorithms based on predictive modelling are stated as follows:
1) Behavioral Clustering
This approach enlightens about how people behave while purchasing.
2) Product Based Clustering/ Category Based Clustering
This algorithm discovers what different groupings of product people buy from.
3) Brand based clustering
This clustering tells us what brands people prefer.
The other types of modelling techniques are used in Propensity Models for Predictions and Collaborative
Filtering for Recommendations.
18. Prescriptive Modelling
Prescriptive modelling is an emerging modelling approach
It is still in its infancy
maintaining an Agile approach towards failures while
prescriptive modelling, has resulted this to be one of the
strongest modelling techniques on the rise
From the case studies ,we have seen very few companies
Are able to implement prescriptive modelling.
It can also be said that , all prescriptive models are
descriptive in nature, but all descriptive models are not
prescriptive.
Fig 9: Comparison of different modeling techniques.
19. V. Outcomes and Benefits from Methodology
By extracting common themes from the case studies and
literature reviews, I have noticed three underlying principles.
These principles are present in different magnitudes
irrespective of their size, sector and structure. They are as
follows:
1. Commonality,i.e ,business functions in a common field.
2. Concentration, i.e, grouping of enterprises who can
interact.
3. Connectivity,i.e ,interconnected organizations with different
types of relationships
20. Conclusion
1) Data Integration Process is cumbersome & Companies should try to migrate to the latest Big
Data Technologies like Hadoop, Cloudera, SPARK, HIVE and the likes.
2)Various technical complexities can be approached using Agile Approach
3) My guidelines on technical complexity management are Agile as they help the data scientist
to obtain continuous outputs. Agile analytics focuses on soft managerial variables.
Utilizing this idea and putting it to practice in situations of technical complexities from the
various case study reviews, we can see that with every iteration , the result can be improved.
Faster deployment. Faster feedback incorporation. Faster Delivery. The potential to scale the
system using the feedback and its default flexible nature makes it the state of the art analytics
for choice.
In one phrase, “ Get Lean. Get Agile. Get Started”
Editor's Notes
Volume, Velocity, Variety and Value.
Discovery and communication of meaningful patterns in data.
it depends on documented simultaneous application of computer programming, operations research to statistics.
It is a multidimensional discipline.
3) We need analytics for recommendation actions and decision making.
4) The primary goal of big data analytics is to help companies make more informed business decisions by helping data scientists, predictive modelers and other analytics professionals to analyze data, or other untapped data by conventional business intelligence programs
Agile analytics starts with learning and testing, so that companies can build their models and strategies based on solid answers to their most crucial business questions
Agile big data analytics focuses not on the data itself but on the insight and action that can ultimately be drawn from nimble business intelligence systems.
Agile Approach focuses on continuously delivering working features. In traditional models, developers take months on a feature, only to find it no longer applicable to the business environment.
Frequent testing should be done.
Using Agile approach we can make big data to find Agility instead of pivoting to a key insights.
Adapt Agile Methods to individual projects and teams.
The scope of this thesis is to understand the various technical complexities involved from various Big Data analytics and trying to implement agile approach to address these issues.
Descriptive analytics looks at past performance and comprehends that performance by mining past information to look for the motives behind previous success or failure
Classical descriptive approaches usually seen in data warehousing systems .
Predictive models select major characteristics differently- i.e. while we may know a lot about our customers, we may not be able to precisely estimate what they will do next
Prescriptive models though, take advantage of the data of descriptive models and the premise of predictive models and try to answer, not only what the client will do next, but why they will do so.
This approach can be considered a good starting point for companies who wants to cluster their clients for marketing purpose.
This is because it gives way to the possibility to answer various questions like,
“Which product is preferred by which age group?”, or “how much money is spent by people of a particular location?” etc.
2) Segments as small as 5% can often promise high profitability once they are weighted according to their spend levels. But generally, one would never want to make inferences from a subsample any smaller than around n=30. Thus, one would have had to start with an overall sample of 600, (30/5% = 600) in order to ensure that a segment as small as 5% could be reliably described.
Conclusion : We cannot know how well the data warehouse design matches the available data until we try to load it, nor how well it matches the stakeholder’s actual BI requirements until they use it.
This is why Agile Approach of early analytics and implementation can reduce the Data Warehouse/ Business Intelligence risk
When choosing a data analysis software we need to answer a basic set of questions, like:
Does it run natively on the computer?
Does it handle large datasets?
Does the software provide all methods we need?
Is it affordable?
Ease of use.
Although all of these software supports Agile Approach but the final selection choice lies on the user and for the purpose he wants to focus in.
1) I would propose to implement pre-clustering methods like Canopy Clusters. (Canopy Clusters can process huge datasets efficiently. It is an unsupervised pre-clustering algorithm). This can be used as a pre-processing step for K-Means algorithm or the Hierarchical Clustering Algorithm. This will speed up the clustering operations on large datasets.
The reasons why clustering is cluster development policies can benefit from understanding process of clusteringare as follows:
1. Clusters are not static and do not have fixed boundaries.
2. Clusters have different stages of development. (Agile approach can be implemented in developing a cluster as well as analyzing it.)