This was a 90 min talk given to the Dublin R user group in Nov 2013. It describes how one might go about a data analysis project using the R language and packages, using an open source dataset.
45min talk given at LondonR March 2014 Meetup.
The presentation describes how one might go about an insights-driven data science project using the R language and packages, using an open source dataset.
Customer Segmentation with R - Deep Dive into flexclustJim Porzak
Jim Porzak's presentation at useR! 2015 in Aalborg, Denmark. Learn on how to segment customers based on stated interest surveys using the flexclust package in R. Covers basic customer segmentation concepts, introduction to flexclust, and solutions to three practical issues: the numbering problem, the stability problem, and the best choice for k.
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...Dataconomy Media
Gaining insight from data is not as straightforward as we often wish it would be – as diverse as the questions we’re asking are the quality and the quantity of the data we may have at hand. Any attempt to turn data into knowledge thus strongly depends on it dealing with big or not-so-big data, high- or low-dimensional data, exact or fuzzy data, exact or fuzzy questions, and the goal being accurate prediction or understanding. This presentation emphasizes the need for a multi-paradigm data science to tackle all the challenges we are facing today and may be facing in the future. Luckily, solutions are starting to emerge...
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADataconomy Media
Making the data of a company accessible to analysts, business users and data scientists can be a quite painful endeavor. In the past 5 years, Project A has supported many of its portfolio companies with building data infrastructures and we experienced many of these pains first-hand. This talk shows how some of these pains can be overcome by applying common sense and standard software engineering best practices.
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
This is an interactive introduction to R.
R is an open source language for statistical computing, data analysis, and graphical visualization.
While most commonly used within academia, in fields such as computational biology and applied statistics, it is gaining currency in industry as well – both Facebook and Google use R within their firms.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
45min talk given at LondonR March 2014 Meetup.
The presentation describes how one might go about an insights-driven data science project using the R language and packages, using an open source dataset.
Customer Segmentation with R - Deep Dive into flexclustJim Porzak
Jim Porzak's presentation at useR! 2015 in Aalborg, Denmark. Learn on how to segment customers based on stated interest surveys using the flexclust package in R. Covers basic customer segmentation concepts, introduction to flexclust, and solutions to three practical issues: the numbering problem, the stability problem, and the best choice for k.
DN 2017 | Multi-Paradigm Data Science - On the many dimensions of Knowledge D...Dataconomy Media
Gaining insight from data is not as straightforward as we often wish it would be – as diverse as the questions we’re asking are the quality and the quantity of the data we may have at hand. Any attempt to turn data into knowledge thus strongly depends on it dealing with big or not-so-big data, high- or low-dimensional data, exact or fuzzy data, exact or fuzzy questions, and the goal being accurate prediction or understanding. This presentation emphasizes the need for a multi-paradigm data science to tackle all the challenges we are facing today and may be facing in the future. Luckily, solutions are starting to emerge...
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADataconomy Media
Making the data of a company accessible to analysts, business users and data scientists can be a quite painful endeavor. In the past 5 years, Project A has supported many of its portfolio companies with building data infrastructures and we experienced many of these pains first-hand. This talk shows how some of these pains can be overcome by applying common sense and standard software engineering best practices.
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
This is an interactive introduction to R.
R is an open source language for statistical computing, data analysis, and graphical visualization.
While most commonly used within academia, in fields such as computational biology and applied statistics, it is gaining currency in industry as well – both Facebook and Google use R within their firms.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATIONcsandit
Here, we discuss the Classification is a fundamental problem in data analysis. Training a
classifier requires accessing a large collection of data. Releasing person-specific data, such as
customer data or patient records, may pose a threat to an individual’s privacy. Even after
removing explicit identifying information such as Name and SSN, it is still possible to link
released records back to their identities by matching some combination of non identifying
attributes such as {Sex,Zip,Birthdate}. A useful approach to combat such linking attacks, called
k-anonymization is anonymizing the linking attributes so that at least k released records match
each value combination of the linking attributes. Our goal is to find a k-anonymization which
preserves the classification structure. Experiments of real-life data show that the quality of
classification can be preserved even for highly restrictive anonymity requirements
This presentation includes segmentation based shopper behavior and pantaloons case study. It defines the shopping behavior of consumers in an pantaloons store.
In this presentation, Jeff Maloy discusses how shopper marketers and retailers can benefit from customizing shopper marketing programs and solutions to specific shopping occasions.
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATIONcsandit
Here, we discuss the Classification is a fundamental problem in data analysis. Training a
classifier requires accessing a large collection of data. Releasing person-specific data, such as
customer data or patient records, may pose a threat to an individual’s privacy. Even after
removing explicit identifying information such as Name and SSN, it is still possible to link
released records back to their identities by matching some combination of non identifying
attributes such as {Sex,Zip,Birthdate}. A useful approach to combat such linking attacks, called
k-anonymization is anonymizing the linking attributes so that at least k released records match
each value combination of the linking attributes. Our goal is to find a k-anonymization which
preserves the classification structure. Experiments of real-life data show that the quality of
classification can be preserved even for highly restrictive anonymity requirements
This presentation includes segmentation based shopper behavior and pantaloons case study. It defines the shopping behavior of consumers in an pantaloons store.
In this presentation, Jeff Maloy discusses how shopper marketers and retailers can benefit from customizing shopper marketing programs and solutions to specific shopping occasions.
A presentation from the 2014 National Postal Forum by Gary Seitz, Executive VP of C.TRAC, on Recency, Frequency, and Monetary (RFM) analysis as a simple tool to help mailers.
Efficient customer segmentation in Google Analytics (Blueffect 2013 Warsaw, Poland) - examples and best practices of accurate data analysis and advanced segmentation principles in order to improve revenues of your business.
- What makes you wrongly evaluate marketing campaigns: do you know, what is the real conversion rate of your website?
- How to prioritize content sections of an e-commerce website.
- What customer segments and cohorts are useful.
Application of Clustering in Data Science using Real-life Examples Edureka!
Clustering data into subsets is an important task for many data science applications. It is considered as one of the most important unsupervised learning technique. Keeping this in mind, we have come with a free webinar ‘Application of Cluster in Data Science using Real-life examples.’
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin Rshanelynn
Self-Organising maps for Customer Segmentation using R.
These slides are from a talk given to the Dublin R Users group on 20th January 2014. The slides describe the uses of customer segmentation, the algorithm behind Self-Organising Maps (SOMs) and go through two use cases, with example code in R.
Accompanying code and datasets now available at http://shanelynn.ie/index.php/self-organising-maps-for-customer-segmentation-using-r/.
RFM Segmentation is the easiest and most frequently used form of database segmentation. It is based on three key metrics: Recency, Frequency and Monetary Value of customer activity. RFM is often used with transactional history in e-commerce, but can also work for Social Media interactions, online gaming or discussion boards. Based on calculated segments a marketer can prepare cross-sell, up-sell, retention and reactivation capampaigns. This deck provides a simple introduction to the RFM Segmentation methodology.
Chadwick Martin Bailey’s Brant Cruz and Jeff McKenna presented best practices of market segmentation based on their years of experience working with clients like eBay, Electronic Arts, Plantronics, and Microsoft.
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze your data for a fraction of the cost of traditional data warehouses.
By following a few best practices for schema design and cluster design, you can unleash the high performance capabilties of Amazon Redshift. This webinar is a deep dive into performance tuning techniques based on real-world use cases.
Learning Objectives:
Learn how to get the best performance from your Redshift cluster
Design Amazon Redshift clusters based on real world use cases
See sample tuning scripts to diagnose and maximize cluster performance
Learn about increasing query performance using interleaved sorting
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
Data Exploration (EDA)
- Relationship between variables
- Check for
- Multi-co linearity
- Distribution of variables
- Presence of outliers and its treatment
- Statistical significance of variables
- Class imbalance and its treatment
Feature Engineering
- Whether any transformations required
- Scaling the data
- Feature selection
- Dimensionality reduction
Assumptions
- Check for the assumptions to be satisfied for each of the models in
- Regression – SLR, Multiple Linear Regression, Logistic Regression
- Classification – Decision Tree, Random Forest, SVM, Bagged and boosted models
- Clustering – PCA (multi-co linearity), K-Means (presence of outliers, scaling, conversion to
numerical, etc.)
----------------------------- Interim Presentation Checkpoint----------------------------------------------------------
Model building
- Split the data to train and test.
- Start with a simple model which satisfies all the above assumptions based on your dataset.
- Check for bias and variance errors.
- To improve the performance, try cross-validation, ensemble models, hyperparameter
tuning, grid search
Evaluation of model
- Regression – RMSE, R-Squared value,
- Classification – Classification report with precision, recall, F1-score, Support, AUC, etc.
- Clustering – Inertia value
- Comparison of different models built and discussion of the same
- Time taken for the inferences/ predictions
Business Recommendations & Future enhancements
- How to improve data collection, processing, and model accuracy?
- Commercial value/ Social value / Research value
- Recommendations based on insights
----------------------------- Final Presentation Checkpoint----------------------------------------------------------
Dashboard
- EDA – Correlation matrix, pair plots, box blots, distribution plots
- Model
- Model Parameters
- Visualization of performance of the model with varying parameters
- Visualization of model Metrics
- Testing outcome
- Failure cases and explanation for the same
- Most successful and obvious cases
- Border cases
Reinforcement Learning (RL) is being increasingly used to learn and adapt application behavior in many domains, including large-scale and safety critical systems, as for example, autonomous driving. With the advent of plug-n-play RL libraries, its applicability has further increased, enabling integration of RL algorithms by users. We note, however, that the majority of such code is not developed by RL engineers, which as a consequence, may lead to poor program quality yielding bugs, suboptimal performance, maintainability, and evolution problems for RL-based projects. In this paper we begin the exploration of this hypothesis, specific to code utilizing RL, analyzing different projects found in the wild, to assess their quality from a software engineering perspective. Our study includes 24 popular RL-based Python projects, analyzed with standard software engineering metrics. Our results, aligned with similar analyses for ML code in general, show that popular and widely reused RL repositories contain many code smells (3.95% of the code base on average), significantly affecting the projects’ maintainability. The most common code smells detected are long method and long method chain, highlighting problems in the definition and interaction of agents. Detected code smells suggest problems in responsibility separation, and the appropriateness of current abstractions for the definition of RL algorithms.
paper preprint: https://arxiv.org/abs/2303.10236
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
Data science is a powerful confluence of statistics, software and large datasets, but what does it mean for large corporates, and how can we gain real value by learning from data?
In this high-level presentation, I describe how I see data science fitting into the regular day-to-day processes of banking and insurance: enabling better decision making, improving top-line revenues and automating the mundane. I'll reference a number of real-world projects that we have undertaken to enable clients to learn from their data, improve their product-market fit and improve their business processes.
Viewers will hopefully gain a better appreciation of data science as applied to business operations in finance.
How is Data Science going to Improve Insurance?Jonathan Sedar
An overview of the applications for data science in insurance: (1) Intelligent use of external data, (2) Advanced but interpretable statistical modelling, (3) Careful use of exotic machine learning. As presented at QCon London 2016 Conference.
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Jonathan Sedar
Slides from my presentation at ODSC and Python Quants meetup in London on 13 Apr 2016. Very lightly covering a demo project on topic modelling and network analysis of the Enron Email Corpus.
Text mining to correct missing CRM information: a practical data science projectJonathan Sedar
20min talk given at PyData London 2014
A client in the energy sector wanted to create predictive behavioural models of business customers at the company level, but the CRM data was messy, often containing several sub-accounts for each business, without any grouping identifiers, and so aggregation was impossible. In this talk I describe a short project where we used text mining, a handful of unsupervised learning techniques and pragmatic use of human skill, to identify the true company level structures in the CRM data.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Leading Change strategies and insights for effective change management pdf 1.pdf
Customer Clustering for Retailer Marketing
1. Customer Clustering for
Retailer Marketing
A worked example of machine learning in industry,
with reference to useful R packages
Talk for Dublin R User Group - 06 Nov 2013
Jonathan Sedar - Consulting Data Scientist
@jonsedar
3. Lets try to group them by similar
shopping behaviours
4. Overview
Intro What’s the problem we’re trying to solve?
Sourcing, Cleaning & Exploration
What does the data let us do?
read.table {utils}
ggplot {ggplot2}
lubridate {lubridate}
Extract additional information to enrich the set
data.table {data.table}
cut2 {Hmisc}
dcast {reshape2}
Feature Selection
scale {base}
prcomp {stats}
Feature Creation
Reduce to a smaller dataset to speed up computation
Mixture Modelling
Finding similar customers without prior information
… and interpreting the results
mclust {mclust}
5. Intro
Customer profiling enables targeted
marketing and can improve operations
Retention offers
Product promotions
Loyalty rewards
Optimise stock levels &
store layout
Bob
Dublin 6
Age 42
Married?
Bob
Dublin 6
Age 42
Married?
Type: “Family First”
6. Intro
We want to turn transactional data into
customer classifications
32,000 customers
24,000 items
800,000
transactions
… over a 4 month pd
Magic
A real dataset:
Many, many ways to
approach the problem!
7. Intro
Practical machine learning projects
tend to have a similar structure
Source your
raw data
Exploration &
visualisation
Feature
selection
Model
creation
Cleaning &
importing
Feature
creation
Model
optimisation &
interpretation
8. Intro
Practical machine learning projects
tend to have a similar structure
Source your
raw data
Exploration &
visualisation
Feature
selection
Model
creation
Cleaning &
importing
Feature
creation
Model
optimisation &
interpretation
10. Sourcing, Cleansing, Exploration
What information do we have?
“Ta-Feng” grocery shopping dataset
800,000 transactions
32,000 customer ids
24,000 product ids
4-month period over Winter 2000-2001
http://recsyswiki.com/wiki/Grocery_shopping_datasets
trans_date
cust_id
age res_area
product_id quantity price
1: 2000-11-01 00046855
D
E
4710085120468
3
57
2: 2000-11-01 00539166
E
E
4714981010038
2
48
3: 2000-11-01 00663373
F
E
4710265847666
1
135
...
11. Sourcing, Cleansing, Exploration
Data definition and audit (1 of 2)
A README file, excellent...
4 ASCII text files:
# D11: Transaction data collected in November, 2000
# D12: Transaction data collected in December, 2000
# D01: Transaction data collected in January, 2001
# D02: Transaction data collected in February, 2001
Curious choice of delimiter and an extended charset
# First line: Column definition in Traditional Chinese
#
§È¥¡;∑|≠˚•d∏π;¶~ƒ÷;∞œ∞Ï;∞”´~§¿√˛;∞”´~ΩsΩX;º∆∂q;¶®•ª;æP∞‚
# Second line and the rest: data columns separated by ";"
Pre-clean in shell
awk -F";" 'gsub(":","",$1)' D02
12. Sourcing, Cleansing, Exploration
Data definition and audit (2 of 2)
Although prepared by another researcher, can still find undocumented
gotchas:
# 1: Transaction date and time (time invalid and useless)
# 2: Customer ID
# 3: Age: 10 possible values,
#
A <25,B 25-29,C 30-34,D 35-39,E 40-44,F 45-49,G 50-54,H 55-59,I
60-64,J >65
# actually there's 22362 rows with value K, will assume it's Unknown
# 4: Residence Area: 8 possible values,
#
A-F: zipcode area: 105,106,110,114,115,221,G: others, H: Unknown
#
Distance to store, from the closest: 115,221,114,105,106,110
#
so we’ll factor this with levels "E","F","D","A","B","C","G","H"
# 5: Product subclass
# 6: Product ID
# 7: Amount
# 8: Asset
ignore
# not explained, low values, not an id, will
13. Sourcing, Cleansing, Exploration
Import & preprocess (read.table)
Read each file into a data.table, whilst applying basic data types
> dtnov <- data.table(read.table(fqn,col.names=cl$names,
colClasses=cl$types
,encoding="UTF-8",stringsAsFactors=F));
Alternatives inc RODBC / RPostgreSQL
> con <- dbConnect(dbDriver("PostgreSQL"), host="localhost", port=5432
,dbname=”tafeng”, user=”jon”, password=””)
> dtnov <- dbGetQuery(con,"select * from NovTransactions")
14. Sourcing, Cleansing, Exploration
Import & preprocess (lubridate)
Convert some datatypes to be more useful:
String -> POSIXct datetime (UNIX time UTC) using lubridate
> dtraw[,trans_date:= ymd(trans_date )]
> cat(ymd("2013-11-05"))
1383609600
… also, applying factor levels to the residential area
> dtraw[,res_area:= factor(res_area
,levels=c("E","F","D","A","B","C","G","H") )]
15. Sourcing, Cleansing, Exploration
Explore: Group By (data.table)
How many transactions, dates, customers, products and product subclasses?
> nrow(dt[,1,by=cust_id])
# 32,266
Using data.table’s dt[i,j,k] structure where:
i subselects rows
SQL WHERE
j selects / creates columns
SQL SELECT
k groups by columns
SQL GROUP BY
e.g. the above is:
select count (*)
from dt
group by cust_id
16. Sourcing, Cleansing, Exploration
Example of data logic-level cleaning
Product hierarchies: we assumed many product_ids to one product_category,
but actually … a handful of product_ids belong to 2 or 3 product_cats:
> transcatid <- dt[,list(nbask=length(trans_id)),by=list(prod_cat,
prod_id)]
> transid <- transcatid[,list(ncat=length(prod_cat),nbask=sum(nbask))
,by=prod_id]
> transid[,length(prod_id),by=ncat]
ncat
V1
1:
1 23557
2:
2
253
3:
3
2
Solution: dedupe. keep prod_id-prod_cat combos with the largest nbask
> ids <- transid[ncat>1,prod_id]
> transcatid[prod_id %in% ids,rank :=rank(-nbask),by=prod_id]
> goodprodcat <- transcatid[is.na(rank) | rank ==1,list(prod_cat,
prod_id)]
17. Sourcing, Cleansing, Exploration
Explore: Visualise (ggplot) (1 of 4)
e.g. transactions by date
p1 <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) +
geom_bar(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8)
plot(p1)
18. Sourcing, Cleansing, Exploration
Explore: Visualise (ggplot) (1.5 of 4)
e.g. transactions by date (alternate plotting)
p1b <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) +
geom_point(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8) +
geom_smooth(aes(x=trans_date,y=num_trans), method=’loess’,alpha=0.8)
plot(p1b)
19. Sourcing, Cleansing, Exploration
Explore: Visualise (ggplot) (2 of 4)
e.g. histogram count of customers with N items bought
p2 <- ggplot(dt[,list(numitem=length(trans_id)),by=cust_id]) +
geom_bar(aes(x=numitem),stat='bin',binwidth=10,alpha=0.8,fill=orange)
+
coord_cartesian(xlim=c(0,200))
plot(p2)
20. Sourcing, Cleansing, Exploration
Explore: Visualise (ggplot) (3 of 4)
e.g. scatterplot of total items vs total baskets per customer
p4a <- ggplot(dttt) +
geom_point(aes(x=numbask,y=numitem),size=1,alpha=0.8) +
geom_smooth(aes(x=numbask,y=numitem),method="lm")
plot(p4a)
21. Sourcing, Cleansing, Exploration
Explore: Visualise (ggplot) (4 of 4)
e.g. scatterplot of total items vs total baskets per customer per res_area
p5 <- ggplot(dttt) +
geom_point(aes(x=numbask,y=numitem,color=res_area),size=1,alpha=0.8)
+
geom_smooth(aes(x=numbask,y=numitem),method="lm",color=colorBlind[1])
+
facet_wrap(~res_area)
plot(p5)
A-F: zipcode area:
105,106,110,114,115,221
G: others
H: Unknown
Dist to store, from closest:
E < F < D < A <B < C
23. Feature Creation
Create New Features
Per customer (32,000 of them):
Counts:
# total baskets (==unique days)
# total items
# total spend
# unique prod_subclass,
unique prod_id
Distributions (min - med - max will do):
# items per basket
# spend per basket
# product_ids , prod_cats per basket
# duration between visits
Product preferences
# prop. of
baskets in the N bands of product cats & ids by item pop.
# prop. of baskets in the N bands of product ids by item price
24. Feature Creation
Counts
Pretty straightforward use of group by with data.table
> counts <- dt[,list(nbask=length(trans_id)
,nitem=sum(quantity)
,spend=sum(quantity*price))
,by=list(cust_id)]
> setkey(counts,cust_id)
> counts
cust_id nbask nitem spend
1: 00046855
1
3
171
2: 00539166
4
8
300
3: 00663373
1
1
135
25. Feature Creation
Distributions
Again, making use of group by with data.table and list to form new data table
> dists_ispb <- dt[,list(nitem=sum(quantity)
,spend=sum(quantity*price))
,by=list(cust_id,trans_date)]
> dists_ispb <- dists_ispb[,list(ipb_max=max(nitem)
,ipb_med=median(nitem)
,ipb_min=min(nitem)
,spb_max=max(spend)
,spb_med=median(spend)
,spb_min=min(spend))
,by=cust_id]
> setkey(dists_ispb,cust_id)
26. Feature Creation
Example considerations: it is
acceptable to lose datapoints?
Feature: duration between visits
If customers visited once only, they have value NA - issue for MClust
Solutions:
A: remove them from modelling? wasteful in this case (lose 30%!)
But maybe we don’t care about classifying one-time shoppers
B: or give them all the same value
But which value? all == 0 isn't quite true, and many will skew clustering
C: impute values based on the global mean and SD of each col
Usually a reasonable fix, except for ratio columns, where very clumsy and
likely misleading, requiring mirroring to get +ve axes
27. Feature Creation
Product Preferences (1 of 2) (Hmisc)
Trickier, since we don’t have a product hierarchy
e.g Food > Bakery > Bread > Sliced White > Brennans
But we do price per unit and inherent measure of popularity in the transaction
log, e.g.
> priceid <- dt[,list(aveprice=median(price)),by=prod_id]
> priceid[,prodid_pricerank:=LETTERS[as.numeric(cut2(aveprice,g=5))]]
# A low, E high
> priceid
prod_id aveprice prodid_pricerank
1: 4710085120468
21
A
2: 4714981010038
26
A
3: 4710265847666
185
D
28. Feature Creation
Product Preferences (2 of 2) (dcast)
Now:
1. Merge the product price class back onto each transaction row
2. Reformat and sum transaction count in each class per customer id, e.g.
> dtpop_prodprice <- data.table(dcast(dtpop
,cust_id~prodid_pricerank
,value.var="trans_id"))
> dtpop_prodprice
cust_id
A
B
C
D
E
1: 00001069
1
3
3
2
2
2: 00001113
7
1
4
5
1
3: 00001250
6
4
0
2
2
3. And further process to make proportional per row
29. Feature Selection
Too many features?
We have a data.table 20,000 customers x 40 synthetic features
… which we hope represent their behaviour sufficiently to distinguish them
More features == heavier processing for clustering
Can we lighten the load?
30. Feature Selection
Principal Components Analysis (PCA)
Standard method for reducing dimensionality, maps each datapoint to a new
coordinate system created from principal components (PCs).
PCs are ordered:
- 1st PC is aligned to maximum
variance in all features
- 2nd PC aligned to the remaining
max variance in orthogonal plane
...etc.
Where each datapoint had N features,
it now has N PC values which are a
composite of the original features.
We can now feed the first few PCs to
the clustering and keep the majority of
the variance.
31. Feature Selection
PCA (scale & prcomp) (1 of 2)
Scale first, so we can remove extreme outliers in original features
> cstZi_sc <- scale(cst[,which(!colnames(cst) %in% c("cust_id")),with=F])
> cstZall <- data.table(cst[,list(cust_id)],cstZi_sc)
Now all features are in units of 1 s.d.
For each row, if any one feature has value > 6 s.d., record in a filter vector
> sel <- apply(cstZall[,colnames(cstZi),with=F]
,1
,function(x){if(max(abs(x))>6){T}else{F}})
> cstZoutliers <- cstZall[sel]
> nrow(cstZoutliers)
# 830 (/20381 == 4% loss)
Health warning: we’ve moved the centre, but prcomp will re-center for us
32. Feature Selection
PCA (scale & prcomp) (2 of 2)
And now run prcomp to generate PCs
> cstPCAi <- prcomp(cstZ[,which(!colnames(cstZ) %in% c("cust_id")),
with=F])
> cstPCAi$sdev
# sqrt of eigenvalues
> cstPCAi$rotation
# loadings
> cstPCAi$x
# PCs (aka scores)
> summary(cstPCAi)
Importance of components:
PC1
Standard deviation
PC2
PC3
PC4 … etc
2.1916 1.8488 1.7923 1.37567
Proportion of Variance 0.1746 0.1243 0.1168 0.06882
Cumulative Proportion
0.1746 0.2989 0.4158 0.48457
Wait, prcomp Vs princomp?
Apparently princomp is faster but potentially less accurate. Performance of
prcomp is very acceptable on this small dataset (20,000 x 40)
35. Clustering
Finite Mixture Modelling
● Assume each
datapoint has a
mixture of classes,
each explained by a
different model.
● Pick a number of
models and fit to the
data, best fit wins
36. Clustering
Gaussian Mixture Modelling (GMM)
● Models have Gaussian dist., we can vary params
● Place N models at a random point, move and fit to data using
Expectation Maximisation (EM) algorithm
● EM is iterative method
of finding the local max
likelihood estimate.
● Slow but effective
● GMM advantage over
e.g. k-means is ability
to vary model params
to fit better
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
37. Clustering
Of course there’s an R package (mclust)
Mclust v4, provides:
● Clustering,
classification, density
estimation
● Auto parameter
estimation
● Excellent default
plotting to aid live
investigation
In CRAN and detail at http://www.stat.washington.
edu/mclust/
38. Clustering
Finding the optimal # models (mclust)
Will automatically iterate over a number of models
(components) and covariance params
Will use the
combination with
best fit (highest
BIC)
C5, VVV
39. Clustering
Interpreting the model fit (1 of 3) (mclust)
PC1
The classification pairs-plot lets us view the clustering by
principal component
PC2
PC3
PC4
PC5
40. Clustering
Interpretation (2 of 3) (mclust)
‘Read’ the distributions w.r.
t PC components
PC1: “Variety axis”
Distinct products per basket and raw
count of distinct products overall
prodctpb_max
0.85
prodctpb_med
0.81
ipb_med
0.77
ipb_max
0.77
nprodcat
0.75
PC1
PC2: “Spendy axis”
Prop. baskets containing expensive
items, and simply raw count of items
and visits
popcat_nbaskE
popid_nbaskE
-0.71
-0.69
popcat_nbaskD
0.60
nbask
nitem
PC2
-0.51
-0.51
41. Clustering
Interpretation (3 of 3) (mclust)
Bob
Spendier,
higher variety,
family oriented?
PC1: Greater Variety
‘Read’ the distributions w.r.t PC components
Charles
Thriftier,
reduced selection,
shopping to a
budget?
PC2: Reduced selection of
expensive items, fewer items
42. We covered...
Intro What’s the problem we’re trying to solve?
Sourcing, Cleaning & Exploration
What does the data let us do?
read.table {utils}
ggplot {ggplot2}
lubridate {lubridate}
Extract additional information to enrich the set
data.table {data.table}
cut2 {Hmisc}
dcast {reshape2}
Feature Selection
scale {base}
prcomp {stats}
Feature Creation
Reduce to a smaller dataset to speed up computation
Mixture Modelling
Finding similar customers without prior information
… and interpreting the results
mclust {mclust}