SlideShare a Scribd company logo
1 of 62
Data Science Company 
Introduction to (big) data science 
Infofarm - Seminar 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
30/09/2014
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Agenda 
• About us 
• What is Data Science? 
• Data Science in practice 
– Models 
– Tools 
• Case study
About us 
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
InfoFarm - Company 
• Data Science and BigData startup 
• Part of the Cronos group 
– Largest indepent IT services supplier in Belgium 
– Organized in limited-sized highly focused competence centers 
– 3000+ Consultants 
• Incubated at Xplore Group, within the context of: 
– Java 
– PHP 
– e-commerce (Hybris, Intershop, Magento, DrupalCommerce, ...) 
– Mobile development (iOS, Android, ...) 
– Web development (HTML5, CSS3, ...)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
InfoFarm - Team 
• Mixed skills team 
– 2 Data Scientists 
• Mathematics 
• Statistics 
– 4 BigData Consultants 
– 1 Infra specialist 
– n Cronos colleagues 
with various background 
• Certifications 
– CCDH - Cloudera Certified Hadoop Developer 
– CCAD - Cloudera Certified Hadoop Administrator 
– OCJP – Oracle Certified Java Programmer
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
InfoFarm - Focus 
• Mission 
– “Help our customers to excel in their business activities by 
providing them with new information and insights of high 
business value. 
Identifying, extracting and using data of all types and origins; 
exploring, correlating and using it in new and innovative ways in 
order to extract meaning and business value from it.” 
• Focus Domains 
– Data Science 
– Machine Learning 
– Big Data
Introduction: what is Data Science? 
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
What is Data Science? 
• Data Science & Business decisions 
• Data Science vs … 
– Statistics 
– Business Intelligence 
– Big Data 
• What can Data Science do for your business? 
• The Data Science maturity model
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Business decisions 
• Any business requires continuous decision taking 
– Will we offer this customer a discount or not? 
– Do we need to keep extra stock for product X? 
– How do we answer this customer question? 
– At which supplier do we buy this product? 
– With which solution will be respond to this RFP? 
– Do we need to replace device X? 
– … 
• The possible answers to these questions are based on prior 
experience with the business 
• Each decision can turn out to be the right or wrong one, business 
knowledge should avoid picking the wrong ones
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Business decisions 
– However … 
• Do you really know your business that well? 
• Hasn’t it evolved in this fast-changing world? 
• Are you sure your competitors aren’t making better decisions? 
– You probably own a lot more information than you might realize! 
• All your business processes are generating data which you can 
use to your advantage! 
• Quotes you made vs deals you won 
• Historical sales records 
• Web logs showing user activity 
• Social media activity referring your brand/product 
• Metering info on devices (internet of things) 
• …
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Types of Data 
– Proprietary data 
• ERP, CRM, Orders, Customers, Products, etc… 
– “Dark Data” – currently unused, maybe not even aware of 
• Unknown, but present in the company 
• Cost-efficient BigData tools might enable business cases using this data 
– External data 
• Websites, social media, open data, … 
– Data still to be captured 
• “If only we knew X or Y” … 
– There might be a huge added value in “mashing up” proprietary 
data with public/open data!
Business Knowledge vs Data Science 
(Intuitive knowledge vs data driven decisions) 
Business Knowledge 
Acquired by experience 
(assumed) insights 
RISK: too high bias on past experience and gut feeling 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Data Science 
Complementary to business knowledge 
Confirmative or new insights 
Data-driven decision taking 
RISK: too naive data intepretation, 
disconnected from business
Business Knowledge vs Data Science 
(Intuitive knowledge vs data driven decisions) 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Business decisions: marketing example 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Example: 
We want to send mailings about our new product 
• Decisions to take: 
– Which mail to send to which customers? 
– We need customer segmentation! 
• Risks in failing to do this correctly 
– Missing opportunities (not informing customers) 
– Annoying customers with irrelevant mailings (churn, reputation damage, …)
Business decisions: marketing example 
• Business knowledge based approach 
– “We know our segments: -25y, 25y-35y, 35y+ groups, and male/female” 
– But is this (still) true? 
– E.g.: do we really want to send an ad of the new iPhone to a long-time Android 
user because he’s a 30-something male customer? 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Business decisions: marketing example 
• Data-driven approach: Can we identify different segments automatically? 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
(machine learning!) 
– WEB SERVER LOGS 
Which customers have already looked at similar 
product on our website? 
– ORDER HISTORY 
Which customers own complementary products? 
– CRM INFORMATION 
What is the typical profile of a customer that clicked 
through on the last e-mail campaign for a similar product? 
– … 
• Business knowledge and Data Science become in- and output for 
each other! 
– Ideas/hypotheses and data to be examined should be identified from business 
knowledge! 
– A/B testing can be applied to test approaches and check results 
– Let the data talk for itself! New business insights are generated
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Being a Data Scientist 
• “Data Scientist – the most sexy job of the 21st century” 
- Thomas H. Davenport 
• Data Scientist: “A person who is better at statistics than any software 
engineer and better at software engineering than any statistician” 
- Josh Wills
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Data Science = team work!
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Data Science vs Statistics 
• Basic Statistics concepts 
– Reliability and validity 
– Probability 
– Descriptive statistics and graphics 
• Inferential statistics (and hypothesis testing) 
– Probability distributions 
– Populations and samples 
– Confidence intervals 
– Correlation 
• Data Science 
– Link with IT (tooling, scale, …) 
– Data preparation & hacking (get data from databases, websites, …) 
– Machine learning and automation 
– Working interactively together with business
Data Science vs Business Intelligence 
• Basic BI concepts: structuring data to report and query upon it 
– DWH, OLAP, ETL processes 
– Star- and snowflake schemas 
– Query-oriented architectures 
– Close to typical IT development cycle 
• Data Science: working and experimenting with data to gain insights 
– Exploratory working 
– Work in a research cycle rather than development cycle 
– Limited investment towards analysis that might or might not deliver 
– Tools designed to avoid heavy ETL (loosely structured data) 
– Eventually valuable analyses can be ported to BI systems 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science vs Business Intelligence 
• Using tools that are designed to support exploratory 
working 
– Not requiring strict up-front schema design 
– Allowing fast and cheap hypotheses testing 
– Open up opportunities to quickly integrate many data sources 
• Excel files, Text files, Word Documents 
• Log files 
• Relational databases 
• Sensor data 
• Timeseries data 
• ... 
• Integrations with online (OLTP) and analytical 
(OLAP/BI) systems 
– Typically for automating repetitive analysis and reporting outputs 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Sampling Induction 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Data Science vs Big Data 
• Process of statistical inference: sampling & induction 
• BigData allows: 
– N=ALL (avoid sampling errors) 
• Sampling issues can be overcome by just processing ALL available data (process massive data) 
– N=1 (avoid issues with non-homogenous datasets) 
• Categorization becomes true personalisation: project towards ONE individual (calculate per item) 
• Significance considerations are not applicable!
What can Data Science do for your business? 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Extract meaning from data 
– Using and combining data in ways it has never done before 
– Finding patterns and correlations in data from all possible sources 
– Detecting anomalies and changes in known patterns 
• Transform data of various types into valuable information 
– As a basis for management decisions 
– As a basis for data products 
– That can improve your business in any way 
• Build and integrate Data Products 
– Recommendation engines, Prediction models, Automated classification, … 
• The key point is spotting opportunities to outperform your 
competitors using any data available!
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Scientific cycle 
Question 
Hypothesis 
Experiment 
(data) 
Conclusion 
Analyse 
results 
• This is NOT a 
development cycle! 
• Experimentation vs 
engineering 
• Being a Science makes 
that the outcome cannot 
be predicted 
• This makes it hard to 
integrate in an IT 
development process
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Scientific cycle 
• Take small steps 
• Formulate hypotheses 
• Actually build things 
• Apply A/B testing 
• Even without success, 
you learned something!
The Data Science maturity model 
• Don’t run before you can walk: The Data Science Maturity model 
Each level builds on the quality of the underlying step. It’s science, not magic … 
– Start off by simply collecting the data you need (type, quantity, quality) 
– Then report on your current business (confirmative analysis) 
– Discover new and valuable information (exploratory analysis) 
– Build and test prediction models (predictive analysis) 
– Steer your business based on advise output from your predictions (data-driven) 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Collect 
Describe 
Discover 
Predict 
Advise
The Data Science maturity model 
Phase Actions Examples in commerce 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Collect 
Logging information 
Gathering data from different sources 
Logging user actions on a website 
Using loyalty cards to id customers 
Describe 
Explorative Data Analysis 
Basic analytical functions 
Checking quantity and quality of data 
Typical reporting 
Correlating data over sources 
Discover 
Finding correlations 
Building models 
Finding similarly behaving customers 
Predict 
Building prediction models 
Formulating expectations for the 
future based on past info 
Predict sales figures for a new product 
Predict whether a certain customer 
will or will not buy a certain product 
Advise 
Use prediction models to evaluate 
decision possibilities and pick the best 
Target advertising to the right 
customer groups to optimize revenue
Data Science in practice 
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Overview 
• Tools: R, Hive, Pig 
• Modeling methods & statistics: 
Decision trees, Naive Bayes, Regression, 
Nearest Neighbor, K-means clustering, A 
priori, …
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Tools – Data Science 
• Analytics: R 
• Visualisation: Shiny 
• Docs: MarkDown 
• Data retrieval 
– CSV, TAB, ... files 
– Apache Hive 
• Data processing 
– Apache Pig 
• Open Source based
Tools – Machine Learning 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Apache Mahout 
• Apache Spark Mlib 
• R 
• Open Source based
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Tools - BigData 
• Hadoop 
– HDFS 
– MapReduce 
– Pig 
– Hive 
– Oozie 
– Impala 
– ... 
• Spark 
– Shark, SparkR 
• Platforms 
– Open Source Apache Hadoop 
– CDH - Cloudera (partnership at Cronos level) 
– HDP – Hortonworks Data Platform
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Tools - HDFS
Tools – MapReduce : Wordcount 
Input Splitting Mapping Shuffling Reducing Output 
Framework Code Framework Code Framework 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Modeling methods & statistics 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Basic patterns 
– Recommendations 
Based on known taste, propose items that might be liked as well 
– Clustering 
Detecting correlation groups in data without using pre-defined 
segmentation based on business knowledge 
– Classification 
Automated labeling, acceptance/rejection of data based on 
probability models 
• Supervised & unsupervised learning methods 
– k-means, naive bayes, n-nearest neighborhood, random forrests, 
logistic regression, A priori, ...
Modeling methods: Decision Tree 
• Query: which kind of fruit am I looking at 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
– More general: image recognition 
• Clean your data 
– What to do with missing values? 
• Insert average value 
• Insert special value 
• Delete data 
– What to do with outliers? 
• Wrong data?
Modeling methods: Decision Tree 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Find most decisive variable 
– Categorical variable: One leaf for each variable or one leaf for a 
group of categories 
– Numerical variable: find best cut-off(s) 
Query 
Color 
Green Yellow Red
Modeling methods: Decision Tree 
• For each leave, repeat the process: 
Size is actually numerical: find size cut offs 
Yellow 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Query 
Color 
Size 
Green 
Big 
Medium 
Small 
Shape 
Roun 
d 
Thin 
Size 
Red 
Medium Small
Modeling methods: Decision Tree 
Yellow 
Medium 
Small 
Sweet 
Sour 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
Query 
Color 
Size 
Green 
Big 
Water-melon 
Medium 
Green 
apple 
Small 
Grapes 
Shape 
Size 
Round 
Big 
Grape-fruit 
Mediu 
m 
Lemon 
Banana 
Thin 
Size 
Red 
apple 
Try it 
Cherry 
Grape
Modeling methods: Decision Tree - Distributed 
• A big advantage of the big data tools are the Distributed 
processing power (run processes in parallel) 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Build your decision tree 
– Each leaf can be processed by another node 
– All your data should still be available to every mapper 
• Upgrading your decision tree 
– Bagging trees (sampling your data) 
– Random Forest (sampling your variables) 
– Every mapper should only read a part of your data 
– Still in general better results than a decision tree
Modeling methods: Decision Tree 
• QUESTION: Can we predict whether a customer will place an 
Date_added 
> 1.5 
Hour_added 
> 16.29 
0.06 Date_added 
< 5.113 
0.1136 0.1829 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
order during this web session? 
• Modeling (data mining) 
– Input: historical surfing information 
– Decision tree algorithm 
• Loop at historical data 
• Find most decisive variable 
• For each leaf, repeat 
– Avoid overfitting! 
• Runtime usage 
– Pass current info in tree model 
– Allow certain discounts to increase conversion? 
– Put user on checkout or in-store after putting product in basket? 
0.3273
Modeling methods: Naive Bayes 
• QUESTION: Will I play tennis today? 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Start with labeled data from the past 
Again clean your data! 
• Often used with plain text 
• Assumes that each variable is independent from all others 
• Named after Bayes rule (statistics)
Modeling methods: Naive Bayes 
Day • Outlook Temperature Humidity Wind PlayTennis 
D1 • Sunny Hot High Weak No 
D2 • Sunny Hot High Strong No 
D3 • Overcast Hot High Weak Yes 
D4 • Rain Mild High Weak Yes 
D5 • Rain Cool Normal Weak Yes 
D6 • Rain Cool Normal Strong No 
D7 • Overcast Cool Normal Strong Yes 
D8 • Sunny Mild High Weak No 
D9 • Sunny Cool Normal Weak Yes 
D10 • Rain Mild Normal Weak Yes 
D11 • Sunny Mild Normal Strong Yes 
D12 • Overcast Mild High Strong Yes 
D13 • Overcast Hot Normal Weak Yes 
D14 • Rain Mild High Strong No 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Modeling methods: Naive Bayes 
• Consider PlayTennis problem and new instance 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
(sun, cool, high, strong)
Modeling methods: Naive Bayes 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Estimate parameters 
– P(yes) = 9/14 P(no) = 5/14 
– P(Wind=strong|yes) = 3/9 
– P(Wind=strong|no) = 3/5 
– … 
• We have 
P(y)P(sun|y)P(cool|y)P(high|y)P(strong|y) = 0.005 
P(n)P(sun|y)P(cool|n)P(high|n)P(strong|n) = 0.021 
• Therefore this new instance is classified to “no”
Modeling methods: Naive Bayes - distributed 
• Vectorisation of trainining data (more or less wordcount) can 
easily be distributed: 
– Each text to one mapper 
– Even when dealing with a large text  cut your text in to peaces 
– Every small block of data only read once by one mapper 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Vectorisation of your new instance 
• Actual prediction is a multiplication of all conditional chances 
 also calculation of prediction easy to distribute
Modeling methods: Naive Bayes 
• QUESTION: Can we route incoming questions (free text) to the 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
right person/department? 
• Modeling (data mining) 
– Input: historical information questions and handling person/department 
– Naive bayes algorithm 
• For each word or n-gram (2 or 3 words) – count occurences per file 
• Very valuable are words with high frequency in a single document 
• Very valuable are words only used in a small number of documents 
• Remove stopwords, generic words, etc… 
• Runtime usage 
– Vectorize incoming document (which words/n-grams occur how many 
times?) 
– Predict category based on comparison with historical documents
Modeling methods: k-means Clustering 
• QUESTION: Which countries have the same type of food 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
consumption 
• Your data is not labeled! 
• You define labels for your clusters after applying the cluster 
algorithm 
• Choose the number of clusters you are expecting 
– Try for different number of clusters 
– Run an algorithm to decide the optimal number of clusters 
• Plot your final results mapped on your principal components
Modeling methods: k-means Clustering 
Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg 
1 Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7 
2 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3 
3 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0 
4 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2 
5 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0 
6 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4 
7 E Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 0.8 3.6 
8 Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 1.0 1.4 
9 France 18.0 9.9 3.3 19.5 5.7 28.1 4.8 2.4 6.5 
10 Greece 10.2 3.0 2.8 17.6 5.9 41.7 2.2 7.8 6.5 
11 Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 5.4 4.2 
12 Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 1.6 2.9 
13 Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 4.3 6.7 
14 Netherlands 9.5 13.6 3.6 23.4 2.5 22.4 4.2 1.8 3.7 
15 Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 1.6 2.7 
16 Poland 6.9 10.2 2.7 19.3 3.0 36.1 5.9 2.0 6.6 
17 Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 4.7 7.9 
18 Romania 6.2 6.3 1.5 11.1 1.0 49.6 3.1 5.3 2.8 
19 Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 5.9 7.2 
20 Sweden 9.9 7.8 3.5 24.7 7.5 19.5 3.7 1.4 2.0 
21 Switzerland 13.1 10.1 3.1 23.8 2.3 25.6 2.8 2.4 4.9 
22 UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 3.4 3.3 
23 USSR 9.3 4.6 2.1 16.6 3.0 43.6 6.4 3.4 2.9 
24 W Germany 11.4 12.5 4.1 18.8 3.4 18.6 5.2 1.5 3.8 
25 Yugoslavia 4.4 5.0 1.2 9.5 0.6 55.9 3.0 5.7 3.2 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Modeling methods: k-means Clustering 
• Define a metric: take every variable into account as much as all 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
other variables 
• Create random starting points (as many as clusters you expect) 
• Assign each point to the closest center (or starting) point 
• Calculate the center of each cluster 
• Iterate the previous two steps
Modeling methods: k-means clustering 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Modeling methods: k-means Clustering 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Modeling methods: k-means Clustering 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
"cluster 1" 
Country RedMeat Fish Fr.Veg 
Albania 10.1 0.2 1.7 
Bulgaria 7.8 1.2 4.2 
Romania 6.2 1.0 2.8 
Yugoslavia 4.4 0.6 3.2 
"cluster 2" 
Country RedMeat Fish Fr.Veg 
Denmark 10.6 9.9 2.4 
Finland 9.5 5.8 1.4 
Norway 9.4 9.7 2.7 
Sweden 9.9 7.5 2.0 
"cluster 3" 
Country RedMeat Fish Fr.Veg 
Czechoslovakia 9.7 2.0 4.0 
E Germany 8.4 5.4 3.6 
Hungary 5.3 0.3 4.2 
Poland 6.9 3.0 6.6 
USSR 9.3 3.0 2.9 
[ 
"cluster 4" 
Country RedMeat Fish Fr.Veg 
Austria 8.9 2.1 4.3 
Belgium 13.5 4.5 4.0 
France 18.0 5.7 6.5 
Ireland 13.9 2.2 2.9 
Netherlands 9.5 2.5 3.7 
Switzerland 13.1 2.3 4.9 
UK 17.4 4.3 3.3 
W Germany 11.4 3.4 3.8 
"cluster 5" 
Country RedMeat Fish Fr.Veg 
Greece 10.2 5.9 6.5 
Italy 9.0 3.4 6.7 
Portugal 6.2 14.2 7.9 
Spain 7.1 7.0 7.2
Modeling methods: k-means Clustering - distributed 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Calculate conditional chances 
– Every mapper only needs one variable 
• Assigning points to clusters: 
– All centers in distributed cache 
– Rest of the data only read once by one mapper 
– Calculate distances and assign to the closest center point 
• Update center points 
– One mapper for each cluster
Modeling methods: k-means Clustering 
• QUESTION: In which different segments can we split our 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
customer base? 
• Modeling (data mining) 
– Input: any information on the customers (CRM, ERP, Social Media, …) 
– Very important to find columns to use (requires business knowledge to 
formulate hypotheses!) 
– K-means clustering algorithm 
• Define a “distance” formula to calculate how close two customers are to 
each other 
• Define starting points for each cluster center 
• Iterate and re-allocate customers to a cluster, move cluster centers 
• Runtime usage 
– Quickly check the cluster in which a new customer could be residing
Modeling methods: A priori 
• QUESTION: Which books might be interesting for you, knowing 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
which books you have read? 
• Modeling (data mining) 
– Input: all titles of books someone has read 
– Make sure that same books have same titles (e.g.: drop edition from 
title) 
– A priori algorithm 
• Make baskets of read books, labeled with the reader 
• Identify common occuring books 
• Tweak your recommendation rules: 
– Chose big enough support 
– Confidence of recommendations can be calculated 
– The bigger the lift, the more valuable your recommendation might be for the reader 
• Runtime usage 
– Check if a subset of the books occur as left-hand-side of a rule
Modeling methods: A priori 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Data consists of books bought online 
• There were more than 40000 users buying more than one book (If they only 
bought one book, they are not useful to make your model) 
• In total they bought more than 220000 books 
• Notice the permutations in the rules 
• As you might expect, sequel books are bought together
Modeling methods: A priori 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Modeling methods: A priori - distributed 
• Make list of books bought together (training data) 
– Similar to n-grams (Naïve Bayes) 
– Every customer only read once by one mapper 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Make recommendations 
– Every mapper handles a number of rules
Modeling methods: A priori 
• QUESTION: Which adds can I show on a website? 
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 
• Modeling (data mining) 
– Input: All visited links, all bought items, … 
– Decide what you think is important: you want to show items others were 
also interested in, items others also bought, …. 
– A priori algorithm 
• Find items which occur together 
• Define your support, confidence and lift you want 
• Runtime usage 
– Check if a subset of the visited links occur as a left hand side of a rule
Case study 
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
End: Wrap up & Lunch 
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye

More Related Content

What's hot

Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analyticsInbavalli Valli
 
Applied Data Science for E-Commerce
Applied Data Science for E-CommerceApplied Data Science for E-Commerce
Applied Data Science for E-CommerceArul Bharathi
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesRevolution Analytics
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science Mahesh Kumar CV
 
Graphs & the Police: How Law Enforcement Analyze Connected Data at Scale
Graphs & the Police: How Law Enforcement Analyze Connected Data at ScaleGraphs & the Police: How Law Enforcement Analyze Connected Data at Scale
Graphs & the Police: How Law Enforcement Analyze Connected Data at ScaleNeo4j
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project LifecycleJason Geng
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learningGiuseppe Manco
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBig Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBernard Marr
 

What's hot (20)

5 Big Data Use Cases for 2013
5 Big Data Use Cases for 20135 Big Data Use Cases for 2013
5 Big Data Use Cases for 2013
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analytics
 
Applied Data Science for E-Commerce
Applied Data Science for E-CommerceApplied Data Science for E-Commerce
Applied Data Science for E-Commerce
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Graphs & the Police: How Law Enforcement Analyze Connected Data at Scale
Graphs & the Police: How Law Enforcement Analyze Connected Data at ScaleGraphs & the Police: How Law Enforcement Analyze Connected Data at Scale
Graphs & the Police: How Law Enforcement Analyze Connected Data at Scale
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Sample
Sample Sample
Sample
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Big Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business NeedsBig Data: The 6 Key Skills Every Business Needs
Big Data: The 6 Key Skills Every Business Needs
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 

Similar to Introduction to (Big) Data Science

Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Big Data Analytics.pdfbgfjgjgghfhhffhdfyf
Big Data Analytics.pdfbgfjgjgghfhhffhdfyfBig Data Analytics.pdfbgfjgjgghfhhffhdfyf
Big Data Analytics.pdfbgfjgjgghfhhffhdfyfVijayKaran7
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseDatabricks
 
Ch1-Introduction to Business Intelligence.pptx
Ch1-Introduction to Business Intelligence.pptxCh1-Introduction to Business Intelligence.pptx
Ch1-Introduction to Business Intelligence.pptxsommaikhantong
 
L’IA, booster de votre activité : principes, usages & idéation
L’IA, booster de votre activité : principes, usages & idéationL’IA, booster de votre activité : principes, usages & idéation
L’IA, booster de votre activité : principes, usages & idéationScaleway
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
 
Intro to Artificial Intelligence w/ Target's Director of PM
 Intro to Artificial Intelligence w/ Target's Director of PM Intro to Artificial Intelligence w/ Target's Director of PM
Intro to Artificial Intelligence w/ Target's Director of PMProduct School
 
ADV Slides: Data Curation for Artificial Intelligence Strategies
ADV Slides: Data Curation for Artificial Intelligence StrategiesADV Slides: Data Curation for Artificial Intelligence Strategies
ADV Slides: Data Curation for Artificial Intelligence StrategiesDATAVERSITY
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptxKannanThangavelu2
 
Business_intelligence_overview.ppt
Business_intelligence_overview.pptBusiness_intelligence_overview.ppt
Business_intelligence_overview.pptPerumalPitchandi
 
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...Innovation Enterprise
 
S ba0881 big-data-use-cases-pearson-edge2015-v7
S ba0881 big-data-use-cases-pearson-edge2015-v7S ba0881 big-data-use-cases-pearson-edge2015-v7
S ba0881 big-data-use-cases-pearson-edge2015-v7Tony Pearson
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
big data analytics pgpmx2015
big data analytics pgpmx2015big data analytics pgpmx2015
big data analytics pgpmx2015Sanmeet Dhokay
 

Similar to Introduction to (Big) Data Science (20)

Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Big Data Analytics.pdfbgfjgjgghfhhffhdfyf
Big Data Analytics.pdfbgfjgjgghfhhffhdfyfBig Data Analytics.pdfbgfjgjgghfhhffhdfyf
Big Data Analytics.pdfbgfjgjgghfhhffhdfyf
 
Big data
Big dataBig data
Big data
 
Agile BI success factors
Agile BI success factorsAgile BI success factors
Agile BI success factors
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Ch1-Introduction to Business Intelligence.pptx
Ch1-Introduction to Business Intelligence.pptxCh1-Introduction to Business Intelligence.pptx
Ch1-Introduction to Business Intelligence.pptx
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
L’IA, booster de votre activité : principes, usages & idéation
L’IA, booster de votre activité : principes, usages & idéationL’IA, booster de votre activité : principes, usages & idéation
L’IA, booster de votre activité : principes, usages & idéation
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
 
Intro to Artificial Intelligence w/ Target's Director of PM
 Intro to Artificial Intelligence w/ Target's Director of PM Intro to Artificial Intelligence w/ Target's Director of PM
Intro to Artificial Intelligence w/ Target's Director of PM
 
ADV Slides: Data Curation for Artificial Intelligence Strategies
ADV Slides: Data Curation for Artificial Intelligence StrategiesADV Slides: Data Curation for Artificial Intelligence Strategies
ADV Slides: Data Curation for Artificial Intelligence Strategies
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Business_intelligence_overview.ppt
Business_intelligence_overview.pptBusiness_intelligence_overview.ppt
Business_intelligence_overview.ppt
 
uae views on big data
  uae views on  big data  uae views on  big data
uae views on big data
 
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
Actionable Analytics - Solving Real World Problems With Big Data, Xerox Innov...
 
S ba0881 big-data-use-cases-pearson-edge2015-v7
S ba0881 big-data-use-cases-pearson-edge2015-v7S ba0881 big-data-use-cases-pearson-edge2015-v7
S ba0881 big-data-use-cases-pearson-edge2015-v7
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
big data analytics pgpmx2015
big data analytics pgpmx2015big data analytics pgpmx2015
big data analytics pgpmx2015
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Introduction to (Big) Data Science

  • 1. Data Science Company Introduction to (big) data science Infofarm - Seminar Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be 30/09/2014
  • 2. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Agenda • About us • What is Data Science? • Data Science in practice – Models – Tools • Case study
  • 3. About us Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
  • 4. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be InfoFarm - Company • Data Science and BigData startup • Part of the Cronos group – Largest indepent IT services supplier in Belgium – Organized in limited-sized highly focused competence centers – 3000+ Consultants • Incubated at Xplore Group, within the context of: – Java – PHP – e-commerce (Hybris, Intershop, Magento, DrupalCommerce, ...) – Mobile development (iOS, Android, ...) – Web development (HTML5, CSS3, ...)
  • 5. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be InfoFarm - Team • Mixed skills team – 2 Data Scientists • Mathematics • Statistics – 4 BigData Consultants – 1 Infra specialist – n Cronos colleagues with various background • Certifications – CCDH - Cloudera Certified Hadoop Developer – CCAD - Cloudera Certified Hadoop Administrator – OCJP – Oracle Certified Java Programmer
  • 6. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be InfoFarm - Focus • Mission – “Help our customers to excel in their business activities by providing them with new information and insights of high business value. Identifying, extracting and using data of all types and origins; exploring, correlating and using it in new and innovative ways in order to extract meaning and business value from it.” • Focus Domains – Data Science – Machine Learning – Big Data
  • 7. Introduction: what is Data Science? Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
  • 8. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be What is Data Science? • Data Science & Business decisions • Data Science vs … – Statistics – Business Intelligence – Big Data • What can Data Science do for your business? • The Data Science maturity model
  • 9. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Business decisions • Any business requires continuous decision taking – Will we offer this customer a discount or not? – Do we need to keep extra stock for product X? – How do we answer this customer question? – At which supplier do we buy this product? – With which solution will be respond to this RFP? – Do we need to replace device X? – … • The possible answers to these questions are based on prior experience with the business • Each decision can turn out to be the right or wrong one, business knowledge should avoid picking the wrong ones
  • 10. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Business decisions – However … • Do you really know your business that well? • Hasn’t it evolved in this fast-changing world? • Are you sure your competitors aren’t making better decisions? – You probably own a lot more information than you might realize! • All your business processes are generating data which you can use to your advantage! • Quotes you made vs deals you won • Historical sales records • Web logs showing user activity • Social media activity referring your brand/product • Metering info on devices (internet of things) • …
  • 11. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Types of Data – Proprietary data • ERP, CRM, Orders, Customers, Products, etc… – “Dark Data” – currently unused, maybe not even aware of • Unknown, but present in the company • Cost-efficient BigData tools might enable business cases using this data – External data • Websites, social media, open data, … – Data still to be captured • “If only we knew X or Y” … – There might be a huge added value in “mashing up” proprietary data with public/open data!
  • 12. Business Knowledge vs Data Science (Intuitive knowledge vs data driven decisions) Business Knowledge Acquired by experience (assumed) insights RISK: too high bias on past experience and gut feeling Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Data Science Complementary to business knowledge Confirmative or new insights Data-driven decision taking RISK: too naive data intepretation, disconnected from business
  • 13. Business Knowledge vs Data Science (Intuitive knowledge vs data driven decisions) Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 14. Business decisions: marketing example Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Example: We want to send mailings about our new product • Decisions to take: – Which mail to send to which customers? – We need customer segmentation! • Risks in failing to do this correctly – Missing opportunities (not informing customers) – Annoying customers with irrelevant mailings (churn, reputation damage, …)
  • 15. Business decisions: marketing example • Business knowledge based approach – “We know our segments: -25y, 25y-35y, 35y+ groups, and male/female” – But is this (still) true? – E.g.: do we really want to send an ad of the new iPhone to a long-time Android user because he’s a 30-something male customer? Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 16. Business decisions: marketing example • Data-driven approach: Can we identify different segments automatically? Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be (machine learning!) – WEB SERVER LOGS Which customers have already looked at similar product on our website? – ORDER HISTORY Which customers own complementary products? – CRM INFORMATION What is the typical profile of a customer that clicked through on the last e-mail campaign for a similar product? – … • Business knowledge and Data Science become in- and output for each other! – Ideas/hypotheses and data to be examined should be identified from business knowledge! – A/B testing can be applied to test approaches and check results – Let the data talk for itself! New business insights are generated
  • 17. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Being a Data Scientist • “Data Scientist – the most sexy job of the 21st century” - Thomas H. Davenport • Data Scientist: “A person who is better at statistics than any software engineer and better at software engineering than any statistician” - Josh Wills
  • 18. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Data Science = team work!
  • 19. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Data Science vs Statistics • Basic Statistics concepts – Reliability and validity – Probability – Descriptive statistics and graphics • Inferential statistics (and hypothesis testing) – Probability distributions – Populations and samples – Confidence intervals – Correlation • Data Science – Link with IT (tooling, scale, …) – Data preparation & hacking (get data from databases, websites, …) – Machine learning and automation – Working interactively together with business
  • 20. Data Science vs Business Intelligence • Basic BI concepts: structuring data to report and query upon it – DWH, OLAP, ETL processes – Star- and snowflake schemas – Query-oriented architectures – Close to typical IT development cycle • Data Science: working and experimenting with data to gain insights – Exploratory working – Work in a research cycle rather than development cycle – Limited investment towards analysis that might or might not deliver – Tools designed to avoid heavy ETL (loosely structured data) – Eventually valuable analyses can be ported to BI systems Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 21. Data Science vs Business Intelligence • Using tools that are designed to support exploratory working – Not requiring strict up-front schema design – Allowing fast and cheap hypotheses testing – Open up opportunities to quickly integrate many data sources • Excel files, Text files, Word Documents • Log files • Relational databases • Sensor data • Timeseries data • ... • Integrations with online (OLTP) and analytical (OLAP/BI) systems – Typically for automating repetitive analysis and reporting outputs Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 22. Sampling Induction Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Data Science vs Big Data • Process of statistical inference: sampling & induction • BigData allows: – N=ALL (avoid sampling errors) • Sampling issues can be overcome by just processing ALL available data (process massive data) – N=1 (avoid issues with non-homogenous datasets) • Categorization becomes true personalisation: project towards ONE individual (calculate per item) • Significance considerations are not applicable!
  • 23. What can Data Science do for your business? Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Extract meaning from data – Using and combining data in ways it has never done before – Finding patterns and correlations in data from all possible sources – Detecting anomalies and changes in known patterns • Transform data of various types into valuable information – As a basis for management decisions – As a basis for data products – That can improve your business in any way • Build and integrate Data Products – Recommendation engines, Prediction models, Automated classification, … • The key point is spotting opportunities to outperform your competitors using any data available!
  • 24. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Scientific cycle Question Hypothesis Experiment (data) Conclusion Analyse results • This is NOT a development cycle! • Experimentation vs engineering • Being a Science makes that the outcome cannot be predicted • This makes it hard to integrate in an IT development process
  • 25. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Scientific cycle • Take small steps • Formulate hypotheses • Actually build things • Apply A/B testing • Even without success, you learned something!
  • 26. The Data Science maturity model • Don’t run before you can walk: The Data Science Maturity model Each level builds on the quality of the underlying step. It’s science, not magic … – Start off by simply collecting the data you need (type, quantity, quality) – Then report on your current business (confirmative analysis) – Discover new and valuable information (exploratory analysis) – Build and test prediction models (predictive analysis) – Steer your business based on advise output from your predictions (data-driven) Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Collect Describe Discover Predict Advise
  • 27. The Data Science maturity model Phase Actions Examples in commerce Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Collect Logging information Gathering data from different sources Logging user actions on a website Using loyalty cards to id customers Describe Explorative Data Analysis Basic analytical functions Checking quantity and quality of data Typical reporting Correlating data over sources Discover Finding correlations Building models Finding similarly behaving customers Predict Building prediction models Formulating expectations for the future based on past info Predict sales figures for a new product Predict whether a certain customer will or will not buy a certain product Advise Use prediction models to evaluate decision possibilities and pick the best Target advertising to the right customer groups to optimize revenue
  • 28. Data Science in practice Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
  • 29. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Overview • Tools: R, Hive, Pig • Modeling methods & statistics: Decision trees, Naive Bayes, Regression, Nearest Neighbor, K-means clustering, A priori, …
  • 30. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Tools – Data Science • Analytics: R • Visualisation: Shiny • Docs: MarkDown • Data retrieval – CSV, TAB, ... files – Apache Hive • Data processing – Apache Pig • Open Source based
  • 31. Tools – Machine Learning Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Apache Mahout • Apache Spark Mlib • R • Open Source based
  • 32. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Tools - BigData • Hadoop – HDFS – MapReduce – Pig – Hive – Oozie – Impala – ... • Spark – Shark, SparkR • Platforms – Open Source Apache Hadoop – CDH - Cloudera (partnership at Cronos level) – HDP – Hortonworks Data Platform
  • 33. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Tools - HDFS
  • 34. Tools – MapReduce : Wordcount Input Splitting Mapping Shuffling Reducing Output Framework Code Framework Code Framework Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 35. Modeling methods & statistics Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Basic patterns – Recommendations Based on known taste, propose items that might be liked as well – Clustering Detecting correlation groups in data without using pre-defined segmentation based on business knowledge – Classification Automated labeling, acceptance/rejection of data based on probability models • Supervised & unsupervised learning methods – k-means, naive bayes, n-nearest neighborhood, random forrests, logistic regression, A priori, ...
  • 36. Modeling methods: Decision Tree • Query: which kind of fruit am I looking at Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be – More general: image recognition • Clean your data – What to do with missing values? • Insert average value • Insert special value • Delete data – What to do with outliers? • Wrong data?
  • 37. Modeling methods: Decision Tree Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Find most decisive variable – Categorical variable: One leaf for each variable or one leaf for a group of categories – Numerical variable: find best cut-off(s) Query Color Green Yellow Red
  • 38. Modeling methods: Decision Tree • For each leave, repeat the process: Size is actually numerical: find size cut offs Yellow Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Query Color Size Green Big Medium Small Shape Roun d Thin Size Red Medium Small
  • 39. Modeling methods: Decision Tree Yellow Medium Small Sweet Sour Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Query Color Size Green Big Water-melon Medium Green apple Small Grapes Shape Size Round Big Grape-fruit Mediu m Lemon Banana Thin Size Red apple Try it Cherry Grape
  • 40. Modeling methods: Decision Tree - Distributed • A big advantage of the big data tools are the Distributed processing power (run processes in parallel) Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Build your decision tree – Each leaf can be processed by another node – All your data should still be available to every mapper • Upgrading your decision tree – Bagging trees (sampling your data) – Random Forest (sampling your variables) – Every mapper should only read a part of your data – Still in general better results than a decision tree
  • 41. Modeling methods: Decision Tree • QUESTION: Can we predict whether a customer will place an Date_added > 1.5 Hour_added > 16.29 0.06 Date_added < 5.113 0.1136 0.1829 Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be order during this web session? • Modeling (data mining) – Input: historical surfing information – Decision tree algorithm • Loop at historical data • Find most decisive variable • For each leaf, repeat – Avoid overfitting! • Runtime usage – Pass current info in tree model – Allow certain discounts to increase conversion? – Put user on checkout or in-store after putting product in basket? 0.3273
  • 42. Modeling methods: Naive Bayes • QUESTION: Will I play tennis today? Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Start with labeled data from the past Again clean your data! • Often used with plain text • Assumes that each variable is independent from all others • Named after Bayes rule (statistics)
  • 43. Modeling methods: Naive Bayes Day • Outlook Temperature Humidity Wind PlayTennis D1 • Sunny Hot High Weak No D2 • Sunny Hot High Strong No D3 • Overcast Hot High Weak Yes D4 • Rain Mild High Weak Yes D5 • Rain Cool Normal Weak Yes D6 • Rain Cool Normal Strong No D7 • Overcast Cool Normal Strong Yes D8 • Sunny Mild High Weak No D9 • Sunny Cool Normal Weak Yes D10 • Rain Mild Normal Weak Yes D11 • Sunny Mild Normal Strong Yes D12 • Overcast Mild High Strong Yes D13 • Overcast Hot Normal Weak Yes D14 • Rain Mild High Strong No Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 44. Modeling methods: Naive Bayes • Consider PlayTennis problem and new instance Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be (sun, cool, high, strong)
  • 45. Modeling methods: Naive Bayes Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Estimate parameters – P(yes) = 9/14 P(no) = 5/14 – P(Wind=strong|yes) = 3/9 – P(Wind=strong|no) = 3/5 – … • We have P(y)P(sun|y)P(cool|y)P(high|y)P(strong|y) = 0.005 P(n)P(sun|y)P(cool|n)P(high|n)P(strong|n) = 0.021 • Therefore this new instance is classified to “no”
  • 46. Modeling methods: Naive Bayes - distributed • Vectorisation of trainining data (more or less wordcount) can easily be distributed: – Each text to one mapper – Even when dealing with a large text  cut your text in to peaces – Every small block of data only read once by one mapper Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Vectorisation of your new instance • Actual prediction is a multiplication of all conditional chances  also calculation of prediction easy to distribute
  • 47. Modeling methods: Naive Bayes • QUESTION: Can we route incoming questions (free text) to the Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be right person/department? • Modeling (data mining) – Input: historical information questions and handling person/department – Naive bayes algorithm • For each word or n-gram (2 or 3 words) – count occurences per file • Very valuable are words with high frequency in a single document • Very valuable are words only used in a small number of documents • Remove stopwords, generic words, etc… • Runtime usage – Vectorize incoming document (which words/n-grams occur how many times?) – Predict category based on comparison with historical documents
  • 48. Modeling methods: k-means Clustering • QUESTION: Which countries have the same type of food Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be consumption • Your data is not labeled! • You define labels for your clusters after applying the cluster algorithm • Choose the number of clusters you are expecting – Try for different number of clusters – Run an algorithm to decide the optimal number of clusters • Plot your final results mapped on your principal components
  • 49. Modeling methods: k-means Clustering Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg 1 Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7 2 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3 3 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0 4 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2 5 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0 6 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4 7 E Germany 8.4 11.6 3.7 11.1 5.4 24.6 6.5 0.8 3.6 8 Finland 9.5 4.9 2.7 33.7 5.8 26.3 5.1 1.0 1.4 9 France 18.0 9.9 3.3 19.5 5.7 28.1 4.8 2.4 6.5 10 Greece 10.2 3.0 2.8 17.6 5.9 41.7 2.2 7.8 6.5 11 Hungary 5.3 12.4 2.9 9.7 0.3 40.1 4.0 5.4 4.2 12 Ireland 13.9 10.0 4.7 25.8 2.2 24.0 6.2 1.6 2.9 13 Italy 9.0 5.1 2.9 13.7 3.4 36.8 2.1 4.3 6.7 14 Netherlands 9.5 13.6 3.6 23.4 2.5 22.4 4.2 1.8 3.7 15 Norway 9.4 4.7 2.7 23.3 9.7 23.0 4.6 1.6 2.7 16 Poland 6.9 10.2 2.7 19.3 3.0 36.1 5.9 2.0 6.6 17 Portugal 6.2 3.7 1.1 4.9 14.2 27.0 5.9 4.7 7.9 18 Romania 6.2 6.3 1.5 11.1 1.0 49.6 3.1 5.3 2.8 19 Spain 7.1 3.4 3.1 8.6 7.0 29.2 5.7 5.9 7.2 20 Sweden 9.9 7.8 3.5 24.7 7.5 19.5 3.7 1.4 2.0 21 Switzerland 13.1 10.1 3.1 23.8 2.3 25.6 2.8 2.4 4.9 22 UK 17.4 5.7 4.7 20.6 4.3 24.3 4.7 3.4 3.3 23 USSR 9.3 4.6 2.1 16.6 3.0 43.6 6.4 3.4 2.9 24 W Germany 11.4 12.5 4.1 18.8 3.4 18.6 5.2 1.5 3.8 25 Yugoslavia 4.4 5.0 1.2 9.5 0.6 55.9 3.0 5.7 3.2 Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 50. Modeling methods: k-means Clustering • Define a metric: take every variable into account as much as all Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be other variables • Create random starting points (as many as clusters you expect) • Assign each point to the closest center (or starting) point • Calculate the center of each cluster • Iterate the previous two steps
  • 51. Modeling methods: k-means clustering Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 52. Modeling methods: k-means Clustering Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 53. Modeling methods: k-means Clustering Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be "cluster 1" Country RedMeat Fish Fr.Veg Albania 10.1 0.2 1.7 Bulgaria 7.8 1.2 4.2 Romania 6.2 1.0 2.8 Yugoslavia 4.4 0.6 3.2 "cluster 2" Country RedMeat Fish Fr.Veg Denmark 10.6 9.9 2.4 Finland 9.5 5.8 1.4 Norway 9.4 9.7 2.7 Sweden 9.9 7.5 2.0 "cluster 3" Country RedMeat Fish Fr.Veg Czechoslovakia 9.7 2.0 4.0 E Germany 8.4 5.4 3.6 Hungary 5.3 0.3 4.2 Poland 6.9 3.0 6.6 USSR 9.3 3.0 2.9 [ "cluster 4" Country RedMeat Fish Fr.Veg Austria 8.9 2.1 4.3 Belgium 13.5 4.5 4.0 France 18.0 5.7 6.5 Ireland 13.9 2.2 2.9 Netherlands 9.5 2.5 3.7 Switzerland 13.1 2.3 4.9 UK 17.4 4.3 3.3 W Germany 11.4 3.4 3.8 "cluster 5" Country RedMeat Fish Fr.Veg Greece 10.2 5.9 6.5 Italy 9.0 3.4 6.7 Portugal 6.2 14.2 7.9 Spain 7.1 7.0 7.2
  • 54. Modeling methods: k-means Clustering - distributed Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Calculate conditional chances – Every mapper only needs one variable • Assigning points to clusters: – All centers in distributed cache – Rest of the data only read once by one mapper – Calculate distances and assign to the closest center point • Update center points – One mapper for each cluster
  • 55. Modeling methods: k-means Clustering • QUESTION: In which different segments can we split our Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be customer base? • Modeling (data mining) – Input: any information on the customers (CRM, ERP, Social Media, …) – Very important to find columns to use (requires business knowledge to formulate hypotheses!) – K-means clustering algorithm • Define a “distance” formula to calculate how close two customers are to each other • Define starting points for each cluster center • Iterate and re-allocate customers to a cluster, move cluster centers • Runtime usage – Quickly check the cluster in which a new customer could be residing
  • 56. Modeling methods: A priori • QUESTION: Which books might be interesting for you, knowing Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be which books you have read? • Modeling (data mining) – Input: all titles of books someone has read – Make sure that same books have same titles (e.g.: drop edition from title) – A priori algorithm • Make baskets of read books, labeled with the reader • Identify common occuring books • Tweak your recommendation rules: – Chose big enough support – Confidence of recommendations can be calculated – The bigger the lift, the more valuable your recommendation might be for the reader • Runtime usage – Check if a subset of the books occur as left-hand-side of a rule
  • 57. Modeling methods: A priori Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Data consists of books bought online • There were more than 40000 users buying more than one book (If they only bought one book, they are not useful to make your model) • In total they bought more than 220000 books • Notice the permutations in the rules • As you might expect, sequel books are bought together
  • 58. Modeling methods: A priori Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 59. Modeling methods: A priori - distributed • Make list of books bought together (training data) – Similar to n-grams (Naïve Bayes) – Every customer only read once by one mapper Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Make recommendations – Every mapper handles a number of rules
  • 60. Modeling methods: A priori • QUESTION: Which adds can I show on a website? Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be • Modeling (data mining) – Input: All visited links, all bought items, … – Decide what you think is important: you want to show items others were also interested in, items others also bought, …. – A priori algorithm • Find items which occur together • Define your support, confidence and lift you want • Runtime usage – Check if a subset of the visited links occur as a left hand side of a rule
  • 61. Case study Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
  • 62. End: Wrap up & Lunch Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye