The report describes the results of a Discrete Choice Experiment (a type of Conjoint-Analysis) to explore the potential configuration of a tablet computer from a new entrant to the category.
In this paper, I develop a custom binary classifier of search queries for the makeup category using different Machine Learning techniques and models. An extensive exploration of shallow and Deep Learning models was performed using a cross-validation framework to identify the top three models, optimize them tuning their hyperparameters, and finally creating an ensemble of models with a custom decision threshold that outperforms all other models. The final classifier achieves an accuracy of 98.83% on a test set, making it ready for production.
A large appliance manufacturer was interested in using propensity models to better target consumers with direct mail campaigns. A data set containing transactional data from past purchases and enriched with all kinds of data about the consumer, the household or the zip code, from third party providers was used to develop a model to predict non-responders and avoid targeting them. Simulations varying the estimated revenue per customer and the cutoff point used to filter out potential consumers allowed me to identify different optimal point in the Reach-vs-Response-Rate tradeoff.
Modeling Sexual Selection with Agent-Based ModelsEsteban Ribero
The paper discusses a well-known principle in evolutionary biology called the handicap principle. Two agent-based-models were developed to illustrate the principle in an attempt to better understand its implications for the study of human behavior.
A focused practice aimed at using simulations from simple System Dynamics models to help us better understand the intended and unintended consequences of our actions.
Brand Communications Modeling: Developing and Using Econometric Models in Adv...Esteban Ribero
This report presents a description and a complete example of the modeling process required to build a comprehensive market response model that would account for the impact of previous marketing actions on sales.
The report describes the results of a Discrete Choice Experiment (a type of Conjoint-Analysis) to explore the potential configuration of a tablet computer from a new entrant to the category.
In this paper, I develop a custom binary classifier of search queries for the makeup category using different Machine Learning techniques and models. An extensive exploration of shallow and Deep Learning models was performed using a cross-validation framework to identify the top three models, optimize them tuning their hyperparameters, and finally creating an ensemble of models with a custom decision threshold that outperforms all other models. The final classifier achieves an accuracy of 98.83% on a test set, making it ready for production.
A large appliance manufacturer was interested in using propensity models to better target consumers with direct mail campaigns. A data set containing transactional data from past purchases and enriched with all kinds of data about the consumer, the household or the zip code, from third party providers was used to develop a model to predict non-responders and avoid targeting them. Simulations varying the estimated revenue per customer and the cutoff point used to filter out potential consumers allowed me to identify different optimal point in the Reach-vs-Response-Rate tradeoff.
Modeling Sexual Selection with Agent-Based ModelsEsteban Ribero
The paper discusses a well-known principle in evolutionary biology called the handicap principle. Two agent-based-models were developed to illustrate the principle in an attempt to better understand its implications for the study of human behavior.
A focused practice aimed at using simulations from simple System Dynamics models to help us better understand the intended and unintended consequences of our actions.
Brand Communications Modeling: Developing and Using Econometric Models in Adv...Esteban Ribero
This report presents a description and a complete example of the modeling process required to build a comprehensive market response model that would account for the impact of previous marketing actions on sales.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2024 State of Marketing Report – by HubspotMarius Sescu
https://www.hubspot.com/state-of-marketing
· Scaling relationships and proving ROI
· Social media is the place for search, sales, and service
· Authentic influencer partnerships fuel brand growth
· The strongest connections happen via call, click, chat, and camera.
· Time saved with AI leads to more creative work
· Seeking: A single source of truth
· TLDR; Get on social, try AI, and align your systems.
· More human marketing, powered by robots
ChatGPT is a revolutionary addition to the world since its introduction in 2022. A big shift in the sector of information gathering and processing happened because of this chatbot. What is the story of ChatGPT? How is the bot responding to prompts and generating contents? Swipe through these slides prepared by Expeed Software, a web development company regarding the development and technical intricacies of ChatGPT!
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2024 State of Marketing Report – by HubspotMarius Sescu
https://www.hubspot.com/state-of-marketing
· Scaling relationships and proving ROI
· Social media is the place for search, sales, and service
· Authentic influencer partnerships fuel brand growth
· The strongest connections happen via call, click, chat, and camera.
· Time saved with AI leads to more creative work
· Seeking: A single source of truth
· TLDR; Get on social, try AI, and align your systems.
· More human marketing, powered by robots
ChatGPT is a revolutionary addition to the world since its introduction in 2022. A big shift in the sector of information gathering and processing happened because of this chatbot. What is the story of ChatGPT? How is the bot responding to prompts and generating contents? Swipe through these slides prepared by Expeed Software, a web development company regarding the development and technical intricacies of ChatGPT!
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
The realm of product design is a constantly changing environment where technology and style intersect. Every year introduces fresh challenges and exciting trends that mold the future of this captivating art form. In this piece, we delve into the significant trends set to influence the look and functionality of product design in the year 2024.
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
Mental health has been in the news quite a bit lately. Dozens of U.S. states are currently suing Meta for contributing to the youth mental health crisis by inserting addictive features into their products, while the U.S. Surgeon General is touring the nation to bring awareness to the growing epidemic of loneliness and isolation. The country has endured periods of low national morale, such as in the 1970s when high inflation and the energy crisis worsened public sentiment following the Vietnam War. The current mood, however, feels different. Gallup recently reported that national mental health is at an all-time low, with few bright spots to lift spirits.
To better understand how Americans are feeling and their attitudes towards mental health in general, ThinkNow conducted a nationally representative quantitative survey of 1,500 respondents and found some interesting differences among ethnic, age and gender groups.
Technology
For example, 52% agree that technology and social media have a negative impact on mental health, but when broken out by race, 61% of Whites felt technology had a negative effect, and only 48% of Hispanics thought it did.
While technology has helped us keep in touch with friends and family in faraway places, it appears to have degraded our ability to connect in person. Staying connected online is a double-edged sword since the same news feed that brings us pictures of the grandkids and fluffy kittens also feeds us news about the wars in Israel and Ukraine, the dysfunction in Washington, the latest mass shooting and the climate crisis.
Hispanics may have a built-in defense against the isolation technology breeds, owing to their large, multigenerational households, strong social support systems, and tendency to use social media to stay connected with relatives abroad.
Age and Gender
When asked how individuals rate their mental health, men rate it higher than women by 11 percentage points, and Baby Boomers rank it highest at 83%, saying it’s good or excellent vs. 57% of Gen Z saying the same.
Gen Z spends the most amount of time on social media, so the notion that social media negatively affects mental health appears to be correlated. Unfortunately, Gen Z is also the generation that’s least comfortable discussing mental health concerns with healthcare professionals. Only 40% of them state they’re comfortable discussing their issues with a professional compared to 60% of Millennials and 65% of Boomers.
Race Affects Attitudes
As seen in previous research conducted by ThinkNow, Asian Americans lag other groups when it comes to awareness of mental health issues. Twenty-four percent of Asian Americans believe that having a mental health issue is a sign of weakness compared to the 16% average for all groups. Asians are also considerably less likely to be aware of mental health services in their communities (42% vs. 55%) and most likely to seek out information on social media (51% vs. 35%).
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
This article is all about what AI trends will emerge in the field of creative operations in 2024. All the marketers and brand builders should be aware of these trends for their further use and save themselves some time!
A report by thenetworkone and Kurio.
The contributing experts and agencies are (in an alphabetical order): Sylwia Rytel, Social Media Supervisor, 180heartbeats + JUNG v MATT (PL), Sharlene Jenner, Vice President - Director of Engagement Strategy, Abelson Taylor (USA), Alex Casanovas, Digital Director, Atrevia (ES), Dora Beilin, Senior Social Strategist, Barrett Hoffher (USA), Min Seo, Campaign Director, Brand New Agency (KR), Deshé M. Gully, Associate Strategist, Day One Agency (USA), Francesca Trevisan, Strategist, Different (IT), Trevor Crossman, CX and Digital Transformation Director; Olivia Hussey, Strategic Planner; Simi Srinarula, Social Media Manager, The Hallway (AUS), James Hebbert, Managing Director, Hylink (CN / UK), Mundy Álvarez, Planning Director; Pedro Rojas, Social Media Manager; Pancho González, CCO, Inbrax (CH), Oana Oprea, Head of Digital Planning, Jam Session Agency (RO), Amy Bottrill, Social Account Director, Launch (UK), Gaby Arriaga, Founder, Leonardo1452 (MX), Shantesh S Row, Creative Director, Liwa (UAE), Rajesh Mehta, Chief Strategy Officer; Dhruv Gaur, Digital Planning Lead; Leonie Mergulhao, Account Supervisor - Social Media & PR, Medulla (IN), Aurelija Plioplytė, Head of Digital & Social, Not Perfect (LI), Daiana Khaidargaliyeva, Account Manager, Osaka Labs (UK / USA), Stefanie Söhnchen, Vice President Digital, PIABO Communications (DE), Elisabeth Winiartati, Managing Consultant, Head of Global Integrated Communications; Lydia Aprina, Account Manager, Integrated Marketing and Communications; Nita Prabowo, Account Manager, Integrated Marketing and Communications; Okhi, Web Developer, PNTR Group (ID), Kei Obusan, Insights Director; Daffi Ranandi, Insights Manager, Radarr (SG), Gautam Reghunath, Co-founder & CEO, Talented (IN), Donagh Humphreys, Head of Social and Digital Innovation, THINKHOUSE (IRE), Sarah Yim, Strategy Director, Zulu Alpha Kilo (CA).
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
The search marketing landscape is evolving rapidly with new technologies, and professionals, like you, rely on innovative paid search strategies to meet changing demands.
It’s important that you’re ready to implement new strategies in 2024.
Check this out and learn the top trends in paid search advertising that are expected to gain traction, so you can drive higher ROI more efficiently in 2024.
You’ll learn:
- The latest trends in AI and automation, and what this means for an evolving paid search ecosystem.
- New developments in privacy and data regulation.
- Emerging ad formats that are expected to make an impact next year.
Watch Sreekant Lanka from iQuanti and Irina Klein from OneMain Financial as they dive into the future of paid search and explore the trends, strategies, and technologies that will shape the search marketing landscape.
If you’re looking to assess your paid search strategy and design an industry-aligned plan for 2024, then this webinar is for you.
5 Public speaking tips from TED - Visualized summarySpeakerHub
From their humble beginnings in 1984, TED has grown into the world’s most powerful amplifier for speakers and thought-leaders to share their ideas. They have over 2,400 filmed talks (not including the 30,000+ TEDx videos) freely available online, and have hosted over 17,500 events around the world.
With over one billion views in a year, it’s no wonder that so many speakers are looking to TED for ideas on how to share their message more effectively.
The article “5 Public-Speaking Tips TED Gives Its Speakers”, by Carmine Gallo for Forbes, gives speakers five practical ways to connect with their audience, and effectively share their ideas on stage.
Whether you are gearing up to get on a TED stage yourself, or just want to master the skills that so many of their speakers possess, these tips and quotes from Chris Anderson, the TED Talks Curator, will encourage you to make the most impactful impression on your audience.
See the full article and more summaries like this on SpeakerHub here: https://speakerhub.com/blog/5-presentation-tips-ted-gives-its-speakers
See the original article on Forbes here:
http://www.forbes.com/forbes/welcome/?toURL=http://www.forbes.com/sites/carminegallo/2016/05/06/5-public-speaking-tips-ted-gives-its-speakers/&refURL=&referrer=#5c07a8221d9b
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
Everyone is in agreement that ChatGPT (and other generative AI tools) will shape the future of work. Yet there is little consensus on exactly how, when, and to what extent this technology will change our world.
Businesses that extract maximum value from ChatGPT will use it as a collaborative tool for everything from brainstorming to technical maintenance.
For individuals, now is the time to pinpoint the skills the future professional will need to thrive in the AI age.
Check out this presentation to understand what ChatGPT is, how it will shape the future of work, and how you can prepare to take advantage.
A brief introduction to DataScience with explaining of the concepts, algorithms, machine learning, supervised and unsupervised learning, clustering, statistics, data preprocessing, real-world applications etc.
It's part of a Data Science Corner Campaign where I will be discussing the fundamentals of DataScience, AIML, Statistics etc.
Time Management & Productivity - Best PracticesVit Horky
Here's my presentation on by proven best practices how to manage your work time effectively and how to improve your productivity. It includes practical tips and how to use tools such as Slack, Google Apps, Hubspot, Google Calendar, Gmail and others.
The six step guide to practical project managementMindGenius
The six step guide to practical project management
If you think managing projects is too difficult, think again.
We’ve stripped back project management processes to the
basics – to make it quicker and easier, without sacrificing
the vital ingredients for success.
“If you’re looking for some real-world guidance, then The Six Step Guide to Practical Project Management will help.”
Dr Andrew Makar, Tactical Project Management
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Random Forrest and Gradient Boosting
1. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 1/12
Esteban Ribero, Assignment #4 - MSDS 422 | Winter 2019
Evaluating Random Forests and Gradient Boosting for
Regression
Purpose and summary of results
The purpose of this exercise is to expand the analysis of the regression methods from assignment 2 employing
Random Forests and Gradient Boosting Regression Trees and comparing them with the best linear regression
models used previously. A cross-validation design was used for comparisons. The data set used to perform the
study is again the Boston Housing Study that contains 506 census tract observations and 13 variables. The
target variable to predict is the median value of homes in Boston in 1970. Table 1 extracted from Miller (2015),
shown in the appendix, describes the variables contained in the data set.
Several versions of random forest and gradient boosting were fitted to the data and cross-validated by tweaking
the hypermeters. Random forest and gradient boosting performed significantly better than all the linear
regression models used previously. The default settings for these two methods already produced great results
reducing the Root Mean Squared Error across the test sets from more than 0.52 to less than 0.35, a 32%
improvement! Tweaking some hyperparameters led to an optimized gradient boosting model with an RSME of
0.30 and a ‘pseudo R=Squared’ of 0.90, well above the ~072 for the linear regression models. A comparison of
the feature importance is provided and some managerial implications for real state brokerage firm are as a result
of the study.
Loading the required packages
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # static plotting
import seaborn as sns # pretty plotting, including heat map
import sklearn.linear_model # modeling routines from Scikit Learn
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt # for root mean-squared error calculation
from sklearn.preprocessing import StandardScaler #for scaling the data
from sklearn.model_selection import KFold #for cross-validation
#importing the regressors to be used
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor,
GradientBoostingRegressor
In [2]: #setup for displaying multiple outputs from a single Jupyter cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
2. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 2/12
The data
In [3]: # loading the data into a dataframe
boston_input = pd.read_csv('boston.csv')
# drop neighborhood from the data being considered
boston = boston_input.drop('neighborhood', 1)
#Setting up the data for fitting the models into numpy arrays
prelim_model_data = np.array([boston.mv,
boston.crim,
boston.zn,
boston.indus,
boston.chas,
boston.nox,
boston.rooms,
boston.age,
boston.dis,
boston.rad,
boston.tax,
boston.ptratio,
boston.lstat]).T
The data was already explored in the prior analysis and some concerns regarding the distribution of some
explanatory variables, as well as the presence of several outliers and extreme outliers, were raised. For the
purpose of this and the prior analysis this concerns were not addressed and only a simple standardization of the
data using the StandardScaler was performed. This was done for the linear regression models that are
susceptible to strong variations in the scales, and although this is not necessary for random forest and gradient
boosting, we used the same standardized data set for ease of comparisons. The standardization centers the
data around 0 and unified all units to the standard deviation. This was done even for the target variable.
In [4]: # Scaling the data using standardization
scaler = StandardScaler()
model_data = scaler.fit_transform(prelim_model_data)
The following boxplot shows the results of the standardization for reference
3. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 3/12
In [5]: #Boxplot for standardized variables
var_names = ['mv', 'crim','zn', 'indus', 'chas', 'nox',
'rooms','age','dis','rad','tax','ptratio','lstat']
model_data_df = pd.DataFrame(model_data, columns = var_names)
fig, axis = plt.subplots(figsize=(12,10))
ax = plt.title('Boxplot for Standardized Features')
ax = sns.boxplot(data=model_data_df, orient="h")
As can be observed in the Boxplots all the variables have been centered and standardized. This process
maintains the shape of the distributions so the differences in the range of values, the distributions, and the
presence of outliers can be easily observed.
Regression models and cross-validation
As before, the following set of code sets up the models to be evaluated. In this study, I will be comparing the
best linear regression models from the prior exercise with a different set of random forest and gradient boosting.
The best performing linear regressions used for comparison are, the baseline Linear regression (no
regularization), the Ridge regression with alpha = 50, and the Lasso regression with alpha = 0.01.
4. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 4/12
In [6]: #Setup code for regression models being considered
RANDOM_SEED = 1 #to obtain reproducible results
SET_FIT_INTERCEPT = True #to include intercept in the regression
##Specifying the set of regression models being evaluated
names = ['a_Linear_Regression',
'b_Ridge_Regression_50',
'c_Lasso_Regression_0.01',
'd_Random_Forest',
'e_Random_Forest_100_log2',
'f_Random_Forest_100_4',
'g_Random_Forest_10_500_4',
'h_Gradient_Boosting',
'i_Gradient_Boosting_3_500',
'j_Gradient_Boosting_2_500_6',
'k_Gradient_Boosting_3_100_0.3',
'l_Gradient_Boosting_3_50_0.3']
For the set of Random Forests I used the default parameters from scikit-learn first, then fine-tuned them
iteratively until I got to a satisfactory result. After a full exploration I ended up with four versions. Sampling with
replacement using boostrap was used for all four:
d_Random_Forest (the default):
With 100 trees, unconstrained depth for the trees, and the ability to use all features.
e_Random_Forest_100_log2:
With 100 trees as well and unconstrained depth, but constraining the amount of available features for each
tree to log2, which in this case equals to setting the max_features = 3. This creates random exploration
over the features leading to more diversity.
f_Random_Forest_100_4:
Same as before but increasing the range of feature exploration to 4
g_Random_Forest_10_500_4:
With 500 trees to average across more trees reducing chances of overfitting, limiting the maximum depth of
each tree to 10 with the same goal, and max_features to 4.
For the set of Gradient Boosting Regression Trees, I ended up with five versions:
h_Gradient_Boosting (the default):
With 100 trees, maximum depth of 3, and the learning rate = 0.1.
i_Gradient_Boosting_3_500:
Same as above but with 500 trees for a more complex model
j_Gradient_Boosting_2_500_6:
Same as before but reducing the maximum depth to 2 in an attempt to reduce overfitting by pruning earlier,
as well as limiting the feature exploration to 6.
k_Gradient_Boosting_3_100_0.3:
Similar to the default model but increasing the learning rate to 0.3.
k_Gradient_Boosting_3_50_0.3:
Same as above but reducing the number of trees to 50 to reduce model complexity and balance the
increased learning rate.
5. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 5/12
In [7]: #code to set the paramaters of the regressors
regressors = [
LinearRegression(fit_intercept = SET_FIT_INTERCEPT),
Ridge(alpha = 50, solver = 'cholesky',
fit_intercept = SET_FIT_INTERCEPT,
normalize = False, # data was standardized before
random_state = RANDOM_SEED),
Lasso(alpha = 0.01, max_iter=10000, tol=0.01,
fit_intercept = SET_FIT_INTERCEPT,
random_state = RANDOM_SEED),
RandomForestRegressor(n_estimators = 100, bootstrap=True,
max_depth = None, max_features='auto',
random_state = RANDOM_SEED),
RandomForestRegressor(n_estimators = 100, bootstrap=True,
max_depth = None, max_features= 'log2',
random_state = RANDOM_SEED),
RandomForestRegressor(n_estimators = 100, bootstrap=True,
max_depth = None, max_features= 4,
random_state = RANDOM_SEED),
RandomForestRegressor(n_estimators = 500, bootstrap=True,
max_depth = 10, max_features= 4,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 100,
max_depth = 3, max_features=None,
learning_rate = 0.1, subsample = 1,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 500,
max_depth = 3, max_features=None,
learning_rate = 0.1, subsample = 1,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 500,
max_depth = 2, max_features=6,
learning_rate = 0.1, subsample = 1,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 100,
max_depth = 3, max_features=None,
learning_rate = 0.3, subsample = 1,
random_state = RANDOM_SEED),
GradientBoostingRegressor(n_estimators = 50,
max_depth = 3, max_features=None,
learning_rate = 0.3, subsample = 1,
random_state = RANDOM_SEED),
]
The following code sets numpy arrays for storing the results as Python iterates over the for loops during the
cross validation. Although the main performance indicator for this study is the Root Mean Squared Error (RSME)
on the test sets, we will collect the RSME for the train sets to more easily identify overfitting as well as a
measure of the variance explained by the model, R-squared for the linear regression models, and pseudo R-
squared for the random forest and gradient boosting.
6. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 6/12
In [9]: #Setting up numpy arrays for storing results
N_FOLDS = 10 #number of fold for cross-validation
rmse_test = np.zeros((len(names), N_FOLDS))
rmse_train = np.zeros((len(names), N_FOLDS))
r2_test = np.zeros((len(names), N_FOLDS))
r2_train = np.zeros((len(names), N_FOLDS))
As before, I used a cross-validation design with 10 folds. This means that we will cut the data into a training-set
and a test-set ten times. We will train the models in each training test and validate their prediction accuracy in
each of the ten test sets.
In [10]: # specifying the k-fold cross-validation design
kf = KFold(n_splits = N_FOLDS, shuffle=True, random_state = RANDOM_SEED)
index_for_fold = 0 # fold count initialized
for train_index, test_index in kf.split(model_data):
# the structure of modeling data for this study has the
# response variable coming first and explanatory variables later
# so 1:model_data.shape[1] slices for explanatory variables
# and 0 is the index for the response variable
X_train = model_data[train_index, 1:model_data.shape[1]]
X_test = model_data[test_index, 1:model_data.shape[1]]
y_train = model_data[train_index, 0]
y_test = model_data[test_index, 0]
index_for_method = 0 # initialize
for name, reg_model in zip(names, regressors):
## fit on the train set for this fold
rmodel = reg_model.fit(X_train, y_train)
## evaluate on the modelfor this fold
y_test_predict = reg_model.predict(X_test)
y_train_predict = reg_model.predict(X_train)
#R-squared
r2_test[index_for_method, index_for_fold] =
r2_score(y_test, y_test_predict)
r2_train[index_for_method, index_for_fold] =
r2_score(y_train, y_train_predict)
#Root-mean squared error
fold_method_rmse_test =
sqrt(mean_squared_error(y_test, y_test_predict))
fold_method_rmse_train =
sqrt(mean_squared_error(y_train, y_train_predict))
rmse_test[index_for_method, index_for_fold] =
fold_method_rmse_test
rmse_train[index_for_method, index_for_fold] =
fold_method_rmse_train
index_for_method += 1
index_for_fold += 1
7. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 7/12
The following code creates a Pandas DataFrame with the results of each fold the averages the results across all
folds
In [11]: ##creating multilevel index for dataframes
model_name = names #to avoid confusion in next line
multi_index = pd.MultiIndex.from_product(
[model_name, np.arange(N_FOLDS)],
names=['model','fold'])
##the dataframe
fit_results_df =
pd.DataFrame(np.hstack((rmse_train.reshape(N_FOLDS*len(names),1),
rmse_test.reshape(N_FOLDS*len(names),1),
r2_train.reshape(N_FOLDS*len(names),1),
r2_test.reshape(N_FOLDS*len(names),1))),
index=multi_index,
columns=['Train_RMSE','Test_RMSE',
'Train_r2','Test_r2'])
##averaging results across folds
av_fit = fit_results_df.groupby('model').mean()
Results
The following table shows the average results of the cross-validation across the 10 folds. The average RMSE for
the training sets and the test sets is presented as well as the coefficient of determination, r-Square.
In [12]: print('----- Results of cross-validation across 10 folds -----nn',
round(av_fit, ndigits=3))
----- Results of cross-validation across 10 folds -----
Train_RMSE Test_RMSE Train_r2 Test_r
2
model
a_Linear_Regression 0.510 0.516 0.739 0.719
b_Ridge_Regression_50 0.521 0.520 0.729 0.716
c_Lasso_Regression_0.01 0.516 0.519 0.734 0.715
d_Random_Forest 0.129 0.350 0.983 0.863
e_Random_Forest_100_log2 0.126 0.339 0.984 0.878
f_Random_Forest_100_4 0.125 0.328 0.984 0.886
g_Random_Forest_10_500_4 0.140 0.326 0.981 0.888
h_Gradient_Boosting 0.153 0.310 0.977 0.896
i_Gradient_Boosting_3_500 0.039 0.308 0.998 0.897
j_Gradient_Boosting_2_500_6 0.121 0.342 0.985 0.872
k_Gradient_Boosting_3_100_0.3 0.071 0.304 0.995 0.899
l_Gradient_Boosting_3_50_0.3 0.123 0.301 0.985 0.901
8. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 8/12
As can be seen above, the default random forest performs pretty well relative to the linear models. Its average
test RSME across the 10 folds is 0.350, well below the 0.516 the best performer of the linear regression models.
There is a clear sign of overfitting given that its average performance on the train data is so much better, with an
RMSE of 0.129 and an R-square of 0.983 (vs 0.863 on the test set). Constraining the space for feature
exploration to 3 (log2 of 12) does improve the performance slightly reducing the RSME to 0.339. After several
iterations, the best setting for max_features was 4. It reduces the RSME even further to 0.328. There is still
overfitting, so the random forest with an increased number of trees to 500 and a limit to the depth of the trees to
10 does reduce the RSME slightly to 0.326.
It is possible that there is an even better combination of parameters that would improve the performance of the
random forests given that there are still signs of overfitting. However, moving to gradient boosting improved
performance much faster: The default setting for gradient boosting provided a model with an test RSME of
0.310 and a r-Squared of 0.896. increasing the number of trees in the model to 500 while keeping the default
max_depth to 3 and the learning_rate to 0.1, increased the performance on the training set significantly (RSME
= 0.039, R-Squared = 0.998!) and the performance on the test set slightly (RSME = 0.308, R-Squared 0.897. So,
there was an improvement overall but clearly overfitting the data given the increased complexity of the model.
Note that for gradient boosting, unlike random forest, more trees increase complexity following the data more
closely, while for random forests more trees reduce the chances of overfitting.
In attempt to decrease overfitting for the gradient boosting with 500 trees, I limited the depth of the trees to 2
and introduced randomness on the feature space by restricting the exploration of features to 6, in this example.
This did not improve performance and actually made the model perform worse than the default gradient
boosting. Then another exploration pertained to increasing the learning rate while reducing the number of trees
used. Since these two parameters are highly related, the more trees the more complex the model and the higher
the learning rate the more strongly the corrections at each iteration, I started with the default number of trees
(100) and increase the learning rate to 0.3. This had a more significant impact in performance reducing the
RSME on the test data to 0.304. The best so far. The model is likely overfitting sine the RSME on the train set is
0.071 and the train R-Square is 0.995! So there is still a chance to improve it slightly. Indeed, reducing the
number of trees to only 50 while the learning rate is still at 0.3 and max_depth = 3 provides the best
performance: Test RMSE 0.301, test R-squared .901! Not bad.
Feature importance
Now that we got a good performing model let explore the contribution and importance of each of the features.
To do that we will train the best models using the full data set.
The following code collects the full data for the features and the target variable
In [13]: #X and Y train values for full dataset
X_f = model_data[:, 1:model_data.shape[1]]
y_f = model_data[:, 0]
The following code fit the linear regression models against the full data set also estimates and stores the
magnitude of the regression coefficients. The intercept is not collected since we only care about the importance
of the features for comparability with the trees models.
9. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 9/12
In [14]: #setting the numpy array to collect the
#coeficients for the three linear models
regression_coef = np.zeros((len(names[:3]), #[:3] slices for just the
#linear regression models
model_data.shape[1]-1))
#same code as before but using .coef_ to collect coefficients
index_for_method = 0 # initialize
for name, reg_model in zip(names[:3], regressors[:3]):
# fit on the method
rmodel = reg_model.fit(X_f, y_f)
#regresion coeficients (features)
regression_coef[index_for_method]=reg_model.coef_
index_for_method += 1
The following code does the same but for the tree models and uses .featureimportance instead of .coef_ to get
the importance of each feature.
In [15]: #setting the numpy array to collect the
#feature importance for the tree models
feature_importance = np.zeros((len(names[3:]), #[3:] slices for the
#trees models
model_data.shape[1]-1))
index_for_method = 0 # initialize
for name, reg_model in zip(names[3:], regressors[3:]):
# fit on the method
rmodel = reg_model.fit(X_f, y_f)
#feature importance
feature_importance[index_for_method] =
rmodel.feature_importances_
index_for_method += 1
The following two box of code creates a pandas dataframe with the coefficients for the linear regression models
and the feature importance for the trees models. Note that for ease of visualization with seaborn plots the shape
of the dataframe is changed to a tidy format.
In [16]: #creating a dataframe for storing the feature importance for each model
column_names = var_names[1:].copy() #using the list of variables names
##the dataframe
feature_importance_pd = pd.DataFrame(feature_importance)
feature_importance_pd.columns = column_names
feature_importance_pd.index = names[3:]
#creating a subset of the dataframe for the best
#perfoming ensemble of trees models for visualization
fi_df = feature_importance_pd.loc[['d_Random_Forest',
'g_Random_Forest_10_500_4',
'h_Gradient_Boosting',
'l_Gradient_Boosting_3_50_0.3'],:]
#reshaping the layout for ease f visualization
fi_df = fi_df.stack().reset_index() #making it tidy
fi_df.columns = ['model','feature','importance']
10. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 10/12
To properly compare the importance of each feature across linear and trees models I converted the regression
coefficients of the linear models to their absolute values
In [17]: #creating a dataframe for storing the regression coefficients
column_names = var_names[1:].copy() #using the list of variables names
##the dataframe
regression_coef_pd = pd.DataFrame(regression_coef)
regression_coef_pd.columns = column_names
regression_coef_pd.index = names[:3]
#reshaping the layout for ease of visualization
cf_df = regression_coef_pd
cf_df = cf_df.stack().reset_index()#making it tidy
cf_df.columns = ['model','feature','abs_magnitude']
#converting the magnitude of the coefficients to absolute values
cf_df['abs_magnitude'] = np.absolute(cf_df['abs_magnitude'])
The following plot compares the absolute magnate of the linear regression coefficients and the importance
given the features for the best and default random forest and gradient boosting models
11. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 11/12
In [18]: sns.set()
fig, axis = plt.subplots(1,2, figsize=(12,8))
ax = plt.suptitle("__________ Feature Importance __________")
ax = plt.subplot(1,2,1)
ax = plt.title('feature coefficients (abs value)')
ax = sns.stripplot(data=cf_df, size=10, x='abs_magnitude',
y='feature', hue='model', palette ="Blues")
ax = plt.subplot(1,2,2)
ax = plt.title('feature importance')
ax = sns.stripplot(data=fi_df, size=10, x='importance',
y='feature', hue='model', palette ="Reds")
As can been seen, there is a different pattern between the linear regression models and the trees models.
Although the two measures of importance are not the same and so direct comparison has to be done carefully,
relative to the importance of the number of rooms (rooms) and the percent of the population of lower socio-
economic status (lstat), the tree-based models give much less importance to the other features. In fact, the
importance zn, indus, chas and rad, is 0 or close except for one of the optimized random forests that gives
some small importance to indust. This model also has the more balanced importance relative to the other tree-
based models
The best performing gradient boosting model relies mostly on the number of rooms and the percent of the
population of low socioeconomic status as well as a little on the distance (dis) to employment centers. Some
smaller importance is given to the pupil/teacher ratio in public schools (ptratio) and air pollution (nox).
12. 2/3/2019 Ribero_Esteban_Assignment 4
http://localhost:8888/files/Google%20Drive/MSPA%20Master/Courses/MSDS%20422%20ML/Assignment%204/Ribero_Esteban_Assignment%204.html 12/12
Conclusion
Ensemble models based on trees are among the most widely used models in machine learning given their
strong performance and ease of training. This study adds evidence that these models are indeed great tools for
supervised learning. With the optimized gradient boosting model, the real estate brokerage firm can confidently
use this machine learning technique to estimate the values of house in Boston at the time. Particular attention is
to be given to the number of rooms and the percentage of the population of lower socio-economic status. This
might be the most obvious features however and so additional attention is to be given to the distance from
employment centers and the level of air pollution. The crime rate and the tax rate are less important but still
contribute to the precision of the model. Using the model’s predictions would likely be a more accurate
estimation of the value of residential real estate than even the assessment of an expert so it is strongly advised
that the firm uses the model as the primary method and complement it with more traditional approaches if
desired
References:
Thomas W. Miller. Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python.
Pearson Education, Old Tappan, N.J., 2015. Data sets and programs available at http://www.ftpress.com/miller/
(http://www.ftpress.com/miller/) and https://github.com/mtpa/ (https://github.com/mtpa/).
Appendix