The document summarizes a machine learning project analyzing data from the Titanic disaster to predict passenger survival. It describes various machine learning algorithms applied, including decision trees, logistic regression, random forests and conditional inference trees. Feature engineering improved predictions, such as using title and family size. While first class passengers had more survival chances, first class adult males had lower survival rates compared to expectations.
The global outsourcing industry is constantly evolving through new contracting award characteristics and an expanding universe of successful service providers. ISG's TPI Index helps industry participants, enterprises and organizations keep pace and capitalize from the latest data on outsourcing trends. It is the authoritative source for marketplace intelligence related to outsourcing: transaction structures and terms, industry adoption, geographic prevalence and service provider metrics.
This ISG white paper assesses recent trends in the mid-tier sourcing marketplace, and basic considerations faced by buyer organizations with
differing levels of outsourcing experience. Risks and opportunities are discussed, and potential sourcing strategy options and key success factors
are outlined.
The global outsourcing industry is constantly evolving through new contracting award characteristics and an expanding universe of successful service providers. ISG's TPI Index helps industry participants, enterprises and organizations keep pace and capitalize from the latest data on outsourcing trends. It is the authoritative source for marketplace intelligence related to outsourcing: transaction structures and terms, industry adoption, geographic prevalence and service provider metrics.
This ISG white paper assesses recent trends in the mid-tier sourcing marketplace, and basic considerations faced by buyer organizations with
differing levels of outsourcing experience. Risks and opportunities are discussed, and potential sourcing strategy options and key success factors
are outlined.
Synergies between mitigation and adaptation..ppt glf nov 16Liz Kahurani
There is growing recognition of the potential for jointly achieving climate change mitigation and adaptation through land management. Landscape approaches to enhancing multi-functionality have been identified as a promising pathway to synergies between mitigation and adaptation besides helping achieve other livelihood needs through ecosystem services and functions provision. This presentation explores what is known and gaps in understanding of synergies and trade-offs. It also explores the necessary enabling conditions that help promote synergies in order to realize the benefits of the approach. We build on an ex-post analysis of the Ngitili systems in Tanzania and selected examples from agroforestry practices to inform the discussion.
The global outsourcing industry is constantly evolving through new contracting award characteristics and an expanding universe of successful service providers. ISG's TPI Index helps industry participants, enterprises and organizations keep pace and capitalize from the latest data on outsourcing trends. It is the authoritative source for marketplace intelligence related to outsourcing: transaction structures and terms, industry adoption, geographic prevalence and service provider metrics.
Digital branding 2015. Особенности национального интернет шоппингаАрмен Манукян
За время своей работы в e-commerce мне удалось поучаствовать в создании множества интернет-ресурсов и получить не мало обратной связи от пользователей. В этом выступлении я хочу обсудить с вами некоторые аспекты, связанные с пользовательским поведением и восприятием, а именно:
- Что такое хороший и что такое плохой интернет-магазин, есть ли между ними разница?
- В чем отличие опытного интернет-пользователя от неопытного и с кем выгодней работать?
- Пользовательские стереотипы в e-commerce, какие они бывают и как с ними бороться?
Information Services Group (ISG) (NASDAQ: III), a leading technology insights, market intelligence and advisory services company, and the International Association of Outsourcing Professionals (IAOP), the global, standard-setting organization and advocate for the outsourcing profession, today announced that the inaugural IAOP/ISG Global Outsourcing Social Responsibility Impact Award has been awarded to CBRE Group, Inc (NYSE:CBG).
The award, established by IAOP and ISG, recognizes service provider excellence in Corporate Social Responsibility (CSR). The winner demonstrates exemplary leadership in, and ongoing commitment to, activities that foster community involvement, fair operating and labor practices, respect for human rights, attention to environmental impacts, consumer issues and good governance.
How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...Matt Harrison
Python makes data science easy. In this deck we walk through a complete example of creating and evaluating a predictive model using Decision Trees and Random Forests. All of the code is included in the slides.
Synergies between mitigation and adaptation..ppt glf nov 16Liz Kahurani
There is growing recognition of the potential for jointly achieving climate change mitigation and adaptation through land management. Landscape approaches to enhancing multi-functionality have been identified as a promising pathway to synergies between mitigation and adaptation besides helping achieve other livelihood needs through ecosystem services and functions provision. This presentation explores what is known and gaps in understanding of synergies and trade-offs. It also explores the necessary enabling conditions that help promote synergies in order to realize the benefits of the approach. We build on an ex-post analysis of the Ngitili systems in Tanzania and selected examples from agroforestry practices to inform the discussion.
The global outsourcing industry is constantly evolving through new contracting award characteristics and an expanding universe of successful service providers. ISG's TPI Index helps industry participants, enterprises and organizations keep pace and capitalize from the latest data on outsourcing trends. It is the authoritative source for marketplace intelligence related to outsourcing: transaction structures and terms, industry adoption, geographic prevalence and service provider metrics.
Digital branding 2015. Особенности национального интернет шоппингаАрмен Манукян
За время своей работы в e-commerce мне удалось поучаствовать в создании множества интернет-ресурсов и получить не мало обратной связи от пользователей. В этом выступлении я хочу обсудить с вами некоторые аспекты, связанные с пользовательским поведением и восприятием, а именно:
- Что такое хороший и что такое плохой интернет-магазин, есть ли между ними разница?
- В чем отличие опытного интернет-пользователя от неопытного и с кем выгодней работать?
- Пользовательские стереотипы в e-commerce, какие они бывают и как с ними бороться?
Information Services Group (ISG) (NASDAQ: III), a leading technology insights, market intelligence and advisory services company, and the International Association of Outsourcing Professionals (IAOP), the global, standard-setting organization and advocate for the outsourcing profession, today announced that the inaugural IAOP/ISG Global Outsourcing Social Responsibility Impact Award has been awarded to CBRE Group, Inc (NYSE:CBG).
The award, established by IAOP and ISG, recognizes service provider excellence in Corporate Social Responsibility (CSR). The winner demonstrates exemplary leadership in, and ongoing commitment to, activities that foster community involvement, fair operating and labor practices, respect for human rights, attention to environmental impacts, consumer issues and good governance.
How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...Matt Harrison
Python makes data science easy. In this deck we walk through a complete example of creating and evaluating a predictive model using Decision Trees and Random Forests. All of the code is included in the slides.
I don't think it's hyperbole when I say that Facebook, Instagram, Twitter & Netflix now define the dimensions of our social & entertainment universe. But what kind of technology engines purr under the hoods of these social media machines?
Here is a tech student's perspective on making the paradigm shift to "Big Data" using innovative models: alphabet blocks, nesting dolls, & LEGOs!
Get info on:
- What is Cassandra (C*)?
- Installing C* Community Version on Amazon Web Services EC2
- Data Modelling & Database Design in C* using CQL3
- Industry Use Cases
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
Final pink panthers_03_30
1. The Titanic:
Machine Learning
from Disaster
Data Mining and Machine Learning. Winter 2014. Final Project
Jean Callao | Michelle Darling | Paul Marxhausen
2. AGENDA
In depth analysis: by Jean Callao
• Logistic Regression: glm
• Tree-based methods: rpart, ctree
In depth analysis: by Paul Marxhausen
• Ensemble Methods: randomForest, cForest
Summary: by Michelle Darling
• Data Visualization
• Machine Learning Kaggle Results
3. Titanic: Machine Learning from Disaster
Why we picked this project:
● Historical context to understand
"What does the data mean?"
● Learn one data set well, and then apply
different algorithms and modelling tools.
● Practice the steps of data analysis:
○ Data exploration and visualization.
○ Model selection, building and
testing.
● Prize: $0 + "knowledge & confidence"
to go on to more challenging data science
problems.
kaggle.com provides:
Online data science competitions.
Structured problems, tutorials,
help forums and discussion groups.
Easy, consistent way to test models
and track results.
>>> Focus <<<
5. RMS Titanic, April 1912
A priori knowledge from problem domain
What factors contributed to survival?
Gender, Age, Passenger Class, Fare, Family
More likely to survive
• Females
• Children, Adults<50
• 1st Class
• Paid higher fares
• Travelling with family
More likely to perish
• Males
• Adults >50
• 2nd, 3rd class
• Paid lower fares
• Travelling alone
• Immigrants
6. Titanic Dataset
Predictor & Target Variables
Response
VARIABLE
Survived
(1 = Yes; 0 = No)
Predictor
Variables DESCRIPTION
Pclass Passenger Class (1=1st; 2=2nd; 3=3rd)
Name Passenger Name
Sex Sex ("male", "female")
Age Age (Numeric fraction e.g., 1.5)
Fare Passenger Fare
Sibsp Number of Siblings/Spouses Aboard
Parch Number of Parents/Children Aboard
Ticket Ticket Number
Cabin Cabin
Embarked Port of Embarkation
(C=Cherbourg; Q=Queenstown; S=Southampton)
QUANTITATIVE Variables; the rest are QUALITATIVE.
7. Feature Engineering
Data relating to one's location on the ship
data$cabin.last.digit <- str_sub(data$Cabin, -1)
data$Side <- "Unknown”
data$Side[which(isEven(data$cabin.last.digit))] <-
"port”
data$Side[which(isOdd(data$cabin.last.digit))] <-
"starboard”
Classifying Fares
combi$Fare2 <- '30+'
combi$Fare2[combi$Fare < 30 & combi$Fare >= 20] <-
'20-30'
combi$Fare2[combi$Fare < 20 & combi$Fare >= 10] <-
'10-20’
combi$Fare2[combi$Fare < 10] <- '<10'
Title - Extract from name to find wealthy
passengers:
combi$Title[combi$Title %in% c('Mme', 'Mlle')] <-
'Mlle‘
combi$Title[combi$Title %in% c('Dona', 'Lady',
'the Countess')] <- 'Lady'
combi$Title[combi$Title %in% c('Capt', 'Col',
'Don', 'Dr','Jonkheer', 'Major', 'Rev', 'Sir')]<-
'Noble’
FamilySize - Combining spouse, siblings
and parents
combi$FamilySize <- combi$SibSp + combi$Parch + 1
9. Decision Trees
• A decision tree is a simple, but
powerful form of multiple variable
analysis. It displays a tree-like
graph of decisions and their
possible consequences.
• Recursive Partitioning-> at each
step, we identify a question that
we use to partition the data.
Advantages:
• Data-driven: Makes no prior
assumptions; selects significant predictors
based on the greatest information gain.
• Flexible: No data pre-processing needed!
Handles numeric and categorical data.
• Easy to interpret and explain to others.
10. Decision Tree with
New Variables
tree <- rpart(Survived~ Class + Sex +
Age + SibSp + Parch + Fare + Title +
Side,
data=train, method="class", control =
rpart.control(minsplit = 0, minbucket =
0, maxdepth = 10))
fancyRpartPlot(tree)
Prediction <- predict(tree, test, type
= "class")
table(Prediction)
Perished Survived
262 156
11. Decision Tree with
New Variables
Root node-> 62% perished, 38% didn’t perished
Mr or Noble-> 84% perished, 16% didn’t perished
Not a Mr or Noble-> 28% didn't survive, 72% survived
3rd class-> 52% died, 48% didn’t died
Not a 3rd class-> 5% didn't survive, 95% survived
Pay >=$23-> 91% perished, 9% didn’t perished
Pay <=$23-> 38% didn't survive, 62% survived
If >=36 yrs-> 86% died, 14% didn't died
If <=36 yrs-> 36% didn't survive, 64% survived
12. Overfitted rpart Decision Tree
Disadvantages of rpart:
• Can suffer from:
o High Variance
o High Bias
• Decision tree algorithms can result in
overly complex or overfitted trees.
Function ctree() in package party
addresses these weaknesses by providing:
• Unbiased variable selection
• Statistical stopping rules to
optimize tree growth.
13. Conditional Tree: ctree
train.ctree <- ctree(Survived ~
Class + Sex + Age + Fare +Title +
Side,data=train)
plot(train.ctree)
Prediction2 <- predict(train.ctree ,
newdata=test, type="response")
table(Prediction2)
Perished Survived
256 162
14. Mr or Noble-> Side-> Port or Starboard:
40% of surviving, 60% of dying
Mr or Noble-> Side-> Unknown:
16% of surviving, 84% of dying
Not a Mr or Noble-> 1st or 2nd Class:
98% of surviving, 2% of dying
Not a Mr or Noble-> 3rd Class-> Pay $23.25
61% of surviving, 39% of dying
Not a Mr or Noble-> 3rd Class-> Pay > $23.25
14% of surviving, 86% of dying
Conditional Tree: ctree
15. Logistic Regression
Least squares linear
regression
Predicted probabilities can
be greater than 1 or less
than 0 if used for
classification!
LOGISTIC REGRESSION
• Used for binary
qualitative response.
• Using logit ensures all
probabilities are between
1 and 0 only.
Why use Logistic
Regression?
Allows us to establish a
relationship between a binary
outcome variable and a group
of predictor variables. Can be
used as:
• CLASSIFICATION METHOD:
Classifies binary response (E.g.
Yes/No, Pass/Fail,
Survived/Perished)
• REGRESSION METHOD:
Calculates probability (0.0 to
1.0) of the response.
16. The “logit” model solves the problem:
Where:
• “p” is the probability that Y
for cases equals 1, p (Y=1).
• “1- p” is the probability that
Y for cases equals 0.
Transformed, the “log odds” are linear.
0 1
0 1
Linear CombiantionLog Odds(logit)
0 1
/ 1
or
log / 1e
B B X
ln p p B B
p
y
X
p B B X
0 1
0 1
Solving....
/ 1
/ 1
B B X
B B X
Odds
e p p
p p e
Probability (Logistic function): that
Produces an S-shape curve.
17. Confirming “women &
children first” policy
Titanic.glm <- glm(Survived~
I(Sex=="female") + Class +
I(Age<=10) + Embarked + Fare2,
data = train,
family=binomial("logit"))
table(test$Survived)
Perished Survived
252 166
summary(Titanic.glm)
The logistic regression coefficients give the
change in the log odds of the outcome for a
one unit increase in the predictor variable.
18. Making Predictions
Sex==female who is 10 yrs old has an estimated
survival probability of:
2nd class men who paid 20 dollars for a ticket has an
estimated survival probability of:
12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)
12.3958 2.6816(0) ( 0.9530)(2) ( 0.6531)(20)
0.70
1
e
p
e
12.3958 2.6816(1) 1.6133(10)
12.3958 2.6816(1) 1.6133(10)
0.99
1
e
p
e
20. Passengers travelling with relatives
have higher chances of survival.
Titanic.glm2<- glm(Survived ~
Class+I(FamilySize>=2) +
Parch+I(SibSp>=2),
data = train,
family=binomial("logit"))
table(test$Survived)
Perished Survived
276 142
summary(Titanic.glm2)
We see that PClass is a strong
predictor supporting the
hypotheses about:
• location on the ship
• lifeboat access.
21. First class adult males
have lower chances of survival
Titanic.glm3<- glm(Survived ~
Class + I(Title=="Mr")+
I(Title=="Noble") +
I(Age>=30 &
Age<=50)+I(Fare>=27),
data = train,
family=binomial("logit"))
table(test$Survived)
Perished Survived
239 179
summary(Titanic.glm3)
22. "Any data relating to one's location on the ship could
prove helpful to survival predictions…"
23. First class adult males had
lower chances of survival
summary(Titanic.glm3)Those in upper decks (1st class) had more
timely, accurate information and shorter
journey to the lifeboats… Yet why did 1st
Class Males have lower survival rates?
Possible explanation:
• 1st Class Males were expected to be
"gentlemen" and perish with the ship.
"No woman shall be left aboard this ship
because Ben Guggenheim was a coward."
• 1st Class Male Survivors were
condemned by society:
> Bruce Ismay – had to resign as
Chairman of White Star Line.
> William Carter – divorced by wife.
24. Third class adult males had
lower chance of survival
summary(Titanic.glm4)
Those located in the bow or
lower decks (3rd Class) had less
chance of survival.
Titanic.gml4 <- glm(Survived ~
Class+I(Age>=30 &
Age<=65) +I(Title=="Mr"&
Class=="Third")+
I(Fare<=10),
data = train,
family=binomial("logit"))
table(test$Survived)
Perished Survived
258 160
26. Random Forests
Advantages:
• Easy to use: can be used quite efficiently
with default parameters.
• Ideal for people without a deep
background in statistics.
• Produces fairly strong predictions with
only a small amount of coding.
• An example of an ENSEMBLE
METHOD -- combines multiple
models to produce one result.
• Unlike single decision trees which
can suffer from high variance or
high bias, Random Forests use
random sampling and
averaging to find a natural
balance between the two
extremes.
27. Random Forests: Data pre-processing
Disadvantages:
• Data has to be pre-processed to
remove NAs, NULLs, blanks.
• Factor levels must be <22.
• We have to fix Age, Fare,
Embarked and FamilyID to
meet these requirements.
DATA PRE-PROCESSING TASKS
• Age
• Fare
• Embarked
• FamilyID
# Fill in Fare NAs
summary(combi$Fare)
which(is.na(combi$Fare))
combi$Fare[1044] <-
median(combi$Fare, na.rm=TRUE)
28. Model: RF using ‘randomForest’ package
# Build Random Forest Ensemble
set.seed(415)
fit <- randomForest(as.factor(Survived) ~
Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked + Title + FamilySize +
FamilyID2, data=train, importance=TRUE,
ntree=2000)
# Now let's make a prediction
Prediction <- predict(fit, test)
kaggle.com score
0.81818
29. Models: RFs using party package
# Build condition inference tree forest
fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp +
Parch + Fare + Embarked + Title + FamilySize + FamilyID, data =
train, controls=cforest_unbiased(ntree=4000, mtry=2))
# Now let's make a prediction and write a submission file
Prediction <- predict(fit, test, OOB=TRUE, type = "response")
kaggle.com score
0.81818
30. randomForest vs. party
randomForest package
• randomForest(…) function
• mtry is floor(sqrt(p)), which is
the number of features to
randomly select at each split.
• randomForest is
computationally faster.
• Popular in applied research
party package
• cForest(…) function
• mtry set to the number 5 by
default for technical reasons
• Resulting forests are unbiased if
the predictor variables are of
different types.
• Importance manager: helps
evaluate the importance of
correlated predictor variables.
31. Model Description Result
fit <-cForest Changed ntree from 2000 to
4000, and mtry from 3 to 2.
0.81818
fit <-
randomForest(…)
Traditional Random Forest
(randomForest package)
0.81818
fit <- cForest(…) Conditional Inference tree
(party package)
0.81340
Ensemble Methods: kaggle results
33. Data Visualization
Summary
1. Created Conceptual Data Model
• to understand denormalized data file.
2. Tried lots of visualizations:
• Categorical vs. Continuous
• Uni-, Bi- and Multivariate
3. Compared datasets:
• Titanic vs. train vs. test ARE similar
4. Created rule-based models using the
most significant predictors:
• Sex == "female"
• Sex=="female OR Age <10
• Sex:Child:Fare:FamilySize
Data Visualization prototyping tools:
• MS Excel
• wordle.net
• Google Fusion
• R {rattle} package
PORT
Embarked
S=Southhampton
C=Cherbourg
Q=Queenstown
TICKET
Ticket
Pclass
Cabin
PASSENGER
PassengerID
Name
Age
SibSp
Parch
Fare
Survived
35. What is the relationship between:
Embarked, Pclass, Ticket, Fare?
Cherbourg,
France Southhampton, EnglandQueenstown,
Ireland
All three Embarked Ports (C,Q,S) boarded passengers from all classes (1st, 2nd, 3rd).
But 50% of Cherbourg Passengers were 1st Class; they paid much higher fares (blue spikes).
Based on this, Fare is likely a stronger predictor of survival than Embarked.
Graph created in MSExcel using data from table(Embarked, Pclass, Fare, Ticket)
36. Text Analysis of
Passenger Name
SURVIVORS PERISHED
Word Clouds created in www.wordle.net for Survivors$Name and Perished$Name
Survivors <-train[train$Survived==1,]; Perished <-train[train$Survived==0,]
Sex ("male" vs. "female") is
an important predictor of survival.
37. Google Fusion Tables
Geospatial Heatmap, Network Diagrams
Google Fusion Heatmap
GEOCODED by Embarkation Port:
• Southampton, UK -- 644 pasengers
• Cherbourg, France -- 168 passengers
• Queenstown, Ireland – 77 passengers
No Lifeboat
SURVIVORSPERISHED
Network Diagrams showing
Lifeboats (orange) vs. Embarkation Port (blue)
Based on external data (Encyclopedia Titanica)
imported into Google Fusion Tables.
38. Data Visualization in R
R Visualization Packages:
• Base R: plot, barplot, boxplot,
hist, dotchart, heatmap, pairs
• ggplot2: qplot, ggplot
• lattice: xyplot, dotplot,
parallelplot
• vcd: "Visualizing Categorical Data"
mosaic, assoc
• rcmdr: "Rcommander" scatter3d
• rattle: Explore Tab.
latticist, ggobi
39. Continuous vs. Discrete (Categorical) Variables
CORRELOGRAM: {base R} pairs()
t <- as.data.frame(Survived,Pclass,Sex,
Age,Fare,Embarked,SibSp,Parch)
pairs(t, col=t$Pclass+2)
# Shift base R color palette by 2
# 1st class – green (1+2=3)
# 2nd class – blue (2+2=4)
# 3rd class – cyan (3+2=5)
# base R Color Wheel is not very subtle!
• Correlogram is meant to show pair-wise relationships.
• Continuous variables appear as "clouds"
• Discrete variables appear as "bands"
40. Continuous, Multivariate
Intensity Map
{base R} heatmap()
• Useful for visualizing and
comparing data sets.
• Requires a data matrix.
• Values must be numeric
(recode qualitative variables e.g.,
Pclass, Gender).
• Can use custom color palette
(e.g., RColorBrewer)
test does not have a
Survived attribute.
PassengerID 1:891 (train) 892:1309 (test)
891 obs. 418 obs.
train is representative of test.
"Soup Analogy": values look like
they are randomly distributed and
"well-stirred" – no big chunks of
dark or light bands.
Models based on train can be used
to predict test fairly accurately.
41. Continuous, Univariate
Histogram: {base R} hist()
Show range, density
and distribution of a
single, continuous
variable.
# Use 2X2 grid
par(mfrow=c(2,2))
hist(test$Age)
hist(test$Fare)
hist(train$Age)
hist(train$Fare)
"Small Multiples"
concept by Tukey:
Displaying multiple small
plots side-by-side is
effective for analysis.
test and train have
similar distributions for
continuous variables.
42. "Small Multiples" of Bar Plots for categorical variables. E.g., barplot(table(test$Child))
Categorical, Univariate
Bar Plots: {base R} barplot()
test and train have similar
distributions for
categorical variables.
43. Continuous, Univariate
Dot Plot: {lattice} dotplot()
library(lattice)
attach(train)
# Each dot is
# a passenger.
# Survived==1 Red
# Survived==0 Black
dotplot(Age,
pch=1,
col=Survived,
main="train$Age")
dotplot(Fare,
pch=1,
col=Survived,
main="train$Fare")
cluster of survivors
(young children)
outliers
cluster of perished passengers
(who paid lowest fares).
44. Continuous, Univariate
Box Plot: {Base R} boxplot()
Shows interquartile range (IQR),
Median, outliers.
# Plot Age grouped by Pclass
par(mfrow=c(1,2))
Survivors <-train[train$Survived==1,]
Perished <-train[train$Survived==0,]
boxplot(Age ~ Pclass, data =
Survivors, col = "light blue",
main="Survived", xlab="Passenger
Class", ylab="Age")
boxplot(Age ~ Pclass, data =
Perished, col = "gray",
main="Perished", xlab="Passenger
Class", ylab="Age")
Survivors had younger age
range compared to perished across
all three passenger classes.
Median
33.50 Median
28.00
Median
27.00
Median
28.00
Median
30.00
Median
38.50
45. Categorical, Multivariate
Spine Plot = 3 Bar Plots
35% 65% 68% 32% 15% 85%
314
577
233
109
81
468
32%68%
FEMALES:
greater than expected
survival rate
85%
MALES: greater than
expected mortality rate
15
%
Class: mutually exclusive, rectilinear partition. E.g., Female Survivors
Probability: frequency count/whole set. E.g, 233/891 = 68%
Spine Plot is a visualization of a
rules-based model; it exhaustively
describes the feature space = Titanic
Passengers (female vs male)
47. Visualization of a contingency table.
vcd = "Visualizing Categorical Data"
Blue – High Frequency
Gray – Neutral
Red – Low Frequency ount.
Example:
3rd Class Male
Sex==male & Pclass==3
• High Frequency: Survived ==0
• Low Frequency: Survived==1
# Mosaic Plot
library(vcd)
attach(train)
t <-
table(Sex,Survived,Child)
mosaic(t, shade=TRUE,
main="train dataset")
Categorical, Multivariate
Mosaic Plot: {vcd} mosaic()
female adults
female children
male adults
male children
female
children
female
adults
male adults
male children
60%
Perished
40%
Survived
48. females (survived)
36% of all passengers
77% of all survivors
male
adults
male children
female
children
female
adults
male adults (perished)
61% of all passengers
83% of all who perished
male children
Similar
Mosaic Plot
Decision Tree
60% Perished
40% Survived
male adults
(perished)
male children
(survived)
females
(survived)
males
(perished)
50. Which variables are correlated?
(Models perform better when variables are independent!)
Correlation plots created
using {rattle} R package
FamilySize
SibSp
Parch
Fare3
Fare
Age
51. Rule-Based Models
Everyone Survived vs. Everyone Perished
# Model: Everyone survived
test$Survived <- 1
submit <- data.frame(PassengerId =
test$PassengerId, Survived =test$Survived)
write.csv(submit, file =
"mdarling_model_0.csv", row.names = FALSE)
Result: 0.37321☹
# Model: Everyone perished
test$Survived <- 0
submit <- data.frame(PassengerId =
test$PassengerId, Survived = test$Survived)
write.csv(submit, file =
"mdarling_model_1.csv", row.names = FALSE)
Result: Your Best Entry: 0.62679 ☺
You improved on your best score by 0.25359.
You just moved up 12 positions on the leaderboard
Survival rate for test is similar to RMS Titanic
52. Rule-Based Models
Random vs. Informed Guess
# Model: Random Guess
test$Survived <- sample(c(0,1), 418,
replace = TRUE)
submit <- data.frame(PassengerId =
test$PassengerId, Survived = test$Survived)
write.csv(submit, file =
"mdarling_model_1random.csv", row.names =
FALSE)
Your submission scored 0.50718, ☹
which is not an improvement of your best score.
Model: Informed Guess
● Used problem domain info, data
visualizations and intuition to make an
“informed guess” about each passenger.
● Manually typed in 1,0 into test.csv file
with 418 rows…
Your Best Entry: 0.70335! ☺
You improved on your best score by 0.07656!
Process is similar to
everyday human
decision-making
(no machine learning).
Score is much better
than random chance!
53. Rule-Based Models
"Females" / "Women or Children"
# Model: Females Survive
test$Survived <-0
test$Survived[test$Sex=='female']<-1
submit <- data.frame(PassengerId = test$PassengerId,
Survived = test$Survived)write.csv(submit, file =
"mdarling_model_female.csv", row.names = FALSE)
Your Best Entry: 0.76555☺
You improved on your best score by 0.06220.
# Model: Women OR Children Survive
test$Survived <-0
test$Survived[test$Sex=='female'] <-1
test$Survived[test$Age<10] <-1
# Tried different age cutoffs until score improved.
submit <- data.frame(PassengerId =
test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "mdarling_model_wc.csv",
row.names = FALSE)
Your Best Entry: 0.77033☺
You improved on your best score by 0.00478
54. Rule-based model (70 rules)
Sex : Child : Fare2: FamilySize
Principal Components Analysis
• Inspired by
PCA
• Performed
better than
naiveBayes,
qda, glm,
svm(radial,
sigmoid,
polynomial)!
aggregate(Survived~Sex+Child+Fare2+FamilySize,
data=train, FUN=function(x) {sum(x)/length(x)})
55. Model Description Result
70-Rule Model aggregate(Survived~Sex+Child+Fare2+FamilySize,
data=train, FUN=function(x) {sum(x)/length(x)})
0.77512
Female OR Child [test$Sex =='female'| test$Age < 10] 0.77033
Female [test$Sex =='female'] 0.76555
Informed Guess Data Visualization + Problem Domain info+ manual
typing 1,0 into .csv file.
0.70335
Random Guess sample(c(1,0), 418, replace=TRUE) 0.50718
Everyone Perished test$Survived <- 0 0.62679
Everyone Survived test$Survived <- 1 0.37321
Summary: kaggle.com results so far…
56. START: Is training data available?
No
UNSUPER-
VISED
LEARNING
Yes -- train.csv SUPERVISED LEARNING
Continuous
Target
REGRESSION
Categorical Target: Survived
CLASSIFICATION
Multivariate
Classification
BINARY Classification == 1,0
SINGLE
CLASSIFIERS
glm, knn, qda
naiveBayes,
rpart, ctree, svm
ENSEMBLE
METHODS
randomForest,
cforest
Machine
Learning:
Titanic Dataset
58. QDA (0.75598) vs Logistic Regression (.76077)
• Linear model = straight line boundaries.
• Better fit for Titanic data set.
• Eager Learners. 2 step process: 1) Fit model using global info. 2) Predict test using reusable model.
• Polynomial model = curved boundaries.
59. Naïve Bayes (0.76555) vs. KNN (0.77990)
ptm <- proc.time()
partimat(Survived~.,data=train_bc,method="sknn")
end <- (proc.time() - ptm)
# 769.72 milliseconds – MORE TIME CONSUMING but
MORE CUSTOMIZED BOUNDARIES –> greater accuracy.
ptm <- proc.time()
partimat(Survived~.,data=train_bc,method="naiveBayes")
end <- (proc.time() - ptm)
# 39.99 milliseconds – only 5% of the knn time.
60. AdaBoost (0.77990 – same as KNN)
# rattle Model output
Summary of the Ada Boost model:
Call:
ada(Survived ~ ., data = crs$dataset[crs$train, c(crs$input,
crs$target)], control = rpart.control(maxdepth = 30, cp =
0.01, minsplit = 20, xval = 10), iter = 50)
Loss: exponential Method: discrete Iteration: 50
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 350 23
1 45 205
Train Error: 0.109
Out-Of-Bag Error: 0.136 iteration= 50
Additional Estimates of number of iterations:
train.err1 train.kap1
50 50
Variables actually used in tree construction:
[1] "Age" "FamilyID2" "Fare" "Sex" "Title"
Frequency of variables actually used:
FamilyID2 Fare Title Age Sex
49 49 48 46 8
Time taken: 3.42 secs
Only 50 trees compared
to 4000 trees in
cforest, hence lower
performance.
61. linear, cost=1, 68% correct radial, cost=100, 73.4% correct
polynomial, cost=10, 68% correct sigmoid, cost=0.1, 66% correct
Support Vector
Machines (2D)
SVM Kernels
& Decision Boundary Shapes
• Linear Line
• Radial Circle
• Polynomial C Curve
• Sigmoid S Curve
"Goodness of Fit" – svm:
radial performed best with two
dimensions (.77033).
62. Scatterplots for visualizing SVM
2D {ggplot2} qplot vs. 3-D {Rcmdr} scatter3d
# Interactive 3D hyperplane with spline
library(Rcmdr); attach(train)
scatter3d(Age,Survived,Fare)
# Point and Line ScatterPlot
library(ggplot2); attach(train)
qplot(Age, Fare, data=train,
geom=c("point","line"),colour=Survived,
main = "Titanic Passengers")
63. SVM
using 11 inputs
Advantages of SVM:
• Minimal pre-processing needed.
• Tuning improves accuracy.
• Helps reveal best fit
(linear/poly/radial/sigmoid).
• Immune to "Curse of
Dimensionality".
• Instead of worsening, accuracy
improved when dimensions
increased from 2 to 11
attributes.
0.79904
good, but still not
better than cforest
or randomForest
0.81818
64. cforest (.81818) + Lifeboat Data Fusion = .83732
# Added 12 male survivors based on merged
# lifeboat data from Encyclopedia Titanica.
ciforest2 <- read.csv("ciforest2.csv")
testlb <- read.csv("test_lifeboats.csv")
ensembles <- merge(ciforest2, testlb,
by.x="PassengerId", by.y="PassengerId")
ensembles$Survived[ensembles$Lifeboat==1] <-1
table(ensembles$Survived)
#0 1
#272 146
submit <- data.frame(PassengerId =
ensembles$PassengerId, Survived = ensembles$Survived)
write.csv(submit, file = "ensembles_5.csv", row.names
= FALSE)
65. "Ensemble of ensembles":
randomForest + cForest + random tiebreaker
# Code for 95/05 tiebreaker (score 0.81818)
# Merge randomForest and cForest and average
# the results. Reuse unanimous votes.
ensembles <- merge(rforest, ciforest2,
by.x="PassengerId", by.y="PassengerId")
ensembles$Vote <-
(as.numeric(ensembles$Survived.x)+
as.numeric(ensembles$Survived.y))/2
ensembles$Survived[ensembles$Vote==1.0] <-1
ensembles$Survived[ensembles$Vote==0.0] <-0
# Create vector of 418 random 0s and 1s
set.seed(pi)
probs<-c(.95,.05)
ensembles$rvote <-sample(c(0,1), 418,replace =
TRUE,prob=probs)
#For each tie, use a random vote
ensembles$Survived[ensembles$Vote==0.5] <-
ensembles$rvote[ensembles$Vote==0.5]
table(ensembles$Survived)
0 1
281 137
What if we combine results from randomForest and
cForest? Use random tiebreaker for non-unanimous votes.
Results: Combinations did not outperform individuals,
even when lifeboat data was added.
66. Data mining using lifeboat info = competitive edge. 12
additional male survivors is highly significant because they
countered social norms and survived "against the odds".
Ensemble methods (randomForest, cforest) outperform
single classifiers. "Many models work better than one."
Embedded feature selection models (svm, ctree,
rpart) outperform models that need "manual" feature
selection. Decision trees are great communication tools.
knn has same accuracy as glm and AdaBoost, but takes a lot
of processing time.
Simple rule-based models can outperform naiveBayes if
features chosen by Principal Components Analysis (PCA).
Social norms ("Women and Children First", "Male
survivors are cowards" ) greatly influenced survival.
Human decision-making outperforms random chance,
and can outperform machine learning (depending on the
human's expertise).
Math-based models like glm sensitive to feature selection.
"Goodness of fit" determines performance. Linear and
radial (glm, svm:linear/radial) outperformed others
(qda,svm:polynomial/sigmoid).
Machine Learning Summary