2. Team Details
Name Campus Roll No. Mobile No. Email Id
Krishna Priya IIT Roorkee 15411007 8910091352 kpriya@es.iitr.
ac.in
Manish
Kumar
Kushwaha
IIT Roorkee 15110013 9456522346 mkushwaha@
ar.iitr.ac.in
Team Name : KMAnalytica
3. Estimation Technique Used
Please provide the estimation/modeling technique(s)/approach
used to arrive at the solution/equation
• Iterative Imputer(not used but this was closest to the
approximation but certainly not good enough. )
• Feature transformation.
• Outlier removal.
• Creation of Interaction features(between variables).
• XGBoost, LightGBM.
• Model tuning and optimization.
4. Strategy to decide final list
Please provide the strategy employed to decide the final list for
submission
• Replaced VAR10 with forward filling upon observing the pattern.
• Identified outliers with quantiles and mean and removed them.
• Tried to impute missing values using iterative imputer but that did
not give satisfactory results.
• Reciprocal, square root and Gaussian transformation applied to
skewed variables.
• Prevented overfitting by controlling max_depth and
min_child_weight.
5. Details of each Variable used in the logic/model/strategy
Please provide details of each variable used in the final logic
These are few variables which I added on top of present features:
• FICO/CMV.
• Reported annual business revenue /Average amount paid towards card bills in the last 3 months.
• Risk score associated with probability of default*Average utilization of credit line in the last six months
(Balance/credit line) .
• Reported annual business income/Average amount paid towards card bills in the last 3 months.
• TSR/Months in Business.
• FICO/TSR
• TSR*CMV
On top of these I generated three statistical feature for each column which included:
• Mapped value count of each value in that column and with set max value limit of 10.
• Mapped (feature value-mean)*value count with min value limit 1
• Mapped feature value*value count with min limit as 2
6. Reasons for Technique(s) Used
Why do you think this is the best technique(s) for this particular
problem?
• First of all, since missing value could not be imputed with logic, so there were only 2
models left to explore on this dataset (LGBM,XGBOOST) , since most of the columns had
values missing completely at random.
• Also two types of feature engineering played a major role:
1) Interaction features between variables as they captured meanings and relations.
2) Statistical feature – surprisingly they worked on this data set as the values were scaled so
those interaction features could not work but this brought out meaning for the machine.
• Finally, XGBoost performed bit better than LGBM before and after tuning both as dataset
was not that large so time was not an issue so, opted for xgboost to gain that extra 0.5%.