Predictive Model on Big File
A Predictive Model
This is a model developed with SPSS Modeler. - The client was
a luxury brand company, which wanted to launch a direct mail
campaign, which targeted some middle-class wealthy people,
with some products. - I had to build a model from the USA
public Census data, for forecasting if a person is likely to earn
more than 50k USD annually. These people were the target of
the above campaign.
DATA FILE
The data file contains 32561 records of US residents,
divided into 14 variables, some quantitative and the
other qualitative.
The variable «income-GT-50k-USD» will be my target.
DATA UNDERSTANDING
Let's look at the file
layout:
I want to build a
predictive model, or
possibly multiple
models, to see which
one will have the best
performance
At this point it is important to check the variables to see
if there are missing data or outliers
At first I set measurements
and the role of the target
variable. Its value is 1 if
earnings are more than
USD 50000 per year, and 0
otherwise
For deleting missing
data, I choose «on» in
the column
«Missing» and
«Nullify» in the
column «Check».
DATA AUDIT
I insert «Data Audit» and can see, for
each variable, the name (1° col.), the
histogram (2° col.) and the type of
measurement (3° col.). The main
statistics follow in the columns 4°-9°.
In the last column I see the number of
valid records. There are three variables
with a number of valid records less
than the maximum, which is 32561.
Sex and income are flag variables, with
values 0;1. For income, 0 is 50000.
Here I activate «Quality»
and see that there are five
variables with outliers.
However, the number of
outliers makes little relief,
also in the case of
«capital-loss»,
considering the huge total
number of records
(32561).
I see also that complete
fields are 78.57% and
Complete records are
92,63%.
.
Now I check the connection between «Education number» and «Education», which refer to the same concept. The
Plot graph gives confirmation of a good connection, tending to the diagonal. Therefore, I can delete one of the two
variables, because it would be redundant. I choose to delete “Education”, using a filter.
After filtering «Education», I insert
«Auto Data Prep», for optimizing
variables. Moreover, I will optimize for
accuracy the process of transformation
of the data, for creating the best
predictive model.
After optimizing I switch to
Analysis and find that
there is a predictor not
used: evidently, it is
useless for my accuracy
goal. Moreover, I can see
that 11 predictors were
transformed.
In the plot on the right I
find the predictive values
of the independent
variables.
Now I use «Data Audit», and
in this table I can see that all
the predictors were
transformed (standardized)
and cleaned.
Now the valid records are
32561 for all variables.
I choose «Quality» in
«Data Audit» and find
that there are only 2100
«outliers» in the «Capital
gain» variable.
The other variables were
cleaned by the previous
transformation.
MODELLING
Here I insert «Partition» and link it to «Auto Dato prep», for performing a
classification analysis. I have used a three-stage procedure: -«Training» (60%) –
«Testing» (20%) and «Validation» (20%) with default seed.
I insert the node
«Auto Classifier»,
connected to
«partition».
I switch to the option
«model» and choose
to optimize «Overall
accuracy». Moreover,
in the «Expert»
window, I choose two
models for
comparison: «C5»
and «C&R Tree»
EVALUATION
I have compared two
models, and the best is
C5, with an «Overall
Accuracy» of 84.907%. I
think that this level of
accuracy is good and so I
end the analysis.
I insert an evaluation node for measuring the gains of
The best model (C5).
The most important graph is for
testing (the second) and my model
line (purple) is quite close to the best
line (blue). Therefore I can state that
the model is good.
I evaluate another plot (using
another «Evaluation» node)
for «lift»: on the 20%
percentile, in the testing
graph, I find a lift of 2.9752.
This value measures the
difference between my
model (purple line) and the
random model (red line). This
is a confirmation of the good
performance of the model.
ANALYSIS
From the node «Analysis» I get the «Confusion
Matrix».
In particular, for the testing partition, I see 4684
people with an income 50000, but
the contrary is true. On the other side, for the
model, 269 people earn