Saturday, October 25, 2014
   
Text Size

Building Attrition Models Using Logistic Regression in Analysis Studio

In statistical analysis and data mining projects, building an attrition analysis (also known as churn analysis) is about finding the relations between customers' attrition and the variables that affect it. Although attrition analysis models have specific requirements, the process described in this article can be used for producing any binomial logistic model. The goal of attrition analysis is to provide the manger or researcher the ability to understand what the most important variables that cause attrition are and what the likelihood of a customer to churn is.

It may looks easy to draw the main reasons that affect attrition: customer satisfaction, length of service etc. Using those rules-of-thumbs the user can predict 15% of all churners but using a statistical analysis procedure as in Analysis Studio can yield more then 60% precision.

Analysis Studio makes use of four logistic regression methods to find the best model that can explain the main reasons for attrition. In this "how-to" paper, we will discuss a simple yet powerful statistical analysis method for obtaining a good attrition model. We will also discuss the model interpretation in order to deliver the manager / researcher tools to conduct and analyze the model as well as deploy it as the final step of the data mining process.

A predictive statistical model based on logistic regression is a model that analyzes each variable weight and contribution to the model goals. The variable contribution is measured in percents and the manager or researcher can understand the weight of each variable on the model target variable (In this example: attrition). Having the weight of each variable and being able to see the effect each variable has on the predictive model, makes logistic regression a preferred modeling method (as opposed to other methods such as neural networks that act like a black box).

As suggested in the beginning of this article, you can also use the Analysis Studio Logistic regression procedure for a wide variety of fields like Projects failure analysis, Employees attrition in HR, Social research, Engineering , Finance and other research aiming to find an explanation to a binary event (like "0" / "1" , "Churned"/"Not churned" etc.) occurrence and prediction.

Preparing the data set

Logistic regression produces a statistical model using a data set where the target variable is binary i.e. has two possible values: "1" means that the event has occurred and "0" that means that the event has not occurred.

Customer_ID

Children

Age

Education

Calls

Visits

Attrition

102654

2

61

12

12

0

1

103540

1

32

20

18

2

0

104426

1

35

20

14

2

0

105312

0

26

20

20

2

0

106198

0

25

12

90

0

1

107084

5

59

10

6

2

1

107970

3

46

10

70

2

1

108856

4

65

16

6

2

0

109742

3

57

10

5

2

0

110628

2

64

14

12

0

1

111514

0

72

9

40

0

0

112400

5

67

12

8

2

1

113286

0

33

15

12

2

1

114172

1

23

14

12

0

0

115058

1

33

12

12

2

1

115944

2

59

12

6

0

0

116830

1

60

14

6

2

0

117716

2

77

9

86

0

1

118602

2

52

14

12

2

0

119488

1

55

7

98

2

1

Example data set

The data set contains 20 customers that have churned last year.10 of those customers have churned and 10 are still with the company. The goal of the analysts in this data mining project is to detect the customers with personal risk of churning (e.g. John Smith has 95% risk of churning) so the marketing or call center can contact them in advance. An important outcome is the ability to analyze and understand what causes the attrition? What influence each variable has on customer's decision to leave the company? Being able to answer these questions is what binds statistical analysis, data mining and business decisions and many referred to as decision support aid.

Data set variables:

· Children - Number of children that a customer has.

· Age – Customer's age.

· Education – Customer years of education.

· Calls – No. of customer calls the service center.

· Visits – No. of customer visits to the local service center.

· Attrition - Whether the customer churned ("1") or not ("0") – this is the model target variable.

To start using the logistic regression modeling process described here, you will need Analysis Studio . After starting the software and retrieving the data, click the Statistics menu -> Logistic Regression.

As discussed above, the target variable (explained variable) is in this case: Attrition. It is the variable we would like to know how changes of explanatory variable values (in this case: Age, Calls, Children, Education, Visits) affect it. Select the Attrition variable from the explained variable box. To define our model we will move the desired or expected explanatory variables from the Explanatory Variables frame on the left side of the wizard window and move them to the Selected Columns frame on the right side of the wizard window.

In a good data mining or statistical analysis process the analyst makes assumptions and then tries to disprove them so at the end, only strong well based assumptions are left standing.

Now we have all model components:

1. The explained variable – that is the target variable "Attrition" that we would like to either predict or analyze.

2. The explanatory variables (assumptions for now) – the variables that we think have an influence on the target variable outcome. These are the selected columns are In this case: Age, Calls, Children, Education and Visits.

3. Selected modeling method: Enter All, which is the simplest modeling technique that will try to add all variables to the model (this might be impossible due to correlated data).

All we have to do is to click the Next button and let the software do the model calculation for us. In the Analysis Studio logistic regression modeling wizard, the process can be stopped, you may move back and forth to make changes and refine your model before publishing it.

At this point, the logistic regression model is calculated and the process of reviewing, analyzing and refining it begins. The screen should now display the ROC curve and the Area Under Curve (AUC) value. We will not discuss the ROC or AUC methods here but, generally speaking, ROC and AUC measure the model's success to distinguish between "1" or "0" events of the target variable. As the AUC figure is close to "1" this means that the model has a very high success distinguishing between the binary events ("0" or "1") of the targeted variable. Values close to 0.5 usually mean that the model has low performance and should be ignored.

AUC value

0.5


No distinguish ability

0.5-0.7


Not a very good model

0.7-0.9


Very good model

0.9-1.0

Excellent model

Note that excellent too good to be true models should be examined carefully to make sure that no variables "from the future" are present. Imagine that our database contained a variable containing the number of next month's orders. Customers that are not longer active will have 0 orders and will be a good false predictive explanatory variable if entered into the model.

Most good predictive business models produced have AUC of 0.7-0.8. This depends on data quality and the nature of the problem.

In our example, we have an AUC of 0.83 so we should proceed to view the rest of the results for further analysis. Interpreting the statistical parameters in the model can be a complicated task that is not for our How­-to-paper. We will view model parameters and predictive results that will help us to understand the attrition phenomena as well as our model performance.

Note that you may make assumptions about model quality and performance yet in order to verify model correctness, quality and true performance you will need to qualify as a professional statistician or analyst.

At this point, click the next button and then click finish. The model is now published and ready to be reviewed. In the main attrition model (logistic regression) window; each variable has its own value regarding its contribution to the attrition phenomena.

For example: Age has the value of 0.9512 which means that for each additional year the churn risk is decrease in 4.88%. (1-0.9512) * 100 = 4.88, Calls has the value of 1.0458 which means that for each additional call to the call center the risk of churning increase by 4.58%. (1.0458-1) * 100 = 4.58

Classifications

Click the classifications tab to view model results on the current data set. This tab shows how many cases were classified as churners or non churners as opposed to how many customers actually churned or not.

1. Model performance identifying the non-churners (80% success).

2. Model performance identifying the churners (70% success).

3. Overall model performance (75% success).

After computing the logistic regression procedure, we can finally try to answer our question: What affects the attrition phenomena and what is the weight of each explanatory variable on "Attrition"?

Analysis 6 armed you with four powerful analytics tools:

What-If scenario – Allows you to analyze and view a specific case in order to learn from it on your customer's attrition, or you can analyze a specific customer in order to understand how and why it was classified by the model. Take a look at the image below:

1. The variables calculator that calculates the probability or risk of Attrition based on the given variable values: Age, Calls, Children, Education and Visits.

2. The calculation result when parameters are entered.

In the example below, the probability of attrition is 40% for a customer that is 57 years old, had called the call center 12 times, has 2 children, has 12 years education and has visited the customer service centers two times.

Sensitivity Table - Analyze your customer's attrition sensitivity having values changes of one of the explained variables that are part of the attrition model. Take a look at the image below containing a customer that has constant variable values:

1. The only variable change is the number of children, while the rest of the variables are constant.

2. The No. of children variable starts with no children at all (0) and ends with six children.

3. Attrition probability increases from 14.9% with no children to 90.78% risk at 6 children value (figure no.3).

Deploying the model – The Deploy Model button, applies the model to the current data set (Current Results) or a different data set (future data set in which attrition is unknown or a test data set which was not used to build the model yet the attrition is known). In many data mining and statistical analysis projects in order to test model stability, analysts test the attrition model on a data set that was not used to build the model. Using a test increases the confidence in the generated model. Many statistical analysts deploy the model on the current data set (use the Deploy Model: Current Results option) in order to test the model behaviour on the current data set (e.g. after deploying, create a new cross-tab table, put the deciles column in the columns variable, the did hit variable in the row variables and view which customers where classified at each deciles and whether the model performance was as expected).

The Deployment process generates different variables on the selected data set:

1. PROBABILITY – This is the model outcome, a number between 0 and 1 that represents the probability of the data record to have a value of 1 (in this case the probability of the customers attrition)

2. DECILE – The deciles in which the record was classified, a number between 1 and 10. This number reflects a smaller resolution of the probability yet it is more readable to the human eye and easier to interpret and use for further analysis.

3. DID_HIT – This variable shows whether the model classification was correct. In our example, this will be 1 for customers classified as churners that truly churned or classified as non churners and are still our customers. The DID_HIT will be 0 for customers that were classified as churners and are still our customers or classified as non churners and yet churned.

In many cases when probability is around 50% the model performs poorly. This is logical since people or events that have 50% chance of becoming true are very hard to predict.

Below you can see the output of deploying the model on the current dataset (Current Results).

Another way to deploy the model is simply copying the formula to a new field in any SQL engine and getting future results from the SQL tool that will compute the formula for given variables values.