Predictive analytics 101- 4 : Logistic regression
WHAT CAN YOU LEARN IN THIS LECTURE?
I would like to explain the statistical model called "Logistic regression model" and how to predict the "Probability of default" with logistic regression. In this lecture, the basic structure of logistic regression model is presented. In the next lecture, how to program this model by "R" language will be explained. Less math is used so that beginners for data analysis can understand this model easily. Let us start now!
If you are not familiar with "probability of default", I am glad to explain it. When we want bank loans, it is good that we can borrow the amount of money we need, with a lower interest. Then you may wonder how banks decide who can borrow the amount of money requested with lower interests. In other words, how banks assess customer’s credit worthiness. In such case, "probability of default" is used. “Probability of default” means the likelihood where the customer will be in default during a certain period, such as one year. To make the story simple, I take an example of unsecured loans, loans without collateral.
TARGET, FEATURES AND STATISTICAL MODEL, AGAIN!
Could you remember target, features and statistical model which I explained in the lecture about linear regression before? If you could not, please go back to the lecture of introduction as they are critically important for us. In this lecture, I would like to introduce a new model called "Logistic regression model". It can be represented in the chart below. Target is "y". Unlike "Linear regression model", "y" should be "1" or "0". "1" is considered as "something occurs", "0" is considered as "something does not occur". Features are x1, x2 and x3. A statistical model is "Logistic regression model" in this lecture. You might be wondering what "Logistic regression model" exactly is. OK, I will explain it step by step by using an example of "probability of default". (In general, the number of features can be more that three.)
PARAMETERS SHOULD BE OBTAINED TO PREDICT THE TARGET
Just like "Linear regression model", θ is new for us. θ is called "parameter" or "weight" of each feature. As you see the formula, each θ is multiplied with corresponding feature x and all values are added to obtain predictions of the target y. In other words, features x are inputs and weighted with "parameters θ". Formulars of "Logistic regression model"are also presented here. Values of θ are unknown initially. We should know what values of θ are. Once value of each θ is obtained, we can get the predictions of the target. Therefore parameter θ is critically important to obtain the accurate predictions of the target. Most of our efforts about calculations will go to obtain values of parameters θ in practice. Some of you might look at parameters for the first time. It is beneficial to be familiar with how parameter θ works because more advance models also have parameters θ. θ are usually obtained by using massive amount of data. That is why collecting data is very important in practice. In the next lecture, I would like to explain how to calculate θ.
LET US START PREDICTING probability of default
"y" is "target". In this lecture, the target is "probability of default" which is "1" default or "0" not default. Then let us take examples and see how "Logistic regression model" can predict the probability of default for each customer in more details. Each customer has values of “features”. Please look at the table below. It is the simple example of "features". Each customer has features 1 (amount of borrowing) and feature 2 (times of delinquency). Based on the values of “features”, each customer obtains his/her own “one value” of "target"of "probability of default"according to formulas above.
Let us see Steeve’s case. Based on my calculation (I explain it later!), Steeve’s θX is about -23.8. According to the graph below, Steeve’’s probability of default, which is shown in y-axis, is close to 0. Steeve has a low “probability of default”. It means that he is less likely to be in default in the near term. This curve below is called “logistic curve”.
Let us see Hanna’s case. Based on my calculation, Hanna’s θX is about 23.8. According to the graph below, Hanna’s probability of default, which is shown in y-axis, is close to 1. Hanna has a high “probability of default”. It means that she is likely to be in default in the near term.
I summarize the result of my calculations about the probability of default. Blue data are provided in advance and red one is the result of calculations. "Probability of defaults" can be obtained based on two formulas above. In Steeve's case θX = -129.106+1.923*45+18.769*1 = -23.802 and y=1/(1+exp (-θX)),
Although there are other methods of “prediction”, "Logistic regression model" is widely used in many industries as far as I know. In theory, the probability of default for many customers from individuals to big companies and sovereigns can be obtained. In practice, however, more data are available in loans to individuals, small and medium size enterprises (SME) than loans to big companies. The more data are available, the more accurately "Probability of default" can be predicted.
I hope you can understand how the target of "probability of default" is obtained. In general, "Logistic regression model" can be used to predict "Who is likely to buy the products?", "Who is likely to be cured?", "What is likely to fail?" and "What movie is likely to be popular?". In the next lecture, I would like to explain how parameters θ can be obtained by "R language". See you again!
October 4, 2015 : The lecture is released
Notice: TOSHI STATS SDN. BHD. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithm or ideas contained herein, or acting or refraining from acting as a result of such use. TOSHI STATS SDN. BHD. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on TOSHI STATS SDN. BHD. and me to correct any errors or defects in the codes and the software.