You are here
Homelogistic regression
Primary tabs
logistic regression
Given a binary respose variable $Y$ with probability of success $p$, the logistic regression is a nonlinear regression model with the following model equation:
$\operatorname{E}[Y]=\frac{\operatorname{exp}(\boldsymbol{X}^{{\operatorname{T}% }}\boldsymbol{\beta})}{1+\operatorname{exp}(\boldsymbol{X}^{{\operatorname{T}}% }\boldsymbol{\beta})},$ 
where $\boldsymbol{X}^{{\operatorname{T}}}\boldsymbol{\beta}$ is the product of the transpose of the column matrix $\boldsymbol{X}$ of explanatory variables and the unknown column matrix $\boldsymbol{\beta}$ of regression coefficients. Rewriting this so that the right hand side is $\boldsymbol{X}^{{\operatorname{T}}}\boldsymbol{\beta}$, we arrive at a new equation
$\ln\Big(\frac{\operatorname{E}[Y]}{1\operatorname{E}[Y]}\Big)=\boldsymbol{X}^% {{\operatorname{T}}}\boldsymbol{\beta}.$ 
The left hand side of this new equation is known as the logit function, defined on the open unit interval $(0,1)$ with range the entire real line $\mathbb{R}$:
$\operatorname{logit}(p):=\ln(\frac{p}{1p})\mbox{ where }p\in(0,1).$ 
Note that the logit of $p$ is the same as the natural log of the odds of success (over failures) with the probability of success = $p$. Since $Y$ is a binary response variable, so it has a binomial distribution with parameter (probability of success) $p=\operatorname{E}[Y]$, the logistic regression model equation can be rewritten as
$\operatorname{logit}\big(\operatorname{E}[Y]\big)=\operatorname{logit}(p)=% \boldsymbol{X}^{{\operatorname{T}}}\boldsymbol{\beta}.$  (1) 
Logistic regression is a particular type of generalized linear model. In addition, the associated logit function is the most appropriate and natural choice for a link function. By natural we mean that $\operatorname{logit}(p)$ is equal to the natural parameter $\theta$ appearing in the distribution function for the GLM (generalized linear model). To see this, first note that the distribution function for a binomial random variable $Y$ is
$P(Y=y)=\left({n\atop y}\right)p^{y}(1p)^{{(ny)}},$ 
where $n$ is the number of trials and $Y=y$ is the event that there are $y$ success in these $n$ trials. $p$, the parameter, is the probability of success. Let there be $N$ iid binomial random variables $Y_{1},Y_{2},\ldots,Y_{N}$ each corresponding to $n_{i}$ trials with $p_{i}$ probability of success. Then the joint probability distribution of these $N$ random variables is simply the product of the individual binomial distributions. Equating this to the distribution for the GLM, which belongs to the exponential family of distributions, we have:
$\prod_{{i=1}}^{{N}}\left({n_{i}\atop y_{i}}\right){p_{i}}^{{y_{i}}}(1p_{i})^{% {(n_{i}y_{i})}}=\prod_{{i=1}}^{{N}}\operatorname{exp}\big[y_{i}\theta_{i}b(% \theta_{i})+c(y_{i})\big].$ 
Taking the natural log on both sides, we have the equality of loglikelihood function in two different forms:
$\sum_{{i=1}}^{{N}}\big[\ln\left({n_{i}\atop y_{i}}\right)+y_{i}\ln p_{i}+(n_{i% }y_{i})\ln(1p_{i})\big]=\sum_{{i=1}}^{{N}}\big[y_{i}\theta_{i}b(\theta_{i})% +c(y_{i})\big].$ 
Rearranging the left hand side and comparing term $i$, we have
$y_{i}\ln(\frac{p_{i}}{1p_{i}})+n_{i}\ln(1p_{i})+\ln\left({n_{i}\atop y_{i}}% \right)=y_{i}\theta_{i}b(\theta_{i})+c(y_{i}),$ 
so that $\theta_{i}=\ln\big(p_{i}/(1p_{i})\big)=\operatorname{logit}(p_{i})$.
Next, setting the natural link function logit of the expected value of $Y_{i}$, which is $p_{i}$, to the linear portion of the GLM, we have
$\operatorname{logit}(p_{i})={\boldsymbol{X}_{i}}^{{\operatorname{T}}}% \boldsymbol{\beta},$ 
Remarks.

Comparing model equation for the logistic regression to that of the normal or Gaussian linear regression model, we see that the difference is in the choice of link function. In normal liner model, the regression equation looks like
$\operatorname{E}[Y]=\boldsymbol{X}^{{\operatorname{T}}}\boldsymbol{\beta}.$ (2) The link function in this case is the identity function. The model equation is consistent because the linear terms on the right hand side allow $\operatorname{E}[Y]$ on the left hand side to vary over the reals. However, for a binary response variable, Equation (2) would not be appropriate as the left hand side is restricted to only within the unit interval, whereas the right hand side has the possibility of going outside of $(0,1)$. Therefore, Equation (1) is more appropriate when we are dealing with a binary response data variable.

The logit function is not the only choice of link function for the logistic regression. Other, “nonnatural” link functions are available. Two such examples are the probit function, or the inverse cumulative normal distribution function $\Phi^{{1}}(p)$ and the complimentaryloglog function $\ln(\ln(1p))$. Both of these functions map the open unit interval to $\mathbb{R}$.
Mathematics Subject Classification
62J12 no label found62J02 no label found Forums
 Planetary Bugs
 HS/Secondary
 University/Tertiary
 Graduate/Advanced
 Industry/Practice
 Research Topics
 LaTeX help
 Math Comptetitions
 Math History
 Math Humor
 PlanetMath Comments
 PlanetMath System Updates and News
 PlanetMath help
 PlanetMath.ORG
 Strategic Communications Development
 The Math Pub
 Testing messages (ignore)
 Other useful stuff
 Corrections