What is Linear Regression?
Linear regression is a linear relationship approach between dependent and independent variables, where interaction with a single independent variable is called ‘simple linear regression’ and interaction with multiple independent variables is called ‘multiple linear regression. When multiple dependent variables exist within a model, it is called ‘multivariate linear regression. (Freedman, David A., 2009)
Linear regression has a lot of benefits due to its simplicity of the application as the relationships are created using linear predictive functions; also, it utilizes existing data for unknown parameters. In comparison, non-linear predictive functions will be hard to apply for unknown parameters. Some assumptions are made when predictive functions are employed on different variables and their corresponding relationships. In order to keep the assumptions relaxed in standard linear regression models, a number of extensions are designed, but that tradeoffs complexity instead of the simplicity of the system.
The practical applications of linear regression in machine learning have two main types:
- It can develop a predictive model with a combination of independent variables along with their corresponding responses for forecasting and eliminating errors in a dataset. In the case of missing corresponding responses, the predictive model can come up with the most probable responses as well as weeding out repetitive values.
- Linear regression can also be used to highlight the variance between independent variables and response variables, along with their subsequent affinity with each other.
It must be noted that the terms ‘least squares’ and ‘linear model’ tend to be interrelated but in no way are they similar. The key difference is, in the least square approach, other than the ability to fit non-linear models, it focuses on reducing the absence of an apparent fit in the data set.
The linear regression algorithm carries out better results when applied to linear yet independent interactions between the features. In case of a more complex influence on the output, other algorithms might perform better, but in either case, the developer has to indicate the linearity or non-linearity of interactions to avoid confusion.
Core Concepts when used in Machine Learning
The linear regression model signifies a fitted relationship between the predictor variable and the response variable under the condition that all other predictor values within the model are also set as fixed. By ‘fixed,’ it means that the focus is only on those subgroups of data that have a similar value for the corresponding predictor variable.
When more than one features affect the response variable, it creates a ‘unique effect’ in a complex system and can be deduced as a causal effect linked to the predictor variable. Although, in cases of multiple regression analysis, the interrelationship between predictor variables and response variables is hard to simplify. In such cases, commonality analysis can shed light on demystifying both the exclusive and common influences of correlated independent variables. (Warne, Russel T., 2011)
Linear Regression Modelling Assumptions
There are certain assumptions made regarding the interactions between predictor variables and response variables in linear regression models. Some of them are as follows:
- Homoscedasticity (Constant Variance)
This assumption indicates that there is no link between the predictor variable and the errors in response variable values, although the errors have uniform variance. Simply put, the graph will show an even distribution of errors on the regression line.
- Independence of Errors
The response variable errors are independent of each other.
- Linearity
This is the linear combination of predictor values and regression coefficients being equal to the mean of the response variable. Since the predictor values are usually fixed, this assumption affects the regression coefficients in a greater manner.
- Perfect Multicollinearity
In linear regression, the assumption is specifically a lack of perfect multicollinearity within the predictor values. This happens when either the data is too little, or the predictor variable is erroneously repeated. The repetition may occur due to a lack of alteration in one of the copies or due to linear alteration in the same.
- Effect Sparsity
To avoid this problem and find a better fit for the linear model, a bulk of effects is equated to zero.
- Weak Exogeneity
The predictor variables are fixed and free of error.
A number of extensions of linear regression permit some or all of the aforementioned assumptions to be relaxed, which are as follows:
Measurement Errors model
This model spreads the bias in the effects to be close to zero when the linear regression model measures the predictor variables.
General Linear model
This model refers to the instance when the response variable is a vector instead of a scalar. This is also called a ‘multivariate linear model,’ which is different from a multivariable or multiple linear models.
Generalized Linear Model (GLM)
This model is for bounded or discrete response variables, for example:
- When handling large but positive values, such as population or prices.
- When demonstrating data as a result of definitive unique choices, such as election results.
- When displaying values based on some rated responses, such as 0-5 response scales.
Heteroscedastic model
This model is used when the response variables have different variances among their errors.
Multilevel Regression (Hierarchical Linear model)
This model uses a nested hierarchical structure, for example, an organization within an industry, the departments within that organization, the employees within that department, the performance of each of the employees, and so on. The response variable for each employee will again be nested based on their educational qualification, professional experience, monthly/annual performance, salary scale, etc.
Simple and Multiple Linear Regression model
The simple linear regression model has both the predictor and response variables as scalar values. When the predictor variable becomes a vector, it transforms into a multiple linear regression model. Almost every practical regression model contains multiple predictor variables, with some having scalar response variables.
Practical Application
Linear regression models have great practical usage within multiple disciplines, such as behavioral and social sciences, biomedicine, and economics, in addition to machine learning. Therefore, for a program developer, there is a wide range of choices to be applied to each set of problems in the respective fields. Some of these examples are as follows:
Econometrics
A number of empirical findings are based on linear regression within the discipline of econometrics, for example, labor demand, and supply, national purchasing of exports and spending on imports, consumption patterns, etc.
Environmental Sciences
There is a lot of scope for linear regression applications in environmental sciences, for example, the measurement of air quality within a city and the impact of climate change.
Epidemiology
Linear regression analysis is very useful in establishing causal linkages among observational data within epidemiological research studies. The effects of alcohol consumption on the deterioration of multiple organ failure, morbidity, and mortality employ a linear regression model. For example, if alcohol consumption is set to be the independent variable and the average lifespan of a user is the dependent variable, the researcher may add socio-economic status and educational background as additional independent variables.
Finance
The beta coefficient of the linear regression model is directly employed in the financial capital asset pricing model, along with the evaluation of risk on investment.
Machine Learning
The linear regression model is among the primary supervised machine learning algorithms and plays an important part by making the system modeling simplified yet integrating the corresponding complexities of artificial intelligence.
References
- David A. Freedman (2009). Statistical Models: Theory and Practice. Cambridge University Press. p. 26.
- Warne, Russell T. (2011). “Beyond multiple regression: Using commonality analysis to better understand R2 results”. Gifted Child Quarterly. 55 (4): 313–318. https://journals.sagepub.com/doi/10.1177/0016986211422217
- Hauskrecht, M. “Linear Regression (Machine Learning)” (PDF). University of Pittsburgh. https://people.cs.pitt.edu/~milos/courses/cs2750-Spring03/lectures/class6.pdf