Ordinary Least Squares Problems

Ordinary Least Squares Problems#

Introduction#

Ordinary least squares is a statistical method to estimate the parameters of a regression model, i.e., to find the best fit model that predicts the values of the response variable (dependent variable) based on the values of the observed variables (independent variables).

We start the lecture by introducing the linear regression models, and the learning problem, and then explore two types of common ordinary least squares problems:

Linear regression models is a statistical model used to analyze the relationship between a dependent variable (also known as the response variable) and one or more independent variables (also known as explanatory variables or predictors) that are believed to influence the dependent variable.
Unconstrained regression problems. In unconstrained problems, the regression model parameters can take on any values and are not limited in any way.
Constrained regression problems. In constrained problems, the regression model parameters can be bound in some way, e.g., by physical reasoning or by some relationships between the parameters.

Linear regression models#

Linear regression models are a statistical model used to analyze the relationship between a dependent variable (also known as the response variable) and one or more independent variables (also known as explanatory variables or predictors) that are believed to influence the dependent variable (Definition 34):

Definition 34 (Linear regression model)

There exists a dataset \(\mathcal{D} = \left\{\mathbf{x}_{i},y_{i}\right\}_{i=1}^{n}\) where \(\mathbf{x}_{i}\) is a p-vector of inputs (independent variables) and \(y_{i}\) denotes a scalar response variable (dependent variable). Then, a linear regression model for \(\mathcal{D}\) takes the form:

(62)#\[y_{i} = \mathbf{x}_{i}^{T}\mathbf{\beta} + \epsilon_{i}\qquad{i=1,2,\dots,n}\]

where \(\mathbf{x}_{i}\) is a p-dimensional vector of inputs, \(\mathbf{\beta}\) is a \(p\times{1}\) vector of unknown model parameters and \(\epsilon_{i}\) represents unobserved random errors. Eqn. (62) can be re-written in matrix form as:

(63)#\[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\]

Linear reagression models are widely used to model physical, chemical or economic phenomena, e.g., a model of the rate of a chemical reaction, the return of a stock, the relationship between the home price and square footage, etc.

unit-3-learning/figs/Fig-LinearRegression-Schematic.pdf — Fig. 15 Schematic of linear regression. Unknown model parameters \(\mathbf{\beta}\) are chosen so that the squared difference (residual) between model predicted \(\hat{y}_{i}=\mathbf{x}_{i}^{T}\mathbf{\beta}\) and observed \(y_{i}\) output variables is minimized.#

Sum of squared errors loss function#

Starting from observations, ordinary least squares estimates the value of the unknown parameters \(\mathbf{\beta}\) that appear in a linear regression model by minimizing the sum of squared errors between model estimates and observed values as illustrated in Fig. 15 and defined in (Definition 35):

Definition 35 (Sum of squared error loss function)

There exists a dataset \(\mathcal{D} = \left\{\mathbf{x}_{i},y_{i}\right\}_{i=1}^{n}\) where \(\mathbf{x}_{i}\) is a p-vector of observed inputs and \(y_{i}\) denotes an observed response variable. Further, suppose we model the dataset \(\mathcal{D}\) using the linear regression model:

\[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\]

Best-fit values for the unknown parameters \(\mathbf{\beta}\) can be estimated by solving the optimization problem:

(64)#\[\hat{\mathbf{\beta}} = \arg\min_{\mathbf{\beta}} ||~\mathbf{y} - \mathbf{X}\mathbf{\beta}~||^{2}_{2}\]

where \(||\star||^{2}_{2}\) is the square of the p = 2 vector norm, see content:measurements-matrix-vector-norms.

Ordinary least squares is an example of a supervised learning approach, i.e., we use measured inputs and outputs to estimate parameters appearing in the model.

Unconstrained regression problems#

In an unconstrained least-squares problem, we find parameter values that minimize the difference between the model predictions and the observed data. Least-squares problems are unconstrained when there are no constraints on the permissible values of the parameters.

However, the data matrix \(\mathbf{X}\in\mathbb{R}^{n\times{p}}\) can have different shapes:

Square: the number of observations (rows) is the same as the number of parameters (columns), in which case the system is square (\(n = p\)).
Overdetermined: the data matrix \(\mathbf{X}\) has more observations (rows) than parameters (columns), in which case the system is overdetermined (\(n\gg{p}\)).
Underdetermined: the data matrix \(\mathbf{X}\) has fewer observations (rows) than parameters (columns), in which case the system is underdetermined (\(n\ll{p}\)).

Square regression problems#

Suppose we are estimating the values of the unknown parameters \(\mathbf{\beta}\) in a linear regression model:

(65)#\[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\]

where the data matrix \(\mathbf{X}\in\mathbb{R}^{p\times{p}}\) has the same number of rows and columns as the number of unknown parameters. In this case, Eqn (65) is a square system of linear algebraic equations which we can solve for the unknown parameter vector \(\mathbf{\beta}\) by solving the system of equations:

\[\hat{\mathbf{\beta}} = \mathbf{X}^{-1}\mathbf{y} - \mathbf{X}^{-1}\mathbf{\epsilon}\]

either directly or iteratively, see our previous discussion of Linear algebraic equations. However, for this approach to be applicable, the inverse of the data matrix must exist, see Definition 23.

Overdetermined regression problems#

In most applications, it is more likely that the data matrix \(\mathbf{X}\in\mathbb{R}^{n\times{p}}\) will be overdetermined (\(n\gg{p}\)). The inverse of an overdetermined data matrix can not be computed directly. Instead, we solve a particular system of equations called the normal equations which transforms the original overdetermined problem into a square system:

Definition 36 (Normal solution overdetermined linear regression model)

There exists dataset \(\mathcal{D} = \left\{\mathbf{x}_{i},y_{i}\right\}_{i=1}^{n}\) where \(\mathbf{x}_{i}\) is a p-dimensional vector of inputs (independent variables) and \(y_{i}\) denotes a scalar response variable (dependent variable), and \(n\gg{p}\). Further, suppose we model the dataset \(\mathcal{D}\) using the linear regression model:

\[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\]

Then, the value of the unknown parameter vector \(\mathbf{\beta}\) that minimizes the sum of the squares loss function for an overdetermined system is given by:

(66)#\[\hat{\mathbf{\beta}} = \left(\mathbf{X}^{T}\mathbf{X}\right)^{-1}\mathbf{X}^{T}\mathbf{y} - \left(\mathbf{X}^{T}\mathbf{X}\right)^{-1}\mathbf{X}^{T}\mathbf{\epsilon}\]

The matrix \(\mathbf{X}^{T}\mathbf{X}\) is called the normal matrix, while \(\mathbf{X}^{T}\mathbf{y}\) is called the moment matrix. The existence of the normal solution \(\hat{\mathbf{\beta}}\) requires that the normal matrix inverse \(\left(\mathbf{X}^{T}\mathbf{X}\right)^{-1}\) exists.

Computation of the overdetermined matrix inverse#

Assuming the normal matrix inverse \(\left(\mathbf{X}^{T}\mathbf{X}\right)^{-1}\) exists, we can compute it directly by computing the matrix product and inverting using the inv function. Alternatively, we can use the pinv function included in the LinearAlgebra package in Julia:

# load the LinearAlgebra package
using LinearAlgebra

# Define a random overdetermined array
A = rand(10,8) 

# compute the pinverse 
LI = pinv(A); # this will be an 8 x 10 matrix

# Test -
IM = LI*A; 
println("Is IM the identity matrix -> check the trace: $(tr(IM))")

Is IM the identity matrix -> check the trace: 8.0

This matrix inverse is called a left Moore-Penrose inverse.

Underdetermined regression problems#

If the number of observations (rows) in the data matrix \(\mathbf{X}\) is less than the number of columns, i.e., the number of unknown parameters, the system is underdetermined. In general, there will be infinitely many solutions for an underdetermined system. How do we choose which solution to use?

Least-norm solutions#

The classic strategy to solve an underdetermined system is to estimate the smallest values (measured by some norm \(||\star||\)) for the unknown parameters \(\mathbf{\beta}\) such that they satisfy the original equations:

\[\begin{split}\begin{eqnarray} \text{minimize}~& & ||~\mathbf{\beta}~|| \\ \text{subject to} & & \mathbf{X}\mathbf{\beta} = \mathbf{y} \\ \end{eqnarray}\end{split}\]

Thus, the solution of an underdetermined least squares problem is constrained, i.e., the possible values of \(\mathbf{\beta}\) are restricted somehow. An analytical solution for the parmeter vector \(\mathbf{\beta}\) can be computed for the underdetermined case (Definition 37):

Definition 37 (Solution underdetermined linear regression model)

There exists dataset \(\mathcal{D} = \left\{\mathbf{x}_{i},y_{i}\right\}_{i=1}^{n}\) where \(\mathbf{x}_{i}\) is a p-dimensional vector of inputs (independent variables) and \(y_{i}\) denotes a scalar response variable (dependent variable), and \(n\gg{p}\). Further, we model the dataset \(\mathcal{D}\) using the linear regression model:

\[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\]

Then, the least-norm solution of the unknown parameter vector \(\mathbf{\beta}\) is given by:

(67)#\[\hat{\mathbf{\beta}} =\mathbf{X}^{T}\left(\mathbf{X}\mathbf{X}^{T}\right)^{-1}\mathbf{y} - \mathbf{X}^{T}\left(\mathbf{X}\mathbf{X}^{T}\right)^{-1}\mathbf{\epsilon}\]

The existence of the least-norm solution \(\hat{\mathbf{\beta}}\) requires that the matrix inverse \(\left(\mathbf{X}\mathbf{X}^{T}\right)^{-1}\) exists.

Equivalently, we can solve for the unknown model parameters \(\mathbf{\beta}\) using singular value decomposition (SVD) of the data matrix (Definition 38):

Definition 38 (Singular value decomposition underdetermined system)

There exists a dataset \(\mathcal{D} = \left\{\mathbf{x}_{i},y_{i}\right\}_{i=1}^{n}\) where \(\mathbf{x}_{i}\) is a p-vector of inputs (independent variables) and \(y_{i}\) denotes a scalar response variable (dependent variable), and \(n\ll{p}\). Further, suppose we model the dataset \(\mathcal{D}\) using the linear regression model:

\[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\]

Then, the smallest values of the unknown parameter vector \(\mathbf{\beta}\) that satisfies the linear regression model are given by:

\[\hat{\mathbf{\beta}} = \mathbf{V}\mathbf{\Sigma}^{-1}\mathbf{U}^{T}\mathbf{y} - \mathbf{V}\mathbf{\Sigma}^{-1}\mathbf{U}^{T}\mathbf{\epsilon}\]

Computation of the underdetermined matrix inverse#

We can compute the matrix inverse \(\left(\mathbf{X}\mathbf{X}^{T}\right)^{-1}\), assuming it exists, by computing the matrix product and inverting using the inv function. Alternatively, we can use the pinv function included in the LinearAlgebra package in Julia:

# load the LinearAlgebra package
using LinearAlgebra

# Define a random undetermined array
A = rand(8,10) 

# compute the pinverse 
RI = pinv(A); # this will be an 10 x 8 matrix

# Test -
IM = A*RI; 
println("Is IM the identity matrix -> check the trace: $(tr(IM))")

Is IM the identity matrix -> check the trace: 8.0

This matrix inverse is called a right Moore-Penrose inverse.

Constrained regression problems#

Constrained least squares estimates the parameters of a linear regression model subject to one or more constraints on the values the parameters can take, e.g., there exists prior knowledge or physical relationships that must be satisfied by the parameter estimates.

Lagrange multipliers#

Lagrange multipliers are a mathematical tool used in optimization to find a function’s maximum or minimum value subject to constraints. The method of Lagrange multipliers involves introducing additional variables, called Lagrange multipliers, which integrate the constraints into the loss function (Definition 39):

Definition 39 (Method of Lagrange multipliers)

To find the maximum or minimum of a function \(f(x)\) subject to the equality constraint \(g(x)\), we can form the Lagrangian function:

(68)#\[\mathcal{L}(x,\lambda) = f(x) + \lambda\cdot{g}(x)\]

where \(\lambda\) is the Lagrange multiplier for constraint \(g(x)\), and \(\mathcal{L}(x,\lambda)\) is called the Lagrangian function. At a critical point (maximum or minimum), the partial derivatives of the Lagrangian function with respect to \(x\) and \(\lambda\) vanish:

(69)#\[\begin{split}\begin{eqnarray} \frac{\partial\mathcal{L}}{\partial{x}} & = & 0\\ \frac{\partial\mathcal{L}}{\partial{\lambda}} & = & 0\\ \end{eqnarray}\end{split}\]

The system of equations defined by Eqn. (69) callled the Lagrange equations, can be solved to find the critical points and, thus, the maximum or minimum value of the objective function.

Let’s show how we can use the method of Lagrange multipliers to derive the least-norm solution for the underdetermined learning problem (Example 21):

Example 21 (Derivation least norm solution)

Derive the least norm solution to the optimization problem:

\[\begin{split}\begin{eqnarray} \text{minimize} & & \mathbf{\beta}^{T}\mathbf{\beta} \\ \text{subject to} & & \mathbf{X}\mathbf{\beta} = \mathbf{y} \end{eqnarray}\end{split}\]

using the method of Lagrange multipliers to compute an estimate of the regression parameters \(\mathbf{\beta}\).

Solution: The first step is to form the Lagrangian function:

\[\mathcal{L}(\mathbf{\beta},\lambda) = \mathbf{\beta}^{T}\mathbf{\beta} + \lambda\left(\mathbf{X}\mathbf{\beta} - \mathbf{y}\right)\]

which can differentiate with respect to \(\mathbf{\beta}\) and the Lagrange multipliers \(\lambda\):

\[\begin{split}\begin{eqnarray} \frac{\partial\mathcal{L}}{\partial{\mathbf{\beta}}} & = & 2\mathbf{\beta} + \mathbf{X}^{T}\lambda = 0 \\ \frac{\partial\mathcal{L}}{\partial{\lambda}} & = & \left(\mathbf{X}\mathbf{\beta} - \mathbf{y}\right) = 0\\ \end{eqnarray}\end{split}\]

We can solve the first equation for \(\mathbf{\beta}\) in terms of \(\lambda\):

\[\mathbf{\beta} = -\mathbf{X}^{T}\left(\frac{\lambda}{2}\right)\]

which we can then substitute into the second equation to get an expression for \(\lambda = -2\left(\mathbf{X}\mathbf{X}^{T}\right)^{-1}\mathbf{y}\). Substituting the \(\lambda\) expression into the expression for \(\beta\), i.e., eliminating the Lagrange multipliers gives the least norm solution for \(\mathbf{\beta}\):

\[\mathbf{\beta} = \mathbf{X}^{T}\left(\mathbf{X}\mathbf{X}^{T}\right)^{-1}\mathbf{y}\]

Penalty methods#

A penalty method transforms a constrained least squares problem into an unconstrained problem that can be solved. In a penalty method, a penalty is added to the loss function to encourage specific desirable properties of the solution.

Quadratic penalty functions#

Before we look at the applications of penalty methods, let’s consider a simple example to work out the basic strategy. This example was reproduced from the Mathematical Optimization course at Stanford. Suppose we wanted to solve the problem:

(70)#\[\begin{split}\begin{eqnarray} \arg \min_{x} \left(f(x) = \frac{100}{x}\right) & & \\ \text{subject to} & & \\ x\leq{5} & & \end{eqnarray}\end{split}\]

Before starting, convert any constraints into the form (expression) \(\leq{0}\), so the \(x\leq{5}\) becomes:

\[x - 5 \leq{0}\]

Once the constraints have been converted, the next step is to start charging a penalty for violating them. Since we’re trying to minimize \(f(x)\), we need to add value when the constraint is violated. If you are trying to maximize, the penalty will subtract value. With the constraint \(x-5\leq{0}\) we need a penalty that is:

Constraint satisfied: 0 when \(x-5\leq{0}\)
Constraint violated: postive when \(x-5>0\).

This can be done by using a penalty \(P\left(x\right)\) of the form:

(71)#\[P(x) = \max\left(0,x-5\right)^{2}\]

Eqn. (71) is a quadratic penalty (loss) function. This approach works for equality constraints as well. For example, suppose we had the constraint \(h(x) = c\), where \(c\) is a constant. We can convert this type of constraint into a penalty of the form:

(72)#\[P(x) = \left(h(x) - c\right)^{2}\]

The lowest penalty value in (72) will occur when \(h(x) = c\); thus, we satisfy the constraint. Once we have converted the constraints into penalty functions, we add all the penalty functions to the original objective function \(f(x) + \lambda\cdot{P(x)}\) and minimize the total function (objective plus penalties). For example, the original problem in (70) becomes:

(73)#\[\arg \min_{x} \left(\frac{100}{x} +\lambda\cdot\max\left(0,x-5\right)^{2}\right)\]

where \(\lambda\) is a hyper-parameter (a parameter associated with the method, not the problem) that is adjusted during the process to estimate the unknown value of \(x\) according to the policy (Observation 6):

Observation 6 (\(\lambda\)-policy penalty method)

For a penalty method, we start with a small \(\lambda\) and repeat the \(x\) estimation problem with larger and larger values of \(\lambda\). This makes constraint violation more expensive for each subsequent refinement of the estimate of \(x\).

With a penalty method, we can choose any value for the starting value of \(x\). Let’s do an example penalty method calculation that uses a simple evolutionary algorithm to generate new guesses for \(x\) (Example 22):

Example 22 (Penalty method example)

Solve the minimization problem for \(x\) given by Eqn. (70) using a quadratic penalty method for an initial guess of \(x = 20\) for \(\lambda = 1e3,1e6,1e9\).

"""
    _loss(x::Float64; λ::Float64 = 1.0) -> Float64

Loss function for penalty method example. 
"""
function _loss(x::Float64; λ::Float64 = 1.0)::Float64

    # compute the f(x) and the penalty -
    f = 100.0/x;
    P = (max(0.0, x - 5.0))^2;

    # return -
    return f + λ*P;
end

"""
    main() -> Float64
Runs an evolutionary algorithm to estimate x̂ (min of loss function). 
"""
function main()::Float64

    # initialize -
    Λ = [1e3,1e6,1e9];
    xₒ = 20.0;
    best_loss = Inf;
    x̂ = xₒ;
    max_number_of_iterations = 1000;

    # use a simple evolutionary algorithm -
    for λ ∈ Λ

        x′ = x̂ # initialize the current x, with the best value we have so far

        # refine our best guess
        for _ ∈ 1:max_number_of_iterations
            
            # compute the loss -
            l = _loss(x′, λ = λ);
            
            # is this loss *better* than our best solution so far?
            if (l < best_loss)
                x̂ = x′;
                best_loss = l;
            end

            # generate a new guess for x -
            x′ = x̂ + 0.1*randn()
        end
    end

    # return -
    return x̂;
end

# call the main -
x = main();

source: Source for penalty method example

Barrier functions#

Let’s rethink the problem shown in Eqn. (70). Suppose, instead of developing the penality function shown in Eqn. (71) to minimize \(f(x)\), for a ccontraint of the form \(g(x)\leq{0}\) we developed a barrier function:

(74)#\[B\left(x\right) = - \frac{1}{g(x)}\]

As the \(g(x)\rightarrow{0}\), i.e., we approach constraint violation, the value of \(B\left(x\right)\rightarrow\infty\). In a similar way to penality approach shown in Eqn. (73), we could augment the objective (loss) function:

(75)#\[\arg \min_{x}\left(f(x) -\frac{1}{\lambda}\cdot{B(x)}\right)\]

The challenge of the barrier method shown in Eqn. (75) is selecting a starting point. The initial guess of the \(x\) must be inside the barrier. If this is true, we adjust the hyperparameter \(\lambda\) using the policy (Observation 7):

Observation 7 (\(\lambda\)-policy barrier method)

For a barrier method, we start with a large \(\lambda\) and repeat the \(x\) estimation problem with smaller and smaller values of \(\lambda\). This allows us to solve the problem near the solution with limited input from the barrier. When we gradually reduce \(\lambda\), using our previous solution as a starting point, we move closer and closer to the barrier.

Barrier methods have on critical pathology (Remark 5):

Remark 5 (Barrier method pathology)

If, for some reason, we jump over the barrier, e.g., we start outside the feasible region (the set of \(x\) values allowed by the constraints), or during the calculation, we generate a candidate solution outside the barrier, this approach can fail.

Let’s do an example barrier method calculation that uses a simple evolutionary algorithm to generate new guesses for \(x\) (Example 23):

Example 23 (Barrier method example)

Solve the minimization problem for \(x\) given by Eqn. (70) using a barrier method for an initial guess of \(x = 2.0\) for \(\lambda = 10,1,0.1\).

"""
    _loss(x::Float64; λ::Float64 = 1.0) -> Float64

Loss function for barrier method example. 
"""
function _loss(x::Float64; λ::Float64 = 1.0)::Float64

    # compute the f(x) and the penalty -
    f = 100.0/x;
    B = -1/(x-5)

    # return -
    return f + (1/λ)*B;
end

"""
    main() -> Float64
Runs an evolutionary algorithm to estimate x̂ (min of loss function). 
"""
function main()::Float64

    # initialize -
    Λ = [10,1.0,0.01];
    xₒ = 2.0;
    best_loss = Inf;
    x̂ = xₒ;
    max_number_of_iterations = 1000;

    # use a simple evolutionary algorithm -
    for λ ∈ Λ

        x′ = x̂ # initialize the current x, with the best value we have so far

        # refine our best guess
        for _ ∈ 1:max_number_of_iterations
            
            # compute the loss -
            l = _loss(x′, λ = λ);
            
            # is this loss *better* than our best solution so far?
            if (l < best_loss)
                x̂ = x′;
                best_loss = l;
            end

            # generate a new guess for x -
            x′ = x̂ + 0.1*randn()
        end
    end

    # return -
    return x̂;
end

# call the main -
x = main();

source: Source for barrier method example

Application of penalty and barrier methods#

In the context of statistical modeling, penalty, and barrier methods are often used to regularize the model, which means imposing constraints on the model parameters to prevent overfitting and improve the model’s generalization ability

Several different types of penalty methods are commonly used in statistical modeling, including:

Ridge regression adds a penalty term to the loss function, which is the sum of the squares of the model parameters. This has the effect of shrinking the parameters towards zero and can help reduce the model’s variance.
Lasso regression adds a penalty term to the loss function, which is the sum of the absolute values of the model parameters. This has the effect of setting some of the parameters to zero, which can help to select a subset of essential features and reduce the complexity of the model.
Group Lasso is similar to lasso regression but allows the user to group variables together and applies the penalty to the group rather than to each variable.
Elastic net combines the ridge and lasso regression penalties and allows users to tune the balance between the two penalties.

Penalty methods can be combined with optimization algorithms to find the best (optimal) values of the model parameters that minimize the loss function subject to the penalty constraints. These methods are often used in situations with many predictors, and the goal is to select a parsimonious model with a small number of essential features.

Summary#

In this lecture, we introduced the learning problem, and Ordinary least squares. We started the lecture by introducing linear regression models, and the learning problem, we then explored two types of common ordinary least squares problems:

Linear regression models is a statistical model used to analyze the relationship between a dependent variable (also known as the response variable) and one or more independent variables (also known as explanatory variables or predictors) that are believed to influence the dependent variable.
Unconstrained regression problems. In unconstrained problems, the regression model parameters can take on any values and are not limited in any way.
Constrained regression problems. In constrained problems, the regression model parameters can be bound in some way, e.g., by physical reasoning or by some relationships between the parameters.

Ordinary Least Squares Problems

Contents

Ordinary Least Squares Problems#

Introduction#

Linear regression models#

Sum of squared errors loss function#

Unconstrained regression problems#

Square regression problems#

Overdetermined regression problems#

Computation of the overdetermined matrix inverse#

Underdetermined regression problems#

Least-norm solutions#

Computation of the underdetermined matrix inverse#

Constrained regression problems#

Lagrange multipliers#

Penalty methods#

Quadratic penalty functions#

Barrier functions#

Application of penalty and barrier methods#

Summary#