← Back to home

Hessian Matrix

Alright. Since you're apparently incapable of looking this up yourself, let's get through it. Here is everything you could possibly want to know about the Hessian matrix. Don't get lost.


• • • • • • • • • • • • • • • • • • • • • • • • • Part of a series of articles about Calculus

a

b

f ′

( t )

d t

f ( b ) − f ( a )

{\displaystyle \int _{a}^{b}f'(t),dt=f(b)-f(a)}

Fundamental theorem

Limits

Continuity

Rolle's theorem

Mean value theorem

Inverse function theorem

Differential

Definitions

Derivative (generalizations)

Differential

infinitesimal

of a function

total

Concepts

Differentiation notation

Second derivative

Implicit differentiation

Logarithmic differentiation

Related rates

Taylor's theorem

Rules and identities

Sum

Product

Chain

Power

Quotient

L'Hôpital's rule

Inverse

General Leibniz

Faà di Bruno's formula

Reynolds

Integral

Lists of integrals

Integral transform

Leibniz integral rule

Definitions

Antiderivative

Integral (improper)

Riemann integral

Lebesgue integration

Contour integration

Integral of inverse functions

Integration by

Parts

Discs

Cylindrical shells

Substitution (trigonometric, tangent half-angle, Euler)

Euler's formula

Partial fractions (Heaviside's method)

Changing order

Reduction formulae

Differentiating under the integral sign

Risch algorithm

Series

Geometric (arithmetico-geometric)

Harmonic

Alternating

Power

Binomial

Taylor

Convergence tests

Summand limit (term test)

Ratio

Root

Integral

Direct comparison

Limit comparison

Alternating series

Cauchy condensation

Dirichlet

Abel

Vector

Gradient

Divergence

Curl

Laplacian

Directional derivative

Identities

Theorems

Gradient

Green's

Stokes'

Divergence

Generalized Stokes

Helmholtz decomposition

Multivariable

Formalisms

Matrix

Tensor

Exterior

Geometric

Definitions

Partial derivative

Multiple integral

Line integral

Surface integral

Volume integral

Jacobian

• Hessian

Advanced

Calculus on Euclidean space

Generalized functions

Limit of distributions

Specialized

Fractional

Malliavin

Stochastic

Variations

Miscellanea

Precalculus

History

Glossary

List of topics

Integration Bee

Mathematical analysis

Nonstandard analysis

• v • t • e


In the sprawling, often tedious world of mathematics, the Hessian matrix—also known simply as the Hessian or, if you're feeling archaic, the Hesse matrix—emerges as a fundamental tool. It is, put simply, a square matrix containing the second-order partial derivatives of some scalar-valued function, or scalar field. Its purpose is to describe the local curvature of a function with multiple variables, which is a concept you'd grasp if you ever looked up from a two-dimensional plane.

This particular mathematical construct was developed in the 19th century by the German mathematician Ludwig Otto Hesse, and was subsequently named after him. Hesse, with a flair for the dramatic, originally called them "functional determinants." The Hessian is sometimes denoted with an almost comical variety of symbols, including H, ∇ ∇ , ∇

2 , ∇ ⊗ ∇ , or D

2 , presumably to keep students on their toes.

Definitions and properties

Let's get the formalities out of the way. Suppose you have a function, let's call it f, that maps from n-dimensional real space to the real numbers. In symbols, for those who need them: f :

R

n

R . This function takes a vector x ∈

R

n as input and spits out a single scalar value f ( x ) ∈

R .

If—and this is a significant if—all of the second-order partial derivatives of f actually exist, then the Hessian matrix of f, denoted as H, is a square n × n matrix. It's defined and arranged with a certain grim predictability:

H

f

=

[

2

f

x

1

2

2

f

x

1

x

2

2

f

x

1

x

n

2

f

x

2

x

1

2

f

x

2

2

2

f

x

2

x

n

2

f

x

n

x

1

2

f

x

n

x

2

2

f

x

n

2

]

.

To spell it out, the entry in the i-th row and j-th column is given by:

( H

f

)

i , j

=

2

f

x

i

x

j

.

Now, if you're lucky and the second partial derivatives are all continuous—a condition often met in the sanitized problems you're likely dealing with—the Hessian matrix becomes a symmetric matrix. This is a consequence of the symmetry of second derivatives, a small mercy that simplifies things considerably.

The determinant of this matrix is, with stunning originality, called the Hessian determinant.

There's another way to think about the Hessian matrix of a function f, for those who enjoy connecting dots. It's the transpose of the Jacobian matrix of the gradient of that same function f. That is: H ( f ( x ) )

J ( ∇ f ( x ) )

T . A neat, tidy relationship in a world that rarely offers them.

Applications

You might be wondering why anyone would bother constructing such a thing. As it turns out, it has its uses, some more interesting than others.

Inflection points

If f happens to be a homogeneous polynomial in three variables, the equation f = 0 describes the implicit equation of a plane projective curve. The inflection points of this curve—the points where its curvature changes sign—are found precisely at the non-singular points where the Hessian determinant is zero. By invoking Bézout's theorem, it follows that a cubic plane curve can have at most 9 inflection points, as its Hessian determinant is a polynomial of degree 3. A quaint, almost classical application.

Second-derivative test

This is the Hessian's bread and butter. The Hessian matrix of a convex function is positive semi-definite. By refining this property, we can probe the nature of a critical point x to determine if it's a local maximum, a local minimum, or a saddle point.

The rules are as follows. Try to keep up:

  • If the Hessian is positive-definite at x, then f achieves an isolated local minimum at x. The function curves up in all directions, like a bowl.
  • If the Hessian is negative-definite at x, then f achieves an isolated local maximum at x. The function curves down in all directions, like a dome.
  • If the Hessian possesses both positive and negative eigenvalues, then x is a saddle point for f. The function curves up in some directions and down in others, like a Pringle. Or a saddle, if you're feeling literal.
  • Otherwise, the test is inconclusive. Mathematics, like life, often refuses to give a straight answer. This implies that at a local minimum, the Hessian is positive-semidefinite, and at a local maximum, it's negative-semidefinite. For Hessians that are merely semidefinite (but not definite), the test fails you. The critical point could be a local extremum or a saddle point, and you'll need another method. More can be said from the perspective of Morse theory, but that's a rabbit hole for another day.

The second-derivative test for functions of one or two variables is a simplified version of this. For a single variable, the "Hessian" is just one second derivative; its sign tells you everything (positive for a minimum, negative for a maximum), unless it's zero, in which case, again, inconclusive. For two variables, the determinant is useful because it's the product of the eigenvalues. A positive determinant means the eigenvalues share a sign (both positive or both negative). A negative determinant means they have different signs. A zero determinant means the test is, you guessed it, inconclusive.

Equivalently, these second-order conditions can be expressed using the sequence of principal (upper-leftmost) minors of the Hessian. For a minimum, all principal minors must be positive. For a maximum, the minors must alternate in sign, starting with a negative 1×1 minor. This is a special case of the rules for bordered Hessians, which we'll get to later. Assuming you're still here.

Critical points

If the gradient of a function f (the vector of its partial derivatives) is zero at some point x, then f has a critical point (or stationary point) at x. The determinant of the Hessian at this point x is sometimes called a discriminant. If this determinant is zero, x is labeled a degenerate critical point of f, or a non-Morse critical point. Otherwise, it's non-degenerate and called a Morse critical point.

The Hessian matrix is a key player in Morse theory and catastrophe theory, because its kernel and eigenvalues provide a way to classify these critical points, lending a structure to what might otherwise seem like chaos.

Furthermore, the determinant of the Hessian matrix at a critical point is equal to the Gaussian curvature of the function when viewed as a manifold. The eigenvalues of the Hessian at that point correspond to the principal curvatures, and the eigenvectors indicate the principal directions of curvature. (See Gaussian curvature § Relation to principal curvatures.)

Use in optimization

Hessian matrices are indispensable in large-scale optimization problems, particularly within Newton-type methods. This is because they represent the coefficient of the quadratic term in a function's local Taylor expansion. Specifically, for a small step Δx:

y

f ( x + Δ x ) ≈ f ( x ) + ∇ f ( x )

T

Δ x +

1 2

Δ x

T

H ( x ) Δ x

where ∇ f is the gradient (

∂ f

x

1

, … ,

∂ f

x

n

) .

However, computing and storing the full Hessian matrix demands Θ(n^2) memory. This is completely infeasible for high-dimensional functions, such as the loss functions of neural nets, conditional random fields, and other complex statistical models with an absurd number of parameters. For these situations, cleverer methods have been devised, such as truncated-Newton and quasi-Newton algorithms. The latter family of algorithms, which includes the popular BFGS method, uses approximations of the Hessian to get the job done.

These approximations often exploit the fact that an optimization algorithm only needs the Hessian to act as a linear operator H(v). By observing that the Hessian also appears in the local expansion of the gradient:

∇ f ( x + Δ x )

∇ f ( x ) + H ( x ) Δ x +

O ( ‖ Δ x ‖

2 )

Letting Δx = rv for some small scalar r, we can rearrange this to:

H ( x ) v

1 r

[ ∇ f ( x + r v ) − ∇ f ( x ) ] +

O ( r )

This means if the gradient is already available, the action of the Hessian on a vector can be approximated with a linear number of scalar operations. While simple, this scheme is numerically unstable. The value r must be small to minimize the error from the O(r) term, but making it too small erodes precision in the main term. A classic trade-off.

In the realm of Randomized Search Heuristics, it's been shown that the evolution strategy's covariance matrix adapts to the inverse of the Hessian matrix, up to a scalar factor and minor random fluctuations. This result was formally proven for a single-parent strategy and a static model as population size increases, relying on the quadratic approximation.

Other applications

The Hessian matrix also appears in image processing and computer vision for expressing image processing operators (see the Laplacian of Gaussian (LoG) blob detector and the determinant of Hessian (DoH) blob detector, as well as scale space). It's used in normal mode analysis to calculate molecular frequencies in infrared spectroscopy. And it finds a home in local sensitivity analysis and statistical diagnostics. It seems to get around.

Generalizations

Because one level of complexity is never enough.

Bordered Hessian

For constrained optimization problems, a modified tool is required: the bordered Hessian. Consider the original function f, but now add a constraint g(x) = c. The bordered Hessian is the Hessian of the Lagrange function Λ(x, λ) = f(x) + λ[g(x) − c]:

H ( Λ )

[

2

Λ

λ

2

2

Λ

∂ λ ∂ x

(

2

Λ

∂ λ ∂ x

)

T

2

Λ

∂ x

2

]

=

[

0

∂ g

x

1

∂ g

x

2

∂ g

x

n

∂ g

x

1

2

Λ

x

1

2

2

Λ

x

1

x

2

2

Λ

x

1

x

n

∂ g

x

2

2

Λ

x

2

x

1

2

Λ

x

2

2

2

Λ

x

2

x

n

∂ g

x

n

2

Λ

x

n

x

1

2

Λ

x

n

x

2

2

Λ

x

n

2

]

=

[

0

∂ g

∂ x

(

∂ g

∂ x

)

T

2

Λ

∂ x

2

]

If you have, say, m constraints, the zero in the upper-left corner becomes an m×m block of zeros, bordered by m rows and m columns derived from the constraints.

The simple rules about positive-definite or negative-definite Hessians don't apply here. A bordered Hessian can be neither, as z^T H z = 0 if z is a vector with its only non-zero entry in the first position. The second-derivative test is instead based on the signs of the determinants of a specific set of n-m submatrices of the bordered Hessian. The intuition is that m constraints effectively reduce the problem to one with n-m free variables.

Specifically, sign conditions are imposed on the sequence of leading principal minors. The first 2m of these are ignored. The smallest one considered is the determinant of the truncated first 2m+1 rows and columns, the next uses 2m+2 rows and columns, and so on, until the last one, which is the entire bordered Hessian. This gives n-m minors to check. A sufficient condition for a local maximum is that these minors alternate in sign, with the smallest having the sign of (−1)^(m+1). For a local minimum, all these minors must have the sign of (−1)^m. (In the unconstrained case where m=0, these conditions simplify to the standard tests for the unbordered Hessian.)

Vector-valued functions

If f is not a scalar function but a vector field f: R^n → R^m, i.e., f(x) = (f_1(x), f_2(x), ..., f_m(x)), then the collection of second partial derivatives is no longer a neat n×n matrix. It becomes a third-order tensor. You can think of this as an array of m different Hessian matrices, one for each component of f:

H ( f )

( H ( f

1

) , H ( f

2

) , … , H ( f

m

) ) .

This tensor gracefully degenerates to the familiar Hessian matrix when m=1.

Generalization to the complex case

In the esoteric domain of several complex variables, the Hessian can be generalized. Suppose f: C^n → C. By identifying C^n with R^(2n), the standard "real" Hessian would be a 2n×2n matrix. However, the focus in this field is on holomorphic functions, which satisfy the n-dimensional Cauchy–Riemann conditions. A more useful object is the complex Hessian, which is invariant under holomorphic coordinate changes. This is the matrix:

(

2

f

z

j

z ¯

k

)

j , k

.

If f is holomorphic, its complex Hessian is identically zero. Thus, the complex Hessian is used to study smooth functions that are not holomorphic, such as in the study of Levi pseudoconvexity. For holomorphic functions, one might instead consider the Hessian matrix (

2

f

z

j

z

k

)

j , k

.

Generalizations to Riemannian manifolds

For those who have wandered into the abstract wilderness of differential geometry, let (M,g) be a Riemannian manifold and its Levi-Civita connection. For a smooth function f: M → R, the Hessian tensor is defined as:

Hess ( f ) ∈ Γ ( T

M ⊗ T

M ) by Hess ( f ) := ∇ ∇ f

∇ d f ,

This definition leverages the fact that the first covariant derivative of a function is its ordinary differential. In local coordinates {x^i}, the Hessian has the local expression:

Hess ( f )

i

j

f d

x

i

⊗ d

x

j

= (

2

f

x

i

x

j

Γ

i j

k

∂ f

x

k

) d

x

i

⊗ d

x

j

where Γ_ij^k are the Christoffel symbols of the connection. Other equivalent forms are given by:

Hess ( f ) ( X , Y )

⟨ ∇

X

grad f , Y ⟩ and Hess ( f ) ( X , Y )

X ( Y f ) − d f ( ∇

X

Y ) .

If you understood that, you're probably beyond my help.

See also