Sam Chaudry

Upload folder using huggingface_hub

7885a28 verified about 1 month ago

19.5 kB

	"""Trust Region Reflective algorithm for least-squares optimization.

	The algorithm is based on ideas from paper [STIR]_. The main idea is to
	account for the presence of the bounds by appropriate scaling of the variables (or,
	equivalently, changing a trust-region shape). Let's introduce a vector v:

	\| ub[i] - x[i], if g[i] < 0 and ub[i] < np.inf
	v[i] = \| x[i] - lb[i], if g[i] > 0 and lb[i] > -np.inf
	\| 1, otherwise

	where g is the gradient of a cost function and lb, ub are the bounds. Its
	components are distances to the bounds at which the anti-gradient points (if
	this distance is finite). Define a scaling matrix D = diag(v**0.5).
	First-order optimality conditions can be stated as

	D^2 g(x) = 0.

	Meaning that components of the gradient should be zero for strictly interior
	variables, and components must point inside the feasible region for variables
	on the bound.

	Now consider this system of equations as a new optimization problem. If the
	point x is strictly interior (not on the bound), then the left-hand side is
	differentiable and the Newton step for it satisfies

	(D^2 H + diag(g) Jv) p = -D^2 g

	where H is the Hessian matrix (or its J^T J approximation in least squares),
	Jv is the Jacobian matrix of v with components -1, 1 or 0, such that all
	elements of matrix C = diag(g) Jv are non-negative. Introduce the change
	of the variables x = D x_h (_h would be "hat" in LaTeX). In the new variables,
	we have a Newton step satisfying

	B_h p_h = -g_h,

	where B_h = D H D + C, g_h = D g. In least squares B_h = J_h^T J_h, where
	J_h = J D. Note that J_h and g_h are proper Jacobian and gradient with respect
	to "hat" variables. To guarantee global convergence we formulate a
	trust-region problem based on the Newton step in the new variables:

	0.5 * p_h^T B_h p + g_h^T p_h -> min, \|\|p_h\|\| <= Delta

	In the original space B = H + D^{-1} C D^{-1}, and the equivalent trust-region
	problem is

	0.5 * p^T B p + g^T p -> min, \|\|D^{-1} p\|\| <= Delta

	Here, the meaning of the matrix D becomes more clear: it alters the shape
	of a trust-region, such that large steps towards the bounds are not allowed.
	In the implementation, the trust-region problem is solved in "hat" space,
	but handling of the bounds is done in the original space (see below and read
	the code).

	The introduction of the matrix D doesn't allow to ignore bounds, the algorithm
	must keep iterates strictly feasible (to satisfy aforementioned
	differentiability), the parameter theta controls step back from the boundary
	(see the code for details).

	The algorithm does another important trick. If the trust-region solution
	doesn't fit into the bounds, then a reflected (from a firstly encountered
	bound) search direction is considered. For motivation and analysis refer to
	[STIR]_ paper (and other papers of the authors). In practice, it doesn't need
	a lot of justifications, the algorithm simply chooses the best step among
	three: a constrained trust-region step, a reflected step and a constrained
	Cauchy step (a minimizer along -g_h in "hat" space, or -D^2 g in the original
	space).

	Another feature is that a trust-region radius control strategy is modified to
	account for appearance of the diagonal C matrix (called diag_h in the code).

	Note that all described peculiarities are completely gone as we consider
	problems without bounds (the algorithm becomes a standard trust-region type
	algorithm very similar to ones implemented in MINPACK).

	The implementation supports two methods of solving the trust-region problem.
	The first, called 'exact', applies SVD on Jacobian and then solves the problem
	very accurately using the algorithm described in [JJMore]_. It is not
	applicable to large problem. The second, called 'lsmr', uses the 2-D subspace
	approach (sometimes called "indefinite dogleg"), where the problem is solved
	in a subspace spanned by the gradient and the approximate Gauss-Newton step
	found by ``scipy.sparse.linalg.lsmr``. A 2-D trust-region problem is
	reformulated as a 4th order algebraic equation and solved very accurately by
	``numpy.roots``. The subspace approach allows to solve very large problems
	(up to couple of millions of residuals on a regular PC), provided the Jacobian
	matrix is sufficiently sparse.

	References
	----------
	.. [STIR] Branch, M.A., T.F. Coleman, and Y. Li, "A Subspace, Interior,
	and Conjugate Gradient Method for Large-Scale Bound-Constrained
	Minimization Problems," SIAM Journal on Scientific Computing,
	Vol. 21, Number 1, pp 1-23, 1999.
	.. [JJMore] More, J. J., "The Levenberg-Marquardt Algorithm: Implementation
	and Theory," Numerical Analysis, ed. G. A. Watson, Lecture
	"""
	import numpy as np
	from numpy.linalg import norm
	from scipy.linalg import svd, qr
	from scipy.sparse.linalg import lsmr
	from scipy.optimize import OptimizeResult

	from .common import (
	step_size_to_bound, find_active_constraints, in_bounds,
	make_strictly_feasible, intersect_trust_region, solve_lsq_trust_region,
	solve_trust_region_2d, minimize_quadratic_1d, build_quadratic_1d,
	evaluate_quadratic, right_multiplied_operator, regularized_lsq_operator,
	CL_scaling_vector, compute_grad, compute_jac_scale, check_termination,
	update_tr_radius, scale_for_robust_loss_function, print_header_nonlinear,
	print_iteration_nonlinear)


	def trf(fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev, x_scale,
	loss_function, tr_solver, tr_options, verbose):
	# For efficiency, it makes sense to run the simplified version of the
	# algorithm when no bounds are imposed. We decided to write the two
	# separate functions. It violates the DRY principle, but the individual
	# functions are kept the most readable.
	if np.all(lb == -np.inf) and np.all(ub == np.inf):
	return trf_no_bounds(
	fun, jac, x0, f0, J0, ftol, xtol, gtol, max_nfev, x_scale,
	loss_function, tr_solver, tr_options, verbose)
	else:
	return trf_bounds(
	fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev, x_scale,
	loss_function, tr_solver, tr_options, verbose)


	def select_step(x, J_h, diag_h, g_h, p, p_h, d, Delta, lb, ub, theta):
	"""Select the best step according to Trust Region Reflective algorithm."""
	if in_bounds(x + p, lb, ub):
	p_value = evaluate_quadratic(J_h, g_h, p_h, diag=diag_h)
	return p, p_h, -p_value

	p_stride, hits = step_size_to_bound(x, p, lb, ub)

	# Compute the reflected direction.
	r_h = np.copy(p_h)
	r_h[hits.astype(bool)] *= -1
	r = d * r_h

	# Restrict trust-region step, such that it hits the bound.
	p *= p_stride
	p_h *= p_stride
	x_on_bound = x + p

	# Reflected direction will cross first either feasible region or trust
	# region boundary.
	_, to_tr = intersect_trust_region(p_h, r_h, Delta)
	to_bound, _ = step_size_to_bound(x_on_bound, r, lb, ub)

	# Find lower and upper bounds on a step size along the reflected
	# direction, considering the strict feasibility requirement. There is no
	# single correct way to do that, the chosen approach seems to work best
	# on test problems.
	r_stride = min(to_bound, to_tr)
	if r_stride > 0:
	r_stride_l = (1 - theta) * p_stride / r_stride
	if r_stride == to_bound:
	r_stride_u = theta * to_bound
	else:
	r_stride_u = to_tr
	else:
	r_stride_l = 0
	r_stride_u = -1

	# Check if reflection step is available.
	if r_stride_l <= r_stride_u:
	a, b, c = build_quadratic_1d(J_h, g_h, r_h, s0=p_h, diag=diag_h)
	r_stride, r_value = minimize_quadratic_1d(
	a, b, r_stride_l, r_stride_u, c=c)
	r_h *= r_stride
	r_h += p_h
	r = r_h * d
	else:
	r_value = np.inf

	# Now correct p_h to make it strictly interior.
	p *= theta
	p_h *= theta
	p_value = evaluate_quadratic(J_h, g_h, p_h, diag=diag_h)

	ag_h = -g_h
	ag = d * ag_h

	to_tr = Delta / norm(ag_h)
	to_bound, _ = step_size_to_bound(x, ag, lb, ub)
	if to_bound < to_tr:
	ag_stride = theta * to_bound
	else:
	ag_stride = to_tr

	a, b = build_quadratic_1d(J_h, g_h, ag_h, diag=diag_h)
	ag_stride, ag_value = minimize_quadratic_1d(a, b, 0, ag_stride)
	ag_h *= ag_stride
	ag *= ag_stride

	if p_value < r_value and p_value < ag_value:
	return p, p_h, -p_value
	elif r_value < p_value and r_value < ag_value:
	return r, r_h, -r_value
	else:
	return ag, ag_h, -ag_value


	def trf_bounds(fun, jac, x0, f0, J0, lb, ub, ftol, xtol, gtol, max_nfev,
	x_scale, loss_function, tr_solver, tr_options, verbose):
	x = x0.copy()

	f = f0
	f_true = f.copy()
	nfev = 1

	J = J0
	njev = 1
	m, n = J.shape

	if loss_function is not None:
	rho = loss_function(f)
	cost = 0.5 * np.sum(rho[0])
	J, f = scale_for_robust_loss_function(J, f, rho)
	else:
	cost = 0.5 * np.dot(f, f)

	g = compute_grad(J, f)

	jac_scale = isinstance(x_scale, str) and x_scale == 'jac'
	if jac_scale:
	scale, scale_inv = compute_jac_scale(J)
	else:
	scale, scale_inv = x_scale, 1 / x_scale

	v, dv = CL_scaling_vector(x, g, lb, ub)
	v[dv != 0] *= scale_inv[dv != 0]
	Delta = norm(x0 * scale_inv / v**0.5)
	if Delta == 0:
	Delta = 1.0

	g_norm = norm(g * v, ord=np.inf)

	f_augmented = np.zeros(m + n)
	if tr_solver == 'exact':
	J_augmented = np.empty((m + n, n))
	elif tr_solver == 'lsmr':
	reg_term = 0.0
	regularize = tr_options.pop('regularize', True)

	if max_nfev is None:
	max_nfev = x0.size * 100

	alpha = 0.0 # "Levenberg-Marquardt" parameter

	termination_status = None
	iteration = 0
	step_norm = None
	actual_reduction = None

	if verbose == 2:
	print_header_nonlinear()

	while True:
	v, dv = CL_scaling_vector(x, g, lb, ub)

	g_norm = norm(g * v, ord=np.inf)
	if g_norm < gtol:
	termination_status = 1

	if verbose == 2:
	print_iteration_nonlinear(iteration, nfev, cost, actual_reduction,
	step_norm, g_norm)

	if termination_status is not None or nfev == max_nfev:
	break

	# Now compute variables in "hat" space. Here, we also account for
	# scaling introduced by `x_scale` parameter. This part is a bit tricky,
	# you have to write down the formulas and see how the trust-region
	# problem is formulated when the two types of scaling are applied.
	# The idea is that first we apply `x_scale` and then apply Coleman-Li
	# approach in the new variables.

	# v is recomputed in the variables after applying `x_scale`, note that
	# components which were identically 1 not affected.
	v[dv != 0] *= scale_inv[dv != 0]

	# Here, we apply two types of scaling.
	d = v*0.5 scale

	# C = diag(g * scale) Jv
	diag_h = g * dv * scale

	# After all this has been done, we continue normally.

	# "hat" gradient.
	g_h = d * g

	f_augmented[:m] = f
	if tr_solver == 'exact':
	J_augmented[:m] = J * d
	J_h = J_augmented[:m] # Memory view.
	J_augmented[m:] = np.diag(diag_h**0.5)
	U, s, V = svd(J_augmented, full_matrices=False)
	V = V.T
	uf = U.T.dot(f_augmented)
	elif tr_solver == 'lsmr':
	J_h = right_multiplied_operator(J, d)

	if regularize:
	a, b = build_quadratic_1d(J_h, g_h, -g_h, diag=diag_h)
	to_tr = Delta / norm(g_h)
	ag_value = minimize_quadratic_1d(a, b, 0, to_tr)[1]
	reg_term = -ag_value / Delta**2

	lsmr_op = regularized_lsq_operator(J_h, (diag_h + reg_term)**0.5)
	gn_h = lsmr(lsmr_op, f_augmented, **tr_options)[0]
	S = np.vstack((g_h, gn_h)).T
	S, _ = qr(S, mode='economic')
	JS = J_h.dot(S) # LinearOperator does dot too.
	B_S = np.dot(JS.T, JS) + np.dot(S.T * diag_h, S)
	g_S = S.T.dot(g_h)

	# theta controls step back step ratio from the bounds.
	theta = max(0.995, 1 - g_norm)

	actual_reduction = -1
	while actual_reduction <= 0 and nfev < max_nfev:
	if tr_solver == 'exact':
	p_h, alpha, n_iter = solve_lsq_trust_region(
	n, m, uf, s, V, Delta, initial_alpha=alpha)
	elif tr_solver == 'lsmr':
	p_S, _ = solve_trust_region_2d(B_S, g_S, Delta)
	p_h = S.dot(p_S)

	p = d * p_h # Trust-region solution in the original space.
	step, step_h, predicted_reduction = select_step(
	x, J_h, diag_h, g_h, p, p_h, d, Delta, lb, ub, theta)

	x_new = make_strictly_feasible(x + step, lb, ub, rstep=0)
	f_new = fun(x_new)
	nfev += 1

	step_h_norm = norm(step_h)

	if not np.all(np.isfinite(f_new)):
	Delta = 0.25 * step_h_norm
	continue

	# Usual trust-region step quality estimation.
	if loss_function is not None:
	cost_new = loss_function(f_new, cost_only=True)
	else:
	cost_new = 0.5 * np.dot(f_new, f_new)
	actual_reduction = cost - cost_new
	Delta_new, ratio = update_tr_radius(
	Delta, actual_reduction, predicted_reduction,
	step_h_norm, step_h_norm > 0.95 * Delta)

	step_norm = norm(step)
	termination_status = check_termination(
	actual_reduction, cost, step_norm, norm(x), ratio, ftol, xtol)
	if termination_status is not None:
	break

	alpha *= Delta / Delta_new
	Delta = Delta_new

	if actual_reduction > 0:
	x = x_new

	f = f_new
	f_true = f.copy()

	cost = cost_new

	J = jac(x, f)
	njev += 1

	if loss_function is not None:
	rho = loss_function(f)
	J, f = scale_for_robust_loss_function(J, f, rho)

	g = compute_grad(J, f)

	if jac_scale:
	scale, scale_inv = compute_jac_scale(J, scale_inv)
	else:
	step_norm = 0
	actual_reduction = 0

	iteration += 1

	if termination_status is None:
	termination_status = 0

	active_mask = find_active_constraints(x, lb, ub, rtol=xtol)
	return OptimizeResult(
	x=x, cost=cost, fun=f_true, jac=J, grad=g, optimality=g_norm,
	active_mask=active_mask, nfev=nfev, njev=njev,
	status=termination_status)


	def trf_no_bounds(fun, jac, x0, f0, J0, ftol, xtol, gtol, max_nfev,
	x_scale, loss_function, tr_solver, tr_options, verbose):
	x = x0.copy()

	f = f0
	f_true = f.copy()
	nfev = 1

	J = J0
	njev = 1
	m, n = J.shape

	if loss_function is not None:
	rho = loss_function(f)
	cost = 0.5 * np.sum(rho[0])
	J, f = scale_for_robust_loss_function(J, f, rho)
	else:
	cost = 0.5 * np.dot(f, f)

	g = compute_grad(J, f)

	jac_scale = isinstance(x_scale, str) and x_scale == 'jac'
	if jac_scale:
	scale, scale_inv = compute_jac_scale(J)
	else:
	scale, scale_inv = x_scale, 1 / x_scale

	Delta = norm(x0 * scale_inv)
	if Delta == 0:
	Delta = 1.0

	if tr_solver == 'lsmr':
	reg_term = 0
	damp = tr_options.pop('damp', 0.0)
	regularize = tr_options.pop('regularize', True)

	if max_nfev is None:
	max_nfev = x0.size * 100

	alpha = 0.0 # "Levenberg-Marquardt" parameter

	termination_status = None
	iteration = 0
	step_norm = None
	actual_reduction = None

	if verbose == 2:
	print_header_nonlinear()

	while True:
	g_norm = norm(g, ord=np.inf)
	if g_norm < gtol:
	termination_status = 1

	if verbose == 2:
	print_iteration_nonlinear(iteration, nfev, cost, actual_reduction,
	step_norm, g_norm)

	if termination_status is not None or nfev == max_nfev:
	break

	d = scale
	g_h = d * g

	if tr_solver == 'exact':
	J_h = J * d
	U, s, V = svd(J_h, full_matrices=False)
	V = V.T
	uf = U.T.dot(f)
	elif tr_solver == 'lsmr':
	J_h = right_multiplied_operator(J, d)

	if regularize:
	a, b = build_quadratic_1d(J_h, g_h, -g_h)
	to_tr = Delta / norm(g_h)
	ag_value = minimize_quadratic_1d(a, b, 0, to_tr)[1]
	reg_term = -ag_value / Delta**2

	damp_full = (damp2 + reg_term)0.5
	gn_h = lsmr(J_h, f, damp=damp_full, **tr_options)[0]
	S = np.vstack((g_h, gn_h)).T
	S, _ = qr(S, mode='economic')
	JS = J_h.dot(S)
	B_S = np.dot(JS.T, JS)
	g_S = S.T.dot(g_h)

	actual_reduction = -1
	while actual_reduction <= 0 and nfev < max_nfev:
	if tr_solver == 'exact':
	step_h, alpha, n_iter = solve_lsq_trust_region(
	n, m, uf, s, V, Delta, initial_alpha=alpha)
	elif tr_solver == 'lsmr':
	p_S, _ = solve_trust_region_2d(B_S, g_S, Delta)
	step_h = S.dot(p_S)

	predicted_reduction = -evaluate_quadratic(J_h, g_h, step_h)
	step = d * step_h
	x_new = x + step
	f_new = fun(x_new)
	nfev += 1

	step_h_norm = norm(step_h)

	if not np.all(np.isfinite(f_new)):
	Delta = 0.25 * step_h_norm
	continue

	# Usual trust-region step quality estimation.
	if loss_function is not None:
	cost_new = loss_function(f_new, cost_only=True)
	else:
	cost_new = 0.5 * np.dot(f_new, f_new)
	actual_reduction = cost - cost_new

	Delta_new, ratio = update_tr_radius(
	Delta, actual_reduction, predicted_reduction,
	step_h_norm, step_h_norm > 0.95 * Delta)

	step_norm = norm(step)
	termination_status = check_termination(
	actual_reduction, cost, step_norm, norm(x), ratio, ftol, xtol)
	if termination_status is not None:
	break

	alpha *= Delta / Delta_new
	Delta = Delta_new

	if actual_reduction > 0:
	x = x_new

	f = f_new
	f_true = f.copy()

	cost = cost_new

	J = jac(x, f)
	njev += 1

	if loss_function is not None:
	rho = loss_function(f)
	J, f = scale_for_robust_loss_function(J, f, rho)

	g = compute_grad(J, f)

	if jac_scale:
	scale, scale_inv = compute_jac_scale(J, scale_inv)
	else:
	step_norm = 0
	actual_reduction = 0

	iteration += 1

	if termination_status is None:
	termination_status = 0

	active_mask = np.zeros_like(x)
	return OptimizeResult(
	x=x, cost=cost, fun=f_true, jac=J, grad=g, optimality=g_norm,
	active_mask=active_mask, nfev=nfev, njev=njev,
	status=termination_status)