A General Optimization Framework for Dynamic Time Warping

Dave Deriso · 2026

Given two signals xx and yy, the goal of dynamic time warping (DTW) is to find a warping function ϕ\phi such that xϕyx \circ \phi \approx y. In other words, instead of modifying amplitude, we modify (or "warp") time to align the two signals.

This problem arises naturally whenever you compare signals that share the same shape but differ in timing. A cardiologist comparing heartbeats across patients sees the same P-wave, QRS complex, and T-wave — but each occurs at slightly different times and speeds. An ADC clock hang stalls the digitized signal while the reference keeps running, then races to catch up. A multimodal waveform may stretch or compress differently across each of its lobes. In each case, a meaningful comparison requires first removing the timing differences — and that means finding the right warping function.

Loading...

The figures above show three such problems. In each, the blue and orange signals share the same underlying shape, but their timing differs — the gray lines show the pointwise correspondence found by a warping function. Simple operations like shifting or scaling the time axis cannot fix these misalignments; the timing distortion varies across the signal, so only a nonlinear warping function can bring them into register.

DTW is the classic tool for this, used wherever signals must be aligned before they can be compared. The classic algorithm uses dynamic programming to find a warping path that minimizes misalignment, subject to monotonicity, boundary, and continuity constraints.

Despite its popularity, DTW has a longstanding problem: the warping path is restricted to a discrete grid, so the resulting warping function is piecewise constant — a staircase. Flat segments of the staircase are singularities, where one signal's time stands still while the other advances, collapsing many time points onto one. The comparison below illustrates this issue.

Classic DTW

Most approaches to reducing singularities fall into two camps. The first preprocesses the input signals — using combinations of first and second derivatives, square-root velocity functions, adaptive down-sampling, or ensembles of features including wavelet transforms — to indirectly influence warping smoothness. The second varies the continuity constraints by expanding allowable transitions via step patterns. Both are ad-hoc techniques that require hand-tuning for different signal types.

A continuous-time optimization framework

We handle these issues entirely within an optimization framework in continuous time. Instead of discrete paths, we choose a warping function ϕ\phi that minimizes a loss measuring misalignment plus two regularization terms: one penalizing cumulative warping (limiting overfitting, analogous to ridge regularization), another penalizing the instantaneous warping rate (directly producing smoother warping functions). Our continuous-time formulation also allows for non-uniformly sampled signals, enabling simple out-of-sample validation to select penalty parameters and detect overfitting.

We propose handling these issues entirely within an optimization framework in continuous time. Instead of finding discrete paths, we pose DTW as choosing a time warp function ϕ:[0,1][0,1]\phi: [0,1] \to [0,1] that minimizes:

f(ϕ)=L(ϕ)+λcumlRcuml(ϕ)+λinstRinst(ϕ)f(\phi) = \mathcal{L}(\phi) + \lambda^{\text{cuml}} \cdot \mathcal{R}^{\text{cuml}}(\phi) + \lambda^{\text{inst}} \cdot \mathcal{R}^{\text{inst}}(\phi)

subject to ϕ(0)=0\phi(0) = 0 and ϕ(1)=1\phi(1) = 1.

Let's unpack each term.

The loss functional

The loss measures how well the time-warped signal approximates the target:

L(ϕ)=01L(x(ϕ(t))y(t))dt\mathcal{L}(\phi) = \int_0^1 L(x(\phi(t)) - y(t)) \, dt

This is simply the average penalty for the difference between the time-warped first signal and the second signal. Common choices include:

  • Squared loss: L(u)=u22L(u) = \|u\|_2^2
  • Absolute loss: L(u)=u1L(u) = \|u\|_1
  • Huber loss: Squared for small deviations, linear for outliers

Cumulative warp regularization

The cumulative warp ϕ(t)t\phi(t) - t measures how much the warped time deviates from the original time at each moment. We regularize this with:

Rcuml(ϕ)=01Rcuml(ϕ(t)t)dt\mathcal{R}^{\text{cuml}}(\phi) = \int_0^1 R^{\text{cuml}}(\phi(t) - t) \, dt

This term limits overfitting, similar to ridge or lasso regularization in machine learning. The larger λcuml\lambda^{\text{cuml}}, the less τ\tau deviates from tt.

Instantaneous warp (or "smoothing") regularization

The instantaneous rate of time warping ϕ(t)1\phi'(t) - 1 measures how quickly the warping is changing. We regularize this with:

Rinst(ϕ)=01Rinst(ϕ(t)1)dt\mathcal{R}^{\text{inst}}(\phi) = \int_0^1 R^{\text{inst}}(\phi'(t) - 1) \, dt

This is the key to eliminating singularities. By penalizing large instantaneous warping rates, we directly encourage smooth warping functions. The larger λinst\lambda^{\text{inst}}, the smoother the time warping function.

Explore the impact of the smoothing regularization below to see how it improves over the classic DTW formulation.

Classic DTW

GDTW

1.0

Constraints via infinite penalties

A clever aspect of this formulation: we can impose hard constraints by allowing the regularization functions to take the value ++\infty. For example, requiring that the slope ϕ\phi' stay between bounds SminS_{\min} and SmaxS_{\max}:

Rinst(u)={u2SminuSmaxotherwiseR^{\text{inst}}(u) = \begin{cases} u^2 & S_{\min} \leq u \leq S_{\max} \\ \infty & \text{otherwise} \end{cases}

This elegantly unifies soft penalties and hard constraints in a single framework.

The interactive demo below shows how these penalty terms shape the warping function ϕ(t)\phi(t). The top chart shows the signals and their alignment via the warp lines. Below it, three charts display ϕ(t)\phi(t) directly (with the dotted diagonal representing identity), the deviation ϕ(t)t\phi(t) - t, and the instantaneous warping rate ϕ(t)1\phi'(t) - 1. Adjust the regularization parameters to see how they control smoothness and total warping.

A sine wave warped by a quadratic time function $\phi(t) = t^2$.

Loading...
λinst\lambda^{\text{inst}}(Smoothness)1
*

Higher values result in smoother warping functions. Lower values may produce singularities.

λcuml\lambda^{\text{cuml}}(Cumulative Warp)0.1
*

Higher values limit the total amount of warping, preventing over-fitting.

Loss & Penalty

Solving via dynamic programming with refinement

The optimization problem we've formulated is a classical continuous-time optimal control problem with scalar state ϕ(t)\phi(t) and action u(t)=ϕ(t)u(t) = \phi'(t). While such problems are generally intractable to solve exactly, this particular problem — with state dimension one — can be solved efficiently by discretization and dynamic programming.

Discretization

We discretize time into NN values and assume ϕ\phi is piecewise linear with knot points at these times. This gives us a vector τRN\tau \in \mathbb{R}^N where τi=ϕ(ti)\tau_i = \phi(t_i).

We further discretize the values each τi\tau_i can take into MM possibilities within bounds determined by the slope constraints.

Shortest path formulation

The discretized problem becomes a shortest path problem on a graph. We form a graph with MNMN nodes, where each node τij\tau_{ij} (for time index ii and value index jj) has MM outgoing edges to nodes at the next time step.

At each node, we associate the state cost (loss plus cumulative regularization). On each edge, we associate the instantaneous regularization cost. The total cost of a path from τ11=0\tau_{11} = 0 to τNM=1\tau_{NM} = 1 is exactly our discretized objective.

Standard dynamic programming finds the globally optimal path in O(NM2)O(NM^2) time.

Looking ahead

This post introduced the core idea behind GDTW: by recasting dynamic time warping as a continuous-time optimization problem with explicit regularization, we can eliminate the singularities that have plagued classical DTW for decades. But this framework opens the door to far more than just smoother alignments.

In upcoming posts, we'll explore these ideas in much greater depth:

The unifying theme across all of these is that time warping, when treated as a first-class optimization variable rather than a preprocessing step, becomes a remarkably flexible primitive for signal processing and machine learning. Stay tuned.