The random walk method on the space of unit quaternions produces uniformly distributed quaternions without trigonometric functions. We study some performance and quality properties. CPU benchmarks show favorable results compared to the classical algorithm. Using shared memory, speedups can also be found on the GPU. Implementations in SIMD and GLSL are provided.

(Note: This post relies on mathjax).

\( \def\H{\mathbb{H}} \def\H1{\mathbb{H}^1} \def\N{\mathbb{N}} \def\E{\mathbb{E}} \def\Z{\mathbb{Z}} \def\R{\mathbb{R}} \def\C{\mathbb{C}} \def\S{\mathbb{S}} \def\V{\mathbb{V}} \def\HR{\H(\R)} \def\HU{\mathbb{H}^1(\mathbb{R})} \def\SU{\text{SU}(2)} \def\su{\mathfrak{su}_2} \def\SO{\text{SO}(3)} \def\So{\text{SO}(2)} \)## 1. Quaternions as rotations

Consider the space of quaternions

$$q= r+xi+yj+zk \in\HR$$

where \( x,y,z\in \R \). The vectors \(i, j, k\) represent the imaginary units satisfying the multiplication laws

$$i^2 = j^2 = k^2 = -1, \quad ijk = -1.$$

By linearity, these rules define a multiplication among quaternions. Using the norm of \(\R^4\), the unit quaternions \(\HU\) are those with $$\|q\|^2 = r^2 + x^2 +y^2 +z^2 =1$$ and thus define elements on the 3-sphere \(\S^3\).

Since for two quaternions \(q_1, q_2\) we have $$ \|q_1q_2\|=\|q_1\|\|q_2\|,$$ the unit quaternions are closed under multiplication and thus provide \(\S^3\) with a group structure. This multiplication can also be obtained by using the matrix group of \(\SU\) under the following isomorphism:

$$m:\HU \to \SU \quad q= r+xi+yj+zk \mapsto m_q=\begin{pmatrix} r + ix & y + iz \\ -y + iz & r – ix \end{pmatrix}.$$

Note that \[ \det m_q = \|q\|^2 = 1. \]

This isomorphism lets us understand how quaternions act on two complex planes \(\C^2\).

The common algorithm^{1} for producing random quaternions using these planes is

```
quat rand_quat()
{
float phi = rand_float(0,2*pi);
float theta = rand_float(0,2*pi);
float z = rand_float(0,1);
float r = sqrt(z);
float t = sqrt(1.f-z);
return {r*cos(phi), r*sin(phi), t*cos(theta), t*sin(theta)};
}
```

We shall refer to this algorithm (and its variations) to either the polar method or Marsaglia’s algorithm.

Quaternions are useful because of the two-to-one representation \(\SU \to \SO\). This representation can be obtained from the Adjoint representation where \(\SU\) acts by conjugation on its own Lie algebra. The Lie algebra \( \su \) of \(\SU\) is the vector space (over \(\R\)) of complex \(2\times2\) matrices \(m=(a_{ij})\) with zero trace (\(a_{11} = -a_{22}\)) that are skew-Hermitian (\(a_{ij} = -\overline{a_{ji}}\)). It is spanned by the three vectors

\[ \begin{pmatrix} i & \\ & -i \end{pmatrix}, \begin{pmatrix} & 1 \\ -1 & \end{pmatrix}, \begin{pmatrix} & i \\ i & \end{pmatrix}, \]

which are the images of \(i\), \(j\) and \(k\) under the map given before.

Identify \( \R^3\) with the set of *pure *quaternions \( v = xi+yj+zk \). Then conjugation by a unit quaternion \( q\in\HU \) $$v\mapsto qvq^{-1} = qv\overline{q}$$ where $$q^{-1} = \overline{q} = r- xi-yj-zk \quad \forall q\in\HU $$ is equivalent to the \( 2\times2 \) matrix multiplication $$ m_{qv\overline{q}} = m_qm_vm_q^{-1} = m_qm_vm_q^{*} $$ where \( m_q^{*} \) is the complex conjugate of a matrix, which is by definition the inverse of a unitary matrix \( m_q \in \SU \). We know that this action of \( \HU \) on \( \R^3 \) must be a rotation since conjugation by a quaternion preserves the norm.

With respect to this basis \( i, j, k \) this action has the matrix form

\[ R_q = \begin{pmatrix} 1 – 2y^2 – 2z^2 & 2xy – 2rz & 2xz + 2ry \\ 2xy + 2rz & 1 – 2x^2 – 2z^2 & 2yz – 2rx \\ 2xz – 2ry & 2yz + 2rx & 1 – 2x^2 – 2y^2 \end{pmatrix}, \]

which is the usual formula given to convert a quaternion into a rotation.^{2}

It is useful to have a complete multiplication rule in terms of the scalar real and the 3-dimensional imaginary part of a quaternion: Let \(q_1 = r_1 + v_1\) and \(q_2 = r_2+v_2\) then

\[q_1q_2 = (r_1r_2-\langle v_1,v_2\rangle) + (r_1v_2+r_2v_1 + v_1\times v_2).\]

where the angle brackets denote the 3-dimensional inner product and the multiplication sign denotes the cross product. If \(q_2\) is pure then \(q_1q_2 = -\langle v_1,v_2\rangle +r_1v_2+v_1\times v_2\). Multiplying by \(\overline{q_1}\) from the right, we know that the real part will cancel out and we collect only the imaginary part:

\[ q_1q_2\overline{q_1} = -\langle v_1,v_2\rangle (-v_1) + r_1(r_1v_2+v_1\times v_2) + (r_1v_2+v_1\times v_2)\times (-v_1)\]

\[=\langle v_1,v_2\rangle v_1 + r_1^2v_2+ v_1\times(2r_1v_2+v_1\times v_2).\]

Substitute \(r_1^2 = 1- \langle v_1, v_1\rangle\), we can apply the vector triple product formula,^{3} \(\langle v_1,v_2\rangle v_1 – \langle v_1, v_1\rangle v_2 = v_1\times(v_1\times v_2)\) to get

\[=v_2+ 2v_1\times(r_1v_2+v_1\times v_2).\]

This results in commonly known GLSL code for quaternion-vector multiplication.^{4}

```
vec3 quat_mult(vec4 q, vec3 v)
{
return v + 2*cross(q.yzw, (q.x*v + cross(q.yzw, v)));
}
```

## 2. Random quaternions

Our goal is to produce a random unit quaternion or a random rotation \( g \in \SO \). Here the notation of random implies that we have picked a measure \(\mu=\mu_{\SO} \) (that is, a “volume” or “area” function) on the group of rotation, namely the uniform measure, also known as Haar measure.^{5}

Up to normalization, it is the unique translation invariant measure: For any^{6} subset \( E\subset \SO \), for any \( g \in \SO \), $$\mu(gE ) =\mu(E ) $$ where \( gE = \{gh : h\in E\} \).

Equivalently, we can use the uniform measure \( \nu=\nu_{\S^3} \) of the 3-sphere (which is also a Haar measure thanks to the multiplication structure of the unit quaternions) and its push forward measure on \( \SO \) defined by \(\mu(E) = \nu(R^{-1}E)\) where \( R^{-1}E \) denotes the preimage of \(E\subset\SO \) under \(q\mapsto R_q\). This must be equal to \( \mu \) (up to normalization) by equivariance of the map \(q\mapsto R_q\), that is, \( R_{q_2q_1} = R_{q_2}R_{q_1} \) and the requirement that the measures are translation invariant.

Algorithms which produce random rotations are discussed in the Graphics Gems series,^{7} II and III by Arvo and Shoemake.^{8} These include the Box-Muller method to first produce a Gaussian random variable in 4 dimensions. Since it is rotation invariant, all is left is to normalize. The other proposed method is the subgroup algorithm where the fact is used that \( \So \subset \SO \) and the quotient is \( \SO/\So \simeq \S^2\). The algorithm combines then a random generator of the 2-dimensional unit sphere (a random axis of rotation) with one on the 1-dimensional unit circle (a random angle of rotation).^{9}

We note that the rejection method in which we sample the 4-dimensional cube \( [-1,1]^4 \) until we find a vector that lies inside the unit ball which we then normalize to lie on the unit sphere, becomes already quite inefficient: only \( \frac{\frac12 \pi^2}{2^4} \sim 0.308 \) of the samples are accepted. Marsaglia^{10} also provides a method in which we sample twice from the unit disk via rejection sampling which succeeds with probability \( \frac{\pi}{2^2}\frac{\pi}{2^2}\sim 0.617 \) which improves the acceptance rate by a factor of two. Replacing the rejection method with the polar method results in the previously given rand_quat() function. One also arrives to rand_quat() from the subgroup algorithm using a variable substitution.^{1}

## 3. Integer quaternions and the tree structure

Quaternions already make sense over the integers,^{12} and a fundamental fact for the method discussed in this post.

The group \( \SO \) has the interesting property that it contains many trees. To begin with, recall that the free group of rank \( s \) is obtained by considering all possible words built from a fixed set \( S \) of generators \(a_1, \dots, a_s \) and their inverses. For \(s=2\), we may write the generators as \(a,b\) with inverses \(a^{-1}, b^{-1} \) which are defined by \(aa^{-1} = e = bb^{-1} \) where \( e\) denotes the identity element and represents the “empty word”. It is the simplest non-commutative group. “Simple” here refers to the lack of any other relation between the generators. Given a group with a set of generators, one can consider the associated Cayley graph. The vertex set is the group. Edge relations between two elements are defined if the two elements differ by a translation from the generator set or their inverses. Hence, inside a free group of rank \(s\), every element has \( 2s \) neighbors. The lack of relations implies that there are no closed paths inside the graph. Hence, the Cayley graph of a free group of rank \(s\) is a \( 2s \)-regular tree. The group of rotations in 3 dimensions or larger contains a free group.^{13}

The following quaternions give rise to a free group of rank 3 inside \( \SO \):

$$ S_5=\{1\pm2i, 1\pm2j, 1\pm 2k\}. $$

If we want to write them as unit quaternions, we have to divide by their norm \( \sqrt{5} \).

The associated rotations are

$$ \frac{1}{5}\begin{pmatrix}

1 & 0 & 0 \\

0 & -3 & -4 \\

0 & 4 & -3

\end{pmatrix}, \quad \frac{1}{5}\begin{pmatrix}

-3 & 0 & 4 \\

0 & 1 & 0 \\

-4 & 0 & -3

\end{pmatrix}, \quad \frac{1}{5}\begin{pmatrix}

-3 & -4 & 0 \\

4 & -3 & 0 \\

0 & 0 & 1

\end{pmatrix}$$

and their inverses. In fact, for any prime \(p\) which satisfies \( p \text{ mod } 4 = 1 \) there exist exactly \(2(p+1)\) many integer quaternions of norm \( p \) that will generate a free group of rank \( p+1 \) when considering their representation as \( \SO \). The next primes after \(p=5\) satisfying this congruence condition are \(13\) and \(17\). The integer quaternions of norm \(13\) and \(17\) are

$$S_{13} = \{1±i±j±k, 3±2i, 3±2j, 3±2k\}$$

and

$$S_{17} = \{1±4i, 1±4j, 1±4k, 3±2i±2j, 3±2j±2k, 3±2k±2i\}.$$

A proof that these generate free groups can be found in Lubotzky’s book.^{14}

## 4. Random walk

The simple random walk on the graph generated by \(S_5=\{1\pm2i, 1\pm2j, 1\pm 2k\}\)

starting at the identity element which is the root of the tree \(G_5\), picks a next neighbor at every step, and the step itself is done by quaternion multiplication. At time n we are at the vertex defined by the word \(g_0,\dots,g_n \) where \(g_i=s_{j_i}\in S_5\) is randomly picked among all the elements \(S_5\).

Here is a pseudo C++ implementation:

```
float s5 = 1.f/sqrt(5.f);
quat s[6]
{ s5*quat{1, 2,0,0},
s5*quat{1,0, 2,0},
s5*quat{1,0,0, 2},
s5*quat{1,-2,0,0},
s5*quat{1,0,-2,0},
s5*quat{1,0,0,-2},
};
quat random_quat(quat &q)
{
u32 a = uniform_int(0,5); // Draw an integer in [0,5].
q= quat_mult(s[a], q);
return q;
}
```

## 5. Convolution and spectral gap

It was shown by Lubotzky, Phillips and Sarnak^{15} (LPS) that the convolution operator associated to these elements has the strongest possible spectral gap possible.

To explain what this means, we pick out the tree \( G_p \) associated to \( S_{p} \). To study the spectral properties of this graph, one considers the space of functions \( l^2(G_p) \) on \(G_p\) that are square-integrable with respect to the (infinite) counting measure. We can consider the normalized adjacency matrix as convolution operator \( T_p \) on this space:

$$T_pf (g)= \frac{1}{|S_p|}\sum_{h\in S_p} f(hg).$$ By a theorem of Kesten^{16} about random walks on free groups, the spectral radius of this operator is \( \frac{2 \sqrt{p}}{p+1}\).

LPS show that if we consider the average operator \( T_p \) on \( L_0^2(\SO,\mu) \) instead, the space of square integrable functions with respect to the uniform measure \( \mu \) which are orthogonal to the constant function (equivalently, have integral zero), then the operator still has the same spectral radius:

$$ \lambda_1 = \|T_p\|_\text{Op} = \frac{2 \sqrt{p}}{p+1}. $$

Here one restricts to functions orthogonal to the constant function \(g\mapsto 1\) as to discard the eigenvalue 1. The “spectral gap” alluded to in the heading is the difference between this trivial eigenvalue \(1\) and \( \lambda_1 \). The larger this gap, the quicker will \( T_p^nf \) converge to \(\mu(f)\), which is a result about equidistribution: For any square-integrable \(f\) on \(\SO\),

$$ \|T_p^nf-\mu(f)\| = \mathcal{O}(p^{-\frac{n}{2}}).$$

The \(L^2\) bounds on the Hecke operators imply analogous bounds for the \(L^\infty\) norm.

Thus, there exists \(\kappa>0\) such that for every \(g\), every Lipschitz function \(f\) on \(\SO\),

$$\left|T_p^{n}(f)(g)-\mu(f)\right| = \mathcal{O}_f(p^{-\kappa n})$$

From here one can also deduce a discrepancy result: Let

$$S_{p}^n = \text{ words of length }n,$$

equivalently, the sphere of radius \(n\) inside the tree \( G_p \).

There exist some exponent \(\kappa>0\) such that for any nice set \(E\subset\SO\) it holds

$$ \left|\frac{|E \cap S^n_p|}{|S^n_p|} – \frac{\mu(E)}{\mu(\SO)}\right| = \mathcal{O}(p^{-n\kappa}).$$

The theorems of LPS show that the balls and spheres in tree \( G_p \) define a low-discrepancy sequence.^{17} In principle they have the disadvantage that the sequences one can generate grow exponentially, so that a given number might be hard to approximate as it falls in between these cardinalities. However, if we consider intermediate cardinalities, they will still equidistribute if we enumerate the points in the tree in a width-first manner. In this case, an intermediate value will correspond to the union of the largest containing ball inside the set (i.e. finite subtree of same degree except for its leaves), plus some translates of smaller balls which individually are all well distributed. The disadvantage of generating the whole tree is the costly memory access, as all nodes from the previously generated sphere have to be read to generate the next one. We return to this matter in the section on the GPU implementation.

We now give up on the low-discrepancy property in favor for an algorithm that can produce an arbitrary number of points by producing a random sequence of elements in \( S_p \).

Here are images of the spheres of radius 5, 6, and the whole tree up to radius 8.

For each quaternion, we plotted three points, the images of the rotation action under the standard basis.

For comparison, here is the polar method for the same number of points:

Both image sequences look similarly random, except that the random walk method produces some visible patterns. A hint that they come from a very arithmetic structure after all.

## 6. Nonbacktracking random walks

The dynamics related to \(T_p^n\) is the simple random walk. Averaging over \(S_{p}^n\) is the non-backtracking random walk. Note that for the non-backtracking walk, we start with at a point \( g_0 \) on \( \SO \), and move with equal probability to any of the 6 neighbors \( R_{s_1}g \) for \( s_1\in S_5=\{1\pm2i, 1\pm2j, 1\pm 2k\} \).

For the next move, we restrict to \( s_{2} \neq s_1^{-1} \), the non-backtracking condition. This will hold for all future times \(n+1\) as well: \( s_{n+1} \neq s_n^{-1} \). The set of non-backtracking paths of length \( n \) starting at \( g_0 \) is \( S_{5}^n g_0\) using the tree sphere introduced in the previous section.

The non-backtracking walk will have to save the last step in its internal state. Using the particular ordering in the array, if the last step was \(k\), then \( k + 3\mod6 \) corresponds to a step back. Hence we can choose any element in \([ k+3+1, k+3+5] \mod 6\).

```
struct internal_state { quat q; u32 last_choice; };
quat random_rotation_nonbacktracking(internal_state &state)
{
u32 a = ( state.last_choice + 4 + uniform_int(0,4)) % 6;
state.q = quat_mult(s[a], state.q);
state.last_choice = a;
return state.q;
}
```

We omitted the seed for the uniform integer generator which should also be part of the internal_state struct.

## 7. Equidistribution of the random walk

As suggested more recently,^{18} the spectral gap of the operator \( T_5 \) (also called *Hecke *operator) implies equidistribution of the (non-backtracking) random walk.

The large deviation principle states that the fraction of walks \(\gamma=(s_1g_0,s_2s_1g_0, \dots, s_n\dots s_1g_0)\) of length \( n \) that do not equidistribute, that is, given a set \( E\subset \SO \) and any \(\epsilon>0 \)

$$ \left| \frac{ |\gamma \cap E|}{n} – \frac{\mu(E)}{\mu(\SO)}\right| >\epsilon$$

decays exponentially as \(n\to\infty\).

The law of large numbers (LLN) can be phrased as follows:

Let \(f:\SO\to\R\) any continuous function. Then for any \( g\in\SO \),

$$ \frac{1}{n} \left( f(s_1g) + f(s_2s_1g) + \dots + f(s_ns_{n-1}\dots s_1g)\right)\to\mu(f).$$

We wish to show the law of large numbers for sum of random variables \(f(\gamma_k\dots\gamma_1g)\):

$$ Y_n(\gamma_1,\dots,\gamma_n) := \frac{1}{n}\left(f(\gamma_1g)+f(\gamma_2\gamma_1g)+\dots+f(\gamma_n\dots\gamma_1g)\right) $$

where \(\gamma_i\) is chosen uniformly from \(S_p\setminus\{\gamma_i\}\) (non-backtracking random walk)

or uniformly among \(S_p\) (simple random walk).

For brevity, we restrict now to the simple random walk.

Let \(\nu\) be the uniform measure on \(S_p\). This is the law of the walk.

The space of all paths is \(S_p^{\otimes \N}\) carrying the Bernoulli measure \(\beta = \nu^{\otimes\N}\).

The expectation \(\E\) of \(Y_n\) with respect to \(\beta\) is

$$\E[nY_n] = \frac{1}{(p+1)^n} \sum_{\gamma_1,\dots,\gamma_n} Y_n(\gamma_1,\dots,\gamma_n) = Tf(g) + T^2f(g) + \dots + T^nf(g).$$

Applying the above operator bound from the spectral gap discussion,

$$\E[Y_n] = \mu(f) + \mathcal{O}(n^{-1})$$

since the error term sum to a geometric series bounded by a constant.

Analogously one shows that the variance behaves like \(\text{Var}[Y_n] = (\mu(f^2)-\mu(f)^2)n^{-1} + \mathcal{O}(n^{-2})\),

using that if \(k>\ell\) then \(T^{k}_p(f)T^{\ell}_p(f)= T^\ell_p(fT^{k-\ell}f)\).

The pointwise convergence of \(T^n_p\) implies that for any continuous function \(f\), and any \(T_p\)-invariant probability measure on \(\mu’\) on \(\SO\), we have \(\mu'(f)=\mu(f)\).

It follows from Breiman’s LLT for Markov chains,^{19} which applies to both the simple random walk and the non-back tracking walk, that

$$Y_n\to \mu(f).$$

Using these bounds it is also possible to deduce a central limit theorem.^{20}

## 8. Estimating the discrepancy

The results in LPS refer to the following setup.

Pick a point \(v_0\in \S^2\) and study their images of a set \( \{g_1,\dots g_n\}\) of rotations. Let \(m=m_{\S^2}\) denote the uniform measure on \(\S^2\).

In LPS, the spherical cap discrepancy is employed. A spherical cap is determined by a unit vector \(w\in\S^2\) and a number \(\omega\) to define the set \(\text{Cap}_{w,\omega}=\{ v\in\S^2 : \langle v,w\rangle < \omega \}\) where angle brackets denote the inner product of \(\R^3\).

Given a point set \(\Lambda\) the **extremal spherical cap discrepancy** is

\[ D_\infty(\Lambda) = \sup_{\text{Cap}_{w,\omega}} \left| \frac{|\Lambda\cap \text{Cap}_{w,\omega}|}{|\Lambda|}- \frac{m(\text{Cap}_{w,\omega})}{m(\S^2)} \right|. \]

An analytically easier discrepancy is the **mean square spherical cap discrepancy**

\[ D_2(\Lambda) = \left(\int \left| \frac{|\Lambda\cap \text{Cap}_{w,\omega}|}{|\Lambda|}- \frac{m(\text{Cap}_{w,\omega})}{m(\S^2)} \right|^2 d\nu(w,\omega) \right)^{\frac12}, \]

where \(\nu = m_{\S^2} \times m_{[-1,1]}\) is a measure on the set of caps using the natural parametrization.

LPS prove \[ D_\infty(S^n_p) = \mathcal{O}(|S^n_p|^{-\frac13+\epsilon})\] where I used an \(\epsilon\) to absorb the log factors. They conjecture that the true exponent is \(\frac12-\epsilon\), which indeed holds for the mean square discrepancy:

\[ D_2(S^n_p) = \mathcal{O}(|S^n_p|^{-\frac12+\epsilon}).\]

The exponent \(\frac12-\epsilon\) also holds for the error term when averaging a Sobolev functions of order one which is also Lipschitz.

LPS also did numerical experiments in Pascal. They formulated the results in terms of the error exponent,

\[ \alpha = -\frac{1}{\log |\Lambda|}\log err(\Lambda, \text{Cap}) \] where \( err(\Lambda, \text{Cap})= \left| \frac{|\Lambda\cap \text{Cap}|}{|\Lambda|}- \frac{m(\text{Cap})}{m(\S^2)} \right| \) so that \(err(\Lambda, \text{Cap}) = |\Lambda|^{-\alpha}\). Testing for a various choice of caps and initial points \(v_0\in \S^2\), \(\alpha\) ranges between 0.5 and 0.7 for \(\Lambda = S^n_5\), \(n=2\dots11\).

The rotations associated to \(S_5\) are rotations around the standard axis in \(\R^3\) with angle \(\text{acos}(-\frac35)\simeq 2.21\) radians. In different experiments, LPS changed that angle to \(\frac12 \pi\) and \(\frac23 \pi\) radians. For words of size 10, \(\alpha\) comes out to be 0.29 and 0.46.

More recently, in a study on illumination integrals^{21} the *Stolarsky’s invariance principle*^{22} has been employed to measure discrepancy for point sets on the 2-sphere. Consider the following expression

\[ \frac{1}{|\Lambda|^2}\sum_{v,w\in\Lambda}\|v-w\|_2 \]

which sums the distances within the point set. Then the principle states

\[ E(\Lambda)^2:=\int\int\|x-y\|dm(x)dm(y) \;-\; \frac{1}{|\Lambda|^2}\sum_{v,w\in\Lambda}|v-w|_2 = c_2 D_2(\Lambda)^2 \]

which gives a trivial way to measure discrepancy for moderately large sets for which an \(\mathcal{O(n^2)}\) algorithm is still feasible. The expression \(E(\Lambda)\) is described as energy. Note that the first term is a constant, equal to \(\frac{4}{3}\). The other constant in the formula appearing is \(c_2=4\). Moreover, this formula holds for spheres of any dimensions larger than two, so we can directly apply it to the unit quaternions. For \(d=3\), the first term is \(\frac{64}{15\pi}\) and \(c_3=\frac{2}{3\pi}\).

While the energy method works well for small n, we employ Monte Carlo integration to calculate the defining integral of the mean square integral (which involves integral over the 4-sphere). We employ the Super-Fibonacci Spirals^{23}, which is the state-of-the-art method for low discrepancy quaternions if n is known in advance. The discrepancy of that point set is an order of magnitude lower compared to the discrepancies of the random methods studied here. The construction of these spirals is based on replacing the uniform sampler in Marsaglia’s method with equally spaced points on a winding line in the (3-dim) cylinder parametrized by [0,1]x[0,2pi]x[0,2pi].

The following plots show the extremal and mean square spherical discrepancy for up to a million points. Points are produced using various primes p, various integer random number generators and various random walk laws. Some of these will only be discussed later. We observe an error exponent of around 0.5 for all these variations.

## 9. Biased walks to avoid modular arithmetic

It is also possible to obtain equidistribution if the law for the choice among \( S_5 \) is non-uniform. That is, the next edge \(s_i\in S_5\) is chosen with some probability \(p_i\) where \( p_i \neq \frac{1}{|S_5|}\) in the case of the simple random walk. If the probability weights are sufficiently close to equal this follows from perturbation theory.^{24} A theorem more specific to our setup implies that the spectral gap of \(T_5\) implies the spectral gap for any other reasonable averaging measure.^{25} Alternatively, a recent breakthrough of random walks of groups on homogeneous spaces applies as well.^{26}

This allows to use the following algorithm that avoids integer modulo calculation by extending the array for \( S_5 \) by one of size 8, repeating some of the elements (as to make them twice as likely). This optimization will become handy when porting the code to SIMD.

```
quat s5_biased[8]
{ s5*quat{1, 2,0,0},
s5*quat{1,0, 2,0},
s5*quat{1,0,0, 2},
s5*quat{1,-2,0,0},
s5*quat{1,0,-2,0},
s5*quat{1,0,0,-2},
s5*quat{1, 2,0,0},
s5*quat{1,0, 2,0},
};
mat3 random_rotation_biased(quat &state){
u32 a = random_int() & 7; // Draw an integer in [0,7].
state = quat_mult(s5_biased[a], state);
return rot_from_quat(state);
}
```

## 10. Performance and SIMD implementation

A quaternion fits into a 128-bit SIMD register:

```
union mm_quat
{
__m128 mm_q;
quat q;
};
```

Quaternion multiplication in compressed SIMD accumulates to around a dozen SSE instructions.^{27} Quaternion to rotation matrix (and back) is discussed by van Waveren^{28} and available in the Doom 3 BFG Edition GPL source code.^{29} In this approach, we can still use the previous implementation for advancing the random walk.

We measure performance on two tests, “write” and “sample”. In the write test we simply fill a large array of quaternions. The sample test not only produces quaternions but immediately uses them to calculate the discrepancy. We produce \(2^{16}\) many points for each test.

On a AMD Ryzen Zen 4 we count around 14 cycles per write and 100 cycles per sample for the scalar versions. A look at the disassembly shows that the MSVC compiler uses 128-bit SIMD. Marsaglia’s algorithm is implemented both using the rejection algorithm and the polar algorithm for sampling on a disk with a cycle count of 50/100 for write and 70/110 for sample.

The custom SIMD by compression is profiled with 20 cycles resp. 40 cycles.

In contrast, we can employ SIMD instructions by optimizing for throughput, here the 256-bit variant:

```
struct quatx8
{
__m256 r,x,y,z;
quatx8() : r(_mm256_set1_ps(1.)), x(_mm256_set1_ps(0.)),
y(_mm256_set1_ps(0.)), z(_mm256_set1_ps(0.)) {}
};
```

Multiplication and the transform to a rotation matrix is a straight forward translation from the corresponding scalar code. The write test will additional have a struct-of-arrays to arrays-of-struct routine that will shuffle `quatx8 `

to `quat[8]`

.

Instead of using an array, we construct the elements of \(S_5\) as `__m256`

variables in-place using the following function `construct_T5`

.

```
// Encode each element in three bits: s_i = b2b1b0
//b0 = sign.
//b2b1 in {0,1,2,3} where 2 and 3 both get mapped to choice 2
quatx8 construct_T5(__m256i s)
{
const __m256 s5_x8 =_mm256_set1_ps(1./std::sqrt(5.));
const __m256 s5_2_x8 =_mm256_add_ps(s5_x8,s5_x8);
const __m256i one = _mm256_set1_epi32(1);
__m256 a = _mm256_xor_ps(s5_2_x8, _mm256_castsi256_ps(_mm256_slli_epi32(s,31))) ;
s = _mm256_srli_epi32(s, 1);
__m256 x = _mm256_and_ps(a, _mm256_castsi256_ps(_mm256_cmpeq_epi32(s,
_mm256_setzero_si256()))); // == 0?
__m256 y = _mm256_and_ps(a, _mm256_castsi256_ps(_mm256_cmpeq_epi32(s,one))); // == 1?
s = _mm256_srli_epi32(s, 1);
__m256 z = _mm256_and_ps(a, _mm256_castsi256_ps(_mm256_cmpeq_epi32(s,one))); // >= 2?
return { s5_x8, x, y, z };
}
```

While the biased simple random walk is straight forward to translate, we come up with a new scalar version for the non-backtracking random walk avoiding modulus which is missing in AVX2.

We reorder the array as to have the pair \(q, q^{-1}\) next to each other.

Then we set (arbitrarily) the rule that if we were to walk back, we instead repeat the previous choice, that is, walk forward in the same direction.

```
struct no_backtracking_walk_scalar
{
u32 seed;
u32 last;
};
u32 random_int(u32 seed);
u32 step(no_backtracking_walk_scalar *walk)
{
u32 x = rng(walk->seed) & 7;
x = x > 5 ? x-2 : x; // The last two choices are copies.
const u32 forbidden = walk->last & 1 ? walk->last-1 : walk->last+1;
walk->last = forbidden == x ? walk->last : x;
return walk->last;
}
```

Note that there are now two sources of bias. The first is the same from the “no integer arithmetic” optimization where we fill up the array to be of size 8.

But this time, we also wish to keep the non-backtracking property, so if we were to backtrack, we walk forward in the same direction instead. Effectively, this makes repeating the last step twice as likely. The convergence of the associated Markov chain is likely to hold.^{30}

The AVX version could look something like this:

```
struct no_backtracking_walk
{
__m256i seed;
__m256i last;
};
__m256i step(no_backtracking_walk *walk)
{
const __m256i one =_mm256_set1_epi32(1);
const __m256i two =_mm256_add_epi32(one,one);
const __m256i five = _mm256_set1_epi32(5);
const __m256i seven= _mm256_add_epi32(two,five);
const __m256i sign_bit = _mm256_and_si256(walk->last, one);
const __m256i sign_bit_h = _mm256_cmpeq_epi32(sign_bit, one);
const __m256i forbidden = _mm256_blendv_epi8(
_mm256_add_epi32(walk->last, one),
_mm256_sub_epi32(walk->last, one),
sign_bit_h); // flip lsb which encodes sign
__m256i x = _mm256_and_si256(rng(walk->seed), seven);
const __m256i too_big = _mm256_cmpgt_epi32(x, five);
x = _mm256_blendv_epi8(x, _mm256_sub_epi32(x, two), too_big);
const __m256i is_backtracking = _mm256_cmpeq_epi32(x, forbidden);
walk->last = _mm256_blendv_epi8(x, walk->last, is_backtracking);
return walk->last;
}
```

The biased simple random walk flies through with 2 cycles per write and 7 cycles per sample. The write is three times faster than Marsaglia’s algorithm when implemented using the Intel math SSE library for `sincos`

whereas the sample test gives comparable results.

The following plots summarize the data. They also include the discrepancy for the produced set of 65536 quaternions.

## 11. GLSL implementation

For the following we set our task to generate one or more quaternions per pixel per frame for some fictional graphical application.

A compute shader is run for every frame and each kernel invocation accounts for a pixel.

A seed is generated from the pixel location \((x,y)\) and a frame number \(n_f\).

The set of quaternions for a fixed frame \(n_f\) should be equidistributed.

The generated sets for different frames should differ.

Here we chose to measure the time that it takes to calculate the discrepancy of the quaternions produced in each frame. We produce 1024 quaternions per frame per pixel, that is, have a loop of length 1024 in each compute invocation.

There is not a straight forward implementation of the random walk method without an explicit formula for the nth word given a list of choices (other then naive multiplication). Further, random walks of short length starting at the same position will produce the same elements.

To accommodate this difficulty, we first produce \(S_5^m\) the sphere of radius m on the tree generated by \(S_5\) and store it in shared memory. Then the seed is used to multiply

random elements among \(S_5^m\).

Here we use \(m = 4\) (750 elements) and part of the next sphere to fill an array of 1024 quaternions.

```
vec4 quat_mul(vec4 a, vec4 b)
{
return vec4(a.x*b.x - dot(a.yzw, b.yzw),
a.x*b.yzw + b.x*a.yzw + cross(a.yzw, b.yzw));
}
#define s5inv 0.4472135f
#define s5inv2 0.8944271f
vec4 construct_ts(uint u)
{
u = u & 7;
float r = s5inv - (u&1u)*s5inv2;
u >>=1;
float x = u==0 ? s5inv2 : 0;
float y = u==1 ? s5inv2 : 0;
float z = u>=2 ? s5inv2 : 0;
return vec4(r, x , y, z);
}
const uint shared_size = 1024;
shared vec4 tree_root[shared_size];
void rand_quat_init()
{
uint id = gl_LocalInvocationIndex;
uint id5 = id/5;
uint id25 = id5/5;
uint id125 = id25/5;
tree_root[id] = id < 750 ? vec4(1,0,0,0) : construct_ts(0);
uint cur = id125;
tree_root[id] = quat_mul( construct_ts(cur % 6) , tree_root[id]);
memoryBarrierShared(); barrier();
cur = cur+ 4 + id25 % 5;
tree_root[id] = quat_mul(construct_ts(cur % 6) , tree_root[id]);
memoryBarrierShared(); barrier();
cur = cur + 4 + id5 % 5;
tree_root[id] = quat_mul(construct_ts(cur % 6), tree_root[id]);
memoryBarrierShared(); barrier();
cur = cur + 4 + id % 5;
tree_root[id] = quat_mul(construct_ts(cur % 6), tree_root[id]);
memoryBarrierShared(); barrier();
}
vec4 rand_quat(uvec4 seed)
{
seed = pcg4d(seed);
vec4 p = quat_mul( tree_root[ seed.y & shared_size-1], tree_root[seed.x & shared_size-1]);
return p;
}
```

In each loop iteration, we can either multiply with random point on the larger sphere stored in shared memory (“random_walk_s5_4”) or go through it iteratively (“sphere_walk_s5_4”).

```
// Loop over 0<=i<1024
case random_walk_s5_4:
{
seed.x = pcg(seed.x);
q = quat_mul( tree_root[seed.x & shared_size-1], q);
break;
}
case sphere_walk_s5_4:
{
q = quat_mul( tree_root[i & shared_size-1], q);
break;
}
```

We compare these algorithms to the polar version of Marsaglia’s algorithm, the Super-Fibonacci algorithm, and the non-uniform rejection algorithm which samples in the 4-cube and normalizes to one.

Here is a table for the throughput and discrepancies on a Nvidia RTX 4700 for creating 16’777’216 quaternions per frame for 1000 frames in total. “GSample” is measures the number of intersections (in billions) for the discrepancy test per second. The number of intersections is equal to the number of quaternions produced times the number of quaternions used in the Monte Carlo integration (here 512 or 1024 via the Fibonacci method).

Operations from the sum redux in each compute shader invocation to calculate the discrepancy are not measured.

For reference, according to the hardware specification^{31} the theoretical peak performance is 29 TFlops.

We also measure performance for the integrated GPU of a AMD Zen 4 Ryzen CPU on a smaller point set. The compute experiments are run both in OpenGL & GLSL and with Vulkan & GLSL/SPIR-V. The Vulkan program is used to be able to switch between the discrete and integrated GPU. For the discrete GPU, the OpenGL and Vulkan versions give similar results (though the shader compilation differs).

Numbers in parenthesis refer to the standard deviation.

Nvidia RTX 4700 | n=16’777’216 | Zen 4 Ryzen integrated GPU | n=2’097’152 | |

Algorithm | GSample mean/stddev | Discrepancy mean/stddev | GSample mean/stddev | Discrepancy mean/stddev |

Sphere Walk S5_4 | 4045 (49) | 0.000107 (0.000008) | 88 (0.03) | 0.000347 (0.000080) |

Random Walk S5_4 | 3167 (32) | 0.000125 (0.000016) | 67 (0.02) | 0.000308 (0.000048) |

Marsaglia PCG | 2971 (32) | 0.000104 (0.000020) | 43 (0.01) | 0.000316 (0.000062) |

Marsaglia PCG3 | 3295 (33) | 0.000103 (0.000020) | 41 (0.01) | 0.000304 (0.000059) |

Super-Fibonacci | 3167(32) | 0.000009 (0) | 68 (0.02) | 0.000037 (0) |

Non-uniform Rejection | 3126 (31) | 0.009137 (0.000010) | 38 (0.01) | 0.009007 (0.000019) |

The next plot shows the distribution of discrepancy for various seeds (i.e. for different frames). To decipher the axis, it is best to open the image it in a new tab.

The fact that the rejection algorithm does not sample from the uniform distribution can be deduced by observing that the discrepancy stays constant as the point size goes up.

The super Fibonacci algorithm is deterministic and therefore only appears as a Dirac measure.

In contrast to the xorshift and lcg random number generator which we used for the CPU implementation, we employed here recently discovered PCG variants.^{32}

## 11. Normalization and using integers instead of floats

Quaternion multiplication will suffer from floating point drift. It is therefore required to normalize the internal state, at least every few ten thousand calls:

`state = normalize(state);`

Since we are dealing with integer quaternions that are normalized, we may try to make the quaternion multiplication over the integers, doing the norm calculation separately. Of course, we expect an overflow after \( \lfloor\log_{\sqrt{5}}( 2^{31} )\rfloor = 26 \) steps. Hence, we will have to restart the walk at this point. In the following code, we use integer arithmetic for quaternion multiplication.

```
quatz s5i[6] =
{
1+2i, 1+2j, 1+2k, 1-2i, 1-2j, 1-2k
};
template <class RandomWalk>
quat sampler_quaternion_int_5()
{
static RandomWalk walk{};
static quatz q{};
static u32 tt = 26;
static u64 norm = 1;
if (tt == 0)
{
q = {};
tt = 26;
norm = 1;
}
tt--;
q = s5i[walk()] * q;
norm2 *= 5u;
return to_quat_f32(q, (f32)(1./std::sqrt(static_cast<f64>(norm2))));
}
```

Restarting the walk every \(N\) step has the effect that every \(N\)th quaternion in the series is from \(S^N_5\). Thus, the convergence error will be made up from discrepancy errors of the \(N\) different sets \(S^N_5\) contributing each with equal weight \(\frac{1}{N}\).

Consequentially, the first level \(S_5\) will essentially contribute an error of 5% and we cannot expect better. Also note that any bias which was previously allowed might become visible, at least for those \(N\) for which \(|S^N_5| \) is much smaller than the total number of points produced.

This implementation is twice as fast as its scalar version. But it is important to note that the discrepancy will not go below a certain threshold because of the error of short paths just addressed. Hence, it is only suitable for a small number of samples.

We can combine different Hecke operators \(T_p\) and \(T_{p’}\) for two different primes \(p,p’\). For example, consider the following thickening of the previous integer code, where we walk the walk associated to \( p = 5\) but at each step calculate the first sphere on the tree corresponding to \(p= 29\) before continuing. We have to restart the walk slightly earlier in order to adjust for one additional multiplication of a norm \(\sqrt{29}\) quaternion to avoid integer overflow.

## 12. Summary

We studied the random walk of quaternions in order to sample the uniform measure on the sphere of unit quaternions. We discussed several variants of the following algorithm:

Let p be a prime congruent to 1 mod 4. Let \(D(p)\) be all the integer quaternions of length p. Up to some permutations and sign symmetry, this is a collection of p+1 elements, which can be identified through exhaustive search. Normalize these elements and collect them in an array \(S[]\).

```
quat seed = quat{.r = 1, .x = 0 y = 0, z = 0 };
quat rand_quat(quat& seed)
{
u32 u = rand() % (p+1);
return seed = S[u]*seed;
}
```

Using more memory allows to implement non-backtracking versions. Theory shows that also biased random integer generators will lead to convergence of the random walk which allows for more flexible implementations.

The algorithm lends itself to sequential SIMD implementations and fared well in two simple performance tests on the authors machine. Our GPU implementation requires the use of shared memory.

A small number of samples can be generated using integer arithmetic. For a large number of samples, the convergence speed follows the typical \(n^{-0.5}\) asymptotic. The discrepancy is comparable to or better than that of the uniform sampler. It loses its quasi-Monte Carlo (QMC) behavior as compared to taking increasing balls inside the tree generated by \(D(p)\).

Source code for this project can be found on https://github.com/reneruhr/quaternions. It contains four programs (render , discrepancy, compute, compute_vk) that compile on Microsoft’s C++ compiler and a python script for plotting the graphs.

- Graphic Gems III, Shoemake. urot.c ↩︎
- Wiki reference. ↩︎
- https://en.wikipedia.org/wiki/Triple_product#Vector_triple_product ↩︎
- https://community.khronos.org/t/quaternion-functions-for-glsl/50140 ↩︎
- The construction of such measures for Lie groups is due to Hurwitz https://eudml.org/doc/58378 but they are now named after Haar who generalized the constructed to topological groups in https://doi.org/10.2307/1968346. (Links shared only for historic curiosity.) ↩︎
- Correct is
*any measurable set*on which \(\mu\)’s domain is restricted to exclude certain paradoxical properties as a later footnote shows. ↩︎ - http://www.graphicsgems.org/ ↩︎
- As observed in Graphics Gems III, the algorithm in Graphics Gems II, Section VII.7 does not sample the uniform measure. ↩︎
- A variant of this algorithm was recently studied in Sampling Rotation Groups by Successive Orthogonal Images ↩︎
- Choosing a Point from the Surface of a Sphere ↩︎
- Graphic Gems III, Shoemake. urot.c ↩︎
- See Hurwitz’ lecture notes Vorlesungen über die Zahlentheorie der Quaternionen on integer quaternions which are quite accessible. ↩︎
- A construction of Banach and Tarski uses the free group for a seemingly paradoxical reconstruction of sets. See Lubotzky’s book below. ↩︎
- Corollary 2.1.11 in Lubotzky’s book Discrete Groups, Expanding Graphs and Invariant Measures. ↩︎
- Hecke operators and distributing points on the sphere I and Hecke operators and distributing points on S2. II. See also these Bourbaki seminar notes for a concise summary. ↩︎
- Symmetric random walks on groups. There is no trivial eigenvalue \( 1 \) as with finite graphs. Kesten shows that 1 is contained in the spectrum of the averaging operator of a Cayley graph if and only if the group is amenable. ↩︎
- The first paper
^{8}actually only talks about the action of the rotation group on the 2-sphere. The book of Sarnak, Some Applications of Modular Forms discusses how to upgrade the result. The second paper^{8}introduces the “Lattice method”, the approach taking by the book by Lubotzky which is extremely general, see e.g. Hecke operators and equidistribution of Hecke points.

↩︎ - Ellenberg, Michel, Venkatesh and a follow-up study on the large deviations principle in this situation by Khayutin ↩︎
- Breiman: The Strong Law of Large Numbers for a Class of Markov Chains

See also Section 3.2 Benoist, Quint: Random walks on reductive groups.

↩︎ - Chapter 3: Dolgopyat, Sarig: Local Limit Theorems for Inhomogeneous Markov Chains ↩︎
- See Spherical Fibonacci Point Sets for Illumination Integrals and a related book Efficient Quadrature Rules for Illumination Integrals references therein. The spherical Fibonacci point set seems to be the state of the art construction for low discrepancy point sets on the sphere. ↩︎
- Stolarsky, Theorem 2 in Sums of distances between points on a sphere. II. Formulas for the involved constants can be found in QMC designs: Optimal order Quasi Monte Carlo integration schemes on the sphere ↩︎
- Alexa: Super-Fibonacci Spirals: Fast, Low-Discrepancy Sampling of SO(3) ↩︎
- Section 3.3 Lecture notes of Sarig ↩︎
- Reasonable here is means not supported on a coset of a proper subgroup of \(\Gamma\), the free subgroup generated by \(S_5\). See Sharp ergodic theorems for group actions and strong ergodicity, combining Theorem 1.6 and Theorem 6.3. ↩︎
- Benoist and Quint Stationary measures and invariant subsets of homogeneous spaces (III) ↩︎
- Stackoverflow and Agner Fog: Extension to vector class library ↩︎
- https://mrelusive.com/publications/papers/SIMD-From-Quaternion-to-Matrix-and-Back.pdf ↩︎
- https://github.com/id-Software/DOOM-3-BFG/blob/1caba1979589971b5ed44e315d9ead30b278d8b4/neo/idlib/math/Simd_SSE.cpp ↩︎
- While the theorem
^{13}of Benoist & Quint only holds for the simple, possibly biased, random walks, this paper generalizes some cases to the Markov chain situation required by the non-backtracking walk. Their setup does not cover the S-adic case but a generalization seems feasible. ↩︎ - https://www.techpowerup.com/gpu-specs/geforce-rtx-4070.c3924 ↩︎
- https://www.pcg-random.org/ and Hash Functions for GPU Rendering ↩︎

## Leave a Reply