Random walk method for quaternions

The random walk method on the space of unit quaternions produces uniformly distributed quaternions without trigonometric functions. We study some performance and quality properties. Implementations in SIMD and GLSL are provided.

1. Quaternions as rotations

Consider the space of quaternions

q=r+xi+yj+zkH1(R)

where x,y,zR. The vectors i,j,k represent the imaginary units satisfying the multiplication laws

i2=j2=k2=1,ijk=1.

By linearity, these rules define a multiplication among quaternions. Using the norm of R4, the unit quaternions H1(R) are those with

q2=r2+x2+y2+z2=1
and thus define elements on the 3-sphere S3.

Since for two quaternions q1,q2 we have

q1q2=q1q2,
the unit quaternions are closed under multiplication and thus provide S3 with a group structure. This multiplication can also be obtained by using the matrix group of SU(2) under the following isomorphism:

m:H1(R)SU(2)q=r+xi+yj+zkmq=(r+ixy+izy+izrix).

Note that

detmq=q2=1.

This isomorphism lets us understand how quaternions act on two complex planes C2.

The common algorithm1 for producing random quaternions using these planes is

quat rand_quat()
{
	float phi   = rand_float(0,2*pi);
	float theta = rand_float(0,2*pi);

	float z = rand_float(0,1);
	float r = sqrt(z);
	float t = sqrt(1.f-z);
	return {r*cos(phi), r*sin(phi), t*cos(theta), t*sin(theta)};
}

We shall refer to this algorithm (and its variations) to either the polar method or Marsaglia’s algorithm.

Quaternions are useful because of the two-to-one representation SU(2)SO(3). This representation can be obtained from the Adjoint representation where SU(2) acts by conjugation on its own Lie algebra. The Lie algebra su2 of SU(2) is the vector space (over R) of complex 2×2 matrices m=(aij) with zero trace (a11=a22) that are skew-Hermitian (aij=aji¯¯¯¯¯¯). It is spanned by the three vectors

(ii),(11),(ii),

which are the images of i, j and k under the map given before.

Identify R3 with the set of pure quaternions v=xi+yj+zk. Then conjugation by a unit quaternion qH1(R)

vqvq1=qvq¯¯
where
q1=q¯¯=rxiyjzkqH1(R)
is equivalent to the 2×2 matrix multiplication
mqvq¯=mqmvm1q=mqmvmq
where mq is the complex conjugate of a matrix, which is by definition the inverse of a unitary matrix mqSU(2). We know that this action of H1(R) on R3 must be a rotation since conjugation by a quaternion preserves the norm.

With respect to this basis i,j,k this action has the matrix form

Rq=12y22z22xy+2rz2xz2ry2xy2rz12x22z22yz+2rx2xz+2ry2yz2rx12x22y2,

which is the usual formula given to convert a quaternion into a rotation.2

It is useful to have a complete multiplication rule in terms of the scalar real and the 3-dimensional imaginary part of a quaternion: Let q1=r1+v1 and q2=r2+v2 then

q1q2=(r1r2v1,v2)+(r1v2+r2v1+v1×v2).

where the angle brackets denote the 3-dimensional inner product and the multiplication sign denotes the cross product. If q2 is pure then q1q2=v1,v2+r1v2+v1×v2. Multiplying by q1¯¯¯¯¯ from the right, we know that the real part will cancel out and we collect only the imaginary part:

q1q2q1¯¯¯¯¯=v1,v2(v1)+r1(r1v2+v1×v2)+(r1v2+v1×v2)×(v1)

=v1,v2v1+r21v2+v1×(2r1v2+v1×v2).

Substitute r21=1v1,v1, we can apply the vector triple product formula,3 v1,v2v1v1,v1v2=v1×(v1×v2) to get

=v2+2v1×(r1v2+v1×v2).

This results in commonly known GLSL code for quaternion-vector multiplication.4

vec3 quat_mult(vec4 q, vec3 v)
{
   return v + 2*cross(q.yzw, (q.x*v + cross(q.yzw, v));
}

2. Random quaternions

Our goal is to produce a random unit quaternion or a random rotation gSO(3). Here the notation of random implies that we have picked a measure μ=μSO(3) (that is, a “volume” or “area” function) on the group of rotation, namely the uniform measure, also known as Haar measure.5

Up to normalization, it is the unique translation invariant measure: For any6 subset ESO(3), for any gSO(3),

μ(gE)=μ(E)
where gE={gh:hE}.
Equivalently, we can use the uniform measure ν=νS3 of the 3-sphere (which is also a Haar measure thanks to the multiplication structure of the unit quaternions) and its push forward measure on SO(3) defined by μ(E)=ν(R1E) where R1E denotes the preimage of ESO(3) under qRq. This must be equal to μ (up to normalization) by equivariance of the map qRq, that is, Rq2q1=Rq2Rq1 and the requirement that the measures are translation invariant.

Algorithms which produce random rotations are discussed in the Graphics Gems series,7 II and III by Arvo and Shoemake.8 These include the Box-Muller method to first produce a Gaussian random variable in 4 dimensions. Since it is rotation invariant, all is left is to normalize. The other proposed method is the subgroup algorithm where the fact is used that SO(2)SO(3) and the quotient is SO(3)/SO(2)S2. The algorithm combines then a random generator of the 2-dimensional unit sphere (a random axis of rotation) with one on the 1-dimensional unit circle (a random angle of rotation).9
We note that the rejection method in which we sample the 4-dimensional cube [1,1]4 until we find a vector that lies inside the unit ball which we then normalize to lie on the unit sphere, becomes already quite inefficient: only 12π2240.308 of the samples are accepted. Marsaglia10 also provides a method in which we sample twice from the unit disk via rejection sampling which succeeds with probability π22π220.617 which improves the acceptance rate by a factor of two. Replacing the rejection method with the polar method results in the previously given rand_quat() function. One also arrives to rand_quat() from the subgroup algorithm using a variable substitution.1

3. Integer quaternions and the tree structure

Quaternions already make sense over the integers,12 and a fundamental fact for the method discussed in this post.

The group SO(3) has the interesting property that it contains many trees. To begin with, recall that the free group of rank s is obtained by considering all possible words built from a fixed set S of generators a1,,as and their inverses. For s=2, we may write the generators as a,b with inverses a1,b1 which are defined by aa1=e=bb1 where e denotes the identity element and represents the “empty word”. It is the simplest non-commutative group. “Simple” here refers to the lack of any other relation between the generators. Given a group with a set of generators, one can consider the associated Cayley graph. The vertex set is the group. Edge relations between two elements are defined if the two elements differ by a translation from the generator set or their inverses. Hence, inside a free group of rank s, every element has 2s neighbors. The lack of relations implies that there are no closed paths inside the graph. Hence, the Cayley graph of a free group of rank s is a 2s-regular tree. The group of rotations in 3 dimensions or larger contains a free group.13

The following quaternions give rise to a free group of rank 3 inside SO(3):

S5={1±2i,1±2j,1±2k}.

If we want to write them as unit quaternions, we have to divide by their norm 5.
The associated rotations are

15100034043,15304010403,15340430001

and their inverses. In fact, for any prime p which satisfies p mod 4=1 there exist exactly 2(p+1) many integer quaternions of norm p that will generate a free group of rank p+1 when considering their representation as SO(3). The next primes after p=5 satisfying this congruence condition are 13 and 17. The integer quaternions of norm 13 and 17 are

S13={1±i±j±k,3±2i,3±2j,3±2k}

and

S17={1±4i,1±4j,1±4k,3±2i±2j,3±2j±2k,3±2k±2i}.

A proof that these generate free groups can be found in Lubotzky’s book.14

4. Random walk

The simple random walk on the graph generated by S5={1±2i,1±2j,1±2k}

starting at the identity element which is the root of the tree G5, picks a next neighbor at every step, and the step itself is done by quaternion multiplication. At time n we are at the vertex defined by the word g0,,gn where gi=sjiS5 is randomly picked among all the elements S5.

Here is a pseudo C++ implementation:

float s5 = 1.f/sqrt(5.f);
quat s[6] 
{ 			s5*quat{1, 2,0,0},
			s5*quat{1,0, 2,0},
			s5*quat{1,0,0, 2},
			s5*quat{1,-2,0,0},
			s5*quat{1,0,-2,0},
			s5*quat{1,0,0,-2},
}; 

mat3 random_rotation(quat &state)
{
   u32 a = uniform_int(0,5);        // Draw an integer in [0,5].
   state = quat_mult(s[a], state);  // Update the internal state q.
   return rot_from_quat(state);     // Return the rotation matrix R_q.
}

5. Convolution and spectral gap

It was shown by Lubotzky, Phillips and Sarnak15 (LPS) that the convolution operator associated to these elements has the strongest possible spectral gap possible.
To explain what this means, we pick out the tree Gp associated to Sp. To study the spectral properties of this graph, one considers the space of functions l2(Gp) on Gp that are square-integrable with respect to the (infinite) counting measure. We can consider the normalized adjacency matrix as convolution operator Tp on this space:

Tpf(g)=1|Sp|hSpf(hg).
By a theorem of Kesten16 about random walks on free groups, the spectral radius of this operator is 2pp+1.
LPS show that if we consider the average operator Tp on L20(SO(3),μ) instead, the space of square integrable functions with respect to the uniform measure μ which are orthogonal to the constant function (equivalently, have integral zero), then the operator still has the same spectral radius:

λ1=TpOp=2pp+1.

Here one restricts to functions orthogonal to the constant function g1 as to discard the eigenvalue 1. The “spectral gap” alluded to in the heading is the difference between this trivial eigenvalue 1 and λ1. The larger this gap, the quicker will Tnpf converge to μ(f), which is a result about equidistribution: For any square-integrable f on SO(3),

Tnpfμ(f)=O(pn2).

The L2 bounds on the Hecke operators imply analogous bounds for the L norm.

Thus, there exists κ>0 such that for every g, every Lipschitz function f on SO(3),

|Tnp(f)(g)μ(f)|=Of(pκn)

From here one can also deduce a discrepancy result: Let

Snp= words of length n,

equivalently, the sphere of radius n inside the tree Gp.

There exist some exponent κ>0 such that for any nice set ESO(3) it holds

|ESnp||Snp|μ(E)μ(SO(3))=O(pnκ).

The theorems of LPS show that the balls and spheres in tree Gp define a low-discrepancy sequence.17 They have the disadvantage that the sequences one can generate grow exponentially, so that a given number might be hard to approximate as it falls in between these cardinalities. We now give up on the low-discrepancy property in favor for an algorithm that can produce an arbitrary number of points by producing a random sequence of elements in Sp.

Here are images of the spheres of radius 5, 6, and the whole tree up to radius 8.

For each quaternion, we plotted three points, the images of the rotation action under the standard basis.

For comparison, here is the polar method for the same number of points:

6. Nonbacktracking random walks

The dynamics related to Tnp is the simple random walk. Averaging over Snp is the non-backtracking random walk. Note that for the non-backtracking walk, we start with at a point g0 on SO(3), and move with equal probability to any of the 6 neighbors Rs1g for s1S5={1±2i,1±2j,1±2k}.
For the next move, we restrict to s2s11, the non-backtracking condition. This will hold for all future times n+1 as well: sn+1s1n. The set of non-backtracking paths of length n starting at g0 is Sn5g0 using the tree sphere introduced in the previous section.

The non-backtracking walk will have to save the last step in its internal state. Using the particular ordering in the array, if the last step was k, then k+3mod6 corresponds to a step back. Hence we can choose any element in [k+3+1,k+3+5]mod6.

struct internal_state { quat q; u32 last_choice; };
mat3 random_rotation_nonbacktracking(internal_state &state)
{
   u32 a = ( state.last_choice + 4 +  uniform_int(0,4)) % 6;
   state.q = quat_mult(s[a], state.q);
   state.last_choice = a;
   return rot_from_quat(state.q);
}

We omitted the seed for the uniform integer generator which should also be part of the internal_state struct.

7. Equidistribution of the the random walk

As suggested more recently,18 the spectral gap of the operator T5 (also called Hecke operator) implies equidistribution of the (non-backtracking) random walk.

The large deviation principle states that the fraction of walks γ=(s1g0,s2s1g0,,sns1g0) of length n that do not equidistribute, that is, given a set ESO(3) and any ϵ>0

|γE|nμ(E)μ(SO(3))>ϵ

decays exponentially as n.

The law of large numbers (LLN) can be phrased as follows:
Let f:SO(3)R any continuous function. Then for any gSO(3),

1n(f(s1g)+f(s2s1g)++f(snsn1s1g))μ(f).

We wish to show the law of large numbers for sum of random variables f(γkγ1g):

Yn(γ1,,γn):=1n(f(γ1g)+f(γ2γ1g)++f(γnγ1g))

where γi is chosen uniformly from Sp{γi} (non-backtracking random walk)

or uniformly among Sp (simple random walk).

For brevity, we restrict now to the simple random walk.

Let ν be the uniform measure on Sp. This is the law of the walk.

The space of all paths is SNp carrying the Bernoulli measure β=νN.

The expectation E of Yn with respect to β is

E[nYn]=1(p+1)nγ1,,γnYn(γ1,,γn)=Tf(g)+T2f(g)++Tnf(g).

Applying the above operator bound from the spectral gap discussion,

E[Yn]=μ(f)+O(n1)

since the error term sum to a geometric series bounded by a constant.

Analogously one shows that the variance behaves like Var[Yn]=(μ(f2)μ(f)2)n1+O(n2),

using that if k> then Tkp(f)Tp(f)=Tp(fTkf).

The pointwise convergence of Tnp implies that for any continuous function f, and any Tp-invariant probability measure on μ on SO(3), we have μ(f)=μ(f).

It follows from Breiman’s LLT for Markov chains,19 which applies to both the simple random walk and the non-back tracking walk, that

Ynμ(f).

Using these bounds it is also possible to deduce a central limit theorem.20

8. Estimating the discrepancy

The results in LPS refer to the following setup.

Pick a point v0S2 and study their images of a set {g1,gn} of rotations. Let m=mS2 denote the uniform measure on S2.

In LPS, the spherical cap discrepancy is employed. A spherical cap is determined by a unit vector wS2 and a number ω to define the set Capw,ω={vS2:v,w<ω} where angle brackets denote the inner product of R3.

Given a point set Λ the extremal spherical cap discrepancy is

D(Λ)=supCapw,ω|ΛCapw,ω||Λ|m(Capw,ω)m(S2).

An analytically easier discrepancy is the mean square spherical cap discrepancy

D2(Λ)=(|ΛCapw,ω||Λ|m(Capw,ω)m(S2)2dν(w,ω))12,

where ν=mS2×m[1,1] is a measure on the set of caps using the natural parametrization.

LPS prove

D(Snp)=O(|Snp|13+ϵ)
where I used an ϵ to absorb the log factors. They conjecture that the true exponent is 12ϵ, which indeed holds for the mean square discrepancy:

D2(Snp)=O(|Snp|12+ϵ).

The exponent 12ϵ also holds for the error term when averaging a Sobolev functions of order one which is also Lipschitz.

LPS also did numerical experiments in Pascal. They formulated the results in terms of the error exponent,

α=1log|Λ|logerr(Λ,Cap)
where err(Λ,Cap)=|ΛCap||Λ|m(Cap)m(S2) so that err(Λ,Cap)=|Λ|α. Testing for a various choice of caps and initial points v0S2, α ranges between 0.5 and 0.7 for Λ=Sn5, n=211.

The rotations associated to S5 are rotations around the standard axis in R3 with angle acos(35)2.21 radians. In different experiments, LPS changed that angle to 12π and 23π radians. For words of size 10, α comes out to be 0.29 and 0.46.

More recently, in a study on illumination integrals21 the Stolarsky’s invariance principle22 has been employed to measure discrepancy for point sets on the 2-sphere. Consider the following expression

1|Λ|2v,wΛvw2

which sums the distances within the point set. Then the principle states

E(Λ)2:=xydm(x)dm(y)1|Λ|2v,wΛ|vw|2=c2D2(Λ)2

which gives a trivial way to measure discrepancy for moderately large sets for which an O(n2) algorithm is still feasible. The expression E(Λ) is described as energy. Note that the first term is a constant, equal to 43. The other constant in the formula appearing is c2=4. Moreover, this formula holds for spheres of any dimensions larger than two, so we can directly apply it to the unit quaternions. For d=3, the first term is 6415π and c3=23π.

While the energy method works well for small n, we employ Monte Carlo integration to calculate the defining integral of the mean square integral (which involves integral over the 4-sphere). We employ the Super-Fibonacci Spirals23, which is the state-of-the-art method for low discrepancy quaternions if n is known in advance. The discrepancy of that point set is an order of magnitude lower compared to the discrepancies of the random methods studied here.

The following plots show the extremal and mean square spherical discrepancy for up to a million points. Points are produced using various primes p, various integer random number generators and various random walk laws. Some of these will only be discussed later. We observe an error exponent of around 0.5 for all these variations.

9. Biased walks to avoid modular arithmetic

It is also possible to obtain equidistribution if the law for the choice among S5 is non-uniform. That is, the next edge siS5 is chosen with some probability pi where pi1|S5| in the case of the simple random walk. If the probability weights are sufficiently close to equal this follows from perturbation theory.24 Without this restriction, we can also apply a recent breakthrough of random walks of groups on homogeneous spaces.25 This allows to use the following algorithm that avoids integer modulo calculation by extending the array for S5 by one of size 8, repeating some of the elements (as to make them twice as likely). This optimization will become handy when porting the code to SIMD.

quat s5_biased[8] 
{ 			s5*quat{1, 2,0,0},
			s5*quat{1,0, 2,0},
			s5*quat{1,0,0, 2},
			s5*quat{1,-2,0,0},
			s5*quat{1,0,-2,0},
			s5*quat{1,0,0,-2},
                        s5*quat{1, 2,0,0},
                        s5*quat{1,0, 2,0},
}; 

mat3 random_rotation_biased(quat &state){
   u32 a = random_int() & 7;        // Draw an integer in [0,7].
   state = quat_mult(s5_biased[a], state); 
   return rot_from_quat(state);
}

10. Performance and SIMD implementation

A quaternion fits into a 128-bit SIMD register:

union mm_quat
{
	__m128 mm_q;
	quat q;
};

Quaternion multiplication in compressed SIMD accumulates to around a dozen SSE instructions.26 Quaternion to rotation matrix (and back) is discussed by van Waveren27 and available in the Doom 3 BFG Edition GPL source code.28 In this approach, we can still use the previous implementation for advancing the random walk.

We measure performance on two tests, “write” and “sample”. In the write test we simply fill a large array of quaternions. The sample test not only produces quaternions but immediately uses them to calculate the discrepancy. We produce 216 many points for each test.

On a AMD Ryzen Zen 4 we count around 14 cycles per write and 100 cycles per sample for the scalar versions. A look at the disassembly shows that the MSVC compiler uses 128-bit SIMD. Marsaglia’s algorithm is implemented both using the rejection algorithm and the polar algorithm for sampling on a disk with a cycle count of 50/100 for write and 70/110 for sample.

The custom SIMD by compression is profiled with 20 cycles resp. 40 cycles.

In contrast, we can employ SIMD instructions by optimizing for throughput, here the 256-bit variant:

struct quatx8
{
	__m256 r,x,y,z;
	quatx8() : r(_mm256_set1_ps(1.)), x(_mm256_set1_ps(0.)), 
                   y(_mm256_set1_ps(0.)), z(_mm256_set1_ps(0.)) {}
};

Multiplication and the transform to a rotation matrix is a straight forward translation from the corresponding scalar code. The write test will additional have a struct-of-arrays to arrays-of-struct routine that will shuffle quatx8 to quat[8].

Instead of using an array, we construct the elements of S5 as __m256 variables in-place using the following function construct_T5.


// Encode each element in three bits: s_i = b2b1b0 
//b0 = sign. 
//b2b1 in {0,1,2,3} where 2 and 3 both get mapped to choice 2
quatx8 construct_T5(__m256i s)
{
   const __m256 s5_x8 =_mm256_set1_ps(1./std::sqrt(5.));
   const __m256 s5_2_x8 =_mm256_add_ps(s5_x8,s5_x8);
   const __m256i one = _mm256_set1_epi32(1);
   
   __m256 a = _mm256_xor_ps(s5_2_x8, _mm256_castsi256_ps(_mm256_slli_epi32(s,31))) ; 
   s = _mm256_srli_epi32(s, 1);
   __m256 x = _mm256_and_ps(a, _mm256_castsi256_ps(_mm256_cmpeq_epi32(s,
                                                         _mm256_setzero_si256())));   // == 0?
   __m256 y = _mm256_and_ps(a, _mm256_castsi256_ps(_mm256_cmpeq_epi32(s,one)));   // == 1? 
   s = _mm256_srli_epi32(s, 1);
   __m256 z = _mm256_and_ps(a, _mm256_castsi256_ps(_mm256_cmpeq_epi32(s,one)));   // >= 2?

   return { s5_x8, x, y, z };
}

While the biased simple random walk is straight forward to translate, we come up with a new scalar version for the non-backtracking random walk avoiding modulus which is missing in AVX2.

We reorder the array as to have the pair q,q1 next to each other.

Then we set (arbitrarily) the rule that if we were to walk back, we instead repeat the previous choice, that is, walk forward in the same direction.

struct no_backtracking_walk_scalar
{
	u32 seed;
	u32 last;
};

u32 random_int(u32 seed);

u32 step(no_backtracking_walk_scalar *walk)
{
   u32 x = rng(walk->seed) & 7;
   x = x > 5 ? x-2 : x;  // The last two choices are copies.
   const u32 forbidden = walk->last & 1 ? walk->last-1 : walk->last+1;      
   walk->last = forbidden == x ? walk->last : x;               
   return walk->last;
}

Note that there are now two sources of bias. The first is the same from the “no integer arithmetic” optimization where we fill up the array to be of size 8.

But this time, we also wish to keep the non-backtracking property, so if we were to backtrack, we walk forward in the same direction instead. Effectively, this makes repeating the last step twice as likely. The convergence of the associated Markov chain is likely to hold.29

The AVX version could look something like this:

struct no_backtracking_walk
{
   __m256i seed;
   __m256i last;
};

__m256i step(no_backtracking_walk *walk)
{
   const __m256i one =_mm256_set1_epi32(1);
   const __m256i two =_mm256_add_epi32(one,one);
   const __m256i five = _mm256_set1_epi32(5);
   const __m256i seven= _mm256_add_epi32(two,five);

   const __m256i sign_bit = _mm256_and_si256(walk->last, one);
   const __m256i sign_bit_h = _mm256_cmpeq_epi32(sign_bit, one);
   const __m256i forbidden  = _mm256_blendv_epi8(
                              _mm256_add_epi32(walk->last, one),
                              _mm256_sub_epi32(walk->last, one), 
                              sign_bit_h); // flip lsb which encodes sign

   __m256i x = _mm256_and_si256(rng(walk->seed), seven);
   const __m256i too_big = _mm256_cmpgt_epi32(x, five);
   x = _mm256_blendv_epi8(x, _mm256_sub_epi32(x, two), too_big);
   const __m256i is_backtracking = _mm256_cmpeq_epi32(x, forbidden);
   walk->last = _mm256_blendv_epi8(x, walk->last, is_backtracking);
   return walk->last;
}

The biased simple random walk flies through with 2 cycles per write and 7 cycles per sample. The write is three times faster than Marsaglia’s algorithm when implemented using the Intel math SSE library for sincos whereas the sample test gives comparable results.

The following plots summarize the data. They also include the discrepancy for the produced set of 65536 quaternions.

11. GLSL implementation

For the following we set our task to generate one or more quaternions per pixel per frame for some fictional graphical application.

A compute shader is run for every frame and each kernel invocation accounts for a pixel.

A seed is generated from the pixel location (x,y) and a frame number nf.

The set of quaternions for a fixed frame nf should be equidistributed.

The generated sets for different frames should differ.

Here we chose to measure the time that it takes to calculate the discrepancy of the quaternions produced in each frame. We produce 1024 quaternions per frame, that is, have a loop of length 1024 in each compute invocation.

There is not a straight forward implementation of the random walk method without an explicit formula for the nth word given a list of choices (other then naive multiplication). Further, random walks of short length starting at the same position will produce the same elements.

To accommodate this difficulty, we first produce Sm5 the sphere of radius m on the tree generated by S5 and store it in shared memory. Then the seed is used to multiply

random elements among Sm5.

Here we use m=4 (750 elements) and part of the next sphere to fill an array of 1024 quaternions.

vec4 quat_mul(vec4 a, vec4 b)
{
	return vec4(a.x*b.x   - dot(a.yzw, b.yzw), 
                   a.x*b.yzw + b.x*a.yzw + cross(a.yzw, b.yzw));
}

#define s5inv   0.4472135f
#define s5inv2  0.8944271f
vec4 construct_ts(uint u)
{
	u = u & 7;
	float r = s5inv - (u&1u)*s5inv2;
	u >>=1;
	float x = u==0 ? s5inv2 : 0;
	float y = u==1 ? s5inv2 : 0;
	float z = u>=2 ? s5inv2 : 0;
	return vec4(r, x , y, z);
}

const uint shared_size = 1024;
shared vec4 tree_root[shared_size];

void rand_quat_init()
{
        uint id = gl_LocalInvocationIndex;
	uint id5 = id/5;
	uint id25 = id5/5;
	uint id125 = id25/5;

	tree_root[id] =  id < 750 ? vec4(1,0,0,0) : construct_ts(0);

	uint cur = id125;
	tree_root[id] = quat_mul( construct_ts(cur % 6) , tree_root[id]);
	memoryBarrierShared(); barrier();

	cur = cur+ 4 + id25 % 5;
	tree_root[id] = quat_mul(construct_ts(cur % 6) , tree_root[id]);
	memoryBarrierShared(); barrier();

	cur = cur + 4 + id5 % 5;
	tree_root[id] = quat_mul(construct_ts(cur % 6), tree_root[id]);
	memoryBarrierShared(); barrier();

	cur = cur + 4 + id % 5;
	tree_root[id] = quat_mul(construct_ts(cur % 6), tree_root[id]);
	memoryBarrierShared(); barrier();
}

vec4 rand_quat(uvec4 seed)
{
	seed = pcg4d(seed);
	vec4 p = quat_mul( tree_root[ seed.y  & shared_size-1], tree_root[seed.x & shared_size-1]);
        return p;
}

In each loop iteration, we can either multiply with random point on the larger sphere stored in shared memory (“random_walk_s5_4”) or go through it iteratively (“sphere_walk_s5_4”).

// Loop over 0<=i<1024

case random_walk_s5_4:
	{
	seed.x = pcg(seed.x);
	q = quat_mul( tree_root[seed.x & shared_size-1], q);
	break;
        }
case sphere_walk_s5_4:
	{
	q = quat_mul( tree_root[i & shared_size-1], q);
	break;
	} 

We compare these algorithms to the polar version of Marsaglias algorithm, the Super-Fibonacci algorithm, and the non-uniform rejection algorithm which samples in the 4-cube and normalizes to one.

Here is a table for the throughput and discrepancies on a Nvidia RTX 4700 for creating 16’777’216 quaternions per frame for 1000 frames in total. “GSample” is measures the number of intersections (in billions) for the discrepancy test per second. The number of intersections is equal to the number of quaternions produced times the number of quaternions used in the Monte Carlo integration (here 512 or 1024 via the Fibonacci method).

Operations from the sum redux in each compute shader invocation to calculate the discrepancy are not measured.

For reference, according to the hardware specification30 the theoretical peak performance is 29 TFlops.

We also measure performance for the integrated GPU of a AMD Zen 4 Ryzen CPU on a smaller point set. The compute experiments are run both in OpenGL & GLSL and with Vulkan & GLSL/SPIR-V. The Vulkan program is used to be able to switch between the discrete and integrated GPU. For the discrete GPU, the OpenGL and Vulkan versions give similar results (though the shader compilation differs).

Numbers in parenthesis refer to the standard deviation.

Nvidia RTX 4700n=16’777’216 Zen 4 Ryzen integrated GPU n=2’097’152
AlgorithmGSample mean/stddevDiscrepancy mean/stddevGSample mean/stddevDiscrepancy mean/stddev
Sphere Walk S5_44045 (49)0.000107 (0.000008)88 (0.03)0.000347 (0.000080)
Random Walk S5_43167 (32)0.000125 (0.000016)67 (0.02)0.000308 (0.000048)
Marsaglia PCG2971 (32)0.000104 (0.000020)43 (0.01)0.000316 (0.000062)
Marsaglia PCG33295 (33)0.000103 (0.000020)41 (0.01)0.000304 (0.000059)
Super-Fibonacci3167(32)0.000009 (0)68 (0.02)0.000037 (0)
Non-uniform Rejection3126 (31)0.009137 (0.000010)38 (0.01)0.009007 (0.000019)

The next plot shows the distribution of discrepancy for various seeds (i.e. for different frames). To decipher the axis, it is best to open the image it in a new tab.

The fact that the rejection algorithm does not sample from the uniform distribution can be deduced by observing that the discrepancy stays constant as the point size goes up.

In contrast, the super Fibonacci algorithm is deterministic and therefore only appears as a Dirac measure.

In contrast to the xorshift and lcg random number generator which we used for the CPU implementation, we employed here recently discovered PCG variants.31

11. Normalization and using integers instead of floats

Quaternion multiplication will suffer from floating point drift. It is therefore required to normalize the internal state, at least every few ten thousand calls:

state = normalize(state);

Since we are dealing with integer quaternions that are normalized, we may try to make the quaternion multiplication over the integers, doing the norm calculation separately. Of course, we expect an overflow after log5(231)=26 steps. Hence, we will have to restart the walk at this point. In the following code, we use integer arithmetic for quaternion multiplication.

quatz s5i[6] = 
{ 
  1+2i, 1+2j, 1+2k, 1-2i, 1-2j, 1-2k 
};

template <class RandomWalk> 
quat sampler_quaternion_int_5()
{
	static RandomWalk walk{};
	static quatz q{};
	static u32 tt = 26;
	static u64 norm = 1;
	if (tt == 0)
	{
		q = {};
		tt = 26;
		norm = 1;
	}
	tt--;
	q = s5i[walk()] * q;
	norm2 *= 5u;
	return to_quat_f32(q, (f32)(1./std::sqrt(static_cast<f64>(norm2))));
}

Restarting the walk every N step has the effect that every Nth quaternion in the series is from SN5. Thus, the convergence error will be made up from discrepancy errors of the N different sets SN5 contributing each with equal weight 1N.

Consequentially, the first level S5 will essentially contribute an error of 5% and we cannot expect better. Also note that any bias which was previously allowed might become visible, at least for those N for which |SN5| is much smaller than the total number of points produced.

This implementation is twice as fast as its scalar version. But it is important to note that the discrepancy will not go below a certain threshold because of the error of short paths just addressed. Hence, it is only suitable for a small number of samples.

We can combine different Hecke operators Tp and Tp for two different primes p,p. For example, consider the following thickening of the previous integer code, where we walk the walk associated to p=5 but at each step calculate the first sphere on the tree corresponding to p=29 before continuing. We have to restart the walk slightly earlier in order to adjust for one additional multiplication of a norm 29 quaternion to avoid integer overflow.

Source code for this project can be found on https://github.com/reneruhr/quaternions. It contains four programs (render , discrepancy, compute, compute_vk) that compile on Microsoft’s C++ compiler and a python script for plotting the graphs.

  1. Graphic Gems III, Shoemake. urot.c ↩︎
  2. Wiki reference. ↩︎
  3. https://en.wikipedia.org/wiki/Triple_product#Vector_triple_product ↩︎
  4. https://community.khronos.org/t/quaternion-functions-for-glsl/50140 ↩︎
  5. The construction of such measures for Lie groups is due to Hurwitz https://eudml.org/doc/58378 but they are now named after Haar who generalized the constructed to topological groups in https://doi.org/10.2307/1968346. (Links shared only for historic curiosity.) ↩︎
  6. Correct is any measurable set on which μ’s domain is restricted to exclude certain paradoxical properties as a later footnote shows. ↩︎
  7. http://www.graphicsgems.org/ ↩︎
  8. As observed in Graphics Gems III, the algorithm in Graphics Gems II, Section VII.7 does not sample the uniform measure. ↩︎
  9. A variant of this algorithm was recently studied in Sampling Rotation Groups by Successive Orthogonal Images ↩︎
  10. Choosing a Point from the Surface of a Sphere ↩︎
  11. Graphic Gems III, Shoemake. urot.c ↩︎
  12. See Hurwitz’ lecture notes Vorlesungen über die Zahlentheorie der Quaternionen on integer quaternions which are quite accessible. ↩︎
  13. A construction of Banach and Tarski uses the free group for a seemingly paradoxical reconstruction of sets. See Lubotzky’s book below. ↩︎
  14. Corollary 2.1.11 in Lubotzky’s book Discrete Groups, Expanding Graphs and Invariant Measures. ↩︎
  15. Hecke operators and distributing points on the sphere I and Hecke operators and distributing points on S2. II. See also these Bourbaki seminar notes for a concise summary. ↩︎
  16. Symmetric random walks on groups. There is no trivial eigenvalue 1 as with finite graphs. Kesten shows that 1 is contained in the spectrum of the averaging operator of a Cayley graph if and only if the group is amenable. ↩︎
  17. The first paper 8 actually only talks about the action of the rotation group on the 2-sphere. The book of Sarnak, Some Applications of Modular Forms discusses how to upgrade the result. The second paper 8 introduces the “Lattice method”, the approach taking by the book by Lubotzky which is extremely general, see e.g. Hecke operators and equidistribution of Hecke points.
    ↩︎
  18. Ellenberg, Michel, Venkatesh and a follow-up study on the large deviations principle in this situation by Khayutin ↩︎
  19. Breiman: The Strong Law of Large Numbers for a Class of Markov Chains
    See also Section 3.2 Benoist, Quint: Random walks on reductive groups.
    ↩︎
  20. Chapter 3: Dolgopyat, Sarig: Local Limit Theorems for Inhomogeneous Markov Chains ↩︎
  21. See Spherical Fibonacci Point Sets for Illumination Integrals and a related book Efficient Quadrature Rules for Illumination Integrals references therein. The spherical Fibonacci point set seems to be the state of the art construction for low discrepancy point sets on the sphere. ↩︎
  22. Stolarsky, Theorem 2 in Sums of distances between points on a sphere. II. Formulas for the involved constants can be found in QMC designs: Optimal order Quasi Monte Carlo integration schemes on the sphere ↩︎
  23. Alexa: Super-Fibonacci Spirals: Fast, Low-Discrepancy Sampling of SO(3) ↩︎
  24. Section 3.3 Lecture notes of Sarig ↩︎
  25. Benoist and Quint Stationary measures and invariant subsets of homogeneous spaces (III) ↩︎
  26. Stackoverflow and Agner Fog: Extension to vector class library ↩︎
  27. https://mrelusive.com/publications/papers/SIMD-From-Quaternion-to-Matrix-and-Back.pdf ↩︎
  28. https://github.com/id-Software/DOOM-3-BFG/blob/1caba1979589971b5ed44e315d9ead30b278d8b4/neo/idlib/math/Simd_SSE.cpp ↩︎
  29. While the theorem13 of Benoist & Quint only holds for the simple, possibly biased, random walks, this paper generalizes some cases to the Markov chain situation required by the non-backtracking walk. Their setup does not cover the S-adic case but a generalization seems feasible. ↩︎
  30. https://www.techpowerup.com/gpu-specs/geforce-rtx-4070.c3924 ↩︎
  31. https://www.pcg-random.org/ and Hash Functions for GPU Rendering ↩︎

Comments

Leave a Reply

Logged in as rene. Edit your profile. Log out? Required fields are marked *