We introduce a new disk sampling algorithm that maps the square locally isometrically to the disk. Similar as to the classical rejection algorithm, this allows to push low discrepancy sampling strategies from the square to the disk with no warping. In contrast to rejection, it is amenable to SIMD implementations and our benchmark shows 2-3x performance increase on 256-bit width vectors.
First recall rejection sampling, in which the square circumscribing the disk is sampled and a sample may be rejected if it lies outside the disk.
point_2d rejection_sampling()
{
f32 x,y;
do
{
x = random_float(-1,1);
y = random_float(-1,1);
}
while(x*x + y*y > 1);
return {x,y};
}
In this post we modify this algorithm. We will sample the square inside the disk, so we will never reject any sample. To make up for the parts that we miss, we make use of some symmetries explained next.
The algorithm
We now present a new algorithm in which the rejection is replaced by “adoption”.
We introduce some notation:
D={p:∥p∥<=1} unit diskS={(x,y):|x|,|y|<=2–√} square inscribed in the unit diskH1={(x,y):x>0},H2={(x,y):y>0},H3={(x,y):x<0},H4={(x,y):y<0} halfplanes⌓Di={p∈D∩Hi∖S} t1=(−2–√,0),t2=(0,−2–√),t3=(2–√,0),t4=(0,2–√) translation vectorsDi=D+ti={p:∥p−ti∥<=1} translated unit disks⌓Si=⌓Di+ti=S∩Di
We sample from the square S inscribed in the unit disk D. We divide
the points inside the disk that lie outside the square into four
circular segments
Consider further the 4 unit disks
The sample strategy is now as follows:
- Let p be a uniform sample on S.
- Check if p lies in any of the
Di . - If not, take the sample p.
- If yes, then p is also inside some
⌓Si . - Take the sample p, but also cache p’.
- In the next function call, we directly return p’.
Implementation
We now present a C++-like implementation. We scale the above described picture by
point_2d adoption_sampling()
{
const f32 s2 = std::sqrt(2.f)/2;
f32 x,y;
static f32 x_cache = 0;
static f32 y_cache = 0;
if(x_cache != 0 || y_cache != 0)
{
x = x_cache;
y = y_cache;
x_cache = 0;
y_cache = 0;
return {s2*x,s2*y};
}
x_cache = x = random_float(-1,1);
y_cache = y = random_float(-1,1);
f32 r2 = x * x + y * y;
f32 t = r2 + 2;
if (t <= 4 * x)
{
x_cache -= 2;
}
else if (t <= -4 * x)
{
x_cache += 2;
}
else if (t <= 4 * y)
{
y_cache -= 2;
}
else if (t <= -4 * y)
{
y_cache += 2;
}
else
{
x_cache = y_cache = 0;
}
return {s2*x,s2*y};
}
Comparison with Rejection
We point out the differences to the rejection sampling strategy:
Adoption | Rejection | |
Sample multiplier | 1/(2- | |
Maximal instruction cost | 2 rng calls & 4 comparisons | |
Memory requirement | rng seed & last adopted sample | rng seed |
Invertibility | with extra encoding | right inverse |
Continuity | discontinuous at boundary of C | continous |
Isometry | locally isometric | isometric |
Symmetries | .73 probability of translational symmetry | none |
Sample multiplier
To produce n samples, the main body (producing random numbers and checking intersections) in the rejection algorithm has to run
Maximal instruction cost
In the rejection algorithm, bad luck can cause many loop iterations. Of course, to run the main body of the rejection algorithm n-times decreases exponentially so that the expected iteration number equals one over the sample multiplier. The adoption cost in comparison has a fixed upper bound. This helps with SIMD implementation.
Memory requirement
The main disadvantage of the adoption strategy is that it needs memory. This makes GPU implementations more restrictive as invocations are not independent of each other.
Invertibility
Both strategies are not trivially invertible if one tries to reinterpret them as maps from a square to the disk. Rejection, can be understood as restriction map from square to disk, does have a right inverse, namely the inclusion map from disk to square.
Adoption can be understood as one-to-many mapping. It is possible to encode in each sample if it was adopted or not, which can be used to recover the original point from the square. This will be a many-to-one mapping.
Suppose we wish to use the inverse map to find a pixel from a square texture. For the rejection map, pixels outside the disk would simply be not used.
For the adoption map, values in the segments
Continuity/Isometry
Inside the (open) disk, the rejection sample algorithm is continuous, since it is essentially the identity map. The adoption algorithm produces additional samples outside the set C, and points may suddenly disappear if they are move from outside C to inside C.
Symmetry
The adoption method makes use of antithetic variates, see the corresponding subsection below.
Markov Chain description
This algorithm describes a Markov chain on D as follows.
Let C be the complement of the
For simplicity, label all the partition elements
Let
The transition probability
Let
Then
Solving
This chain is ergodic, and we can deduce convergence from that fact: Let
For each fixed j, the sequence
Usage as warp function
Here we wish to study the use of the rejection and adoption sampler as warp functions in the following sense.
Suppose we have high quality low discrepancy point sets (qmc) on the square and we wish to map them to the unit disk. The two common strategies are the polar map and the concentric map3. See the Pixar Technical Memo4 on sampling which brings up this strategy or the PBRT book5.
Here we make the explicit case of using rejection and the adoption map, by replacing the uniform sampler by a qmc.
Some plots of qmc sequences on the square and their mappings on the disk using the four warps polar/concentric/rejection/adoption are shown next. The qmc sequences are taken from the Sampling and Reconstruction chapter of PBRT 4.
We plot the size of the disk according average number of samples produced per point on the square, which is one for polar/concentric, >1 for adoption and <1 for rejection.
Note how sets which exhibit the symmetries of a torus like the grid or the Sobol sequence feature seamless boarders around the square in the adoption method.
We show some discrepancy plots showing mean squared error of 100 trials against the total number of samples. The qmc methods are randomized using Owen scrambling6. Random refers to the uniform sampler, and jitter refers to the stratified sampler.
Antithetic variates
The notion antithetic variates8
refers to introducing area-preserving transformations which are applied
to the sampled points. The standard example being negation on the real
line or reflection around the origin inside the disk9. These can lead to lowered variance for odd functions,
One may consider the adoption method to be a simple antithetic
version of the restriction method. Indeed, given any rational rotation R
of order n, one can generate additional (n-1) antithetic samples
In the method proposed here, antithetics are only produced on a subset of the disk, namely on the inner segments
The set C, which consists of the disk minus the eight segments ⌓ has relative area
SIMD benchmarks
We now show performance characteristics of a SIMD implementation using AVX2 and AVX512. We produce 2^23=8388608 samples and do an discrepancy test against a collection of sets on the unit disk. The units in the following table are cycles per sample.
AMD Zen 4 | AVX2 msvc | AVX2 clang | AVX512 msvc | AVX512 clang |
polar | 4.65 | – | 3.16 | – |
rejection | 7.99 | 7.76 | 5.50 | 5.32 |
adoption | 3.61 | 2.67 | 3.94 (?!) | 2.06 |
Note that we did not succeed in getting the performance benefits AVX512 promises over AVX2, in fact we are slower using the msvc compiler. The mechanism for comparison in AVX512 uses mask vectors, forcing a different implementation which may causes the performance catastrophe.
We only found Intel’s SMVL library working for msvc which provides the sincos implementation, hence the polar version lacks from the profiling data for clang.
The associated github folder https://github.com/reneruhr/adoption for this project contains three programs: The animation video is a recording from an OpenGL program. The plots have also been drawn using OpenGL in a separate program. Finally, there is a test bench for the SIMD code. The programs compile on Windows.
Update 14.04.2024:
Alias method and GPU implementation
As pointed out in discussion with Won in the comments, there is an alias sampling strategy that both removes the need for a cache (making it easy to implement on the GPU) and removes any unwelcomed translational symmetries.
We use a variant of the alias method11
as follows: Recall that we sample inside a square S, which is
subdivided into C and complement. The probability of sampling inside C
is
which gives
Having the correct probability for C, we sample inside
Here’s a Cuda C++ implementation:
__global__ void adoption_sampling(vec2* out, u32 n)
{
u32 idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n)
{
u32 s = idx;
f32 x,y;
x = (s=rng(s)) * 0x1.0p-31f - 1.f;
y = (s=rng(s)) * 0x1.0p-31f - 1.f;
f32 r2 = x * x + y * y;
f32 t = r2 + 2.f;
f32 x_cache = x;
f32 y_cache = y;
if (t <= 4 * x)
{
x_cache -= 2;
}
else if (t <= -4 * x)
{
x_cache += 2;
}
else if (t <= 4 * y)
{
y_cache -= 2;
}
else if (t <= -4 * y)
{
y_cache += 2;
}
else
{
if((s=rng(s)) * 0x1.0p-32f < CUDART_2_OVER_PI_F)
{
x = (s=rng(s)) * 0x1.0p-31f - 1.f;
y = (s=rng(s)) * 0x1.0p-31f - 1.f;
r2 = x * x + y * y;
t = r2 + 2.f;
x_cache = x;
y_cache = y;
if (t <= 4 * x)
{
x_cache -= 2;
}
else if (t <= -4 * x)
{
x_cache += 2;
}
else if (t <= 4 * y)
{
y_cache -= 2;
}
else if (t <= -4 * y)
{
y_cache += 2;
}
}
}
if((s=rng(s)) < 1u<<31)
{
x_cache = x;
y_cache = y;
}
out[idx].x = x_cache;
out[idx].y = y_cache;
}
It is possible to optimize somewhat by directly checking if a point (x,y) is inside C using
bool inside_c = t > 4*fmaxf(fabsf(x),fabsf(y));
When comparing to the polar method, on a RTX 4070, it is clearly slower and at best comparable to rejection:
ns/sample | |
Polar | 0.0020 |
Rejection | 0.0058 |
Adoption | 0.0065 |
The algorithms are iterated 100 times inside the kernel calls to make the computational time dominate over the memory write times. Times are averages over 30 tries. The Cuda compile flag –use_fast_math for fast sin/cos is used.
The full code can be found in the Cuda folder on github.
- https://en.wikipedia.org/wiki/Law_of_large_numbers ↩︎
- https://en.wikipedia.org/wiki/Discrete-time_Markov_chain ↩︎
- Shirley & Chiu: A Low Distortion Map Between Disk and Square ↩︎
- Andrew Kensler: Correlated Multi-Jittered Sampling ↩︎
- PBRT 4: A.5.1 Sampling a Unit Disk ↩︎
- PBRT 4: util/lowdiscrepancy.h ↩︎
- PBRT 4: Sobol Sampling ↩︎
- https://en.wikipedia.org/wiki/Antithetic_variates ↩︎
- Fishman & Huang: Antithetic variates revisited ↩︎
- PBRT 4: Hemisphere Sampling ↩︎
- PBRT 4: A.1 The Alias Method ↩︎
Leave a Reply