Sampling the disk from a square: The adoption method

We introduce a new disk sampling algorithm that maps the square locally isometrically to the disk. Similar as to the classical rejection algorithm, this allows to push low discrepancy sampling strategies from the square to the disk with no warping. In contrast to rejection, it is amenable to SIMD implementations and our benchmark shows 2-3x performance increase on 256-bit width vectors.

First recall rejection sampling, in which the square circumscribing the disk is sampled and a sample may be rejected if it lies outside the disk.

point_2d rejection_sampling()
{
   f32 x,y;
   do
   {
      x = random_float(-1,1);
      y = random_float(-1,1);
   }
   while(x*x + y*y > 1);
   return {x,y};
}

In this post we modify this algorithm. We will sample the square inside the disk, so we will never reject any sample. To make up for the parts that we miss, we make use of some symmetries explained next.

The algorithm

We now present a new algorithm in which the rejection is replaced by “adoption”.

We introduce some notation:

  • D={p:p<=1} unit disk
  • S={(x,y):|x|,|y|<=2} square inscribed in the unit disk
  • H1={(x,y):x>0},H2={(x,y):y>0},H3={(x,y):x<0},H4={(x,y):y<0} halfplanes
  • Di={pDHiS}
  • t1=(2,0),t2=(0,2),t3=(2,0),t4=(0,2) translation vectors
  • Di=D+ti={p:pti<=1} translated unit disks
  • Si=Di+ti=SDi

We sample from the square S inscribed in the unit disk D. We divide the points inside the disk that lie outside the square into four circular segments Di of the initial disk D.

Consider further the 4 unit disks Di centered at an offset of ±2 on the horizontal and vertical axis. These disks intersect S in a 4 circular segments Si. We can choose the labeling of Di and Si such that Di=Si±2 where translation is in the horizontal axis for odd i and vertical for even i. The sign is uniquely defined by i. For each point p in some Si, we let p’ its image in Di by translating it by ±2. This is the “adopted” point.

The sample strategy is now as follows:

  • Let p be a uniform sample on S.
  • Check if p lies in any of the Di.
  • If not, take the sample p.
  • If yes, then p is also inside some Si.
  • Take the sample p, but also cache p’.
  • In the next function call, we directly return p’.
Animation of the algorithm

Implementation

We now present a C++-like implementation. We scale the above described picture by 2 so that the square S becomes the standard square of side length 2, and the final sample is inside disk of radius 2, which we scale down by 1/2=2/2 before returning.


point_2d adoption_sampling()
{	
   const f32 s2 = std::sqrt(2.f)/2;
   f32 x,y;
   static f32 x_cache = 0;
   static f32 y_cache = 0;
   
   if(x_cache != 0 || y_cache != 0)
   {
       x = x_cache;
       y = y_cache;
       x_cache = 0;
       y_cache = 0;
       return {s2*x,s2*y};
   }

   
   x_cache = x = random_float(-1,1);
   y_cache = y = random_float(-1,1);
   
   f32 r2 = x * x + y * y;
   f32 t = r2 + 2;
   
   if (t <= 4 * x)
   {
	x_cache -= 2;
   }
   else if (t <= -4 * x)
   {
	x_cache += 2;
   }
   else if (t <= 4 * y)
   {
	y_cache -= 2;
   }
   else if (t <= -4 * y)
   {
	y_cache += 2;
   }
   else 
   {
      x_cache = y_cache = 0;
   }
   return {s2*x,s2*y};
}

Comparison with Rejection

We point out the differences to the rejection sampling strategy:

AdoptionRejection
Sample multiplier1/(2-π/2)π/4
Maximal instruction cost2 rng calls & 4 comparisons
Memory requirementrng seed & last adopted samplerng seed
Invertibilitywith extra encodingright inverse
Continuitydiscontinuous at boundary of Ccontinous
Isometrylocally isometricisometric
Symmetries.73 probability of translational symmetrynone

Sample multiplier

To produce n samples, the main body (producing random numbers and checking intersections) in the rejection algorithm has to run 4/πn times in expectation. In the adoption algorithm, the main body only runs (2π/2)n times on average. If the rng (random number generator) is expensive, say, because it is a custom low discrepancy sequence, then adoption is 34/π2π/2 times as economic compared to rejection in expectation.

Maximal instruction cost

In the rejection algorithm, bad luck can cause many loop iterations. Of course, to run the main body of the rejection algorithm n-times decreases exponentially so that the expected iteration number equals one over the sample multiplier. The adoption cost in comparison has a fixed upper bound. This helps with SIMD implementation.

Memory requirement

The main disadvantage of the adoption strategy is that it needs memory. This makes GPU implementations more restrictive as invocations are not independent of each other.

Invertibility

Both strategies are not trivially invertible if one tries to reinterpret them as maps from a square to the disk. Rejection, can be understood as restriction map from square to disk, does have a right inverse, namely the inclusion map from disk to square.

Adoption can be understood as one-to-many mapping. It is possible to encode in each sample if it was adopted or not, which can be used to recover the original point from the square. This will be a many-to-one mapping.

Suppose we wish to use the inverse map to find a pixel from a square texture. For the rejection map, pixels outside the disk would simply be not used.

For the adoption map, values in the segments Si will have to be compressed as they will hold two values, one for the point in Si itself and one for the adopted copy in Di.

Continuity/Isometry

Inside the (open) disk, the rejection sample algorithm is continuous, since it is essentially the identity map. The adoption algorithm produces additional samples outside the set C, and points may suddenly disappear if they are move from outside C to inside C.

Symmetry

The adoption method makes use of antithetic variates, see the corresponding subsection below.

Markov Chain description

This algorithm describes a Markov chain on D as follows.

Let C be the complement of the Si inside S.

For simplicity, label all the partition elements C,S1,S2,S3,S4,D1,D2,D3,D4 by Pi.

Let mi be the uniform measure on Pi. Let m be the uniform measure on D.

The transition probability ρ(dx,y) of x given y (for each fixed y this is a measure) is easily seen to be ρ(dx,y)=miy(dx)pix,iy where iy,ix are the indices for which yPiy,xPix and pix,iy is the transition probability of a finite Markov chain on 9 symbols. Let m() denote the area of any of the circular segments.

Let a=m()/m(S) and note 14a=m(S)4m()m(S)=18m()m(S)=m(C)/m(S).

Then

P=(pji)=14a000014a14a14a14aa0000aaaaa0000aaaaa0000aaaaa0000aaaa010000000001000000000100000000010000.

Solving πP=π, we find the stationary distribution

π=(m(C),m(),m(),m(),m(),m(),m(),m(),m()).

This chain is ergodic, and we can deduce convergence from that fact: Let xi denote the outcomes of the sampling and let f:DR. We need to show that, almost surely,

1ni<nf(xi)m(f).
Reorder the sum according to which Dj each xi falls into. Relabel them xjk and let nj be their cardinalities:

1ni<nf(xi)=jnjn1njk<njf(xjk)

For each fixed j, the sequence xjk are sampled from mj. By the the law of large numbers for iid random variables1, 1njk<njf(xjk)=mj(f)+o(1) as nj. By the ergodic theorem for Markov chains2, njn=πj+o(1) (in particularly, nj). Combining these two facts,

1ni<nf(xi)=jπjmj(f)+o(1)=m(f)+o(1).

Usage as warp function

Here we wish to study the use of the rejection and adoption sampler as warp functions in the following sense.

Suppose we have high quality low discrepancy point sets (qmc) on the square and we wish to map them to the unit disk. The two common strategies are the polar map and the concentric map3. See the Pixar Technical Memo4 on sampling which brings up this strategy or the PBRT book5.

Here we make the explicit case of using rejection and the adoption map, by replacing the uniform sampler by a qmc.

Some plots of qmc sequences on the square and their mappings on the disk using the four warps polar/concentric/rejection/adoption are shown next. The qmc sequences are taken from the Sampling and Reconstruction chapter of PBRT 4.

We plot the size of the disk according average number of samples produced per point on the square, which is one for polar/concentric, >1 for adoption and <1 for rejection.

Note how sets which exhibit the symmetries of a torus like the grid or the Sobol sequence feature seamless boarders around the square in the adoption method.

We show some discrepancy plots showing mean squared error of 100 trials against the total number of samples. The qmc methods are randomized using Owen scrambling6. Random refers to the uniform sampler, and jitter refers to the stratified sampler.

Antithetic variates

The notion antithetic variates8 refers to introducing area-preserving transformations which are applied to the sampled points. The standard example being negation on the real line or reflection around the origin inside the disk9. These can lead to lowered variance for odd functions, f(x)=f(x) but make sampling for even functions less effective. The other property was listed as a feature in the comparison table between adoption and rejection: An expensive sample might be made more economic if the symmetries are cheap to produce.

One may consider the adoption method to be a simple antithetic version of the restriction method. Indeed, given any rational rotation R of order n, one can generate additional (n-1) antithetic samples Rkx given some sampled point x on the circle of radius |x|.

In the method proposed here, antithetics are only produced on a subset of the disk, namely on the inner segments Si and the transformations are translations Ti:SiDi.

The set C, which consists of the disk minus the eight segments ⌓ has relative area 4π10.27. This “pure” area is untouched by symmetries artificially introduced. If disk sampling is applied to cosine-weighted hemisphere sampling10, then the projection of C, also of relative area 0.27, is located around the north pole.

SIMD benchmarks

We now show performance characteristics of a SIMD implementation using AVX2 and AVX512. We produce 2^23=8388608 samples and do an discrepancy test against a collection of sets on the unit disk. The units in the following table are cycles per sample.

AMD Zen 4AVX2 msvcAVX2 clangAVX512 msvcAVX512 clang
polar4.653.16
rejection7.997.765.505.32
adoption3.612.673.94 (?!)2.06

Note that we did not succeed in getting the performance benefits AVX512 promises over AVX2, in fact we are slower using the msvc compiler. The mechanism for comparison in AVX512 uses mask vectors, forcing a different implementation which may causes the performance catastrophe.

We only found Intel’s SMVL library working for msvc which provides the sincos implementation, hence the polar version lacks from the profiling data for clang.

The associated github folder https://github.com/reneruhr/adoption for this project contains three programs: The animation video is a recording from an OpenGL program. The plots have also been drawn using OpenGL in a separate program. Finally, there is a test bench for the SIMD code. The programs compile on Windows.

Update 14.04.2024:

Alias method and GPU implementation

As pointed out in discussion with Won in the comments, there is an alias sampling strategy that both removes the need for a cache (making it easy to implement on the GPU) and removes any unwelcomed translational symmetries.

We use a variant of the alias method11 as follows: Recall that we sample inside a square S, which is subdivided into C and complement. The probability of sampling inside C is m(C)m(S). We modify this probability to be m(C) (recall the normalization m(D)=1) by attaching an alias on the set S, i.e. we sample again, with an alias probability p. This results in an equation for p:

m(C)=m(C)m(S)(1p)+m(C)m(S)pm(C)m(S)

which gives p=2π(=m(S)).

Having the correct probability for C, we sample inside Si twice as often. We add an alias to Di with alias probability 1/2.

Here’s a Cuda C++ implementation:

__global__ void adoption_sampling(vec2* out, u32 n)
{
    u32 idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) 
    {
        u32 s = idx;
        f32 x,y;
        x = (s=rng(s)) * 0x1.0p-31f - 1.f;
        y = (s=rng(s)) * 0x1.0p-31f - 1.f;

	f32 r2 = x * x + y * y;
	f32 t = r2 + 2.f;
	f32 x_cache = x;
	f32 y_cache = y;

	if (t <= 4 * x)
	{
		x_cache -= 2;
	}
	else if (t <= -4 * x)
	{
		x_cache += 2;
	}
	else if (t <= 4 * y)
	{
		y_cache -= 2;
	}
	else if (t <= -4 * y)
	{
		y_cache += 2;
	}
        else
        {
           if((s=rng(s)) * 0x1.0p-32f < CUDART_2_OVER_PI_F)
           {
                x = (s=rng(s)) * 0x1.0p-31f - 1.f;
                y = (s=rng(s)) * 0x1.0p-31f - 1.f;

                r2 = x * x + y * y;
                t = r2 + 2.f;
                x_cache = x;
                y_cache = y;

                if (t <= 4 * x)
                {
                    x_cache -= 2;
                }
                else if (t <= -4 * x)
                {
                    x_cache += 2;
                }
                else if (t <= 4 * y)
                {
                    y_cache -= 2;
                }
                else if (t <= -4 * y)
                {
                    y_cache += 2;
                }
            }   
        }

        if((s=rng(s)) < 1u<<31)
        {
            x_cache = x;
            y_cache = y;
        }

        out[idx].x = x_cache;
        out[idx].y = y_cache;

}

It is possible to optimize somewhat by directly checking if a point (x,y) is inside C using

        bool inside_c = t > 4*fmaxf(fabsf(x),fabsf(y));

When comparing to the polar method, on a RTX 4070, it is clearly slower and at best comparable to rejection:

ns/sample
Polar0.0020
Rejection0.0058
Adoption0.0065

The algorithms are iterated 100 times inside the kernel calls to make the computational time dominate over the memory write times. Times are averages over 30 tries. The Cuda compile flag –use_fast_math for fast sin/cos is used.

The full code can be found in the Cuda folder on github.

  1. https://en.wikipedia.org/wiki/Law_of_large_numbers ↩︎
  2. https://en.wikipedia.org/wiki/Discrete-time_Markov_chain ↩︎
  3. Shirley & Chiu: A Low Distortion Map Between Disk and Square ↩︎
  4. Andrew Kensler: Correlated Multi-Jittered Sampling ↩︎
  5. PBRT 4: A.5.1 Sampling a Unit Disk ↩︎
  6. PBRT 4: util/lowdiscrepancy.h ↩︎
  7. PBRT 4: Sobol Sampling ↩︎
  8. https://en.wikipedia.org/wiki/Antithetic_variates ↩︎
  9. Fishman & Huang: Antithetic variates revisited ↩︎
  10. PBRT 4: Hemisphere Sampling ↩︎
  11. PBRT 4: A.1 The Alias Method ↩︎
Comments
Responses
  1. Won Avatar

    Very neat, and definitely worth a thorough read. This kind of reminds me of Marsaglia’s “Monty Python” method (which I’m not sure he took too seriously given the name). In it he uses a piecewise approach to atlas a distribution into box to minimize the rejection ratio, but perhaps the adoption technique could be used there as well?

    https://dl.acm.org/doi/pdf/10.1145/292395.292453

  2. Won Avatar
    Won

    Ah, I think I didn’t quite understand your idea at first, but I think I get it on the second read — my mistake was thinking that you were flipping a coin in the overlap case to see whether it belonged in the square or the translated sector, but this unfortunately underestimate the number of samples.

    The connection with antithetic variates is interesting — I wonder if it means that it would be beneficial to sometimes mirror the sample about axis of translation, balancing in both axes at once. With a few more random bits, you could decide to flip randomly, or even select a different destination sector. By D4 symmetry there are 8 possible locations, but you could reduce it to 4 and always select distant spots.

    I’m not sure if this is a good idea, or simply my post-facto justification of my initial confusion, thinking that adoption was a form of an alias method!

    1. rene Avatar

      Yes, it seems the Monty Python method is related as it describes cutting and transforming part of the area.

      Antithetic samples one can always add in addition if one has the symmetries, which could be sampled in a structured manner or randomly.

      Yes, one can use the alias method: By encoding two steps into one (making the Markov Chain Bernoullian). When sampling on the C area, one samples again with p=2/pi. And when sampling a segment then one samples again p’=1/2 among them (flip randomly).

    2. rene Avatar

      The approach of your first thought is easily modified by removing probability from the non-overlap to correct the undersampling.

    3. rene Avatar

      I added a discussion and gave some cuda benchmark using an alias approach.

Leave a Reply

Logged in as rene. Edit your profile. Log out? Required fields are marked *