WebGL best practices

WebGL is a complicated API, and it's often not obvious what the recommended ways to use it are. This page tackles recommendations across the spectrum of expertise, and not only highlights dos and don'ts, but also details why. You can rely on this document to guide your choice of approach, and ensure you're on the right track no matter what browser or hardware your users run.

General Topics

Address and eliminate WebGL errors

Your application should run without generating any WebGL errors (as returned by getError). Every WebGL error is reported in the Web Console as a JavaScript warning with a descriptive message. After too many errors (32 in Firefox), WebGL stops generating descriptive messages, which really hinders debugging.

The only errors a well-formed page generates are OUT_OF_MEMORY and CONTEXT_LOST.

Know your limits (and extensions)

The availability of most WebGL extensions depends on the client system. When using WebGL extensions, if possible, try to make them optional by gracefully adapting to the case there they are not supported.

Likewise the limits of your system will be different than your clients' systems! Don't assume you can use thirty texture samplers per shader just because it works on your machine!

Take advantage of universally supported WebGL 1 extensions

These WebGL 1 extensions are universally supported:

  • ANGLE_instanced_arrays
  • EXT_blend_minmax
  • OES_element_index_uint
  • OES_standard_derivatives
  • OES_vertex_array_object
  • WEBGL_debug_renderer_info
  • WEBGL_lose_context

(see also: https://jdashg.github.io/misc/webgl/webgl-feature-levels.html)

Consider polyfilling these into WebGLRenderingContext, like: https://github.com/jdashg/misc/blob/master/webgl/webgl-v1.1.js

Universally supported limits

The minimum requirements for WebGL are quite low. In practice, effectively all systems support at least the following:

    MAX_CUBE_MAP_TEXTURE_SIZE: 4096
    MAX_RENDERBUFFER_SIZE: 4096
    MAX_TEXTURE_SIZE: 4096
    MAX_VIEWPORT_DIMS: [4096,4096]
    MAX_VERTEX_TEXTURE_IMAGE_UNITS: 4
    MAX_TEXTURE_IMAGE_UNITS: 8
    MAX_COMBINED_TEXTURE_IMAGE_UNITS: 8
    MAX_VERTEX_ATTRIBS: 16
    MAX_VARYING_VECTORS: 8
    MAX_VERTEX_UNIFORM_VECTORS: 128
    MAX_FRAGMENT_UNIFORM_VECTORS: 64
    ALIASED_POINT_SIZE_RANGE: [1,100]

Your desktop may support 16k textures, or maybe 16 texture units in the vertex shader, but most other systems don't, and content that works for you will not work for them!

Avoid invalidating FBO attachment bindings

Almost any change to an FBO's attachment bindings will invalidate its framebuffer completeness. Set up your hot framebuffers ahead of time.

In Firefox, setting the pref webgl.perf.max-warnings to -1 in about:config will enable performance warnings that include warnings about FB completeness invalidations.

And to a lesser degree, VAO attachments (vertexAttribPointer, disable/enableVertexAttribArray)

Drawing from static, unchanging VAOs is faster than mutating the same VAO for every draw call. For unchanged VAOs, browsers can cache the fetch limits, whereas when VAOs change, browsers must revalidate and recalculate limits. The overhead for this is relatively low, but re-using VAOs means fewer vertexAttribPointer calls too, so it's worth doing wherever it's easy.

Delete objects eagerly

Don't wait for the garbage collector/cycle collector to realize objects are orphaned and destroy them. Implementations track the liveness of objects, so 'deleting' them at the API level only releases the handle that refers to the actual object. (conceptually releasing the handle's ref-pointer to the object) Only once the object is unused in the implementation is it actually freed. For example, if you never want to access your shader objects directly again, just delete their handles after attaching them to a program object.

Eagerly lose contexts too

WEBGL_lose_context.loseContext() can be used to release a WebGL context and its resources eagerly. Use this if you are finished with any contexts, such as probe contexts, or if you hit a fallback case.

Flush when expecting results (like queries or rendering frame completion)

Flush tells the implementation to push all pending commands out for execution, flushing them out of the queue, instead of waiting for more commands to enqueue before sending for execution.

For example, it is possible for the following to never complete without context loss:

sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
glClientWaitSync(sync, 0, GL_TIMEOUT_IGNORED);

WebGL doesn't have a SwapBuffers call by default, so a flush can help fill the gap, as well.

Use webgl.flush() when not using requestAnimationFrame

When not using RAF, (such as when using RPAF; see below) use webgl.flush() to encourage eager execution of enqueued commands.

Because RAF is directly followed by the frame boundary, an explicit webgl.flush() isn't really needed with RAF.

Avoid blocking API calls in production (e.g. getError, getParameter)

Certain WebGL entry points cause synchronous stalls on the calling thread. Even basic requests can take as long as 1ms, but they can take even longer if they need to wait for all graphics work to be completed (with an effect similar to glFinish() in native OpenGL).

In production code, avoid such entry points, especially on the browser main thread where they can cause the entire page to jank (often including scrolling or even the whole browser).

  • getError(): causes a flush + round-trip to fetch errors from the GPU process).

    For example, within Firefox, the only time glGetError is checked is after allocations (bufferData, *texImage*, texStorage*) to pick up any GL_OUT_OF_MEMORY errors.

  • getShader/ProgramParameter(), getShader/ProgramInfoLog(), other gets on shaders/programs: flush + shader compile + round-trip, if not done after shader compilation is complete. (See also parallel shader compilation below.)

  • get*Parameter() in general: possible flush + round-trip. In some cases, these will be cached to avoid the round-trip, but try to avoid relying on this.

  • checkFramebufferStatus(): possible flush + round-trip.

  • getBufferSubData(): usual finish + round-trip. (This is okay for READ buffers in conjunction with fences - see async data readback below.)

  • readPixels() to the CPU (i.e. without an UNPACK buffer bound): finish + round-trip. Instead, use GPU-GPU readPixels in conjunction with async data readback.

Always keep vertex attrib 0 array-enabled

If you draw with vertex attrib 0 array disabled, you will force the browser to do complicated emulation when running on desktop OpenGL (such as on macOS). This is because in desktop OpenGL, nothing gets drawn if vertex attrib 0 is not array-enabled. You can use bindAttribLocation to force a vertex attribute to use location 0, and use enableVertexAttribArray(0) to make it array-enabled.

Estimate a per-pixel VRAM Budget

WebGL doesn't offer APIs to query the maximum amount of video memory on the system because such queries are not portable. Still, applications must be conscious of VRAM usage and not just allocate as much as possible.

One technique pioneered by the Google Maps team is the notion of a per-pixel VRAM budget:

1) For one system (e.g. a particular desktop / laptop), decide the maximum amount of VRAM your application should use. 2) Compute the number of pixels covered by a maximized browser window. E.g. (window.innerWidth * devicePixelRatio) * (window.innerHeight * window.devicePixelRatio) 3) The per-pixel VRAM budget is (1) divided by (2), and is a constant.

This constant should generally be portable among systems. Mobile devices typically have smaller screens than powerful desktop machines with large monitors. Re-compute this constant on a few target systems to get a reliable estimate.

Now adjust all internal caching in the application (WebGLBuffers, WebGLTextures, etc.) to obey a maximum size, computed by this constant multiplied by the number of pixels covered by the current browser window. This requires estimating the number of bytes consumed by each texture, for example. The cap also must typically be updated as the browser window resizes, and older resources above the limit must be purged.

Keeping the application's VRAM usage under this cap will help to avoid out-of-memory errors and associated instability.

Consider rendering to a smaller backbuffer size

A common (and easy) way to trade off quality for speed is rendering into a smaller backbuffer, and upscaling the result. Consider reducing canvas.width and height and keeping canvas.style.width and height at a constant size.

Batch draw calls (prefer fewer-but-larger draw calls)

Fewer, larger draw operations will generally improve performance. If you have 1000 sprites to paint, try to do it as a single drawArrays() or drawElements() call.

It's common to use "degenerate triangles" if you need to draw discontinuous objects as a single drawArrays(TRIANGLE_STRIP) call. Degenerate triangles are triangles with no area, therefore any triangle where more than one point is in the same exact location. These triangles are effectively skipped, which lets you start a new triangle strip unattached to your previous one, without having to split into multiple draw calls.

Another important method for batching is texture atlasing, where multiple images are placed into a single texture, often like a checkerboard. Since you need to split draw call batches to change textures, texture atlasing lets you combine more draw calls into fewer, bigger batches.

Shaders, Programs, and GLSL

Avoid "#ifdef GL_ES", which is always true

You should never use #ifdef GL_ES in your WebGL shaders; although some early examples used this, it's not necessary, since this condition is always true in WebGL shaders.

Prefer doing more work in vertex (not fragment) shaders

Do as much as you can in the vertex shader, rather than in the fragment shader. This is because per draw call, fragment shaders generally run many more times than vertex shaders. Any calculation that can be done on the vertices and then just interpolated among fragments (via varyings) is a performance boon. (The interpolation of varyings is very cheap, and is done automatically for you through the fixed functionality rasterization phase of the graphics pipeline)

For example, a simple animation of a textured surface can be achieved through a time-dependent transformation of texture coordinates. (The simplest case being adding a uniform vector to the texture coordinates attribute vector) If visually acceptable, one can transform the texture coordinates in the vertex shader rather than in the fragment shader, to get better performance.

One common trade-off is to some lighting calculations per-vertex instead of per-fragment (pixel). In some cases, especially with simple models or dense vertices, this looks good enough.

The inversion of this is if a model has more vertices than pixels in the rendered output. However, LOD meshes is usually the answer to this problem, rarely moving work from the vertex to the fragment shader.

It's tempting to compile shaders and link programs serially, but many browsers can compile and link in parallel on background threads.

Instead of:

function compileOnce(gl, shader) {
  if (shader.compiled) return;
  gl.compileShader(shader);
  shader.compiled = true;
}
for (const [vs, fs, prog] of programs) {
  compileOnce(gl, vs);
  compileOnce(gl, fs);
  gl.linkProgram(prog);
  if (!gl.getProgramParameter(prog, gl.LINK_STATUS)) {
    console.error('Link failed: ' + gl.getProgramInfoLog(prog));
    console.error('vs info-log: ' + gl.getShaderInfoLog(vs));
    console.error('fs info-log: ' + gl.getShaderInfoLog(fs));
  }
}

Consider:

function compileOnce(gl, shader) {
  if (shader.compiled) return;
  gl.compileShader(shader);
  shader.compiled = true;
}
for (const [vs, fs, prog] of programs) {
  compileOnce(gl, vs);
  compileOnce(gl, fs);
}
for (const [vs, fs, prog] of programs) {
  gl.linkProgram(prog);
}
for (const [vs, fs, prog] of programs) {
  if (!gl.getProgramParameter(prog, gl.LINK_STATUS)) {
    console.error('Link failed: ' + gl.getProgramInfoLog(prog));
    console.error('vs info-log: ' + gl.getShaderInfoLog(vs));
    console.error('fs info-log: ' + gl.getShaderInfoLog(fs));
  }
}

While we've described a pattern to allow browsers to compile and link in parallel, normally checking COMPILE_STATUS or LINK_STATUS blocks until the compile or link completes. In browsers where it's available, the KHR_parallel_shader_compile extension provides a non-blocking COMPLETION_STATUS query.

Example usage:

ext = gl.getExtension('KHR_parallel_shader_compile');
gl.compileProgram(vs);
gl.compileProgram(fs);
gl.attachShader(prog, vs);
gl.attachShader(prog, fs);
gl.linkProgram(prog);

// Store program in your data structure.
// Later, for example the next frame:

if (ext) {
  if (gl.getProgramParameter(prog, ext.COMPLETION_STATUS_KHR)) {
    // Check program link status; if OK, use and draw with it.
  }
} else {
  // Program linking is synchronous.
  // Check program link status; if OK, use and draw with it.
}

This technique may not work in all applications, for example those which require programs to be immediately available for rendering. Still, consider how variations may work.

Don't check shader compile status until linking fails

There are very few errors that are guaranteed to cause shader compilation failure, but cannot be deferred to link time. The ESSL3 spec says this under "Error Handling":

The implementation should report errors as early a possible but in any case must satisfy the following:

  • All lexical, grammatical and semantic errors must have been detected following a call to glLinkProgram
  • Errors due to mismatch between the vertex and fragment shader (link errors) must have been detected following a call to glLinkProgram
  • Errors due to exceeding resource limits must have been detected following any draw call or a call to glValidateProgram
  • A call to glValidateProgram must report all errors associated with a program object given the current GL state.

The allocation of tasks between the compiler and linker is implementation dependent. Consequently there are many errors which may be detected either at compile or link time, depending on the implementation.

Additionally, querying compile status is a synchronous call, which breaks pipelining.

Instead of:

gl.compileShader(vs);
if (!gl.getShaderParameter(vs, gl.COMPILE_STATUS)) {
  console.error('vs compile failed: ' + gl.getShaderInfoLog(vs));
}

gl.compileShader(fs);
if (!gl.getShaderParameter(fs, gl.COMPILE_STATUS)) {
  console.error('fs compile failed: ' + gl.getShaderInfoLog(fs));
}

gl.linkProgram(prog);
if (!gl.getProgramParameter(vs, gl.LINK_STATUS)) {
  console.error('Link failed: ' + gl.getProgramInfoLog(prog));
}

Consider:

gl.compileShader(vs);
gl.compileShader(fs);
gl.linkProgram(prog);
if (!gl.getProgramParameter(vs, gl.LINK_STATUS)) {
  console.error('Link failed: ' + gl.getProgramInfoLog(prog));
  console.error('vs info-log: ' + gl.getShaderInfoLog(vs));
  console.error('fs info-log: ' + gl.getShaderInfoLog(fs));
}

Be precise with GLSL variable precision annotations

If you expect to pass an essl300 int between shaders, and you need it to have 32-bits, you must use highp or you will have portability problems. (Works on Desktop, not on Android)

If you have a float texture, iOS requires that you use highp sampler2D foo;, or it will very painfully give you lowp texture samples! (+/-2.0 max is probably not good enough for you)

Implicit defaults

The vertex language has the following predeclared globally scoped default precision statements:

precision highp float;
precision highp int;
precision lowp sampler2D;
precision lowp samplerCube;

The fragment language has the following predeclared globally scoped default precision statements:

precision mediump int;
precision lowp sampler2D;
precision lowp samplerCube;

In WebGL 1, "highp float" support is optional in fragment shaders

Using highp precision unconditionally in fragment shaders will prevent your content from working on some older mobile hardware.

While you can use mediump float instead, but be aware that this often results in corrupted rendering due to lack of precision (particularly mobile systems) though the corruption is not going to be visible on a typical desktop computer.

If you know your precision requirements, getShaderPrecisionFormat() will tell you what the system supports.

If highp float is available, GL_FRAGMENT_PRECISION_HIGH will be defined as 1.

A good pattern for "always give me the highest precision":

#ifdef GL_FRAGMENT_PRECISION_HIGH
precision highp float;
#else
precision mediump float;
#endif

ESSL100 minimum requirements (WebGL 1)

float think range min above zero precision
highp float24* (-2^62, 2^62) 2^-62 2^-16 relative
mediump IEEE float16 (-2^14, 2^14) 2^-14 2^-10 relative
lowp 10-bit signed fixed (-2, 2) 2^-8 2^-8 absolute
int think range
highp int17 (-2^16, 2^16)
mediump int11 (-2^10, 2^10)
lowp int9 (-2^8, 2^8)

*float24: sign bit, 7-bit for exponent, 16-bit for mantissa

ESSL300 minimum requirements (WebGL 2)

float think range min above zero precision
highp IEEE float32 (-2^126, 2^127) 2^-126 2^-24 relative
mediump IEEE float16 (-2^14, 2^14) 2^-14 2^-10 relative
lowp 10-bit signed fixed (-2, 2) 2^-8 2^-8 absolute
(u)int think int range unsigned int range
highp (u)int32 [-2^31, 2^31] [0, 2^32]
mediump (u)int16 [-2^15, 2^15] [0, 2^16]
lowp (u)int9 [-2^8, 2^8] [0, 2^9]

Prefer builtins like dot, mix, and normalize instead of buiding your own

At best, custom implementations of builtins might run as fast as the builtins they replace, but don't expect them to. Hardware often has hyper-optimized or even specialized instructions for builtins, and the compiler can't reliably replace your custom builtin-replacements with the special builtin codepaths.

Textures

Use mipmaps for any texture you'll see in 3d!

When in doubt, call generateMipmaps() after texture uploads. Mipmaps are cheap on memory (only 30% overhead) while providing often-large performance advantages when textures are "zoomed out" or generally downscaled in the distance in 3d, or even for cube-maps!

It's quicker to sample from smaller texture images due to better inherent texture fetch cache locality: Zooming out on a non-mipmapped texture ruins texture fetch cache locality, because neighboring pixels no longer sample from neighboring texels!

However, for 2d resources that are never "zoomed out", don't pay the 30% memory surcharge for mipmaps:

const tex = gl.createTexture();
gl.bindTexture(gl.TEXTURE_2D, tex);
gl.texParameterf(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.LINEAR); // Defaults to NEAREST_MIPMAP_LINEAR, for mipmapping!

(In WebGL 2, you should just use texStorage with levels=1)

One caveat: generateMipmaps only works if you would be able to render into the texture if you attached it to a framebuffer. (The spec calls this "color-renderable formats") For example, if a system supports float-textures but not render-to-float, generateMipmaps will fail for float formats.

Support for float textures doesn't mean you can render into them!

There are many, many systems that support RGBA32F textures, but if you attach one to a framebuffer you'll get FRAMEBUFFER_INCOMPLETE_ATTACHMENT from checkFramebufferStatus(). It may work on your system, but most mobile systems will not support it!

On WebGL 1, use the EXT_color_buffer_half_float and WEBGL_color_buffer_float extensions to check for render-to-float-texture support for float32 and float16 respectively.

On WebGL 2, EXT_color_buffer_float is your check for render-to-float-texture support for both float32 and float16.

Render-to-float32 doesn't imply float32-blending!

If may work on your system, but on many others it wont. Avoid it if you can. Check for the EXT_float_blend extension to check for support.

Float16-blending is always supported.

Some formats (e.g. RGB) on some systems are emulated

A number of formats (particularly three-channel formats) are emulated. For example, RGB32F is often actually RGBA32F, and Luminance8 may actually be RGBA8. RGB8 in particular is often surprisingly slow, as masking out the alpha channel and/or patching blend functions has fairly high overhead. Prefer to use RGBA8 and ignore the alpha yourself for better performance.

Consider compressed texture formats

While JPG and PNG are generally smaller over-the-wire, GPU compressed texture formats are smaller on in GPU memory, and are faster to sample from. (This reduces texture memory bandwidth, which is precious on mobile) However, compressed texture formats have worse quality than JPG, and are generally only acceptable for colors (not e.g. normals or coordinates).

Unfortunately, there's no single universally supported format. Every system has at least one of the following though:

  • WEBGL_compressed_texture_s3tc (desktop)
  • WEBGL_compressed_texture_etc1 (Android)
  • WEBGL_compressed_texture_pvrtc (iOS)

WebGL 2 has universal support by combining:

  • WEBGL_compressed_texture_s3tc (desktop)
  • WEBGL_compressed_texture_etc (mobile)

WEBGL_compressed_texture_astc has both higher quality and/or higher compression, but is only supported on newer hardware.

Basis Universal texture compression format/library

Basis Universal solves several of the issues mentioned above. It offers a way to support all common compressed texture formats with a single compressed texture file, through a JavaScript library that efficiently converts formats at load time. It also adds additional compression that makes Basis Universal compressed texture files much smaller than regular compressed textures over-the-wire, more comparable to JPEG.

https://github.com/BinomialLLC/basis_universal/blob/master/webgl/README.md

Memory usage of depth and stencil formats

Depth and stencil attachments and formats are actually inseparable on many devices. You may ask for DEPTH_COMPONENT24 or STENCIL_INDEX8, but you're often getting D24X8 and X24S8 32bpp formats behind the scenes. Assume that the memory usage of depth and stencil formats is rounded up to the nearest four bytes.

texImage/texSubImage uploads (particularly with videos) can cause pipeline flushes

Most texture uploads from DOM elements will incur a processing pass that will temporarily switch GL Progams internally, causing a pipeline flush. (Pipelines are formalized explicitly in Vulkan[1] et al, but are implicit behind-the-scenes in OpenGL and WebGL. Pipelines are more or less the tuple of shader program, depth/stencil/multisample/blend/rasterization state)

In WebGL:

    ...
    useProgram(prog1)
<pipeline flush>
    bindFramebuffer(target)
    drawArrays()
    bindTexture(webgl_texture)
    texImage2D(HTMLVideoElement)
    drawArrays()
    ...

Behind the scenes in the browser:

    ...
    useProgram(prog1)
<pipeline flush>
    bindFramebuffer(target)
    drawArrays()
    bindTexture(webgl_texture)
    -texImage2D(HTMLVideoElement):
        +useProgram(_internal_tex_tranform_prog)
<pipeline flush>
        +bindFramebuffer(webgl_texture._internal_framebuffer)
        +bindTexture(HTMLVideoElement._internal_video_tex)
        +drawArrays() // y-flip/colorspace-transform/alpha-(un)premultiply
        +bindTexture(webgl_texture)
        +bindFramebuffer(target)
        +useProgram(prog1)
<pipeline flush>
    drawArrays()
    ...

Prefer doing uploads before starting drawing, or at least between pipelines:

In WebGL:

    ...
    bindTexture(webgl_texture)
    texImage2D(HTMLVideoElement)
    useProgram(prog1)
<pipeline flush>
    bindFramebuffer(target)
    drawArrays()
    bindTexture(webgl_texture)
    drawArrays()
    ...

Behind the scenes in the browser:

    ...
    bindTexture(webgl_texture)
    -texImage2D(HTMLVideoElement):
        +useProgram(_internal_tex_tranform_prog)
<pipeline flush>
        +bindFramebuffer(webgl_texture._internal_framebuffer)
        +bindTexture(HTMLVideoElement._internal_video_tex)
        +drawArrays() // y-flip/colorspace-transform/alpha-(un)premultiply
        +bindTexture(webgl_texture)
        +bindFramebuffer(target)
    useProgram(prog1)
<pipeline flush>
    bindFramebuffer(target)
    drawArrays()
    bindTexture(webgl_texture)
    drawArrays()
    ...

WebGL 2

Use texStorage to create textures

The texImage* API lets you define each mip level independently and at any size, even the mis-matching mips sizes are not an error until draw time which means there is no way the driver can actually prepare the texture in GPU memory until the first time the texture is drawn.

Further, some drivers might unconditionally allocate the whole mip-chain (+30% memory!) even if you only want a single level.

So, prefer texStorage+texSubImage for textures in WebGL 2

invalidateFramebuffer

Storing data that you won't use again can have high cost, particularly on tiled-rendering GPUs common on mobile. When you're done with the contents of a framebuffer attachment, use invalidateFramebuffer to discard the data, instead of leaving the driver to waste time storing the data for later use. DEPTH/STENCIL and/or multisampled attachments in particular are great candidates for invalidateFramebuffer.

Non-blocking async data download/readback

The approach in WebGL 2 is analogous to the approach in OpenGL: https://jdashg.github.io/misc/async-gpu-downloads.html

function clientWaitAsync(gl, sync, flags, interval_ms) {
  return new Promise((resolve, reject) => {
    function test() {
      const res = gl.clientWaitSync(sync, flags, 0);
      if (res == gl.WAIT_FAILED) {
        reject();
        return;
      }
      if (res == gl.TIMEOUT_EXPIRED) {
        setTimeout(test, interval_ms);
        return;
      }
      resolve();
    }
    test());
  });
}

async function getBufferSubDataAsync(
    gl, target, buffer, srcByteOffset, dstBuffer,
    /* optional */ dstOffset, /* optional */ length) {
  const sync = gl.fenceSync(gl.SYNC_GPU_COMMANDS_COMPLETE, 0);
  gl.flush();

  await clientWaitAsync(gl, sync, 0, 10);
  gl.deleteSync(sync);

  gl.bindBuffer(target, buffer);
  gl.getBufferSubData(target, srcByteOffset, dstBuffer, dstOffset, length);
  gl.bindBuffer(target, null);

  return dest;
}

async function readPixelsAsync(gl, x, y, w, h, format, type, dest) {
  const buf = gl.createBuffer();
  gl.bindBuffer(gl.PIXEL_PACK_BUFFER, buf);
  gl.bufferData(gl.PIXEL_PACK_BUFFER, dest.byteLength, gl.STREAM_READ);
  gl.readPixels(x, y, w, h, format, type, 0);
  gl.bindBuffer(gl.PIXEL_PACK_BUFFER, null);

  await getBufferSubDataAsync(gl, gl.PIXEL_PACK_BUFFER, buf, 0, dest);

  gl.deleteBuffer(buf);
  return dest;
}

Some tips are relevent to WebGL, but deal with other APIs.

Use requestPostAnimationFrame not requestAnimationFrame

While it's well-known that apps should use requestAnimationFrame ("RAF") instead of setTimeout (et al) to redraw on-demand, what's less well-known is that non-trivial WebGL apps should often not render within a RAF callback.

RAF callbacks (and their microtasks/promises) are the last JS run at the end of each Browser content frame.

For robust non-trivial (particularly WebGL) content, requestPostAnimationFrame ("RPAF") is the first JS run at the beginning of each Browser content frame. That is, it's the first JS run after RAF callbacks and the Browser content (transaction) presentation step. (RPAF explainer)

This allows as much time as possible for rendering each frame.

devicePixelRatio and high-dpi rendering

Handling devicePixelRatio != 1.0 is tricky. While the common approach is to set canvas.width = width * devicePixelRatio, this will cause moire artifacts with non-integer values of devicePixelRatio, as is common with UI scaling on Windows, as well as zooming on all platforms.

Instead, we can use non-integer values for CSS's top/bottom/left/right to fairly reliably 'pre-snap' our canvas to whole integer device coordinates.

Demo: https://jdashg.github.io/misc/webgl/device-pixel-presnap.html

ResizeObserver and 'device-pixel-content-box'

On supporting browsers (Chromium?), ResizeObserver can be used with 'device-pixel-content-box' to request a callback that includes the true device pixel size of an element. This can be used to build an async-but-accurate function:

window.getDevicePixelSize = window.getDevicePixelSize || async function(elem) {
   await new Promise(fn_resolve => {
      const observer = new ResizeObserver(entries => {
         for (const cur of entries) {
            const dev_size = cur.devicePixelContentBoxSize;
            const ret = {
               width: dev_size[0].inlineSize,
               height: dev_size[0].blockSize,
            };
            fn_resolve(ret);
            observer.disconnect();
            return;
         }
         throw 'device-pixel-content-box not observed for elem ' + elem;
      });
      observer.observe(elem, {box: 'device-pixel-content-box'});
   });
};

Please refer to the specification for more details.

ImageBitmap creation

Using the ImageBitmapOptions dictionary is essential for properly preparing textures for upload to WebGL, but unfortunately there's no obvious way to query exactly which dictionary members are supported by a given browser.

This JSFiddle illustrates how to determine which dictionary members a given browser supports.