panfrost: Rewrite u-interleaving code

Submitted by Alyssa Rosenzweig on June 25, 2019, 6:54 p.m.

Details

Message ID 20190625185414.16926-1-alyssa.rosenzweig@collabora.com
State New
Headers show
Series "panfrost: Rewrite u-interleaving code" ( rev: 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Alyssa Rosenzweig June 25, 2019, 6:54 p.m.
Rather than using a magic lookup table with no explanations, let's add
liberal comments to the code to explain what this tiling scheme is and
how to encode/decode it efficiently.

It's not so mysterious after all -- just reordering bits with some XORs
thrown in.

Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Vasily Khoruzhick <anarsoul@gmail.com>
---
 src/panfrost/shared/pan_tiling.c | 330 +++++++++++++++++++++----------
 1 file changed, 229 insertions(+), 101 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/panfrost/shared/pan_tiling.c b/src/panfrost/shared/pan_tiling.c
index 413cd89420b..de0d6812831 100644
--- a/src/panfrost/shared/pan_tiling.c
+++ b/src/panfrost/shared/pan_tiling.c
@@ -2,6 +2,7 @@ 
  * Copyright (c) 2011-2013 Luc Verhaegen <libv@skynet.be>
  * Copyright (c) 2018 Alyssa Rosenzweig <alyssa@rosenzweig.io>
  * Copyright (c) 2018 Vasily Khoruzhick <anarsoul@gmail.com>
+ * Copyright (c) 2019 Collabora
  *
  * Permission is hereby granted, free of charge, to any person obtaining a
  * copy of this software and associated documentation files (the "Software"),
@@ -26,127 +27,230 @@ 
 
 #include "pan_tiling.h"
 
-uint32_t space_filler[16][16] = {
-   { 0,   1,   4,   5,   16,  17,  20,  21,  64,  65,  68,  69,  80,  81,  84,  85, },
-   { 3,   2,   7,   6,   19,  18,  23,  22,  67,  66,  71,  70,  83,  82,  87,  86, },
-   { 12,  13,  8,   9,   28,  29,  24,  25,  76,  77,  72,  73,  92,  93,  88,  89, },
-   { 15,  14,  11,  10,  31,  30,  27,  26,  79,  78,  75,  74,  95,  94,  91,  90, },
-   { 48,  49,  52,  53,  32,  33,  36,  37,  112, 113, 116, 117, 96,  97,  100, 101, },
-   { 51,  50,  55,  54,  35,  34,  39,  38,  115, 114, 119, 118, 99,  98,  103, 102, },
-   { 60,  61,  56,  57,  44,  45,  40,  41,  124, 125, 120, 121, 108, 109, 104, 105, },
-   { 63,  62,  59,  58,  47,  46,  43,  42,  127, 126, 123, 122, 111, 110, 107, 106, },
-   { 192, 193, 196, 197, 208, 209, 212, 213, 128, 129, 132, 133, 144, 145, 148, 149, },
-   { 195, 194, 199, 198, 211, 210, 215, 214, 131, 130, 135, 134, 147, 146, 151, 150, },
-   { 204, 205, 200, 201, 220, 221, 216, 217, 140, 141, 136, 137, 156, 157, 152, 153, },
-   { 207, 206, 203, 202, 223, 222, 219, 218, 143, 142, 139, 138, 159, 158, 155, 154, },
-   { 240, 241, 244, 245, 224, 225, 228, 229, 176, 177, 180, 181, 160, 161, 164, 165, },
-   { 243, 242, 247, 246, 227, 226, 231, 230, 179, 178, 183, 182, 163, 162, 167, 166, },
-   { 252, 253, 248, 249, 236, 237, 232, 233, 188, 189, 184, 185, 172, 173, 168, 169, },
-   { 255, 254, 251, 250, 239, 238, 235, 234, 191, 190, 187, 186, 175, 174, 171, 170, },
+/* This file implements software encode/decode of the tiling format used for
+ * textures and framebuffers primarily on Utgard GPUs. Names for this format
+ * include "Utgard-style tiling", "(Mali) swizzled textures", and
+ * "U-interleaved" (the former two names being used in the community
+ * Lima/Panfrost drivers; the latter name used internally at Arm).
+ * Conceptually, like any tiling scheme, the pixel reordering attempts to 2D
+ * spatial locality, to improve cache locality in both horizontal and vertical
+ * directions.
+ *
+ * This format is tiled: first, the image dimensions must be aligned to 16
+ * pixels in each axis. Once aligned, the image is divided into 16x16 tiles.
+ * This size harmonizes with other properties of the GPU; on Midgard,
+ * framebuffer tiles are logically 16x16 (this is the tile size used in
+ * Transaction Elimination and the minimum tile size used in Hierarchical
+ * Tiling). Conversely, for a standard 4 bytes-per-pixel format (like
+ * RGBA8888), 16 pixels * 4 bytes/pixel = 64 bytes, equal to the cache line
+ * size.
+ *
+ * Within each 16x16 block, the bits are reordered according to this pattern:
+ *
+ * | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) |
+ *
+ * Basically, interleaving the X and Y bits, with XORs thrown in for every
+ * adjacent bit pair.
+ *
+ * This is cheap to implement both encode/decode in both hardware and software.
+ * In hardware, lines are simply rerouted to reorder and some XOR gates are
+ * thrown in. Software has to be a bit more clever.
+ *
+ * In software, the trick is to divide the pattern into two lines:
+ *
+ *    | y3 | y3 | y2 | y2 | y1 | y1 | y0 | y0 |
+ *  ^ |  0 | x3 |  0 | x2 |  0 | x1 |  0 | x0 |
+ *
+ * That is, duplicate the bits of the Y and space out the bits of the X. The
+ * top line is a function only of Y, so it can be calculated once per row and
+ * stored in a register. The bottom line is simply X with the bits spaced out.
+ * The initial value can be calculated once at the beginning, and then after
+ * each pixel it can incremented by subtracting a mask pattern and ANDing with
+ * the mask pattern (abusing carry bits, where the mask pattern is the mask of
+ * bits we care about, so 0b01010101).
+ *
+ * Alternatively, the 16 indices can be precalculated for each row and reused
+ * across the row, which may be amenable to easier vectorization.
+ *
+ * This format is also supported on Midgard GPUs, where it *can* be used for
+ * textures and framebuffers. That said, in practice it is usually as a
+ * fallback layout; Midgard introduces Arm FrameBuffer Compression, which is
+ * significantly more efficient than Utgard-style tiling and preferred for both
+ * textures and framebuffers, where possible. For unsupported texture types,
+ * for instance sRGB textures and framebuffers, this tiling scheme is used at a
+ * performance penality, as AFBC is not compatible.
+ */
+
+/* Given the lower 4-bits of the Y coordinate, we would like to
+ * duplicate every bit over. So instead of 0b1010, we would like
+ * 0b11001100. The idea is that for the bits in the solely Y place, we
+ * get a Y place, and the bits in the XOR place *also* get a Y. */
+
+uint32_t bit_duplication[16] = {
+   0b00000000,
+   0b00000011,
+   0b00001100,
+   0b00001111,
+   0b00110000,
+   0b00110011,
+   0b00111100,
+   0b00111111,
+   0b11000000,
+   0b11000011,
+   0b11001100,
+   0b11001111,
+   0b11110000,
+   0b11110011,
+   0b11111100,
+   0b11111111,
 };
 
+/* Space the bits out of a 4-bit nibble; doesn't need to be particularly fast
+ * as it only runs once per tiling (and only on the unaligned slow path). Makes
+ * every other bit a zero, basically. */
+
+static inline unsigned
+space_4(unsigned x)
+{
+   return
+         ((x & 0x1) << 0) |
+         ((x & 0x2) << 1) |
+         ((x & 0x4) << 2) |
+         ((x & 0x8) << 3);
+}
+
+/* For the X column, the mask of bits that matter (every other bit; the rest
+ * are zero) */
+#define X_MASK 0b01010101
+
+/* The scheme uses 16x16 tiles */
+
+#define TILE_WIDTH 16
+#define TILE_HEIGHT 16
+#define PIXELS_PER_TILE (TILE_WIDTH * TILE_HEIGHT)
+
+/* An optimized routine to tile an aligned (width & 0xF == 0) bpp4 texture */
+
 static void
 panfrost_store_tiled_image_bpp4(void *dst, const void *src,
                                const struct pipe_box *box,
                                uint32_t dst_stride,
                                uint32_t src_stride)
 {
+   /* Precompute the offset to the beginning of the first horizontal tile we're
+    * writing to, knowing that box->x is 16-aligned. Tiles themselves are
+    * stored linearly, so we get the X tile number by shifting and then
+    * multiply by the bytes per tile */
+
+   uint8_t *dest_start = dst + ((box->x >> 4) * PIXELS_PER_TILE * 4);
+
+   /* Iterate across the pixels we're trying to store in source-order */
+
    for (int y = box->y, src_y = 0; src_y < box->height; ++y, ++src_y) {
+      /* For each pixel in the destination image, figure out the part
+       * corresponding to the 16x16 block index */
+
       int block_y = y & ~0x0f;
-      int rem_y = y & 0x0F;
-      int block_start_s = block_y * dst_stride;
-      int source_start = src_y * src_stride;
 
-      for (int x = box->x, src_x = 0; src_x < box->width; ++x, ++src_x) {
-         int block_x_s = (x >> 4) * 256;
-         int rem_x = x & 0x0F;
+      /* In pixel coordinates (where the origin is the top-left), (block_y, 0)
+       * is the top-left corner of the leftmost tile in this row. While pixels
+       * are reordered within a block, the blocks themselves are stored
+       * linearly, so multiplying block_y by the pixel stride of the
+       * destination image equals the byte offset of that top-left corner of
+       * the block this row is in */
+
+      uint32_t *dest = (uint32_t *) (dest_start + (block_y * dst_stride));
+
+      /* The source is actually linear, so compute the byte offset to the start
+       * and end of this row in the source */
+
+      const uint32_t *source = src + (y * src_stride) + (box->x * 4);
+      const uint32_t *source_end = source + box->width;
+
+      /* We want to duplicate the bits of the bottom nibble of Y */
+      unsigned expanded_y = bit_duplication[y & 0xF];
+
+      /* Initialize the offset */
+      unsigned space_x = 0;
+
+      /* Iterate the row in source order. In the outer loop, we iterate 16
+       * bytes tiles. After each tile, we increment dest to include the size of
+       * that tile in pixels. */
 
-         int index = space_filler[rem_y][rem_x];
-         const uint32_t *source = src + source_start + 4 * src_x;
-         uint32_t *dest = dst + block_start_s + 4 * (block_x_s + index);
+      for (; source < source_end; dest += PIXELS_PER_TILE) {
+         /* Within each tile, we iterate each of the 16 pixels in the row of
+          * the tile. This loop should be unrolled. */
 
-         *dest = *source;
+         for (int i = 0; i < 16; ++i) {
+            /* We have the X component spaced out in space_x and we have the Y
+             * component duplicated. So we just XOR them together. The X bits
+             * get the XOR like the pattern needs. The Y bits are XORing with
+             * zero so this is a no-op */
+
+            unsigned index = expanded_y ^ space_x;
+
+            /* For the X component, we need the bits spaced out. Instead of
+             * 0b1111, we need 0b010101. To accomplish this, we use the
+             * subtract/and trick, outlined on
+             * https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/
+             *
+             * Essentially, we have a mask of the bits that we do care to set
+             * -- 0b01010101 (X_MASK), and then a clever use of the carry bit
+             *  allows us to increments only the bits we care about. So this is
+             *  how we iterate X */
+
+            space_x = (space_x - X_MASK) & X_MASK;
+
+            /* Copy over the pixel */
+            dest[index] = *(source++);
+         }
       }
    }
 }
 
 static void
-panfrost_store_tiled_image_generic(void *dst, const void *src,
+panfrost_access_tiled_image_generic(void *dst, void *src,
                                const struct pipe_box *box,
                                uint32_t dst_stride,
                                uint32_t src_stride,
-                               uint32_t bpp)
+                               uint32_t bpp,
+                               unsigned is_store)
 {
+   unsigned space_x_initial = space_4(box->x);
+
    for (int y = box->y, src_y = 0; src_y < box->height; ++y, ++src_y) {
       int block_y = y & ~0x0f;
-      int rem_y = y & 0x0F;
       int block_start_s = block_y * dst_stride;
       int source_start = src_y * src_stride;
 
-      for (int x = box->x, src_x = 0; src_x < box->width; ++x, ++src_x) {
-         int block_x_s = (x >> 4) * 256;
-         int rem_x = x & 0x0F;
-
-         int index = space_filler[rem_y][rem_x];
-         const uint8_t *src8 = src;
-         const uint8_t *source = &src8[source_start + bpp * src_x];
-         uint8_t *dest = dst + block_start_s + bpp * (block_x_s + index);
-
-         for (int b = 0; b < bpp; ++b)
-            dest[b] = source[b];
-      }
-   }
-}
-
-static void
-panfrost_load_tiled_image_bpp4(void *dst, const void *src,
-                              const struct pipe_box *box,
-                              uint32_t dst_stride,
-                              uint32_t src_stride)
-{
-   for (int y = box->y, dest_y = 0; dest_y < box->height; ++y, ++dest_y) {
-      int block_y = y & ~0x0f;
-      int rem_y = y & 0x0F;
-      int block_start_s = block_y * src_stride;
-      int dest_start = dest_y * dst_stride;
+      unsigned expanded_y = bit_duplication[y & 0xF];
+      unsigned space_x = space_x_initial;
 
-      for (int x = box->x, dest_x = 0; dest_x < box->width; ++x, ++dest_x) {
+      for (int x = box->x, src_x = 0; src_x < box->width; ++x, ++src_x) {
          int block_x_s = (x >> 4) * 256;
-         int rem_x = x & 0x0F;
 
-         int index = space_filler[rem_y][rem_x];
-         uint32_t *dest = dst + dest_start + 4 * dest_x;
-         const uint32_t *source = src + block_start_s + 4 * (block_x_s + index);
+         unsigned index = expanded_y ^ space_x;
+         space_x = (space_x - X_MASK) & X_MASK;
 
-         *dest = *source;
-      }
-   }
-}
+         uint8_t *src8 = src;
+         uint8_t *source = &src8[source_start + bpp * src_x];
+         uint8_t *dest = dst + block_start_s + bpp * (block_x_s + index);
 
-static void
-panfrost_load_tiled_image_generic(void *dst, const void *src,
-                              const struct pipe_box *box,
-                              uint32_t dst_stride,
-                              uint32_t src_stride,
-                              uint32_t bpp)
-{
-   for (int y = box->y, dest_y = 0; dest_y < box->height; ++y, ++dest_y) {
-      int block_y = y & ~0x0f;
-      int rem_y = y & 0x0F;
-      int block_start_s = block_y * src_stride;
-      int dest_start = dest_y * dst_stride;
+         uint8_t *out = is_store ? dest : source;
+         uint8_t *in = is_store ? source : dest;
 
-      for (int x = box->x, dest_x = 0; dest_x < box->width; ++x, ++dest_x) {
-         int block_x_s = (x >> 4) * 256;
-         int rem_x = x & 0x0F;
+         /* Write out 1-4 bytes. Written like this rather than a loop so the
+          * compiler doesn't need to do branching (just some predication) */
 
-         int index = space_filler[rem_y][rem_x];
-         uint8_t *dst8 = dst;
-         uint8_t *dest = &dst8[dest_start + bpp * dest_x];
-         const uint8_t *source = src + block_start_s + bpp * (block_x_s + index);
-
-         for (int b = 0; b < bpp; ++b)
-            dest[b] = source[b];
+         out[0] = in[0];
+         if (bpp > 1) {
+            out[1] = in[1];
+            if (bpp > 2) {
+               out[2] = in[2];
+               if (bpp > 3)
+                  out[3] = in[3];
+            }
+         }
       }
    }
 }
@@ -158,13 +262,43 @@  panfrost_store_tiled_image(void *dst, const void *src,
                            uint32_t src_stride,
                            uint32_t bpp)
 {
-	switch (bpp) {
-	case 4:
-		panfrost_store_tiled_image_bpp4(dst, src, box, dst_stride, src_stride);
-		break;
-	default:
-		panfrost_store_tiled_image_generic(dst, src, box, dst_stride, src_stride, bpp);
-	}
+   /* Unaligned writes we handle by breaking up into an aligned write and a
+    * small generic unaligned write */
+   if (box->x & 0xF) {
+      /* Split up writes so we can get X tile aligned for the larger (major) write */
+      const struct pipe_box major = {
+         .x = (box->x & ~0xF) + 0x10,
+         .y = box->y,
+         .width = box->width - (box->x & 0xF),
+         .height = box->height
+      };
+
+      const struct pipe_box minor = {
+         .x = box->x,
+         .y = box->y,
+         .width = 0x10 - (box->x & 0xF),
+         .height = box->height
+      };
+
+      if (box->x > 0x10) {
+         panfrost_store_tiled_image(dst, src, &minor, dst_stride, src_stride, bpp);
+         panfrost_store_tiled_image(dst, src, &major, dst_stride, src_stride, bpp);
+      } else {
+         panfrost_access_tiled_image_generic(dst, (void *) src, box, dst_stride, src_stride, bpp, 1);
+      }
+
+      return;
+   }
+
+   /* From here, we can guarantee tile alignment of the X */
+
+     switch (bpp) {
+     case 4:
+             panfrost_store_tiled_image_bpp4(dst, (void *) src, box, dst_stride, src_stride);
+             break;
+     default:
+             panfrost_access_tiled_image_generic(dst, (void *) src, box, dst_stride, src_stride, bpp, 1);
+     }
 }
 
 void
@@ -174,11 +308,5 @@  panfrost_load_tiled_image(void *dst, const void *src,
                            uint32_t src_stride,
                            uint32_t bpp)
 {
-	switch (bpp) {
-	case 4:
-		panfrost_load_tiled_image_bpp4(dst, src, box, dst_stride, src_stride);
-		break;
-	default:
-		panfrost_load_tiled_image_generic(dst, src, box, dst_stride, src_stride, bpp);
-	}
+   panfrost_access_tiled_image_generic((void *) src, dst, box, dst_stride, src_stride, bpp, 0);
 }

Comments

Hi Alyssa,

On Tue, 25 Jun 2019 at 19:54, Alyssa Rosenzweig
<alyssa.rosenzweig@collabora.com> wrote:
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2011-2013 Luc Verhaegen <libv@skynet.be>
>   * Copyright (c) 2018 Alyssa Rosenzweig <alyssa@rosenzweig.io>
>   * Copyright (c) 2018 Vasily Khoruzhick <anarsoul@gmail.com>
> + * Copyright (c) 2019 Collabora

Please use 'Collabora, Ltd.' as that's our full legal form.

Afraid I have no feelings on the actual code however. ;)

Cheers,
Daniel
On Tue, Jun 25, 2019 at 11:54 AM Alyssa Rosenzweig
<alyssa.rosenzweig@collabora.com> wrote:

Hi Alyssa,

> Rather than using a magic lookup table with no explanations, let's add
> liberal comments to the code to explain what this tiling scheme is and
> how to encode/decode it efficiently.
>
> It's not so mysterious after all -- just reordering bits with some XORs
> thrown in.
>
> Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
> Cc: Vasily Khoruzhick <anarsoul@gmail.com>
> ---
>  src/panfrost/shared/pan_tiling.c | 330 +++++++++++++++++++++----------
>  1 file changed, 229 insertions(+), 101 deletions(-)
>
> diff --git a/src/panfrost/shared/pan_tiling.c b/src/panfrost/shared/pan_tiling.c
> index 413cd89420b..de0d6812831 100644
> --- a/src/panfrost/shared/pan_tiling.c
> +++ b/src/panfrost/shared/pan_tiling.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2011-2013 Luc Verhaegen <libv@skynet.be>
>   * Copyright (c) 2018 Alyssa Rosenzweig <alyssa@rosenzweig.io>
>   * Copyright (c) 2018 Vasily Khoruzhick <anarsoul@gmail.com>
> + * Copyright (c) 2019 Collabora
>   *
>   * Permission is hereby granted, free of charge, to any person obtaining a
>   * copy of this software and associated documentation files (the "Software"),
> @@ -26,127 +27,230 @@
>
>  #include "pan_tiling.h"
>
> -uint32_t space_filler[16][16] = {
> -   { 0,   1,   4,   5,   16,  17,  20,  21,  64,  65,  68,  69,  80,  81,  84,  85, },
> -   { 3,   2,   7,   6,   19,  18,  23,  22,  67,  66,  71,  70,  83,  82,  87,  86, },
> -   { 12,  13,  8,   9,   28,  29,  24,  25,  76,  77,  72,  73,  92,  93,  88,  89, },
> -   { 15,  14,  11,  10,  31,  30,  27,  26,  79,  78,  75,  74,  95,  94,  91,  90, },
> -   { 48,  49,  52,  53,  32,  33,  36,  37,  112, 113, 116, 117, 96,  97,  100, 101, },
> -   { 51,  50,  55,  54,  35,  34,  39,  38,  115, 114, 119, 118, 99,  98,  103, 102, },
> -   { 60,  61,  56,  57,  44,  45,  40,  41,  124, 125, 120, 121, 108, 109, 104, 105, },
> -   { 63,  62,  59,  58,  47,  46,  43,  42,  127, 126, 123, 122, 111, 110, 107, 106, },
> -   { 192, 193, 196, 197, 208, 209, 212, 213, 128, 129, 132, 133, 144, 145, 148, 149, },
> -   { 195, 194, 199, 198, 211, 210, 215, 214, 131, 130, 135, 134, 147, 146, 151, 150, },
> -   { 204, 205, 200, 201, 220, 221, 216, 217, 140, 141, 136, 137, 156, 157, 152, 153, },
> -   { 207, 206, 203, 202, 223, 222, 219, 218, 143, 142, 139, 138, 159, 158, 155, 154, },
> -   { 240, 241, 244, 245, 224, 225, 228, 229, 176, 177, 180, 181, 160, 161, 164, 165, },
> -   { 243, 242, 247, 246, 227, 226, 231, 230, 179, 178, 183, 182, 163, 162, 167, 166, },
> -   { 252, 253, 248, 249, 236, 237, 232, 233, 188, 189, 184, 185, 172, 173, 168, 169, },
> -   { 255, 254, 251, 250, 239, 238, 235, 234, 191, 190, 187, 186, 175, 174, 171, 170, },
> +/* This file implements software encode/decode of the tiling format used for
> + * textures and framebuffers primarily on Utgard GPUs. Names for this format
> + * include "Utgard-style tiling", "(Mali) swizzled textures", and
> + * "U-interleaved" (the former two names being used in the community
> + * Lima/Panfrost drivers; the latter name used internally at Arm).
> + * Conceptually, like any tiling scheme, the pixel reordering attempts to 2D
> + * spatial locality, to improve cache locality in both horizontal and vertical
> + * directions.
> + *
> + * This format is tiled: first, the image dimensions must be aligned to 16
> + * pixels in each axis. Once aligned, the image is divided into 16x16 tiles.
> + * This size harmonizes with other properties of the GPU; on Midgard,
> + * framebuffer tiles are logically 16x16 (this is the tile size used in
> + * Transaction Elimination and the minimum tile size used in Hierarchical
> + * Tiling). Conversely, for a standard 4 bytes-per-pixel format (like
> + * RGBA8888), 16 pixels * 4 bytes/pixel = 64 bytes, equal to the cache line
> + * size.
> + *
> + * Within each 16x16 block, the bits are reordered according to this pattern:
> + *
> + * | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^ x0) |
> + *
> + * Basically, interleaving the X and Y bits, with XORs thrown in for every
> + * adjacent bit pair.
> + *
> + * This is cheap to implement both encode/decode in both hardware and software.
> + * In hardware, lines are simply rerouted to reorder and some XOR gates are
> + * thrown in. Software has to be a bit more clever.
> + *
> + * In software, the trick is to divide the pattern into two lines:
> + *
> + *    | y3 | y3 | y2 | y2 | y1 | y1 | y0 | y0 |
> + *  ^ |  0 | x3 |  0 | x2 |  0 | x1 |  0 | x0 |
> + *
> + * That is, duplicate the bits of the Y and space out the bits of the X. The
> + * top line is a function only of Y, so it can be calculated once per row and
> + * stored in a register. The bottom line is simply X with the bits spaced out.
> + * The initial value can be calculated once at the beginning, and then after
> + * each pixel it can incremented by subtracting a mask pattern and ANDing with
> + * the mask pattern (abusing carry bits, where the mask pattern is the mask of
> + * bits we care about, so 0b01010101).
> + *
> + * Alternatively, the 16 indices can be precalculated for each row and reused
> + * across the row, which may be amenable to easier vectorization.
> + *
> + * This format is also supported on Midgard GPUs, where it *can* be used for
> + * textures and framebuffers. That said, in practice it is usually as a
> + * fallback layout; Midgard introduces Arm FrameBuffer Compression, which is
> + * significantly more efficient than Utgard-style tiling and preferred for both
> + * textures and framebuffers, where possible. For unsupported texture types,
> + * for instance sRGB textures and framebuffers, this tiling scheme is used at a
> + * performance penality, as AFBC is not compatible.

s/penality/penalty

> + */
> +
> +/* Given the lower 4-bits of the Y coordinate, we would like to
> + * duplicate every bit over. So instead of 0b1010, we would like
> + * 0b11001100. The idea is that for the bits in the solely Y place, we
> + * get a Y place, and the bits in the XOR place *also* get a Y. */
> +
> +uint32_t bit_duplication[16] = {
> +   0b00000000,
> +   0b00000011,
> +   0b00001100,
> +   0b00001111,
> +   0b00110000,
> +   0b00110011,
> +   0b00111100,
> +   0b00111111,
> +   0b11000000,
> +   0b11000011,
> +   0b11001100,
> +   0b11001111,
> +   0b11110000,
> +   0b11110011,
> +   0b11111100,
> +   0b11111111,
>  };
>
> +/* Space the bits out of a 4-bit nibble; doesn't need to be particularly fast
> + * as it only runs once per tiling (and only on the unaligned slow path). Makes
> + * every other bit a zero, basically. */
> +
> +static inline unsigned
> +space_4(unsigned x)
> +{
> +   return
> +         ((x & 0x1) << 0) |
> +         ((x & 0x2) << 1) |
> +         ((x & 0x4) << 2) |
> +         ((x & 0x8) << 3);
> +}

Is there any reason not to use LUT here?

> +
> +/* For the X column, the mask of bits that matter (every other bit; the rest
> + * are zero) */
> +#define X_MASK 0b01010101
> +
> +/* The scheme uses 16x16 tiles */
> +
> +#define TILE_WIDTH 16
> +#define TILE_HEIGHT 16
> +#define PIXELS_PER_TILE (TILE_WIDTH * TILE_HEIGHT)
> +
> +/* An optimized routine to tile an aligned (width & 0xF == 0) bpp4 texture */
> +
>  static void
>  panfrost_store_tiled_image_bpp4(void *dst, const void *src,
>                                 const struct pipe_box *box,
>                                 uint32_t dst_stride,
>                                 uint32_t src_stride)
>  {
> +   /* Precompute the offset to the beginning of the first horizontal tile we're
> +    * writing to, knowing that box->x is 16-aligned. Tiles themselves are
> +    * stored linearly, so we get the X tile number by shifting and then
> +    * multiply by the bytes per tile */
> +
> +   uint8_t *dest_start = dst + ((box->x >> 4) * PIXELS_PER_TILE * 4);
> +
> +   /* Iterate across the pixels we're trying to store in source-order */
> +
>     for (int y = box->y, src_y = 0; src_y < box->height; ++y, ++src_y) {
> +      /* For each pixel in the destination image, figure out the part
> +       * corresponding to the 16x16 block index */
> +
>        int block_y = y & ~0x0f;
> -      int rem_y = y & 0x0F;
> -      int block_start_s = block_y * dst_stride;
> -      int source_start = src_y * src_stride;
>
> -      for (int x = box->x, src_x = 0; src_x < box->width; ++x, ++src_x) {
> -         int block_x_s = (x >> 4) * 256;
> -         int rem_x = x & 0x0F;
> +      /* In pixel coordinates (where the origin is the top-left), (block_y, 0)
> +       * is the top-left corner of the leftmost tile in this row. While pixels
> +       * are reordered within a block, the blocks themselves are stored
> +       * linearly, so multiplying block_y by the pixel stride of the
> +       * destination image equals the byte offset of that top-left corner of
> +       * the block this row is in */
> +
> +      uint32_t *dest = (uint32_t *) (dest_start + (block_y * dst_stride));
> +
> +      /* The source is actually linear, so compute the byte offset to the start
> +       * and end of this row in the source */
> +
> +      const uint32_t *source = src + (y * src_stride) + (box->x * 4);
> +      const uint32_t *source_end = source + box->width;
> +
> +      /* We want to duplicate the bits of the bottom nibble of Y */
> +      unsigned expanded_y = bit_duplication[y & 0xF];
> +
> +      /* Initialize the offset */
> +      unsigned space_x = 0;
> +
> +      /* Iterate the row in source order. In the outer loop, we iterate 16
> +       * bytes tiles. After each tile, we increment dest to include the size of
> +       * that tile in pixels. */
>
> -         int index = space_filler[rem_y][rem_x];
> -         const uint32_t *source = src + source_start + 4 * src_x;
> -         uint32_t *dest = dst + block_start_s + 4 * (block_x_s + index);
> +      for (; source < source_end; dest += PIXELS_PER_TILE) {
> +         /* Within each tile, we iterate each of the 16 pixels in the row of
> +          * the tile. This loop should be unrolled. */
>
> -         *dest = *source;
> +         for (int i = 0; i < 16; ++i) {
> +            /* We have the X component spaced out in space_x and we have the Y
> +             * component duplicated. So we just XOR them together. The X bits
> +             * get the XOR like the pattern needs. The Y bits are XORing with
> +             * zero so this is a no-op */
> +
> +            unsigned index = expanded_y ^ space_x;
> +
> +            /* For the X component, we need the bits spaced out. Instead of
> +             * 0b1111, we need 0b010101. To accomplish this, we use the

It should be either 0b111 or 0b01010101

> +             * subtract/and trick, outlined on
> +             * https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/
> +             *
> +             * Essentially, we have a mask of the bits that we do care to set
> +             * -- 0b01010101 (X_MASK), and then a clever use of the carry bit
> +             *  allows us to increments only the bits we care about. So this is
> +             *  how we iterate X */
> +
> +            space_x = (space_x - X_MASK) & X_MASK;

Again, why not to use LUT here? Does it improve performance? If not,
let's keep it simple by using LUT.

> +
> +            /* Copy over the pixel */
> +            dest[index] = *(source++);
> +         }
>        }
>     }
>  }
>
>  static void
> -panfrost_store_tiled_image_generic(void *dst, const void *src,
> +panfrost_access_tiled_image_generic(void *dst, void *src,
>                                 const struct pipe_box *box,
>                                 uint32_t dst_stride,
>                                 uint32_t src_stride,
> -                               uint32_t bpp)
> +                               uint32_t bpp,
> +                               unsigned is_store)
>  {
> +   unsigned space_x_initial = space_4(box->x);
> +
>     for (int y = box->y, src_y = 0; src_y < box->height; ++y, ++src_y) {
>        int block_y = y & ~0x0f;
> -      int rem_y = y & 0x0F;
>        int block_start_s = block_y * dst_stride;
>        int source_start = src_y * src_stride;
>
> -      for (int x = box->x, src_x = 0; src_x < box->width; ++x, ++src_x) {
> -         int block_x_s = (x >> 4) * 256;
> -         int rem_x = x & 0x0F;
> -
> -         int index = space_filler[rem_y][rem_x];
> -         const uint8_t *src8 = src;
> -         const uint8_t *source = &src8[source_start + bpp * src_x];
> -         uint8_t *dest = dst + block_start_s + bpp * (block_x_s + index);
> -
> -         for (int b = 0; b < bpp; ++b)
> -            dest[b] = source[b];
> -      }
> -   }
> -}
> -
> -static void
> -panfrost_load_tiled_image_bpp4(void *dst, const void *src,
> -                              const struct pipe_box *box,
> -                              uint32_t dst_stride,
> -                              uint32_t src_stride)
> -{
> -   for (int y = box->y, dest_y = 0; dest_y < box->height; ++y, ++dest_y) {
> -      int block_y = y & ~0x0f;
> -      int rem_y = y & 0x0F;
> -      int block_start_s = block_y * src_stride;
> -      int dest_start = dest_y * dst_stride;
> +      unsigned expanded_y = bit_duplication[y & 0xF];
> +      unsigned space_x = space_x_initial;
>
> -      for (int x = box->x, dest_x = 0; dest_x < box->width; ++x, ++dest_x) {
> +      for (int x = box->x, src_x = 0; src_x < box->width; ++x, ++src_x) {
>           int block_x_s = (x >> 4) * 256;
> -         int rem_x = x & 0x0F;
>
> -         int index = space_filler[rem_y][rem_x];
> -         uint32_t *dest = dst + dest_start + 4 * dest_x;
> -         const uint32_t *source = src + block_start_s + 4 * (block_x_s + index);
> +         unsigned index = expanded_y ^ space_x;
> +         space_x = (space_x - X_MASK) & X_MASK;
>
> -         *dest = *source;
> -      }
> -   }
> -}
> +         uint8_t *src8 = src;
> +         uint8_t *source = &src8[source_start + bpp * src_x];
> +         uint8_t *dest = dst + block_start_s + bpp * (block_x_s + index);
>
> -static void
> -panfrost_load_tiled_image_generic(void *dst, const void *src,
> -                              const struct pipe_box *box,
> -                              uint32_t dst_stride,
> -                              uint32_t src_stride,
> -                              uint32_t bpp)
> -{
> -   for (int y = box->y, dest_y = 0; dest_y < box->height; ++y, ++dest_y) {
> -      int block_y = y & ~0x0f;
> -      int rem_y = y & 0x0F;
> -      int block_start_s = block_y * src_stride;
> -      int dest_start = dest_y * dst_stride;
> +         uint8_t *out = is_store ? dest : source;
> +         uint8_t *in = is_store ? source : dest;
>
> -      for (int x = box->x, dest_x = 0; dest_x < box->width; ++x, ++dest_x) {
> -         int block_x_s = (x >> 4) * 256;
> -         int rem_x = x & 0x0F;
> +         /* Write out 1-4 bytes. Written like this rather than a loop so the
> +          * compiler doesn't need to do branching (just some predication) */
>
> -         int index = space_filler[rem_y][rem_x];
> -         uint8_t *dst8 = dst;
> -         uint8_t *dest = &dst8[dest_start + bpp * dest_x];
> -         const uint8_t *source = src + block_start_s + bpp * (block_x_s + index);
> -
> -         for (int b = 0; b < bpp; ++b)
> -            dest[b] = source[b];
> +         out[0] = in[0];
> +         if (bpp > 1) {
> +            out[1] = in[1];
> +            if (bpp > 2) {
> +               out[2] = in[2];
> +               if (bpp > 3)
> +                  out[3] = in[3];
> +            }
> +         }
>        }
>     }
>  }
> @@ -158,13 +262,43 @@ panfrost_store_tiled_image(void *dst, const void *src,
>                             uint32_t src_stride,
>                             uint32_t bpp)
>  {
> -       switch (bpp) {
> -       case 4:
> -               panfrost_store_tiled_image_bpp4(dst, src, box, dst_stride, src_stride);
> -               break;
> -       default:
> -               panfrost_store_tiled_image_generic(dst, src, box, dst_stride, src_stride, bpp);
> -       }
> +   /* Unaligned writes we handle by breaking up into an aligned write and a
> +    * small generic unaligned write */
> +   if (box->x & 0xF) {
> +      /* Split up writes so we can get X tile aligned for the larger (major) write */

Is there any benefit in doing so? Old code used to work fine with
boxes not aligned to tile boundaries.

> +      const struct pipe_box major = {
> +         .x = (box->x & ~0xF) + 0x10,
> +         .y = box->y,
> +         .width = box->width - (box->x & 0xF),
> +         .height = box->height
> +      };
> +
> +      const struct pipe_box minor = {
> +         .x = box->x,
> +         .y = box->y,
> +         .width = 0x10 - (box->x & 0xF),
> +         .height = box->height
> +      };
> +
> +      if (box->x > 0x10) {
> +         panfrost_store_tiled_image(dst, src, &minor, dst_stride, src_stride, bpp);
> +         panfrost_store_tiled_image(dst, src, &major, dst_stride, src_stride, bpp);
> +      } else {
> +         panfrost_access_tiled_image_generic(dst, (void *) src, box, dst_stride, src_stride, bpp, 1);
> +      }
> +
> +      return;
> +   }
> +
> +   /* From here, we can guarantee tile alignment of the X */
> +
> +     switch (bpp) {
> +     case 4:
> +             panfrost_store_tiled_image_bpp4(dst, (void *) src, box, dst_stride, src_stride);
> +             break;
> +     default:
> +             panfrost_access_tiled_image_generic(dst, (void *) src, box, dst_stride, src_stride, bpp, 1);
> +     }
>  }
>
>  void
> @@ -174,11 +308,5 @@ panfrost_load_tiled_image(void *dst, const void *src,
>                             uint32_t src_stride,
>                             uint32_t bpp)
>  {
> -       switch (bpp) {
> -       case 4:
> -               panfrost_load_tiled_image_bpp4(dst, src, box, dst_stride, src_stride);
> -               break;
> -       default:
> -               panfrost_load_tiled_image_generic(dst, src, box, dst_stride, src_stride, bpp);
> -       }
> +   panfrost_access_tiled_image_generic((void *) src, dst, box, dst_stride, src_stride, bpp, 0);
>  }
> --
> 2.20.1
>
On Tue, 2019-06-25 at 12:36 -0700, Vasily Khoruzhick wrote:
> On Tue, Jun 25, 2019 at 11:54 AM Alyssa Rosenzweig
> <alyssa.rosenzweig@collabora.com> wrote:
> 
> Hi Alyssa,
> 
> > Rather than using a magic lookup table with no explanations, let's
> > add
> > liberal comments to the code to explain what this tiling scheme is
> > and
> > how to encode/decode it efficiently.
> > 
> > It's not so mysterious after all -- just reordering bits with some
> > XORs
> > thrown in.
> > 
> > Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
> > Cc: Vasily Khoruzhick <anarsoul@gmail.com>
> > ---
> >  src/panfrost/shared/pan_tiling.c | 330 +++++++++++++++++++++----
> > ------
> >  1 file changed, 229 insertions(+), 101 deletions(-)
> > 
> > diff --git a/src/panfrost/shared/pan_tiling.c
> > b/src/panfrost/shared/pan_tiling.c
> > index 413cd89420b..de0d6812831 100644
> > --- a/src/panfrost/shared/pan_tiling.c
> > +++ b/src/panfrost/shared/pan_tiling.c
> > @@ -2,6 +2,7 @@
> >   * Copyright (c) 2011-2013 Luc Verhaegen <libv@skynet.be>
> >   * Copyright (c) 2018 Alyssa Rosenzweig <alyssa@rosenzweig.io>
> >   * Copyright (c) 2018 Vasily Khoruzhick <anarsoul@gmail.com>
> > + * Copyright (c) 2019 Collabora
> >   *
> >   * Permission is hereby granted, free of charge, to any person
> > obtaining a
> >   * copy of this software and associated documentation files (the
> > "Software"),
> > @@ -26,127 +27,230 @@
> > 
> >  #include "pan_tiling.h"
> > 
> > -uint32_t space_filler[16][16] = {
> > -   {
> > 0,   1,   4,   5,   16,  17,  20,  21,  64,  65,  68,  69,  80,  81
> > ,  84,  85, },
> > -   {
> > 3,   2,   7,   6,   19,  18,  23,  22,  67,  66,  71,  70,  83,  82
> > ,  87,  86, },
> > -   {
> > 12,  13,  8,   9,   28,  29,  24,  25,  76,  77,  72,  73,  92,  93
> > ,  88,  89, },
> > -   {
> > 15,  14,  11,  10,  31,  30,  27,  26,  79,  78,  75,  74,  95,  94
> > ,  91,  90, },
> > -   { 48,  49,  52,  53,  32,  33,  36,  37,  112, 113, 116, 117,
> > 96,  97,  100, 101, },
> > -   { 51,  50,  55,  54,  35,  34,  39,  38,  115, 114, 119, 118,
> > 99,  98,  103, 102, },
> > -   { 60,  61,  56,  57,  44,  45,  40,  41,  124, 125, 120, 121,
> > 108, 109, 104, 105, },
> > -   { 63,  62,  59,  58,  47,  46,  43,  42,  127, 126, 123, 122,
> > 111, 110, 107, 106, },
> > -   { 192, 193, 196, 197, 208, 209, 212, 213, 128, 129, 132, 133,
> > 144, 145, 148, 149, },
> > -   { 195, 194, 199, 198, 211, 210, 215, 214, 131, 130, 135, 134,
> > 147, 146, 151, 150, },
> > -   { 204, 205, 200, 201, 220, 221, 216, 217, 140, 141, 136, 137,
> > 156, 157, 152, 153, },
> > -   { 207, 206, 203, 202, 223, 222, 219, 218, 143, 142, 139, 138,
> > 159, 158, 155, 154, },
> > -   { 240, 241, 244, 245, 224, 225, 228, 229, 176, 177, 180, 181,
> > 160, 161, 164, 165, },
> > -   { 243, 242, 247, 246, 227, 226, 231, 230, 179, 178, 183, 182,
> > 163, 162, 167, 166, },
> > -   { 252, 253, 248, 249, 236, 237, 232, 233, 188, 189, 184, 185,
> > 172, 173, 168, 169, },
> > -   { 255, 254, 251, 250, 239, 238, 235, 234, 191, 190, 187, 186,
> > 175, 174, 171, 170, },
> > +/* This file implements software encode/decode of the tiling
> > format used for
> > + * textures and framebuffers primarily on Utgard GPUs. Names for
> > this format
> > + * include "Utgard-style tiling", "(Mali) swizzled textures", and
> > + * "U-interleaved" (the former two names being used in the
> > community
> > + * Lima/Panfrost drivers; the latter name used internally at Arm).
> > + * Conceptually, like any tiling scheme, the pixel reordering
> > attempts to 2D
> > + * spatial locality, to improve cache locality in both horizontal
> > and vertical
> > + * directions.
> > + *
> > + * This format is tiled: first, the image dimensions must be
> > aligned to 16
> > + * pixels in each axis. Once aligned, the image is divided into
> > 16x16 tiles.
> > + * This size harmonizes with other properties of the GPU; on
> > Midgard,
> > + * framebuffer tiles are logically 16x16 (this is the tile size
> > used in
> > + * Transaction Elimination and the minimum tile size used in
> > Hierarchical
> > + * Tiling). Conversely, for a standard 4 bytes-per-pixel format
> > (like
> > + * RGBA8888), 16 pixels * 4 bytes/pixel = 64 bytes, equal to the
> > cache line
> > + * size.
> > + *
> > + * Within each 16x16 block, the bits are reordered according to
> > this pattern:
> > + *
> > + * | y3 | (x3 ^ y3) | y2 | (y2 ^ x2) | y1 | (y1 ^ x1) | y0 | (y0 ^
> > x0) |
> > + *
> > + * Basically, interleaving the X and Y bits, with XORs thrown in
> > for every
> > + * adjacent bit pair.
> > + *
> > + * This is cheap to implement both encode/decode in both hardware
> > and software.
> > + * In hardware, lines are simply rerouted to reorder and some XOR
> > gates are
> > + * thrown in. Software has to be a bit more clever.
> > + *
> > + * In software, the trick is to divide the pattern into two lines:
> > + *
> > + *    | y3 | y3 | y2 | y2 | y1 | y1 | y0 | y0 |
> > + *  ^ |  0 | x3 |  0 | x2 |  0 | x1 |  0 | x0 |
> > + *
> > + * That is, duplicate the bits of the Y and space out the bits of
> > the X. The
> > + * top line is a function only of Y, so it can be calculated once
> > per row and
> > + * stored in a register. The bottom line is simply X with the bits
> > spaced out.
> > + * The initial value can be calculated once at the beginning, and
> > then after
> > + * each pixel it can incremented by subtracting a mask pattern and
> > ANDing with
> > + * the mask pattern (abusing carry bits, where the mask pattern is
> > the mask of
> > + * bits we care about, so 0b01010101).
> > + *
> > + * Alternatively, the 16 indices can be precalculated for each row
> > and reused
> > + * across the row, which may be amenable to easier vectorization.
> > + *
> > + * This format is also supported on Midgard GPUs, where it *can*
> > be used for
> > + * textures and framebuffers. That said, in practice it is usually
> > as a
> > + * fallback layout; Midgard introduces Arm FrameBuffer
> > Compression, which is
> > + * significantly more efficient than Utgard-style tiling and
> > preferred for both
> > + * textures and framebuffers, where possible. For unsupported
> > texture types,
> > + * for instance sRGB textures and framebuffers, this tiling scheme
> > is used at a
> > + * performance penality, as AFBC is not compatible.
> 
> s/penality/penalty
> 
> > + */
> > +
> > +/* Given the lower 4-bits of the Y coordinate, we would like to
> > + * duplicate every bit over. So instead of 0b1010, we would like
> > + * 0b11001100. The idea is that for the bits in the solely Y
> > place, we
> > + * get a Y place, and the bits in the XOR place *also* get a Y. */
> > +
> > +uint32_t bit_duplication[16] = {
> > +   0b00000000,
> > +   0b00000011,
> > +   0b00001100,
> > +   0b00001111,
> > +   0b00110000,
> > +   0b00110011,
> > +   0b00111100,
> > +   0b00111111,
> > +   0b11000000,
> > +   0b11000011,
> > +   0b11001100,
> > +   0b11001111,
> > +   0b11110000,
> > +   0b11110011,
> > +   0b11111100,
> > +   0b11111111,
> >  };
> > 
> > +/* Space the bits out of a 4-bit nibble; doesn't need to be
> > particularly fast
> > + * as it only runs once per tiling (and only on the unaligned slow
> > path). Makes
> > + * every other bit a zero, basically. */
> > +
> > +static inline unsigned
> > +space_4(unsigned x)
> > +{
> > +   return
> > +         ((x & 0x1) << 0) |
> > +         ((x & 0x2) << 1) |
> > +         ((x & 0x4) << 2) |
> > +         ((x & 0x8) << 3);
> > +}
> 
> Is there any reason not to use LUT here?
> 

The comment writes that it doesn't need to be fast, so I guess it's
because it's a bit clearer?

Generally speaking, I think using a LUT here is a good idea, as it's
so tiny, it'll be swallowed up by the L1 cache.

Also, I don't think you *technically* need the bit_duplication LUT, you
can just use space_4, and OR the result with itself shifted left by
one. Something like this:

unsigned
bit_duplication(unsigned x)
{
   unsigned tmp = space_4(x);
   return tmp | (tmp << 1);
}

Exactly what ends up being the right speed vs tidy code trade-off, I
think needs measurements, though. So this is just my two cents. Take it
or leave it :)

> > +
> > +/* For the X column, the mask of bits that matter (every other
> > bit; the rest
> > + * are zero) */
> > +#define X_MASK 0b01010101
> > +
> > +/* The scheme uses 16x16 tiles */
> > +
> > +#define TILE_WIDTH 16
> > +#define TILE_HEIGHT 16
> > +#define PIXELS_PER_TILE (TILE_WIDTH * TILE_HEIGHT)
> > +
> > +/* An optimized routine to tile an aligned (width & 0xF == 0) bpp4
> > texture */
> > +
> >  static void
> >  panfrost_store_tiled_image_bpp4(void *dst, const void *src,
> >                                 const struct pipe_box *box,
> >                                 uint32_t dst_stride,
> >                                 uint32_t src_stride)
> >  {
> > +   /* Precompute the offset to the beginning of the first
> > horizontal tile we're
> > +    * writing to, knowing that box->x is 16-aligned. Tiles
> > themselves are
> > +    * stored linearly, so we get the X tile number by shifting and
> > then
> > +    * multiply by the bytes per tile */
> > +
> > +   uint8_t *dest_start = dst + ((box->x >> 4) * PIXELS_PER_TILE *
> > 4);
> > +
> > +   /* Iterate across the pixels we're trying to store in source-
> > order */
> > +
> >     for (int y = box->y, src_y = 0; src_y < box->height; ++y,
> > ++src_y) {
> > +      /* For each pixel in the destination image, figure out the
> > part
> > +       * corresponding to the 16x16 block index */
> > +
> >        int block_y = y & ~0x0f;
> > -      int rem_y = y & 0x0F;
> > -      int block_start_s = block_y * dst_stride;
> > -      int source_start = src_y * src_stride;
> > 
> > -      for (int x = box->x, src_x = 0; src_x < box->width; ++x,
> > ++src_x) {
> > -         int block_x_s = (x >> 4) * 256;
> > -         int rem_x = x & 0x0F;
> > +      /* In pixel coordinates (where the origin is the top-left),
> > (block_y, 0)
> > +       * is the top-left corner of the leftmost tile in this row.
> > While pixels
> > +       * are reordered within a block, the blocks themselves are
> > stored
> > +       * linearly, so multiplying block_y by the pixel stride of
> > the
> > +       * destination image equals the byte offset of that top-left 
> > corner of
> > +       * the block this row is in */
> > +
> > +      uint32_t *dest = (uint32_t *) (dest_start + (block_y *
> > dst_stride));
> > +
> > +      /* The source is actually linear, so compute the byte offset
> > to the start
> > +       * and end of this row in the source */
> > +
> > +      const uint32_t *source = src + (y * src_stride) + (box->x *
> > 4);
> > +      const uint32_t *source_end = source + box->width;
> > +
> > +      /* We want to duplicate the bits of the bottom nibble of Y
> > */
> > +      unsigned expanded_y = bit_duplication[y & 0xF];
> > +
> > +      /* Initialize the offset */
> > +      unsigned space_x = 0;
> > +
> > +      /* Iterate the row in source order. In the outer loop, we
> > iterate 16
> > +       * bytes tiles. After each tile, we increment dest to
> > include the size of
> > +       * that tile in pixels. */
> > 
> > -         int index = space_filler[rem_y][rem_x];
> > -         const uint32_t *source = src + source_start + 4 * src_x;
> > -         uint32_t *dest = dst + block_start_s + 4 * (block_x_s +
> > index);
> > +      for (; source < source_end; dest += PIXELS_PER_TILE) {
> > +         /* Within each tile, we iterate each of the 16 pixels in
> > the row of
> > +          * the tile. This loop should be unrolled. */
> > 
> > -         *dest = *source;
> > +         for (int i = 0; i < 16; ++i) {
> > +            /* We have the X component spaced out in space_x and
> > we have the Y
> > +             * component duplicated. So we just XOR them together.
> > The X bits
> > +             * get the XOR like the pattern needs. The Y bits are
> > XORing with
> > +             * zero so this is a no-op */
> > +
> > +            unsigned index = expanded_y ^ space_x;
> > +
> > +            /* For the X component, we need the bits spaced out.
> > Instead of
> > +             * 0b1111, we need 0b010101. To accomplish this, we
> > use the
> 
> It should be either 0b111 or 0b01010101
> 
> > +             * subtract/and trick, outlined on
> > +             * 
> > https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/
> > +             *
> > +             * Essentially, we have a mask of the bits that we do
> > care to set
> > +             * -- 0b01010101 (X_MASK), and then a clever use of
> > the carry bit
> > +             *  allows us to increments only the bits we care
> > about. So this is
> > +             *  how we iterate X */
> > +
> > +            space_x = (space_x - X_MASK) & X_MASK;
> 
> Again, why not to use LUT here? Does it improve performance? If not,
> let's keep it simple by using LUT.
> 

It looks to me like it might be faster this way.

I think the performance question here is really: Is the ALU-based
masking (subtract + and) faster than an L1 cache hit? We can preload
the cache before the loop to avoid the initial hit.

I'm not 100% convinced it is, but I could be wrong here. I guess it
depends on the exact CPU? In either case, the ALU-based masking is
pretty fast on all CPUs, so I think I would go with that, I think.

Of course, some numbers would be awesome :)

> > +
> > +            /* Copy over the pixel */
> > +            dest[index] = *(source++);
> > +         }
> >        }
> >     }
> >  }
> > 
> >  static void
> > -panfrost_store_tiled_image_generic(void *dst, const void *src,
> > +panfrost_access_tiled_image_generic(void *dst, void *src,
> >                                 const struct pipe_box *box,
> >                                 uint32_t dst_stride,
> >                                 uint32_t src_stride,
> > -                               uint32_t bpp)
> > +                               uint32_t bpp,
> > +                               unsigned is_store)
> >  {
> > +   unsigned space_x_initial = space_4(box->x);
> > +
> >     for (int y = box->y, src_y = 0; src_y < box->height; ++y,
> > ++src_y) {
> >        int block_y = y & ~0x0f;
> > -      int rem_y = y & 0x0F;
> >        int block_start_s = block_y * dst_stride;
> >        int source_start = src_y * src_stride;
> > 
> > -      for (int x = box->x, src_x = 0; src_x < box->width; ++x,
> > ++src_x) {
> > -         int block_x_s = (x >> 4) * 256;
> > -         int rem_x = x & 0x0F;
> > -
> > -         int index = space_filler[rem_y][rem_x];
> > -         const uint8_t *src8 = src;
> > -         const uint8_t *source = &src8[source_start + bpp *
> > src_x];
> > -         uint8_t *dest = dst + block_start_s + bpp * (block_x_s +
> > index);
> > -
> > -         for (int b = 0; b < bpp; ++b)
> > -            dest[b] = source[b];
> > -      }
> > -   }
> > -}
> > -
> > -static void
> > -panfrost_load_tiled_image_bpp4(void *dst, const void *src,
> > -                              const struct pipe_box *box,
> > -                              uint32_t dst_stride,
> > -                              uint32_t src_stride)
> > -{
> > -   for (int y = box->y, dest_y = 0; dest_y < box->height; ++y,
> > ++dest_y) {
> > -      int block_y = y & ~0x0f;
> > -      int rem_y = y & 0x0F;
> > -      int block_start_s = block_y * src_stride;
> > -      int dest_start = dest_y * dst_stride;
> > +      unsigned expanded_y = bit_duplication[y & 0xF];
> > +      unsigned space_x = space_x_initial;
> > 
> > -      for (int x = box->x, dest_x = 0; dest_x < box->width; ++x,
> > ++dest_x) {
> > +      for (int x = box->x, src_x = 0; src_x < box->width; ++x,
> > ++src_x) {
> >           int block_x_s = (x >> 4) * 256;
> > -         int rem_x = x & 0x0F;
> > 
> > -         int index = space_filler[rem_y][rem_x];
> > -         uint32_t *dest = dst + dest_start + 4 * dest_x;
> > -         const uint32_t *source = src + block_start_s + 4 *
> > (block_x_s + index);
> > +         unsigned index = expanded_y ^ space_x;
> > +         space_x = (space_x - X_MASK) & X_MASK;
> > 
> > -         *dest = *source;
> > -      }
> > -   }
> > -}
> > +         uint8_t *src8 = src;
> > +         uint8_t *source = &src8[source_start + bpp * src_x];
> > +         uint8_t *dest = dst + block_start_s + bpp * (block_x_s +
> > index);
> > 
> > -static void
> > -panfrost_load_tiled_image_generic(void *dst, const void *src,
> > -                              const struct pipe_box *box,
> > -                              uint32_t dst_stride,
> > -                              uint32_t src_stride,
> > -                              uint32_t bpp)
> > -{
> > -   for (int y = box->y, dest_y = 0; dest_y < box->height; ++y,
> > ++dest_y) {
> > -      int block_y = y & ~0x0f;
> > -      int rem_y = y & 0x0F;
> > -      int block_start_s = block_y * src_stride;
> > -      int dest_start = dest_y * dst_stride;
> > +         uint8_t *out = is_store ? dest : source;
> > +         uint8_t *in = is_store ? source : dest;
> > 
> > -      for (int x = box->x, dest_x = 0; dest_x < box->width; ++x,
> > ++dest_x) {
> > -         int block_x_s = (x >> 4) * 256;
> > -         int rem_x = x & 0x0F;
> > +         /* Write out 1-4 bytes. Written like this rather than a
> > loop so the
> > +          * compiler doesn't need to do branching (just some
> > predication) */
> > 
> > -         int index = space_filler[rem_y][rem_x];
> > -         uint8_t *dst8 = dst;
> > -         uint8_t *dest = &dst8[dest_start + bpp * dest_x];
> > -         const uint8_t *source = src + block_start_s + bpp *
> > (block_x_s + index);
> > -
> > -         for (int b = 0; b < bpp; ++b)
> > -            dest[b] = source[b];
> > +         out[0] = in[0];
> > +         if (bpp > 1) {
> > +            out[1] = in[1];
> > +            if (bpp > 2) {
> > +               out[2] = in[2];
> > +               if (bpp > 3)
> > +                  out[3] = in[3];
> > +            }
> > +         }
> >        }
> >     }
> >  }
> > @@ -158,13 +262,43 @@ panfrost_store_tiled_image(void *dst, const
> > void *src,
> >                             uint32_t src_stride,
> >                             uint32_t bpp)
> >  {
> > -       switch (bpp) {
> > -       case 4:
> > -               panfrost_store_tiled_image_bpp4(dst, src, box,
> > dst_stride, src_stride);
> > -               break;
> > -       default:
> > -               panfrost_store_tiled_image_generic(dst, src, box,
> > dst_stride, src_stride, bpp);
> > -       }
> > +   /* Unaligned writes we handle by breaking up into an aligned
> > write and a
> > +    * small generic unaligned write */
> > +   if (box->x & 0xF) {
> > +      /* Split up writes so we can get X tile aligned for the
> > larger (major) write */
> 
> Is there any benefit in doing so? Old code used to work fine with
> boxes not aligned to tile boundaries.
> 
> > +      const struct pipe_box major = {
> > +         .x = (box->x & ~0xF) + 0x10,
> > +         .y = box->y,
> > +         .width = box->width - (box->x & 0xF),
> > +         .height = box->height
> > +      };
> > +
> > +      const struct pipe_box minor = {
> > +         .x = box->x,
> > +         .y = box->y,
> > +         .width = 0x10 - (box->x & 0xF),
> > +         .height = box->height
> > +      };
> > +
> > +      if (box->x > 0x10) {
> > +         panfrost_store_tiled_image(dst, src, &minor, dst_stride,
> > src_stride, bpp);
> > +         panfrost_store_tiled_image(dst, src, &major, dst_stride,
> > src_stride, bpp);
> > +      } else {
> > +         panfrost_access_tiled_image_generic(dst, (void *) src,
> > box, dst_stride, src_stride, bpp, 1);
> > +      }
> > +
> > +      return;
> > +   }
> > +
> > +   /* From here, we can guarantee tile alignment of the X */
> > +
> > +     switch (bpp) {
> > +     case 4:
> > +             panfrost_store_tiled_image_bpp4(dst, (void *) src,
> > box, dst_stride, src_stride);
> > +             break;
> > +     default:
> > +             panfrost_access_tiled_image_generic(dst, (void *)
> > src, box, dst_stride, src_stride, bpp, 1);
> > +     }
> >  }
> > 
> >  void
> > @@ -174,11 +308,5 @@ panfrost_load_tiled_image(void *dst, const
> > void *src,
> >                             uint32_t src_stride,
> >                             uint32_t bpp)
> >  {
> > -       switch (bpp) {
> > -       case 4:
> > -               panfrost_load_tiled_image_bpp4(dst, src, box,
> > dst_stride, src_stride);
> > -               break;
> > -       default:
> > -               panfrost_load_tiled_image_generic(dst, src, box,
> > dst_stride, src_stride, bpp);
> > -       }
> > +   panfrost_access_tiled_image_generic((void *) src, dst, box,
> > dst_stride, src_stride, bpp, 0);
> >  }
> > --
> > 2.20.1
> > 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev