[v3,24/25] panfrost: Support batch pipelining

Submitted by Boris Brezillon on Sept. 5, 2019, 7:41 p.m.

Details

Message ID 20190905194150.17608-25-boris.brezillon@collabora.com
State New
Headers show
Series "panfrost: Rework the batch pipelining logic" ( rev: 1 ) in Mesa

Not browsing as part of any series.

Commit Message

Boris Brezillon Sept. 5, 2019, 7:41 p.m.
We adjust the code to explicitly request flush of batches accessing
BOs they care about. Thanks to that, we can get rid of the implicit
serialization done in panfrost_batch_submit() and
panfrost_set_framebuffer_state(). Finally, panfrost_flush() is
changed to to flush all pending batches.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 src/gallium/drivers/panfrost/pan_compute.c  |   2 +-
 src/gallium/drivers/panfrost/pan_context.c  | 145 +++++++++++++-------
 src/gallium/drivers/panfrost/pan_job.c      |  15 +-
 src/gallium/drivers/panfrost/pan_resource.c |  26 ++--
 4 files changed, 115 insertions(+), 73 deletions(-)

Patch hide | download patch | download mbox

diff --git a/src/gallium/drivers/panfrost/pan_compute.c b/src/gallium/drivers/panfrost/pan_compute.c
index 4639c1b03c38..036dffbb17be 100644
--- a/src/gallium/drivers/panfrost/pan_compute.c
+++ b/src/gallium/drivers/panfrost/pan_compute.c
@@ -133,7 +133,7 @@  panfrost_launch_grid(struct pipe_context *pipe,
         /* Queue the job */
         panfrost_scoreboard_queue_compute_job(batch, transfer);
 
-        panfrost_flush(pipe, NULL, PIPE_FLUSH_END_OF_FRAME);
+        panfrost_flush_all_batches(ctx, true);
 }
 
 void
diff --git a/src/gallium/drivers/panfrost/pan_context.c b/src/gallium/drivers/panfrost/pan_context.c
index 02726e7cd349..993744a1ffd0 100644
--- a/src/gallium/drivers/panfrost/pan_context.c
+++ b/src/gallium/drivers/panfrost/pan_context.c
@@ -150,6 +150,28 @@  panfrost_emit_mfbd(struct panfrost_batch *batch, unsigned vertex_count)
         return framebuffer;
 }
 
+static void
+panfrost_flush_fbo_deps(struct panfrost_context *ctx)
+{
+        struct pipe_framebuffer_state *fb = &ctx->pipe_framebuffer;
+        for (unsigned i = 0; i < fb->nr_cbufs; i++) {
+                if (!fb->cbufs[i])
+                        continue;
+
+                struct panfrost_resource *rsrc = pan_resource(fb->cbufs[i]->texture);
+
+                panfrost_flush_batch_writing_bo(ctx, rsrc->bo, true);
+                panfrost_flush_batches_reading_bo(ctx, rsrc->bo, true);
+        }
+
+        if (fb->zsbuf) {
+                struct panfrost_resource *rsrc = pan_resource(fb->zsbuf->texture);
+
+                panfrost_flush_batch_writing_bo(ctx, rsrc->bo, true);
+                panfrost_flush_batches_reading_bo(ctx, rsrc->bo, true);
+        }
+}
+
 static void
 panfrost_clear(
         struct pipe_context *pipe,
@@ -160,6 +182,7 @@  panfrost_clear(
         struct panfrost_context *ctx = pan_context(pipe);
         struct panfrost_batch *batch = panfrost_get_batch_for_fbo(ctx);
 
+        panfrost_flush_fbo_deps(ctx);
         panfrost_batch_add_fbo_bos(batch);
         panfrost_batch_clear(batch, buffers, color, depth, stencil);
 }
@@ -1324,10 +1347,9 @@  panfrost_flush(
         unsigned flags)
 {
         struct panfrost_context *ctx = pan_context(pipe);
-        struct panfrost_batch *batch = panfrost_get_batch_for_fbo(ctx);
 
-        /* Submit the frame itself */
-        panfrost_batch_submit(batch);
+        /* Submit all pending jobs */
+        panfrost_flush_all_batches(ctx, false);
 
         if (fence) {
                 struct panfrost_fence *f = panfrost_fence_create(ctx);
@@ -1433,6 +1455,71 @@  panfrost_statistics_record(
         ctx->tf_prims_generated += prims;
 }
 
+static void
+panfrost_flush_draw_deps(struct panfrost_context *ctx, const struct pipe_draw_info *info)
+{
+	struct panfrost_resource *rsrc;
+
+        if (ctx->wallpaper_batch)
+                return;
+
+        panfrost_flush_fbo_deps(ctx);
+
+        for (unsigned stage = 0; stage < PIPE_SHADER_TYPES; stage++) {
+                for (unsigned i = 0; i < ctx->sampler_view_count[stage]; i++) {
+                        struct panfrost_sampler_view *view = ctx->sampler_views[stage][i];
+
+                        if (!view)
+                                continue;
+
+                        rsrc = pan_resource(view->base.texture);
+                        panfrost_flush_batch_writing_bo(ctx, rsrc->bo, true);
+                }
+
+                for (unsigned i = 0; i < 32; i++) {
+                        if (!(ctx->ssbo_mask[stage] & (1 << i)))
+                                continue;
+
+                        rsrc = pan_resource(ctx->ssbo[stage][i].buffer);
+                        panfrost_flush_batch_writing_bo(ctx, rsrc->bo, true);
+                        panfrost_flush_batches_reading_bo(ctx, rsrc->bo, true);
+                }
+        }
+
+        if (info->index_size && !info->has_user_indices) {
+                struct panfrost_resource *rsrc = pan_resource(info->index.resource);
+
+                panfrost_flush_batch_writing_bo(ctx, rsrc->bo, true);
+        }
+
+        for (unsigned i = 0; ctx->vertex && i < ctx->vertex->num_elements; i++) {
+                struct pipe_vertex_element *velem = &ctx->vertex->pipe[i];
+                unsigned vbi = velem->vertex_buffer_index;
+
+                if (!(ctx->vb_mask & (1 << vbi)))
+                        continue;
+
+                struct pipe_vertex_buffer *buf = &ctx->vertex_buffers[vbi];
+
+                if (!buf->buffer.resource)
+                        continue;
+
+                rsrc = pan_resource(buf->buffer.resource);
+                panfrost_flush_batch_writing_bo(ctx, rsrc->bo, true);
+        }
+
+        for (unsigned i = 0; i < ctx->streamout.num_targets; i++) {
+                struct pipe_stream_output_target *target = ctx->streamout.targets[i];
+
+                if (!target)
+                        continue;
+
+                rsrc = pan_resource(target->buffer);
+                panfrost_flush_batch_writing_bo(ctx, rsrc->bo, true);
+                panfrost_flush_batches_reading_bo(ctx, rsrc->bo, true);
+        }
+}
+
 static void
 panfrost_draw_vbo(
         struct pipe_context *pipe,
@@ -1477,6 +1564,7 @@  panfrost_draw_vbo(
                 }
         }
 
+        panfrost_flush_draw_deps(ctx, info);
         ctx->payloads[PIPE_SHADER_VERTEX].offset_start = info->start;
         ctx->payloads[PIPE_SHADER_FRAGMENT].offset_start = info->start;
 
@@ -2252,50 +2340,10 @@  panfrost_set_framebuffer_state(struct pipe_context *pctx,
 {
         struct panfrost_context *ctx = pan_context(pctx);
 
-        /* Flush when switching framebuffers, but not if the framebuffer
-         * state is being restored by u_blitter
-         */
-
-        struct panfrost_batch *batch = panfrost_get_batch_for_fbo(ctx);
-        bool is_scanout = panfrost_batch_is_scanout(batch);
-        bool has_draws = batch->last_job.gpu;
-
-        /* Bail out early when the current and new states are the same. */
-        if (util_framebuffer_state_equal(&ctx->pipe_framebuffer, fb))
-                return;
-
-        /* The wallpaper logic sets a new FB state before doing the blit and
-         * restore the old one when it's done. Those FB states are reported to
-         * be different because the surface they are pointing to are different,
-         * but those surfaces actually point to the same cbufs/zbufs. In that
-         * case we definitely don't want new FB descs to be emitted/attached
-         * since the job is expected to be flushed just after the blit is done,
-         * so let's just copy the new state and return here.
-         */
-        if (ctx->wallpaper_batch) {
-                util_copy_framebuffer_state(&ctx->pipe_framebuffer, fb);
-                return;
-        }
-
-        if (!is_scanout || has_draws)
-                panfrost_flush(pctx, NULL, PIPE_FLUSH_END_OF_FRAME);
-        else
-                assert(!ctx->payloads[PIPE_SHADER_VERTEX].postfix.framebuffer &&
-                       !ctx->payloads[PIPE_SHADER_FRAGMENT].postfix.framebuffer);
-
-        /* Invalidate the FBO job cache since we've just been assigned a new
-         * FB state.
-         */
-        ctx->batch = NULL;
-
+        panfrost_hint_afbc(pan_screen(pctx->screen), fb);
         util_copy_framebuffer_state(&ctx->pipe_framebuffer, fb);
-
-        /* Given that we're rendering, we'd love to have compression */
-        struct panfrost_screen *screen = pan_screen(ctx->base.screen);
-
-        panfrost_hint_afbc(screen, &ctx->pipe_framebuffer);
-        for (unsigned i = 0; i < PIPE_SHADER_TYPES; ++i)
-                ctx->payloads[i].postfix.framebuffer = 0;
+        ctx->batch = NULL;
+        panfrost_invalidate_frame(ctx);
 }
 
 static void *
@@ -2513,6 +2561,7 @@  panfrost_get_query_result(struct pipe_context *pipe,
                           bool wait,
                           union pipe_query_result *vresult)
 {
+        struct panfrost_context *ctx = pan_context(pipe);
         struct panfrost_query *query = (struct panfrost_query *) q;
 
 
@@ -2521,7 +2570,7 @@  panfrost_get_query_result(struct pipe_context *pipe,
         case PIPE_QUERY_OCCLUSION_PREDICATE:
         case PIPE_QUERY_OCCLUSION_PREDICATE_CONSERVATIVE:
                 /* Flush first */
-                panfrost_flush(pipe, NULL, PIPE_FLUSH_END_OF_FRAME);
+                panfrost_flush_all_batches(ctx, true);
 
                 /* Read back the query results */
                 unsigned *result = (unsigned *) query->transfer.cpu;
@@ -2537,7 +2586,7 @@  panfrost_get_query_result(struct pipe_context *pipe,
 
         case PIPE_QUERY_PRIMITIVES_GENERATED:
         case PIPE_QUERY_PRIMITIVES_EMITTED:
-                panfrost_flush(pipe, NULL, PIPE_FLUSH_END_OF_FRAME);
+                panfrost_flush_all_batches(ctx, true);
                 vresult->u64 = query->end - query->start;
                 break;
 
diff --git a/src/gallium/drivers/panfrost/pan_job.c b/src/gallium/drivers/panfrost/pan_job.c
index 8eda110542c3..c40f3f12ec13 100644
--- a/src/gallium/drivers/panfrost/pan_job.c
+++ b/src/gallium/drivers/panfrost/pan_job.c
@@ -384,21 +384,14 @@  panfrost_batch_submit(struct panfrost_batch *batch)
                 fprintf(stderr, "panfrost_batch_submit failed: %d\n", ret);
 
 out:
-        if (ctx->batch == batch)
-                panfrost_invalidate_frame(ctx);
-
         /* The job has been submitted, let's invalidate the current FBO job
          * cache.
 	 */
-        assert(!ctx->batch || batch == ctx->batch);
-        ctx->batch = NULL;
+        if (ctx->batch == batch) {
+                panfrost_invalidate_frame(ctx);
+                ctx->batch = NULL;
+        }
 
-        /* We always stall the pipeline for correct results since pipelined
-	 * rendering is quite broken right now (to be fixed by the panfrost_job
-	 * refactor, just take the perf hit for correctness)
-	 */
-        drmSyncobjWait(pan_screen(ctx->base.screen)->fd, &ctx->out_sync, 1,
-                       INT64_MAX, 0, NULL);
         panfrost_free_batch(batch);
 }
 
diff --git a/src/gallium/drivers/panfrost/pan_resource.c b/src/gallium/drivers/panfrost/pan_resource.c
index a5869768f9de..25b72478dccd 100644
--- a/src/gallium/drivers/panfrost/pan_resource.c
+++ b/src/gallium/drivers/panfrost/pan_resource.c
@@ -578,10 +578,8 @@  panfrost_transfer_map(struct pipe_context *pctx,
                         is_bound |= fb->cbufs[c]->texture == resource;
         }
 
-        if (is_bound && (usage & PIPE_TRANSFER_READ)) {
-                assert(level == 0);
-                panfrost_flush(pctx, NULL, PIPE_FLUSH_END_OF_FRAME);
-        }
+        if (is_bound && (usage & PIPE_TRANSFER_READ))
+                 assert(level == 0);
 
         /* TODO: Respect usage flags */
 
@@ -594,9 +592,10 @@  panfrost_transfer_map(struct pipe_context *pctx,
                 /* No flush for writes to uninitialized */
         } else if (!(usage & PIPE_TRANSFER_UNSYNCHRONIZED)) {
                 if (usage & PIPE_TRANSFER_WRITE) {
-                        /* STUB: flush reading */
-                        //printf("debug: missed reading flush %d\n", resource->target);
+                        panfrost_flush_batch_writing_bo(ctx, bo, false);
+                        panfrost_flush_batches_reading_bo(ctx, bo, false);
                 } else if (usage & PIPE_TRANSFER_READ) {
+                        panfrost_flush_batch_writing_bo(ctx, bo, false);
                         /* STUB: flush writing */
                         //printf("debug: missed writing flush %d (%d-%d)\n", resource->target, box->x, box->x + box->width);
                 } else {
@@ -604,6 +603,8 @@  panfrost_transfer_map(struct pipe_context *pctx,
                 }
         }
 
+        panfrost_bo_wait(bo, INT64_MAX);
+
         if (rsrc->layout != PAN_LINEAR) {
                 /* Non-linear resources need to be indirectly mapped */
 
@@ -748,11 +749,8 @@  panfrost_generate_mipmap(
          * reorder-type optimizations in place. But for now prioritize
          * correctness. */
 
-        struct panfrost_batch *batch = panfrost_get_batch_for_fbo(ctx);
-        bool has_draws = batch->last_job.gpu;
-
-        if (has_draws)
-                panfrost_flush(pctx, NULL, PIPE_FLUSH_END_OF_FRAME);
+        panfrost_flush_batch_writing_bo(ctx, rsrc->bo, false);
+        panfrost_bo_wait(rsrc->bo, INT64_MAX);
 
         /* We've flushed the original buffer if needed, now trigger a blit */
 
@@ -765,8 +763,10 @@  panfrost_generate_mipmap(
         /* If the blit was successful, flush once more. If it wasn't, well, let
          * the state tracker deal with it. */
 
-        if (blit_res)
-                panfrost_flush(pctx, NULL, PIPE_FLUSH_END_OF_FRAME);
+        if (blit_res) {
+                panfrost_flush_batch_writing_bo(ctx, rsrc->bo, false);
+                panfrost_bo_wait(rsrc->bo, INT64_MAX);
+        }
 
         return blit_res;
 }

Comments

I think we can simplify `panfrost_flush_draw_deps`. We need to flush
any BOs that write where we read/write and any BOs that read where we
write. Since we collect this information via add_bo, we can
implement this logic generically, without requiring a special case
for every kind of BO we might need to flush, which is verbose and easy
to forget when adding new BOs later. You might need some extra tables in
panfrost_batch.

----

On design more generally:

I don't think we want to trigger any flushes at draw time. Rather, we
want to trigger at flush time. Conceptually, right before we send a
batch to the GPU, we ensure all of the other batches it needs have been
sent first and there is a barrier between them (via wait_bo).

The first consequence of delaying is that CPU-side logic can proceed
without being stalled on results.

The second is that different batches can be _totally_ independent.
Consider an app that does the following passes:

[FBO 1: Render a depth map of an object ]
[FBO 2: Render a normal map of that object ]
[Scanout: Render using the depth/normal maps as textures ]

In this case, the app should generate CPU-side batches for all three
render targets at once. Then, when flush() is called, fbo #1 and fbo #2
should be submitted and waited upon so they execute concurrently, then
scanout is submitted and waited. This should be a little faster,
especially paired with _NEXT changes in the kernel. CC'ing Steven to
ensure the principle is sound.

We can model this with a dependency graph, where batches are nodes and
the dependency of a batch X on a batch Y is represented as an edge from
Y to X. So this is a directed arrow graph. For well-behaved apps, the
graph must be acyclic (why?).

This touches on the idea of topological sorting: a topological sort of
the dependency graph is a valid order to submit the batches in. So
hypothetically, you could delay everything to flush time, construct the
dependency graph, do a topological sort, and then submit/wait in the
order of the sort.

But more interesting will be to extend to the concurrent FBO case, an
algorithm for which follows simply from topological sorting:

---

0. Create the dependency graph. Cull nodes that are not connected to the
node we're trying to flush (the scanout batch). In other words, reduce
the graph to its component containing the flushed node. See also
https://en.wikipedia.org/wiki/Connected_component_(graph_theory)#Algorithms

1. For each node with no incoming edges (=batch with no dependencies),
submit this batch. Remove it from the dependency graph, removing all
outgoing edges. Add it to a set of submitted batches.

3. For each submitted batch, wait on that batch.

4. Jump back to step #1 until there are no more nodes with no incoming
edges.

---

Intuitively, the idea is "submit as much as we can all at once, then
wait for it. Keep doing that until we submitted everything we need."

A bit more formally, nodes with no edges have no unsatisfied
dependencies by definition, so we can submit them in any order. We
choose to submit these first. We are allowed to submit a wait at any
time. Once we wait on a batch, it is complete, so any batches that
depend on it have that dependency satisfied, represented by removing the
edge from the dependency graph.

Do note that the subtlety of the termination condition: no more nodes
with no incoming edges. This makes proving that the algorithm halts
easy, since every iteration either removes a node or halts, and there
are a finite integral non-negative number of nodes.

* Whether this is a useful optimization is greatly dependent on the
  hardware. The Arm guys can chime in here, but I do know the GPU has
  some parallel execution capabilities so this shouldn't be a total
  waste.
On Fri, 6 Sep 2019 07:40:17 -0400
Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:

> I think we can simplify `panfrost_flush_draw_deps`. We need to flush
> any BOs that write where we read/write and any BOs that read where we
> write. Since we collect this information via add_bo, we can
> implement this logic generically, without requiring a special case
> for every kind of BO we might need to flush, which is verbose and easy
> to forget when adding new BOs later. You might need some extra tables in
> panfrost_batch.

With the current design where deps are flushed before issuing draw/clear
job, the existing add_bo() calls happen too late. This being said,
we could add BOs earlier and store the type of access in batch->bos
(turn it into a hash table where the key is the BO and the data
contains the flags). With that in place, we'd be able to automatically
add BOs to the ctx->{write,read}_bos hash tables.

Now, if we go for the dep graph solution, that's probably a non issue,
since deps can be added at any point as long as they are described
before the flush happens.

> 
> ----
> 
> On design more generally:
> 
> I don't think we want to trigger any flushes at draw time. Rather, we
> want to trigger at flush time. Conceptually, right before we send a
> batch to the GPU, we ensure all of the other batches it needs have been
> sent first and there is a barrier between them (via wait_bo).

I agree, and actually had this rework on my TODO list.

> 
> The first consequence of delaying is that CPU-side logic can proceed
> without being stalled on results.
> 
> The second is that different batches can be _totally_ independent.
> Consider an app that does the following passes:
> 
> [FBO 1: Render a depth map of an object ]
> [FBO 2: Render a normal map of that object ]
> [Scanout: Render using the depth/normal maps as textures ]
> 
> In this case, the app should generate CPU-side batches for all three
> render targets at once. Then, when flush() is called, fbo #1 and fbo #2
> should be submitted and waited upon so they execute concurrently, then
> scanout is submitted and waited.

Yes, also thought about that. We'd need to move the out_sync object
to the batch to make that possible, but that's definitely an
improvement I had in mind.

> This should be a little faster,
> especially paired with _NEXT changes in the kernel. CC'ing Steven to
> ensure the principle is sound.

Haven't looked at that patch yet.

> 
> We can model this with a dependency graph, where batches are nodes and
> the dependency of a batch X on a batch Y is represented as an edge from
> Y to X. So this is a directed arrow graph. For well-behaved apps, the
> graph must be acyclic (why?).
> 
> This touches on the idea of topological sorting: a topological sort of
> the dependency graph is a valid order to submit the batches in. So
> hypothetically, you could delay everything to flush time, construct the
> dependency graph, do a topological sort, and then submit/wait in the
> order of the sort.
> 
> But more interesting will be to extend to the concurrent FBO case, an
> algorithm for which follows simply from topological sorting:
> 
> ---
> 
> 0. Create the dependency graph. Cull nodes that are not connected to the
> node we're trying to flush (the scanout batch). In other words, reduce
> the graph to its component containing the flushed node. See also
> https://en.wikipedia.org/wiki/Connected_component_(graph_theory)#Algorithms
> 
> 1. For each node with no incoming edges (=batch with no dependencies),
> submit this batch. Remove it from the dependency graph, removing all
> outgoing edges. Add it to a set of submitted batches.
> 
> 3. For each submitted batch, wait on that batch.
> 
> 4. Jump back to step #1 until there are no more nodes with no incoming
> edges.
> 
> ---
> 
> Intuitively, the idea is "submit as much as we can all at once, then
> wait for it. Keep doing that until we submitted everything we need."
> 
> A bit more formally, nodes with no edges have no unsatisfied
> dependencies by definition, so we can submit them in any order. We
> choose to submit these first. We are allowed to submit a wait at any
> time. Once we wait on a batch, it is complete, so any batches that
> depend on it have that dependency satisfied, represented by removing the
> edge from the dependency graph.
> 
> Do note that the subtlety of the termination condition: no more nodes
> with no incoming edges. This makes proving that the algorithm halts
> easy, since every iteration either removes a node or halts, and there
> are a finite integral non-negative number of nodes.
> 
> * Whether this is a useful optimization is greatly dependent on the
>   hardware. The Arm guys can chime in here, but I do know the GPU has
>   some parallel execution capabilities so this shouldn't be a total
>   waste.

Thanks for the detailed explanation. I'll look into that. This being
said, I was wondering if we shouldn't merge this patch (after I
addressed your first comment maybe) before getting involved in a more
advanced solution (which I agree is what we should aim for).
> Now, if we go for the dep graph solution, that's probably a non issue,
> since deps can be added at any point as long as they are described
> before the flush happens.
>
> [snip]
>
> Thanks for the detailed explanation. I'll look into that. This being
> said, I was wondering if we shouldn't merge this patch (after I
> addressed your first comment maybe) before getting involved in a more
> advanced solution (which I agree is what we should aim for).

If it's alright, I would prefer to focus on patches 1-23; most of it
looks wonderful so the few comments I had should be easily addressed for
the v2.

Once all of that initial work is merged (and your revision queue and my
review queue are cleared), we can circle back to this change.

I would prefer to go straight to a dep graph approach; this patch is a
good intermediate step for testing the earlier patches in the series but
given the extra complexity added for the draw flushing (which you
mention is only needed with the non-graph solution), I don't know if we
should merge.

Thoughts?
On Fri, 6 Sep 2019 08:10:55 -0400
Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:

> > Now, if we go for the dep graph solution, that's probably a non issue,
> > since deps can be added at any point as long as they are described
> > before the flush happens.
> >
> > [snip]
> >
> > Thanks for the detailed explanation. I'll look into that. This being
> > said, I was wondering if we shouldn't merge this patch (after I
> > addressed your first comment maybe) before getting involved in a more
> > advanced solution (which I agree is what we should aim for).  
> 
> If it's alright, I would prefer to focus on patches 1-23; most of it
> looks wonderful so the few comments I had should be easily addressed for
> the v2.
> 
> Once all of that initial work is merged (and your revision queue and my
> review queue are cleared), we can circle back to this change.
> 
> I would prefer to go straight to a dep graph approach; this patch is a
> good intermediate step for testing the earlier patches in the series but
> given the extra complexity added for the draw flushing (which you
> mention is only needed with the non-graph solution), I don't know if we
> should merge.
> 
> Thoughts?

I'm definitely biased :-), but I do find the changes at hand not that
invasive: most of the logic is placed in helpers that are called in one
or 2 places. I mean, removing those explicit flushes when the time
comes shouldn't be too hard, and I do think it's one step in the right
direction even though it's not the perfect solution yet.

Anyway, I guess having patch 1 to 23 merged would already
significantly reduce my patch queue, and I'm definitely interested in
working on the dep graph solution, so I'm not strongly opposed to the
idea of dropping this patch.
> I'm definitely biased :-), but I do find the changes at hand not that
> invasive: most of the logic is placed in helpers that are called in one
> or 2 places. I mean, removing those explicit flushes when the time
> comes shouldn't be too hard, and I do think it's one step in the right
> direction even though it's not the perfect solution yet.
> 
> Anyway, I guess having patch 1 to 23 merged would already
> significantly reduce my patch queue, and I'm definitely interested in
> working on the dep graph solution, so I'm not strongly opposed to the
> idea of dropping this patch. 

Okay, let's focus on patch 1 to 23 first. Once that's all through, we
can revisit :)
On Fri, 2019-09-06 at 07:40 -0400, Alyssa Rosenzweig wrote:
> I think we can simplify `panfrost_flush_draw_deps`. We need to flush

> any BOs that write where we read/write and any BOs that read where we

> write. Since we collect this information via add_bo, we can

> implement this logic generically, without requiring a special case

> for every kind of BO we might need to flush, which is verbose and easy

> to forget when adding new BOs later. You might need some extra tables in

> panfrost_batch.

> 

> ----

> 

> On design more generally:

> 

> I don't think we want to trigger any flushes at draw time. Rather, we

> want to trigger at flush time. Conceptually, right before we send a

> batch to the GPU, we ensure all of the other batches it needs have been

> sent first and there is a barrier between them (via wait_bo).

> 

> The first consequence of delaying is that CPU-side logic can proceed

> without being stalled on results.

> 

> The second is that different batches can be _totally_ independent.

> Consider an app that does the following passes:

> 

> [FBO 1: Render a depth map of an object ]

> [FBO 2: Render a normal map of that object ]

> [Scanout: Render using the depth/normal maps as textures ]

> 

> In this case, the app should generate CPU-side batches for all three

> render targets at once. Then, when flush() is called, fbo #1 and fbo #2

> should be submitted and waited upon so they execute concurrently, then

> scanout is submitted and waited. This should be a little faster,

> especially paired with _NEXT changes in the kernel. CC'ing Steven to

> ensure the principle is sound.


Yes, this is how the hardware was designed to be used. The idea is that
the vertex processing can be submitted into the hardware back-to-back
(using the _NEXT registers) and then the fragment shading of e.g. FBO 1
can overlap with the vertex processing of FBO 2.

> We can model this with a dependency graph, where batches are nodes and

> the dependency of a batch X on a batch Y is represented as an edge from

> Y to X. So this is a directed arrow graph. For well-behaved apps, the

> graph must be acyclic (why?).


Again, this is how kbase is designed. kbase refers to the hardware job
chains as "atoms" (because we also have "soft-jobs" that are software
only equivalents executed in the kernel). The base_jd_atom_v2 structure
has two dependencies and we have a "dependency only atom" to allow
greater fan-out.

The submission mechanism ensures the graph is acyclic by submitting the
atoms one-by-one and not allowing a dependency on an atom which hasn't
been submitted yet (aside: this is then completely broken by the
existence of other synchronisation mechanisms which can introduce
cycles).

> This touches on the idea of topological sorting: a topological sort of

> the dependency graph is a valid order to submit the batches in. So

> hypothetically, you could delay everything to flush time, construct the

> dependency graph, do a topological sort, and then submit/wait in the

> order of the sort.


Be aware that the topological sort needs to be intelligent. For
instance frames are (usually) logically independent in the sort, but
you really don't want the GPU to do the work for frame N+1 before doing
frame N.

kbase actually has two different types of dependency "data dependency"
and "order dependency". "data dependency" is where the job uses the
output of a previous job (the 'real' dependencies). "order dependency"
is where there is a meaningful order but the jobs are actually
independent.

The main benefit of the two types is that it enables better recovery in
case of errors (e.g. if frame N fails to render, we can still render
frame N+1 even though we have an ordering dependency between them).

Steve