[3/3] win32: Allow gdi operations for argb32 surfaces (allowed by surface flags)

Submitted by Vasily Galkin on April 28, 2018, 7:27 p.m.

Details

Message ID 20180428192701.22786-3-galkin-vv@yandex.ru
State New
Headers show
Series "Series without cover letter" ( rev: 1 ) in Cairo

Not browsing as part of any series.

Commit Message

Vasily Galkin April 28, 2018, 7:27 p.m.
This ends the patch series that speedups the CAIRO_OPERATOR_SOURCE
when used to copy data
to a argb32 cairo   surface corresponding to a win32 dc
from a "backbuffer" - DibSection-based cairo surface
created with cairo_surface_create_similar().

This final patch allows gdi compositor to be used on argb32 surfaces.
Actually for display surfaces	only copying is allowed with gdi (by BitBlt),
since other operations are filtered by flags in implementations.

But since copying pixels is the only used operation in common scenario
"prepare offscreen image and put it to screen" - this is important for
presenting argb32 windows with cairo directly
or with gtk+gdk (which nowdays always create argb32 windows)

Before this patch pixel copy worked by:
1. mapping image to memory (by copying data from window dc to system memory
which is very slow on windows maybe due to gpu or interprocess access)
2. copying new data over that image.
3. copying updated image from system memory back to window dc.
After this patch there is only one step:

2+3. Copying new data over window dc.

Completely eliminating step 1 gives a very huge speedup and allows
argb32 cairo drawing be as fats as typical dibsection-buffered gdi drawing.

There is quick&dirty cairo-vs-gdi perf test made for this patch set:
https://gitlab.gnome.org/galkinvv/cairo/snippets/109
The results show multiple times improvement:

Before speedup

Painting 5000 32bits-per-pixel single-color frames of size 1056x1056 for profiling
GDI entire pipeline      : 4.123983 GB/s, 5408.053900 ms, 924.546998 FPS
GDI entire drawing      : 4.156272 GB/s, 5366.039400 ms
cairo entire pipeline      : 0.835951 GB/s, 26679.463300 ms, 187.410067 FPS
cairo entire drawing      : 0.838992 GB/s, 26582.750800 ms
cairo fill inmem    : 16.130683 GB/s, 1382.627100 ms
cairo to window     : 1.102623 GB/s, 20226.963700 ms

After speedup (running several times shows that there is 5-10% inaccuracy, so this results sgouldn't be used as a source for comparing raw gdi vs cairo)

Painting 5000 32bits-per-pixel single-color frames of size 1056x1056 for profiling
GDI entire pipeline      : 4.139421 GB/s, 5387.883400 ms, 928.008204 FPS
GDI entire drawing      : 4.165124 GB/s, 5354.635400 ms
cairo entire pipeline      : 4.029344 GB/s, 5535.075100 ms, 903.330110 FPS
cairo entire drawing      : 4.063073 GB/s, 5489.126000 ms
cairo fill inmem    : 22.665569 GB/s, 983.991200 ms
cairo to window     : 5.049950 GB/s, 4416.423700 ms

End-user visible speedup does present too - it relates to the following bug

https://gitlab.gnome.org/GNOME/meld/issues/133

Cairo speedup allow more simultaneous meld windows
without eating 100% of cpu core time on spinner rendering.

gtk's speedup is near 1.7x, not such huge as pure cairo ~7-8x on results above
It looks that gtk has some problems in caching cairo surfaces
and recreates them every frame with initial black fill.
---
 src/win32/cairo-win32-gdi-compositor.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Patch hide | download patch | download mbox

diff --git a/src/win32/cairo-win32-gdi-compositor.c b/src/win32/cairo-win32-gdi-compositor.c
index 0873391..4a09a70 100644
--- a/src/win32/cairo-win32-gdi-compositor.c
+++ b/src/win32/cairo-win32-gdi-compositor.c
@@ -488,7 +488,8 @@  static cairo_bool_t check_blit (cairo_composite_rectangles_t *composite)
     if (dst->fallback)
 	return FALSE;
 
-    if (dst->win32.format != CAIRO_FORMAT_RGB24)
+    if (dst->win32.format != CAIRO_FORMAT_RGB24
+	&& dst->win32.format != CAIRO_FORMAT_ARGB32)
 	return FALSE;
 
     if (dst->win32.flags & CAIRO_WIN32_SURFACE_CAN_BITBLT)

Comments

On 28.04.2018 22:27, Vasily Galkin wrote:
> gtk's speedup is near 1.7x,
2.3-2.5x for me

> not such huge as pure cairo ~7-8x on results above
> It looks that gtk has some problems in caching cairo surfaces
> and recreates them every frame with initial black fill.

Not exactly. Here's FPS data for GTK-3.22 fishbowl with old and new cairo:

old and busted:
Layered mode - 35.2 fps
Normal mode - 21.2 fps
Normal mode (optimized double-buffering) - 21.2 fps
Normal mode (generic double-buffering) - 20.0 fps

new hotness:
Layered mode - 35.0 fps
Normal mode - 21.2 fps
Normal mode (optimized double-buffering) - 53 fps (x2.5 faster than normal)
Normal mode (generic double-buffering) - 49 fps (x2.3 faster than normal)

Layered mode doen't blit anything, so has no improvements (the 0.2 change is
likely due to measurement error).

Normal mode doesn't seem to be covered with the "draw everything into a buffer
then blit once" case that you've optimized for, so it also has no visible
improvements.

Double-buffered mode with GDK built-in double-buffering (where GDK creates a
new double-buffer on every redraw) is x2.3 faster, and optimized backend
double-buffering (where DB surface is not re-created on every redraw) is x2.5
faster.

Note that in either case GDK will erase the painted region (well, in case of
generic DB it likely re-creates the new DB surface in a clear state) before
drawing anything, which is required for correct alpha-transparency - otherwise
semi-transparent regions will "stack up" on every redraw. If i deliberately
disable that eraser code, optimized double-buffering fps increases to 55 fps
(i.e. very little), but alpha-transparent regions are screwed.

FPS values are for the GTK fishbowl benchmark window maximized on my 4K
desktop, so, taking into account the taskbar, that makes it 3591x2160, and the
fishbowl widget itself is a bit smaller vertically.

Anyway, i'm pretty sure that GTK is drawing as best as it can, and there's no
x7 speedup anywhere in sight (that said, i was also pretty sure that cairo was
drawing as best as it can; shows what i know...).

As for the x1.7 vs x2.3, it could be attributed to you [presumably] having
smaller test windows, in which case the time spent on actual blitting is
smaller, and thus the speedup, only affecting that time, doesn't have as much
impact.