[Spice-devel] streaming: Use the optimal number of threads for VP8 encoding

Submitted by Francois Gouget on Oct. 31, 2016, 7:49 p.m.

Details

Message ID E1c1IaB-0002Ro-5w@amboise
State New
Headers show
Series "streaming: Use the optimal number of threads for VP8 encoding" ( rev: 1 ) in Spice

Not browsing as part of any series.

Commit Message

Francois Gouget Oct. 31, 2016, 7:49 p.m.
We run the VP8 encoder in real time mode so it uses only the minimum
amount of time needed to encode each frame. However by default it
only uses one thread so that for large/complex frames it may run at
less than the source fps. Besides resulting in dropped frames this
blocks the main server thread for most of the time.
So this patch configures the VP8 encoder to use all the CPU's physical
core, resulting in less wall clock time spent in encode_frame().

Signed-off-by: Francois Gouget <fgouget@codeweavers.com>
---

I am resubmitting this patch because I think the reasons for not
applying it last time were wrong. See:
https://lists.freedesktop.org/archives/spice-devel/2016-March/027026.html

Here is an illustration of the impact of threading for the 
big_buck_bunny_1080p_h264.mov video:

http://fgouget.free.fr/tmp/Spice-vp8-threads.png
http://fgouget.free.fr/tmp/Spice-vp8-threads.xls

The graph shows the time spent in encode_frame() (taken from the 
standard traces) when vp8enc uses 1, 2 or 4 threads.

One can see that the one-thread line spends quite a bit of time above 
the 33 ms mark which corresponds to the 30 fps of the source material. 
This means dropped frames. Indeed, the x axis corresponds to the frame 
number and we can clearly see the 1-thread line getting out of sync with 
the others as it encoded fewer frames.

The two-thread line is much lower and only goes above the 33 ms mark for 
a short time. The four-thread line is a bit lower still but we can also 
see diminishing returns there.

I'll also note that the h264 encoder automatically uses multiple 
threads already so this patch only brings vp8enc in line with it.


 configure.ac               |  4 ++++
 server/gstreamer-encoder.c | 32 ++++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

Patch hide | download patch | download mbox

diff --git a/configure.ac b/configure.ac
index 68aed15..6742577 100644
--- a/configure.ac
+++ b/configure.ac
@@ -150,6 +150,10 @@  AC_SUBST([SPICE_PROTOCOL_MIN_VER])
 PKG_CHECK_MODULES([GLIB2], [glib-2.0 >= 2.22 gio-2.0 >= 2.22])
 AS_VAR_APPEND([SPICE_REQUIRES], [" glib-2.0 >= 2.22 gio-2.0 >= 2.22"])
 
+AC_CHECK_LIB(glib-2.0, g_get_num_processors,
+             AC_DEFINE([HAVE_G_GET_NUMPROCESSORS], 1, [Defined if we have g_get_num_processors()]),,
+             $GLIB2_LIBS)
+
 PKG_CHECK_MODULES([GOBJECT2], [gobject-2.0 >= 2.22])
 AS_VAR_APPEND([SPICE_REQUIRES], [" gobject-2.0 >= 2.22"])
 
diff --git a/server/gstreamer-encoder.c b/server/gstreamer-encoder.c
index a101ab6..eb2a28c 100644
--- a/server/gstreamer-encoder.c
+++ b/server/gstreamer-encoder.c
@@ -866,6 +866,27 @@  static GstFlowReturn new_sample(GstAppSink *gstappsink, gpointer video_encoder)
     return GST_FLOW_OK;
 }
 
+static int physical_core_count = 0;
+static int get_physical_core_count(void)
+{
+    if (!physical_core_count) {
+#ifdef HAVE_G_GET_NUMPROCESSORS
+        physical_core_count = g_get_num_processors();
+#elif defined(_SC_NPROCESSORS_ONLN)
+        physical_core_count = sysconf(_SC_NPROCESSORS_ONLN);
+#endif
+        if (system("egrep -l '^flags\\b.*: .*\\bht\\b' /proc/cpuinfo >/dev/null 2>&1") == 0) {
+            /* Hyperthreading is enabled so divide by two to get the number
+             * of physical cores.
+             */
+            physical_core_count = physical_core_count / 2;
+        }
+        if (physical_core_count == 0)
+            physical_core_count = 1;
+    }
+    return physical_core_count;
+}
+
 static const gchar* get_gst_codec_name(SpiceGstEncoder *encoder)
 {
     switch (encoder->base.codec_type)
@@ -887,6 +908,7 @@  static const gchar* get_gst_codec_name(SpiceGstEncoder *encoder)
     }
 }
 
+/* A helper for spice_gst_encoder_encode_frame() */
 static gboolean create_pipeline(SpiceGstEncoder *encoder)
 {
 #ifdef HAVE_GSTREAMER_0_10
@@ -925,11 +947,17 @@  static gboolean create_pipeline(SpiceGstEncoder *encoder)
          *   75% CPU usage while speed simply prioritizes encoding speed.
          * - deadline is supposed to be set in microseconds but in practice
          *   it behaves like a boolean.
+         * - At least up to GStreamer 1.6.2, vp8enc cannot be trusted to pick
+         *   the optimal number of threads. Also exceeding the number of
+         *   physical core really degrades image quality.
+         * - token-parts/token-partitions parallelizes more operations.
          */
+        int threads = get_physical_core_count();
+        int parts = threads < 2 ? 0 : threads < 4 ? 1 : threads < 8 ? 2 : 3;
 #ifdef HAVE_GSTREAMER_0_10
-        gstenc_opts = g_strdup_printf("mode=cbr min-quantizer=10 error-resilient=true max-latency=0 speed=7");
+        gstenc_opts = g_strdup_printf("mode=cbr min-quantizer=10 error-resilient=true max-latency=0 speed=7 threads=%d token-parts=%d", threads, parts);
 #else
-        gstenc_opts = g_strdup_printf("end-usage=cbr min-quantizer=10 error-resilient=default lag-in-frames=0 deadline=1 cpu-used=4");
+        gstenc_opts = g_strdup_printf("end-usage=cbr min-quantizer=10 error-resilient=default lag-in-frames=0 deadline=1 cpu-used=4 threads=%d token-partitions=%d", threads, parts);
 #endif
         break;
         }

Comments

Hi,

On Mon, Oct 31, 2016 at 08:49:23PM +0100, Francois Gouget wrote:
> We run the VP8 encoder in real time mode so it uses only the minimum
> amount of time needed to encode each frame. However by default it
> only uses one thread so that for large/complex frames it may run at
> less than the source fps. Besides resulting in dropped frames this
> blocks the main server thread for most of the time.
> So this patch configures the VP8 encoder to use all the CPU's physical
> core, resulting in less wall clock time spent in encode_frame().
> 
> Signed-off-by: Francois Gouget <fgouget@codeweavers.com>
> ---
> 
> I am resubmitting this patch because I think the reasons for not
> applying it last time were wrong. See:
> https://lists.freedesktop.org/archives/spice-devel/2016-March/027026.html
> 
> Here is an illustration of the impact of threading for the 
> big_buck_bunny_1080p_h264.mov video:
> 
> http://fgouget.free.fr/tmp/Spice-vp8-threads.png
> http://fgouget.free.fr/tmp/Spice-vp8-threads.xls
> 
> The graph shows the time spent in encode_frame() (taken from the 
> standard traces) when vp8enc uses 1, 2 or 4 threads.

Well, I did not say that using more threads was not useful, I was only
worried that using all CPUs could not be so good (eg multiple VMs
encoding at the same time), especially as I think you mentioned that
trying to use too many threads was causing rendering issues?

> I'll also note that the h264 encoder automatically uses multiple 
> threads already so this patch only brings vp8enc in line with it.

After a quick look at x264, it seems to be using more threads than
physiical CPUs. Is it ok too with vp8enc? I'd really like that we don't
have that code invoking egrep as a fallback, ie no g_get_num_processors,
no _SC_NPROCESSORS_ONLN, tough luck? (the latter is available on el6).

Christophe
> 
> Hi,
> 
> On Mon, Oct 31, 2016 at 08:49:23PM +0100, Francois Gouget wrote:
> > We run the VP8 encoder in real time mode so it uses only the minimum
> > amount of time needed to encode each frame. However by default it
> > only uses one thread so that for large/complex frames it may run at
> > less than the source fps. Besides resulting in dropped frames this
> > blocks the main server thread for most of the time.
> > So this patch configures the VP8 encoder to use all the CPU's physical
> > core, resulting in less wall clock time spent in encode_frame().
> > 
> > Signed-off-by: Francois Gouget <fgouget@codeweavers.com>
> > ---
> > 
> > I am resubmitting this patch because I think the reasons for not
> > applying it last time were wrong. See:
> > https://lists.freedesktop.org/archives/spice-devel/2016-March/027026.html
> > 
> > Here is an illustration of the impact of threading for the
> > big_buck_bunny_1080p_h264.mov video:
> > 
> > http://fgouget.free.fr/tmp/Spice-vp8-threads.png
> > http://fgouget.free.fr/tmp/Spice-vp8-threads.xls
> > 
> > The graph shows the time spent in encode_frame() (taken from the
> > standard traces) when vp8enc uses 1, 2 or 4 threads.
> 
> Well, I did not say that using more threads was not useful, I was only
> worried that using all CPUs could not be so good (eg multiple VMs
> encoding at the same time), especially as I think you mentioned that
> trying to use too many threads was causing rendering issues?
> 

I agree with both... I think the solution is an option to enable/disable
and personally I would prefer to use less thread by default.

> > I'll also note that the h264 encoder automatically uses multiple
> > threads already so this patch only brings vp8enc in line with it.
> 
> After a quick look at x264, it seems to be using more threads than
> physiical CPUs. Is it ok too with vp8enc? I'd really like that we don't
> have that code invoking egrep as a fallback, ie no g_get_num_processors,
> no _SC_NPROCESSORS_ONLN, tough luck? (the latter is available on el6).
> 
> Christophe
> 

I think we can assume, if system has not these options has only one
core...

Frediano
> 
> > I'll also note that the h264 encoder automatically uses multiple
> > threads already so this patch only brings vp8enc in line with it.
> 
> After a quick look at x264, it seems to be using more threads than
> physiical CPUs. Is it ok too with vp8enc? I'd really like that we don't
> have that code invoking egrep as a fallback, ie no g_get_num_processors,
> no _SC_NPROCESSORS_ONLN, tough luck? (the latter is available on el6).
> 
> Christophe
> 

About getting the number of processors in the system looks like
nproc(1) uses sched_getaffinity followed by a CPU_COUNT on the
result. It's supported since kernel 2.5.8 and glibc 2.3.
Basically is equivalent to use sysconf(_SC_NPROCESSORS_ONLN).

Frediano