[PATCHv4,wayland-protocols] text-input: Add v3 of the text-input protocol

Submitted by Dorota Czaplejewicz on May 3, 2018, 3:41 p.m.

Details

Message ID 20180503154121.5894-1-dorota.czaplejewicz@puri.sm
State Superseded
Series "text-input: Add v3 of the text-input protocol"
Headers show

Commit Message

Dorota Czaplejewicz May 3, 2018, 3:41 p.m.
From: Carlos Garnacho <carlosg@gnome.org>

This new protocol description is a simplification over v2.

- All pre-edit text styling is gone.
- Pre-edit cursor can span characters.
- No events regarding input panel (OSK) state nor covered rectangle.
  Compositors are still free to handle situations where the keyboard
  focus rectangle is covered by the input panel.
- No set_preferred_language request for clients.
- There is no event to send keysyms. Compositors can use wl_keyboard
  interface instead.
- All state is double-buffered, with specified state.
- Use Unicode codepoints to measure strings.

Signed-off-by: Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm>
Signed-off-by: Carlos Garnacho <carlosg@gnome.org>
---
This is the next update coming from Purism to perfect the text input protocol.

The following changes added on top of PATCHv3:

- Fixed whitespaces.
- Removed enable flags - the same information can be gathered from the first requests after enter.
- Changed offsets inside UTF-8 strings to use Unicode character counts in order to remove the possibility of communicating invalid state.
- Specified the exact lifetime of double-buffered state, and initial values.
- Made changes requested by the IM double-buffered.

Some questions remain open. One is: how to specify how much text to capture in set_surrounding_text, and how often to update?

A possible change that I decided against for now is to replace enable/disable events by create/destroy of a new object, which would make more state lifetimes encoded in the protocol.

After reading a blog post on fcitx [0], I got the impression that letting the compositor know some persistent ID of a text edit instance could be useful, however I'm not sure what the use cases are.

As always, I'm happy to hear feedback.

Cheers,
Dorota Czaplejewicz

[0] https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/

 Makefile.am                                    |   1 +
 unstable/text-input/text-input-unstable-v3.xml | 362 +++++++++++++++++++++++++
 2 files changed, 363 insertions(+)
 create mode 100644 unstable/text-input/text-input-unstable-v3.xml

Patch hide | download patch | download mbox

diff --git a/Makefile.am b/Makefile.am
index 4b9a901..86d7ca9 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -3,6 +3,7 @@  unstable_protocols =								\
 	unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml		\
 	unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml			\
 	unstable/text-input/text-input-unstable-v1.xml				\
+	unstable/text-input/text-input-unstable-v3.xml				\
 	unstable/input-method/input-method-unstable-v1.xml			\
 	unstable/xdg-shell/xdg-shell-unstable-v5.xml				\
 	unstable/xdg-shell/xdg-shell-unstable-v6.xml				\
diff --git a/unstable/text-input/text-input-unstable-v3.xml b/unstable/text-input/text-input-unstable-v3.xml
new file mode 100644
index 0000000..ed5204f
--- /dev/null
+++ b/unstable/text-input/text-input-unstable-v3.xml
@@ -0,0 +1,362 @@ 
+<?xml version="1.0" encoding="UTF-8"?>
+
+<protocol name="text_input_unstable_v3">
+  <copyright>
+    Copyright © 2012, 2013 Intel Corporation
+    Copyright © 2015, 2016 Jan Arne Petersen
+    Copyright © 2017, 2018 Red Hat, Inc.
+    Copyright © 2018 Purism SPC
+
+    Permission to use, copy, modify, distribute, and sell this
+    software and its documentation for any purpose is hereby granted
+    without fee, provided that the above copyright notice appear in
+    all copies and that both that copyright notice and this permission
+    notice appear in supporting documentation, and that the name of
+    the copyright holders not be used in advertising or publicity
+    pertaining to distribution of the software without specific,
+    written prior permission.  The copyright holders make no
+    representations about the suitability of this software for any
+    purpose.  It is provided "as is" without express or implied
+    warranty.
+
+    THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
+    SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+    FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
+    SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+    WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
+    AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
+    ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
+    THIS SOFTWARE.
+  </copyright>
+
+  <interface name="zwp_text_input_v3" version="1">
+    <description summary="text input">
+      The zwp_text_input_v3 interface represents text input and input methods
+      associated with a seat. It provides enter/leave events to follow the
+      text input focus for a seat.
+
+      Requests are used to enable/disable the text-input object and set
+      state information like surrounding and selected text or the content type.
+      The information about the entered text is sent to the text-input object
+      via the pre-edit and commit_string events.
+
+      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
+      grapheme is made up of multiple code points, an index pointing to any of
+      them should be interpreted as pointing to the first one.
+
+      Focus moving throughout surfaces will result in the emission of
+      zwp_text_input_v3.enter and zwp_text_input_v3.leave events. The focused
+      surface must perform zwp_text_input_v3.enable and
+      zwp_text_input_v3.disable requests as the keyboard focus moves across
+      editable and non-editable elements of the UI. Those two requests are not
+      expected to be paired with each other, the compositor must be able to
+      handle consecutive series of the same request.
+
+      State is sent by the state requests (set_surrounding_text,
+      set_content_type and set_cursor_rectangle) and a commit request.
+      After an enter event or disable request all state information is
+      invalidated and needs to be resent by the client.
+
+      This protocol defines requests and events necessary for regular clients
+      to communicate with an input method. The zwp_input_method protocol
+      defines the interfaces necessary to implement standalone input methods.
+      If a compositor implements both interfaces, it will be the arbiter of the
+      communication between both.
+
+      Warning! The protocol described in this file is experimental and
+      backward incompatible changes may be made. Backward compatible changes
+      may be added together with the corresponding interface version bump.
+      Backward incompatible changes are done by bumping the version number in
+      the protocol and interface names and resetting the interface version.
+      Once the protocol is to be declared stable, the 'z' prefix and the
+      version number in the protocol and interface names are removed and the
+      interface version number is reset.
+    </description>
+
+    <request name="destroy" type="destructor">
+      <description summary="Destroy the wp_text_input">
+       Destroy the wp_text_input object. Also disables all surfaces enabled
+       through this wp_text_input object.
+      </description>
+    </request>
+
+    <request name="enable">
+      <description summary="Request text input to be enabled">
+        Requests text input. This request should be issued every time the
+        active text input changes, including within one surface.
+
+        This request resets all state associated with previous enable,
+        set_surrounding_text, set_content_type, and set_cursor_rectangle
+        requests, as well as the state associated with preedit_string,
+        commit_string, and delete_surrounding_text events.
+
+        The set_surrounding_text, set_content_type and set_cursor_rectangle
+        requests should follow if the text input supports the necessary
+        functionality.
+
+        The changes must be applied by the compositor after issuing a
+        zwp_text_input_v3.commit request.
+      </description>
+    </request>
+
+    <request name="disable">
+      <description summary="Disable text input on a surface">
+        Explicitly disable text input in a surface (typically when there is no
+        focus on any text entry inside the surface).
+      </description>
+    </request>
+
+    <request name="set_surrounding_text">
+      <description summary="sets the surrounding text">
+        Sets the surrounding plain text around the input position.
+
+        Text is UTF-8 encoded. Cursor is the Unicode code point offset within
+        the surrounding text.
+        Anchor is the Unicode code point offset of the selection anchor within
+        the surrounding text. If there is no selected text, anchor is the same
+        as cursor.
+
+        If the client is unaware of the text around the cursor, it should not
+        issue this request, to signify lack of support to the compositor.
+
+        There is a maximum length of wayland messages so text can not be
+        longer than 4000 bytes.
+
+        Values set with this request are double-buffered. They will get applied
+        on the next zwp_text_input_v3.commit request, and stay valid until the
+        next enable or disable request.
+
+        The initial state for affected fields is empty, meaning that the text
+        input does not support sending surrounding text. If the empty values
+        get applied, subsequent attempts to change them may have no effect.
+      </description>
+      <arg name="text" type="string"/>
+      <arg name="cursor" type="int"/>
+      <arg name="anchor" type="int"/>
+    </request>
+
+    <enum name="content_hint" bitfield="true">
+      <description summary="content hint">
+       Content hint is a bitmask to allow to modify the behavior of the text
+       input.
+      </description>
+      <entry name="none" value="0x0" summary="no special behavior"/>
+      <entry name="completion" value="0x1" summary="suggest word completions"/>
+      <entry name="spellcheck" value="0x2" summary="suggest word corrections"/>
+      <entry name="auto_capitalization" value="0x4" summary="switch to uppercase letters at the start of a sentence"/>
+      <entry name="lowercase" value="0x8" summary="prefer lowercase letters"/>
+      <entry name="uppercase" value="0x10" summary="prefer uppercase letters"/>
+      <entry name="titlecase" value="0x20" summary="prefer casing for titles and headings (can be language dependent)"/>
+      <entry name="hidden_text" value="0x40" summary="characters should be hidden"/>
+      <entry name="sensitive_data" value="0x80" summary="typed text should not be stored"/>
+      <entry name="latin" value="0x100" summary="just Latin characters should be entered"/>
+      <entry name="multiline" value="0x200" summary="the text input is multiline"/>
+    </enum>
+
+    <enum name="content_purpose">
+      <description summary="content purpose">
+       The content purpose allows to specify the primary purpose of a text
+       input.
+
+       This allows an input method to show special purpose input panels with
+       extra characters or to disallow some characters.
+      </description>
+      <entry name="normal" value="0" summary="default input, allowing all characters"/>
+      <entry name="alpha" value="1" summary="allow only alphabetic characters"/>
+      <entry name="digits" value="2" summary="allow only digits"/>
+      <entry name="number" value="3" summary="input a number (including decimal separator and sign)"/>
+      <entry name="phone" value="4" summary="input a phone number"/>
+      <entry name="url" value="5" summary="input an URL"/>
+      <entry name="email" value="6" summary="input an email address"/>
+      <entry name="name" value="7" summary="input a name of a person"/>
+      <entry name="password" value="8" summary="input a password (combine with sensitive_data hint)"/>
+      <entry name="pin" value="9" summary="input is a numeric password (combine with sensitive_data hint)"/>
+      <entry name="date" value="10" summary="input a date"/>
+      <entry name="time" value="11" summary="input a time"/>
+      <entry name="datetime" value="12" summary="input a date and time"/>
+      <entry name="terminal" value="13" summary="input for a terminal"/>
+    </enum>
+
+    <request name="set_content_type">
+      <description summary="set content purpose and hint">
+        Sets the content purpose and content hint. While the purpose is the
+        basic purpose of an input field, the hint flags allow to modify some
+        of the behavior.
+
+        Values set with this request are double-buffered. They will get applied
+        on the first zwp_text_input_v3.commit request after an enabl request.
+        Subsequent attempts to update them may have no effect. The values
+        remain valid until the next enable or disable request.
+
+        The initial value for hint is none, and the initial value for purpose
+        is normal.
+      </description>
+      <arg name="hint" type="uint" enum="content_hint"/>
+      <arg name="purpose" type="uint" enum="content_purpose"/>
+    </request>
+
+    <request name="set_cursor_rectangle">
+      <description summary="set cursor position">
+        Marks an area around the cursor as a x, y, width, height rectangle in surface
+        local coordinates.
+
+        Allows the compositor to put a window with word suggestions near the
+        cursor, without obstructing the text being input.
+
+        If the client is unaware of the position of edited text, it should not
+        issue this request, to signify lack of support to the compositor.
+
+        Values set with this request are double-buffered. They will get applied
+        on the next zwp_text_input_v3.commit request, and stay valid until the
+        next enable or disable request.
+
+        The initial values describing a cursor rectangle are empty. That means
+        the text input does not support describing the cursor area. If the
+        empty values get applied, subsequent attempts to change them may have
+        no effect.
+      </description>
+      <arg name="x" type="int"/>
+      <arg name="y" type="int"/>
+      <arg name="width" type="int"/>
+      <arg name="height" type="int"/>
+    </request>
+
+    <request name="commit">
+      <description summary="commit state">
+        Text input state (content purpose, content hint, surrounding text,
+        cursor rectangle) is conceptually double-buffered within the context
+        of a text input, i.e. between an enable request and the following
+        enable or disable request.
+
+        Protocol requests modify the pending state, as opposed to the current
+        state in use by the input method. A commit request atomically applies
+        all pending state, replacing the current state. After commit, the new
+        pending state is as documented for each related request.
+
+        The enable request performs a special role by indicating that the state
+        should be reset and updated with new values on the nearest commit.
+
+        The current or pending state are not modified unless noted otherwise.
+      </description>
+    </request>
+
+    <event name="enter">
+      <description summary="enter event">
+       Notification that this seat's text-input focus is on a certain surface.
+
+       When the seat has the keyboard capability the text-input focus follows
+       the keyboard focus.
+      </description>
+      <arg name="surface" type="object" interface="wl_surface"/>
+    </event>
+
+    <event name="leave">
+      <description summary="leave event">
+       Notification that this seat's text-input focus is no longer on
+       a certain surface. The client should reset any preedit string previously
+       set.
+
+       The leave notification is sent before the enter notification
+       for the new focus.
+
+       When the seat has the keyboard capability the text-input focus follows
+       the keyboard focus.
+      </description>
+      <arg name="surface" type="object" interface="wl_surface"/>
+    </event>
+
+    <event name="preedit_string">
+      <description summary="pre-edit">
+        Notify when a new composing text (pre-edit) should be set around the
+        current cursor position. Any previously set composing text should
+        be removed.
+
+        Values set with this event are double-buffered. They must be applied on
+        the next zwp_text_input_v3.done event, and stay valid until the
+        next enable or disable request.
+
+        The parameters cursor_begin and cursor_end are counted in Unicode
+        code points relative to the beginning of the submitted string. Cursor
+        should be hidden when both are equal to -1.
+
+        They could be represented by the cient as a line if both values are the
+        same, or as a text highligt otherwise.
+
+        The initial value of text is an empty string, and cursor_begin and
+        cursor_end are both 0.
+      </description>
+      <arg name="text" type="string" allow-null="true"/>
+      <arg name="cursor_begin" type="int"/>
+      <arg name="cursor_end" type="int"/>
+    </event>
+
+    <event name="commit_string">
+      <description summary="text commit">
+        Notify when text should be inserted into the editor widget. The text to
+        commit could be either just a single character after a key press or the
+        result of some composing (pre-edit).
+
+        Values set with this event are double-buffered. They must be applied
+        and reset to initial on the next zwp_text_input_v3.done event.
+
+        The initial value of text is an empty string.
+      </description>
+      <arg name="text" type="string" allow-null="true"/>
+    </event>
+
+    <event name="delete_surrounding_text">
+      <description summary="delete surrounding text">
+        Notify when the text around the current cursor position should be
+        deleted. Before_length and after_length are the number of Unicode
+        code points before and after the current cursor position (excluding the
+        selection) to delete.
+
+        Values set with this event are double-buffered. They must be applied
+        and reset to initial on the next zwp_text_input_v3.done event.
+
+        The initial values of both before_length and after_length are 0.
+      </description>
+      <arg name="before_length" type="uint" summary="length of text before current cursor position"/>
+      <arg name="after_length" type="uint" summary="length of text after current cursor position"/>
+    </event>
+
+    <event name="done">
+      <description summary="apply changes">
+        Instruct the application to apply changes to state requested by the
+        preedit_string, commit_string and delete_surrounding_string events. The
+        state relating to these events is double-buffered, and each one
+        modifies the pending state. This event replaces the current state with
+        the pending state.
+
+        The application should proceed by evaluating the changes in the
+        following order:
+
+        1. Replace existing preedit string with the cursor.
+        2. Delete requested surrounding text.
+        3. Insert commit string with the cursor at its end.
+        4. Insert new preedit text in cursor position.
+        5. Place cursor inside preedit text.
+      </description>
+    </event>
+  </interface>
+
+  <interface name="zwp_text_input_manager_v3" version="1">
+    <description summary="text input manager">
+      A factory for text-input objects. This object is a global singleton.
+    </description>
+
+    <request name="destroy" type="destructor">
+      <description summary="Destroy the wp_text_input_manager">
+       Destroy the wp_text_input_manager object.
+      </description>
+    </request>
+
+    <request name="get_text_input">
+      <description summary="create a new text input object">
+       Creates a new text-input object for a given seat.
+      </description>
+      <arg name="id" type="new_id" interface="zwp_text_input_v3"/>
+      <arg name="seat" type="object" interface="wl_seat"/>
+    </request>
+  </interface>
+</protocol>

Comments

Silvan Jegen May 3, 2018, 6:47 p.m.
Hi Dorota

Some comments and typo fixes below.

On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
> This new protocol description is a simplification over v2.
> 
> - All pre-edit text styling is gone.
> - Pre-edit cursor can span characters.
> - No events regarding input panel (OSK) state nor covered rectangle.
>   Compositors are still free to handle situations where the keyboard
>   focus rectangle is covered by the input panel.
> - No set_preferred_language request for clients.
> - There is no event to send keysyms. Compositors can use wl_keyboard
>   interface instead.
> - All state is double-buffered, with specified state.
> - Use Unicode codepoints to measure strings.
> 
> Signed-off-by: Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm>
> Signed-off-by: Carlos Garnacho <carlosg@gnome.org>
> ---
> This is the next update coming from Purism to perfect the text input protocol.
> 
> The following changes added on top of PATCHv3:
> 
> - Fixed whitespaces.
> - Removed enable flags - the same information can be gathered from the first requests after enter.
> - Changed offsets inside UTF-8 strings to use Unicode character counts in order to remove the possibility of communicating invalid state.
> - Specified the exact lifetime of double-buffered state, and initial values.
> - Made changes requested by the IM double-buffered.
> 
> Some questions remain open. One is: how to specify how much text to capture in set_surrounding_text, and how often to update?
> 
> A possible change that I decided against for now is to replace enable/disable events by create/destroy of a new object, which would make more state lifetimes encoded in the protocol.
> 
> After reading a blog post on fcitx [0], I got the impression that letting the compositor know some persistent ID of a text edit instance could be useful, however I'm not sure what the use cases are.
> 
> As always, I'm happy to hear feedback.
> 
> Cheers,
> Dorota Czaplejewicz
> 
> [0] https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
> 
>  Makefile.am                                    |   1 +
>  unstable/text-input/text-input-unstable-v3.xml | 362 +++++++++++++++++++++++++
>  2 files changed, 363 insertions(+)
>  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
> 
> diff --git a/Makefile.am b/Makefile.am
> index 4b9a901..86d7ca9 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -3,6 +3,7 @@ unstable_protocols =								\
>  	unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml		\
>  	unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml			\
>  	unstable/text-input/text-input-unstable-v1.xml				\
> +	unstable/text-input/text-input-unstable-v3.xml				\
>  	unstable/input-method/input-method-unstable-v1.xml			\
>  	unstable/xdg-shell/xdg-shell-unstable-v5.xml				\
>  	unstable/xdg-shell/xdg-shell-unstable-v6.xml				\
> diff --git a/unstable/text-input/text-input-unstable-v3.xml b/unstable/text-input/text-input-unstable-v3.xml
> new file mode 100644
> index 0000000..ed5204f
> --- /dev/null
> +++ b/unstable/text-input/text-input-unstable-v3.xml
> @@ -0,0 +1,362 @@
> +<?xml version="1.0" encoding="UTF-8"?>
> +
> +<protocol name="text_input_unstable_v3">
> +  <copyright>
> +    Copyright © 2012, 2013 Intel Corporation
> +    Copyright © 2015, 2016 Jan Arne Petersen
> +    Copyright © 2017, 2018 Red Hat, Inc.
> +    Copyright © 2018 Purism SPC
> +
> +    Permission to use, copy, modify, distribute, and sell this
> +    software and its documentation for any purpose is hereby granted
> +    without fee, provided that the above copyright notice appear in
> +    all copies and that both that copyright notice and this permission
> +    notice appear in supporting documentation, and that the name of
> +    the copyright holders not be used in advertising or publicity
> +    pertaining to distribution of the software without specific,
> +    written prior permission.  The copyright holders make no
> +    representations about the suitability of this software for any
> +    purpose.  It is provided "as is" without express or implied
> +    warranty.
> +
> +    THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
> +    SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
> +    FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
> +    SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> +    WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
> +    AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
> +    ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
> +    THIS SOFTWARE.
> +  </copyright>
> +
> +  <interface name="zwp_text_input_v3" version="1">
> +    <description summary="text input">
> +      The zwp_text_input_v3 interface represents text input and input methods
> +      associated with a seat. It provides enter/leave events to follow the
> +      text input focus for a seat.
> +
> +      Requests are used to enable/disable the text-input object and set
> +      state information like surrounding and selected text or the content type.
> +      The information about the entered text is sent to the text-input object
> +      via the pre-edit and commit_string events.
> +
> +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> +      grapheme is made up of multiple code points, an index pointing to any of
> +      them should be interpreted as pointing to the first one.

That way we make sure we don't put the cursor/anchor between bytes that
belong to the same UTF-8 encoded Unicode code point which is nice. It
also means that the client has to parse all the UTF-8 encoded strings
into Unicode code points up to the desired cursor/anchor position
on each "preedit_string" event. For each "delete_surrounding_text" event
the client has to parse the UTF-8 sequences before and after the cursor
position up to the requested Unicode code point.

I feel like we are processing the UTF-8 string already in the
input-method. So I am not sure that we should parse it again on the
client side. Parsing it again would also mean that the client would need
to know about UTF-8 which would be nice to avoid.

Thoughts?


> +
> +      Focus moving throughout surfaces will result in the emission of
> +      zwp_text_input_v3.enter and zwp_text_input_v3.leave events. The focused
> +      surface must perform zwp_text_input_v3.enable and
> +      zwp_text_input_v3.disable requests as the keyboard focus moves across
> +      editable and non-editable elements of the UI. Those two requests are not
> +      expected to be paired with each other, the compositor must be able to
> +      handle consecutive series of the same request.
> +
> +      State is sent by the state requests (set_surrounding_text,
> +      set_content_type and set_cursor_rectangle) and a commit request.
> +      After an enter event or disable request all state information is
> +      invalidated and needs to be resent by the client.
> +
> +      This protocol defines requests and events necessary for regular clients
> +      to communicate with an input method. The zwp_input_method protocol
> +      defines the interfaces necessary to implement standalone input methods.
> +      If a compositor implements both interfaces, it will be the arbiter of the
> +      communication between both.
> +
> +      Warning! The protocol described in this file is experimental and
> +      backward incompatible changes may be made. Backward compatible changes
> +      may be added together with the corresponding interface version bump.
> +      Backward incompatible changes are done by bumping the version number in
> +      the protocol and interface names and resetting the interface version.
> +      Once the protocol is to be declared stable, the 'z' prefix and the
> +      version number in the protocol and interface names are removed and the
> +      interface version number is reset.
> +    </description>
> +
> +    <request name="destroy" type="destructor">
> +      <description summary="Destroy the wp_text_input">
> +       Destroy the wp_text_input object. Also disables all surfaces enabled
> +       through this wp_text_input object.
> +      </description>
> +    </request>
> +
> +    <request name="enable">
> +      <description summary="Request text input to be enabled">
> +        Requests text input. This request should be issued every time the
> +        active text input changes, including within one surface.
> +
> +        This request resets all state associated with previous enable,
> +        set_surrounding_text, set_content_type, and set_cursor_rectangle
> +        requests, as well as the state associated with preedit_string,
> +        commit_string, and delete_surrounding_text events.
> +
> +        The set_surrounding_text, set_content_type and set_cursor_rectangle
> +        requests should follow if the text input supports the necessary
> +        functionality.
> +
> +        The changes must be applied by the compositor after issuing a
> +        zwp_text_input_v3.commit request.
> +      </description>
> +    </request>
> +
> +    <request name="disable">
> +      <description summary="Disable text input on a surface">
> +        Explicitly disable text input in a surface (typically when there is no
> +        focus on any text entry inside the surface).
> +      </description>
> +    </request>
> +
> +    <request name="set_surrounding_text">
> +      <description summary="sets the surrounding text">
> +        Sets the surrounding plain text around the input position.
> +
> +        Text is UTF-8 encoded. Cursor is the Unicode code point offset within
> +        the surrounding text.
> +        Anchor is the Unicode code point offset of the selection anchor within
> +        the surrounding text. If there is no selected text, anchor is the same
> +        as cursor.
> +
> +        If the client is unaware of the text around the cursor, it should not
> +        issue this request, to signify lack of support to the compositor.
> +
> +        There is a maximum length of wayland messages so text can not be
> +        longer than 4000 bytes.
> +
> +        Values set with this request are double-buffered. They will get applied
> +        on the next zwp_text_input_v3.commit request, and stay valid until the
> +        next enable or disable request.
> +
> +        The initial state for affected fields is empty, meaning that the text
> +        input does not support sending surrounding text. If the empty values
> +        get applied, subsequent attempts to change them may have no effect.
> +      </description>
> +      <arg name="text" type="string"/>
> +      <arg name="cursor" type="int"/>
> +      <arg name="anchor" type="int"/>
> +    </request>
> +
> +    <enum name="content_hint" bitfield="true">
> +      <description summary="content hint">
> +       Content hint is a bitmask to allow to modify the behavior of the text
> +       input.
> +      </description>
> +      <entry name="none" value="0x0" summary="no special behavior"/>
> +      <entry name="completion" value="0x1" summary="suggest word completions"/>
> +      <entry name="spellcheck" value="0x2" summary="suggest word corrections"/>
> +      <entry name="auto_capitalization" value="0x4" summary="switch to uppercase letters at the start of a sentence"/>
> +      <entry name="lowercase" value="0x8" summary="prefer lowercase letters"/>
> +      <entry name="uppercase" value="0x10" summary="prefer uppercase letters"/>
> +      <entry name="titlecase" value="0x20" summary="prefer casing for titles and headings (can be language dependent)"/>
> +      <entry name="hidden_text" value="0x40" summary="characters should be hidden"/>
> +      <entry name="sensitive_data" value="0x80" summary="typed text should not be stored"/>
> +      <entry name="latin" value="0x100" summary="just Latin characters should be entered"/>
> +      <entry name="multiline" value="0x200" summary="the text input is multiline"/>
> +    </enum>
> +
> +    <enum name="content_purpose">
> +      <description summary="content purpose">
> +       The content purpose allows to specify the primary purpose of a text
> +       input.
> +
> +       This allows an input method to show special purpose input panels with
> +       extra characters or to disallow some characters.
> +      </description>
> +      <entry name="normal" value="0" summary="default input, allowing all characters"/>
> +      <entry name="alpha" value="1" summary="allow only alphabetic characters"/>
> +      <entry name="digits" value="2" summary="allow only digits"/>
> +      <entry name="number" value="3" summary="input a number (including decimal separator and sign)"/>
> +      <entry name="phone" value="4" summary="input a phone number"/>
> +      <entry name="url" value="5" summary="input an URL"/>
> +      <entry name="email" value="6" summary="input an email address"/>
> +      <entry name="name" value="7" summary="input a name of a person"/>
> +      <entry name="password" value="8" summary="input a password (combine with sensitive_data hint)"/>
> +      <entry name="pin" value="9" summary="input is a numeric password (combine with sensitive_data hint)"/>
> +      <entry name="date" value="10" summary="input a date"/>
> +      <entry name="time" value="11" summary="input a time"/>
> +      <entry name="datetime" value="12" summary="input a date and time"/>
> +      <entry name="terminal" value="13" summary="input for a terminal"/>
> +    </enum>
> +
> +    <request name="set_content_type">
> +      <description summary="set content purpose and hint">
> +        Sets the content purpose and content hint. While the purpose is the
> +        basic purpose of an input field, the hint flags allow to modify some
> +        of the behavior.
> +
> +        Values set with this request are double-buffered. They will get applied
> +        on the first zwp_text_input_v3.commit request after an enabl request.

s/enabl/enable/


> +        Subsequent attempts to update them may have no effect. The values
> +        remain valid until the next enable or disable request.
> +
> +        The initial value for hint is none, and the initial value for purpose
> +        is normal.
> +      </description>
> +      <arg name="hint" type="uint" enum="content_hint"/>
> +      <arg name="purpose" type="uint" enum="content_purpose"/>
> +    </request>
> +
> +    <request name="set_cursor_rectangle">
> +      <description summary="set cursor position">
> +        Marks an area around the cursor as a x, y, width, height rectangle in surface
> +        local coordinates.
> +
> +        Allows the compositor to put a window with word suggestions near the
> +        cursor, without obstructing the text being input.
> +
> +        If the client is unaware of the position of edited text, it should not
> +        issue this request, to signify lack of support to the compositor.
> +
> +        Values set with this request are double-buffered. They will get applied
> +        on the next zwp_text_input_v3.commit request, and stay valid until the
> +        next enable or disable request.
> +
> +        The initial values describing a cursor rectangle are empty. That means
> +        the text input does not support describing the cursor area. If the
> +        empty values get applied, subsequent attempts to change them may have
> +        no effect.
> +      </description>
> +      <arg name="x" type="int"/>
> +      <arg name="y" type="int"/>
> +      <arg name="width" type="int"/>
> +      <arg name="height" type="int"/>
> +    </request>
> +
> +    <request name="commit">
> +      <description summary="commit state">
> +        Text input state (content purpose, content hint, surrounding text,
> +        cursor rectangle) is conceptually double-buffered within the context
> +        of a text input, i.e. between an enable request and the following
> +        enable or disable request.
> +
> +        Protocol requests modify the pending state, as opposed to the current
> +        state in use by the input method. A commit request atomically applies
> +        all pending state, replacing the current state. After commit, the new
> +        pending state is as documented for each related request.
> +
> +        The enable request performs a special role by indicating that the state

Maybe "plays a special role" sounds more natural than "performs a special
role"?


> +        should be reset and updated with new values on the nearest commit.
> +
> +        The current or pending state are not modified unless noted otherwise.
> +      </description>
> +    </request>
> +
> +    <event name="enter">
> +      <description summary="enter event">
> +       Notification that this seat's text-input focus is on a certain surface.
> +
> +       When the seat has the keyboard capability the text-input focus follows
> +       the keyboard focus.
> +      </description>
> +      <arg name="surface" type="object" interface="wl_surface"/>
> +    </event>
> +
> +    <event name="leave">
> +      <description summary="leave event">
> +       Notification that this seat's text-input focus is no longer on
> +       a certain surface. The client should reset any preedit string previously
> +       set.
> +
> +       The leave notification is sent before the enter notification
> +       for the new focus.
> +
> +       When the seat has the keyboard capability the text-input focus follows
> +       the keyboard focus.
> +      </description>
> +      <arg name="surface" type="object" interface="wl_surface"/>
> +    </event>
> +
> +    <event name="preedit_string">
> +      <description summary="pre-edit">
> +        Notify when a new composing text (pre-edit) should be set around the
> +        current cursor position. Any previously set composing text should
> +        be removed.
> +
> +        Values set with this event are double-buffered. They must be applied on
> +        the next zwp_text_input_v3.done event, and stay valid until the
> +        next enable or disable request.
> +
> +        The parameters cursor_begin and cursor_end are counted in Unicode
> +        code points relative to the beginning of the submitted string. Cursor
> +        should be hidden when both are equal to -1.
> +
> +        They could be represented by the cient as a line if both values are the
> +        same, or as a text highligt otherwise.

s/highligt/highlight/


> +
> +        The initial value of text is an empty string, and cursor_begin and
> +        cursor_end are both 0.
> +      </description>
> +      <arg name="text" type="string" allow-null="true"/>
> +      <arg name="cursor_begin" type="int"/>
> +      <arg name="cursor_end" type="int"/>
> +    </event>
> +
> +    <event name="commit_string">
> +      <description summary="text commit">
> +        Notify when text should be inserted into the editor widget. The text to
> +        commit could be either just a single character after a key press or the
> +        result of some composing (pre-edit).
> +
> +        Values set with this event are double-buffered. They must be applied
> +        and reset to initial on the next zwp_text_input_v3.done event.
> +
> +        The initial value of text is an empty string.
> +      </description>
> +      <arg name="text" type="string" allow-null="true"/>
> +    </event>
> +
> +    <event name="delete_surrounding_text">
> +      <description summary="delete surrounding text">
> +        Notify when the text around the current cursor position should be
> +        deleted. Before_length and after_length are the number of Unicode
> +        code points before and after the current cursor position (excluding the
> +        selection) to delete.
> +
> +        Values set with this event are double-buffered. They must be applied
> +        and reset to initial on the next zwp_text_input_v3.done event.
> +
> +        The initial values of both before_length and after_length are 0.
> +      </description>
> +      <arg name="before_length" type="uint" summary="length of text before current cursor position"/>
> +      <arg name="after_length" type="uint" summary="length of text after current cursor position"/>
> +    </event>
> +
> +    <event name="done">
> +      <description summary="apply changes">
> +        Instruct the application to apply changes to state requested by the
> +        preedit_string, commit_string and delete_surrounding_string events. The

s/delete_surrounding_string/delete_surrounding_text/

Thanks for all the work!


Cheers,

Silvan
Dorota Czaplejewicz May 3, 2018, 7:22 p.m.
On Thu, 3 May 2018 20:47:27 +0200
Silvan Jegen <s.jegen@gmail.com> wrote:

> Hi Dorota
> 
> Some comments and typo fixes below.
> 
> On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
> > This new protocol description is a simplification over v2.
> > 
> > - All pre-edit text styling is gone.
> > - Pre-edit cursor can span characters.
> > - No events regarding input panel (OSK) state nor covered rectangle.
> >   Compositors are still free to handle situations where the keyboard
> >   focus rectangle is covered by the input panel.
> > - No set_preferred_language request for clients.
> > - There is no event to send keysyms. Compositors can use wl_keyboard
> >   interface instead.
> > - All state is double-buffered, with specified state.
> > - Use Unicode codepoints to measure strings.
> > 
> > Signed-off-by: Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm>
> > Signed-off-by: Carlos Garnacho <carlosg@gnome.org>
> > ---
> > This is the next update coming from Purism to perfect the text input protocol.
> > 
> > The following changes added on top of PATCHv3:
> > 
> > - Fixed whitespaces.
> > - Removed enable flags - the same information can be gathered from the first requests after enter.
> > - Changed offsets inside UTF-8 strings to use Unicode character counts in order to remove the possibility of communicating invalid state.
> > - Specified the exact lifetime of double-buffered state, and initial values.
> > - Made changes requested by the IM double-buffered.
> > 
> > Some questions remain open. One is: how to specify how much text to capture in set_surrounding_text, and how often to update?
> > 
> > A possible change that I decided against for now is to replace enable/disable events by create/destroy of a new object, which would make more state lifetimes encoded in the protocol.
> > 
> > After reading a blog post on fcitx [0], I got the impression that letting the compositor know some persistent ID of a text edit instance could be useful, however I'm not sure what the use cases are.
> > 
> > As always, I'm happy to hear feedback.
> > 
> > Cheers,
> > Dorota Czaplejewicz
> > 
> > [0] https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
> > 
> >  Makefile.am                                    |   1 +
> >  unstable/text-input/text-input-unstable-v3.xml | 362 +++++++++++++++++++++++++
> >  2 files changed, 363 insertions(+)
> >  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
> > 
> > diff --git a/Makefile.am b/Makefile.am
> > index 4b9a901..86d7ca9 100644
> > --- a/Makefile.am
> > +++ b/Makefile.am
> > @@ -3,6 +3,7 @@ unstable_protocols =								\
> >  	unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml		\
> >  	unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml			\
> >  	unstable/text-input/text-input-unstable-v1.xml				\
> > +	unstable/text-input/text-input-unstable-v3.xml				\
> >  	unstable/input-method/input-method-unstable-v1.xml			\
> >  	unstable/xdg-shell/xdg-shell-unstable-v5.xml				\
> >  	unstable/xdg-shell/xdg-shell-unstable-v6.xml				\
> > diff --git a/unstable/text-input/text-input-unstable-v3.xml b/unstable/text-input/text-input-unstable-v3.xml
> > new file mode 100644
> > index 0000000..ed5204f
> > --- /dev/null
> > +++ b/unstable/text-input/text-input-unstable-v3.xml
> > @@ -0,0 +1,362 @@
> > +<?xml version="1.0" encoding="UTF-8"?>
> > +
> > +<protocol name="text_input_unstable_v3">
> > +  <copyright>
> > +    Copyright © 2012, 2013 Intel Corporation
> > +    Copyright © 2015, 2016 Jan Arne Petersen
> > +    Copyright © 2017, 2018 Red Hat, Inc.
> > +    Copyright © 2018 Purism SPC
> > +
> > +    Permission to use, copy, modify, distribute, and sell this
> > +    software and its documentation for any purpose is hereby granted
> > +    without fee, provided that the above copyright notice appear in
> > +    all copies and that both that copyright notice and this permission
> > +    notice appear in supporting documentation, and that the name of
> > +    the copyright holders not be used in advertising or publicity
> > +    pertaining to distribution of the software without specific,
> > +    written prior permission.  The copyright holders make no
> > +    representations about the suitability of this software for any
> > +    purpose.  It is provided "as is" without express or implied
> > +    warranty.
> > +
> > +    THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
> > +    SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > +    FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
> > +    SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> > +    WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
> > +    AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
> > +    ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
> > +    THIS SOFTWARE.
> > +  </copyright>
> > +
> > +  <interface name="zwp_text_input_v3" version="1">
> > +    <description summary="text input">
> > +      The zwp_text_input_v3 interface represents text input and input methods
> > +      associated with a seat. It provides enter/leave events to follow the
> > +      text input focus for a seat.
> > +
> > +      Requests are used to enable/disable the text-input object and set
> > +      state information like surrounding and selected text or the content type.
> > +      The information about the entered text is sent to the text-input object
> > +      via the pre-edit and commit_string events.
> > +
> > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > +      grapheme is made up of multiple code points, an index pointing to any of
> > +      them should be interpreted as pointing to the first one.  
> 
> That way we make sure we don't put the cursor/anchor between bytes that
> belong to the same UTF-8 encoded Unicode code point which is nice. It
> also means that the client has to parse all the UTF-8 encoded strings
> into Unicode code points up to the desired cursor/anchor position
> on each "preedit_string" event. For each "delete_surrounding_text" event
> the client has to parse the UTF-8 sequences before and after the cursor
> position up to the requested Unicode code point.
> 
> I feel like we are processing the UTF-8 string already in the
> input-method. So I am not sure that we should parse it again on the
> client side. Parsing it again would also mean that the client would need
> to know about UTF-8 which would be nice to avoid.
> 
> Thoughts?

The client needs to know about Unicode, but not necessarily about UTF-8. Specifying code points is actually an advantage here, because byte offsets are inherently expressed relative to UTF-8. By counting with code points, client's internal representation can be UTF-16 or maybe even something else.

There's no avoiding the parsing either. What the application cares about is that the cursor falls between glyphs. The application cannot know that in all cases. Unicode allows the same sequence to be displayed in multiple ways (fallback):

https://unicode.org/emoji/charts/emoji-zwj-sequences.html

One could make an argument that byte offsets should never be close to ZWJ characters, but I think this decision is better left to the application, which knows what exactly it is presenting to the user.

> 
> 
> > +
> > +      Focus moving throughout surfaces will result in the emission of
> > +      zwp_text_input_v3.enter and zwp_text_input_v3.leave events. The focused
> > +      surface must perform zwp_text_input_v3.enable and
> > +      zwp_text_input_v3.disable requests as the keyboard focus moves across
> > +      editable and non-editable elements of the UI. Those two requests are not
> > +      expected to be paired with each other, the compositor must be able to
> > +      handle consecutive series of the same request.
> > +
> > +      State is sent by the state requests (set_surrounding_text,
> > +      set_content_type and set_cursor_rectangle) and a commit request.
> > +      After an enter event or disable request all state information is
> > +      invalidated and needs to be resent by the client.
> > +
> > +      This protocol defines requests and events necessary for regular clients
> > +      to communicate with an input method. The zwp_input_method protocol
> > +      defines the interfaces necessary to implement standalone input methods.
> > +      If a compositor implements both interfaces, it will be the arbiter of the
> > +      communication between both.
> > +
> > +      Warning! The protocol described in this file is experimental and
> > +      backward incompatible changes may be made. Backward compatible changes
> > +      may be added together with the corresponding interface version bump.
> > +      Backward incompatible changes are done by bumping the version number in
> > +      the protocol and interface names and resetting the interface version.
> > +      Once the protocol is to be declared stable, the 'z' prefix and the
> > +      version number in the protocol and interface names are removed and the
> > +      interface version number is reset.
> > +    </description>
> > +
> > +    <request name="destroy" type="destructor">
> > +      <description summary="Destroy the wp_text_input">
> > +       Destroy the wp_text_input object. Also disables all surfaces enabled
> > +       through this wp_text_input object.
> > +      </description>
> > +    </request>
> > +
> > +    <request name="enable">
> > +      <description summary="Request text input to be enabled">
> > +        Requests text input. This request should be issued every time the
> > +        active text input changes, including within one surface.
> > +
> > +        This request resets all state associated with previous enable,
> > +        set_surrounding_text, set_content_type, and set_cursor_rectangle
> > +        requests, as well as the state associated with preedit_string,
> > +        commit_string, and delete_surrounding_text events.
> > +
> > +        The set_surrounding_text, set_content_type and set_cursor_rectangle
> > +        requests should follow if the text input supports the necessary
> > +        functionality.
> > +
> > +        The changes must be applied by the compositor after issuing a
> > +        zwp_text_input_v3.commit request.
> > +      </description>
> > +    </request>
> > +
> > +    <request name="disable">
> > +      <description summary="Disable text input on a surface">
> > +        Explicitly disable text input in a surface (typically when there is no
> > +        focus on any text entry inside the surface).
> > +      </description>
> > +    </request>
> > +
> > +    <request name="set_surrounding_text">
> > +      <description summary="sets the surrounding text">
> > +        Sets the surrounding plain text around the input position.
> > +
> > +        Text is UTF-8 encoded. Cursor is the Unicode code point offset within
> > +        the surrounding text.
> > +        Anchor is the Unicode code point offset of the selection anchor within
> > +        the surrounding text. If there is no selected text, anchor is the same
> > +        as cursor.
> > +
> > +        If the client is unaware of the text around the cursor, it should not
> > +        issue this request, to signify lack of support to the compositor.
> > +
> > +        There is a maximum length of wayland messages so text can not be
> > +        longer than 4000 bytes.
> > +
> > +        Values set with this request are double-buffered. They will get applied
> > +        on the next zwp_text_input_v3.commit request, and stay valid until the
> > +        next enable or disable request.
> > +
> > +        The initial state for affected fields is empty, meaning that the text
> > +        input does not support sending surrounding text. If the empty values
> > +        get applied, subsequent attempts to change them may have no effect.
> > +      </description>
> > +      <arg name="text" type="string"/>
> > +      <arg name="cursor" type="int"/>
> > +      <arg name="anchor" type="int"/>
> > +    </request>
> > +
> > +    <enum name="content_hint" bitfield="true">
> > +      <description summary="content hint">
> > +       Content hint is a bitmask to allow to modify the behavior of the text
> > +       input.
> > +      </description>
> > +      <entry name="none" value="0x0" summary="no special behavior"/>
> > +      <entry name="completion" value="0x1" summary="suggest word completions"/>
> > +      <entry name="spellcheck" value="0x2" summary="suggest word corrections"/>
> > +      <entry name="auto_capitalization" value="0x4" summary="switch to uppercase letters at the start of a sentence"/>
> > +      <entry name="lowercase" value="0x8" summary="prefer lowercase letters"/>
> > +      <entry name="uppercase" value="0x10" summary="prefer uppercase letters"/>
> > +      <entry name="titlecase" value="0x20" summary="prefer casing for titles and headings (can be language dependent)"/>
> > +      <entry name="hidden_text" value="0x40" summary="characters should be hidden"/>
> > +      <entry name="sensitive_data" value="0x80" summary="typed text should not be stored"/>
> > +      <entry name="latin" value="0x100" summary="just Latin characters should be entered"/>
> > +      <entry name="multiline" value="0x200" summary="the text input is multiline"/>
> > +    </enum>
> > +
> > +    <enum name="content_purpose">
> > +      <description summary="content purpose">
> > +       The content purpose allows to specify the primary purpose of a text
> > +       input.
> > +
> > +       This allows an input method to show special purpose input panels with
> > +       extra characters or to disallow some characters.
> > +      </description>
> > +      <entry name="normal" value="0" summary="default input, allowing all characters"/>
> > +      <entry name="alpha" value="1" summary="allow only alphabetic characters"/>
> > +      <entry name="digits" value="2" summary="allow only digits"/>
> > +      <entry name="number" value="3" summary="input a number (including decimal separator and sign)"/>
> > +      <entry name="phone" value="4" summary="input a phone number"/>
> > +      <entry name="url" value="5" summary="input an URL"/>
> > +      <entry name="email" value="6" summary="input an email address"/>
> > +      <entry name="name" value="7" summary="input a name of a person"/>
> > +      <entry name="password" value="8" summary="input a password (combine with sensitive_data hint)"/>
> > +      <entry name="pin" value="9" summary="input is a numeric password (combine with sensitive_data hint)"/>
> > +      <entry name="date" value="10" summary="input a date"/>
> > +      <entry name="time" value="11" summary="input a time"/>
> > +      <entry name="datetime" value="12" summary="input a date and time"/>
> > +      <entry name="terminal" value="13" summary="input for a terminal"/>
> > +    </enum>
> > +
> > +    <request name="set_content_type">
> > +      <description summary="set content purpose and hint">
> > +        Sets the content purpose and content hint. While the purpose is the
> > +        basic purpose of an input field, the hint flags allow to modify some
> > +        of the behavior.
> > +
> > +        Values set with this request are double-buffered. They will get applied
> > +        on the first zwp_text_input_v3.commit request after an enabl request.  
> 
> s/enabl/enable/
> 
> 
> > +        Subsequent attempts to update them may have no effect. The values
> > +        remain valid until the next enable or disable request.
> > +
> > +        The initial value for hint is none, and the initial value for purpose
> > +        is normal.
> > +      </description>
> > +      <arg name="hint" type="uint" enum="content_hint"/>
> > +      <arg name="purpose" type="uint" enum="content_purpose"/>
> > +    </request>
> > +
> > +    <request name="set_cursor_rectangle">
> > +      <description summary="set cursor position">
> > +        Marks an area around the cursor as a x, y, width, height rectangle in surface
> > +        local coordinates.
> > +
> > +        Allows the compositor to put a window with word suggestions near the
> > +        cursor, without obstructing the text being input.
> > +
> > +        If the client is unaware of the position of edited text, it should not
> > +        issue this request, to signify lack of support to the compositor.
> > +
> > +        Values set with this request are double-buffered. They will get applied
> > +        on the next zwp_text_input_v3.commit request, and stay valid until the
> > +        next enable or disable request.
> > +
> > +        The initial values describing a cursor rectangle are empty. That means
> > +        the text input does not support describing the cursor area. If the
> > +        empty values get applied, subsequent attempts to change them may have
> > +        no effect.
> > +      </description>
> > +      <arg name="x" type="int"/>
> > +      <arg name="y" type="int"/>
> > +      <arg name="width" type="int"/>
> > +      <arg name="height" type="int"/>
> > +    </request>
> > +
> > +    <request name="commit">
> > +      <description summary="commit state">
> > +        Text input state (content purpose, content hint, surrounding text,
> > +        cursor rectangle) is conceptually double-buffered within the context
> > +        of a text input, i.e. between an enable request and the following
> > +        enable or disable request.
> > +
> > +        Protocol requests modify the pending state, as opposed to the current
> > +        state in use by the input method. A commit request atomically applies
> > +        all pending state, replacing the current state. After commit, the new
> > +        pending state is as documented for each related request.
> > +
> > +        The enable request performs a special role by indicating that the state  
> 
> Maybe "plays a special role" sounds more natural than "performs a special
> role"?
> 
> 
> > +        should be reset and updated with new values on the nearest commit.
> > +
> > +        The current or pending state are not modified unless noted otherwise.
> > +      </description>
> > +    </request>
> > +
> > +    <event name="enter">
> > +      <description summary="enter event">
> > +       Notification that this seat's text-input focus is on a certain surface.
> > +
> > +       When the seat has the keyboard capability the text-input focus follows
> > +       the keyboard focus.
> > +      </description>
> > +      <arg name="surface" type="object" interface="wl_surface"/>
> > +    </event>
> > +
> > +    <event name="leave">
> > +      <description summary="leave event">
> > +       Notification that this seat's text-input focus is no longer on
> > +       a certain surface. The client should reset any preedit string previously
> > +       set.
> > +
> > +       The leave notification is sent before the enter notification
> > +       for the new focus.
> > +
> > +       When the seat has the keyboard capability the text-input focus follows
> > +       the keyboard focus.
> > +      </description>
> > +      <arg name="surface" type="object" interface="wl_surface"/>
> > +    </event>
> > +
> > +    <event name="preedit_string">
> > +      <description summary="pre-edit">
> > +        Notify when a new composing text (pre-edit) should be set around the
> > +        current cursor position. Any previously set composing text should
> > +        be removed.
> > +
> > +        Values set with this event are double-buffered. They must be applied on
> > +        the next zwp_text_input_v3.done event, and stay valid until the
> > +        next enable or disable request.
> > +
> > +        The parameters cursor_begin and cursor_end are counted in Unicode
> > +        code points relative to the beginning of the submitted string. Cursor
> > +        should be hidden when both are equal to -1.
> > +
> > +        They could be represented by the cient as a line if both values are the
> > +        same, or as a text highligt otherwise.  
> 
> s/highligt/highlight/
> 
> 
> > +
> > +        The initial value of text is an empty string, and cursor_begin and
> > +        cursor_end are both 0.
> > +      </description>
> > +      <arg name="text" type="string" allow-null="true"/>
> > +      <arg name="cursor_begin" type="int"/>
> > +      <arg name="cursor_end" type="int"/>
> > +    </event>
> > +
> > +    <event name="commit_string">
> > +      <description summary="text commit">
> > +        Notify when text should be inserted into the editor widget. The text to
> > +        commit could be either just a single character after a key press or the
> > +        result of some composing (pre-edit).
> > +
> > +        Values set with this event are double-buffered. They must be applied
> > +        and reset to initial on the next zwp_text_input_v3.done event.
> > +
> > +        The initial value of text is an empty string.
> > +      </description>
> > +      <arg name="text" type="string" allow-null="true"/>
> > +    </event>
> > +
> > +    <event name="delete_surrounding_text">
> > +      <description summary="delete surrounding text">
> > +        Notify when the text around the current cursor position should be
> > +        deleted. Before_length and after_length are the number of Unicode
> > +        code points before and after the current cursor position (excluding the
> > +        selection) to delete.
> > +
> > +        Values set with this event are double-buffered. They must be applied
> > +        and reset to initial on the next zwp_text_input_v3.done event.
> > +
> > +        The initial values of both before_length and after_length are 0.
> > +      </description>
> > +      <arg name="before_length" type="uint" summary="length of text before current cursor position"/>
> > +      <arg name="after_length" type="uint" summary="length of text after current cursor position"/>
> > +    </event>
> > +
> > +    <event name="done">
> > +      <description summary="apply changes">
> > +        Instruct the application to apply changes to state requested by the
> > +        preedit_string, commit_string and delete_surrounding_string events. The  
> 
> s/delete_surrounding_string/delete_surrounding_text/
> 
> Thanks for all the work!
> 
Thanks for reviewing, and for reminding me of the importance of spellcheck :)

I'll take this opportunity to point out that I made the algorithm to apply changes dependent on the cursor, while I let the cursor be removed altogether elsewhere. That will be fixed in the next revision.

Cheers,
Dorota
Silvan Jegen May 3, 2018, 7:55 p.m.
On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
> On Thu, 3 May 2018 20:47:27 +0200
> Silvan Jegen <s.jegen@gmail.com> wrote:
> 
> > Hi Dorota
> > 
> > Some comments and typo fixes below.
> > 
> > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
> > > This new protocol description is a simplification over v2.
> > > 
> > > - All pre-edit text styling is gone.
> > > - Pre-edit cursor can span characters.
> > > - No events regarding input panel (OSK) state nor covered rectangle.
> > >   Compositors are still free to handle situations where the keyboard
> > >   focus rectangle is covered by the input panel.
> > > - No set_preferred_language request for clients.
> > > - There is no event to send keysyms. Compositors can use wl_keyboard
> > >   interface instead.
> > > - All state is double-buffered, with specified state.
> > > - Use Unicode codepoints to measure strings.
> > > 
> > > Signed-off-by: Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm>
> > > Signed-off-by: Carlos Garnacho <carlosg@gnome.org>
> > > ---
> > > This is the next update coming from Purism to perfect the text input protocol.
> > > 
> > > The following changes added on top of PATCHv3:
> > > 
> > > - Fixed whitespaces.
> > > - Removed enable flags - the same information can be gathered from
> > > the first requests after enter.
> > > - Changed offsets inside UTF-8 strings to use Unicode character
> > > counts in order to remove the possibility of communicating invalid
> > > state.
> > > - Specified the exact lifetime of double-buffered state, and initial values.
> > > - Made changes requested by the IM double-buffered.
> > > 
> > > Some questions remain open. One is: how to specify how much text
> > > to capture in set_surrounding_text, and how often to update?
> > > 
> > > A possible change that I decided against for now is to replace
> > > enable/disable events by create/destroy of a new object, which
> > > would make more state lifetimes encoded in the protocol.
> > > 
> > > After reading a blog post on fcitx [0], I got the impression that
> > > letting the compositor know some persistent ID of a text edit
> > > instance could be useful, however I'm not sure what the use cases
> > > are.
> > > 
> > > As always, I'm happy to hear feedback.
> > > 
> > > Cheers,
> > > Dorota Czaplejewicz
> > > 
> > > [0] https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
> > > 
> > >  Makefile.am                                    |   1 +
> > >  unstable/text-input/text-input-unstable-v3.xml | 362 +++++++++++++++++++++++++
> > >  2 files changed, 363 insertions(+)
> > >  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
> > > 
> > > diff --git a/Makefile.am b/Makefile.am
> > > index 4b9a901..86d7ca9 100644
> > > --- a/Makefile.am
> > > +++ b/Makefile.am
> > > @@ -3,6 +3,7 @@ unstable_protocols =								\
> > >  	unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml		\
> > >  	unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml			\
> > >  	unstable/text-input/text-input-unstable-v1.xml				\
> > > +	unstable/text-input/text-input-unstable-v3.xml				\
> > >  	unstable/input-method/input-method-unstable-v1.xml			\
> > >  	unstable/xdg-shell/xdg-shell-unstable-v5.xml				\
> > >  	unstable/xdg-shell/xdg-shell-unstable-v6.xml				\
> > > diff --git a/unstable/text-input/text-input-unstable-v3.xml b/unstable/text-input/text-input-unstable-v3.xml
> > > new file mode 100644
> > > index 0000000..ed5204f
> > > --- /dev/null
> > > +++ b/unstable/text-input/text-input-unstable-v3.xml
> > > @@ -0,0 +1,362 @@
> > > +<?xml version="1.0" encoding="UTF-8"?>
> > > +
> > > +<protocol name="text_input_unstable_v3">
> > > +  <copyright>
> > > +    Copyright © 2012, 2013 Intel Corporation
> > > +    Copyright © 2015, 2016 Jan Arne Petersen
> > > +    Copyright © 2017, 2018 Red Hat, Inc.
> > > +    Copyright © 2018 Purism SPC
> > > +
> > > +    Permission to use, copy, modify, distribute, and sell this
> > > +    software and its documentation for any purpose is hereby granted
> > > +    without fee, provided that the above copyright notice appear in
> > > +    all copies and that both that copyright notice and this permission
> > > +    notice appear in supporting documentation, and that the name of
> > > +    the copyright holders not be used in advertising or publicity
> > > +    pertaining to distribution of the software without specific,
> > > +    written prior permission.  The copyright holders make no
> > > +    representations about the suitability of this software for any
> > > +    purpose.  It is provided "as is" without express or implied
> > > +    warranty.
> > > +
> > > +    THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
> > > +    SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > > +    FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
> > > +    SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> > > +    WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
> > > +    AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
> > > +    ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
> > > +    THIS SOFTWARE.
> > > +  </copyright>
> > > +
> > > +  <interface name="zwp_text_input_v3" version="1">
> > > +    <description summary="text input">
> > > +      The zwp_text_input_v3 interface represents text input and input methods
> > > +      associated with a seat. It provides enter/leave events to follow the
> > > +      text input focus for a seat.
> > > +
> > > +      Requests are used to enable/disable the text-input object and set
> > > +      state information like surrounding and selected text or the content type.
> > > +      The information about the entered text is sent to the text-input object
> > > +      via the pre-edit and commit_string events.
> > > +
> > > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > > +      grapheme is made up of multiple code points, an index pointing to any of
> > > +      them should be interpreted as pointing to the first one.  
> > 
> > That way we make sure we don't put the cursor/anchor between bytes that
> > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > also means that the client has to parse all the UTF-8 encoded strings
> > into Unicode code points up to the desired cursor/anchor position
> > on each "preedit_string" event. For each "delete_surrounding_text" event
> > the client has to parse the UTF-8 sequences before and after the cursor
> > position up to the requested Unicode code point.
> > 
> > I feel like we are processing the UTF-8 string already in the
> > input-method. So I am not sure that we should parse it again on the
> > client side. Parsing it again would also mean that the client would need
> > to know about UTF-8 which would be nice to avoid.
> > 
> > Thoughts?
> 
> The client needs to know about Unicode, but not necessarily about
> UTF-8. Specifying code points is actually an advantage here, because
> byte offsets are inherently expressed relative to UTF-8. By counting
> with code points, client's internal representation can be UTF-16 or
> maybe even something else.

Maybe I am misunderstanding something but the protocol specifies that
the strings are valid UTF-8 encoded and the cursor/anchor offsets into
the strings are specified in Unicode points. To me that indicates that
the application *has to parse* the UTF-8 string into Unicode points
when receiving the event otherwise it doesn't know after which Unicode
point to draw the cursor. Of course the application can then decide to
convert the UTF-8 string into another encoding like UTF-16 for internal
processing (for whatever reason) but that doesn't change the fact that
it still would have to parse the incoming UTF-8 (and thus know about
UTF-8).


> There's no avoiding the parsing either. What the application cares
> about is that the cursor falls between glyphs. The application cannot
> know that in all cases. Unicode allows the same sequence to be
> displayed in multiple ways (fallback):
> 
> https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> 
> One could make an argument that byte offsets should never be close
> to ZWJ characters, but I think this decision is better left to the
> application, which knows what exactly it is presenting to the user.

The idea of the previous version of the protocol (from my understanding)
was to make sure that only valid UTF-8 and valid byte-offsets (== not
falling between bytes of a Unicode code point) into the string will be
sent to the client. If you just get a byte-offset into a UTF-8 encoded
string you trust the sender to honor the protocol and thus you can just
pass the UTF-8 encoded string unprocessed to your text rendering library
(provided that the library supports UTF-8 strings which is what I am
assuming) without having to parse the UTF-8 string into Unicode code
points.

Of course the Unicode code points will have to be parsed at some point
if you want to render them. Using byte-offsets just lets you do that at
a later stage if your libraries support UTF-8.


Cheers,

Silvan
Dorota Czaplejewicz May 3, 2018, 8:46 p.m.
On Thu, 3 May 2018 21:55:40 +0200
Silvan Jegen <s.jegen@gmail.com> wrote:

> On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
> > On Thu, 3 May 2018 20:47:27 +0200
> > Silvan Jegen <s.jegen@gmail.com> wrote:
> >   
> > > Hi Dorota
> > > 
> > > Some comments and typo fixes below.
> > > 
> > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> > > > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > > > +      grapheme is made up of multiple code points, an index pointing to any of
> > > > +      them should be interpreted as pointing to the first one.    
> > > 
> > > That way we make sure we don't put the cursor/anchor between bytes that
> > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > also means that the client has to parse all the UTF-8 encoded strings
> > > into Unicode code points up to the desired cursor/anchor position
> > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > the client has to parse the UTF-8 sequences before and after the cursor
> > > position up to the requested Unicode code point.
> > > 
> > > I feel like we are processing the UTF-8 string already in the
> > > input-method. So I am not sure that we should parse it again on the
> > > client side. Parsing it again would also mean that the client would need
> > > to know about UTF-8 which would be nice to avoid.
> > > 
> > > Thoughts?  
> > 
> > The client needs to know about Unicode, but not necessarily about
> > UTF-8. Specifying code points is actually an advantage here, because
> > byte offsets are inherently expressed relative to UTF-8. By counting
> > with code points, client's internal representation can be UTF-16 or
> > maybe even something else.  
> 
> Maybe I am misunderstanding something but the protocol specifies that
> the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> the strings are specified in Unicode points. To me that indicates that
> the application *has to parse* the UTF-8 string into Unicode points
> when receiving the event otherwise it doesn't know after which Unicode
> point to draw the cursor. Of course the application can then decide to
> convert the UTF-8 string into another encoding like UTF-16 for internal
> processing (for whatever reason) but that doesn't change the fact that
> it still would have to parse the incoming UTF-8 (and thus know about
> UTF-8).
> 
Can you see any way to avoid parsing UTF-8 in order to draw the cursor? I tried to come up with a way to do that, but even with specifying byte strings, I believe that calculating the position of the cursor - either in pixels or in glyphs - requires full parsing of the input string.

> 
> > There's no avoiding the parsing either. What the application cares
> > about is that the cursor falls between glyphs. The application cannot
> > know that in all cases. Unicode allows the same sequence to be
> > displayed in multiple ways (fallback):
> > 
> > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > 
> > One could make an argument that byte offsets should never be close
> > to ZWJ characters, but I think this decision is better left to the
> > application, which knows what exactly it is presenting to the user.  
> 
> The idea of the previous version of the protocol (from my understanding)
> was to make sure that only valid UTF-8 and valid byte-offsets (== not
> falling between bytes of a Unicode code point) into the string will be
> sent to the client. If you just get a byte-offset into a UTF-8 encoded
> string you trust the sender to honor the protocol and thus you can just
> pass the UTF-8 encoded string unprocessed to your text rendering library
> (provided that the library supports UTF-8 strings which is what I am
> assuming) without having to parse the UTF-8 string into Unicode code
> points.
> 
> Of course the Unicode code points will have to be parsed at some point
> if you want to render them. Using byte-offsets just lets you do that at
> a later stage if your libraries support UTF-8.
> 
> 
Doesn't that chiefly depend on what kind of the text rendering library though? As far as I understand, passing text to rendering is necessary to calculate the cursor position. At the same time, it doesn't matter much for the calculations whether the cursor offset is in bytes or code points - the library does the parsing in the last step anyway.

I think you mean that if the rendering library accepts byte offsets as the only format, the application would have to parse the UTF-8 unnecessarily. I agree with this, but I'm not sure we should optimize for this case. Other libraries may support only code points instead.

Did I understand you correctly?

Cheers,
Dorota
Silvan Jegen May 4, 2018, 8:32 p.m.
On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> On Thu, 3 May 2018 21:55:40 +0200
> Silvan Jegen <s.jegen@gmail.com> wrote:
> 
> > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
> > > On Thu, 3 May 2018 20:47:27 +0200
> > > Silvan Jegen <s.jegen@gmail.com> wrote:
> > >   
> > > > Hi Dorota
> > > > 
> > > > Some comments and typo fixes below.
> > > > 
> > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> > > > > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > > > > +      grapheme is made up of multiple code points, an index pointing to any of
> > > > > +      them should be interpreted as pointing to the first one.    
> > > > 
> > > > That way we make sure we don't put the cursor/anchor between bytes that
> > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > > also means that the client has to parse all the UTF-8 encoded strings
> > > > into Unicode code points up to the desired cursor/anchor position
> > > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > > the client has to parse the UTF-8 sequences before and after the cursor
> > > > position up to the requested Unicode code point.
> > > > 
> > > > I feel like we are processing the UTF-8 string already in the
> > > > input-method. So I am not sure that we should parse it again on the
> > > > client side. Parsing it again would also mean that the client would need
> > > > to know about UTF-8 which would be nice to avoid.
> > > > 
> > > > Thoughts?  
> > > 
> > > The client needs to know about Unicode, but not necessarily about
> > > UTF-8. Specifying code points is actually an advantage here, because
> > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > with code points, client's internal representation can be UTF-16 or
> > > maybe even something else.  
> > 
> > Maybe I am misunderstanding something but the protocol specifies that
> > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > the strings are specified in Unicode points. To me that indicates that
> > the application *has to parse* the UTF-8 string into Unicode points
> > when receiving the event otherwise it doesn't know after which Unicode
> > point to draw the cursor. Of course the application can then decide to
> > convert the UTF-8 string into another encoding like UTF-16 for internal
> > processing (for whatever reason) but that doesn't change the fact that
> > it still would have to parse the incoming UTF-8 (and thus know about
> > UTF-8).
> > 
> Can you see any way to avoid parsing UTF-8 in order to draw the
> cursor? I tried to come up with a way to do that, but even with
> specifying byte strings, I believe that calculating the position of
> the cursor - either in pixels or in glyphs - requires full parsing of
> the input string.

Yes, I don't think it's avoidable either. You just don't have to do
it twice if your text rendering library consumes UTF-8 strings with
byte-offsets though. See my response below.


> > > There's no avoiding the parsing either. What the application cares
> > > about is that the cursor falls between glyphs. The application cannot
> > > know that in all cases. Unicode allows the same sequence to be
> > > displayed in multiple ways (fallback):
> > > 
> > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > 
> > > One could make an argument that byte offsets should never be close
> > > to ZWJ characters, but I think this decision is better left to the
> > > application, which knows what exactly it is presenting to the user.  
> > 
> > The idea of the previous version of the protocol (from my understanding)
> > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > falling between bytes of a Unicode code point) into the string will be
> > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > string you trust the sender to honor the protocol and thus you can just
> > pass the UTF-8 encoded string unprocessed to your text rendering library
> > (provided that the library supports UTF-8 strings which is what I am
> > assuming) without having to parse the UTF-8 string into Unicode code
> > points.
> > 
> > Of course the Unicode code points will have to be parsed at some point
> > if you want to render them. Using byte-offsets just lets you do that at
> > a later stage if your libraries support UTF-8.
> > 
> > 
> Doesn't that chiefly depend on what kind of the text rendering library
> though? As far as I understand, passing text to rendering is necessary
> to calculate the cursor position. At the same time, it doesn't matter
> much for the calculations whether the cursor offset is in bytes or
> code points - the library does the parsing in the last step anyway.
> 
> I think you mean that if the rendering library accepts byte offsets
> as the only format, the application would have to parse the UTF-8
> unnecessarily. I agree with this, but I'm not sure we should optimize
> for this case. Other libraries may support only code points instead.
>
> Did I understand you correctly?

Yes, that's what I meant. I also assumed that no text rendering library
expects you to pass the string length in Unicode points. I had a look
and the ones I managed to find expected their lengths in bytes:

* Pango: https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
* Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html

For those you would need to parse the UTF-8 string yourself first in
order to find out at which byte position the Unicodepoint stops where
the protocol wants you to draw the cursor (if the protocol sends Unicode
point offsets).

I feel like it would make sense to optimize for the more common case. I
assume that is the one where you need to pass a length in bytes to the
text rendering library, not in Unicode points.

Admittedly, I haven't used a lot of text rendering libraries so I would
very much like to hear more opinions on the issue.


Cheers,

Silvan
Dorota Czaplejewicz May 5, 2018, 9:09 a.m.
On Fri, 4 May 2018 22:32:15 +0200
Silvan Jegen <s.jegen@gmail.com> wrote:

> On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> > On Thu, 3 May 2018 21:55:40 +0200
> > Silvan Jegen <s.jegen@gmail.com> wrote:
> >   
> > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:  
> > > > On Thu, 3 May 2018 20:47:27 +0200
> > > > Silvan Jegen <s.jegen@gmail.com> wrote:
> > > >     
> > > > > Hi Dorota
> > > > > 
> > > > > Some comments and typo fixes below.
> > > > > 
> > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:    
> > > > > > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > > > > > +      grapheme is made up of multiple code points, an index pointing to any of
> > > > > > +      them should be interpreted as pointing to the first one.      
> > > > > 
> > > > > That way we make sure we don't put the cursor/anchor between bytes that
> > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > > > also means that the client has to parse all the UTF-8 encoded strings
> > > > > into Unicode code points up to the desired cursor/anchor position
> > > > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > > > the client has to parse the UTF-8 sequences before and after the cursor
> > > > > position up to the requested Unicode code point.
> > > > > 
> > > > > I feel like we are processing the UTF-8 string already in the
> > > > > input-method. So I am not sure that we should parse it again on the
> > > > > client side. Parsing it again would also mean that the client would need
> > > > > to know about UTF-8 which would be nice to avoid.
> > > > > 
> > > > > Thoughts?    
> > > > 
> > > > The client needs to know about Unicode, but not necessarily about
> > > > UTF-8. Specifying code points is actually an advantage here, because
> > > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > > with code points, client's internal representation can be UTF-16 or
> > > > maybe even something else.    
> > > 
> > > Maybe I am misunderstanding something but the protocol specifies that
> > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > > the strings are specified in Unicode points. To me that indicates that
> > > the application *has to parse* the UTF-8 string into Unicode points
> > > when receiving the event otherwise it doesn't know after which Unicode
> > > point to draw the cursor. Of course the application can then decide to
> > > convert the UTF-8 string into another encoding like UTF-16 for internal
> > > processing (for whatever reason) but that doesn't change the fact that
> > > it still would have to parse the incoming UTF-8 (and thus know about
> > > UTF-8).
> > >   
> > Can you see any way to avoid parsing UTF-8 in order to draw the
> > cursor? I tried to come up with a way to do that, but even with
> > specifying byte strings, I believe that calculating the position of
> > the cursor - either in pixels or in glyphs - requires full parsing of
> > the input string.  
> 
> Yes, I don't think it's avoidable either. You just don't have to do
> it twice if your text rendering library consumes UTF-8 strings with
> byte-offsets though. See my response below.
> 
> 
> > > > There's no avoiding the parsing either. What the application cares
> > > > about is that the cursor falls between glyphs. The application cannot
> > > > know that in all cases. Unicode allows the same sequence to be
> > > > displayed in multiple ways (fallback):
> > > > 
> > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > > 
> > > > One could make an argument that byte offsets should never be close
> > > > to ZWJ characters, but I think this decision is better left to the
> > > > application, which knows what exactly it is presenting to the user.    
> > > 
> > > The idea of the previous version of the protocol (from my understanding)
> > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > > falling between bytes of a Unicode code point) into the string will be
> > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > > string you trust the sender to honor the protocol and thus you can just
> > > pass the UTF-8 encoded string unprocessed to your text rendering library
> > > (provided that the library supports UTF-8 strings which is what I am
> > > assuming) without having to parse the UTF-8 string into Unicode code
> > > points.
> > > 
> > > Of course the Unicode code points will have to be parsed at some point
> > > if you want to render them. Using byte-offsets just lets you do that at
> > > a later stage if your libraries support UTF-8.
> > > 
> > >   
> > Doesn't that chiefly depend on what kind of the text rendering library
> > though? As far as I understand, passing text to rendering is necessary
> > to calculate the cursor position. At the same time, it doesn't matter
> > much for the calculations whether the cursor offset is in bytes or
> > code points - the library does the parsing in the last step anyway.
> > 
> > I think you mean that if the rendering library accepts byte offsets
> > as the only format, the application would have to parse the UTF-8
> > unnecessarily. I agree with this, but I'm not sure we should optimize
> > for this case. Other libraries may support only code points instead.
> >
> > Did I understand you correctly?  
> 
> Yes, that's what I meant. I also assumed that no text rendering library
> expects you to pass the string length in Unicode points. I had a look
> and the ones I managed to find expected their lengths in bytes:
> 
> * Pango: https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
> * Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html

I looked a bit deeper and found hb_buffer_add_utf8:

https://cgit.freedesktop.org/harfbuzz/tree/src/hb-buffer.cc#n1576

It seems to require both (either?) the number of bytes (for buffer size) and the number of code points in the same call. In this case, it doesn't matter how the position information is expressed.

> 
> For those you would need to parse the UTF-8 string yourself first in
> order to find out at which byte position the Unicodepoint stops where
> the protocol wants you to draw the cursor (if the protocol sends Unicode
> point offsets).
> 
> I feel like it would make sense to optimize for the more common case. I
> assume that is the one where you need to pass a length in bytes to the
> text rendering library, not in Unicode points.
> 
> Admittedly, I haven't used a lot of text rendering libraries so I would
> very much like to hear more opinions on the issue.
> 

Even if some libraries expect to work with bytes, I see three reasons not to provide them. Most importantly, I believe that we should avoid letting people shoot themselves in the foot whenever possible. Specifying bytes leaves a lot of wiggle room to communicate invalid state. The supporting reason is that protocols shouldn't be tied to implementation details.
The least important reason is that handling Unicode is getting better than it used to be. Taking Python as an example:

>>> 'æþ'[1]
'þ'
>>> len('æþ'.encode('utf-8'))
4

Strings are natively indexed with code points. This matches at least my intuition when I'm asked to place a cursor somewhere inside a string and tell the index.

In the end, I'm not an expert in that area either - perhaps treating client side strings as UTF-8 buffers makes sense, but at the moment I'm still leaning towards the code point abstraction.

Cheers,
Dorota
Silvan Jegen May 5, 2018, 11:37 a.m.
On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
> On Fri, 4 May 2018 22:32:15 +0200
> Silvan Jegen <s.jegen@gmail.com> wrote:
> 
> > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> > > On Thu, 3 May 2018 21:55:40 +0200
> > > Silvan Jegen <s.jegen@gmail.com> wrote:
> > >   
> > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:  
> > > > > On Thu, 3 May 2018 20:47:27 +0200
> > > > > Silvan Jegen <s.jegen@gmail.com> wrote:
> > > > >     
> > > > > > Hi Dorota
> > > > > > 
> > > > > > Some comments and typo fixes below.
> > > > > > 
> > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:    
> > > > > > > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > > > > > > +      grapheme is made up of multiple code points, an index pointing to any of
> > > > > > > +      them should be interpreted as pointing to the first one.      
> > > > > > 
> > > > > > That way we make sure we don't put the cursor/anchor between bytes that
> > > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > > > > also means that the client has to parse all the UTF-8 encoded strings
> > > > > > into Unicode code points up to the desired cursor/anchor position
> > > > > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > > > > the client has to parse the UTF-8 sequences before and after the cursor
> > > > > > position up to the requested Unicode code point.
> > > > > > 
> > > > > > I feel like we are processing the UTF-8 string already in the
> > > > > > input-method. So I am not sure that we should parse it again on the
> > > > > > client side. Parsing it again would also mean that the client would need
> > > > > > to know about UTF-8 which would be nice to avoid.
> > > > > > 
> > > > > > Thoughts?    
> > > > > 
> > > > > The client needs to know about Unicode, but not necessarily about
> > > > > UTF-8. Specifying code points is actually an advantage here, because
> > > > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > > > with code points, client's internal representation can be UTF-16 or
> > > > > maybe even something else.    
> > > > 
> > > > Maybe I am misunderstanding something but the protocol specifies that
> > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > > > the strings are specified in Unicode points. To me that indicates that
> > > > the application *has to parse* the UTF-8 string into Unicode points
> > > > when receiving the event otherwise it doesn't know after which Unicode
> > > > point to draw the cursor. Of course the application can then decide to
> > > > convert the UTF-8 string into another encoding like UTF-16 for internal
> > > > processing (for whatever reason) but that doesn't change the fact that
> > > > it still would have to parse the incoming UTF-8 (and thus know about
> > > > UTF-8).
> > > >   
> > > Can you see any way to avoid parsing UTF-8 in order to draw the
> > > cursor? I tried to come up with a way to do that, but even with
> > > specifying byte strings, I believe that calculating the position of
> > > the cursor - either in pixels or in glyphs - requires full parsing of
> > > the input string.  
> > 
> > Yes, I don't think it's avoidable either. You just don't have to do
> > it twice if your text rendering library consumes UTF-8 strings with
> > byte-offsets though. See my response below.
> > 
> > 
> > > > > There's no avoiding the parsing either. What the application cares
> > > > > about is that the cursor falls between glyphs. The application cannot
> > > > > know that in all cases. Unicode allows the same sequence to be
> > > > > displayed in multiple ways (fallback):
> > > > > 
> > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > > > 
> > > > > One could make an argument that byte offsets should never be close
> > > > > to ZWJ characters, but I think this decision is better left to the
> > > > > application, which knows what exactly it is presenting to the user.    
> > > > 
> > > > The idea of the previous version of the protocol (from my understanding)
> > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > > > falling between bytes of a Unicode code point) into the string will be
> > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > > > string you trust the sender to honor the protocol and thus you can just
> > > > pass the UTF-8 encoded string unprocessed to your text rendering library
> > > > (provided that the library supports UTF-8 strings which is what I am
> > > > assuming) without having to parse the UTF-8 string into Unicode code
> > > > points.
> > > > 
> > > > Of course the Unicode code points will have to be parsed at some point
> > > > if you want to render them. Using byte-offsets just lets you do that at
> > > > a later stage if your libraries support UTF-8.
> > > > 
> > > >   
> > > Doesn't that chiefly depend on what kind of the text rendering library
> > > though? As far as I understand, passing text to rendering is necessary
> > > to calculate the cursor position. At the same time, it doesn't matter
> > > much for the calculations whether the cursor offset is in bytes or
> > > code points - the library does the parsing in the last step anyway.
> > > 
> > > I think you mean that if the rendering library accepts byte offsets
> > > as the only format, the application would have to parse the UTF-8
> > > unnecessarily. I agree with this, but I'm not sure we should optimize
> > > for this case. Other libraries may support only code points instead.
> > >
> > > Did I understand you correctly?  
> > 
> > Yes, that's what I meant. I also assumed that no text rendering library
> > expects you to pass the string length in Unicode points. I had a look
> > and the ones I managed to find expected their lengths in bytes:
> > 
> > * Pango: https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
> > * Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html
> 
> I looked a bit deeper and found hb_buffer_add_utf8:
> 
> https://cgit.freedesktop.org/harfbuzz/tree/src/hb-buffer.cc#n1576
> 
> It seems to require both (either?) the number of bytes (for buffer
> size) and the number of code points in the same call. In this case, it
> doesn't matter how the position information is expressed.

Haha, as an API I think that's horrible...


> > For those you would need to parse the UTF-8 string yourself first in
> > order to find out at which byte position the Unicodepoint stops where
> > the protocol wants you to draw the cursor (if the protocol sends Unicode
> > point offsets).
> > 
> > I feel like it would make sense to optimize for the more common case. I
> > assume that is the one where you need to pass a length in bytes to the
> > text rendering library, not in Unicode points.
> > 
> > Admittedly, I haven't used a lot of text rendering libraries so I would
> > very much like to hear more opinions on the issue.
> > 
> 
> Even if some libraries expect to work with bytes, I see three
> reasons not to provide them. Most importantly, I believe that we
> should avoid letting people shoot themselves in the foot whenever
> possible. Specifying bytes leaves a lot of wiggle room to communicate
> invalid state. The supporting reason is that protocols shouldn't be
> tied to implementation details.

I agree that this is an advantage of using offsets measured in Unicode
code points.

Still, it worries me to think about how for the next 10-20 years people
using these protocols have to parse their UTF-8 strings into Unicode
points twice for no good reason...


> The least important reason is that handling Unicode is getting better
> than it used to be. Taking Python as an example:
> 

That's true to some extent (personally I like Go's string and Unicode handling)
but Python is a bad example IMO. Python 3 handles strings this way while
Python 2 handels them in a completely different way:

Python 2.7.15 (default, May  1 2018, 20:16:04)
[GCC 7.3.1 20180406] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'æþ'
'\xc3\xa6\xc3\xbe'
>>> 'æþ'[1]
'\xa6'

and I am not sure either of them is easy and efficient to work with.


> >>> 'æþ'[1]
> 'þ'
> >>> len('æþ'.encode('utf-8'))
> 4
> 
> Strings are natively indexed with code points. This matches at least
> my intuition when I'm asked to place a cursor somewhere inside a
> string and tell the index.

Go expects all strings to be UTF-8 encoded and they are indexed by
byte. You can iterate over strings to get unicode points (called 'rune's
there) should you need them:

for offset, r := range "æþ" {
   fmt.Printf("start byte pos: %d, code point: %c\n", offset, r)
}

start byte pos: 0, code point: æ
start byte pos: 2, code point: þ

Using Go's approach you can treat strings as UTF-8 bytes if that's all
you want to care about while still having an easy way to parse them into
Unicode points if you need them.


> In the end, I'm not an expert in that area either - perhaps treating
> client side strings as UTF-8 buffers makes sense, but at the moment
> I'm still leaning towards the code point abstraction.

Someone (™) should probably implement a client making use of the protocol
to see what the real world impact of this protocol change would be.

The editor in the weston project uses pango for its text layout:

https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824

so it would have to parse the UTF-8 string twice. The same is most likely
true for all programs using GTK...


Cheers,

Silvan
Dorota Czaplejewicz May 6, 2018, 8:37 p.m.
On Sat, 5 May 2018 13:37:44 +0200
Silvan Jegen <s.jegen@gmail.com> wrote:

> On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
> > On Fri, 4 May 2018 22:32:15 +0200
> > Silvan Jegen <s.jegen@gmail.com> wrote:
> >   
> > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:  
> > > > On Thu, 3 May 2018 21:55:40 +0200
> > > > Silvan Jegen <s.jegen@gmail.com> wrote:
> > > >     
> > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:    
> > > > > > On Thu, 3 May 2018 20:47:27 +0200
> > > > > > Silvan Jegen <s.jegen@gmail.com> wrote:
> > > > > >       
> > > > > > > Hi Dorota
> > > > > > > 
> > > > > > > Some comments and typo fixes below.
> > > > > > > 
> > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:      
> > > > > > > > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> > > > > > > > +      grapheme is made up of multiple code points, an index pointing to any of
> > > > > > > > +      them should be interpreted as pointing to the first one.        
> > > > > > > 
> > > > > > > That way we make sure we don't put the cursor/anchor between bytes that
> > > > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > > > > > also means that the client has to parse all the UTF-8 encoded strings
> > > > > > > into Unicode code points up to the desired cursor/anchor position
> > > > > > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > > > > > the client has to parse the UTF-8 sequences before and after the cursor
> > > > > > > position up to the requested Unicode code point.
> > > > > > > 
> > > > > > > I feel like we are processing the UTF-8 string already in the
> > > > > > > input-method. So I am not sure that we should parse it again on the
> > > > > > > client side. Parsing it again would also mean that the client would need
> > > > > > > to know about UTF-8 which would be nice to avoid.
> > > > > > > 
> > > > > > > Thoughts?      
> > > > > > 
> > > > > > The client needs to know about Unicode, but not necessarily about
> > > > > > UTF-8. Specifying code points is actually an advantage here, because
> > > > > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > > > > with code points, client's internal representation can be UTF-16 or
> > > > > > maybe even something else.      
> > > > > 
> > > > > Maybe I am misunderstanding something but the protocol specifies that
> > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > > > > the strings are specified in Unicode points. To me that indicates that
> > > > > the application *has to parse* the UTF-8 string into Unicode points
> > > > > when receiving the event otherwise it doesn't know after which Unicode
> > > > > point to draw the cursor. Of course the application can then decide to
> > > > > convert the UTF-8 string into another encoding like UTF-16 for internal
> > > > > processing (for whatever reason) but that doesn't change the fact that
> > > > > it still would have to parse the incoming UTF-8 (and thus know about
> > > > > UTF-8).
> > > > >     
> > > > Can you see any way to avoid parsing UTF-8 in order to draw the
> > > > cursor? I tried to come up with a way to do that, but even with
> > > > specifying byte strings, I believe that calculating the position of
> > > > the cursor - either in pixels or in glyphs - requires full parsing of
> > > > the input string.    
> > > 
> > > Yes, I don't think it's avoidable either. You just don't have to do
> > > it twice if your text rendering library consumes UTF-8 strings with
> > > byte-offsets though. See my response below.
> > > 
> > >   
> > > > > > There's no avoiding the parsing either. What the application cares
> > > > > > about is that the cursor falls between glyphs. The application cannot
> > > > > > know that in all cases. Unicode allows the same sequence to be
> > > > > > displayed in multiple ways (fallback):
> > > > > > 
> > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > > > > 
> > > > > > One could make an argument that byte offsets should never be close
> > > > > > to ZWJ characters, but I think this decision is better left to the
> > > > > > application, which knows what exactly it is presenting to the user.      
> > > > > 
> > > > > The idea of the previous version of the protocol (from my understanding)
> > > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > > > > falling between bytes of a Unicode code point) into the string will be
> > > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > > > > string you trust the sender to honor the protocol and thus you can just
> > > > > pass the UTF-8 encoded string unprocessed to your text rendering library
> > > > > (provided that the library supports UTF-8 strings which is what I am
> > > > > assuming) without having to parse the UTF-8 string into Unicode code
> > > > > points.
> > > > > 
> > > > > Of course the Unicode code points will have to be parsed at some point
> > > > > if you want to render them. Using byte-offsets just lets you do that at
> > > > > a later stage if your libraries support UTF-8.
> > > > > 
> > > > >     
> > > > Doesn't that chiefly depend on what kind of the text rendering library
> > > > though? As far as I understand, passing text to rendering is necessary
> > > > to calculate the cursor position. At the same time, it doesn't matter
> > > > much for the calculations whether the cursor offset is in bytes or
> > > > code points - the library does the parsing in the last step anyway.
> > > > 
> > > > I think you mean that if the rendering library accepts byte offsets
> > > > as the only format, the application would have to parse the UTF-8
> > > > unnecessarily. I agree with this, but I'm not sure we should optimize
> > > > for this case. Other libraries may support only code points instead.
> > > >
> > > > Did I understand you correctly?    
> > > 
> > > Yes, that's what I meant. I also assumed that no text rendering library
> > > expects you to pass the string length in Unicode points. I had a look
> > > and the ones I managed to find expected their lengths in bytes:
> > > 
> > > * Pango: https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
> > > * Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html  
> > 
> > I looked a bit deeper and found hb_buffer_add_utf8:
> > 
> > https://cgit.freedesktop.org/harfbuzz/tree/src/hb-buffer.cc#n1576
> > 
> > It seems to require both (either?) the number of bytes (for buffer
> > size) and the number of code points in the same call. In this case, it
> > doesn't matter how the position information is expressed.  
> 
> Haha, as an API I think that's horrible...
> 
> 
> > > For those you would need to parse the UTF-8 string yourself first in
> > > order to find out at which byte position the Unicodepoint stops where
> > > the protocol wants you to draw the cursor (if the protocol sends Unicode
> > > point offsets).
> > > 
> > > I feel like it would make sense to optimize for the more common case. I
> > > assume that is the one where you need to pass a length in bytes to the
> > > text rendering library, not in Unicode points.
> > > 
> > > Admittedly, I haven't used a lot of text rendering libraries so I would
> > > very much like to hear more opinions on the issue.
> > >   
> > 
> > Even if some libraries expect to work with bytes, I see three
> > reasons not to provide them. Most importantly, I believe that we
> > should avoid letting people shoot themselves in the foot whenever
> > possible. Specifying bytes leaves a lot of wiggle room to communicate
> > invalid state. The supporting reason is that protocols shouldn't be
> > tied to implementation details.  
> 
> I agree that this is an advantage of using offsets measured in Unicode
> code points.
> 
> Still, it worries me to think about how for the next 10-20 years people
> using these protocols have to parse their UTF-8 strings into Unicode
> points twice for no good reason...
> 
> 
> > The least important reason is that handling Unicode is getting better
> > than it used to be. Taking Python as an example:
> >   
> 
> That's true to some extent (personally I like Go's string and Unicode handling)
> but Python is a bad example IMO. Python 3 handles strings this way while
> Python 2 handels them in a completely different way:
> 
> Python 2.7.15 (default, May  1 2018, 20:16:04)
> [GCC 7.3.1 20180406] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 'æþ'  
> '\xc3\xa6\xc3\xbe'
> >>> 'æþ'[1]  
> '\xa6'
> 
> and I am not sure either of them is easy and efficient to work with.
> 
> 
> > >>> 'æþ'[1]  
> > 'þ'  
> > >>> len('æþ'.encode('utf-8'))  
> > 4
> > 
> > Strings are natively indexed with code points. This matches at least
> > my intuition when I'm asked to place a cursor somewhere inside a
> > string and tell the index.  
> 
> Go expects all strings to be UTF-8 encoded and they are indexed by
> byte. You can iterate over strings to get unicode points (called 'rune's
> there) should you need them:
> 
> for offset, r := range "æþ" {
>    fmt.Printf("start byte pos: %d, code point: %c\n", offset, r)
> }
> 
> start byte pos: 0, code point: æ
> start byte pos: 2, code point: þ
> 
> Using Go's approach you can treat strings as UTF-8 bytes if that's all
> you want to care about while still having an easy way to parse them into
> Unicode points if you need them.
> 
> 
> > In the end, I'm not an expert in that area either - perhaps treating
> > client side strings as UTF-8 buffers makes sense, but at the moment
> > I'm still leaning towards the code point abstraction.  
> 
> Someone (™) should probably implement a client making use of the protocol
> to see what the real world impact of this protocol change would be.
> 
> The editor in the weston project uses pango for its text layout:
> 
> https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824
> 
> so it would have to parse the UTF-8 string twice. The same is most likely
> true for all programs using GTK...
> 
> 

I made an attempt to dig deeper, and while I stopped short of becoming this Someone for now, I gathered what I think are some important results.

First, the state of the libraries. There's a lot of data I gathered, so I'll keep this section rather dense. First, another contender for the title of text layout library, and that one uses code points exclusively:

https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h `gr_make_seg`

https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c

Afterwards, I focused on GTK and Qt. As an input method plugin developer, I looked at the IM interfaces and internal data structures they expose. The results were not that clear - no mention of "code points", some references to "bytes", many to "characters" (not "chars"). What is certain is that there's a lot of converting going on behind the scenes anyway. First off, GTK seems to be moving away from bytes, judging by the comments:

gtk 3.22 (`gtkimcontext.c`)

`gtk_im_context_delete_surrounding`

> * Asks the widget that the input context is attached to to delete
> * characters around the cursor position by emitting the
> * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars
> * are in characters not in bytes which differs from the usage other
> * places in #GtkIMContext.

`gtk_im_context_get_preedit_string`

> * @cursor_pos: (out): location to store position of cursor (in characters)
> *              within the preedit string.  

`gtk_im_context_get_surrounding`

> * @cursor_index: (out): location to store byte index of the insertion
> *        cursor within @text.

gtkEntry seems to store things internally as characters.

While GTK using code points internally is not a proof of anything, it's a suggestion that there is a reason not to use bytes.

Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString

> replaceLength specifies the number of characters to be replaced

a confirmation that "characters" means "code points" comes from https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value reported when "æþ|" is displayed is 2.

I also spent more time than I should writing a demo implementation of an input method and a client connecting to it to check out the proposed interfaces. Predictably, it gave me a lot of trouble on the edges between bytes and code points, but I blame it on Rust's scarcity of UTF handling functions. The hack is available at https://code.puri.sm/dorota.czaplejewicz/impoc

My impression at the moment is that it doesn't matter much how offsets within UTF strings are encoded, but that code points slightly better reflect what's going on in the GUI toolkits, apart from the benefits mentioned in my other emails. There seems to be so much going on behind the scenes and the parsing is so cheap that it doesn't make sense to worry about the computational aspect, just try to make things easier to get right.

Unless someone chimes in with more arguments, I'm going to keep using code points in following revisions.

Cheers,
Dorota
Joshua Watt May 7, 2018, 3:11 a.m.
On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz
<dorota.czaplejewicz@puri.sm> wrote:
> On Sat, 5 May 2018 13:37:44 +0200
> Silvan Jegen <s.jegen@gmail.com> wrote:
>
>> On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
>> > On Fri, 4 May 2018 22:32:15 +0200
>> > Silvan Jegen <s.jegen@gmail.com> wrote:
>> >
>> > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
>> > > > On Thu, 3 May 2018 21:55:40 +0200
>> > > > Silvan Jegen <s.jegen@gmail.com> wrote:
>> > > >
>> > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
>> > > > > > On Thu, 3 May 2018 20:47:27 +0200
>> > > > > > Silvan Jegen <s.jegen@gmail.com> wrote:
>> > > > > >
>> > > > > > > Hi Dorota
>> > > > > > >
>> > > > > > > Some comments and typo fixes below.
>> > > > > > >
>> > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
>> > > > > > > > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
>> > > > > > > > +      grapheme is made up of multiple code points, an index pointing to any of
>> > > > > > > > +      them should be interpreted as pointing to the first one.
>> > > > > > >
>> > > > > > > That way we make sure we don't put the cursor/anchor between bytes that
>> > > > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
>> > > > > > > also means that the client has to parse all the UTF-8 encoded strings
>> > > > > > > into Unicode code points up to the desired cursor/anchor position
>> > > > > > > on each "preedit_string" event. For each "delete_surrounding_text" event
>> > > > > > > the client has to parse the UTF-8 sequences before and after the cursor
>> > > > > > > position up to the requested Unicode code point.
>> > > > > > >
>> > > > > > > I feel like we are processing the UTF-8 string already in the
>> > > > > > > input-method. So I am not sure that we should parse it again on the
>> > > > > > > client side. Parsing it again would also mean that the client would need
>> > > > > > > to know about UTF-8 which would be nice to avoid.
>> > > > > > >
>> > > > > > > Thoughts?
>> > > > > >
>> > > > > > The client needs to know about Unicode, but not necessarily about
>> > > > > > UTF-8. Specifying code points is actually an advantage here, because
>> > > > > > byte offsets are inherently expressed relative to UTF-8. By counting
>> > > > > > with code points, client's internal representation can be UTF-16 or
>> > > > > > maybe even something else.
>> > > > >
>> > > > > Maybe I am misunderstanding something but the protocol specifies that
>> > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
>> > > > > the strings are specified in Unicode points. To me that indicates that
>> > > > > the application *has to parse* the UTF-8 string into Unicode points
>> > > > > when receiving the event otherwise it doesn't know after which Unicode
>> > > > > point to draw the cursor. Of course the application can then decide to
>> > > > > convert the UTF-8 string into another encoding like UTF-16 for internal
>> > > > > processing (for whatever reason) but that doesn't change the fact that
>> > > > > it still would have to parse the incoming UTF-8 (and thus know about
>> > > > > UTF-8).
>> > > > >
>> > > > Can you see any way to avoid parsing UTF-8 in order to draw the
>> > > > cursor? I tried to come up with a way to do that, but even with
>> > > > specifying byte strings, I believe that calculating the position of
>> > > > the cursor - either in pixels or in glyphs - requires full parsing of
>> > > > the input string.
>> > >
>> > > Yes, I don't think it's avoidable either. You just don't have to do
>> > > it twice if your text rendering library consumes UTF-8 strings with
>> > > byte-offsets though. See my response below.
>> > >
>> > >
>> > > > > > There's no avoiding the parsing either. What the application cares
>> > > > > > about is that the cursor falls between glyphs. The application cannot
>> > > > > > know that in all cases. Unicode allows the same sequence to be
>> > > > > > displayed in multiple ways (fallback):
>> > > > > >
>> > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
>> > > > > >
>> > > > > > One could make an argument that byte offsets should never be close
>> > > > > > to ZWJ characters, but I think this decision is better left to the
>> > > > > > application, which knows what exactly it is presenting to the user.
>> > > > >
>> > > > > The idea of the previous version of the protocol (from my understanding)
>> > > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
>> > > > > falling between bytes of a Unicode code point) into the string will be
>> > > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
>> > > > > string you trust the sender to honor the protocol and thus you can just
>> > > > > pass the UTF-8 encoded string unprocessed to your text rendering library
>> > > > > (provided that the library supports UTF-8 strings which is what I am
>> > > > > assuming) without having to parse the UTF-8 string into Unicode code
>> > > > > points.
>> > > > >
>> > > > > Of course the Unicode code points will have to be parsed at some point
>> > > > > if you want to render them. Using byte-offsets just lets you do that at
>> > > > > a later stage if your libraries support UTF-8.
>> > > > >
>> > > > >
>> > > > Doesn't that chiefly depend on what kind of the text rendering library
>> > > > though? As far as I understand, passing text to rendering is necessary
>> > > > to calculate the cursor position. At the same time, it doesn't matter
>> > > > much for the calculations whether the cursor offset is in bytes or
>> > > > code points - the library does the parsing in the last step anyway.
>> > > >
>> > > > I think you mean that if the rendering library accepts byte offsets
>> > > > as the only format, the application would have to parse the UTF-8
>> > > > unnecessarily. I agree with this, but I'm not sure we should optimize
>> > > > for this case. Other libraries may support only code points instead.
>> > > >
>> > > > Did I understand you correctly?
>> > >
>> > > Yes, that's what I meant. I also assumed that no text rendering library
>> > > expects you to pass the string length in Unicode points. I had a look
>> > > and the ones I managed to find expected their lengths in bytes:
>> > >
>> > > * Pango: https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
>> > > * Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html
>> >
>> > I looked a bit deeper and found hb_buffer_add_utf8:
>> >
>> > https://cgit.freedesktop.org/harfbuzz/tree/src/hb-buffer.cc#n1576
>> >
>> > It seems to require both (either?) the number of bytes (for buffer
>> > size) and the number of code points in the same call. In this case, it
>> > doesn't matter how the position information is expressed.
>>
>> Haha, as an API I think that's horrible...
>>
>>
>> > > For those you would need to parse the UTF-8 string yourself first in
>> > > order to find out at which byte position the Unicodepoint stops where
>> > > the protocol wants you to draw the cursor (if the protocol sends Unicode
>> > > point offsets).
>> > >
>> > > I feel like it would make sense to optimize for the more common case. I
>> > > assume that is the one where you need to pass a length in bytes to the
>> > > text rendering library, not in Unicode points.
>> > >
>> > > Admittedly, I haven't used a lot of text rendering libraries so I would
>> > > very much like to hear more opinions on the issue.
>> > >
>> >
>> > Even if some libraries expect to work with bytes, I see three
>> > reasons not to provide them. Most importantly, I believe that we
>> > should avoid letting people shoot themselves in the foot whenever
>> > possible. Specifying bytes leaves a lot of wiggle room to communicate
>> > invalid state. The supporting reason is that protocols shouldn't be
>> > tied to implementation details.
>>
>> I agree that this is an advantage of using offsets measured in Unicode
>> code points.
>>
>> Still, it worries me to think about how for the next 10-20 years people
>> using these protocols have to parse their UTF-8 strings into Unicode
>> points twice for no good reason...
>>
>>
>> > The least important reason is that handling Unicode is getting better
>> > than it used to be. Taking Python as an example:
>> >
>>
>> That's true to some extent (personally I like Go's string and Unicode handling)
>> but Python is a bad example IMO. Python 3 handles strings this way while
>> Python 2 handels them in a completely different way:
>>
>> Python 2.7.15 (default, May  1 2018, 20:16:04)
>> [GCC 7.3.1 20180406] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> 'æþ'
>> '\xc3\xa6\xc3\xbe'
>> >>> 'æþ'[1]
>> '\xa6'
>>
>> and I am not sure either of them is easy and efficient to work with.
>>
>>
>> > >>> 'æþ'[1]
>> > 'þ'
>> > >>> len('æþ'.encode('utf-8'))
>> > 4
>> >
>> > Strings are natively indexed with code points. This matches at least
>> > my intuition when I'm asked to place a cursor somewhere inside a
>> > string and tell the index.
>>
>> Go expects all strings to be UTF-8 encoded and they are indexed by
>> byte. You can iterate over strings to get unicode points (called 'rune's
>> there) should you need them:
>>
>> for offset, r := range "æþ" {
>>    fmt.Printf("start byte pos: %d, code point: %c\n", offset, r)
>> }
>>
>> start byte pos: 0, code point: æ
>> start byte pos: 2, code point: þ
>>
>> Using Go's approach you can treat strings as UTF-8 bytes if that's all
>> you want to care about while still having an easy way to parse them into
>> Unicode points if you need them.
>>
>>
>> > In the end, I'm not an expert in that area either - perhaps treating
>> > client side strings as UTF-8 buffers makes sense, but at the moment
>> > I'm still leaning towards the code point abstraction.
>>
>> Someone (™) should probably implement a client making use of the protocol
>> to see what the real world impact of this protocol change would be.
>>
>> The editor in the weston project uses pango for its text layout:
>>
>> https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824
>>
>> so it would have to parse the UTF-8 string twice. The same is most likely
>> true for all programs using GTK...
>>
>>
>
> I made an attempt to dig deeper, and while I stopped short of becoming this Someone for now, I gathered what I think are some important results.
>
> First, the state of the libraries. There's a lot of data I gathered, so I'll keep this section rather dense. First, another contender for the title of text layout library, and that one uses code points exclusively:
>
> https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h `gr_make_seg`
>
> https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c
>
> Afterwards, I focused on GTK and Qt. As an input method plugin developer, I looked at the IM interfaces and internal data structures they expose. The results were not that clear - no mention of "code points", some references to "bytes", many to "characters" (not "chars"). What is certain is that there's a lot of converting going on behind the scenes anyway. First off, GTK seems to be moving away from bytes, judging by the comments:
>
> gtk 3.22 (`gtkimcontext.c`)
>
> `gtk_im_context_delete_surrounding`
>
>> * Asks the widget that the input context is attached to to delete
>> * characters around the cursor position by emitting the
>> * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars
>> * are in characters not in bytes which differs from the usage other
>> * places in #GtkIMContext.
>
> `gtk_im_context_get_preedit_string`
>
>> * @cursor_pos: (out): location to store position of cursor (in characters)
>> *              within the preedit string.
>
> `gtk_im_context_get_surrounding`
>
>> * @cursor_index: (out): location to store byte index of the insertion
>> *        cursor within @text.
>
> gtkEntry seems to store things internally as characters.
>
> While GTK using code points internally is not a proof of anything, it's a suggestion that there is a reason not to use bytes.
>
> Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString
>
>> replaceLength specifies the number of characters to be replaced
>
> a confirmation that "characters" means "code points" comes from https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value reported when "æþ|" is displayed is 2.
>
> I also spent more time than I should writing a demo implementation of an input method and a client connecting to it to check out the proposed interfaces. Predictably, it gave me a lot of trouble on the edges between bytes and code points, but I blame it on Rust's scarcity of UTF handling functions. The hack is available at https://code.puri.sm/dorota.czaplejewicz/impoc
>
> My impression at the moment is that it doesn't matter much how offsets within UTF strings are encoded, but that code points slightly better reflect what's going on in the GUI toolkits, apart from the benefits mentioned in my other emails. There seems to be so much going on behind the scenes and the parsing is so cheap that it doesn't make sense to worry about the computational aspect, just try to make things easier to get right.
>
> Unless someone chimes in with more arguments, I'm going to keep using code points in following revisions.

I don't mean to do a drive by or bikeshed, I do actually have a vested
interest in this protocol (I've implemented the previous IM protocols
on Webkit For Wayland). I've really been meaning to try it out, but
haven't yet had time. I also have quite a bit of experience with
unicode (and specifically UTF-8) due to my day job, so I wanted to
chime in...

IMHO, if you are doing UTF-8 (which you should), you should *always*
specify any offset in the string as a byte offset. I have a few
reasons for this justification:
 1. Unicode is *hard*, and it has a lot of terms that people aren't
always familiar with (code points, glyphs, encodings, and the worst
overloaded term "characters"). "a byte offset in UTF-8" should be
universally and unambiguously understood.
 2. Even if you specified the cursor offset as an index into a UTF-32
array of codepoints, you *still* could end up with the cursor "in
between" a printed glyph due to combining diactiricals.
 3. Due to UTF-8's self syncronizing encoding, it is actually very
easy to determine if a given byte is the start of a code point, or in
the middle (and even determine *which* byte in the sequence it is).
Consequently, if you do find the offset is in the middle of a
codepoint, it is pretty trivial to either move to the next code point,
or move back to the beginning of the current code point. As such, I
have always found byte a more useful offset, because it can more
easily be converted to a code point than the other way around.
 4. As more of a "gut feel" sort of thing.... A Wayland protocol is a
pretty well defined binary API (like a networking API...), and
specifying in bytes feels more "stable"... Sorry I really don't have
solid data to back that up, but I would need a lot of convincing that
codepoints were better if someone was proposing throwing this data in
a UDP packet and blasting it across a network :)

Thanks,
Joshua Watt

>
> Cheers,
> Dorota
>
> _______________________________________________
> wayland-devel mailing list
> wayland-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/wayland-devel
>
Dorota Czaplejewicz May 7, 2018, 7:18 p.m.
On Sun, 6 May 2018 22:11:32 -0500
Joshua Watt <jpewhacker@gmail.com> wrote:

> On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz
> <dorota.czaplejewicz@puri.sm> wrote:
> > On Sat, 5 May 2018 13:37:44 +0200
> > Silvan Jegen <s.jegen@gmail.com> wrote:
> >  
> >> On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:  
> >> > On Fri, 4 May 2018 22:32:15 +0200
> >> > Silvan Jegen <s.jegen@gmail.com> wrote:
> >> >  
> >> > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:  
> >> > > > On Thu, 3 May 2018 21:55:40 +0200
> >> > > > Silvan Jegen <s.jegen@gmail.com> wrote:
> >> > > >  
> >> > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:  
> >> > > > > > On Thu, 3 May 2018 20:47:27 +0200
> >> > > > > > Silvan Jegen <s.jegen@gmail.com> wrote:
> >> > > > > >  
> >> > > > > > > Hi Dorota
> >> > > > > > >
> >> > > > > > > Some comments and typo fixes below.
> >> > > > > > >
> >> > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> >> > > > > > > > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> >> > > > > > > > +      grapheme is made up of multiple code points, an index pointing to any of
> >> > > > > > > > +      them should be interpreted as pointing to the first one.  
> >> > > > > > >
> >> > > > > > > That way we make sure we don't put the cursor/anchor between bytes that
> >> > > > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> >> > > > > > > also means that the client has to parse all the UTF-8 encoded strings
> >> > > > > > > into Unicode code points up to the desired cursor/anchor position
> >> > > > > > > on each "preedit_string" event. For each "delete_surrounding_text" event
> >> > > > > > > the client has to parse the UTF-8 sequences before and after the cursor
> >> > > > > > > position up to the requested Unicode code point.
> >> > > > > > >
> >> > > > > > > I feel like we are processing the UTF-8 string already in the
> >> > > > > > > input-method. So I am not sure that we should parse it again on the
> >> > > > > > > client side. Parsing it again would also mean that the client would need
> >> > > > > > > to know about UTF-8 which would be nice to avoid.
> >> > > > > > >
> >> > > > > > > Thoughts?  
> >> > > > > >
> >> > > > > > The client needs to know about Unicode, but not necessarily about
> >> > > > > > UTF-8. Specifying code points is actually an advantage here, because
> >> > > > > > byte offsets are inherently expressed relative to UTF-8. By counting
> >> > > > > > with code points, client's internal representation can be UTF-16 or
> >> > > > > > maybe even something else.  
> >> > > > >
> >> > > > > Maybe I am misunderstanding something but the protocol specifies that
> >> > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> >> > > > > the strings are specified in Unicode points. To me that indicates that
> >> > > > > the application *has to parse* the UTF-8 string into Unicode points
> >> > > > > when receiving the event otherwise it doesn't know after which Unicode
> >> > > > > point to draw the cursor. Of course the application can then decide to
> >> > > > > convert the UTF-8 string into another encoding like UTF-16 for internal
> >> > > > > processing (for whatever reason) but that doesn't change the fact that
> >> > > > > it still would have to parse the incoming UTF-8 (and thus know about
> >> > > > > UTF-8).
> >> > > > >  
> >> > > > Can you see any way to avoid parsing UTF-8 in order to draw the
> >> > > > cursor? I tried to come up with a way to do that, but even with
> >> > > > specifying byte strings, I believe that calculating the position of
> >> > > > the cursor - either in pixels or in glyphs - requires full parsing of
> >> > > > the input string.  
> >> > >
> >> > > Yes, I don't think it's avoidable either. You just don't have to do
> >> > > it twice if your text rendering library consumes UTF-8 strings with
> >> > > byte-offsets though. See my response below.
> >> > >
> >> > >  
> >> > > > > > There's no avoiding the parsing either. What the application cares
> >> > > > > > about is that the cursor falls between glyphs. The application cannot
> >> > > > > > know that in all cases. Unicode allows the same sequence to be
> >> > > > > > displayed in multiple ways (fallback):
> >> > > > > >
> >> > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> >> > > > > >
> >> > > > > > One could make an argument that byte offsets should never be close
> >> > > > > > to ZWJ characters, but I think this decision is better left to the
> >> > > > > > application, which knows what exactly it is presenting to the user.  
> >> > > > >
> >> > > > > The idea of the previous version of the protocol (from my understanding)
> >> > > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> >> > > > > falling between bytes of a Unicode code point) into the string will be
> >> > > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> >> > > > > string you trust the sender to honor the protocol and thus you can just
> >> > > > > pass the UTF-8 encoded string unprocessed to your text rendering library
> >> > > > > (provided that the library supports UTF-8 strings which is what I am
> >> > > > > assuming) without having to parse the UTF-8 string into Unicode code
> >> > > > > points.
> >> > > > >
> >> > > > > Of course the Unicode code points will have to be parsed at some point
> >> > > > > if you want to render them. Using byte-offsets just lets you do that at
> >> > > > > a later stage if your libraries support UTF-8.
> >> > > > >
> >> > > > >  
> >> > > > Doesn't that chiefly depend on what kind of the text rendering library
> >> > > > though? As far as I understand, passing text to rendering is necessary
> >> > > > to calculate the cursor position. At the same time, it doesn't matter
> >> > > > much for the calculations whether the cursor offset is in bytes or
> >> > > > code points - the library does the parsing in the last step anyway.
> >> > > >
> >> > > > I think you mean that if the rendering library accepts byte offsets
> >> > > > as the only format, the application would have to parse the UTF-8
> >> > > > unnecessarily. I agree with this, but I'm not sure we should optimize
> >> > > > for this case. Other libraries may support only code points instead.
> >> > > >
> >> > > > Did I understand you correctly?  
> >> > >
> >> > > Yes, that's what I meant. I also assumed that no text rendering library
> >> > > expects you to pass the string length in Unicode points. I had a look
> >> > > and the ones I managed to find expected their lengths in bytes:
> >> > >
> >> > > * Pango: https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
> >> > > * Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html  
> >> >
> >> > I looked a bit deeper and found hb_buffer_add_utf8:
> >> >
> >> > https://cgit.freedesktop.org/harfbuzz/tree/src/hb-buffer.cc#n1576
> >> >
> >> > It seems to require both (either?) the number of bytes (for buffer
> >> > size) and the number of code points in the same call. In this case, it
> >> > doesn't matter how the position information is expressed.  
> >>
> >> Haha, as an API I think that's horrible...
> >>
> >>  
> >> > > For those you would need to parse the UTF-8 string yourself first in
> >> > > order to find out at which byte position the Unicodepoint stops where
> >> > > the protocol wants you to draw the cursor (if the protocol sends Unicode
> >> > > point offsets).
> >> > >
> >> > > I feel like it would make sense to optimize for the more common case. I
> >> > > assume that is the one where you need to pass a length in bytes to the
> >> > > text rendering library, not in Unicode points.
> >> > >
> >> > > Admittedly, I haven't used a lot of text rendering libraries so I would
> >> > > very much like to hear more opinions on the issue.
> >> > >  
> >> >
> >> > Even if some libraries expect to work with bytes, I see three
> >> > reasons not to provide them. Most importantly, I believe that we
> >> > should avoid letting people shoot themselves in the foot whenever
> >> > possible. Specifying bytes leaves a lot of wiggle room to communicate
> >> > invalid state. The supporting reason is that protocols shouldn't be
> >> > tied to implementation details.  
> >>
> >> I agree that this is an advantage of using offsets measured in Unicode
> >> code points.
> >>
> >> Still, it worries me to think about how for the next 10-20 years people
> >> using these protocols have to parse their UTF-8 strings into Unicode
> >> points twice for no good reason...
> >>
> >>  
> >> > The least important reason is that handling Unicode is getting better
> >> > than it used to be. Taking Python as an example:
> >> >  
> >>
> >> That's true to some extent (personally I like Go's string and Unicode handling)
> >> but Python is a bad example IMO. Python 3 handles strings this way while
> >> Python 2 handels them in a completely different way:
> >>
> >> Python 2.7.15 (default, May  1 2018, 20:16:04)
> >> [GCC 7.3.1 20180406] on linux2
> >> Type "help", "copyright", "credits" or "license" for more information.  
> >> >>> 'æþ'  
> >> '\xc3\xa6\xc3\xbe'  
> >> >>> 'æþ'[1]  
> >> '\xa6'
> >>
> >> and I am not sure either of them is easy and efficient to work with.
> >>
> >>  
> >> > >>> 'æþ'[1]  
> >> > 'þ'  
> >> > >>> len('æþ'.encode('utf-8'))  
> >> > 4
> >> >
> >> > Strings are natively indexed with code points. This matches at least
> >> > my intuition when I'm asked to place a cursor somewhere inside a
> >> > string and tell the index.  
> >>
> >> Go expects all strings to be UTF-8 encoded and they are indexed by
> >> byte. You can iterate over strings to get unicode points (called 'rune's
> >> there) should you need them:
> >>
> >> for offset, r := range "æþ" {
> >>    fmt.Printf("start byte pos: %d, code point: %c\n", offset, r)
> >> }
> >>
> >> start byte pos: 0, code point: æ
> >> start byte pos: 2, code point: þ
> >>
> >> Using Go's approach you can treat strings as UTF-8 bytes if that's all
> >> you want to care about while still having an easy way to parse them into
> >> Unicode points if you need them.
> >>
> >>  
> >> > In the end, I'm not an expert in that area either - perhaps treating
> >> > client side strings as UTF-8 buffers makes sense, but at the moment
> >> > I'm still leaning towards the code point abstraction.  
> >>
> >> Someone (™) should probably implement a client making use of the protocol
> >> to see what the real world impact of this protocol change would be.
> >>
> >> The editor in the weston project uses pango for its text layout:
> >>
> >> https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824
> >>
> >> so it would have to parse the UTF-8 string twice. The same is most likely
> >> true for all programs using GTK...
> >>
> >>  
> >
> > I made an attempt to dig deeper, and while I stopped short of becoming this Someone for now, I gathered what I think are some important results.
> >
> > First, the state of the libraries. There's a lot of data I gathered, so I'll keep this section rather dense. First, another contender for the title of text layout library, and that one uses code points exclusively:
> >
> > https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h `gr_make_seg`
> >
> > https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c
> >
> > Afterwards, I focused on GTK and Qt. As an input method plugin developer, I looked at the IM interfaces and internal data structures they expose. The results were not that clear - no mention of "code points", some references to "bytes", many to "characters" (not "chars"). What is certain is that there's a lot of converting going on behind the scenes anyway. First off, GTK seems to be moving away from bytes, judging by the comments:
> >
> > gtk 3.22 (`gtkimcontext.c`)
> >
> > `gtk_im_context_delete_surrounding`
> >  
> >> * Asks the widget that the input context is attached to to delete
> >> * characters around the cursor position by emitting the
> >> * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars
> >> * are in characters not in bytes which differs from the usage other
> >> * places in #GtkIMContext.  
> >
> > `gtk_im_context_get_preedit_string`
> >  
> >> * @cursor_pos: (out): location to store position of cursor (in characters)
> >> *              within the preedit string.  
> >
> > `gtk_im_context_get_surrounding`
> >  
> >> * @cursor_index: (out): location to store byte index of the insertion
> >> *        cursor within @text.  
> >
> > gtkEntry seems to store things internally as characters.
> >
> > While GTK using code points internally is not a proof of anything, it's a suggestion that there is a reason not to use bytes.
> >
> > Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString
> >  
> >> replaceLength specifies the number of characters to be replaced  
> >
> > a confirmation that "characters" means "code points" comes from https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value reported when "æþ|" is displayed is 2.
> >
> > I also spent more time than I should writing a demo implementation of an input method and a client connecting to it to check out the proposed interfaces. Predictably, it gave me a lot of trouble on the edges between bytes and code points, but I blame it on Rust's scarcity of UTF handling functions. The hack is available at https://code.puri.sm/dorota.czaplejewicz/impoc
> >
> > My impression at the moment is that it doesn't matter much how offsets within UTF strings are encoded, but that code points slightly better reflect what's going on in the GUI toolkits, apart from the benefits mentioned in my other emails. There seems to be so much going on behind the scenes and the parsing is so cheap that it doesn't make sense to worry about the computational aspect, just try to make things easier to get right.
> >
> > Unless someone chimes in with more arguments, I'm going to keep using code points in following revisions.  
> 
> I don't mean to do a drive by or bikeshed, I do actually have a vested
> interest in this protocol (I've implemented the previous IM protocols
> on Webkit For Wayland). I've really been meaning to try it out, but
> haven't yet had time. I also have quite a bit of experience with
> unicode (and specifically UTF-8) due to my day job, so I wanted to
> chime in...
> 
> IMHO, if you are doing UTF-8 (which you should), you should *always*
> specify any offset in the string as a byte offset. I have a few
> reasons for this justification:
>  1. Unicode is *hard*, and it has a lot of terms that people aren't
> always familiar with (code points, glyphs, encodings, and the worst
> overloaded term "characters"). "a byte offset in UTF-8" should be
> universally and unambiguously understood.
>  2. Even if you specified the cursor offset as an index into a UTF-32
> array of codepoints, you *still* could end up with the cursor "in
> between" a printed glyph due to combining diactiricals.
>  3. Due to UTF-8's self syncronizing encoding, it is actually very
> easy to determine if a given byte is the start of a code point, or in
> the middle (and even determine *which* byte in the sequence it is).
> Consequently, if you do find the offset is in the middle of a
> codepoint, it is pretty trivial to either move to the next code point,
> or move back to the beginning of the current code point. As such, I
> have always found byte a more useful offset, because it can more
> easily be converted to a code point than the other way around.
>  4. As more of a "gut feel" sort of thing.... A Wayland protocol is a
> pretty well defined binary API (like a networking API...), and
> specifying in bytes feels more "stable"... Sorry I really don't have
> solid data to back that up, but I would need a lot of convincing that
> codepoints were better if someone was proposing throwing this data in
> a UDP packet and blasting it across a network :)
> 

Thanks for the input. My plan is to implement the server side of this protocol in wlroots, so your Webkit experience could be complementary there. Either way, I'm glad there is interest and feedback, and I hope we can find a solution that satisfies the needs.

Cheers,
Dorota
Silvan Jegen May 7, 2018, 7:55 p.m.
On Sun, May 06, 2018 at 10:37:57PM +0200, Dorota Czaplejewicz wrote:
> On Sat, 5 May 2018 13:37:44 +0200
> Silvan Jegen <s.jegen@gmail.com> wrote:
> 
> > On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
> > > On Fri, 4 May 2018 22:32:15 +0200
> > > Silvan Jegen <s.jegen@gmail.com> wrote:
> > >   
> > > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:  
> > > > > On Thu, 3 May 2018 21:55:40 +0200
> > > > > Silvan Jegen <s.jegen@gmail.com> wrote:
> > >
> > > [...]
> > >
> > > In the end, I'm not an expert in that area either - perhaps treating
> > > client side strings as UTF-8 buffers makes sense, but at the moment
> > > I'm still leaning towards the code point abstraction.  
> > 
> > Someone (™) should probably implement a client making use of the protocol
> > to see what the real world impact of this protocol change would be.
> > 
> > The editor in the weston project uses pango for its text layout:
> > 
> > https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824
> > 
> > so it would have to parse the UTF-8 string twice. The same is most likely
> > true for all programs using GTK...
> > 
> > 
> 
> I made an attempt to dig deeper, and while I stopped short of becoming
> this Someone for now, I gathered what I think are some important
> results.
> 
> First, the state of the libraries. There's a lot of data I gathered,
> so I'll keep this section rather dense. First, another contender
> for the title of text layout library, and that one uses code points
> exclusively:
> 
> https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h `gr_make_seg`
> 
> https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c
> 
> Afterwards, I focused on GTK and Qt. As an input method plugin
> developer, I looked at the IM interfaces and internal data structures
> they expose. The results were not that clear - no mention of "code
> points", some references to "bytes", many to "characters" (not
> "chars"). What is certain is that there's a lot of converting going on

Yes, it's very unfortunate that a lot of developers do not strife for
more clarity and precision in terminology when processing text.


> behind the scenes anyway. First off, GTK seems to be moving away from
> bytes, judging by the comments:
> 
> gtk 3.22 (`gtkimcontext.c`)
> 
> `gtk_im_context_delete_surrounding`
> 
> > * Asks the widget that the input context is attached to to delete
> > * characters around the cursor position by emitting the
> > * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars
> > * are in characters not in bytes which differs from the usage other
> > * places in #GtkIMContext.
> 
> `gtk_im_context_get_preedit_string`
> 
> > * @cursor_pos: (out): location to store position of cursor (in characters)
> > *              within the preedit string.  
> 
> `gtk_im_context_get_surrounding`
> 
> > * @cursor_index: (out): location to store byte index of the insertion
> > *        cursor within @text.
> 
> gtkEntry seems to store things internally as characters.

They mention "characters" but what they most likely mean are Unicode
code points.

One would think they would try to keep their APIs consistent but that
doesn't seem to be the case.


> While GTK using code points internally is not a proof of anything,
> it's a suggestion that there is a reason not to use bytes.
> 
> Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString
> 
> > replaceLength specifies the number of characters to be replaced
> 
> a confirmation that "characters" means "code points" comes from
> https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value
> reported when "æþ|" is displayed is 2.

https://doc.qt.io/qt-5/qstring.html

Qt uses UTF-16 internally so they *could* also be counting "QChars"
which are 16-bit (assuming the position is 0 indexed):

Python 3.6.5 (default, Apr 14 2018, 13:17:30)
[GCC 7.3.1 20180406] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "æþ"
'æþ'
>>> "æþ".encode("utf-16")
b'\xff\xfe\xe6\x00\xfe\x00'

If they are really doing that you would only notice it with characters
outside of the BMP because:

"(Unicode characters with code values above 65535 are stored using
surrogate pairs, i.e., two consecutive QChars.)"

I think everybody agrees that (Unicode) text handling is a mess in
general...


> I also spent more time than I should writing a demo implementation
> of an input method and a client connecting to it to check out the
> proposed interfaces. Predictably, it gave me a lot of trouble
> on the edges between bytes and code points, but I blame it on
> Rust's scarcity of UTF handling functions. The hack is available at
> https://code.puri.sm/dorota.czaplejewicz/impoc

Thanks for taking the time! I compiled and ran it but my rust is weak...

Rust has an interesting String type:

https://doc.rust-lang.org/std/string/struct.String.html#utf-8

It's UTF-8 encoded but you are not allowed to index into it.


> My impression at the moment is that it doesn't matter much how offsets
> within UTF strings are encoded, but that code points slightly better
> reflect what's going on in the GUI toolkits, apart from the benefits
> mentioned in my other emails. There seems to be so much going on
> behind the scenes and the parsing is so cheap that it doesn't make
> sense to worry about the computational aspect, just try to make things
> easier to get right.
> 
> Unless someone chimes in with more arguments, I'm going to keep using
> code points in following revisions.

The only argument I have for using byte offsets instead of Unicode code
points is that you will have to parse the UTF-8 string twice in case
your text rendering library lets you only use byte lengths. That seems
to be the case for pango, which I assume is commonly used.

If I come up with more arguments I will send another mail...


Cheers,

Silvan
Silvan Jegen May 7, 2018, 8:09 p.m.
Hi Joshua

On Sun, May 06, 2018 at 10:11:32PM -0500, Joshua Watt wrote:
> On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz
> <dorota.czaplejewicz@puri.sm> wrote:
> > Unless someone chimes in with more arguments, I'm going to keep
> > using code points in following revisions.
> 
> I don't mean to do a drive by or bikeshed, I do actually have a vested
> interest in this protocol (I've implemented the previous IM protocols
> on Webkit For Wayland). I've really been meaning to try it out, but
> haven't yet had time. I also have quite a bit of experience with
> unicode (and specifically UTF-8) due to my day job, so I wanted to
> chime in...
> 
> IMHO, if you are doing UTF-8 (which you should), you should *always*
> specify any offset in the string as a byte offset. I have a few
> reasons for this justification:
>  1. Unicode is *hard*, and it has a lot of terms that people aren't
> always familiar with (code points, glyphs, encodings, and the worst
> overloaded term "characters"). "a byte offset in UTF-8" should be
> universally and unambiguously understood.
>  2. Even if you specified the cursor offset as an index into a UTF-32
> array of codepoints, you *still* could end up with the cursor "in
> between" a printed glyph due to combining diactiricals.

This case should be covered by the following paragraph in the protocol
spec:

+      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
+      grapheme is made up of multiple code points, an index pointing to any of
+      them should be interpreted as pointing to the first one.


>  3. Due to UTF-8's self syncronizing encoding, it is actually very
> easy to determine if a given byte is the start of a code point, or in
> the middle (and even determine *which* byte in the sequence it is).
> Consequently, if you do find the offset is in the middle of a
> codepoint, it is pretty trivial to either move to the next code point,
> or move back to the beginning of the current code point. As such, I
> have always found byte a more useful offset, because it can more
> easily be converted to a code point than the other way around.

This property of UTF-8 only makes it easier to recover from an issue
you won't have to deal with at all if you specify the offsets in Unicode
code points...


>  4. As more of a "gut feel" sort of thing.... A Wayland protocol is a
> pretty well defined binary API (like a networking API...), and
> specifying in bytes feels more "stable"... Sorry I really don't have
> solid data to back that up, but I would need a lot of convincing that
> codepoints were better if someone was proposing throwing this data in
> a UDP packet and blasting it across a network :)

I am afraid gut feels don't count. And I am with you on this :P


Cheers,

Silvan
Silvan Jegen May 8, 2018, 7:07 a.m.
On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhacker@gmail.com> wrote:
> IMHO, if you are doing UTF-8 (which you should), you should *always*
> specify any offset in the string as a byte offset. I have a few
> reasons for this justification:

I agree with this as well. I thought some more about how to spell out my
gut feeling on this matter in more technical terms.

UTF-8 is a byte (sequence) representation of Unicode code points. This
indicates to me that an offset within an UTF-8-encoded string should also
be given in bytes. Specifying the offset in Unicode points mixes the
abstraction of the Unicode code point with (one of) its representations as
a byte sequence. This is reflected in the fact that an offset in Unicode
code points is not applicable to the UTF-8 string without first processing
the string.

Unicode code points do not give us that much either since what we most
likely want are grapheme clusters anyway (which, like any more advanced
Unicode processing, should be handled by a specialised library):
http://utf8everywhere.org/#myth.strlen


Cheers,

Silvan
Dorota Czaplejewicz May 10, 2018, 9:43 a.m.
On Tue, 08 May 2018 07:07:24 +0000
Silvan Jegen <s.jegen@gmail.com> wrote:

> On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhacker@gmail.com> wrote:
> > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > specify any offset in the string as a byte offset. I have a few
> > reasons for this justification:  
> 
> I agree with this as well. I thought some more about how to spell out my
> gut feeling on this matter in more technical terms.
> 
> UTF-8 is a byte (sequence) representation of Unicode code points. This
> indicates to me that an offset within an UTF-8-encoded string should also
> be given in bytes. Specifying the offset in Unicode points mixes the
> abstraction of the Unicode code point with (one of) its representations as
> a byte sequence. This is reflected in the fact that an offset in Unicode
> code points is not applicable to the UTF-8 string without first processing
> the string.
> 
> Unicode code points do not give us that much either since what we most
> likely want are grapheme clusters anyway (which, like any more advanced
> Unicode processing, should be handled by a specialised library):
> http://utf8everywhere.org/#myth.strlen
> 
> 
> Cheers,
> 
> Silvan

This message made me feel obliged to turn my own gut feeling into words. This is not to be construed as an argument, but more of an explanation.

I view wayland protocols as rather high level: their responsibility is to specify the type and the purpose of the data they are transporting. In this case, the data is a Unicode string, and the purpose is display. Or, the data is a number and the purpose is indexing.

I think that when a protocol starts to specify the type and purpose, it can no longer be thought as high level. In this view, indexing a Unicode string in terms of bytes would be akin to indexing any other vector of Foo in bytes. (I didn't actually check if there is any other vector, or bytes type available in wayland).

As you noted, there is some mixing between abstraction levels in the protocol. Hardcoding that it's not *just* Unicode, but also the particular encoding (UTF-8) eliminates problems with byte indexing we would have encountered if we decided to use things like Punycode (München => Mnchen-3ya). Knowing that it's always UTF-8 allows the protocol to use a tailoring indexing scheme. While I consider this a layer-breaking hack, nevertheless, this property partially counters the above reasoning.

* * *

To be honest, neither Unicode code points nor graphemes nor clusters are what we're truly looking for here. To understand what I mean, I recommend to play with this grapheme cluster:

नमस्ते

According to the Rust book [0], it's composed of 6 code points: ['न', 'म', 'स', '्', 'त', 'े'], but moving the cursor around, I would be led to believe it's 4 "pieces" long only.

Cheers,
Dorota

[0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html
Dorota Czaplejewicz May 10, 2018, 9:46 a.m.
On Thu, 10 May 2018 11:43:12 +0200
Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm> wrote:

> On Tue, 08 May 2018 07:07:24 +0000
> Silvan Jegen <s.jegen@gmail.com> wrote:
> 
> > On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhacker@gmail.com> wrote:  
> > > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > > specify any offset in the string as a byte offset. I have a few
> > > reasons for this justification:    
> > 
> > I agree with this as well. I thought some more about how to spell out my
> > gut feeling on this matter in more technical terms.
> > 
> > UTF-8 is a byte (sequence) representation of Unicode code points. This
> > indicates to me that an offset within an UTF-8-encoded string should also
> > be given in bytes. Specifying the offset in Unicode points mixes the
> > abstraction of the Unicode code point with (one of) its representations as
> > a byte sequence. This is reflected in the fact that an offset in Unicode
> > code points is not applicable to the UTF-8 string without first processing
> > the string.
> > 
> > Unicode code points do not give us that much either since what we most
> > likely want are grapheme clusters anyway (which, like any more advanced
> > Unicode processing, should be handled by a specialised library):
> > http://utf8everywhere.org/#myth.strlen
> > 
> > 
> > Cheers,
> > 
> > Silvan  
> 
> This message made me feel obliged to turn my own gut feeling into words. This is not to be construed as an argument, but more of an explanation.
> 
> I view wayland protocols as rather high level: their responsibility is to specify the type and the purpose of the data they are transporting. In this case, the data is a Unicode string, and the purpose is display. Or, the data is a number and the purpose is indexing.
> 
> I think that when a protocol starts to specify the type and purpose, it can no longer be thought as high level. In this view, indexing a Unicode string in terms of bytes would be akin to indexing any other vector of Foo in bytes. (I didn't actually check if there is any other vector, or bytes type available in wayland).
> 
> As you noted, there is some mixing between abstraction levels in the protocol. Hardcoding that it's not *just* Unicode, but also the particular encoding (UTF-8) eliminates problems with byte indexing we would have encountered if we decided to use things like Punycode (München => Mnchen-3ya). Knowing that it's always UTF-8 allows the protocol to use a tailoring indexing scheme. While I consider this a layer-breaking hack, nevertheless, this property partially counters the above reasoning.
> 
> * * *
> 
> To be honest, neither Unicode code points nor graphemes nor clusters are what we're truly looking for here. To understand what I mean, I recommend to play with this grapheme cluster:
> 
> नमस्ते
> 
> According to the Rust book [0], it's composed of 6 code points: ['न', 'म', 'स', '्', 'त', 'े'], but moving the cursor around, I would be led to believe it's 4 "pieces" long only.
> 
> Cheers,
> Dorota
> 
> [0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html

On a second thought, perhaps graphemes are actually the relevant thing here...
Silvan Jegen May 10, 2018, 12:29 p.m.
On Thu, May 10, 2018 at 11:46:32AM +0200, Dorota Czaplejewicz wrote:
> On Thu, 10 May 2018 11:43:12 +0200
> Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm> wrote:
> 
> > On Tue, 08 May 2018 07:07:24 +0000
> > Silvan Jegen <s.jegen@gmail.com> wrote:
> > 
> > > On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhacker@gmail.com> wrote:  
> > > > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > > > specify any offset in the string as a byte offset. I have a few
> > > > reasons for this justification:    
> > > 
> > > I agree with this as well. I thought some more about how to spell out my
> > > gut feeling on this matter in more technical terms.
> > > 
> > > UTF-8 is a byte (sequence) representation of Unicode code points. This
> > > indicates to me that an offset within an UTF-8-encoded string should also
> > > be given in bytes. Specifying the offset in Unicode points mixes the
> > > abstraction of the Unicode code point with (one of) its representations as
> > > a byte sequence. This is reflected in the fact that an offset in Unicode
> > > code points is not applicable to the UTF-8 string without first processing
> > > the string.
> > > 
> > > Unicode code points do not give us that much either since what we most
> > > likely want are grapheme clusters anyway (which, like any more advanced
> > > Unicode processing, should be handled by a specialised library):
> > > http://utf8everywhere.org/#myth.strlen
> > > 
> > > 
> > > Cheers,
> > > 
> > > Silvan  
> > 
> > This message made me feel obliged to turn my own gut feeling into
> > words. This is not to be construed as an argument, but more of an
> > explanation.
> > 
> > I view wayland protocols as rather high level: their responsibility
> > is to specify the type and the purpose of the data they are
> > transporting. In this case, the data is a Unicode string, and the
> > purpose is display. Or, the data is a number and the purpose is
> > indexing.
> > 
> > I think that when a protocol starts to specify the type and purpose,
> > it can no longer be thought as high level. In this view, indexing a
> > Unicode string in terms of bytes would be akin to indexing any other
> > vector of Foo in bytes. (I didn't actually check if there is any
> > other vector, or bytes type available in wayland).
> > 
> > As you noted, there is some mixing between abstraction levels in
> > the protocol. Hardcoding that it's not *just* Unicode, but also the
> > particular encoding (UTF-8) eliminates problems with byte indexing
> > we would have encountered if we decided to use things like Punycode
> > (München => Mnchen-3ya). Knowing that it's always UTF-8 allows the
> > protocol to use a tailoring indexing scheme. While I consider this a
> > layer-breaking hack, nevertheless, this property partially counters
> > the above reasoning.
> > 
> > * * *
> > 
> > To be honest, neither Unicode code points nor graphemes nor clusters
> > are what we're truly looking for here. To understand what I mean, I
> > recommend to play with this grapheme cluster:
> > 
> > नमस्ते
> > 
> > According to the Rust book [0], it's composed of 6 code points:
> > ['न', 'म', 'स', '्', 'त', 'े'], but moving the cursor
> > around, I would be led to believe it's 4 "pieces" long only.
> > 
> > Cheers,
> > Dorota
> > 
> > [0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html
> 
> On a second thought, perhaps graphemes are actually the relevant thing here...

Yes, that's also mentioned in the rust book:

https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html#bytes-and-scalar-values-and-grapheme-clusters-oh-my

and what I mentioned in my mail.

I agree with what is mentioned in http://utf8everywhere.org/#myth.strlen
which is that Unicode code points are almost never what people making
use of the protocol would want:

"Yet, the number of code points in it is irrelevant to almost any software
engineering task, with perhaps the only exception of converting the
string to UTF-32"

So instead just specifying a byte offset (thus not mixing layers of
abstraction) and leaving more specialized Unicode handling (if desired by
the client) to specialized libraries seems like the best way to go.


Cheers,

Silvan
Daniel Stone May 17, 2018, 5:05 p.m.
Hi Dorota,

On 3 May 2018 at 16:41, Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm> wrote:
> - There is no event to send keysyms. Compositors can use wl_keyboard
>   interface instead.

The reason we explicitly chose to have a keysym (really, 'Unicode
codepoint') event, is to support characters which don't appear in any
keymap. As a trivial example, emoji keyboards will want to send
symbols which appear in no sane keymap. Similarly, CJK input methods
may offer streams of characters pre-composed from component runs; it
is not practical to insert the entire CJK unicode space into a keymap.

Cheers,
Daniel
Dorota Czaplejewicz May 17, 2018, 6:02 p.m.
On Thu, 17 May 2018 18:05:34 +0100
Daniel Stone <daniel@fooishbar.org> wrote:

> Hi Dorota,
> 
> On 3 May 2018 at 16:41, Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm> wrote:
> > - There is no event to send keysyms. Compositors can use wl_keyboard
> >   interface instead.  
> 
> The reason we explicitly chose to have a keysym (really, 'Unicode
> codepoint') event, is to support characters which don't appear in any
> keymap. As a trivial example, emoji keyboards will want to send
> symbols which appear in no sane keymap. Similarly, CJK input methods
> may offer streams of characters pre-composed from component runs; it
> is not practical to insert the entire CJK unicode space into a keymap.
> 
> Cheers,
> Daniel


Hi Daniel,

I think that anyone wanting to support inserting arbitrary Unicode characters should use the text composition requests instead (commit_string and friends). Input methods, especially CJK ones, will make use of that functionality anyway. If removing keysyms makes something impossible, I would rather fix the text composition portion of the protocol.

Cheers,
Dorota
Carlos Garnacho July 17, 2018, 5:18 p.m.
Hi!,

(Way way late, trying to revive the conversation...)

On Thu, May 3, 2018 at 9:22 PM, Dorota Czaplejewicz
<dorota.czaplejewicz@puri.sm> wrote:
> On Thu, 3 May 2018 20:47:27 +0200
> Silvan Jegen <s.jegen@gmail.com> wrote:
>
>> Hi Dorota
>>
>> Some comments and typo fixes below.
>>
>> On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
>> > This new protocol description is a simplification over v2.
>> >
>> > - All pre-edit text styling is gone.
>> > - Pre-edit cursor can span characters.
>> > - No events regarding input panel (OSK) state nor covered rectangle.
>> >   Compositors are still free to handle situations where the keyboard
>> >   focus rectangle is covered by the input panel.
>> > - No set_preferred_language request for clients.
>> > - There is no event to send keysyms. Compositors can use wl_keyboard
>> >   interface instead.
>> > - All state is double-buffered, with specified state.
>> > - Use Unicode codepoints to measure strings.
>> >
>> > Signed-off-by: Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm>
>> > Signed-off-by: Carlos Garnacho <carlosg@gnome.org>
>> > ---
>> > This is the next update coming from Purism to perfect the text input protocol.
>> >
>> > The following changes added on top of PATCHv3:
>> >
>> > - Fixed whitespaces.
>> > - Removed enable flags - the same information can be gathered from the first requests after enter.
>> > - Changed offsets inside UTF-8 strings to use Unicode character counts in order to remove the possibility of communicating invalid state.
>> > - Specified the exact lifetime of double-buffered state, and initial values.
>> > - Made changes requested by the IM double-buffered.
>> >
>> > Some questions remain open. One is: how to specify how much text to capture in set_surrounding_text, and how often to update?

IMHO the only reason to state it here is that it's more likely that a
lazy implementation will try to squeeze a full book here, than eg. an
application setting an insanely long title. But certainly other
messages across protocols may hit this limit (the long title issue
wasn't made up :).

As for how much, I think it ultimately depends on the IM behind. Text
correction probably just wants the current word, any sort of
prediction will probably require phrases to paragraphs, char
composition can probably do without. Sounds like this could be some
sort of hint, but I don't think IMs can tell you today how much text
do they want...

>> >
>> > A possible change that I decided against for now is to replace enable/disable events by create/destroy of a new object, which would make more state lifetimes encoded in the protocol.
>> >
>> > After reading a blog post on fcitx [0], I got the impression that letting the compositor know some persistent ID of a text edit instance could be useful, however I'm not sure what the use cases are.
>> >
>> > As always, I'm happy to hear feedback.
>> >
>> > Cheers,
>> > Dorota Czaplejewicz
>> >
>> > [0] https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
>> >
>> >  Makefile.am                                    |   1 +
>> >  unstable/text-input/text-input-unstable-v3.xml | 362 +++++++++++++++++++++++++
>> >  2 files changed, 363 insertions(+)
>> >  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
>> >
>> > diff --git a/Makefile.am b/Makefile.am
>> > index 4b9a901..86d7ca9 100644
>> > --- a/Makefile.am
>> > +++ b/Makefile.am
>> > @@ -3,6 +3,7 @@ unstable_protocols =                                                                \
>> >     unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml              \
>> >     unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml                      \
>> >     unstable/text-input/text-input-unstable-v1.xml                          \
>> > +   unstable/text-input/text-input-unstable-v3.xml                          \
>> >     unstable/input-method/input-method-unstable-v1.xml                      \
>> >     unstable/xdg-shell/xdg-shell-unstable-v5.xml                            \
>> >     unstable/xdg-shell/xdg-shell-unstable-v6.xml                            \
>> > diff --git a/unstable/text-input/text-input-unstable-v3.xml b/unstable/text-input/text-input-unstable-v3.xml
>> > new file mode 100644
>> > index 0000000..ed5204f
>> > --- /dev/null
>> > +++ b/unstable/text-input/text-input-unstable-v3.xml
>> > @@ -0,0 +1,362 @@
>> > +<?xml version="1.0" encoding="UTF-8"?>
>> > +
>> > +<protocol name="text_input_unstable_v3">
>> > +  <copyright>
>> > +    Copyright © 2012, 2013 Intel Corporation
>> > +    Copyright © 2015, 2016 Jan Arne Petersen
>> > +    Copyright © 2017, 2018 Red Hat, Inc.
>> > +    Copyright © 2018 Purism SPC
>> > +
>> > +    Permission to use, copy, modify, distribute, and sell this
>> > +    software and its documentation for any purpose is hereby granted
>> > +    without fee, provided that the above copyright notice appear in
>> > +    all copies and that both that copyright notice and this permission
>> > +    notice appear in supporting documentation, and that the name of
>> > +    the copyright holders not be used in advertising or publicity
>> > +    pertaining to distribution of the software without specific,
>> > +    written prior permission.  The copyright holders make no
>> > +    representations about the suitability of this software for any
>> > +    purpose.  It is provided "as is" without express or implied
>> > +    warranty.
>> > +
>> > +    THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
>> > +    SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
>> > +    FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
>> > +    SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
>> > +    WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
>> > +    AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
>> > +    ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
>> > +    THIS SOFTWARE.
>> > +  </copyright>
>> > +
>> > +  <interface name="zwp_text_input_v3" version="1">
>> > +    <description summary="text input">
>> > +      The zwp_text_input_v3 interface represents text input and input methods
>> > +      associated with a seat. It provides enter/leave events to follow the
>> > +      text input focus for a seat.
>> > +
>> > +      Requests are used to enable/disable the text-input object and set
>> > +      state information like surrounding and selected text or the content type.
>> > +      The information about the entered text is sent to the text-input object
>> > +      via the pre-edit and commit_string events.
>> > +
>> > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
>> > +      grapheme is made up of multiple code points, an index pointing to any of
>> > +      them should be interpreted as pointing to the first one.
>>
>> That way we make sure we don't put the cursor/anchor between bytes that
>> belong to the same UTF-8 encoded Unicode code point which is nice. It
>> also means that the client has to parse all the UTF-8 encoded strings
>> into Unicode code points up to the desired cursor/anchor position
>> on each "preedit_string" event. For each "delete_surrounding_text" event
>> the client has to parse the UTF-8 sequences before and after the cursor
>> position up to the requested Unicode code point.
>>
>> I feel like we are processing the UTF-8 string already in the
>> input-method. So I am not sure that we should parse it again on the
>> client side. Parsing it again would also mean that the client would need
>> to know about UTF-8 which would be nice to avoid.
>>
>> Thoughts?
>
> The client needs to know about Unicode, but not necessarily about UTF-8. Specifying code points is actually an advantage here, because byte offsets are inherently expressed relative to UTF-8. By counting with code points, client's internal representation can be UTF-16 or maybe even something else.

I personally think byte offsets are more handy than codepoints:
pointer math is O(1) and str*() functions are "sensible" (on UTF-8 at
least, and past the bytes!=chars gotchas), it's relatively simple to
find out whether you are in the middle of a UTF-8 char, it seems
simpler to deal with than the other way around if utf16/codepoints are
used in either side; and this might even be moot as all parties are
interested in chopping strings between word/char boundaries.

As for using UTF-8 specifically, other protocols do use it for
exchange of strings (eg. xdg_surface.set_title). It's the perfect fit
for glib/pango/etc, so it wouldn't be me who objects, either :).

Cheers,
  Carlos
Dorota Czaplejewicz July 23, 2018, 12:26 p.m.
Hi Carlos,

thanks for reviewing!

On Tue, 17 Jul 2018 19:18:36 +0200
Carlos Garnacho <carlosg@gnome.org> wrote:

> Hi!,
> 
> (Way way late, trying to revive the conversation...)
> 
> On Thu, May 3, 2018 at 9:22 PM, Dorota Czaplejewicz
> <dorota.czaplejewicz@puri.sm> wrote:
> > On Thu, 3 May 2018 20:47:27 +0200
> > Silvan Jegen <s.jegen@gmail.com> wrote:
> >  
> >> Hi Dorota
> >>
> >> Some comments and typo fixes below.
> >>
> >> On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> >> > This new protocol description is a simplification over v2.
> >> >
> >> > - All pre-edit text styling is gone.
> >> > - Pre-edit cursor can span characters.
> >> > - No events regarding input panel (OSK) state nor covered rectangle.
> >> >   Compositors are still free to handle situations where the keyboard
> >> >   focus rectangle is covered by the input panel.
> >> > - No set_preferred_language request for clients.
> >> > - There is no event to send keysyms. Compositors can use wl_keyboard
> >> >   interface instead.
> >> > - All state is double-buffered, with specified state.
> >> > - Use Unicode codepoints to measure strings.
> >> >
> >> > Signed-off-by: Dorota Czaplejewicz <dorota.czaplejewicz@puri.sm>
> >> > Signed-off-by: Carlos Garnacho <carlosg@gnome.org>
> >> > ---
> >> > This is the next update coming from Purism to perfect the text input protocol.
> >> >
> >> > The following changes added on top of PATCHv3:
> >> >
> >> > - Fixed whitespaces.
> >> > - Removed enable flags - the same information can be gathered from the first requests after enter.
> >> > - Changed offsets inside UTF-8 strings to use Unicode character counts in order to remove the possibility of communicating invalid state.
> >> > - Specified the exact lifetime of double-buffered state, and initial values.
> >> > - Made changes requested by the IM double-buffered.
> >> >
> >> > Some questions remain open. One is: how to specify how much text to capture in set_surrounding_text, and how often to update?  
> 
> IMHO the only reason to state it here is that it's more likely that a
> lazy implementation will try to squeeze a full book here, than eg. an
> application setting an insanely long title. But certainly other
> messages across protocols may hit this limit (the long title issue
> wasn't made up :).
> 
> As for how much, I think it ultimately depends on the IM behind. Text
> correction probably just wants the current word, any sort of
> prediction will probably require phrases to paragraphs, char
> composition can probably do without. Sounds like this could be some
> sort of hint, but I don't think IMs can tell you today how much text
> do they want...
> 
> >> >
> >> > A possible change that I decided against for now is to replace enable/disable events by create/destroy of a new object, which would make more state lifetimes encoded in the protocol.
> >> >
> >> > After reading a blog post on fcitx [0], I got the impression that letting the compositor know some persistent ID of a text edit instance could be useful, however I'm not sure what the use cases are.
> >> >
> >> > As always, I'm happy to hear feedback.
> >> >
> >> > Cheers,
> >> > Dorota Czaplejewicz
> >> >
> >> > [0] https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
> >> >
> >> >  Makefile.am                                    |   1 +
> >> >  unstable/text-input/text-input-unstable-v3.xml | 362 +++++++++++++++++++++++++
> >> >  2 files changed, 363 insertions(+)
> >> >  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
> >> >
> >> > diff --git a/Makefile.am b/Makefile.am
> >> > index 4b9a901..86d7ca9 100644
> >> > --- a/Makefile.am
> >> > +++ b/Makefile.am
> >> > @@ -3,6 +3,7 @@ unstable_protocols =                                                                \
> >> >     unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml              \
> >> >     unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml                      \
> >> >     unstable/text-input/text-input-unstable-v1.xml                          \
> >> > +   unstable/text-input/text-input-unstable-v3.xml                          \
> >> >     unstable/input-method/input-method-unstable-v1.xml                      \
> >> >     unstable/xdg-shell/xdg-shell-unstable-v5.xml                            \
> >> >     unstable/xdg-shell/xdg-shell-unstable-v6.xml                            \
> >> > diff --git a/unstable/text-input/text-input-unstable-v3.xml b/unstable/text-input/text-input-unstable-v3.xml
> >> > new file mode 100644
> >> > index 0000000..ed5204f
> >> > --- /dev/null
> >> > +++ b/unstable/text-input/text-input-unstable-v3.xml
> >> > @@ -0,0 +1,362 @@
> >> > +<?xml version="1.0" encoding="UTF-8"?>
> >> > +
> >> > +<protocol name="text_input_unstable_v3">
> >> > +  <copyright>
> >> > +    Copyright © 2012, 2013 Intel Corporation
> >> > +    Copyright © 2015, 2016 Jan Arne Petersen
> >> > +    Copyright © 2017, 2018 Red Hat, Inc.
> >> > +    Copyright © 2018 Purism SPC
> >> > +
> >> > +    Permission to use, copy, modify, distribute, and sell this
> >> > +    software and its documentation for any purpose is hereby granted
> >> > +    without fee, provided that the above copyright notice appear in
> >> > +    all copies and that both that copyright notice and this permission
> >> > +    notice appear in supporting documentation, and that the name of
> >> > +    the copyright holders not be used in advertising or publicity
> >> > +    pertaining to distribution of the software without specific,
> >> > +    written prior permission.  The copyright holders make no
> >> > +    representations about the suitability of this software for any
> >> > +    purpose.  It is provided "as is" without express or implied
> >> > +    warranty.
> >> > +
> >> > +    THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
> >> > +    SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
> >> > +    FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
> >> > +    SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> >> > +    WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
> >> > +    AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
> >> > +    ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
> >> > +    THIS SOFTWARE.
> >> > +  </copyright>
> >> > +
> >> > +  <interface name="zwp_text_input_v3" version="1">
> >> > +    <description summary="text input">
> >> > +      The zwp_text_input_v3 interface represents text input and input methods
> >> > +      associated with a seat. It provides enter/leave events to follow the
> >> > +      text input focus for a seat.
> >> > +
> >> > +      Requests are used to enable/disable the text-input object and set
> >> > +      state information like surrounding and selected text or the content type.
> >> > +      The information about the entered text is sent to the text-input object
> >> > +      via the pre-edit and commit_string events.
> >> > +
> >> > +      Text is valid UTF-8 encoded, indices and lengths are in code points. If a
> >> > +      grapheme is made up of multiple code points, an index pointing to any of
> >> > +      them should be interpreted as pointing to the first one.  
> >>
> >> That way we make sure we don't put the cursor/anchor between bytes that
> >> belong to the same UTF-8 encoded Unicode code point which is nice. It
> >> also means that the client has to parse all the UTF-8 encoded strings
> >> into Unicode code points up to the desired cursor/anchor position
> >> on each "preedit_string" event. For each "delete_surrounding_text" event
> >> the client has to parse the UTF-8 sequences before and after the cursor
> >> position up to the requested Unicode code point.
> >>
> >> I feel like we are processing the UTF-8 string already in the
> >> input-method. So I am not sure that we should parse it again on the
> >> client side. Parsing it again would also mean that the client would need
> >> to know about UTF-8 which would be nice to avoid.
> >>
> >> Thoughts?  
> >
> > The client needs to know about Unicode, but not necessarily about UTF-8. Specifying code points is actually an advantage here, because byte offsets are inherently expressed relative to UTF-8. By counting with code points, client's internal representation can be UTF-16 or maybe even something else.  
> 
> I personally think byte offsets are more handy than codepoints:
> pointer math is O(1) and str*() functions are "sensible" (on UTF-8 at
> least, and past the bytes!=chars gotchas), it's relatively simple to
> find out whether you are in the middle of a UTF-8 char, it seems
> simpler to deal with than the other way around if utf16/codepoints are
> used in either side; and this might even be moot as all parties are
> interested in chopping strings between word/char boundaries.
> 
> As for using UTF-8 specifically, other protocols do use it for
> exchange of strings (eg. xdg_surface.set_title). It's the perfect fit
> for glib/pango/etc, so it wouldn't be me who objects, either :).
> 
> Cheers,
>   Carlos

I think you're tipping the scales here. In the interest of having the protocol move forward I'm changing code points to bytes, since I don't think they make a huge difference in practice. v5 incoming!

Cheers,
Dorota