Skip to content
46 changes: 34 additions & 12 deletions Doc/library/re.rst
Original file line number Diff line number Diff line change
Expand Up @@ -279,25 +279,47 @@ The special characters are:
``[]()[{}]`` will match a right bracket, as well as left bracket, braces,
and parentheses.

.. .. index:: single: --; in regular expressions
.. .. index:: single: &&; in regular expressions
.. .. index:: single: ~~; in regular expressions
.. .. index:: single: ||; in regular expressions

* Support of nested sets and set operations as in `Unicode Technical
Standard #18`_ might be added in the future. This would change the
syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
in ambiguous cases for the time being.
That includes sets starting with a literal ``'['`` or containing literal
character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To
avoid a warning escape them with a backslash.
.. index::
single: --; in regular expressions
single: &&; in regular expressions
single: ||; in regular expressions

* A character set may contain a nested set written in square brackets, and
two sets may be combined with a set operator, as in `Unicode Technical
Standard #18`_:

* ``[A--B]`` (*difference*) matches a character that is in *A* but not
in *B*; for example ``[a-z--[aeiou]]`` matches an ASCII lowercase
consonant.
* ``[A&&B]`` (*intersection*) matches a character that is in both *A*
and *B*; for example ``[\w&&[a-z]]`` matches an ASCII lowercase letter.
* ``[A||B]`` (*union*) matches a character that is in *A* or in *B*; this
is the same as listing the members of both sets in a single set, but
allows combining nested sets.

Operators have no precedence and are applied from left to right. To
group, write a nested set as the operand after an operator, as in
``[a-z--[aeiou]]``. A leading ``'^'`` complements the whole result.
A ``'['`` begins a nested set only immediately after a set operator;
anywhere else -- including at the start of a character set -- it is an
ordinary character, so existing patterns keep their meaning. Escape it
as ``'\['`` to include a literal ``'['`` right after an operator.

.. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/

.. note::

Symmetric difference (``A~~B``) is not yet supported; a literal ``'~~'``
in a character set still raises a :exc:`FutureWarning`.

.. versionchanged:: 3.7
:exc:`FutureWarning` is raised if a character set contains constructs
that will change semantically in the future.

.. versionchanged:: next
Added support for nested sets and the set operators ``--``, ``&&``
and ``||``.

.. index:: single: | (vertical bar); in regular expressions

``|``
Expand Down
12 changes: 12 additions & 0 deletions Doc/whatsnew/3.16.rst
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,18 @@ os
(Contributed by Maurycy Pawłowski-Wieroński in :gh:`149464`.)


re
--

* :mod:`re` now supports set operations and nested sets in character classes,
as described in `Unicode Technical Standard #18
<https://unicode.org/reports/tr18/>`__: set difference (``[A--B]``),
intersection (``[A&&B]``) and union (``[A||B]``), where an operand may be a
nested set written in square brackets. For example, ``[a-z--[aeiou]]``
matches an ASCII lowercase consonant.
(Contributed by Serhiy Storchaka in :gh:`152100`.)


shlex
-----

Expand Down
2 changes: 1 addition & 1 deletion Lib/_strptime.py
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ def __calc_date_time(self):
current_format = current_format.replace(tz, "%Z")
# Transform all non-ASCII digits to digits in range U+0660 to U+0669.
if not current_format.isascii() and self.LC_alt_digits is None:
current_format = re_sub(r'\d(?<![0-9])',
current_format = re_sub(r'[\d--0-9]',
lambda m: chr(0x0660 + int(m[0])),
current_format)
for old, new in replacement_pairs:
Expand Down
2 changes: 1 addition & 1 deletion Lib/doctest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1768,7 +1768,7 @@ def check_output(self, want, got, optionflags):
'', want)
# If a line in got contains only spaces, then remove the
# spaces.
got = re.sub(r'(?m)^[^\S\n]+$', '', got)
got = re.sub(r'(?m)^[\s--\n]+$', '', got)
if got == want:
return True

Expand Down
2 changes: 1 addition & 1 deletion Lib/pkgutil.py
Original file line number Diff line number Diff line change
Expand Up @@ -443,7 +443,7 @@ def resolve_name(name, *, strict=False):
within the imported package to get to the desired object.
"""
global _LENIENT_PATTERN, _STRICT_PATTERN
dotted_words = r'(?!\d)(\w+)(\.(?!\d)(\w+))*'
dotted_words = r'([\w--\d]\w*)(\.([\w--\d]\w*))*'
if strict:
if _STRICT_PATTERN is None:
_STRICT_PATTERN = re.compile(
Expand Down
Loading
Loading