Successfully reported this slideshow.
Your SlideShare is downloading. ×

Regular Expressions with full Unicode support

Advertisement

More Related Content

Advertisement

Regular Expressions with full Unicode support

  1. 1. Copyright © 2019 Oracle and/or its affiliates. All rights
  2. 2. Copyright © 2019 Oracle and/or its affiliates. All rights Regular Expressions with full Unicode support Martin Hansson Software Development MySQL Optimizer Team The ins and outs of the new regular expression functions and the ICU library
  3. 3. Copyright © 2019 Oracle and/or its affiliates. All rights Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
  4. 4. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What Happened? Old regexp library (Henry Spencer) • Does not support Unicode • Limited Features • No resource control • Only Boolean Search https://mysqlserverteam.com/new-regular-expression-functions-in-mysql-8-0/
  5. 5. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Not some niche feature Feature Requests for Extracting Substring: Bug#79428 No way to extract a substring matching a regex Bug#29781 Adding in Pattern Replace (RegExp) for MySQL Engine Bug#16357 add in functions to do regular expression replacements in a select query Bug#9105 Regular expression support for Search & Replace 51 “affects me” total CTE had 59 “affects me”
  6. 6. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | New Regular Expression Functions REGEXP_INSTR REGEXP_LIKE REGEXP_REPLACE REGEXP_SUBSTR
  7. 7. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  8. 8. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Two Security Concerns Memory Runtime 8
  9. 9. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Security Cap on runtime mysql> SELECT regexp_instr( 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC', '(A+)+B'); ERROR 3699 (HY000): Timeout exceeded in regular expression match.
  10. 10. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Security Cap on Memory mysql> SELECT regexp_instr( '', '(((((((){120}){11}){11}){11}){80}){11}){4}' ); ERROR 3699 (HY000): Timeout exceeded in regular expression match. mysql> SET GLOBAL regexp_stack_limit = 239; mysql> SELECT regexp_instr( '', '(((((((){120}){11}){11}){11}){80}){11}){4}' ); ERROR 3698 (HY000): Overflow in the regular expression backtrack stack.
  11. 11. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  12. 12. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | ICU library
  13. 13. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Building ICU Need three libraries • i18n library – Regular expressions – Character sets • Common library • Data Library
  14. 14. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  15. 15. 15 UTF-32 ab d 0x00000061 0x000000610x00000061 0x000000610x00000062 0x000000640x000000610x000000610x000000610x0001f37a
  16. 16. 16 UTF-8 ab d 0x62 0x000000610x000000610x000000610xF09F8DBA0x62 0x64
  17. 17. 17 UTF-16 ab d 0x0062 0x3CD87ADF0x0062 0x0064
  18. 18. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Under the Hood • Count codepoints • Convert to UTF-16 • Use the C API • Convert back if needed
  19. 19. Copyright © 2019 Oracle and/or its affiliates. All rights Program Agenda Security ICU library Unicode Working with Unicode in Regular Expressions
  20. 20. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Simple case sensitivity mysql> SELECT regexp_like( 'a', '(?i)A' ); # mode modifier 1 mysql> SELECT regexp_like( 'a', 'A', ‘i’ ); # match_parameter 1 mysql> SELECT regexp_like( 'a' COLLATE utf8mb4_0900_as_cs, 'A' ); # collation 0
  21. 21. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Simple case sensitivity mysql> SELECT regexp_like( 'Abc', 'abC', ‘c’ ); → 0 mysql> SELECT regexp_like( 'Abc', 'abC', ‘i’ ); → 1
  22. 22. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Case-mapping process A → a B → b C → c
  23. 23. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Full Case Folding ß → ss mysql> SELECT regexp_like( 'ß', '^ss$', ‘c’ ); → 0 mysql> SELECT regexp_like( 'ß', '^ss$', ‘i’ ); → 1
  24. 24. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Full Case Folding ᾛ ⇒ ἣι U+1F9B GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI U+1F23 U+03B9 GREEK SMALL LETTER ETA WITH DASIA AND VARIA
  25. 25. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Has to Look Like a String in order to Match mysql> SELECT regexp_like( 'ß', '^ss$' ); → 1 mysql> SELECT regexp_like( 'ß', '^s+$' ); → 0 mysql> SELECT regexp_like( 'ß', '^s{2}$' ); → 0
  26. 26. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Can’t start Match Within Expanded Character mysql> SELECT regexp_like( 'ß', 's$' ); → 0 mysql> SELECT regexp_like( 'ß', '^s' ); → 0
  27. 27. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Collations mysql> select 'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss'G *************************** 1. row 'ß' collate utf8mb4_de_pb_0900_ai_ci = 'ss': 1 mysql> select 'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss'G *************************** 1. row 'ß' collate utf8mb4_de_pb_0900_as_cs = 'ss': 0
  28. 28. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Case folding Language Dependent Case Folding mysql> SELECT regexp_like( 'I', 'i' ); → 1 mysql> SELECT regexp_like( 'İ', 'i' ); → 0 mysql> SELECT regexp_like( 'I', ' ı' ); → 0
  29. 29. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Beware of Conversion! mysql> set names latin1; mysql> create table t1 ( a char ( 10 ) ); mysql> insert into t1 values ( 'å' ); mysql> select a from t1G *************************** 1. row a: å mysql> select regexp_like( a, 'å' ) from t1G *************************** 1. row regexp_like( a, 'å' ): 1
  30. 30. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Beware of Conversion! Use Hex Codes! mysql> select hex( a ) from t1; +----------+ | hex( a ) | +----------+ | C383C2A5 | +----------+
  31. 31. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Beware of Conversion! Use Hex Codes! mysql> select hex( a ) from t1; +----------+ | hex( a ) | +----------+ | C383C2A5 | +----------+ Latin-1: 0x e5 UTF-8: 0x c3 a5 å is encoded as:
  32. 32. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32 Conversion flow Terminal UTF-8 c3a5 å Latin-1 → UTF-8 UTF-8 → Latin-1 C383C2A5 = Ã¥ Server Table UTF-8 Server
  33. 33. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Power Tip Use Hex Codes and Character set Introducers! mysql> set global character_set_client = utf8mb4; mysql> select _utf8mb4 0xc3a5, _latin1 0xe5; +-----------------+--------------+ | _utf8mb4 0xc3a5 | _latin1 0xe5 | +-----------------+--------------+ | å | å | +-----------------+--------------+ mysql> set global character_set_client = latin1; mysql> select _utf8mb4 0xc3a5, _latin1 0xe5; +-----------------+--------------+ | _utf8mb4 0xc3a5 | _latin1 0xe5 | +-----------------+--------------+ | å | å | +-----------------+--------------+
  34. 34. Copyright © 2019 Oracle and/or its affiliates. All rights Questions?

Editor's Notes

  • I am
    Worked with MySQL since time immemorial, MySQL AB.
    Work from Uppsala Sweden, former head office for MySQL. Swedish is my native tongue. Makes a differrence as you will see.

  • I am
    Worked with MySQL since time immemorial, MySQL AB.
    Work from Uppsala Sweden, former head office for MySQL. Swedish is my native tongue. Makes a differrence as you will see.

  • So what’s all this about? We switched our regex library in 8.0.4. At the time I blogged about it here.

    The old one was written by HS in 1986. called regexp very good regex library. Has been used widely in the Unix realm, is part of POSIX standard. Also called “the book regexp library” because he updated it for the book Software Solutions in C in 1994. .

    Made its way to Tcl, Postgres and even early perl. Apparently Postgres still use it.

    Really good. Great performance. But ASCII only. Worked byte-by-byte. Lacks many features.

    Not safe – Easy to put in infinite loop.

    You can only do boolean search, not do matching doesn’t have a pattern buffer out of the box. Hence doesn’t support search-replace.


  • And this was a quite popular request. Four FR bugs against getting the matched substring alone.

    We had 51 “Affects me” in total. CTE had 59, but that’s a really popular feature.
  • Now we have four functions
    Instr → position, before or after
    Like → boolean
    Replace → replaces a match, capture
    Substr → the matched substring
  • So here’s the agenda. On top, we have security, which is why we chose ICU. Perhaps not the obvious choice given the candidates. It also has close ties to Unicode.

    What I won’t cover here is all the features of regular expressions. These are documented in our manual and if you can always head to the ICU documentations.

    My ambition is to teach you about how to work efficiently and securely with unicode and to give some insight into where common wisdom breaks down.

    I presented here 3 years ago and I had a really good time, so I wanted to go again. I told my boss what am I going to talk about, I haven’t really added anything new since last time. All I can think of is the new regular expression. “Tell’em about that, he said” They’ll love that. So, I submitted this talk as a 20 minute presentation. Not only did it get accepted but it got upgraded to 30 minutes. I couldn’t think of much to say, so I asked around. “What do YOU want to know about regular expressions with Unicode?”. Nobody had a clue. So that’s why I just picked some common pitfalls that I consider tricky.

  • The way a malicious user can exploit regex matching is by exhausting the memory or creating an infinite loop, consuming all the cpu time.
  • Out of the box there’s always cap on runtime.

    Runtime is specified in “steps of the match engine. A bit vacuous. Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but will typically be on the order of milliseconds.

    Match the first A, capture, then repeat that match. Backtrack, match 2nd, repeat that and so on. Eventually fail because of the C.


    Set conservatively to 32 (secure by default)
  • Here I’m trying to run out of memory. Really have to provoke here. Reach the time limit first.

    Match empty string 120 times, repeat that 11 times, repeat that 11 times, etc.

    Backtracking stack used by engine.

    Bytes. Choking to 239 bytes

    Default size 80 MB. Never managed to DOS server.
  • So… about the icu library
  • What is ICU library. Set of I18n libs. What they provide is Globalization support and Unicode for software applications. They have an open source license. From what I gather compatible with GNU, but IANAL.

    Used by Java, Apple, Amazon, IBM…
    Unicode consortium mostly known for emoji nowadays. New releases of Unicode typically contain new emojis. And so you have to be able search for them. Haha-papa a.k.a. Sushi-beer bug. And so regexp have to suport them.

    💬 5 billion emojis are sent daily on Facebook Messenger
    📸 By mid-2015, half of all comments on Instagram included an emoji
    🍑 Only 7% of people use the peach emoji as a fruit
    The rest mostly use it as a butt or for other non-fruit uses
    According to emojipedia

    In a sense ICU is Unicode. Support for all of Unicode

  • We ship ICU with MySQL, and optionally build bundled. We ship 59.1. I notice Ubuntu 18.04 ships 60.

    There’s the internationalization library which contains regexp and charsets. All we use right now. All we bundle. The common library contains things like the breakiterator which helps work with grapheme clusters. I won’t go into grapheme clusters in this presentation. We don’t handle those yet.

    The data library is not used currently. Don’t ship. Fairly big, not needed for regexp.
  • Tell you a bit about Unicode

    Specifies three encodings.
  • + constant size
    + maps 1-to-1 to unicode codepoints
    - space consuming
  • + Optimized for Western ASCII
    + Small (for Western)
    + Self-synchronizing (what isn’t???)
    - Variable size

    De-facto standard for the web 92.9%
  • Generally regarded Worst of both worlds
    - Bigger than UTF-8
    - Not fixed like UTF-32
    + More is constant (what? Which planes?)
    + Also self-synchronizing

    Surrogate pairs

    Broken in Java. How?

    Alas, used by ICU
  • So they way we use ICU is, unless you start on the first character, we count the code points before. Convert the rest to UTF-16, search with ICU. We use ICU’s C API. There is C++ API.
  • So, I have two examples how to work with Unicode.


  • You can specify case sensitivity in three ways. Mode modifiers Inside the regexp have the highest priorority.

    If there are no mode modifiers, match_paramete is used. String of modifiers. ‘c’ means case-sensitive, ‘i’ means in-sensitive.

    If there are none of those, we look at the collation. There are rules for computing which collation should be used in any comparison. Apply here.

  • Case insensitivity seems simple at first. Text is normalized by transforming to the same case. Then compare.

    On the next slide we see how such a case mapping could look.
  • Totally obvious, right? One character maps to exactly one character. This is called simple case insensitive matching.

    Well there are some trickier cases.
  • The german Ess-zet is generally understood to be equivalent to two s’es. So in full case insensitive matching they should be equal. Since there is no esszet in any other language, this folding is part of the default.

    I could go on all day about case mapping, it’s a 61-page document in the Unicode standard. But these are the essentials.
  • This example is a little more complicated for me.
    Here one letter obviously maps to two letters. Actually letters. Not just code points.

    If you paste them and press backspace, the little I goes away.

    In this case they’re different. Works the same way.


    It’s all greek to me.

  • Full case folding used when the pattern contains anything looks like a character string, even just one char.

  • A match can never start within an expanded character.

    The anchors here enforce a match that would 1) start in the middle 2) end in the middle

  • This is consistent with how collations work with the equals predicate.

    Hard to read collation name
    Charset, language code, pb – don’t know, accent sensitive, case sensitive.
  • Case folding can also be language dependent. In the default case folding, capital I folds to small I with dot. However, in the turkish case folding, a dotted capital I is case folded to dotted lowercase I. Dotless capital I folds to dotless lowercase I.

    In Turkish locale, actually wrong.
  • Another problem with full Unicode and regexp. You need to be careful when you send non-ASCII data from a client. Here is a cautionary tale. Here I changed the variables character_set_connection, character_set_client and character_set_results. What SET NAMES does.

    So, I create a table.

    I populate it. Swedish letter å. Pronounced

    Read back.

    Check with a regular expression match.


    So, everything is fine, right? Let’s do a “trust but verify” here. I want to see what’s actualy in the table. The problem is that it will always be converted to my character set. I want to apply a function to it on the server side. Problem is, all functions will also convert their arguments. What to do? All functions save one: The hex() function. It will tell the truth.
  • So here we have .. what? Is this really a w/ring ? Let’s check.
  • This is not å in any encoding. What is going on
  • My terminal is UTF-8. So, when I type å on my Swedish keyboard, it sends c3a5 to the server. Now, when I set character_set_client, what I really said “interpret as latin1”. Fine c3a5 thats a-wave yen. Stores that. But the table stores utf8 so let’s convert.
    And that becomes


    Now, when I do select, it reads character_set_results, oh yeah, you speak latin-1. Let me translate for ya.

    And so we’re back full circle.

    Especially tricky with latin-1 since anything is valid latin-1. No check fails.
  • So here’s a power tip for troubleshooting your multilinguas regexps. If you use hex codes and character set introducers, it’s totally unambiguous. As you see here.

×