Perl And Unicode

Perl and Unicode
Mike Whitaker, BBC/EnlightenedPerl.org

The problem

• Keeping track of input and output
encodings
• Not losing encoding data in the middle
• Understanding the difference between
characters and bytes

Characters vs bytes

characters

Characters vs bytes

characters
$

Characters vs bytes

characters
$
U+0024

Characters vs bytes

characters
$
U+0024

bytes
(UTF-8)

Characters vs bytes

characters
$
U+0024

bytes 0x24
(UTF-8)

Characters vs bytes

characters
$ €
U+0024

bytes 0x24
(UTF-8)

Characters vs bytes

characters
$ €
U+0024 U+20AC

bytes 0x24
(UTF-8)

Characters vs bytes

characters
$ €
U+0024 U+20AC

bytes 0x24 0xE2 0x82 0xAC
(UTF-8)

Characters vs bytes

2
characters
$ €
U+0024 U+20AC

(UTF-8)

Characters vs bytes

2
characters
$ €
U+0024 U+20AC

4
(UTF-8)

Handling Encodings
àbçdé
bytes in some
input encoding or
other

Handling Encodings
àbçdé
bytes in some
input encoding or
other

decode

Handling Encodings
àbçdé
bytes in some
input encoding or
other

decode
àbçdé
character-based
internal
representation

Handling Encodings
àbçdé
bytes in some
input encoding or
other

decode encode
àbçdé
character-based
internal
representation

Handling Encodings
àbçdé àbçdé
bytes in some
bytes in desired
input encoding or
encoding
other

decode encode
àbçdé
character-based
internal
representation

Handling Encodings
àbçdé àbçdé
bytes in some
bytes in desired
input encoding or
encoding output
other

decode encode
àbçdé
character-based
internal
representation

Handling Encodings
àbçdé àbçdé
bytes in some
bytes in desired
input encoding or
encoding output
other

decode encode
use Encode;
$chars = decode($enc,
àbçdé
$bytes);
character-based
internal
representation

Handling Encodings
àbçdé àbçdé
bytes in some
bytes in desired
input encoding or
encoding output
other

decode encode
use Encode;
àbçdé use Encode;
$bytes = encode($enc,
$bytes); $chars);
character-based
internal
representation

The Holy Grail

• Can represent all
encodings

The Holy Grail

encodings

• Has multibyte character
support

The Holy Grail

encodings

• Has multibyte character
support

• for example, length()
should count
characters, not bytes

It doesn't work like
that

use Encode;
Only works in Perl 5.8
and above

use Encode;
Only works in Perl 5.8 Why the $£%^&*()
are you using 5.6
and above ANYWAY?

use Encode;
Only works in Perl 5.8
and above

There are solutions for 5.6 and even
earlier. But they're HORRIBLE.

character-based
internal
representation

character-based
internal Perl has one!
representation

character-based
representation

Magic internal representation.

character-based
representation


All string functions know about it.

character-based
representation


It's encoding-agnostic.

character-based
representation


It's encoding-agnostic.

In fact....

-8!
almost
TF
SU
IT'

Handling Encodings
àbçdé àbçdé
bytes in some
bytes in desired
input encoding or
encoding output
other

decode encode
use Encode;
$bytes); $chars);
Perl's magic internal
representation

àbçdé àbçdé
bytes in bytes in
input machine's 8bit machine's 8bit output
encoding encoding

àbçdé

bytes in machine's
8bit encoding

I18N? What the £$%^&*('s that?
àbçdé àbçdé
bytes in bytes in
input machine's 8bit machine's 8bit output
encoding encoding

àbçdé

bytes in machine's
8bit encoding

People are still writing
Perl like it was Perl 4

People are
still writing
Perl like it
was Perl 4

People are
still writing
Perl like it
was Perl 4
...and we have to support
them.

People are
still writing
Perl like it
was Perl 4
...and we have to support
them.

Even though our string
functions expect chars.

????

representation

????

representation

if

????

representation

if
all characters are representable in
local machine's 8 bit charset, use
that;

????

representation

if
that;

else

????

representation

if
that;

else
use UTF-8

àbçdé
UTF-8
characters use Encode;
$chars);

àbçdé àbçdé
UTF-8 bytes in desired
characters output
use Encode; encoding
$chars);

àbçdé àbçdé
characters output
$chars);

àbçdé
machine
bytes

àbçdé àbçdé
characters output
$chars);

àbçdé
machine
bytes use Encode;
$chars);

àbçdé àbçdé
characters output
$chars);

àbçdé àbçdé
machine bytes in desired output
bytes use Encode;
encoding
$chars);

UTF-8
characters

+àbçdé
machine
bytes

UTF-8

+ =
characters

?????

àbçdé
machine
bytes

UTF-8

+ =
characters

?????

àbçdé
machine promote
bytes

UTF-8

+ =
characters

?????

àbçdé àbçdé
machine UTF-8
promote
bytes characters

UTF-8

+ =
characters

àbçdé

UTF-8 bytes

àbçdé àbçdé
machine UTF-8
promote
bytes characters

àbçdé
machine output
bytes

àbçdé
Content-Encoding: UTF-8
machine output
bytes

àbçdé
Content-Encoding: UTF-8 bd
? ? ?
machine output
bytes

àbçdé
? ? ?
machine output
bytes
Content-Encoding: ISO-8859-1

àbçdé
? ? ?
machine output
bytes
Content-Encoding: ISO-8859-1 àbçdé

àbçdé
? ? ?
machine output
bytes

àbçdé
UTF-8
characters

àbçdé
? ? ?
machine output
bytes

àbçdé
UTF-8 output
characters

àbçdé
? ? ?
machine output
bytes

àbçdé
Content-Encoding: UTF-8
UTF-8 output
characters

àbçdé
? ? ?
machine output
bytes

àbçdé
Content-Encoding: UTF-8 àbçdé
UTF-8 output
characters

àbçdé
? ? ?
machine output
bytes

àbçdé
UTF-8 output
characters
Content-Encoding: ISO-8859-1

àbçdé
? ? ?
machine output
bytes

àbçdé
UTF-8 output
characters
Content-Encoding: ISO-8859-1 Ã bÃ§dÃ©

You can't tell what
you've actually got

You can't tell what
you've actually got

utf8::is_utf8()

You can't tell what
you've actually got

utf8::is_utf8()
does not mean what you think it means

You can't tell what
you've actually got
encoded
bytes

You can't tell what
you've actually got
encoded
bytes utf8::is_utf8() = false

You can't tell what
you've actually got
encoded
EVEN IF they're UTF-8

You can't tell what
you've actually got
encoded
decoded
UTF-8 chars

You can't tell what
you've actually got
encoded
decoded
UTF-8 chars utf8::is_utf8() = true

You can't tell what
you've actually got
encoded
decoded

decoded
machine bytes

You can't tell what
you've actually got
encoded
decoded

decoded
machine bytes utf8::is_utf8() = false

The science bit
• Encode.pm
use Encode; $bytes = encode($enc,
$chars);

The science bit
• Encode.pm
$chars);
• 3 argument form of open() - PerlIO layers
open(FILEHANDLE, ">:encoding(UTF-8)",
$ﬁle);

The science bit
• Encode.pm
$chars);
• 3 argument form of open() - PerlIO layers
open(FILEHANDLE, ">:encoding(UTF-8)",
$ﬁle);
• binmode(FILEHANDE,

'utf8' vs 'UTF-8'
• Encode.pm

'utf8' vs 'UTF-8'
• Encode.pm
• utf8 = marks it as UTF-8 and hopes...

'utf8' vs 'UTF-8'
• Encode.pm
• UTF-8 = is actually valid UTF-8

'utf8' vs 'UTF-8'
• Encode.pm
• PerlIO layers:

'utf8' vs 'UTF-8'
• Encode.pm
• PerlIO layers:
• :utf8

'utf8' vs 'UTF-8'
• Encode.pm
• PerlIO layers:
• :utf8
• :encoding(UTF-8)

use utf8;

• Does NOT do what you might think it
does

use utf8;

• Does NOT do what you might think it
does
• All it says is 'my source code is UTF-8'.

Modules
• It depends on the module:

Modules
• CGI - $CGI::PARAM_UTF8=1;

Modules
• LWP::UserAgent -
>decoded_content() method honours
Content-Encoding:

Modules
Content-Encoding:
• DBI - mysql_enable_utf8 in
DBI::connect()

Modules
Content-Encoding:
• DBI - mysql_enable_utf8 in
DBI::connect()
• XML::LibXML - looks at encoding,

In summary
• decode bytes as soon as you get them:

In summary
• decode(), binmode(STDIN), 3 arg
open()

In summary
open()
• encode characters just before you output:

In summary
open()
• encode(), binmode(STDOUT), 3 arg
open()

In summary
open()
• encode(), binmode(STDOUT), 3 arg
open()
• keep track of whether your strings are

NEVER EVER EVER
rely on the encoding of
Perl's internal
representation

...there is
NO SUCH THING
as
"plain text"

The Holy Fail (thanks Joel!)
àbçdé àbçdé
bytes in some
bytes in desired
input encoding or
encoding output
other

decode encode
use Encode;
$bytes); $chars);
representation

Perl And Unicode

More Related Content

Similar to Perl And Unicode

Recently uploaded

Perl And Unicode

Editor's Notes