This document provides an overview of Unicode and character encodings to avoid corrupting international text. It discusses:
- The difference between bytes and characters, noting that characters are often multiple bytes wide and an encoding is needed to interpret byte sequences as character sequences.
- Common mistakes like assuming a default encoding, mixing bytes and characters, and not specifying an encoding which can lead to text being corrupted when read by systems using different encodings.
- Encoding issues that can occur in different languages and file types like text files, HTML, XML, if an encoding is not properly declared or honored.
The key lessons are: you must know the character encoding to interpret byte sequences correctly, and bytes and characters should not be
our application is great – and popular. You have translation efforts underway, everything is going well – and wait a minute, what’s the report of strange question mark characters all over the page? Unicode is pain. UTF-32, UTF-16, UTF-8 and then something else is thrown in the mix … Multibyte and codepoints, it all sounds like greek. But it doesn’t have to be so scary. PHP support for Unicode has been improving, even without native unicode string support. Learn the basics of unicode is and how it works, why you would add support for it in your application, how to deal with issues, and the pain points of implementation.
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6Andrei Zmievski
n the halcyon days of early 2005, a project was launched to bring long overdue native Unicode and internationalization support to PHP. It was deemed so far reaching and important that PHP needed to have a version bump. After more than 4 years of development, the project (and PHP 6 for now) was shelved. This talk will introduce Unicode and i18n concepts, explain why Web needs Unicode, why PHP needs Unicode, how we tried to solve it (with examples), and what eventually happened. No sordid details will be left uncovered.
Data encryption and tokenization for international unicodeUlf Mattsson
Unicode is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of March 2020, it has a total of 143,859 characters, with Unicode 13.0 (these characters consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, each being code-for-code identical with the other.
The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional text display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts). Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework.
Unicode can be implemented by different character encodings. The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings. The most commonly used encodings are UTF-8, UTF-16, and UCS-2 (a precursor of UTF-16 without full support for Unicode)
Jun 29 new privacy technologies for unicode and international data standards ...Ulf Mattsson
Protecting the increasing use International Unicode characters is required by a growing number of Privacy Laws in many countries and general Privacy Concerns with private data. Current approaches to protect International Unicode characters will increase the size and change the data formats. This will break many applications and slow down business operations. The current approach is also randomly returning data in new and unexpected languages. New approach with significantly higher performance and a memory footprint can be customizable and fit on small IoT devices.
We will discuss new approaches to achieve portability, security, performance, small memory footprint and language preservation for privacy protecting of Unicode data. These new approaches provide granular protection for all Unicode languages and customizable alphabets and byte length preserving protection of privacy protected characters.
Old Approaches
Major Issues
Protecting the increasing use International Unicode characters is required by a growing number of Privacy Laws in many countries and general Privacy Concerns with private data.
Old approaches to protect International Unicode characters will typically increase the size and change the data formats.
This will break many applications and slow down business operations. This is an example of an old approach that is also randomly returning data in new and unexpected languages
This is a very old presentation but if you gloss over the usage of VB6 there is plenty of value. I presented this to the VBUG Annual Conference in 2003.
our application is great – and popular. You have translation efforts underway, everything is going well – and wait a minute, what’s the report of strange question mark characters all over the page? Unicode is pain. UTF-32, UTF-16, UTF-8 and then something else is thrown in the mix … Multibyte and codepoints, it all sounds like greek. But it doesn’t have to be so scary. PHP support for Unicode has been improving, even without native unicode string support. Learn the basics of unicode is and how it works, why you would add support for it in your application, how to deal with issues, and the pain points of implementation.
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6Andrei Zmievski
n the halcyon days of early 2005, a project was launched to bring long overdue native Unicode and internationalization support to PHP. It was deemed so far reaching and important that PHP needed to have a version bump. After more than 4 years of development, the project (and PHP 6 for now) was shelved. This talk will introduce Unicode and i18n concepts, explain why Web needs Unicode, why PHP needs Unicode, how we tried to solve it (with examples), and what eventually happened. No sordid details will be left uncovered.
Data encryption and tokenization for international unicodeUlf Mattsson
Unicode is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of March 2020, it has a total of 143,859 characters, with Unicode 13.0 (these characters consist of 143,696 graphic characters and 163 format characters) covering 154 modern and historic scripts, as well as multiple symbol sets and emoji. The character repertoire of the Unicode Standard is synchronized with ISO/IEC 10646, each being code-for-code identical with the other.
The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional text display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts). Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework.
Unicode can be implemented by different character encodings. The Unicode standard defines Unicode Transformation Formats (UTF) UTF-8, UTF-16, and UTF-32, and several other encodings. The most commonly used encodings are UTF-8, UTF-16, and UCS-2 (a precursor of UTF-16 without full support for Unicode)
Jun 29 new privacy technologies for unicode and international data standards ...Ulf Mattsson
Protecting the increasing use International Unicode characters is required by a growing number of Privacy Laws in many countries and general Privacy Concerns with private data. Current approaches to protect International Unicode characters will increase the size and change the data formats. This will break many applications and slow down business operations. The current approach is also randomly returning data in new and unexpected languages. New approach with significantly higher performance and a memory footprint can be customizable and fit on small IoT devices.
We will discuss new approaches to achieve portability, security, performance, small memory footprint and language preservation for privacy protecting of Unicode data. These new approaches provide granular protection for all Unicode languages and customizable alphabets and byte length preserving protection of privacy protected characters.
Old Approaches
Major Issues
Protecting the increasing use International Unicode characters is required by a growing number of Privacy Laws in many countries and general Privacy Concerns with private data.
Old approaches to protect International Unicode characters will typically increase the size and change the data formats.
This will break many applications and slow down business operations. This is an example of an old approach that is also randomly returning data in new and unexpected languages
This is a very old presentation but if you gloss over the usage of VB6 there is plenty of value. I presented this to the VBUG Annual Conference in 2003.
ITU - MDD - Textural Languages and GrammarsTonny Madsen
This presentation describes the use and design of textural domain specific language - DSL. It has two basic purposes:
Introduce you to some of the more important design criteria in language design
Introduce you to BNF
This presentation is developed for MDD 2010 course at ITU, Denmark.
This presentation is a part of the COP2271C college level course taught at the Florida Polytechnic University located in Lakeland Florida. The purpose of this course is to introduce Freshmen students to both the process of software development and to the Python language.
The course is one semester in length and meets for 2 hours twice a week. The Instructor is Dr. Jim Anderson.
A video of Dr. Anderson using these slides is available on YouTube at: http://youtu.be/ccBz9bcCSGMhttps://www.youtube.com/watch?feature=player_embedded&v=W8Bg7KyhWPc
How to Become a Thought Leader in Your NicheLeslie Samuel
Are bloggers thought leaders? Here are some tips on how you can become one. Provide great value, put awesome content out there on a regular basis, and help others.
ITU - MDD - Textural Languages and GrammarsTonny Madsen
This presentation describes the use and design of textural domain specific language - DSL. It has two basic purposes:
Introduce you to some of the more important design criteria in language design
Introduce you to BNF
This presentation is developed for MDD 2010 course at ITU, Denmark.
This presentation is a part of the COP2271C college level course taught at the Florida Polytechnic University located in Lakeland Florida. The purpose of this course is to introduce Freshmen students to both the process of software development and to the Python language.
The course is one semester in length and meets for 2 hours twice a week. The Instructor is Dr. Jim Anderson.
A video of Dr. Anderson using these slides is available on YouTube at: http://youtu.be/ccBz9bcCSGMhttps://www.youtube.com/watch?feature=player_embedded&v=W8Bg7KyhWPc
How to Become a Thought Leader in Your NicheLeslie Samuel
Are bloggers thought leaders? Here are some tips on how you can become one. Provide great value, put awesome content out there on a regular basis, and help others.
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
Significant compromises are often made taking a product to market that cause downstream pain—success can mean endless hours re-architecting and retrofitting to go global, get past 508 compliance at universities or integrate partners. The good news is there are freely available technologies and strategies to avoid the pain. Learn from Zimbra’s experiences with ZCS and Zimbra Desktop (an offline-capable AJAX email application) including a checklist of do’s and don’ts and a deep dive into: i18n and l10n, 508 compliance (Americans with Disabilities Act), skinning, templates, time-date formatting and more.
From http://en.oreilly.com/oscon2008/public/schedule/detail/4834
Unicode, PHP, and Character Set CollisionsRay Paseur
In recent years UTF-8 has become the dominant character encoding scheme, supplanting extended ASCII. This has led to an uneasy transition for users of PHP, where the assumption has always been that one character equals one byte. This presentation is for the DC PHP Developers' Community meeting on September 10, 2014. It examines the history of character set encoding and the ways that the PHP community is responding to the transition to UTF-8. Not surprisingly, there are surprises in the process! The slides are derived from the article here:
http://iconoun.com/articles/collisions
Have you ever encountered problems displaying foreign characters on your app or website, or been confused by the appearance of strange question marks like this: ���? These are the result of character encoding mismatches. If you encounter these in the course of software localization, it can develop into an encoding nightmare!
Encoding nightmares can over-run product deadlines and spark frustration for your clients. If a website or app has an international future, a little knowledge up front can save you hours and even days of debugging.
I'm not a Unicode Guru, but working with third parties, I often find that a lot of people consistently fail to get the basics right about Unicode and encoding. There must be something esoteric about it. So here's yet another set of slides about Unicode/UTF8 in Perl.
It's not meant to be a comprehensive presentation of all Unicode things in Perl. It's meant to insist on a couple of guidelines and give some pointers to get a good start writing a unicode compliant application and avoiding common issues.
A character is a sign or a symbol in a writing system. In computing a character can be, a letter, a digit, a punctuation or mathematical symbol or a control character.Computers only understand binary data. To represents the characters as required by human languages, the concept of character sets was introduced. In this PPT I have explained the charactor encoding. More info: http://mobisoftinfotech.com/resources/media/understanding-character-encodings
Palestra dada por Andrei Zmievski no CONAPHP 2008 - Congresso Nacional de PHP que ocorreu em São Paulo nos dias 18 e 19 de Outubro dentro do CONISLI 2008
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
3. Out of Scope
• Internationalization (i18n)
– Extending a program to emit messages in
multiple languages
• Localization (l10n)
– Extending a program to emit messages in a
specific language, such as German
• Manipulating Unicode characters within strings
4. Problems
• Customer A writes some text to a file or app.
Customer B reads it back, but it is different.
In particular it has a bunch of ??? or ���.
– ß ➔ �
• UnicodeEncodeError: 'ascii' codec can't
encode character 'ua000' in position
0: ordinal not in range(128)
6. Bytes vs. Characters
77
10
1
10
5
11
0
32 70
11
7
19
5
15
9
M e i n F u ß
Byte
Stream
Decode utf-8
Character
Stream
Character
Encoding
︎Multiple bytes wide!
☝
︎Often
forgotten!
☟
7. What is the character encoding?
• There is usually some signal (sometimes out-of-
band) that specifies the encoding that should be
used to interpret a byte stream as characters.
– HTTP: Content-Type: text/html; charset=UTF-8
– HTML: <meta charset="UTF-8"/>
– XML: <?xml encoding="UTF-8">
– Python: # -*- coding: utf-8 -*-
– POSIX: LANG=en_US.UTF-8
8. What is the character encoding?
• Unfortunately some types of files don't contain any
information about their encoding.
– Text files (*.txt)
• Usually the OS default character encoding is assumed,
which depends on its locale. Yikes.
– JSON files (*.json)
• Usually UTF-8 is assumed, but other Unicode encodings are
permitted by RFC 4627.
– Java source files (*.java)
• Encoding is derived from the -encoding compiler flag.
9. Big Mistake #1
You cannot interpret a
byte sequence as a
character sequence
without knowing the
character encoding.
10. What's wrong with this code? (A1)
#!/usr/bin/python2.7
with open("names.txt", "r") as f:
for name in f:
print('Hello ' + name.strip())
11. What's wrong with this code? (A1)
#!/usr/bin/python2.7
with open("names.txt", "r") as f:
for name in f:
print('Hello ' + name.strip())
• No character encoding is specified!
– Python will fallback to the OS default character encoding,
which depends on its locale.
– Therefore a customer running this program on a
Japanese OS will read different text than an English OS!
• Reads byte strings instead of character strings!
12. What's wrong with this code? (A1)
#!/usr/bin/python2.7
import codecs
with codecs.open("names.txt", "r",
"utf-8") as f:
for name in f:
print(u'Hello ' + name.strip())
• Fixed. Will always read character strings, and as UTF-8.
13. What's wrong with this code? (A2)
#!/usr/bin/python3.4
with open("names.txt", "r") as f:
for name in f:
print('Hello ' + name.strip())
14. What's wrong with this code? (A2)
#!/usr/bin/python3.4
with open("names.txt", "r") as f:
for name in f:
print('Hello ' + name.strip())
• No character encoding is specified!
15. What's wrong with this code? (A2)
#!/usr/bin/python3.4
with open("names.txt", "r",
encoding="utf-8") as f:
for name in f:
print('Hello ' + name.strip())
• Fixed. Will always read as UTF-8.
16. What's wrong with this code? (B)
<!DOCTYPE html>
<html>
<head>
<title>Krankenzimmer</title>
</head>
<body>Mein Fuß tut weh!</body>
</html>
17. What's wrong with this code? (B)
<!DOCTYPE html>
<html>
<head>
<title>Krankenzimmer</title>
</head>
<body>Mein Fuß tut weh!</body>
</html>
• No character encoding is specified!
18. What's wrong with this code? (B)
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8"/>
<title>Krankenzimmer</title>
</head>
<body>Mein Fuß tut weh!</body>
</html>
• Fixed. Declares self as UTF-8 encoded.
19. What's wrong with this code? (C)
<?xml version="1.0">
<messages>
<message>Mein Fuß tut weh!</message>
</messages>
20. What's wrong with this code? (C)
<?xml version="1.0">
<messages>
<message>Mein Fuß tut weh!</message>
</messages>
• No character encoding is specified!
21. What's wrong with this code? (C)
<?xml version="1.0" encoding="UTF-8">
<messages>
<message>Mein Fuß tut weh!</message>
</messages>
• Fixed. Declares self as UTF-8 encoded.
22. What's wrong with this code? (D)
// C#
// TextReader is a character stream
// OpenText always assumes UTF-8 encoding
using (TextReader r = File.OpenText("names.xml"))
{
XmlDocument doc = new XmlDocument();
doc.Load(r);
...
}
23. What's wrong with this code? (D)
// C#
// TextReader is a character stream
// OpenText always assumes UTF-8 encoding
using (TextReader r = File.OpenText("names.xml"))
{
XmlDocument doc = new XmlDocument();
doc.Load(r);
...
}
• The encoding declaration in the XML is ignored!
UTF-8 is always forced.
24. What's wrong with this code? (D)
// C#
// Stream is a byte stream
using (Stream s = File.OpenRead("names.xml"))
{
XmlDocument doc = new XmlDocument();
doc.Load(s);
...
}
• Fixed. XmlDocument will internally determine the
encoding based on the declaration in the byte stream.
26. Unfortunately many languages blur the line
between byte strings and character strings.
– Python 2.x
• All strings are byte strings by default.
• Byte and ASCII character strings are implicitly convertible.
– C / C++
• String functions in the C standard library manipulate
byte strings by default.
27. What's wrong with this code? (E1)
#!/usr/bin/python2.7
# -*- coding: windows-1252 -*-
print('Mein Fuß tut weh!')
28. What's wrong with this code? (E1)
#!/usr/bin/python2.7
# -*- coding: windows-1252 -*-
print('Mein Fuß tut weh!')
• A byte string (with international chars) was printed.
Only character strings should be printed.
– On OS X, which has the UTF-8 locale by default rather than
Windows-1252, the second word will be printed as "Fu?"
instead of "Fuß".
29. What's wrong with this code? (E1)
#!/usr/bin/python2.7
# -*- coding: windows-1252 -*-
print(u'Mein Fuß tut weh!')
• This is the smallest possible fix.
30. What's wrong with this code? (E1)
#!/usr/bin/python2.7
# -*- coding: windows-1252 -*-
from __future__ import unicode_literals
print('Mein Fuß tut weh!')
• A better fix, since it avoids adding u'…' everywhere.
31. What's wrong with this code? (E2)
#!/usr/bin/python3.4
# -*- coding: windows-1252 -*-
print('Mein Fuß tut weh!')
32. What's wrong with this code? (E2)
#!/usr/bin/python3.4
# -*- coding: windows-1252 -*-
print('Mein Fuß tut weh!')
• Nothing!
– Python 3.x interprets string literals as character strings
by default.
33. What's wrong with this code? (F)
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
import codecs
with codecs.open('hurts.txt', 'r', 'utf-8') as f:
status = f.read().strip()
print('Schädigung: ' + status)
34. What's wrong with this code? (F)
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
import codecs
with codecs.open('hurts.txt', 'r', 'utf-8') as f:
status = f.read().strip()
print('Schädigung: ' + status)
• Mixing a byte string literal with character input.
– Python 2.x interprets string literals as bytes by default.
35. What's wrong with this code? (F)
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import codecs
with codecs.open('hurts.txt', 'r', 'utf-8') as f:
status = f.read().strip()
print('Schädigung: ' + status)
• Fixed. All strings are character strings now.
36. Summary: Special Considerations
• Python 2.x
– String literals are byte strings by default rather than characters.
– Implicitly converts between byte strings and ASCII character strings.
• HTML, CSS, JavaScript
– Must declare an encoding in HTML.
• XML files
– Must declare an encoding in XML. Must honor such a declaration.
– Feed bytes to XML parsers rather than characters.
• Text files
– Must always assume an encoding. Usually UTF-8.
37. Don't Forget
1. You cannot interpret a byte sequence as a
character sequence without knowing the
character encoding.
2. Bytes and characters are not the same thing.
Do not mix them.
40. What's wrong with this code? (#1)
// Java
Reader r = new FileReader("names.txt");
41. What's wrong with this code? (#1)
// Java
Reader r = new FileReader("names.txt");
• No character encoding is specified!
– Java will fallback to the OS default character encoding,
which depends on its locale.
– Therefore a customer running this program on a
Japanese OS will read different text than an English OS!
42. What's wrong with this code? (#1)
// Java
Reader r = new FileReader(
"names.txt", "UTF-8");
• Fixed. Will always read as UTF-8.
43. What's wrong with this code? (#2)
// C#
Reader r = new StreamReader("names.txt");
44. What's wrong with this code? (#2)
// C#
Reader r = new StreamReader("names.txt");
• Nothing!
– C#'s StreamReader always uses UTF-8 encoding if no
encoding is specified.
– You must always read the documentation. Don't assume.
45. What's wrong with this code? (#2)
// C#
Reader r = new StreamReader(
"names.txt", Encoding.UTF8);
• Nevertheless, always explicitly specifying the encoding is still a
good idea.