Character sets and collations are am important part of the database setup. In this presentation I show you the history of character sets and how they are used today, how UTF-8 works and how MySQL handles all this.
My talks at Voxxed Days Zurich 2016. This is about he history of character encodings and unicode. And it's all about APIs stuck in the 90ies where things were very different.
our application is great – and popular. You have translation efforts underway, everything is going well – and wait a minute, what’s the report of strange question mark characters all over the page? Unicode is pain. UTF-32, UTF-16, UTF-8 and then something else is thrown in the mix … Multibyte and codepoints, it all sounds like greek. But it doesn’t have to be so scary. PHP support for Unicode has been improving, even without native unicode string support. Learn the basics of unicode is and how it works, why you would add support for it in your application, how to deal with issues, and the pain points of implementation.
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6Andrei Zmievski
n the halcyon days of early 2005, a project was launched to bring long overdue native Unicode and internationalization support to PHP. It was deemed so far reaching and important that PHP needed to have a version bump. After more than 4 years of development, the project (and PHP 6 for now) was shelved. This talk will introduce Unicode and i18n concepts, explain why Web needs Unicode, why PHP needs Unicode, how we tried to solve it (with examples), and what eventually happened. No sordid details will be left uncovered.
Unicode - Hacking The International Character SystemWebsecurify
In this presentation we explore some of the problems of unicode and how they can be used for nefarious purposes in order to exploit a range of critical vulnerabilities including SQL Injection, XSS and many other.
Unicode, character encodings in programming and standard persian keyboard layoutbijan_
در این ارائه با ابتداییترین معلوماتی که یک برنامهنویس باید در مورد کدگذاریهای نویسهها (کاراکتر انکدینگها) داشته باشد آشنا میشویم. سیری بر تاریخچه کدگذاریها خواهیم داشت و خواهیم دید چگونه از مشکلات معمول در این ضمینه اجتناب کرد. همچنین در انتها با کیبورد استاندارد فارسی و سطوح کاربردی مختلف آن آشنا خواهیم شد
My talks at Voxxed Days Zurich 2016. This is about he history of character encodings and unicode. And it's all about APIs stuck in the 90ies where things were very different.
our application is great – and popular. You have translation efforts underway, everything is going well – and wait a minute, what’s the report of strange question mark characters all over the page? Unicode is pain. UTF-32, UTF-16, UTF-8 and then something else is thrown in the mix … Multibyte and codepoints, it all sounds like greek. But it doesn’t have to be so scary. PHP support for Unicode has been improving, even without native unicode string support. Learn the basics of unicode is and how it works, why you would add support for it in your application, how to deal with issues, and the pain points of implementation.
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6Andrei Zmievski
n the halcyon days of early 2005, a project was launched to bring long overdue native Unicode and internationalization support to PHP. It was deemed so far reaching and important that PHP needed to have a version bump. After more than 4 years of development, the project (and PHP 6 for now) was shelved. This talk will introduce Unicode and i18n concepts, explain why Web needs Unicode, why PHP needs Unicode, how we tried to solve it (with examples), and what eventually happened. No sordid details will be left uncovered.
Unicode - Hacking The International Character SystemWebsecurify
In this presentation we explore some of the problems of unicode and how they can be used for nefarious purposes in order to exploit a range of critical vulnerabilities including SQL Injection, XSS and many other.
Unicode, character encodings in programming and standard persian keyboard layoutbijan_
در این ارائه با ابتداییترین معلوماتی که یک برنامهنویس باید در مورد کدگذاریهای نویسهها (کاراکتر انکدینگها) داشته باشد آشنا میشویم. سیری بر تاریخچه کدگذاریها خواهیم داشت و خواهیم دید چگونه از مشکلات معمول در این ضمینه اجتناب کرد. همچنین در انتها با کیبورد استاندارد فارسی و سطوح کاربردی مختلف آن آشنا خواهیم شد
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityTravis Fischer
Every developer will inevitably feel the pain of character encoding issues. We will cover the fundamentals every Python developer should know on character encoding and Unicode. We will teach you how to identify the types of problems that occur when dealing with character encoding and outline a set of best practices and useful libraries which can be used to avoid and fix character encoding issues.
Unicode, PHP, and Character Set CollisionsRay Paseur
In recent years UTF-8 has become the dominant character encoding scheme, supplanting extended ASCII. This has led to an uneasy transition for users of PHP, where the assumption has always been that one character equals one byte. This presentation is for the DC PHP Developers' Community meeting on September 10, 2014. It examines the history of character set encoding and the ways that the PHP community is responding to the transition to UTF-8. Not surprisingly, there are surprises in the process! The slides are derived from the article here:
http://iconoun.com/articles/collisions
Digital Image Processing and Edge DetectionSeda Yalçın
This presentation is an introduction for digital image processing and edge detection which covers them on four topic; example of fields that use digital image processing, visibility that depends on human perception, fundamental definition of an image, analysis of edge detection algorithms such as Roberts, Prewitt, Sobel and Laplacian of a Gaussian.
Strategies for Friendly English and Successful LocalizationJohn Collins
This slideshow was designed for a 20-minute progression session at the 2014 Society for Technical Communication Summit, presented on Tuesday, May 20, 2014. It's a significantly shortened version of a 45-minute session I'll be giving at Information Development World.
Companies are starting to distinguish themselves with a unique, natural English voice and tone, and many companies also realize there’s a growth potential in localizing their product to reach international markets. That leaves a tension for writers of the English content that will be translated for the international markets. Do the writers focus on tone or on writing easily translated content? Those two goals may seem mutually exclusive, but actually, they’re a healthy combination. We’ll look at what localization is and how to create content that’s good for your English-speaking users and well-suited for translation.
Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...Ken Tabor
Most apps of a significant viral popularity, or even modest ones providing value in the enterprise, need to implement foreign languages. Why? Supporting the largest possible audience in today’s connected world lets programmers create an opportunity for expanding the business. Find supporting demo app and GitHub repo here: bit.ly/KenOscon13
Putting Out Fires with Content Strategy (InfoDevDC meetup)John Collins
You’ve probably heard – or said – something like “All I did today at work was put out fires.” We’ve all been there. We don’t want fires, but they happen. So, let’s see how content strategy helps put out fires in software development, and what you can do to transition from a technical writing role to a content strategy role.
Presented at the InfoDevDC meetup on Dec. 9, 2014 (http://www.meetup.com/InfoDevDC/events/212733712/)
Here we talk about designing across, and for, multiple touchscreen platforms (Nokia, iPhone, iPad and Windows Phone 7) using Ribot’s recent suite of Tesco apps as a case study.
How do different form factors, operating systems, and interaction paradigms inform the design of real I-want-to-use-it-every-day apps?
How do you take the constraints (and opportunities) of differing mobile devices and design interfaces that, for the user, feel like they belong on the device and as part of their life?
(Download the presentation for full transcript)
NLP is important for scientific, economic, and cultural reasons. It is experiencing rapid growth as its theories and methods are deployed in a variety of new language technologies. It is important for a wide range of people to have a working knowledge of NLP. Within industry, this includes people in HCI, business information analysis, and website development.
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignityTravis Fischer
Every developer will inevitably feel the pain of character encoding issues. We will cover the fundamentals every Python developer should know on character encoding and Unicode. We will teach you how to identify the types of problems that occur when dealing with character encoding and outline a set of best practices and useful libraries which can be used to avoid and fix character encoding issues.
Unicode, PHP, and Character Set CollisionsRay Paseur
In recent years UTF-8 has become the dominant character encoding scheme, supplanting extended ASCII. This has led to an uneasy transition for users of PHP, where the assumption has always been that one character equals one byte. This presentation is for the DC PHP Developers' Community meeting on September 10, 2014. It examines the history of character set encoding and the ways that the PHP community is responding to the transition to UTF-8. Not surprisingly, there are surprises in the process! The slides are derived from the article here:
http://iconoun.com/articles/collisions
Digital Image Processing and Edge DetectionSeda Yalçın
This presentation is an introduction for digital image processing and edge detection which covers them on four topic; example of fields that use digital image processing, visibility that depends on human perception, fundamental definition of an image, analysis of edge detection algorithms such as Roberts, Prewitt, Sobel and Laplacian of a Gaussian.
Strategies for Friendly English and Successful LocalizationJohn Collins
This slideshow was designed for a 20-minute progression session at the 2014 Society for Technical Communication Summit, presented on Tuesday, May 20, 2014. It's a significantly shortened version of a 45-minute session I'll be giving at Information Development World.
Companies are starting to distinguish themselves with a unique, natural English voice and tone, and many companies also realize there’s a growth potential in localizing their product to reach international markets. That leaves a tension for writers of the English content that will be translated for the international markets. Do the writers focus on tone or on writing easily translated content? Those two goals may seem mutually exclusive, but actually, they’re a healthy combination. We’ll look at what localization is and how to create content that’s good for your English-speaking users and well-suited for translation.
Translated Strings and Foreign Language Support in JavaScript Web Apps - OSCO...Ken Tabor
Most apps of a significant viral popularity, or even modest ones providing value in the enterprise, need to implement foreign languages. Why? Supporting the largest possible audience in today’s connected world lets programmers create an opportunity for expanding the business. Find supporting demo app and GitHub repo here: bit.ly/KenOscon13
Putting Out Fires with Content Strategy (InfoDevDC meetup)John Collins
You’ve probably heard – or said – something like “All I did today at work was put out fires.” We’ve all been there. We don’t want fires, but they happen. So, let’s see how content strategy helps put out fires in software development, and what you can do to transition from a technical writing role to a content strategy role.
Presented at the InfoDevDC meetup on Dec. 9, 2014 (http://www.meetup.com/InfoDevDC/events/212733712/)
Here we talk about designing across, and for, multiple touchscreen platforms (Nokia, iPhone, iPad and Windows Phone 7) using Ribot’s recent suite of Tesco apps as a case study.
How do different form factors, operating systems, and interaction paradigms inform the design of real I-want-to-use-it-every-day apps?
How do you take the constraints (and opportunities) of differing mobile devices and design interfaces that, for the user, feel like they belong on the device and as part of their life?
(Download the presentation for full transcript)
NLP is important for scientific, economic, and cultural reasons. It is experiencing rapid growth as its theories and methods are deployed in a variety of new language technologies. It is important for a wide range of people to have a working knowledge of NLP. Within industry, this includes people in HCI, business information analysis, and website development.
These slides cover the fundamentals of data communication & networking. it covers all data types which are used in communication of data over transmission medium. it is useful for engineering students & also for the candidates who want to master data communication & computer networing.
ElixirConf 2017 - Writing an Editor in Elixir - Ian Dugganijcd
Writing An Editor In Elixir -- Elixir for GUIs? Yes, it’s possible. I’m working on a modern editor in Elixir, and I’ll show you how I’m doing it. Topics will include GUI libraries for Elixir, ports, NIFs, interfacing Elixir with other languages (especially Rust), and general design principles for a modern, programmable editor.
https://elixirforum.com/t/14-elixirconf-2017-writing-an-editor-in-elixir-ian-duggan/8950
https://www.youtube.com/watch?v=6lIVWVmuPao
Github: https://github.com/ijcd
Twitter: @ijcd
San Francisco, California
SPEAKER NOTES:
-------------
Why Design UI Tips
I’ve always had an off fascination with editors.
The answer, I believe, as to what makes a good editor, is when it makes us more productive. When it allows us to get into a flow. When it gets out of the way and just lets us create. I have a theory that, for some, it’s what makes you most productive quickly (TextMate, Sublime, Atom) For others, it is what you can control deeply (Emacs, Vi, Atom?). But if they are so hard to learn, why do they persist? Power... the power to control your environment (a true hacker wants to control everything... quote? reference?). But... why don’t people just add to Sublime/Atom/TextMate, etc... because of friction.
My hypothesis is that the reason Emacs is so powerful, is that you can write emacs in emacs without leaving it it can grow as you use it. Others can too, to some extent, but not to the deep level of customizability you can get from Emacs... Not even vim can do this (which is why Spacemacs exsists... some people want the keys of vi with the features of emacs)...
This editor had lisp embedded. It could read email and netnews before your cellphone could make phone calls without being attached to a car.
You could customize it, from the very beginning. DECADES AGO. New editors are adding most of its features, except for the ability to CODE ITSELF WHILE RUNNING.
Why is this a good idea?
1. Emacs <-> Vixen
2. C core, Elisp control <-> Rust core, Elixir control
3. Elisp <-> Elixir
4. Dynamic <-> Dynamic
5. Runtime eval <-> Runtime compile (hot code reload) 6. Macros <-> Macros
7. DSLs <-> DSLs
8. concurre-what? <-> concurrency
9. immuta-maybe? <-> immutable
10. beachballs <-> no beachballs (if designed well)
Rust stuff might mess up your schedulers, Elixir string manipulations might cause GC issues — you need to profile and make a choice. That said, this is an editor and we are using Elixir more for the flexibility than the performance...
Some issues around signals, detecting window size (in escripts) Use tty like iex (have user_drv open it), anoint self as shell
Very much like unix. A process has a process group. IO is sent to the process group session leader. erlang:display goes around this.
editor in control Rust/Termion ruby rust can launch w/out shell and insert ourselves as one can multiplex the shell in our own buffer pty/port-driver combo Telnet Port Driver
Pipiot - the double-architecture shellcode constructorMoshe Zioni
Presentation Abstract:
When compiling shellcode - it is always constrained to what architecture you are intended it to run on. So with that thought in mind - I started my latest challenge/research/journey into assembly polyglotism - focusing on the two top main architectures around - x86 and ARM.
Through research, sleepless nights and a lot of coffee (thank you, coffee) I found out that it is, indeed, possible - and, while exploring different routes and directions, devised a constructive, repeatable method.
The Pipiot method is a constructuive way to break the limitation enforced by previously known shellcode construction and make a payload that can run on more than only one architecture of choice.
In this session you will learn on how this system works and how to apply its logic to your exploit payload construction, discuss possible impacts, strong and weak points of the method and of course - all provided with a follow-through live demo.
Have you ever encountered problems displaying foreign characters on your app or website, or been confused by the appearance of strange question marks like this: ���? These are the result of character encoding mismatches. If you encounter these in the course of software localization, it can develop into an encoding nightmare!
Encoding nightmares can over-run product deadlines and spark frustration for your clients. If a website or app has an international future, a little knowledge up front can save you hours and even days of debugging.
There are many ways in which MySQL can handle Unicode data. In this talk I'll explain how to use unicode data, how to migrate to unicode and what the current limitations are.
I'll cover utf8, utf8mb4, conversion, storage engine limitations, normalization and more.
https://www.percona.com/live/europe-amsterdam-2015/sessions/unicode-and-mysql
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
2. Agenda
• About Anders Karlsson
• Part 1 - The gruesome background
• The history of character sets and
collations
• The “classic” 7 and 8 bit ASCII
character sets
• Part 2 – UNICODE Rocks!
• What is UNICODE and encodings
• Why UTF-8 is smart. Or not so smart
• Part 3 - MySQL and UNICODE
• Questions? Answers?
3. About Anders Karlsson
• Senior Sales Engineer at SkySQL
• Former Database Architect at Recorded Future, Sales
Engineer and Consultant with
Oracle, Informix, TimesTen, MySQL / Sun / Oracle etc.
• Has been in the RDBMS business for 20+ years
• Has also worked as Tech Support engineer, Porting
Engineer and in many other roles
• Outside SkySQL I build websites
(www.papablues.com), develop Open Source software
(MyQuery, mycleaner etc), am a keen
photographer, has an affection for English Real Ales
and a great interest in computer history
22/11/2012 SkySQL Ab 2011 Confidential 3
4. Part 1 – The history which we
are not to ignore (but which has
already been ignored several
times)
5. The history of Characters Sets and
collations
• At first there were no characters, only numbers
• Then on the 7th day we realized characters and
words was a good thing, but that computers
can only handle numbers, so we needed a way
of representing characters as numbers
• So we different mappings from characters to
numbers: ASCII, EBCDIC, FIELDATA, Baudot
etc, in different variations (in particular
EBCDIC)
6. ASCII – The mother of character sets
• For anyone not being a machochist (i.e.
anyone not using a mainframe), the character
set of choice soon became 7-bit ASCII
(American Standard Code for Information
Interchange), first published in 1963
• 7-bits was enough for US English characters
and control characters, with some legroom
(note that ASCII is US English, not UK
English, centric)
• The 8th bit was used for parity in transmission
7. All ASCII hell breaks loose
• As the original 7-bit US ASCII didn’t support
anything but US English, variations started to
appear.
• Any decent computer was supporting 8-bit
characters, but as the assumption was still
that bit 8 was a partity bit.
• So 7-bit local variations was
developed, Swedish 7-bit ASCII for example
(anyone coding in C knows and hates this)
8. And then we get 8-bit ASCII hell!
• Extended 8-bit ASCII solves a few problems, but also
introduces a few new ones. Most of the new problems
came from an attempt of making 8-bit Extended ASCII
compatible with 7-bit ASCII variations
• The Extended 8-bit “ASCII” characters sets are largely
standardized as ISO 8859 (with variations). Most
common is ISO 8859-1 (latin-1)
• 8859-15 is a not so popular 8859-1 update, including a
Euro-sign among a few other
things. If the Euro-sign really is a useful
addition is yet to be determined
• Another 8859-1 variation is Windows CP1252,
which is an enhanced 8859-1 character set
9. Oh, then we have collations!
• A “collation” determines how characters in a
character set are to be sorted!
• 7-bit ASCII was great (numeric order same as
character order)
– Or was it? Really? Upper / Lower case?
• 7-bit localized ASCII was not so great. To say the
least. Swedish 7-BIT ASCII was not correctly
sorted (å last in the alphabet, after ä and ö)
• 8-bit Extended ASCII didn’t help much (Swedish
again being in the wrong order, but not the same
wrong order as with 7-bit “Swedish ASCII”)
10. Collation basics
• Don’t ever think that the character
set determines the sorting!
– The same character set used in
different countries may be sorted
differently
– Different sorting models may be used in
the same country (A good example is
case sensitivity)
• Also, collations is not only about
sorting, it’s also about comparisons
and a few other things
11. Interoperating with ASCII
• A long as we were all using 1 single computer
or a bunch of similar computers in a LAN, the
issues were limited
• As usual, the Internet turned this beautiful
environment into something truly evil!
• Internet got started in the US
– Which means, again, that the founders were
convinced that 7-bit ASCII would be OK. That this
had been an incorrect assumption 30 years before
Facebook came around made no difference. Of
course not.
12. Interoperability necessities
• For us to be able to communicate we need to
be able to tell what character set we expect
here at the client side, the server has to tell
what it delivers, and then we need a way to
align all this.
• The trick: <meta http-equiv=Content-Type
content="text/html; charset=iso8559-1">
– Or maybe not? This tells what I get, but doesn’t
allow me to say what I want!
• Actually, this didn’t help as much as we hoped
13. Part 1 Conclusion
• The many different local variations of
characters served us well, for a while
• Now we have a global IT environment with
many different character sets and
collations, and we can’t deal with multiple
local versions anymore
• And we have languages whose character set
will not fit in 8 bits anyway
• And the we need to sort and compare all this!
14. Part 2 – UNICODE and Ken
Thompson saves the
world, without Batman and not by
tracking down the penguin
15. UNICODE – One Character set for all
• Yes, that is what UNICODE (or ISO/IEC 10646)
sets out to do – A common character set for
ALL languges (close to 240.000 characters are
defined in UNICODE 4.1 today, MySQL is
somewhat at UNICODE 3.0). Sort of.
• This means that UNICODE has character codes
than can not fit in 1 byte. This is big surprise
to anyone on the other side of the pond, but
there you go
• But there is a remedy: UNICODE Encodings!
16. UNICODE Encodings
• A UNICODE encoding is a standardized way of
representing a character in the UNICODE
character set
• UNICODE encodings represent select parts of
the full UNICODE character set
• UNICODE encodings are part of the UNICODE
standard itself (and this is a VERY good thing!
If this wasn’t the case, both Apple and
Microsoft would have invented their own
encodings I’m sure)
17. UNICODE Encodings
• Among the UNICODE encodings are
– UCS-2 – 2 bytes wide (i.e. only 64k different
characters can be represented)
– UTF-16 – 2 or 4 bytes wide. This is then a variable
length scheme with a very complex setup. When
only 2 bytes are used, they are the same as UCS-2
– UTF-32 – 4 bytes fixed size
• To be honest, besides UTF-16 / UCS-2 that is
common in Windows and related frameworks
(like COM), none of these are very popular
18. UTF-8 – Some smart dudes at work!
• The problem than UNICODE has is that it has
to represent all those characters. This should
break some applications for sure.
• Well, Encodings solve that too, and the
mother of all encodings is UTF-8.
Invented not by Albert Einstein or
Batman but by Ken Thompson!
• Let’s now have a round of applause
for Ken Thompson!
19. The details of UTF-8
• UNICODE characters 0 – 127 are the same as
in standard 7-bit ASCII (remember that?)
• UTF-8 works the same: For characters 0 –
127, the most significant (first) bit of the first
(and only) byte is 0
• Beyond 7-bit ASCII characters, the number of
“leading” 1’s in the first byte tells how many
bytes make of the up the character
• All other bytes start with a 1 and a 0
• And the rest of the bits make up the character
20. The details of UTF-8
• So in the first byte, it is one of two things:
– A leading 0 meaning a single byte character
– A number of 1’s (at least 2, as 1 byte characters
are indicated by a leading 0) followed by a 0
• This means that the first byte in a character NEVER
starts with the sequence 10
– All other bytes starts with 10
– 1 UTF-8 byte can contain up to 7 bits of data
– 2 UTF-8 bytes contains from 8 to 13 bits of data
– 3 UTF-8 bytes contains from 14 to 16 bits of data
– 4 UTF-8 bytes contains from 17 to 21 bits of data
21. Some useful aspects of UTF-8
• You can always find the leading byte of a
character in a word, starting from any byte
– Just move “backward” til a byte not having a
leading 10 is found
• Byte values 0 – 127 are ONLY present as
character values 0 – 127, nowhere else!
– All other byte values have the highest bit set
– So strlen(), strcmp() etc. still works, but on a byte
by byte, not character by character, level
22. So, are we all OK with UTF-8 now?
• Let’s see. Using UTF-8 we can represents
binary values with up to 21 bits, which is
2.097.152 characters! Which is
more than enough! (But 640K
RAM was ALSO more than enough)
• If we limit ourselves to 3 bytes UTF-8
we can represent 65.536 different
characters, the same as if we use
UCS-2 (which is fixed 2-byte format).
65.536 characters is what is in the
UNICODE Basic Multilingual Plane
23. Why we actually need 4-bytes UTF-8
• Beyond the BMP comes a couple of other
“planes”. The one that causes most issues is
the one that adds a bunch of
Chinese, Japanese and Korean characters
• For these, we need to go beyond the BMP and
hence beyond the nice and cosy 65.536
characters. Duh!
• And this is why the MySQL assumption on
UTF-8 means a maximum of 3 bytes might not
be such a good idea after all
25. So, how does MySQL handle all this?
• MySQL supports a whole range of UNICODE
Encodings and collations! Good!
• MySQL understand the case when we have
one character set stored in a column in a table
and another one on the client side, and nicely
does a conversion for is! Good!
• Not all UNICODE Encodings are valid on the
Client side! Not so good
• Actually, anything beyond UTF-8, when it
comes to UNICODE on the client side, is
troublesome
26. Lessons in MySQL and UNICODE
• Lesson 1: Learn about UNICODE and
understand how it works
• Lesson 2: Stick with UTF-8! Most others does
that too. Including Java, Java Script, JSON, the
web any many, many others!
• Lesson 3: UCS-2 may seem like a good idea, it
is fixed length after all. It’s not (a good idea
that is, fixed length it is)
• Lesson 4: Don’t forget about collations! They
are important!
27. Collations – The Sequel
• Collations determine how strings are sorted
– Order by
– Indexes
– WHERE col1 > ‘Über’
• Collations determine how strings are
compared
– Is A = Ä or not? Y = Ü?
• What in particular for COLLATIONS used for
PRIMARY KEYs
28. Storing UTF-8 data in MySQL
• Most Storage Engines are happy to use utf-8
• The MySQL Interpretation of UTF-8 is 1 – 3
bytes, or 65.536 different characters!
– This means that
• A CHAR(10) column requires 30 bytes fixed space!
• A VARCHAR(10) column is limited to 30 bytes
• MySQL 5.5 and up also supports 4-byte UTF-8
by using the character set utf8mb4
29. Storing UTF-8 data in MySQL
• VARCHAR columns are actually fixed in some
Storage Engines, most notably those engines
developed sometime around the time of the
American Civil War, when variable length data
was still in it’s infancy
• UTF-8 can potentially waste A LOT of space
• Extra space for UTF-8 also affects byte size
limits, such as VARCHAR and INDEX sizes
• UTF-8 data sorting is way more complex than
a simple binary sort (so in some ways, things
were better in the old 7-bit ASCII days)
30. Some simple demos, Questions
and Answers.
And I haven’t even began to talk
about byte ordering and byte order
marks.