ASCII is so 1963. Nowadays, computers must support a broad range of different characters beyond the 128 we had in the early days of computing - not just accents and emojis but also completely different writing systems used around the globe. The Unicode standard packs a whopping 143,859 characters into an elegant system used by over 95% of the Internet, but PHP's string functions don't play nicely with Unicode by default, making it difficult for developers to properly handle such a wide array of possible user inputs.
In this talk, we'll explore why Unicode is important, how the various encodings like UTF-8 work under-the-hood, how to handle them within PHP, and some nifty tricks and shortcuts to preserve performance.
Radio 2.0 Conference at Paris: The Change.
A walk through the situation of the online radio advertising in Spain.
Challenges, business models, case studies, procedures, and formats.
Radio 2.0 Conference at Paris: The Change.
A walk through the situation of the online radio advertising in Spain.
Challenges, business models, case studies, procedures, and formats.
PyLadies Talk: Learn to love the command line!Blanca Mancilla
This talks aims to uncover some of the magic powers of scripting and the command line.
I'll share with you some of my experience using the shell to schedule backups of a git repository or to find strings in files of unknown name and location.
And then you might see that it is a tough love!
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, ...mcrashidkhan
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, Set W, June 2009 Solution , M C Rashid khan
mcrashidkhan@gmail.com
Suggestions awaited
No Flex Zone: Empathy Driven DevelopmentDuretti H.
The technology industry has a bad rap. Sexism and misogyny run rampant. Marginalized groups get railroaded. Out-of-touch companies look to make as much money as ruthlessly as possible, all while exploiting others and passing it off as "disruption". Our industry is losing sight of what it could be. Technology, at its heart, has always been aspirational - about dreaming up the impossible and willing it into existence. This talk will discuss what can bring us back from the brink: empathy. Empathy for the people that use the things we make, for our non-technical teammates, and for our fellow engineers.
A talk on static code analysis tools such as jshint, jscs, and eslint and how to use them to write good (stylish) code. Also introducing tools to enforce using the correct style via editorconfig or js-beautify to minimize efforts to write good code.
Talk tenuto il 13 Dicembre 2016 alla Camera di Commercio di Prato durante il PostgreSQL Day 2016 ITALY la più longeva conferenza dedicata a PostgreSQL in Europa
Evolution towards the Internet of EverythingTim Winchcomb
The communications landscape has undergone radical change over recent decades – from analogue to digital and an explosion of short-range technologies, expanding the possibilities for personal and ubiquitous communication. This presentation introduces some of the highlights of this journey, based on pioneering innovation, and then looks forward to the latest emerging standards for ‘direct-to-cloud’ connectivity that will enable the true ‘Internet of Everything’.
PyLadies Talk: Learn to love the command line!Blanca Mancilla
This talks aims to uncover some of the magic powers of scripting and the command line.
I'll share with you some of my experience using the shell to schedule backups of a git repository or to find strings in files of unknown name and location.
And then you might see that it is a tough love!
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, ...mcrashidkhan
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, Set W, June 2009 Solution , M C Rashid khan
mcrashidkhan@gmail.com
Suggestions awaited
No Flex Zone: Empathy Driven DevelopmentDuretti H.
The technology industry has a bad rap. Sexism and misogyny run rampant. Marginalized groups get railroaded. Out-of-touch companies look to make as much money as ruthlessly as possible, all while exploiting others and passing it off as "disruption". Our industry is losing sight of what it could be. Technology, at its heart, has always been aspirational - about dreaming up the impossible and willing it into existence. This talk will discuss what can bring us back from the brink: empathy. Empathy for the people that use the things we make, for our non-technical teammates, and for our fellow engineers.
A talk on static code analysis tools such as jshint, jscs, and eslint and how to use them to write good (stylish) code. Also introducing tools to enforce using the correct style via editorconfig or js-beautify to minimize efforts to write good code.
Talk tenuto il 13 Dicembre 2016 alla Camera di Commercio di Prato durante il PostgreSQL Day 2016 ITALY la più longeva conferenza dedicata a PostgreSQL in Europa
Evolution towards the Internet of EverythingTim Winchcomb
The communications landscape has undergone radical change over recent decades – from analogue to digital and an explosion of short-range technologies, expanding the possibilities for personal and ubiquitous communication. This presentation introduces some of the highlights of this journey, based on pioneering innovation, and then looks forward to the latest emerging standards for ‘direct-to-cloud’ connectivity that will enable the true ‘Internet of Everything’.
This is the Highly Detailed factory service repair manual for the1998 ACURA INTEGRA, this Service Manual has detailed illustrations as well as step by step instructions,It is 100 percents complete and intact. they are specifically written for the do-it-yourself-er as well as the experienced mechanic.1998 ACURA INTEGRA Service Repair Workshop Manual provides step-by-step instructions based on the complete dis-assembly of the machine. It is this level of detail, along with hundreds of photos and illustrations, that guide the reader through each service and repair procedure. Complete download comes in pdf format which can work under all PC based windows operating system and Mac also, All pages are printable. Using this repair manual is an inexpensive way to keep your vehicle working properly.
Service Repair Manual Covers:
General Information
Special Tools
Specification
Maintenance
Engine
Cooling
Fuel and Emissions
Transaxle
Steering
Suspension
Brakes
Body
Heater and Air Conditioner
Electrical
File Format: PDF
Compatible: All Versions of Windows & Mac
Language: English
Requirements: Adobe PDF Reader
NO waiting, Buy from responsible seller and get INSTANT DOWNLOAD, Without wasting your hard-owned money on uncertainty or surprise! All pages are is great to have1998 ACURA INTEGRA Service Repair Workshop Manual.
Looking for some other Service Repair Manual,please check:
https://www.aservicemanualpdf.com/
Thanks for visiting!
8
Releasing High Quality Packages - Longhorn PHP 2021Colin O'Dell
Releasing open-source libraries is more than sharing your GitHub URL with the world. There are many considerations and steps involved especially for successful and long-lived projects.
In this talk, we’ll cover the principles behind creating, releasing, and maintaining high-quality libraries. Topics will include structuring the repository, implementing modern PHP standards, maintaining changelogs, using CI tests, releasing new versions, and more.
Releasing High Quality PHP Packages - ConFoo Montreal 2019Colin O'Dell
Releasing open-source libraries is more than sharing your Github URL with the world. There are many considerations and steps involved especially for successful and long-lived projects.
In this talk we’ll cover the principles behind creating, releasing and maintaining high-quality libraries. Topics will include structuring the repository, implementing modern PHP standards, maintaining changelogs, using CI tests, releasing new versions and more.
Debugging Effectively - ConFoo Montreal 2019Colin O'Dell
Software bugs are inevitable; some are especially difficult to track down causing you to waste countless hours before throwing your hands up in defeat. It doesn't have to be this way! Fatigue and wasted time can be avoided with strategies and techniques to break through those mental barriers. Attendees will learn how to combine these techniques with the right mindset and attitude in order to debug their code quickly and effectively
Automating Deployments with Deployer - php[world] 2018Colin O'Dell
Have you ever botched a deployment and taken a site offline? I have - more times then I'd like to admit. What if we could completely automate the deployment process, make it lightning fast, remove most of the risk, and do it all without custom bash scripts? Better yet, what if we could use the same exact deployment process locally or trigger it via CI? All of this is possible with Deployer - a PHP-based deployment tool which supports virtually every CMS and framework! Attendees will learn the basics of Deployer, how to customize deployments to fit their needs, and how to handle situations if things go wrong.
Releasing open-source libraries involves much more than sharing your Github URL with the world. There are many considerations and steps involved, especially if you want your project to be successful and long-lived. In this talk, we'll cover the principles behind creating, releasing, and maintaining high-quality libraries. Topics will include structuring the repository, implementing modern PHP standards, maintaining changelogs, using CI tests, releasing new versions, and other best practices. Attendees will walk away with enough knowledge to publish their own quality PHP packages on Packagist for others to use.
Software bugs are inevitable; some are especially difficult to track down, causing you to waste countless hours before throwing your hands up in defeat. It doesn't have to be this way! The mental fatigue and wasted time can be avoided by using strategies like identifying the most-appropriate tool, taking a logical & objective approach, challenging assumptions, listening to variables, isolating the code path, and reinforcing code with automated tests.
CommonMark: Markdown Done Right - ZendCon 2017Colin O'Dell
Markdown is one of the most popular markup languages on the web. Unfortunately, with no standard specification, every implementation works differently, producing varying results across different platforms. The CommonMark specification fixes this by providing an unambiguous syntax specification and a comprehensive suite of tests. In this session you'll learn about this standard and how to integrate the league/commonmark parser into their PHP applications. We'll also cover how to customize the library to implement new features like custom Markdown syntax or advanced renderers.
Rise of the Machines: PHP and IoT - ZendCon 2017Colin O'Dell
The Internet of Things (IoT) is fundamentally changing how we interact with the digital world. In this session we’ll explore the implementation of real examples which bridge the gap between the physical and digital world using PHP: asking Alexa for information within a PHP application; displaying API data on an Arduino-powered display; using PHP to control LEDs on a Raspberry Pi to monitor application uptime; and connecting IR sensors to Slack to see whether a conference room is in use.
Debugging Effectively - All Things Open 2017Colin O'Dell
Software bugs are inevitable; some are especially difficult to track down, causing you to waste countless hours before throwing your hands up in defeat. It doesn’t have to be this way! The mental fatigue and wasted time can be avoided by using strategies like identifying the most-appropriate tool, taking a logical & objective approach, challenging assumptions, listening to variables, isolating the code path, and reinforcing code with automated tests.
Hacking Your Way To Better Security - DrupalCon Baltimore 2017Colin O'Dell
This talk educates junior and mid-level developers on common security vulnerabilities, how they are exploited, and how to protect against them. We'll explore several of the OWASP Top 10 attack vectors like SQL injection, XSS, CSRF, and others. Each topic will be approached from the perspective of an attacker to see how these vulnerabilities are detected and exploited using several realistic examples. We'll then apply this knowledge to see how web applications can be secured against such vulnerabilities.
Software bugs are inevitable; some are especially difficult to track down, causing you to waste countless hours before throwing your hands up in defeat. It doesn't have to be this way! The mental fatigue and wasted time can be avoided by using strategies like identifying the most-appropriate tool, taking a logical & objective approach, challenging assumptions, listening to variables, isolating the code path, and reinforcing code with automated tests. Attendees will learn how to combine these techniques with the right mindset and attitude in order to debug their code quickly and effectively.
Software bugs are inevitable; some are especially difficult to track down, causing you to waste countless hours before throwing your hands up in defeat. It doesn't have to be this way! The mental fatigue and wasted time can be avoided by using strategies like identifying the most-appropriate tool, taking a logical & objective approach, challenging assumptions, listening to variables, isolating the code path, and reinforcing code with automated tests. Attendees will learn how to combine these techniques with the right mindset and attitude in order to debug their code quickly and effectively.
Automating Your Workflow with Gulp.js - php[world] 2016Colin O'Dell
Gulp is a powerful utility for automating development workflows. Tasks are written using code, not configuration, enabling the easy creation of highly-custom and flexible automations. This talk introduces developers to the core concepts of gulp.js, and how to leverage it for new & existing projects. We’ll cover several examples of common tasks for managing CSS, JS and PHP, including: compiling Sass, minifying files, running PHP tests, checking code styles, ensuring legacy browser support & more.
Rise of the Machines: PHP and IoT - php[world] 2016Colin O'Dell
The Internet of Things (IoT) is fundamentally changing how we interact with the digital world. In this talk, we’ll explore the implementation of live examples which bridge the gap between the physical and digital world using PHP: asking Alexa for information on php[world] conference sessions; displaying API data on an Arduino-powered display; using PHP to control LEDs on a Raspberry Pi to monitor application uptime; and connecting IR sensors to Slack to see whether a conference room is in use.
Software bugs are inevitable; some are especially difficult to track down, causing you to waste countless hours before throwing your hands up in defeat. It doesn't have to be this way! The mental fatigue and wasted time can be avoided by using strategies like identifying the most appropriate tool, taking a logical and objective approach, challenging assumptions, listening to variables, isolating the code path, and reinforcing code with automated tests. Attendees will learn how to combine these techniques with the right mindset and attitude in order to debug their code quickly and effectively.
Hacking Your Way to Better Security - ZendCon 2016Colin O'Dell
This talk educates developers on common security vulnerabilities, how they are exploited, and how to protect against them. We'll explore several of the OWASP Top 10 attack vectors like SQL injection, XSS, CSRF, and others. Each topic will be approached from the perspective of an attacker to see how these vulnerabilities are detected and exploited using several realistic examples. We'll then apply this knowledge to see how web applications can be secured against such vulnerabilities.
Hacking Your Way to Better Security - PHP South Africa 2016Colin O'Dell
This talk educates developers on common security vulnerabilities, how they are exploited, and how to protect against them. We'll explore several of the OWASP Top 10 attack vectors like SQL injection, XSS, CSRF, and more. Each topic will be approached from the perspective of an attacker to see how these vulnerabilities are detected and exploited using several realistic examples. We'll then apply this knowledge to see how web applications can be secured against such vulnerabilities.
Debugging Effectively - DrupalCon Europe 2016Colin O'Dell
Software bugs are inevitable; some are especially difficult to track down, causing you to waste countless hours before throwing your hands up in defeat. It doesn't have to be this way! The mental fatigue and wasted time can be avoided by using strategies like identifying the most-appropriate tool, taking a logical & objective approach, challenging assumptions, listening to variables, isolating the code path, and reinforcing code with automated tests. Attendees will learn how to combine these techniques with the right mindset and attitude in order to debug their code quickly and effectively.
CommonMark: Markdown done right - Nomad PHP September 2016Colin O'Dell
Markdown is one of the most popular markup languages on the Web. Unfortunately, with no standard specification, every implementation works differently, producing varying results across different platforms. The CommonMark specification fixes this by providing an unambiguous syntax specification and a comprehensive suite of tests. Attendees will learn about this standard and how to integrate the league/commonmark parser into their applications. We will also cover how to add new syntax and other features to the parser to fit your custom needs.
Debugging Effectively - Frederick Web Tech 9/6/16Colin O'Dell
Software bugs are inevitable; some are especially difficult to track down, causing you to waste countless hours before throwing your hands up in defeat. It doesn't have to be this way! The mental fatigue and wasted time can be avoided by using strategies like identifying the most-appropriate tool, taking a logical & objective approach, challenging assumptions, listening to variables, isolating the code path, and reinforcing code with automated tests. Attendees will learn how to combine these techniques with the right mindset and attitude in order to debug their code quickly and effectively.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Understanding Nidhi Software Pricing: A Quick Guide 🌟
Choosing the right software is vital for Nidhi companies to streamline operations. Our latest presentation covers Nidhi software pricing, key factors, costs, and negotiation tips.
📊 What You’ll Learn:
Key factors influencing Nidhi software price
Understanding the true cost beyond the initial price
Tips for negotiating the best deal
Affordable and customizable pricing options with Vector Nidhi Software
🔗 Learn more at: www.vectornidhisoftware.com/software-for-nidhi-company/
#NidhiSoftwarePrice #NidhiSoftware #VectorNidhi
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
2. Colin O’Dell
● Principal Engineer at Unleashed Technologies
● PHP for ~20 years; 13 years professionally
● Creator & maintainer of league/commonmark library
● PHP League leadership team
● Owner of moderngeekware.com
● @colinodell
3. Agenda
● A History of Encoding Systems
● Unicode Standard
● Unicode Encodings
● Using Unicode in PHP
● Tips & Tricks
● Questions & Answers
11. 1960s: ASCII
● American Standard Code for Information Interchange
● 7-bit binary encoding
○ 0000000 = 0
○ ...
○ 1111111 = 127
12. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
13. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
Character Hex Binary Character Hex Binary
LF (line feed) 0x0A 0001010 E 0x45 1000101
3 0x33 0110011 e 0x65 1100101
14. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
00xxxxx = 32 control codes
01xxxxx = 32 numbers & symbols
10xxxxx = 32 uppercase letters and some extra symbols
11xxxxx = 32 lowercase letters and some extra symbols
15. A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
16. A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
a = 0x61 = 1100001
b = 0x62 = 1100010
…
z = 0x7A = 1111010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
17. But computers use 8-bit bytes...
ASCII (7 Bits) ???
Start 00000000 10000000
End 01111111 11111111
Count 128 128
18. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
7-bit
ASCII
19. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
8
???
9
A
B
C
D
E
F
8-bit
“Extended
ASCII”
22. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8
9
A NBSP Ą ˘ Ł ¤ Ľ Ś § ¨ Š Ş Ť Ź SHY Ž Ż
B ° ą ˛ ł ´ ľ ś ˇ ¸ š ş ť ź ˝ ž ż
C Ŕ Á Â Ă Ä Ĺ Ć Ç Č É Ę Ë Ě Í Î Ď
D Đ Ń Ň Ó Ô Ő Ö × Ř Ů Ú Ű Ü Ý Ţ ß
E ŕ á â ă ä ĺ ć ç č é ę ë ě í î ď
F đ ń ň ó ô ő ö ÷ ř ů ú ű ü ý ţ ˙
ISO
8859-2
23. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼
1 ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ ⌂
8 Ç ü é â ä à å ç ê ë è ï î ì Ä Å
9 É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ
A á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »
B ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐
C └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧
D ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀
E α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩
F ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■ NBSP
Code
Page
437
(IBM
PC)
24.
25. 8-bit “Extended ASCII”
● ISO 8859 - 16 variations:
○ ISO 8859-1 (“Latin 1”, Western European)
○ ISO 8859-2 (“Latin 2”, Central European)
○ ISO 8859-3 (“Latin 3”, South European)
○ ISO 8859-4 (“Latin 4”, North European)
○ ISO 8859-5 (Latin/Cyrillic)
○ ISO 8859-6 (Latin/Arabic)
○ ISO 8859-7 (Latin/Greek)
○ ISO 8859-8 (Latin/Hebrew)
○ ISO 8859-9 (“Latin 5”, Turkish)
○ ISO 8859-10 (“Latin 6”, Nordic)
○ ISO 8859-11 (Latin/Thai)
○ ISO 8859-12 (Latin/Devanagari) - abandoned
○ ISO 8859-13 (“Latin 7”, Baltic Rim)
○ ISO 8859-14 (“Latin 8”, Celtic)
○ ISO 8859-15 (“Latin 9”)
■ Revision of 8859-1 with swaps out less-
used chars; adds euro currency symbol
○ ISO 8859-16 (“Latin 10”, South-Eastern European)
● Windows-1252
● CP 437 - Original IBM PC
● Mac OS Roman character set
● TRS-80 character set
● Atari’s ATASCII
● Commodore’s PETSCII
● HP Roman-8 and Roman-9
● DEC’s Multinational Character Set
● Lotus International Character Set
● ECMA-94
31. “The Unicode Standard is the universal character
encoding standard for written characters and text. It
defines a consistent way of encoding multilingual text
that enables the exchange of text data internationally and
creates the foundation for global software”
32. Code Points
Problem:
How to accommodate larger character sets without wasting memory?
Solution:
Break the one-to-one correspondence between characters and
bits/encoding! Offer different ways to encode based on
different needs.
33. ASCII vs. Unicode
Character Encoded Bits
H 01001000 (0x48)
P 01010000 (0x50)
Glyph Code Point
P U+0050
LATIN CAPITAL LETTER P
H U+0048
LATIN CAPITAL LETTER H
Encoded Bits
????
????
34. Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
Σ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😸
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
45. Recap
● Code Point: a number representing a single character*
○ 143,859 defined as of Unicode 13.0
○ Format: U+hhhhhh
● Codespace: A range of numerical values available for encoding characters
○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
● Code Planes: Continuous group of 65,536 (216) code points
○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first
two positions in six position hexadecimal format (U+hhhhhh)
48. Character / Code Point:
a
U+0061
LATIN SMALL LETTER A
a a a a a a a a
Glyphs:
49. Glyphs and Graphemes
Glyph /
Grapheme c a f e
Unicode
Character
c a f e
Code Point
U+0063 U+0061 U+0066 U+0065
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
50. Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
51. Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
e + ◌́ = é
e
52. Glyphs and Graphemes: Combining Diacritical Marks
Z̷̧̨̰̋Å̸̮͉ ̵͉̣̄̇̀
L̵͉̣̄̇̀G
̸̮͉̊ O
̸̱͒̓ ̷̧̨̰̋Ț͝E̪̘̗̓͝X̪̘̗T
̸̰̺̝̍̈
53. Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
VS
15
54. Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
Glyph /
Grapheme
Unicode
Character
✈
Code Point
U+2708 U+FE0F
AIRPLANE
VARIATION
SELECTOR 16
(EMOJI STYLE)
VS
16
VS
15
55. Glyphs and Graphemes: Regional Indicator Symbols
Glyph /
Grapheme 🇺🇸
Unicode
Character
🇺 🇸
Code Point
U+1F1FA U+1F1F8
REGIONAL
INDICATOR
SYMBOL
LETTER U
REGIONAL
INDICATOR
SYMBOL
LETTER S
Glyph /
Grapheme 🇨🇦
Unicode
Character
🇨 🇦
Code Point
U+1F1E8 U+1F1E6
REGIONAL
INDICATOR
SYMBOL
LETTER C
REGIONAL
INDICATOR
SYMBOL
LETTER A
56. Glyphs and Graphemes: Modifiers
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FC
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-3
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FE
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-5
57. Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
👨 👩 👶 👧
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+1F469 U+1F476 U+1F467
MAN WOMAN BABY GIRL
58. Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467
MAN
ZERO
WIDTH
JOINER
WOMAN
ZERO
WIDTH
JOINER
BABY
ZERO
WIDTH
JOINER
GIRL
ZWJ ZWJ ZWJ
65. Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
Σ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😸
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
67. UTF-32
Fixed-byte encoding; 4 bytes per code point
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
68. UTF-32
Fixed-byte encoding; 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
Examples:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 00000000 00000000 01000001
😸
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000
69. UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
70. Example:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 01000001
Variable-length encoding; 2 or 4 bytes per character
UTF-16
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
71. UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
72. U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
75. UTF-8
Variable-length encoding; 1-4 bytes per code point
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
76. UTF-8
Trick 1: ASCII === UTF-8
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
77. UTF-8
Trick 2: Virtually all languages only need 1, 2, or 3 bytes
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
78. UTF-8
Trick 3: First byte tells you the length
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
79. UTF-8
Trick 4: Self-synchronization
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
80. UTF-8
Trick 5: No 0x00 bytes, except for NUL
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
81. UTF Encoding Summary
UTF-32 UTF-16 UTF-8
Encoding length Fixed Variable Variable
4 bytes per code
point
2 or 4 bytes per
code point
1-4 bytes per code
point
Memory-efficient No Somewhat Yes
CPU-efficient Yes Somewhat Somewhat
Self-synchronizing No Yes Yes
Contains null
(0x00) bytes
Yes Yes No
ASCII-compatible No No Yes
84. Handling Text In Programming Languages
1. Treat text as a sequence of bytes (PHP, C)
$smile = "xF0x9Fx98x80";
echo $smile; // => '😀'
echo strlen($smile); // => 4
1. Treat text as a sequence of Unicode code points (Python 3)
2. Treat text as a sequence of UTF-16 code units (JavaScript, C#)
const smile = 'uD83DuDE00';
console.log(smile); // => '😀'
console.log(smile.length); // => 2
85. PHP Strings
Be careful!
● Strings are simply byte sequences
● Encoding-agnostic
● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding
86. PHP String Functions
Function What It Actually Does
strlen() Counts the length in bytes
str_replace() Replaces bytes
substr() Returns a subset of bytes
strtoupper() Converts alphabetic ASCII bytes to uppercase based on
globally-set locale
Works for ASCII; not entirely safe* for Unicode!
87. ext/mbstring
Provides multibyte-safe string functions
Standard Function mbstring Alternative
strlen() mb_strlen()
str_replace() (none)
substr() mb_substr()
strtoupper() mb_strtoupper()
Tip: All functions accept an
optional parameter to specify
the encoding, if known; will be
auto-detected otherwise.
88. ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Mary had a little lamb
MB_CASE_UPPER MARY HAD A LITTLE LAMB
MB_CASE_LOWER mary had a little lamb
MB_CASE_TITLE Mary Had A Little Lamb
MB_CASE_FOLD mary had a little lamb
89. ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Ich grüße den Mann
(I greet the man)
MB_CASE_UPPER ICH GRÜSSE DEN MANN
MB_CASE_LOWER ich grüße den mann
MB_CASE_TITLE Ich Grüße Den Mann
MB_CASE_FOLD ich grüsse den mann
90. ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Property
Code
Matches Example
L Any letter p{L}
Ll Lower case letter p{Ll}
Lu Upper case letter p{Lu}
Lm Modifier letter p{Lm}
Lt Title case letter p{Lt}
Lo Other letter p{Lo}
Property
Code
Matches Example
S Any symbol p{S}
Sc Currency symbol p{Sc}
Sk Modifier symbol p{Sk}
Sm Mathematical
symbol
p{Sm}
So Other symbol p{So}
91. Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Examples: p{Greek} or p{Egyptian_Hieroglyphs}
ext/pcre
92. Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
ext/pcre
93. Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
Match a Unicode extended grapheme cluster: X
Think of it like a . but for multiple characters
that combine into a single glyph
ext/pcre
94. ext/intl - IntlChar class
var_dump(IntlChar::charName('⛄'));
// string(20) "SNOWMAN WITHOUT SNOW"
$name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS";
var_dump(IntlChar::charFromName($name));
// int(9843)
var_dump(IntlChar::isupper("A"));
// bool(true)
95. ext/intl - Normalizer class
1. U+01FA - “Precomposed” character (LATIN CAPITAL
LETTER A WITH RING ABOVE AND ACUTE)
2. A + U+030A + U+0301 - A base letter A followed by two
combining marks (U+030A COMBINING RING ABOVE
and U+0301 COMBINING ACUTE ACCENT)
3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN
CAPITAL LETTER A WITH RING ABOVE) followed by a
combining accent (U+0301 COMBINING ACUTE
ACCENT)
4. U+212B + U+0301 - A compatibility character (U+212B
ANGSTROM SIGN) followed by a combining accent
(U+0301 COMBINING ACUTE ACCENT)
Ǻ
100. ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
101. ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
102. ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
// This is the Euro symbol 'EUR'.
echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
// This is the Euro symbol ''.
103. PHP Extension Summary
ext/iconv: Convert between encodings
ext/mbstring: Work with multi-byte string encodings like UTF-8
ext/pcre: Special UTF-compatible matching when /u modifier enabled
ext/intl: Work with individual codepoints and graphemes
105. Disclaimer
Clever hacks and micro-optimizations are usually unnecessary and can be
detrimental to long-term maintenance!
Don’t use these unless you absolutely need them.
106. Taking Advantage of UTF-Encoded Bytes
PHP string functions can still be used in some cases:
if (str_contains($utf8, '&')) { … }
$trimmed = trim($utf8);
$firstChar = substr($utf32, 0, 4);
Requires solid understanding of UTF encodings and what the functions do
Don’t be clever unless there’s a clear advantage!
107. Splitting Strings Into Codepoints
mb_str_split($str) - returns array of individual codepoints (PHP 7.4+)
UTF-8 polyfill for older versions: preg_split('//u', $str)
(Works for codepoints, not graphemes)
108. ASCII-Only UTF-8 Strings
Is a UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions:
$isAscii = mb_detect_encoding($str, 'ASCII', true);
Micro-optimization (2x faster):
$isASCII = strlen($str) === mb_strlen($str);
Speed is fractions of milliseconds; micro-optimization only
important for parsing-heavy applications
109. Writing Silly Code
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
110. Writing Silly Code
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT
PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.
111. Writing Silly Code (Don’t Do This)
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
112. Writing Silly Code (Seriously, Don’t Do This)
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
$you can use = 'U+2000 EN QUAD whitespace';
114. Recap & Recommendations
● Unicode supports virtually every known modern and historic writing system
● Codepoints != Glyphs/Graphemes != Encoding
● Use and support UTF-8 everywhere, especially for user input
● PHP strings are just raw bytes
● Use mbstring functions
Simple device
Type a key, sends some numbers, same letter comes out the other side
But there needs to be a standard
Developed in 1960s for teleprinters (“Teletype”) and early computers
7-bit: each letter you type in gets converted into 7 bits
Support for:
Upper and lowercase letters
Numbers
Basic, common symbols
More control codes (CR, LF, BS, HT, BEL)
(next for examples)
(how to encode/decode)
Something really clever going on here
Group by first two bits
4 “pages” or sections, 32 chars each
Letters in alphabetical order, starting at 1 (not random)
Even more clever - converting between upper and lowercase by changing one bit
“Extended ASCII” sounds like a standard, but it’s not
AKA Latin 1 for the Americas, Western Europe, Oceania, and much of Africa
Superset/extension of ISO 8859-1
Adds curly quotation marks
De-facto standard for Windows
Aka Latin 2 for Central or Eastern European Languages
UI graphics, science, and math
Standard EGA VGA encoding on gfx cards
That’s a lot! However,
In practice, most users only used one standard locally. Which was fine...
Standards proliferation
(Problem) You could add more bits, but that wasted computing resources (which were scarce at the time) for users who only needed Latin or ASCII-like characters
ATTN: 4 vs 5 char convention
Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
Codespace: entire range of numerical values available for encoding characters
Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
Codespace: entire range of numerical values available for encoding characters
Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
Unicode does not specify how the character / code point should be displayed (or encoded)!
Unicode does not specify how the character / code point should be displayed (or encoded)!
Combining Diacritical Marks
In this example: 5 code points but 4 graphemes
GRAPHEME = smallest unit of a writing system
Think about putting cursor in this text and selecting something or pressing backspace
“Zalgo text” or “glitch text”
Combining Diacritical Marks
Combining Diacritical Marks
Combining Diacritical Marks
Combining Diacritical Marks
Combining Diacritical Marks
Windows supports 52,000 family combinations
Windows supports 52,000 family combinations
If system lacks dedicated image, individual emojis are shown
Combining Diacritical Marks
Pros: Code points always use some number of bytes; very straight-forward
Cons: not very memory efficient, can contain null bytes, not self-synchronizing
BMP = basically everything except emojis and historical scripts
“Surrogate pairs”; values are reserved, no code points with those values
Pros: more memory efficient (most of the time), works well for BMP; is self-synchronizing
Cons: 4-byte encoding logic somewhat messy; can contain null bytes
This symbol can be encoded 4 different ways
Intl normalizer class
In UTF-8: 3 bytes for snowman, 1 for space, 1 for each letter c a f e, and 1 for diacritical combining acute accent mark