The document is a slide presentation by Adrian Roselli for London Web Standards about using the lang attribute in HTML. It discusses what the lang attribute is, examples of its use, research showing around 47% of pages use it correctly, its importance for HTML validation, internationalization, accessibility, and screen readers. It also covers fun facts like the history of the "en-US-x-Hixie" language code.
UiPath Test Automation using UiPath Test Suite series, part 3
Mind Your Lang — London Web Standards
1. Mind Your lang
Presented by Adrian Roselli (@aardrian)
for London Web Standards
Slides from this talk will be available at
rosel.li/lws18
London skyline by Taras Kalapun, CC BY 2.0
2. • I’ve written some stuff,
• Member of W3C,
• Building for the web
since 1993,
• Learn more at
AdrianRoselli.com,
• Avoid on Twitter
@aardrian.
About Adrian Roselli
4. What Is lang?
• Examples:
<html lang="en">
<html lang="en-gb">
<html lang="en-us">
<html lang="en-GB-x-hixie">
• Source:
BCP47: Tags for Identifying Languages,
https://tools.ietf.org/html/bcp47
We’ll come back to that last one.
6. Who Uses lang?
• WHATWG Bug: “why do these examples of <html>
lack the lang attribute?”
This is where my research started.
“Why not? Realistically,
few people include it. It
just means the language
is unknown.”
7. Who Uses lang?
• Pulled January 2015 archive from
WebDevData.org (a W3C Community Group),
• Parsed 84,054 pages,
• Found that 39,433 pages use the lang
attribute on the <html> element,
• 47% use <html lang="…">.
12,762 use xml:lang, which is wrong.
8. Who Uses lang?
• “why do these examples of <html> lack the
lang attribute?”
• WHATWG HTML bug (26942)
• Reported: 2014-09-30
• Resolved: 2016-04-18
• Git merge:
• Editorial: Add lang to most examples #1061
Spoiled the surprise, I know, but we aren’t here for a bug.
10. Why Would You Use lang?
• HTML 5 Specification
• HTML Validation
• Internationalization (i18)
• WCAG 2.0 A, AA
• Numbers
• Dates
• Hyphens
• Quotes
• Screen Readers
12. HTML 5 Specification
• The spec provides a warning,
• Notes that it must match detected language of
the page,
• Identified ways which it is used,
• Added in April 2016
• add warning/advice about lang attribute use #218
https://github.com/w3c/html/issues/218
15. HTML Validation
• The W3C HTML validator compares the
following attributes on the page with the
detected page language:
• dir
• lang
• If there is a mismatch, the validator will
provide a warning,
• If there is no dir or lang, the validator will
provide a warning.
It will know if you lie.
18. Internationalization (i18n)
• Spelling and grammar checkers:
• spellcheck attribute (at caniuse.com)
• CSS:
• ::first-letter (at caniuse.com)
• Hanging punctuation
• Translation tools (particularly when looking at
parts of a page).
https://www.w3.org/International/questions/qa-lang-why
19. Internationalization (i18n)
• Font selection for CJK (for political reasons).
https://medium.com/behancetech/localization-gotchas-for-asian-languages-cjk-e52a57c0fde1
22. WCAG 2.0 A, AA
• Guideline 3.1 Readable: Make text content
readable and understandable.
• 3.1.1 Language of Page (Level A)
• H57: Using language attributes on the html
element
• 3.1.2 Language of Parts (Level AA)
• H58: Using language attributes to identify changes
in the human language
https://www.w3.org/TR/2008/REC-WCAG20-20081211/#meaning-doc-lang-id
24. Numbers
• A browser can adjust decimal characters in
number fields,
• Some use comma, some use period,
• Yes, this is for Latin scripts.
• Do not worry about browser support unless
you are mixing within a page.
• In that case, Firefox is the way to go.
If left blank, the browser should go with locale settings.
30. Hyphens
• For browsers that support hyphens, you will
enjoy the benefit just by using the attribute.
• This assumes you use the following CSS:
• hyphens: auto;
• -ms-hyphens: auto; (ugh)
• -webkit-hyphens: auto; (also ugh)
• Browser support:
• http://caniuse.com/#search=hyphens
If left blank, the browser should go with locale settings.
34. Quotes
• Let the browser choose the quote marks
based on the language.
• This assumes you use the following HTML:
• <q>…</q>
Obviously you can override this with CSS, but that would be silly.
38. Screen Readers
• VoiceOver uses it to auto-switch voices.
• VoiceOver can speak using a different accent.
• JAWS uses it to load the correct phonetic engine /
phonologic dictionary.
• NVDA uses it in the same way as VoiceOver and JAWS.
• For HTML in ePub or Apple iBooks document, it affects
how VoiceOver will read the book.
• Leaving out the lang attribute may require the user to
manually switch to the correct language for proper
pronunciation.
This gist is that things can sound funny if done wrong.
42. Fun Facts
• WHATWG HTML 5
<html class=split lang=en-US-x-hixie>
• W3C HTML 5.0
<html lang="en-US-x-Hixie">
• W3C HTML 5.1
<html lang="en">
You can confirm this by viewing the source of each.
43. Fun Facts
“Private-use subtags do not appear in the
subtag registry, and are chosen and maintained
by private agreement amongst parties.”
“Because these subtags are only meaningful
within private agreements and cannot be used
interoperably across the Web, they should be
used with great care, and avoided whenever
possible.”
http://www.w3.org/International/articles/language-tags/Overview.en.php#extension
44. Fun Facts
• There is a normative spec:
• Hixie English
• Version: 1.0-pre43
• Language Tag: en-GB-x-Hixie
• “This is a normative reference to Hixie English.
Hixie English is a variant of the language
spoken by the majority of the residents of the
United Kingdom (England) and the United
States of America.”
http://ian.hixie.ch/bible/english
46. Mind Your lang
Presented by Adrian Roselli (@aardrian)
for London Web Standards
Slides from this talk will be available at
rosel.li/lws18
London skyline by Taras Kalapun, CC BY 2.0
Editor's Notes
The most exciting talk you have ever seen about a single HTML attribute.
Maybe.
Specifically, what is the attribute?
Where does it live?
The first example sets the language of the page as English
The second sets the language as British English
The third sets the language as American English
The fourth is… we’ll come back to that.
“Case distinctions are ignored in extensions (as with any language subtag) and normalized subtags of this type are expected to be in lowercase.”
This question might not seem relevant, but it is helps explain how I got here.
I stumbled across this issue when trying to suss out why en-GB-x-Hixie was a thing.
But it was the response to the issue that bothered me.
It made an assertion with no support.
I set out to find data to see if that assertion was true.
Nearly half used the lang attribute.
I consider this different than “few people”.
I also found that nearly 13,000 use xml:lang, which is only valid for XML or HTML5 polyglot.
The good news is that the bug was resolved.
I am not here to re-litigate the bug.
It did get me an acknowledgment in the WHATWG spec. So yeah.
But now you have context for the following slides.
In my opinion, this was the more important question.
This is where it gets exciting.
I have collected 9 reasons.
For context, that is about the same time WHATWG closed its lang bug.
The W3C spec was learning from WHATWG’s mistake?
Either way, clarity was needed.
The language of HTML documents is indicated using a lang attribute (on the html element itself, to indicate the primary language of the document, and on individual elements, to indicate a change in language). It provides an explicit indication to user agents about the language of content in order to enable language specific behavior. For example, use of an appropriate language dictionary; selection of an appropriate font or glyphs for characters shared between different languages; or in the case of screen readers and similar assistive technologies with voice output, pronunciation of content using the correct voice / language library.
Incorrect or absent lang attributes can produce unexpected results in other circumstances, as they are also used to determine quotation marks for q elements, styling such as hyphenation, case conversion, line-breaking, and spell-checking in some editors, etc.
Setting the lang attribute to a language which does not match the language of the document or document parts will result in some users being unable to understand the content.
It stands to reason it would come into play here.
Heuristics!
This warning is helpful
It explains how it came to that conclusion
Eg: it sees page content does not match the declared language
Localization or internationalization.
In the UK you likely worry about it more than we do in the U.S.
Though with our diversity we need to be better at in-country localization.
https://caniuse.com/#search=spellcheck: “Browsers have different behavior in how they deal with spellchecking in combination with the the lang attribute. Generally spelling is based on the browser's language, not the language of the document.”
http://caniuse.com/#search=%3A%3Afirst-letter: “The spec says that both letters of digraphs which are always capitalized together (such as "IJ" in Dutch) should be matched by ::first-letter, but no browser has ever implemented this.”
Mandarin, Cantonese, Japanese, and Korean
yes, Korean fonts also contain Chinese characters, in addition to Hangul
These 4 languages contain the same Chinese characters.
Many characters are drawn differently in each language.
Each language’s version of the character shares the same unicode value.
Hopefully this graphic is more obvious.
Can use the :lang pseudo class selector to choose the right font for when browser cannot.
There are potential political ramifications for using the wrong character.
Accessibility!
A single A Success Criterion is to define the language of the page.
A double A SC is to define the language of any parts of the page that deviate from that.
For many sites, 3.1.2 may not be an issue.
However, think of language switchers, which are a language change in-page to present the other language name.
Which are not words.
In a vacuum, a browser will default to the current system’s language setting.
But this is where you can help make the experience better for people who are not their regular system.
Travelers using computers in foreign countries comes to mind fo me pretty easily.
This is using Firefox.
The first two fields are English.
The second two are Norwegian.
Norwegians use a comma as a decimal delimiter.
Here you can see me increasing the value in each field.
The first is in .01 steps, then in whole numbers.
For Norwegian it is the same.
In each case, it can be confusing for non-native users.
Ok, here we go.
Yeah, no.
I did make a test page, so you can play around later.
This has more support, though.
https://caniuse.com/#search=hyphens: Chrome < 55 and Android 4.0 Browser support "-webkit-hyphens: none", but not the "auto" property. It is advisable to set the @lang attribute on the HTML element to enable hyphenation support and improve accessibility.
In case you wanted to disable hyphens.
A sample using Flexbox for layout.
6 columns of text at just too narrow a width.
Once those words are allowed to hyphenate, we have a proper six columns for that width.
All I am doing is adding and removing the lang attribute.
Specifically quote marks.
Not inspirational messages.
Free, built-in localization.
I have an example of dialog. This may be familiar to some of you.
All I am doing is changing the lang value to demonstrate how the browser switches the quote marks.
This was done in Firefox, works in WebKit browsers.
Did not work in Edge this morning.
There is a lot happening under the hood.
Stuff you need not manage other than using the right attribute/value.
NVDA actively changes the pronunciation.
Worst or best German text track ever for club music?
JAWS prepends the block of words with the chosen language.
Does not change pronunciation for my English-only system.
Bringing it back to where we started.
This is how things used to be.
But was with that –x-Hixie thing?
The –x-[word] is for private-use sub-tags.
Note, “avoided whenever possible.”
But wait… it gets better…
There is a normative spec.
This means you too can make up your own language and have it considered acceptable by some standards.
Also, don’t.
Don’t be WHATWG.
The most exciting talk you have ever seen about a single HTML attribute.
Maybe.