SlideShare a Scribd company logo
1 of 23
Makoto Murata
eb2mmrt@gmail.com
Keio University and JEPA
 usual
 սոየተለመደው
 ‫معتاد‬
 վորական
 ‫געוויינטלעך‬
 सामान्य
 Įprasta
 வழக்கமான
 ပုံမှန်ထက်
 চলিত
 демейдеги
 보통의
 bình thường
 обычный
 通常の
 137,374 characters
 87,887 CJK Unified Ideographs
 Mistakenly introduced characters
 Separation good enough for most people is not good enough
for somebody (see CJK compatibility ideographics).
 Japanese do not necessarily need what Chinese need, and
vice versa.
 Implementations may support subsets.
 No subsets are defined.
 No mechanisms for describing subsets are defined.
 However, it is true that Unicode regular expressions can be
used for representing subsets.
 Implementations may support subsets.
 Taxonomy of subsets
 implementation-defined lists of code points,
 standardized collections as defined in Annex A
 combination of the two.
 Annex A uses multiple notations without formal definitions.
 LATIN-1 SUPPLEMENT (collection 2) is a range 00A0-00FF.
 MULTILINGUAL EUROPEAN SUBSET 2 (collection 282)
Plane 00
Row Values within row
 00 20-7E A0-FF
 01 00-7F 8F 92 B7 DE-EF FA-FF
 02 18-1B 1E-1F 59 7C 92 BB-BD
 C6-C7 C9 D8-DD EE
 03 74-75 7A 7E 84-8A 8C 8E-A1
 A3-CE D7 DA-E1
 04 00-5F 90-C4 C7-C8 CB-CC D0-EB
 EE-F5 F8-F9
 1E 02-03 0A-0B 1E-1F 40-41 56-57
 60-61 6A-6B 80-85 9B F2-F3
 1F 00-15 18-1D 20-45 48-4D 50-57
 59 5B 5D 5F-7D 80-B4 B6-C4
 C6-D3 D6-DB DD-EF F2-F4 F6-FE
 20 13-15 17-1E 20-22 26 30 32-33
 39-3A 3C 3E 44 4A 7F 82 A3-A4
 A7 AC AF
 21 05 16 22 26 5B-5E 90-95 A8
 22 00 02-03 06 08-09 0F 11-12 19-1A
 1E-1F 27-2B 48 59 60-61 64-65
 82-83 95 97
 23 02 10 20-21 29-2A
 25 00 02 0C 10 14 18 1C 24 2C 34 3C
 50-6C 80 84 88 8C 90-93
 A0 AC B2 BA BC C4 CA-CB D8-D9
 26 3A-3C 40 42 60 63 65-66 6A-6B
 FB 01-02
 FF FD
 JIS2004 IDEOGRAPHICS EXTENSION (collection 371) has 3695
code points.
 BASIC JAPANESE (collection 285) contains 6884 code points.
 IICORE (collection 370) has 9810 code points.
 Ranges are not very useful since code points in CJK
collections are scattered.
 Some collections defined in Annex A contain unassigned code
points.
 Unassigned code points may be assigned by later versions of
ISO/IEC 10646.
 So, validation should provide “yes”, “no”, or “I don’t know”.
 Some collections are defined as the union of other collections.
 MODERN EUROPEAN SCRIPTS (collection 283) is the union of more
than 30 collections, each of which is a simple range.
 COMMON JAPANESE (collection 287) is defined as the union of BASIC
JAPANESE (collection 285) and an enumerated list of 609 code
points.
 A grapheme cluster is a sequence of code points that
represents “user-perceived characters”.
 ‘e’ followed by an accent character
 Japan now has tons of grapheme clusters.
 Plane 00
 00 41-50 52-56 59-5A 61-70 72-76 79-7A C0-C1 C3 C8-C9 CC-CD D1-D3
D5 D9-DA DD E0-E1 E3 E8-E9 F1-F3 F5 F9-FA FD
 01 04-05 0C-0D 16-19 28 2E-2F 60-61 68-6B 72-73 7D-7E
 1E BC-BD F8-F9
 UCS Sequence Identifiers
 <0104, 0301> <0105, 0301> <0104, 0303> <0105, 0303> <0118, 0301>
<0119, 0301> <0118, 0303> <0119, 0303> <0116, 0301> <0117,0301>
<0116, 0303> <0117, 0303> <0069, 0307, 0300> <0069, 0307, 0301>
<0069, 0307, 0303> <012E, 0301> <012F, 0307, 0301>, <012E, 0303>
<012F, 0307, 0303> <004A, 0303> <006A, 0307, 0303> <004C, 0303>
<006C, 0303> <004D, 0303> <006D, 0303> <0052, 0303> <0072, 0303>
<0172, 0301> <0173, 0301> <0172, 0303> <0173, 0303> <016A, 0301>
<016B, 0301> <016A, 0303> <016B, 0303>
 a collection applicable to persons' names in Japanese public
service.
 The number of code points is more than 52000 and that of
grapheme clusters is more than 10000.
 Kyouiku Kanji
 elementary school education
 1006 characters.
 Jouyou Kanji
 use in official government documents
 2136 characters
 A list of such subsets from Asian governments is available at
https://github.com/cjkvi/cjkvi-tables
 Based on Adobe-Japan1, JIS standards, 10646 collections and
so forth.
 But they tend to add several characters for some commercial
reasons.
 Font vendors in CITPC (Japanese Character Information
Technology Promotion Council) are searching for machine-
readable notations for describing font coverage.
 Unicode regular expressions can be used for representing
subsets.
 Unicode Common Locale Data Repository use regular expressions
for defining subsets.
 10646 collections (even CJK collections) can be captured by
Unicode regular expressions in theory.
 Cannot reference collections defined in ISO/IEC 10646.
 Cannot reference other regular expressions.
 Copying is acceptable for small collections, but it not
acceptable for huge collections.
 COMMON JAPANESE (collection 287) references JIS2004
IDEOGRAPHICS EXTENSION (collection 371), which contains 3695
code points.
 Regular expression engines are slow.
 Hash-based set operations are much faster.
 20 times faster for MULTILINGUAL EUROPEAN SUBSET 2 collection.
 1600 times faster for the IICORE collection.
 Interesting but never implemented.
 Its own syntax (rather than regular expressions) for
representing ranges and code points, respectively.
 Kernel and hull elements for defining open collections.
 References to other subset descriptions or well-known
subsets (e.g., collections in ISO/IEC 10646)
 Set operations (union, inverse, difference, and intersection).
 No mechanisms for describing grapheme clusters.
 Unicode regular expressions as atomic expressions.
 <code>[abc]</code>
 References to collections defined in ISO/IEC 10646.
 <repertoire registry="10646" number="370"/>
 Typically implemented by hash-based sets.
 References to well-known subsets.
 <ref href=”URI-of-another-CREPDL-script”/>
 Set operation by the union, intersection, and difference
elements.
 kernel and hull
 An open source implementation of CREPDL is available at
https://github.com/CITPCSHARE/CREPDL.
 Written in F# (a functional programming language)
 Uses the ICU regular expression engine
 Large collections in Annex A of ISO/IEC 10646 are implemented as
hash-based sets. Validation against such collections is thus very
efficient.
 Another GitHub for example CREPDL scripts is available at
https://github.com/CITPCSHARE/CREPDLScripts.
 Create the DIS of ISO/IEC 19757-7 CREPDL and start a ballot.
 Sell CREPDL to font vendors in the Japanese Character
Information Technology Promotion Council, of which I am a
board member.
 Compare coverage of fonts automatically by comparing
CREPDL scripts.

More Related Content

Similar to CREPDL: Protect Yourself from the Proliferation of Unicode Characters

GTTS-EHU Systems for QUESST at MediaEval 2014
GTTS-EHU Systems for QUESST at MediaEval 2014GTTS-EHU Systems for QUESST at MediaEval 2014
GTTS-EHU Systems for QUESST at MediaEval 2014multimediaeval
 
Enumerating cycles in bipartite graph using matrix approach
Enumerating cycles in bipartite graph using matrix approachEnumerating cycles in bipartite graph using matrix approach
Enumerating cycles in bipartite graph using matrix approachUsatyuk Vasiliy
 
CArcMOOC 06.03 - Multiple-issue processors
CArcMOOC 06.03 - Multiple-issue processorsCArcMOOC 06.03 - Multiple-issue processors
CArcMOOC 06.03 - Multiple-issue processorsAlessandro Bogliolo
 
Agilent ADS 模擬手冊 [實習1] 基本操作與射頻放大器設計
Agilent ADS 模擬手冊 [實習1] 基本操作與射頻放大器設計Agilent ADS 模擬手冊 [實習1] 基本操作與射頻放大器設計
Agilent ADS 模擬手冊 [實習1] 基本操作與射頻放大器設計Simen Li
 
Ahn pacsec2017 key-recovery_attacks_against_commercial_white-box_cryptography...
Ahn pacsec2017 key-recovery_attacks_against_commercial_white-box_cryptography...Ahn pacsec2017 key-recovery_attacks_against_commercial_white-box_cryptography...
Ahn pacsec2017 key-recovery_attacks_against_commercial_white-box_cryptography...PacSecJP
 
Physical design-complete
Physical design-completePhysical design-complete
Physical design-completeMurali Rai
 

Similar to CREPDL: Protect Yourself from the Proliferation of Unicode Characters (20)

8th Semester Electronic and Communication Engineering (2013-June) Question Pa...
8th Semester Electronic and Communication Engineering (2013-June) Question Pa...8th Semester Electronic and Communication Engineering (2013-June) Question Pa...
8th Semester Electronic and Communication Engineering (2013-June) Question Pa...
 
GTTS-EHU Systems for QUESST at MediaEval 2014
GTTS-EHU Systems for QUESST at MediaEval 2014GTTS-EHU Systems for QUESST at MediaEval 2014
GTTS-EHU Systems for QUESST at MediaEval 2014
 
Enumerating cycles in bipartite graph using matrix approach
Enumerating cycles in bipartite graph using matrix approachEnumerating cycles in bipartite graph using matrix approach
Enumerating cycles in bipartite graph using matrix approach
 
CArcMOOC 06.03 - Multiple-issue processors
CArcMOOC 06.03 - Multiple-issue processorsCArcMOOC 06.03 - Multiple-issue processors
CArcMOOC 06.03 - Multiple-issue processors
 
RISC-V Zce Extension
RISC-V Zce ExtensionRISC-V Zce Extension
RISC-V Zce Extension
 
1st and 2nd Semester M Tech: Computer Science and Engineering (Dec-2015; Jan-...
1st and 2nd Semester M Tech: Computer Science and Engineering (Dec-2015; Jan-...1st and 2nd Semester M Tech: Computer Science and Engineering (Dec-2015; Jan-...
1st and 2nd Semester M Tech: Computer Science and Engineering (Dec-2015; Jan-...
 
7th Semester Electronic and Communication Engineering (2013-December) Questio...
7th Semester Electronic and Communication Engineering (2013-December) Questio...7th Semester Electronic and Communication Engineering (2013-December) Questio...
7th Semester Electronic and Communication Engineering (2013-December) Questio...
 
5th semester Computer Science and Information Science Engg (2013 December) Qu...
5th semester Computer Science and Information Science Engg (2013 December) Qu...5th semester Computer Science and Information Science Engg (2013 December) Qu...
5th semester Computer Science and Information Science Engg (2013 December) Qu...
 
Agilent ADS 模擬手冊 [實習1] 基本操作與射頻放大器設計
Agilent ADS 模擬手冊 [實習1] 基本操作與射頻放大器設計Agilent ADS 模擬手冊 [實習1] 基本操作與射頻放大器設計
Agilent ADS 模擬手冊 [實習1] 基本操作與射頻放大器設計
 
7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
 
7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
 
7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
7th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
 
8th semester Computer Science and Information Science Engg (2013 December) Qu...
8th semester Computer Science and Information Science Engg (2013 December) Qu...8th semester Computer Science and Information Science Engg (2013 December) Qu...
8th semester Computer Science and Information Science Engg (2013 December) Qu...
 
Ahn pacsec2017 key-recovery_attacks_against_commercial_white-box_cryptography...
Ahn pacsec2017 key-recovery_attacks_against_commercial_white-box_cryptography...Ahn pacsec2017 key-recovery_attacks_against_commercial_white-box_cryptography...
Ahn pacsec2017 key-recovery_attacks_against_commercial_white-box_cryptography...
 
6th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
6th Semeste Electronics and Communication Engineering (June-2016) Question Pa...6th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
6th Semeste Electronics and Communication Engineering (June-2016) Question Pa...
 
2nd Semester M Tech: VLSI Design and Embedded System (June-2016) Question Papers
2nd Semester M Tech: VLSI Design and Embedded System (June-2016) Question Papers2nd Semester M Tech: VLSI Design and Embedded System (June-2016) Question Papers
2nd Semester M Tech: VLSI Design and Embedded System (June-2016) Question Papers
 
5th Semester (June; July-2014) Electronics and Communication Engineering Ques...
5th Semester (June; July-2014) Electronics and Communication Engineering Ques...5th Semester (June; July-2014) Electronics and Communication Engineering Ques...
5th Semester (June; July-2014) Electronics and Communication Engineering Ques...
 
R09 power quality
R09 power qualityR09 power quality
R09 power quality
 
8th Semester Electronic and Communication Engineering (June/July-2015) Questi...
8th Semester Electronic and Communication Engineering (June/July-2015) Questi...8th Semester Electronic and Communication Engineering (June/July-2015) Questi...
8th Semester Electronic and Communication Engineering (June/July-2015) Questi...
 
Physical design-complete
Physical design-completePhysical design-complete
Physical design-complete
 

More from Makoto Murata

LCP: アクセシビリティと 両立する著作権保護DRM
LCP: アクセシビリティと両立する著作権保護DRMLCP: アクセシビリティと両立する著作権保護DRM
LCP: アクセシビリティと 両立する著作権保護DRMMakoto Murata
 
DRM inside 技術セミナー
DRM inside 技術セミナーDRM inside 技術セミナー
DRM inside 技術セミナーMakoto Murata
 
Epub最新動向とW3CによるWebアノテーション勧告
Epub最新動向とW3CによるWebアノテーション勧告Epub最新動向とW3CによるWebアノテーション勧告
Epub最新動向とW3CによるWebアノテーション勧告Makoto Murata
 
電流協セミナー EPUB 3.1関連の状況について
電流協セミナー EPUB 3.1関連の状況について電流協セミナー EPUB 3.1関連の状況について
電流協セミナー EPUB 3.1関連の状況についてMakoto Murata
 
W3C日本会議発表資料
W3C日本会議発表資料W3C日本会議発表資料
W3C日本会議発表資料Makoto Murata
 
韓国教育ICT標準化動向
韓国教育ICT標準化動向韓国教育ICT標準化動向
韓国教育ICT標準化動向Makoto Murata
 
Epub3.0の国際デジュール動向、現在協議検討されていることについて
Epub3.0の国際デジュール動向、現在協議検討されていることについてEpub3.0の国際デジュール動向、現在協議検討されていることについて
Epub3.0の国際デジュール動向、現在協議検討されていることについてMakoto Murata
 
W3 c日本語組版ノートとepub3
W3 c日本語組版ノートとepub3W3 c日本語組版ノートとepub3
W3 c日本語組版ノートとepub3Makoto Murata
 
EPUB3以降とReadium
EPUB3以降とReadiumEPUB3以降とReadium
EPUB3以降とReadiumMakoto Murata
 
Overview of egls requirement list
Overview of egls requirement listOverview of egls requirement list
Overview of egls requirement listMakoto Murata
 
Egls sapporo meeting
Egls sapporo meetingEgls sapporo meeting
Egls sapporo meetingMakoto Murata
 

More from Makoto Murata (11)

LCP: アクセシビリティと 両立する著作権保護DRM
LCP: アクセシビリティと両立する著作権保護DRMLCP: アクセシビリティと両立する著作権保護DRM
LCP: アクセシビリティと 両立する著作権保護DRM
 
DRM inside 技術セミナー
DRM inside 技術セミナーDRM inside 技術セミナー
DRM inside 技術セミナー
 
Epub最新動向とW3CによるWebアノテーション勧告
Epub最新動向とW3CによるWebアノテーション勧告Epub最新動向とW3CによるWebアノテーション勧告
Epub最新動向とW3CによるWebアノテーション勧告
 
電流協セミナー EPUB 3.1関連の状況について
電流協セミナー EPUB 3.1関連の状況について電流協セミナー EPUB 3.1関連の状況について
電流協セミナー EPUB 3.1関連の状況について
 
W3C日本会議発表資料
W3C日本会議発表資料W3C日本会議発表資料
W3C日本会議発表資料
 
韓国教育ICT標準化動向
韓国教育ICT標準化動向韓国教育ICT標準化動向
韓国教育ICT標準化動向
 
Epub3.0の国際デジュール動向、現在協議検討されていることについて
Epub3.0の国際デジュール動向、現在協議検討されていることについてEpub3.0の国際デジュール動向、現在協議検討されていることについて
Epub3.0の国際デジュール動向、現在協議検討されていることについて
 
W3 c日本語組版ノートとepub3
W3 c日本語組版ノートとepub3W3 c日本語組版ノートとepub3
W3 c日本語組版ノートとepub3
 
EPUB3以降とReadium
EPUB3以降とReadiumEPUB3以降とReadium
EPUB3以降とReadium
 
Overview of egls requirement list
Overview of egls requirement listOverview of egls requirement list
Overview of egls requirement list
 
Egls sapporo meeting
Egls sapporo meetingEgls sapporo meeting
Egls sapporo meeting
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

CREPDL: Protect Yourself from the Proliferation of Unicode Characters

  • 2.  usual  սոየተለመደው  ‫معتاد‬  վորական  ‫געוויינטלעך‬  सामान्य  Įprasta  வழக்கமான  ပုံမှန်ထက်  চলিত  демейдеги  보통의  bình thường  обычный  通常の
  • 3.  137,374 characters  87,887 CJK Unified Ideographs
  • 4.  Mistakenly introduced characters  Separation good enough for most people is not good enough for somebody (see CJK compatibility ideographics).  Japanese do not necessarily need what Chinese need, and vice versa.
  • 5.  Implementations may support subsets.  No subsets are defined.  No mechanisms for describing subsets are defined.  However, it is true that Unicode regular expressions can be used for representing subsets.
  • 6.  Implementations may support subsets.  Taxonomy of subsets  implementation-defined lists of code points,  standardized collections as defined in Annex A  combination of the two.  Annex A uses multiple notations without formal definitions.
  • 7.  LATIN-1 SUPPLEMENT (collection 2) is a range 00A0-00FF.  MULTILINGUAL EUROPEAN SUBSET 2 (collection 282)
  • 8. Plane 00 Row Values within row  00 20-7E A0-FF  01 00-7F 8F 92 B7 DE-EF FA-FF  02 18-1B 1E-1F 59 7C 92 BB-BD  C6-C7 C9 D8-DD EE  03 74-75 7A 7E 84-8A 8C 8E-A1  A3-CE D7 DA-E1  04 00-5F 90-C4 C7-C8 CB-CC D0-EB  EE-F5 F8-F9  1E 02-03 0A-0B 1E-1F 40-41 56-57  60-61 6A-6B 80-85 9B F2-F3  1F 00-15 18-1D 20-45 48-4D 50-57  59 5B 5D 5F-7D 80-B4 B6-C4  C6-D3 D6-DB DD-EF F2-F4 F6-FE  20 13-15 17-1E 20-22 26 30 32-33  39-3A 3C 3E 44 4A 7F 82 A3-A4  A7 AC AF  21 05 16 22 26 5B-5E 90-95 A8  22 00 02-03 06 08-09 0F 11-12 19-1A  1E-1F 27-2B 48 59 60-61 64-65  82-83 95 97  23 02 10 20-21 29-2A  25 00 02 0C 10 14 18 1C 24 2C 34 3C  50-6C 80 84 88 8C 90-93  A0 AC B2 BA BC C4 CA-CB D8-D9  26 3A-3C 40 42 60 63 65-66 6A-6B  FB 01-02  FF FD
  • 9.  JIS2004 IDEOGRAPHICS EXTENSION (collection 371) has 3695 code points.  BASIC JAPANESE (collection 285) contains 6884 code points.  IICORE (collection 370) has 9810 code points.  Ranges are not very useful since code points in CJK collections are scattered.
  • 10.  Some collections defined in Annex A contain unassigned code points.  Unassigned code points may be assigned by later versions of ISO/IEC 10646.  So, validation should provide “yes”, “no”, or “I don’t know”.
  • 11.  Some collections are defined as the union of other collections.  MODERN EUROPEAN SCRIPTS (collection 283) is the union of more than 30 collections, each of which is a simple range.  COMMON JAPANESE (collection 287) is defined as the union of BASIC JAPANESE (collection 285) and an enumerated list of 609 code points.
  • 12.  A grapheme cluster is a sequence of code points that represents “user-perceived characters”.  ‘e’ followed by an accent character  Japan now has tons of grapheme clusters.
  • 13.  Plane 00  00 41-50 52-56 59-5A 61-70 72-76 79-7A C0-C1 C3 C8-C9 CC-CD D1-D3 D5 D9-DA DD E0-E1 E3 E8-E9 F1-F3 F5 F9-FA FD  01 04-05 0C-0D 16-19 28 2E-2F 60-61 68-6B 72-73 7D-7E  1E BC-BD F8-F9  UCS Sequence Identifiers  <0104, 0301> <0105, 0301> <0104, 0303> <0105, 0303> <0118, 0301> <0119, 0301> <0118, 0303> <0119, 0303> <0116, 0301> <0117,0301> <0116, 0303> <0117, 0303> <0069, 0307, 0300> <0069, 0307, 0301> <0069, 0307, 0303> <012E, 0301> <012F, 0307, 0301>, <012E, 0303> <012F, 0307, 0303> <004A, 0303> <006A, 0307, 0303> <004C, 0303> <006C, 0303> <004D, 0303> <006D, 0303> <0052, 0303> <0072, 0303> <0172, 0301> <0173, 0301> <0172, 0303> <0173, 0303> <016A, 0301> <016B, 0301> <016A, 0303> <016B, 0303>
  • 14.  a collection applicable to persons' names in Japanese public service.  The number of code points is more than 52000 and that of grapheme clusters is more than 10000.
  • 15.  Kyouiku Kanji  elementary school education  1006 characters.  Jouyou Kanji  use in official government documents  2136 characters  A list of such subsets from Asian governments is available at https://github.com/cjkvi/cjkvi-tables
  • 16.  Based on Adobe-Japan1, JIS standards, 10646 collections and so forth.  But they tend to add several characters for some commercial reasons.  Font vendors in CITPC (Japanese Character Information Technology Promotion Council) are searching for machine- readable notations for describing font coverage.
  • 17.  Unicode regular expressions can be used for representing subsets.  Unicode Common Locale Data Repository use regular expressions for defining subsets.  10646 collections (even CJK collections) can be captured by Unicode regular expressions in theory.
  • 18.  Cannot reference collections defined in ISO/IEC 10646.  Cannot reference other regular expressions.  Copying is acceptable for small collections, but it not acceptable for huge collections.  COMMON JAPANESE (collection 287) references JIS2004 IDEOGRAPHICS EXTENSION (collection 371), which contains 3695 code points.
  • 19.  Regular expression engines are slow.  Hash-based set operations are much faster.  20 times faster for MULTILINGUAL EUROPEAN SUBSET 2 collection.  1600 times faster for the IICORE collection.
  • 20.  Interesting but never implemented.  Its own syntax (rather than regular expressions) for representing ranges and code points, respectively.  Kernel and hull elements for defining open collections.  References to other subset descriptions or well-known subsets (e.g., collections in ISO/IEC 10646)  Set operations (union, inverse, difference, and intersection).  No mechanisms for describing grapheme clusters.
  • 21.  Unicode regular expressions as atomic expressions.  <code>[abc]</code>  References to collections defined in ISO/IEC 10646.  <repertoire registry="10646" number="370"/>  Typically implemented by hash-based sets.  References to well-known subsets.  <ref href=”URI-of-another-CREPDL-script”/>  Set operation by the union, intersection, and difference elements.  kernel and hull
  • 22.  An open source implementation of CREPDL is available at https://github.com/CITPCSHARE/CREPDL.  Written in F# (a functional programming language)  Uses the ICU regular expression engine  Large collections in Annex A of ISO/IEC 10646 are implemented as hash-based sets. Validation against such collections is thus very efficient.  Another GitHub for example CREPDL scripts is available at https://github.com/CITPCSHARE/CREPDLScripts.
  • 23.  Create the DIS of ISO/IEC 19757-7 CREPDL and start a ballot.  Sell CREPDL to font vendors in the Japanese Character Information Technology Promotion Council, of which I am a board member.  Compare coverage of fonts automatically by comparing CREPDL scripts.

Editor's Notes

  1. I am going to talk about CREPDL, an XML language for describing subsets of Unicode or 10646. When we have to handle huge subsets, my CREPDL validator is more than 1000 times faster than the ICU Unicode regular expression engine.
  2. Let’s begin with an exam. I used Google this morning to translate “usual” to many languages. …. PowerPoint can display all of them. Word can. Emacs can. But my favorite XML editor, Oxygen, cannot. Why?
  3. How many characters does Unicode have? Now, the latest version is 11. It has so many …
  4. Do you believe that all CJK ideographic characters are needed? A short answer is No.
  5. So, we have too many characters. Does Unicode mandate the support of all characters?
  6. Then, how about 10646? There are interesting and significant differences.
  7. So, let’s have a quick look at Collections in 10646. Simple ones are very simple.
  8. This is more complicated, but is still not extremely complicated.
  9. But CJK collections are much bigger.
  10. Here let me introduce open collections.
  11. We very often want to define subsets in terms of other subsets.
  12. I have used the word “character”, but what users think is a “character” is not necessarily a single code point in Unicode.
  13. Let’ see the first collection containing grapheme clusters.
  14. But a CJK collection containing grapheme clusters is much much bigger.
  15. I have talked about collections, which are subsets defined by 10646. But there are many other subsets.
  16. An important type of subsets is font coverage. Each font implicitly defines a subset.
  17. Every Westerner says we only have to use Unicode regular expressions.
  18. There are two significant problems. They are not problematic for small collections, but are, in my opinion, fatal for large collections. The first problems is inability to reference other subsets.
  19. The other problem is performance.
  20. An interesting alternative is “A Notation ….”. It has never been implemented, but we can still learn from it.
  21. CREPDL is my attempt to combine best parts of both Unicode regular expressions and the W3C notation.
  22. I have an implementation of CREPDL. It is open source. It is ….
  23. CREPDL is an attempt to combine regular expressions and the W3C notation.