This document discusses representing text from different languages and scripts on the web. It begins by noting that while the web is international, much of its text remains in English. It then explores some of the challenges of displaying non-Roman scripts like Cyrillic, Japanese, and Chinese on the web due to differences in character encodings and font support across platforms. The document emphasizes that XML and Unicode help address these issues by providing a universal character encoding that supports all languages. It also discusses the requirements for properly displaying different scripts, including a character set, font, input method, and software support.
Computer languages can be categorized into high-level languages, low-level languages, and machine language. High-level languages are easier for humans to read and write but require compilers or interpreters, while low-level languages like assembly language are closer to machine language but still use symbolic instructions. Machine language uses only binary and is directly executable by computers. Languages have evolved through five generations from low-level machine and assembly languages to modern high-level languages.
What is Higher Language and Lower Language in programming.Questpond
This document differentiates between higher and lower languages. Higher languages are more abstract, user-friendly, and portable across platforms but slower, while lower languages are closer to machine language, faster but more difficult for humans to understand and require more expertise to use as they are closer to hardware.
This document discusses different types of computer languages. It begins by distinguishing between programming languages and other computer languages like markup languages. It then categorizes languages as either low-level or high-level. Low-level languages like machine language and assembly language are closer to machine code, while high-level languages use English-like syntax and are translated to machine code. Several examples of each type are provided along with their advantages and disadvantages.
The document provides an overview of different programming languages including their origins, uses, and key features. It discusses low-level languages like machine language and assembly language as well as high-level languages including BASIC, Java, C, C++, C#, Objective-C, PHP, Python, Ruby, JavaScript, and SQL. Each language is briefly described in terms of its history, applications, and role in software development.
The document discusses the compilation process, which consists of 4 phases: lexical analysis, parsing, semantics/code generation, and code optimization. Lexical analysis identifies language tokens. Parsing checks syntax validity using grammars. Semantics generates machine code from the parse tree if it is semantically valid. Optimization aims to improve code efficiency. The overall goal of a compiler is to correctly translate high-level code into efficient machine language.
This document provides an overview of different computer languages. It begins by explaining that computer languages allow communication between humans and computers. It then distinguishes between low-level languages like machine language and assembly language, which are close to hardware, and high-level languages, which are closer to human languages. Popular high-level languages mentioned include Python, JavaScript, Java, C, and C++. Procedural languages require specifying how to solve a problem, while non-procedural languages only require specifying what problem to solve.
Computer languages can be categorized as either low-level or high-level. Low-level languages like machine language and assembly language provide little abstraction from computer hardware and use numeric codes that are directly understandable by computers. High-level languages allow problems to be solved using terminology more familiar to humans and are easier for programmers to use. Examples include C, C++, Java, and JavaScript. Operating systems act as an interface between application software, hardware, and users, performing functions like memory management, task scheduling, and file handling.
Computer languages allow humans to communicate with computers through programming. There are different types of computer languages at different levels of abstraction from machine language up to high-level languages. High-level languages are closer to human language while low-level languages are closer to machine-readable code. Programs written in high-level languages require compilers or interpreters to convert them to machine-readable code that can be executed by computers.
Computer languages can be categorized into high-level languages, low-level languages, and machine language. High-level languages are easier for humans to read and write but require compilers or interpreters, while low-level languages like assembly language are closer to machine language but still use symbolic instructions. Machine language uses only binary and is directly executable by computers. Languages have evolved through five generations from low-level machine and assembly languages to modern high-level languages.
What is Higher Language and Lower Language in programming.Questpond
This document differentiates between higher and lower languages. Higher languages are more abstract, user-friendly, and portable across platforms but slower, while lower languages are closer to machine language, faster but more difficult for humans to understand and require more expertise to use as they are closer to hardware.
This document discusses different types of computer languages. It begins by distinguishing between programming languages and other computer languages like markup languages. It then categorizes languages as either low-level or high-level. Low-level languages like machine language and assembly language are closer to machine code, while high-level languages use English-like syntax and are translated to machine code. Several examples of each type are provided along with their advantages and disadvantages.
The document provides an overview of different programming languages including their origins, uses, and key features. It discusses low-level languages like machine language and assembly language as well as high-level languages including BASIC, Java, C, C++, C#, Objective-C, PHP, Python, Ruby, JavaScript, and SQL. Each language is briefly described in terms of its history, applications, and role in software development.
The document discusses the compilation process, which consists of 4 phases: lexical analysis, parsing, semantics/code generation, and code optimization. Lexical analysis identifies language tokens. Parsing checks syntax validity using grammars. Semantics generates machine code from the parse tree if it is semantically valid. Optimization aims to improve code efficiency. The overall goal of a compiler is to correctly translate high-level code into efficient machine language.
This document provides an overview of different computer languages. It begins by explaining that computer languages allow communication between humans and computers. It then distinguishes between low-level languages like machine language and assembly language, which are close to hardware, and high-level languages, which are closer to human languages. Popular high-level languages mentioned include Python, JavaScript, Java, C, and C++. Procedural languages require specifying how to solve a problem, while non-procedural languages only require specifying what problem to solve.
Computer languages can be categorized as either low-level or high-level. Low-level languages like machine language and assembly language provide little abstraction from computer hardware and use numeric codes that are directly understandable by computers. High-level languages allow problems to be solved using terminology more familiar to humans and are easier for programmers to use. Examples include C, C++, Java, and JavaScript. Operating systems act as an interface between application software, hardware, and users, performing functions like memory management, task scheduling, and file handling.
Computer languages allow humans to communicate with computers through programming. There are different types of computer languages at different levels of abstraction from machine language up to high-level languages. High-level languages are closer to human language while low-level languages are closer to machine-readable code. Programs written in high-level languages require compilers or interpreters to convert them to machine-readable code that can be executed by computers.
There are two types of programming languages: high-level languages and low-level languages. High-level languages are closer to human languages and provide more abstraction from machine-level instructions, while low-level languages like assembly language closely map to processor instructions. Programs written in high-level languages need to be translated into machine code using compilers or interpreters, while low-level language programs are assembled directly into machine code. Common examples of high-level languages include C++, Java, and Python, while assembly language and Basic are examples of low-level languages.
The document discusses the differences between low-level machine code used by CPUs and high-level computer languages used by programmers. It explains that programmers write source code in high-level languages, which are then compiled into machine code through a compiler or interpreted line-by-line using an interpreter. Compiled code tends to run faster while interpreted languages typically have more programmer-friendly features. Modern dynamic languages often use a just-in-time compiler to bridge performance gaps.
The document discusses computer languages and programming. It describes three categories of programming languages: machine languages, assembly languages, and high-level languages. It also discusses common programming language tools like assemblers, compilers, linkers, and interpreters. Popular programming languages like FORTRAN, COBOL, BASIC, Pascal, C, C++, C#, Java, RPG, LISP and SNOBOL are also mentioned. The key characteristics and uses of machine languages, assembly languages, high-level languages, compilers, linkers and interpreters are summarized.
There are different types of computer languages, including high-level and low-level languages. High-level languages like Pascal, Ada, Logo, and FORTRAN are easily understood by humans but must be translated before the computer can understand them. Low-level languages like assembler and machine code are understood directly by computers but are difficult for humans to read. Translators like interpreters and compilers are used to translate high-level languages into low-level languages understood by computers. Interpreters translate programs line-by-line while compilers translate the entire program at once.
High-level languages like C, FORTRAN, and Pascal allow programmers to write programs independently of a particular computer's architecture. They are considered "high-level" because they are more similar to human languages than machine languages. Examples are Python, C, Fortran, and Pascal, which have syntax that is closer to human languages and easier for humans to read compared to low-level languages like assembly, which are closer to machine languages.
The document discusses different types of computer languages and language translators. There are two main types of computer languages - low-level languages which are close to machine language like assembly, and high-level languages which are closer to human languages like C++ and Java. Language translators like compilers, interpreters, and assemblers are used to translate programs written in high-level and assembly languages into machine-readable object code. Compilers translate the entire program at once while interpreters translate line-by-line, and assemblers specifically assemble assembly language programs into machine code.
it is about computer languages which describes development of computer languages. as it provide best knowledge about computer languages,every sllides in this ppt makes you know the updation of machine languages by fliping every pages.
A programming language allows humans to write programs that instruct computers to perform tasks. Programming languages can be classified as low-level languages that are closer to machine code like assembly, or high-level languages that are easier for humans to read and write like Python and Java. The document outlines different types of programming languages including low-level languages that are machine-oriented and specific, assembly languages that use symbols instead of binary, and high-level languages that are easier for humans, more portable between machines, and abstracted from the underlying hardware.
This document discusses different types of computer programming languages. It defines machine language, assembly language, procedure languages, and object-oriented languages. Machine language uses binary instructions directly understood by computers. Assembly language uses human-readable instructions for the CPU. Procedure languages specify steps and procedures. Object-oriented languages represent concepts as objects with data and methods. Examples given are assembler, C, C++, and how operating systems use those languages.
Interfacing With High Level Programming Language
High Level Programming Language
Categories of programming languages
Processing a High-Level Language Program
Advantages of high-level languages
Interface-Based Programming
Interfaces in Object Oriented Programming Languages
Implementing an Interface
This document provides an overview of computer languages and the programming process. It discusses machine language as the lowest-level language understood by computers. Higher-level languages like assembly and programming languages make programming easier for humans. It also describes the programming process, which involves defining problems, planning solutions, coding, testing, and documenting programs. Finally, it discusses different types of program translators like assemblers, compilers, and interpreters that translate human-readable code into machine-readable code.
This document discusses keyboards and internationalization. It provides examples of different keyboard layouts for languages like Hebrew, Russian, and Japanese. It explains that input methods editors (IMEs) are used for languages with large character sets, like Chinese, to map keyboard inputs to characters. Common Chinese IME types include pinyin romanization, component-based, and stroke-based methods. It also gives an overview of specific Chinese IMEs like Bopomofo, Changjie, and Dayi and how they map keyboard inputs to Chinese characters.
This document discusses programming languages and development tools. It defines programming languages and categorizes them from low-level machine languages to high-level languages. Popular languages today include Visual Basic, C, C++, Java, and SQL. The document also outlines tools for application development, macros, RAD, web pages using HTML and XML, and multimedia programs.
The document discusses UI testing for multi-lingual applications. It covers testing the text and controls to ensure they accommodate longer translated strings, storing localizable strings in separate files, testing sorting, justification and directionality for different languages, checking for issues with images, printing for international paper sizes and fonts, and introduces pseudo-translation testing to find localization issues without a full translation.
There are three main types of computer languages:
1. Machine language - Understood directly by computers as binary, fast but difficult for humans.
2. Assembly language - Uses mnemonics like ADD instead of binary, easier for humans but still machine-dependent. Requires an assembler to translate to machine language.
3. High-level languages - Are machine independent, use familiar words and symbols, and require compilers or interpreters to translate to machine language. They are easier for programmers but provide less control over hardware.
Rome was the capital of the Roman Empire from around 400 BC to 400 AD, which covered much of Europe. It was originally a Republic where citizens voted for their government. The Forum was the center of Roman public life, containing buildings like the Senate and spaces for speeches. Major Roman gods included Jupiter, king of the gods, Mars the god of war, and Venus the goddess of love. Roman entertainment included chariot races at the Circus Maximus and gladiator fights at the Colosseum. Julius Caesar was a famous Roman general who was later assassinated, and his nephew Augustus became the first Roman emperor and established the Praetorian Guard. Christianity later spread throughout the Roman Empire despite Roman persecution of Christians and Jews
The document provides information on various ancient writing systems including:
1) Rock paintings from the Stone Age found in South Africa that depict humans and animals and are preserved in shades of ochre and white.
2) Cuneiform writing developed by the Sumerians around 3500 BC which used wedge-shaped markings pressed into wet clay tablets.
3) Egyptian hieroglyphics formed around 3000 BC with thousands of symbols representing ideas, objects, and sounds that were carved on walls and monuments.
4) The Greek alphabet developed around 1000 BC from the Phoenician alphabet and was adapted to represent vowels, increasing its accuracy for the Greek language. It evolved into modern scripts and was widely adopted.
A empresa de tecnologia XYZ anunciou hoje uma nova linha de laptops leves e potentes projetados para estudantes e profissionais modernos. Os novos laptops oferecem processadores rápidos, telas nítidas, baterias de longa duração e preços acessíveis para ajudar os usuários a navegar, criar e se conectar.
There are two types of programming languages: high-level languages and low-level languages. High-level languages are closer to human languages and provide more abstraction from machine-level instructions, while low-level languages like assembly language closely map to processor instructions. Programs written in high-level languages need to be translated into machine code using compilers or interpreters, while low-level language programs are assembled directly into machine code. Common examples of high-level languages include C++, Java, and Python, while assembly language and Basic are examples of low-level languages.
The document discusses the differences between low-level machine code used by CPUs and high-level computer languages used by programmers. It explains that programmers write source code in high-level languages, which are then compiled into machine code through a compiler or interpreted line-by-line using an interpreter. Compiled code tends to run faster while interpreted languages typically have more programmer-friendly features. Modern dynamic languages often use a just-in-time compiler to bridge performance gaps.
The document discusses computer languages and programming. It describes three categories of programming languages: machine languages, assembly languages, and high-level languages. It also discusses common programming language tools like assemblers, compilers, linkers, and interpreters. Popular programming languages like FORTRAN, COBOL, BASIC, Pascal, C, C++, C#, Java, RPG, LISP and SNOBOL are also mentioned. The key characteristics and uses of machine languages, assembly languages, high-level languages, compilers, linkers and interpreters are summarized.
There are different types of computer languages, including high-level and low-level languages. High-level languages like Pascal, Ada, Logo, and FORTRAN are easily understood by humans but must be translated before the computer can understand them. Low-level languages like assembler and machine code are understood directly by computers but are difficult for humans to read. Translators like interpreters and compilers are used to translate high-level languages into low-level languages understood by computers. Interpreters translate programs line-by-line while compilers translate the entire program at once.
High-level languages like C, FORTRAN, and Pascal allow programmers to write programs independently of a particular computer's architecture. They are considered "high-level" because they are more similar to human languages than machine languages. Examples are Python, C, Fortran, and Pascal, which have syntax that is closer to human languages and easier for humans to read compared to low-level languages like assembly, which are closer to machine languages.
The document discusses different types of computer languages and language translators. There are two main types of computer languages - low-level languages which are close to machine language like assembly, and high-level languages which are closer to human languages like C++ and Java. Language translators like compilers, interpreters, and assemblers are used to translate programs written in high-level and assembly languages into machine-readable object code. Compilers translate the entire program at once while interpreters translate line-by-line, and assemblers specifically assemble assembly language programs into machine code.
it is about computer languages which describes development of computer languages. as it provide best knowledge about computer languages,every sllides in this ppt makes you know the updation of machine languages by fliping every pages.
A programming language allows humans to write programs that instruct computers to perform tasks. Programming languages can be classified as low-level languages that are closer to machine code like assembly, or high-level languages that are easier for humans to read and write like Python and Java. The document outlines different types of programming languages including low-level languages that are machine-oriented and specific, assembly languages that use symbols instead of binary, and high-level languages that are easier for humans, more portable between machines, and abstracted from the underlying hardware.
This document discusses different types of computer programming languages. It defines machine language, assembly language, procedure languages, and object-oriented languages. Machine language uses binary instructions directly understood by computers. Assembly language uses human-readable instructions for the CPU. Procedure languages specify steps and procedures. Object-oriented languages represent concepts as objects with data and methods. Examples given are assembler, C, C++, and how operating systems use those languages.
Interfacing With High Level Programming Language
High Level Programming Language
Categories of programming languages
Processing a High-Level Language Program
Advantages of high-level languages
Interface-Based Programming
Interfaces in Object Oriented Programming Languages
Implementing an Interface
This document provides an overview of computer languages and the programming process. It discusses machine language as the lowest-level language understood by computers. Higher-level languages like assembly and programming languages make programming easier for humans. It also describes the programming process, which involves defining problems, planning solutions, coding, testing, and documenting programs. Finally, it discusses different types of program translators like assemblers, compilers, and interpreters that translate human-readable code into machine-readable code.
This document discusses keyboards and internationalization. It provides examples of different keyboard layouts for languages like Hebrew, Russian, and Japanese. It explains that input methods editors (IMEs) are used for languages with large character sets, like Chinese, to map keyboard inputs to characters. Common Chinese IME types include pinyin romanization, component-based, and stroke-based methods. It also gives an overview of specific Chinese IMEs like Bopomofo, Changjie, and Dayi and how they map keyboard inputs to Chinese characters.
This document discusses programming languages and development tools. It defines programming languages and categorizes them from low-level machine languages to high-level languages. Popular languages today include Visual Basic, C, C++, Java, and SQL. The document also outlines tools for application development, macros, RAD, web pages using HTML and XML, and multimedia programs.
The document discusses UI testing for multi-lingual applications. It covers testing the text and controls to ensure they accommodate longer translated strings, storing localizable strings in separate files, testing sorting, justification and directionality for different languages, checking for issues with images, printing for international paper sizes and fonts, and introduces pseudo-translation testing to find localization issues without a full translation.
There are three main types of computer languages:
1. Machine language - Understood directly by computers as binary, fast but difficult for humans.
2. Assembly language - Uses mnemonics like ADD instead of binary, easier for humans but still machine-dependent. Requires an assembler to translate to machine language.
3. High-level languages - Are machine independent, use familiar words and symbols, and require compilers or interpreters to translate to machine language. They are easier for programmers but provide less control over hardware.
Rome was the capital of the Roman Empire from around 400 BC to 400 AD, which covered much of Europe. It was originally a Republic where citizens voted for their government. The Forum was the center of Roman public life, containing buildings like the Senate and spaces for speeches. Major Roman gods included Jupiter, king of the gods, Mars the god of war, and Venus the goddess of love. Roman entertainment included chariot races at the Circus Maximus and gladiator fights at the Colosseum. Julius Caesar was a famous Roman general who was later assassinated, and his nephew Augustus became the first Roman emperor and established the Praetorian Guard. Christianity later spread throughout the Roman Empire despite Roman persecution of Christians and Jews
The document provides information on various ancient writing systems including:
1) Rock paintings from the Stone Age found in South Africa that depict humans and animals and are preserved in shades of ochre and white.
2) Cuneiform writing developed by the Sumerians around 3500 BC which used wedge-shaped markings pressed into wet clay tablets.
3) Egyptian hieroglyphics formed around 3000 BC with thousands of symbols representing ideas, objects, and sounds that were carved on walls and monuments.
4) The Greek alphabet developed around 1000 BC from the Phoenician alphabet and was adapted to represent vowels, increasing its accuracy for the Greek language. It evolved into modern scripts and was widely adopted.
A empresa de tecnologia XYZ anunciou hoje uma nova linha de laptops leves e potentes projetados para estudantes e profissionais modernos. Os novos laptops oferecem processadores rápidos, telas nítidas, baterias de longa duração e preços acessíveis para ajudar os usuários a navegar, criar e se conectar.
The document summarizes the evolution of the Latin alphabet from its origins in the 7th century BC adapting the Etruscan alphabet to Latin. It originally had 21 letters before additions and replacements over time, including adding "G" in the 3rd century BC and adding "W", "J", and "U" in the Middle Ages, completing the modern alphabet by the 15th century.
The Greek alphabet is one of the oldest writing systems, dating back to the 9th-8th century BC. It uses separate symbols for consonants and vowels and has influenced the development of other alphabets. The original Greek alphabet was derived from the Phoenician alphabet and over time evolved into three main variations that later developed into several modern alphabets like the Etruscan and modern Greek alphabets. The Greek alphabet continues to be used in fields like mathematics, science, naming stars and storms.
Writing originated as pictograms but evolved to become more phonetic over time. Early writing systems included logographic scripts like Sumerian cuneiform and Egyptian hieroglyphs which represented morphemes and words. Later, the rebus principle was developed where symbols began representing sounds, moving towards syllabic scripts. The Phoenician alphabet was adapted by Greeks to create a true alphabet representing individual phonemes, which was then adapted by Romans. Runic writing also developed as an offshoot of early European scripts.
The Celts were loosely connected groups of people who lived across Europe during the Iron Age. They spoke similar languages and shared cultural and religious practices. The Celts lived in scattered farming villages and hill forts, building houses out of materials like wood, stone, and mud. They wore colorful woven clothes and ate a variety of plants and animals. The Celts had many gods and goddesses and were led spiritually by druids. They believed in an afterlife and buried the dead with possessions. Celtic society was divided into social classes including warriors and intellectuals.
This document discusses different styles of lettering used in technical drawings. It describes Gothic, Roman, and Italic letter styles, and defines each by the characteristics of their elementary strokes. It also lists extended, condensed, lightface, and boldface as types of letter styles classified by the width of the letters. References used in preparing the document on drawing fundamentals and lettering are also provided.
1) The document discusses the origins of the alphabet, proposing that it was invented in Egypt during the reign of Pharaoh Amenemhat III in the 18th century BCE.
2) It suggests that migrant workers known as Habiru, who were living in Egypt at sites like Avaris and Wadi-el-Hol, developed the first alphabet by associating Egyptian hieroglyphs with the sounds of Canaanite words.
3) Evidence for this comes from an inscription found at the temple of the goddess Hathor at Serabit el-Khadem in the Sinai, which associates early alphabetical letters with Egyptian hieroglyphs.
This document discusses low-level and high-level programming languages. It defines machine language and assembly language as low-level languages that are closer to binary machine code. It then explains that high-level languages use English-like terms and are easier for humans to read and write but can have performance tradeoffs. It also discusses interpreters, which interpret code line-by-line, and compilers, which translate entire programs to machine code ahead of time.
This document provides an introduction to Unicode and character encoding standards. It explains that Unicode is a character set standard that supports all languages worldwide. It describes different character encoding schemes like UTF-8 and UTF-16 that are used to represent Unicode characters in binary. It highlights issues with older single-byte encodings and the benefits of adopting a Unicode encoding to support globalization.
The document discusses the history of programming languages from first to fifth generation. It defines a program as a set of instructions that tells a computer what to do. First generation languages used binary machine code, while assembly language as a second generation made programming easier by using letters. Third generation high-level languages like FORTRAN, COBOL, and BASIC improved data management and were easier for non-professionals to use. Fourth and fifth generation languages attempted to make programming even more like natural languages through visual interfaces and English-like syntax.
There are four categories of computer languages: high-level languages, low-level languages, assembly language, and machine language. High-level languages are closer to human language and need translators to be understood by computers. Low-level languages are closer to machine language and do not need translators. Assembly language sits between high-level and machine language by using mnemonic codes. Machine language consists of binary and is the only language computers can directly understand. Translators like compilers, interpreters, and assemblers are used to convert between these language categories.
Computer languages can be categorized into high-level languages, low-level languages, and machine language. High-level languages are closer to human language and require compilers or interpreters, while low-level languages like assembly language are closer to machine language. Machine language is binary code that is directly executable by computers. There are also different generations of languages that evolved with advances in hardware and software.
The document provides information about high level and low level programming languages. It defines low level languages as assembly language and machine language, which computers can directly understand as binary code. High level languages are closer to human language and include C++, SQL, Java, C#, FORTRAN, COBOL, C, JavaScript, PHP, and HTML. Each high level language is then briefly described in terms of its history, purpose, and basic syntax structure.
Computer languages can be categorized into different generations based on their level of abstraction from machine language. First generation languages are machine languages that use binary, while assembly languages as second generation are closer to machine language with mnemonic codes. High-level languages of the third generation like FORTRAN and COBOL are easier for humans to read and write. Fourth generation languages attempt more natural language programming, and fifth generation use visual interfaces to generate code compiled by lower level languages. The key aspects of a program include variables, statements, keywords, instructions, and the ability to perform tasks through organized lists of commands.
Machine languages were the earliest programming languages using binary. Assembly languages provided a slight abstraction from machine code using mnemonics. Higher-level languages evolved through generations, with third-generation languages like COBOL and FORTRAN using more English-like syntax. Fourth-generation languages offered further abstraction through visual programming, while fifth-generation languages aimed to use artificial intelligence. Programming languages have transitioned from machine-focused to problem-focused and include compiled, interpreted, procedural, object-oriented, functional, and declarative paradigms.
There are three main categories of programming languages: machine languages, assembly languages, and higher-level languages. Higher-level languages are divided into five generations - third being the first true English-like languages, fourth allowing visual programming, and fifth hypothetically using artificial intelligence. The software development life cycle has five phases - needs analysis, program design, development, implementation, and maintenance.
This document discusses the evolution of programming languages from early machine languages to modern higher-level languages. It begins with an introduction to human and computer languages. It then covers the development of machine languages, assembly languages, and higher-level languages like FORTRAN and COBOL. The document discusses the advantages of each generation of languages and examples of languages from the 1950s to modern times.
This document discusses programming languages. It begins by asking what a programming language is and why there are so many types. It then defines a programming language as a set of rules that tells a computer what operations to perform. The document discusses the different types of programming languages like low-level languages close to machine code and high-level languages closer to English. It covers many popular programming languages from early generations like FORTRAN and COBOL to modern languages like C, C++, Java, and scripting languages. It concludes by discussing qualities of good programming languages like writability, readability, reliability and maintainability.
A programming language defines a set of instructions that are compiled by the CPU to perform tasks. Programming languages can be classified as low-level or high-level based on their level of abstraction from hardware. Low-level languages like machine code and assembly language provide little abstraction and are closer to binary machine instructions, while high-level languages like C++ and Python provide more abstraction and are easier for humans to read and write.
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
Significant compromises are often made taking a product to market that cause downstream pain—success can mean endless hours re-architecting and retrofitting to go global, get past 508 compliance at universities or integrate partners. The good news is there are freely available technologies and strategies to avoid the pain. Learn from Zimbra’s experiences with ZCS and Zimbra Desktop (an offline-capable AJAX email application) including a checklist of do’s and don’ts and a deep dive into: i18n and l10n, 508 compliance (Americans with Disabilities Act), skinning, templates, time-date formatting and more.
From http://en.oreilly.com/oscon2008/public/schedule/detail/4834
Xml For Dummies Chapter 6 Adding Character(S) To Xmlphanleson
This document provides an overview of character encodings and how they are handled in XML. It discusses the limitations of 7-bit and 8-bit character encodings and how Unicode addresses these by supporting a much wider range of characters with 16-bit encoding. It also describes how characters maps to numeric codes in Unicode/ISO 10646 and how UTF encodings implement Unicode. Additional topics covered include common character sets, using Unicode characters, and resources for finding character entity information.
COMPUTER LANGUAGES AND THERE DIFFERENCE Pavan Kalyan
In this ppt you will understand the difference among languages and You will know what is necessary for a language to become best in the present software filed
This document provides an introduction to the Perl programming language. It begins with an overview of Perl, including what Perl is, its history and key features. It then discusses Perl syntax, variables types (scalars, arrays and hashes), and other important Perl concepts like variable scoping and context. The document also provides examples of basic Perl programs and commands.
Information about the level of programming language, types of programming language, the principal paradigms, few programming languages, criteria for good language.
Lect 1. introduction to programming languagesVarun Garg
A programming language is a set of rules that allows humans to communicate instructions to computers. There are many programming languages because they have evolved over time as better ways to design them have been developed. Programming languages can be categorized based on their generation or programming paradigm such as imperative, object-oriented, logic-based, and functional. Characteristics like writability, readability, reliability and maintainability are important qualities for programming languages.
The document discusses different types of programming languages: machine language uses binary; assembly language uses symbols but still maps to binary; and high-level languages are abstracted from hardware and use English-like syntax. It provides details on each type, including their advantages like efficiency for machine language or readability for high-level languages, and disadvantages like lack of portability or required translation.
Share point 2013 coding standards and best practices 1.0LiquidHub
This document provides coding standards and best practices for developing SharePoint applications. It discusses efficient use of SharePoint data and objects, including caching objects and handling multithreaded environments. Specific recommendations are given for working with folders, lists, and deleting multiple versions of list items. The document also covers writing applications that scale to large numbers of users and using SPQuery objects. Best practices for disposing objects, exception handling, and accessing web and site objects are also outlined.
The document outlines the SharePoint 2013 upgrade process from SharePoint 2010 in three main steps:
1. Prepare the 2010 and 2013 farms by gathering information, cleaning up 2010, and setting up 2013.
2. Upgrade databases by copying them from 2010 to 2013, upgrading service application databases, and creating web applications before upgrading content databases.
3. Upgrade sites by running health checks, creating an evaluation site, and upgrading site collections after verifying readiness.
SharePoint 2013 introduced new features like apps, social capabilities and improved search compared to SharePoint 2010. It moved to a new app model and deprecated sandbox solutions and visual upgrades. Several site templates were also removed like Document Workspace, Personalization Site and Meeting Workspaces to simplify template selection. Existing sites created with these templates will still function in SharePoint 2013 but may not be supported in future releases.
The document discusses new user interface features in SharePoint 2010, including the ribbon interface, status bar, notifications, dialog framework, improved page model, master pages, theming engine, and extensibility of these features. The ribbon replaces the command surfaces of SharePoint 2007 and can be customized. The status bar and notifications provide feedback without page reloads. Dialogs load pages within iframes to reduce page transitions. Sites function as collections of pages. Themes allow changing the look and feel by applying color and font changes defined in theme files.
Microsoft office-sharepoint-server-2007-presentation-120211522467022-2LiquidHub
The document discusses the challenges faced by information workers and how collaboration and content management technologies can help address them. It notes that information workers spend a significant amount of time on emails, searching for information, and coordinating work across different locations. New technologies aim to simplify working together, improve access to information and insights, securely manage content, and streamline business processes. The Microsoft Office System provides a platform to address these issues through features in SharePoint for documents, tasks, calendars, blogs, wikis, email integration, and more.
SharePoint 2010 allows users to tag list items, documents, pages, and external pages to help organize information using a flexible taxonomy system. Tags can be added to anything with a URL to help categorize and find content. Users can tag list items, documents, and SharePoint pages to add keywords and notes for categorization purposes.
This presentation provides an overview of FAST Search for SharePoint 2010. It discusses the server roles, single and multi-server deployments, SharePoint search service applications, visual and conversational search capabilities, user context, comparisons to SharePoint search, ways to extend search, and lessons learned. The presentation was delivered by Jacob Wilson from Bross Group.
This document provides guidance on deploying Microsoft Office SharePoint Server 2007 in a server farm environment. It discusses the following key points:
- Server farm deployments involve multiple dedicated servers and provide better performance and scalability than a single-server deployment.
- The deployment process involves three phases - deploying and configuring server infrastructure, creating and configuring Shared Services Providers (SSPs), and deploying and configuring SharePoint sites.
- Server farm topologies can range from a small configuration with two servers, to a large configuration with clustered database servers and multiple application and frontend servers.
- Proper planning is important before deployment, including acquiring necessary credentials and accounts, installing prerequisites, and config
This document provides instructions for database administrators to pre-install databases required by Microsoft Office SharePoint Server 2007 before beginning the installation or creation of a Shared Services Provider (SSP). It describes creating databases with the correct collation and owners, configuring SQL Server, and running commands to configure the databases and create the SSP.
This document provides guidance on deploying Microsoft Office SharePoint Server 2007 in a server farm environment. It discusses hardware and software requirements, installing SharePoint on the first server, adding additional servers to the farm, configuring a shared services provider, and creating site collections and SharePoint sites. It also covers optional steps like pre-installing databases, configuring single sign-on, and configuring the trace log for troubleshooting.
This document discusses backup strategies for Microsoft Office SharePoint Server (MOSS) 2007. It examines the native MOSS 2007 backup tools including the Recycle Bin, Central Administration backup, and command-line backup. It finds that the native tools lack automation, granularity, and robustness. The document then introduces AvePoint's DocAve software as a third-party backup solution. DocAve provides item-level backup and restore capabilities to address the shortcomings of the native MOSS 2007 backup tools.
How To Configure Email Enabled Lists In Moss2007 Rtm Using Exchange 2003LiquidHub
This document provides step-by-step instructions for configuring incoming email for document libraries in SharePoint 2007 using Exchange 2003. It describes setting up an organizational unit in Active Directory, installing the SMTP service on the SharePoint server, configuring incoming email settings in Central Administration, creating an email-enabled document library, and testing. Troubleshooting tips are also provided to address issues like duplicate emails or problems with attachments.
The Business Data Catalog (BDC) is a framework included with SharePoint Enterprise that allows integration of line-of-business systems like SAP and Oracle into SharePoint sites without requiring custom code. The BDC uses metadata to define entities, properties, and methods to retrieve read-only data from external systems using web services or SQL. Administrators import BDC metadata to create applications that provide out-of-the-box techniques for displaying and searching external data within SharePoint sites.
The Business Data Catalog (BDC) is a framework in SharePoint Server 2007 Enterprise Edition that allows integration of line-of-business systems and databases without custom coding. It uses metadata to define entities, methods, and associations for connecting to and retrieving read-only data from external systems via web services or ADO.NET. Administrators import BDC metadata to create applications that provide data to SharePoint sites using out-of-the-box web parts, columns, search, and profiles synchronization. Programmers can also access BDC entities through the object model.
This document provides demonstration steps for exploring document storage, management, and collaboration features in Windows SharePoint Services 3.0. It outlines steps to edit documents, create and add documents to libraries, apply version control, view version history, restore previous versions from the recycle bin, and edit task tracking lists. The demonstrations are organized into sections on document storage and management, collaboration technologies, and information management and communication.
This document provides an overview of Module 01 which introduces the Windows SharePoint Services 3.0 platform. It discusses common collaboration challenges organizations face, reasons solutions often fail, and how WSS addresses key business needs. The module contains two lessons - the first on WSS and collaboration challenges, the second on collaboration technologies in WSS like document storage, collaboration features, and information management.
This document describes steps to create a web service that inserts data into a SQL database, create an InfoPath form to submit data to the web service, and publish and deploy the InfoPath form. It includes:
1) Code for a C# web service that inserts student records into a database
2) Instructions for running the web service and testing it works
3) Steps to create an InfoPath form with controls to enter student data
4) Configuring the form to submit data to the web service using the URL
5) Publishing and saving the form so it can be used to insert records
Whats New In Microsoft Windows Share Point Services Feature WalkthroughLiquidHub
This document provides an overview and instructions for a lab on the new and enhanced features in Windows SharePoint Services (version 3). The lab consists of 9 exercises that cover topics such as site creation, administration, permissions, web parts, navigation, security, notifications, and Outlook integration. Completing the lab will help users learn about the updated features in WSS. The estimated time to complete all the exercises is 90 minutes using the MOSS computer.
Overviewofthe2007 Microsoft Office System Components RefreshLiquidHub
This document provides an overview of the 2007 Microsoft Office system components lab. The lab includes exercises that describe enhancements in Word 2007 and Excel 2007 for formatting documents and spreadsheets. It also describes new features in other Office 2007 applications like PowerPoint, Outlook, Access, and InfoPath. The lab uses three virtual machines - a domain controller, file server, and client system - and takes approximately 90 minutes to complete. Exercises in the lab guide students through tasks like applying styles, themes, and conditional formatting in Word and Excel to enhance documents and analyze data.
Organizingand Finding Resourceswith Office Share Point Server2007 RefreshLiquidHub
This document provides an overview and objectives for a Microsoft Virtual Labs exercise on organizing and finding resources with Office SharePoint Server 2007. The exercise includes 4 tasks that cover portal management, content targeting, personal SharePoint sites, and using the Colleagues Web Part. Learners will manage sites and libraries, create audiences, target content, configure user profiles and privacy settings on personal sites, and view colleague information. The lab utilizes 3 virtual machines - a domain controller, SharePoint server, and client computer.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
1. 7
CHAPTER
Foreign
Languages and ✦ ✦ ✦ ✦
Non-Roman Text In This Chapter
Understanding the
effects of non-Roman
scripts on the Web
T Using scripts,
he Web is international, yet most of the text you’ll find on
character sets,
it is English. XML is starting to change this. XML provides
fonts, and glyphs
full support for the double-byte Unicode character set, as well
as its more compact representations. This is good news for Web
authors because Unicode supports almost every character Legacy character sets
commonly used in every modern script on Earth.
Using the Unicode
In this chapter, you’ll learn how international text is repre- Character Set
sented in computer applications, how XML understands text,
and how you can take advantage of the software you have to Writing XML
read and write in languages other than English. in Unicode
✦ ✦ ✦ ✦
Non-Roman Scripts on the Web
Although the Web is international, much of its text is in
English. Because of the Web’s expansiveness, however, you
can still surf through Web pages in French, Spanish, Chinese,
Arabic, Hebrew, Russian, Hindi, and other languages. Most
of the time, though, these pages come out looking less than
ideal. Figure 7-1 shows the October 1998 cover page of one of
the United States Information Agency’s propaganda journals,
Issues in Democracy (http://www.usia.gov/journals/
itdhr/1098/ijdr/ijdr1098.htm), in Russian translation
viewed in an English encoding. The red Cyrillic text in the
upper left is a bitmapped image file so it’s legible (if you
speak Russian) and so are a few words in English such as
“Adobe Acrobat.” However, the rest of the text is mostly a
bunch of accented Roman vowels, not the Cyrillic letters
they are supposed to be.
2. 162 Part I ✦ Introducing XML
The quality of Web pages deteriorates even further when complicated, non-
Western scripts like Chinese and Japanese are used. Figure 7-2 shows the home
page for the Japanese translation of my book JavaBeans (IDG Books, 1997,
http://www. ohmsha.co.jp/data/books/contents/4-274-06271-6.htm)
viewed in an English browser. Once again the bitmapped image shows the proper
Japanese (and English) text, but the rest of the text on the page looks almost like a
random collection of characters except for a few recognizable English words like
JavaBeans. The Kanji characters you’re supposed to see are completely absent.
Figure 7-1: The Russian translation of the October 1998 issue of
Issues of Democracy viewed in a Roman script
These pages look as they’re intended to look if viewed with the right encoding and
application software, and if the correct font is installed. Figure 7-3 shows Issues in
Democracy viewed with the Windows 1251 encoding of Cyrillic. As you can see, the
text below the picture is now readable (if you can read Russian).
You can select the encoding for a Web page from the View/Encoding menu in
Netscape Navigator or Internet Explorer. In an ideal world, the Web server would
tell the Web browser what encoding to use, and the Web browser would listen. It
would also be nice if the Web server could send the Web browser the fonts it
needed to display the page. In practice, however, you often need to select the
encoding manually, even trying several to find the exact right one when more than
one encoding is available for a script. For instance, a Cyrillic page might be
3. 163
Chapter 7 ✦ Foreign Languages and Non-Roman Text
encoded in Windows 1251, ISO 8859-5, or KOI6-R. Picking the wrong encoding may
make Cyrillic letters appear, but the words will be gibberish.
Figure 7-2: The
Japanese translation of
JavaBeans viewed
in an English browser
Figure 7-3: Issues of Democracy viewed in a Cyrillic script
4. 164 Part I ✦ Introducing XML
Even when you can identify the encoding, there’s no guarantee you have fonts
available to display it. Figure 7-4 shows the Japanese home page for JavaBeans with
Japanese encoding, but without a Japanese font installed on the computer. Most
of the characters in the text are shown as a box, which indicates an unavailable
character glyph. Fortunately, Netscape Navigator can recognize that some of the
bytes on the page are double-byte Japanese characters rather than two one-byte
Western characters.
Figure 7-4: The Japanese translation of JavaBeans in Kanji
without the necessary fonts installed
If you do have a Japanese localized edition of your operating system that includes
the necessary fonts, or additional software like Apple’s Japanese Language Kit or
NJStar’s NJWin (http://www.njstar.com/) that adds Japanese-language support
to your existing system, you would be able to see the text more or less as it was
meant to be seen as shown in Figure 7-5.
Of course, the higher quality fonts you use, the better the text will look. Chinese
Note
and Japanese fonts tend to be quite large (there are over 80,000 characters in
Chinese alone) and the distinctions between individual ideographs can be quite
subtle. Japanese publishers generally require higher-quality paper and printing
than Western publishers, so they can maintain the fine detail necessary to print
Japanese letters. Regrettably a 72-dpi computer monitor can’t do justice to most
Japanese and Chinese characters unless they’re displayed at almost obscenely
large point sizes.
5. 165
Chapter 7 ✦ Foreign Languages and Non-Roman Text
Figure 7-5: The Japanese translation of JavaBeans in Kanji
with the necessary fonts installed
Because each page can only have a single encoding, it is difficult to write a Web
page that integrates multiple scripts, such as a French commentary on a Chinese
text. For a reasons such as this the Web community needs a single, universal
character set to display all characters for all computers and Web browsers.
We don’t have such a character set yet, but XML and Unicode get as close
as is currently possible.
XML files are written in Unicode, a double-byte character set that can represent
most characters in most of the world’s languages. If a Web page is written in
Unicode, as XML pages are, and if the browser understands Unicode, as XML
browsers should, then it’s not a problem for characters from different languages
to be included on the same page.
Furthermore, the browser doesn’t need to distinguish between different encodings
like Windows 1251, ISO 8859-5, or KOI8-R. It can just assume everything’s written
in Unicode. As long as the double-byte set has the space to hold all of the different
characters, there’s no need to use more than one character set. Therefore there’s
no need for browsers to try to detect which character set is in use.
6. 166 Part I ✦ Introducing XML
Scripts, Character Sets, Fonts, and Glyphs
Most modern human languages have written forms. The set of characters used to
write a language is called a script. A script may be a phonetic alphabet, but it
doesn’t have to be. For instance, Chinese, Japanese, and Korean are written with
ideographic characters that represent whole words. Different languages often share
scripts, sometimes with slight variations. For instance, the modern Turkish
alphabet is essentially the familiar Roman alphabet with three extra letters — , ,
and ı. Chinese, Japanese, and Korean, on the other hand, share essentially the same
80,000 Han ideographs, though many characters have different meanings in the
different languages.
The word script is also often used to refer to programs written in weakly typed,
Note
interpreted languages like JavaScript, Perl, and TCL. In this chapter, the word script
always refers to the characters used to write a language and not to any sort of
program.
Some languages can even be written in different scripts. Serbian and Croatian are
virtually identical and are generally referred to as Serbo-Croatian. However, Serbian
is written in a modified Cyrillic script, and Croatian is written in a modified Roman
script. As long as a computer doesn’t attempt to grasp the meaning of the words it
processes, working with a script is equivalent to working with any language that
can be written in that script.
Unfortunately, XML alone is not enough to read a script. For each script a computer
processes, four things are required:
1. A character set for the script
2. A font for the character set
3. An input method for the character set
4. An operating system and application software that understand the
character set
If any of these four elements are missing, you won’t be able to work easily in the
script, though XML does provide a work-around that’s adequate for occasional use.
If the only thing your application is missing is an input method, you’ll be able to
read text written in the script. You just won’t be able to write in it.
A Character Set for the Script
Computers only understand numbers. Before they can work with text, that text has
to be encoded as numbers in a specified character set. For example, the popular
ASCII character set encodes the capital letter ‘A’ as 65. The capital letter ‘B’ is
encoded as 66. ‘C’ is 67, and so on.
7. 167
Chapter 7 ✦ Foreign Languages and Non-Roman Text
These are semantic encodings that provide no style or font information. C, C, or
even C are all 67. Information about how the character is drawn is stored elsewhere.
A Font for the Character Set
A font is a collection of glyphs for a character set, generally in a specific size, face,
and style. For example, C, C, and C are all the same character, but they are drawn
with different glyphs. Nonetheless their essential meaning is the same.
Exactly how the glyphs are stored varies from system to system. They may be
bitmaps or vector drawings; they may even consist of hot lead on a printing press.
The form they take doesn’t concern us here. The key idea is that a font tells the
computer how to draw each character in the character set.
An Input Method for the Character Set
An input method enables you to enter text. English speakers don’t think much
about the need for an input method for a script. We just type on our keyboards
and everything’s hunky-dory. The same is true in most of Europe, where all that’s
needed is a slightly modified keyboard with a few extra umlauts, cedillas, or
thorns (depending on the country).
Radically different character sets like Cyrillic, Hebrew, Arabic, and Greek are more
difficult to input. There’s a finite number of keys on the keyboard, generally not
enough for Arabic and Roman letters, or Roman and Greek letters. Assuming both
are needed though, a keyboard can have a Greek lock key that shifts the keyboard
from Roman to Greek and back. Both Greek and Roman letters can be printed on
the keys in different colors. The same scheme works for Hebrew, Arabic, Cyrillic,
and other non-Roman alphabetic character sets.
However, this scheme really breaks down when faced with ideographic scripts
like Chinese and Japanese. Japanese keyboards can have in the ballpark of 5,000
different keys; and that’s still less than 10% of the language! Syllabic, phonetic,
and radical representations exist that can reduce the number of keys; but it is
questionable whether a keyboard is really an appropriate means of entering text
in these languages. Reliable speech and handwriting recognition have even
greater potential in Asia than in the West.
Since speech and handwriting recognition still haven’t reached the reliability of
even a mediocre typist like myself, most input methods today are map multiple
sequences of keys on the keyboard to a single character. For example, to type the
Chinese character for sheep, you might hold down the Alt key and type a tilde (~),
then type yang, then hit the enter key. The input method would then present you
with a list of words that are pronounced more or less like yang. For example:
8. 168 Part I ✦ Introducing XML
You would then choose the character you wanted, _. The exact details of both the
GUI and the transliteration system used to convert typed keys like yang to the
ideographic characters like _ vary from program to program, operating system to
operating system, and language to language.
Operating System and Application Software
As of this writing, the major Web browsers (Netscape Navigator and Internet
Explorer) do a surprisingly good job of displaying non-Roman scripts. Provided
the underlying operating system supports a given script and has the right fonts
installed, a Web browser can probably display it.
MacOS 7.1 and later can handle most common scripts in the world today. However,
the base operating system only supports Western European languages. Chinese,
Japanese, Korean, Arabic, Hebrew, and Cyrillic are available as language kits that
cost about $100 a piece. Each provides fonts and input methods for languages
written in those scripts. There’s also an Indian language kit, which handles the
Devanagari, Gujarati, and Gurmukhu scripts common on the Indian subcontinent.
MacOS 8.5 adds optional, limited support for Unicode (which most applications
don’t yet take advantage of).
Windows NT 4.0 uses Unicode as its native character set. NT 4.0 does a fairly good
job with Roman languages, Cyrillic, Greek, Hebrew, and a few others. The Lucida
Sans Unicode font covers about 1300 of the most common of Unicode’s 40,000 or so
characters. Microsoft Office 97 includes Chinese, Japanese, and Korean fonts that
you can install to read text in these languages. (Look in the Fareast folder in the
Valupack folder on your Office CD-ROM.)
Microsoft claims Windows 2000 (previously known as NT 5.0) will also include
fonts covering most of the Chinese-Japanese-Korean ideographs, as well as input
methods for these scripts. However they also promised that Windows 95 would
include Unicode support, and that got dropped before shipment. Consequently,
I’m not holding my breath. Certainly, it would be nice if they do provide full
international support in all versions of NT rather than relying on localized
systems.
Microsoft’s consumer operating systems, Windows 3.1, 95, and 98, do not fully
support Unicode. Instead they rely on localized systems that can only handle basic
English characters plus the localized script.
The major Unix variants have varying levels of support for Unicode. Solaris 2.6
supports European languages, Greek, and Cyrillic. Chinese, Japanese, and Korean
are supported by localized versions using different encodings rather than Unicode.
Linux has embryonic support for Unicode, which may grow to something useful in
the near future.
9. 169
Chapter 7 ✦ Foreign Languages and Non-Roman Text
Legacy Character Sets
Different computers in different locales use different default character sets. Most
modern computers use a superset of the ASCII character set. ASCII encodes the
English alphabet and the most common punctuation and whitespace characters.
In the United States, Macs use the MacRoman character set, Windows PCs use a
character set called Windows ANSI, and most Unix workstations use ISO Latin-1.
These are all extensions of ASCII that support additional characters like ç and ¿
that are needed for Western European languages like French and Spanish. In other
locales like Japan, Greece, and Israel, computers use a still more confusing hodge-
podge of character sets that mostly support ASCII plus the local language.
This doesn’t work on the Internet. It’s unlikely that while you’re reading the San
Jose Mercury News you’ll turn the page and be confronted with several columns
written in German or Chinese. However, on the Web it’s entirely possible a user will
follow a link and end up staring at a page of Japanese. Even if the surfer can’t read
Japanese it would still be nice if they saw a correct version of the language, as seen
in Figure 7-5, instead of a random collection of characters like those shown in
Figure 7-2.
XML addresses this problem by moving beyond small, local character sets to one
large set that’s supposed to encompass all scripts used in all living languages (and a
few dead ones) on planet Earth. This character set is called Unicode. As previously
noted, Unicode is a double-byte character set that provides representations of over
40,000 different characters in dozens of scripts and hundreds of languages. All XML
processors are required to understand Unicode, even if they can’t fully display it.
As you learned in Chapter 6, an XML document is divided into text and binary
entities. Each text entity has an encoding. If the encoding is not explicitly specified
in the entity’s definition, then the default is UTF-8 — a compressed form of Unicode
which leaves pure ASCII text unchanged. Thus XML files that contain nothing but
the common ASCII characters may be edited with tools that are unaware of the
complications of dealing with multi-byte character sets like Unicode.
The ASCII Character Set
ASCII, the American Standard Code for Information Interchange, is one of the
original character sets, and is by far the most common. It forms a sort of lowest
common denominator for what a character set must support. It defines all the
characters needed to write U.S. English, and essentially nothing else. The charac-
ters are encoded as the numbers 0-127. Table 7-1 presents the ASCII character set.
10. 170 Part I ✦ Introducing XML
Table 7-1
The ASCII Character Set
Code Character Code Char- Code Char- Code Char-
acter acter acter
0 null(Control-@) 32 Space 64 @ 96 `
1 start of heading 33 ! 65 A 97 a
(Control-A)
2 start of text 34 “ 66 B 98 b
(Control-B)
3 end of text 35 # 67 C 99 c
(Control-C)
4 end of transmis- 36 $ 68 D 100 d
sion (Control-D)
5 enquiry 37 % 69 E 101 e
(Control-E)
6 acknowledge 38 & 70 F 102 f
(Control-F)
7 bell (Control-G) 39 ‘ 71 G 103 g
8 backspace 40 ( 72 H 104 h
(Control-H)
9 tab(Control-I) 41 ) 73 I 105 i
10 linefeed 42 * 74 J 106 j
(Control-J)
11 vertical tab) 43 + 75 K 107 k
(Control-K
12 formfeed 44 , 76 L 108 l
(Control-L)
13 carriage return 45 - 77 M 109 m
(Control-M)
14 shift out 46 . 78 N 110 n
(Control-N)
15 shift in 47 / 79 O 111 o
(Control-O)
16 data link escape 48 0 80 P 112 p
(Control-P)
17 device control 1 49 1 81 Q 113 q
(Control-Q)
11. 171
Chapter 7 ✦ Foreign Languages and Non-Roman Text
Code Character Code Char- Code Char- Code Char-
acter acter acter
18 device control 2 50 2 82 R 114 r
(Control-R)
19 device control 3 51 3 83 S 115 s
(Control-S)
20 device control 4 52 4 84 T 116 t
(Control-T)
21 negative acknowl- 53 5 85 U 117 u
edge (Control-U)
22 synchronous idle 54 6 86 V 118 v
(Control-V)
23 end of transmission 55 7 87 W 119 w
block (Control-W)
24 cancel (Control-X) 56 8 88 X 120 x
25 end of medium 57 9 89 Y 121 y
(Control-Y)
26 substitute 58 : 90 Z 122 z
(Control-Z)
27 escape (Control-[) 59 ; 91 [ 123 {
28 file separator 60 < 92 124 |
(Control-)
29 group separator 61 = 93 ] 125 }
(Control-])
30 record separator 62 > 94 ^ 126 ~
(Control-^)
31 unit separator 63 ? 95 _ 127 delete
(Control-_)
Characters 0 through 31 are non-printing control characters. They include the
carriage return, the linefeed, the tab, the bell, and similar characters. Many of these
are leftovers from the days of paper-based teletype terminals. For instance, carriage
return used to literally mean move the carriage back to the left margin, as you’d do
on a typewriter. Linefeed moved the platen up one line. Aside from the few control
characters mentioned, these aren’t used much anymore.
Most other character sets you’re likely to encounter are supersets of ASCII. In other
words, they define 0 though 127 exactly the same as ASCII, but add additional char-
acters from 128 on up.
13. 173
Chapter 7 ✦ Foreign Languages and Non-Roman Text
Code Character Code Character Code Character Code Character
143 ss3 175 ¯ 207 Ï 239 Ï
144 Dcs 176 ° 208 240
W e
145 pu1 177 ± 209 Ñ 241 Ñ
2
146 pu2 178 210 Ò 242 Ò
3
147 Sts 179 211 Ó 243 Ó
148 Cch 180 ´ 212 Ô 244 Ô
149 Mw 181 µ 213 Õ 245 Õ
150 Spa 182 ¶ 214 Ö 246 Ö
151 Epa 183 · 215 247 ÷
×
152 Sos 184 ¸ 216 Ø 248 Ø
1
153 Undefined 185 217 Ù 249 Ù
154 Sci 186 º 218 Ú 250 Ú
155 Csi 187 » 219 Û 251 Û
156 St 188 1/4 220 Ü 252 Ü
157 Osc 189 1/2 221 253
158 Pm 190 3/4 222 254
T T
159 Apc 191 ¿ 223 ß 255 Ÿ
Latin-1 still lacks many useful characters including those needed for Greek, Cyrillic,
Chinese, and many other scripts and languages. You might think these could just be
moved into the numbers from 256 up. However there’s a catch. A single byte can only
hold values from 0 to 255. To go beyond that, you need to use a multi-byte character
set. For historical reasons most programs are written under the assumption that
characters and bytes are identical, and they tend to break when faced with multi-byte
character sets. Therefore, most current operating systems (Windows NT being the
notable exception) use different, single-byte character sets rather than one large
multi-byte set. Latin-1 is the most common such set, but other sets are needed to
handle additional languages.
ISO 8859 defines ten other character sets (8859-2 through 8859-10 and 8859-15)
suitable for different scripts, with four more (8859-11 through 8859-14) in active
development. Table 7-3 lists the ISO character sets and the languages and scripts
they can be used for. All share the same ASCII characters from 0 to 127, and then
each includes additional characters from 128 to 255.
14. 174 Part I ✦ Introducing XML
Table 7-3
The ISO Character Sets
Also
Known
Character Set As Languages
ISO 8859-1 Latin-1 ASCII plus the characters required for most Western European
languages including Albanian, Afrikaans, Basque, Catalan,
Danish, Dutch, English, Faroese, Finnish, Flemish, Galician,
German, Icelandic, Irish, Italian, Norwegian, Portuguese,
Scottish, Spanish, and Swedish. However it omits the ligatures
ij (Dutch), Œ (French), and German quotation marks.
ISO 8859-2 Latin-2 ASCII plus the characters required for most Central European
languages including Czech, English, German, Hungarian,
Polish, Romanian, Croatian, Slovak, Slovene, and Sorbian.
ISO 8859-3 Latin-3 ASCII plus the characters required for English, Esperanto,
German, Maltese, and Galician.
ISO 8859-4 Latin-4 ASCII plus the characters required for the Baltic languages
Latvian, Lithuanian, German, Greenlandic, and Lappish;
superseded by ISO 8859-10, Latin-6
ISO 8859-5 ASCII plus Cyrillic characters required for Byelorussian,
Bulgarian, Macedonian, Russian, Serbian, and Ukrainian.
ISO 8859-6 ASCII plus Arabic.
ISO 8859-7 ASCII plus Greek.
ISO 8859-8 ASCII plus Hebrew.
ISO 8859-9 Latin-5 Latin-1 except that the Turkish letters , ı, , , , and take
the place of the less commonly used Icelandic letters , , T,
y, W, and e.
ISO 8859-10 Latin-6 ASCII plus characters for the Nordic languages Lithuanian,
Inuit (Greenlandic Eskimo), non-Skolt Sami (Lappish), and
Icelandic.
ISO 8859-11 ASCII plus Thai.
ISO 8859-12 This may eventually be used for ASCII plus Devanagari (Hindi,
Sanskrit, etc.) but no proposal is yet available.
ISO 8859-13 Latin-7 ASCII plus the Baltic Rim, particularly Latvian.
ISO 8859-14 Latin-8 ASCII plus Gaelic and Welsh.
ISO 8859-15 Latin-9, Essentially the same as Latin-1 but with a Euro sign instead
Latin-0 of the international currency sign . Furthermore, the Finnish
characters , , , replace the uncommon symbols B, ¨, ¸.
And the French Œ, œ, and Ÿ characters replace the fractions
1/4, 1/2, 3/4.
15. 175
Chapter 7 ✦ Foreign Languages and Non-Roman Text
These sets often overlap. Several languages, most notably English and German, can
be written in more than one of the character sets. To some extent the different sets
are designed to allow different combinations of languages. For instance Latin-1 can
combine most Western languages and Icelandic whereas Latin-5 combines most
Western languages with Turkish instead of Icelandic. Thus if you needed a document
in English, French, and Icelandic, you’d use Latin-1. However a document containing
English, French, and Turkish would use Latin-5. However, a document that required
English, Hebrew, and Turkish, would have to use Unicode since no single-byte
character set handles all three languages and scripts.
A single-byte set is insufficient for Chinese, Japanese, and Korean. These languages
have more than 256 characters apiece, so they must use multi-byte character sets.
The MacRoman Character Set
The MacOS predates Latin-1 by several years. (The ISO 8859-1 standard was first
adopted in 1987. The first Mac was released in 1984.) Unfortunately this means
that Apple had to define its own extended character set called MacRoman.
MacRoman has most of the same extended characters as Latin-1 (except for the
Icelandic letters T, y, and e) but the characters are assigned to different numbers.
MacRoman is the same as ASCII and Latin-1 in the codes though the first 127
characters. This is one reason text files that use extended characters often look
funny when moved from a PC to a Mac or vice versa. Table 7-4 lists the upper half
of the MacRoman character set.
Table 7-4
The MacRoman Character Set
Code Character Code Character Code Character Code Character
128 Â 160 † 192 ¿ 224 ‡
129 Å 161 ° 193 ¡ 225 ·
130 Ç 162 ¢ 194 ¬ 226 ‚
131 É 163 £ 195 227 „
√
132 Ñ 164 § 196 ƒ 228 ‰
133 Ö 165 · 197 ˜ 229 Â
134 Û 166 ¶ 198 230 Ê
∆
135 Á 167 ß 199 « 231 Á
136 À 168 ® 200 » 232
Continued
17. 177
Chapter 7 ✦ Foreign Languages and Non-Roman Text
Table 7-5
The Windows ANSI Character Set
Code Character Code Character Code Character Code Character
128 Undefined 136 ˆ 144 Undefined 152 ~
129 Undefined 137 ‰ 145 ‘ 153 ™
130 , 138 146 ‘ 154
s
131 139 ‹ 147 “ 155 ›
132 “ 140 Œ 148 “ 156 Œ
133 ... 141 Undefined 149 • 157 Undefined
134 † 142 Undefined 150 – 158 Undefined
135 ‡ 143 Undefined 151 — 159 Ÿ
The Unicode Character Set
Using different character sets for different scripts and languages works well enough
as long as:
1. You don’t need to work in more than one script at once.
2. You never trade files with anyone using a different character set.
Since Macs and PCs use different character sets, more people fail these criteria
than not. Obviously what is needed is a single character set that everyone agrees
on and that encodes all characters in all the world’s scripts. Creating such a set is
difficult. It requires a detailed understanding of hundreds of languages and their
scripts. Getting software developers to agree to use that set once it’s been created
is even harder. Nonetheless work is ongoing to create exactly such a set called
Unicode, and the major vendors (Microsoft, Apple, IBM, Sun, Be, and many others)
are slowly moving toward complying with it. XML specifies Unicode as its default
character set.
Unicode encodes each character as a two-byte unsigned number with a value
between 0 and 65,535. Currently a few more than 40,000 different Unicode charac-
ters are defined. The remaining 25,000 spaces are reserved for future extensions.
About 20,000 of the characters are used for the Han ideographs and another
11,000 or so are used for the Korean Hangul syllables. The remainder of the char-
acters encodes most of the rest of the world’s languages. Unicode characters 0
through 255 are identical to Latin-1 characters 0 through 255.
I’d love to show you a table of all the characters in Unicode, but if I did this book
would consist entirely of this table and not much else. If you need to know more
about the specific encodings of the different characters in Unicode, get a copy of
18. 178 Part I ✦ Introducing XML
The Unicode Standard (second edition, ISBN 0-201-48346-9, from Addison-Wesley).
This 950-page book includes the complete Unicode 2.0 specification, including
character charts for all the different characters defined in Unicode 2.0. You can
also find information online at the Unicode Consortium Web site at http://www.
unicode.org/ and http://charts.unicode.org/. Table 7-6 lists the different
scripts encoded by Unicode which should give you some idea of Unicode’s
versatility. The characters of each script are generally encoded in a consecutive
sub-range (block) of the 65,536 code points in Unicode. Most languages can be
written with the characters in one of these blocks (for example, Russian can be
written with the Cyrillic block) though some languages like Croatian or Turkish
may need to mix and match characters from the first four Latin blocks.
Table 7-6
Unicode Script Blocks
Script Range Purpose
Basic Latin 0-127 ASCII, American English.
Latin-1 Supplement 126-255 Upper half of ISO Latin-1, in conjunction with the
Basic Latin block can handle Danish, Dutch, English,
Faroese, Flemish, German, Hawaiian, Icelandic,
Indonesian, Irish, Italian, Norwegian, Portuguese,
Spanish, Swahili, and Swedish.
Latin Extended-A 256-383 This block adds the characters from the ISO 8859 sets
Latin-2, Latin-3, Latin-4, and Latin-5 not already found
in the Basic Latin and Latin-1 blocks. In conjunction
with those blocks, this block can encode Afrikaans,
Breton, Basque, Catalan, Czech, Esperanto, Estonian,
French, Frisian, Greenlandic, Hungarian, Latvian,
Lithuanian, Maltese, Polish, Provençal, Rhaeto-
Romanic, Romanian, Romany, Slovak, Slovenian,
Sorbian, Turkish, and Welsh.
Latin Extended-B 383-591 Mostly characters needed to extend the Latin script to
handle languages not traditionally written in this
script; includes many African languages, Croatian
digraphs to match Serbian Cyrillic letters, the Pinyin
transcription of Chinese, and the Sami characters
from Latin-10.
IPA Extensions 592-687 The International Phonetic Alphabet.
Spacing Modifier 686-767 Small symbols that somehow change (generally
Letters phonetically) the previous letter.
Combining Diacritical 766-879 Diacritical marks like ~, ‘, and _ that will somehow be
Marks combined with the previous character (most com-
monly, be placed on top of) rather than drawn as a
separate character.
19. 179
Chapter 7 ✦ Foreign Languages and Non-Roman Text
Script Range Purpose
Greek 880-1023 Modern Greek, based on ISO 8859-7; also provides
characters for Coptic.
Cyrillic 1024-1279 Russian and most other Slavic languages (Ukrainian,
Byelorussian, and so forth), and many non-Slavic
languages of the former Soviet Union (Azerbaijani,
Ossetian, Kabardian, Chechen, Tajik, and so forth);
based on ISO 8859-5. A few languages (Kurdish,
Abkhazian) require both Latin and Cyrillic characters
Armenian 1326-1423 Armenian
Hebrew 1424-1535 Hebrew (classical and modern), Yiddish, Judezmo,
early Aramaic.
Arabic 1536-1791 Arabic, Persian, Pashto, Sindhi, Kurdish, and classical
Turkish.
Devanagari 2304-2431 Sanskrit, Hindi, Nepali, and other languages of the
Indian subcontinent including Awadhi, Bagheli,
Bhatneri, Bhili, Bihari, Braj Bhasha, Chhattisgarhi,
Garhwali, Gondi, Harauti, Ho, Jaipuri, Kachchhi,
Kanauji, Konkani, Kului, Kumaoni, Kurku, Kurukh,
Marwari, Mundari, Newari, Palpa, and Santali.
Bengali 2432-2559 A North Indian script used in India’s West Bengal state
and Bangladesh; used for Bengali, Assamese, Daphla,
Garo, Hallam, Khasi, Manipuri, Mizo, Naga, Munda,
Rian, Santali.
Gurmukhi 2560-2687 Punjabi
Gujarati 2686-2815 Gujarati
Oriya 2816-2943 Oriya, Khondi, Santali.
Tamil 2944-3071 Tamil and Badaga, used in south India, Sri Lanka,
Singapore, and parts of Malaysia.
Telugu 3072-3199 Telugu, Gondi, Lambadi.
Kannada 3200-3327 Kannada, Tulu.
Malalayam 3326-3455 Malalayam
Thai 3584-3711 Thai, Kuy, Lavna, Pali.
Lao 3712-3839 Lao
Tibetan 3840-4031 Himalayan languages including Tibetan, Ladakhi, and
Lahuli.
Continued
20. 180 Part I ✦ Introducing XML
Table 7-6 (continued)
Script Range Purpose
Georgian 4256-4351 Georgian, the language of the former Soviet Republic
of Georgian on the Black Sea.
Hangul Jamo 4352-4607 The alphabetic components of the Korean Hangul
syllabary.
Latin Extended 7680-7935 Normal Latin letters like E and Y combined with
Additional diacritical marks, rarely used except for Vietnamese
vowels
Greek Extended 7936-8191 Greek letters combined with diacritical marks; used in
Polytonic and classical Greek.
General Punctuation 8192-8303 Assorted punctuation marks.
Superscripts and 8304-8351 Common subscripts and superscripts.
Subscripts
Currency Symbols 8352-8399 Currency symbols not already present in other blocks.
Combining Marks for 8400-8447 Used to make a diacritical mark span two or more
Symbols characters.
Letter like Symbols 8446-8527 Symbols that look like letters such as ™ and _.
Number Forms 8526-8591 Fractions and Roman numerals.
Arrows 8592-8703 Arrows
Mathematical 8704-8959 Mathematical operators that don’t already appear in
Operators other blocks.
Miscellaneous 8960-9039 Cropping marks, braket notation from quantum
Technical mechanics, symbols needed for the APL programming
language, and assorted other technical symbols.
Control Pictures 9216-9279 Pictures of the ASCII control characters; generally
used in debugging and network-packet sniffing.
Optical Character 9280-9311 OCR-A and the MICR (magnetic ink character
Recognition recognition) symbols on printed checks.
Enclosed 9312-9471 Letters and numbers in circles and parentheses.
alphanumerics
Box Drawing 9472-9599 Characters for drawing boxes on monospaced
terminals.
Block Elements 9600-9631 Monospaced terminal graphics as used in DOS and
elsewhere.
Geometric Shapes 9632-9727 Squares, diamonds, triangles, and the like.
21. 181
Chapter 7 ✦ Foreign Languages and Non-Roman Text
Script Range Purpose
Miscellaneous 9726-9983 Cards, chess, astrology, and more.
Symbols
Dingbats 9984-10175 The Zapf Dingbat characters.
CJK Symbols and 12286- Symbols and punctuation used in Chinese, Japanese,
Punctuation 12351 and Korean.
Hiragana 12352- A cursive syllabary for Japanese
12447
Katakana 12446- A non-cursive syllabary used to write words imported
12543 from the West in Japanese, especially modern words
like “keyboard”.
Bopomofo 12544- A phonetic alphabet for Chinese used primarily for
12591 teaching.
Hangul Compatibility 12592- Korean characters needed for compatibility with the
Jamo 12687 KSC 5601 encoding.
Kanbun 12686- Marks used in Japanese to indicate the reading order
12703 of classical Chinese.
Enclosed CJK Letters 12800- Hangul and Katakana characters enclosed in circles
and Months 13055 and parentheses.
CJK Compatibility 13056- Characters needed only to encode KSC 5601 and
13311 CNS 11643.
CJK Unified 19966- The Han ideographs used for Chinese, Japanese, and
Ideographs 40959 Korean.
Hangul Syllables 44032- A Korean syllabary.
55203
Surrogates 55296- Currently unused, but will eventually allow the
57343 extension of Unicode to over one million different
characters.
Private Use 57344- Software developers can include their custom
63743 characters here; not compatible across
implementations.
CJK Compatibility 63744- A few extra Han ideographs needed only to maintain
Ideographs 64255 compatibility with existing standards like KSC 5601.
Alphabetic Presen- 64256- Ligatures and variants sometimes used in Latin,
tation Forms 64335 Armenian, and Hebrew.
Arabic Presentation 64336- Variants of assorted Arabic characters.
Forms 65023
Continued
22. 182 Part I ✦ Introducing XML
Table 7-6 (continued)
Script Range Purpose
Combining Half 65056- Combining multiple diacritical marks into a single
Marks 65071 diacritical mark that spans multiple characters.
CJK Compatibility 65072- Mostly vertical variants of Han ideographs used in
Forms 65103 Taiwan.
Small Form Variants 65104- Smaller version of ASCII punctuation mostly used in
65135 Taiwan.
Additional Arabic 65136- More variants of assorted Arabic characters.
Presentation Forms 65279
Half-width and Full- 65280- Characters that allow conversion between different
width Forms 65519 Chinese and Japanese encodings of the same
characters.
Specials 65520- The byte order mark and the zero-width, no breaking
65535 space often used to start Unicode files.
UTF 8
Since Unicode uses two bytes for each character, files of English text are about twice
as large in Unicode as they would be in ASCII or Latin-1. UTF-8 is a compressed ver-
sion of Unicode that uses only a single byte for the most common characters, that is
the ASCII characters 0-127, at the expense of having to use three bytes for the less
common characters, particularly the Hangul syllables and Han ideographs. If you’re
writing mostly in English, UTF-8 can reduce your file sizes by as much as 50 percent.
On the other hand if you’re writing mostly in Chinese, Korean, or Japanese, UTF-8 can
increase your file size by as much as 50 percent — so it should be used with caution.
UTF-8 has mostly no effect on non-Roman, non-CJK scripts like Greek, Arabic, Cyrillic,
and Hebrew.
XML processors assume text data is in the UTF-8 format unless told otherwise. This
means they can read ASCII files, but other formats like MacRoman or Latin-1 cause
them trouble. You’ll learn how to fix this problem shortly.
The Universal Character System
Unicode has been criticized for not encompassing enough, especially in regard to
East Asian languages. It only defines about 20,000 of the 80,000 Han ideographs used
amongst Chinese, Japanese, Korean, and historical Vietnamese. (Modern Vietnamese
uses a Roman alphabet.)
UCS (Universal Character System), also known as ISO 10646, uses four bytes per
character (more precisely, 31 bits) to provide space for over two billion different
23. 183
Chapter 7 ✦ Foreign Languages and Non-Roman Text
characters. This easily covers every character ever used in any language in any
script on the planet Earth. Among other things this enables a full set of characters
to be assigned to each language so that the French “e” is not the same as the
English “e” is not the same as the German “e,” and so on.
Like Unicode, UCS defines a number of different variants and compressed forms.
Pure Unicode is sometimes referred to as UCS-2, which is two-byte UCS. UTF-16 is a
special encoding that maps some of the UCS characters into byte strings of varying
length in such a fashion that Unicode (UCS-2) data is unchanged.
At this point, the advantage of UCS over Unicode is mostly theoretical. The only
characters that have actually been defined in UCS are precisely those already in
Unicode. However, it does provide more room for future expansion.
How to Write XML in Unicode
Unicode is the native character set of XML, and XML browsers will probably do a
pretty good job of displaying it, at least to the limits of the available fonts. None-
theless, there simply aren’t many if any text editors that support the full range of
Unicode. Consequently, you’ll probably have to tackle this problem in one of a
couple of ways:
1. Write in a localized character set like Latin-3; then convert your file to Unicode.
2. Include Unicode character references in the text that numerically identify
particular characters.
The first option is preferable when you’ve got a large amount of text to enter in
essentially one script, or one script plus ASCII. The second works best when you
need to mix small portions of multiple scripts into your document.
Inserting Characters in XML Files with
Character References
Every Unicode character is a number between 0 and 65,535. If you do not have a
text editor that can write in Unicode, you can always use a character reference to
insert the character in your XML file instead.
A Unicode character reference consists of the two characters &# followed by the
character code, followed by a semicolon. For instance, the Greek letter π has
Unicode value 960 so it may be inserted in an XML file as π. The Cyrillic
character has Unicode value 1206 so it can be included in an XML file with the
character reference Ҷ
Unicode character references may also be specified in hexadecimal (base 16).
Although most people are more comfortable with decimal numbers, the Unicode
24. 184 Part I ✦ Introducing XML
Specification gives character values as two-byte hexadecimal numbers. It’s often
easier to use hex values directly rather than converting them to decimal.
All you need to do is include an x after the &# to signify that you’re using a hexa-
decimal value. For example, π has hexadecimal value 3C0 so it may be inserted in
an XML file as π. The Cyrillic character has hexadecimal value 4B6 so it
can be included in an XML file with the escape sequence Ҷ. Because two
bytes always produce exactly four hexadecimal digits, it’s customary (though not
required) to include leading zeros in hexadecimal character references so they are
rounded out to four digits.
Unicode character references, both hexadecimal and decimal, may be used to
embed characters that would otherwise be interpreted as markup. For instance,
the ampersand (&) is encoded as & or &. The less-than sign (<) is
encoded as < or <.
Converting to and from Unicode
Application software that exports XML files, such as Adobe Framemaker, handles
the conversion to Unicode or UTF-8 automatically. Otherwise you’ll need to use a
conversion tool. Sun’s freely available Java Development Kit (JDK) includes a sim-
ple command-line utility called native2ascii that converts between many common
and uncommon localized character sets and Unicode.
For example, the following command converts a text file named myfile.txt from the
platform’s default encoding to Unicode
C:> native2ascii myfile.txt myfile.uni
You can specify other encodings with the -encoding option:
C:> native2ascii -encoding Big5 chinese.txt chinese.uni
You can also reverse the process to go from Unicode to a local encoding with the -
reverse option:
C:> native2ascii -encoding Big5 -reverse chinese.uni chinese.txt
If the output file name is left off, the converted file is printed out.
The native2ascii program also processes Java-style Unicode escapes, which are
characters embedded as u09E3. These are not in the same format as XML
numeric character references, though they’re similar. If you convert to Unicode
using native2ascii, you can still use XML character references — the viewer will
still recognize them.
25. 185
Chapter 7 ✦ Foreign Languages and Non-Roman Text
How to Write XML in Other Character Sets
Unless told otherwise, an XML processor assumes that text entity characters are
encoded in UTF-8. Since UTF-8 includes ASCII as a subset, ASCII text is easily parsed
by XML processors as well.
The only character set other than UTF-8 that an XML processor is required to
understand is raw Unicode. If you cannot convert your text into either UTF-8 or
raw Unicode, you can leave the text in its native character set and tell the XML
processor which set that is. This should be a last resort, though, because there’s
no guarantee an arbitrary XML processor can process other encodings. Nonethe-
less Netscape Navigator and Internet Explorer both do a pretty good job of inter-
preting the common character sets.
To warn the XML processor that you’re using a non-Unicode encoding, you include
an encoding attribute in the XML declaration at the start of the file. For example,
to specify that the entire document uses Latin-1 by default (unless overridden by
another processing instruction in a nested entity) you would use this XML
declaration:
<?xml version=”1.0” encoding=”ISO-8859-1” ?>
You can also include the encoding declaration as part of a separate processing
instruction after the XML declaration but before any character data appears.
<?xml encoding=”ISO-8859-1”?>
Table 7-7 lists the official names of the most common character sets used today, as
they would be given in XML encoding attributes. For encodings not found in this list,
consult the official list maintained by the Internet Assigned Numbers Authority
(IANA) at http://www.isi.edu/in-notes/iana/assignments/character-sets.
Table 7-7
Names of Common Character Sets
Character Set Name Languages/Countries
US-ASCII English
UTF-8 Compressed Unicode
UTF-16 Compressed UCS
ISO-10646-UCS-2 Raw Unicode
ISO-10646-UCS-4 Raw UCS
Continued
26. 186 Part I ✦ Introducing XML
Table 7-7 (continued)
Character Set Name Languages/Countries
ISO-8859-1 Latin-1, Western Europe
ISO-8859-2 Latin-2, Eastern Europe
ISO-8859-3 Latin-3, Southern Europe
ISO-8859-4 Latin-4, Northern Europe
ISO-8859-5 ASCII plus Cyrillic
ISO-8859-6 ASCII plus Arabic
ISO-8859-7 ASCII plus Greek
ISO-8859-8 ASCII plus Hebrew
ISO-8859-9 Latin-5, Turkish
ISO-8859-10 Latin-6, ASCII plus the Nordic languages
ISO-8859-11 ASCII plus Thai
ISO-8859-13 Latin-7, ASCII plus the Baltic Rim languages,
particularly Latvian
ISO-8859-14 Latin-8, ASCII plus Gaelic and Welsh
ISO-8859-15 Latin-9, Latin-0; Western Europe
ISO-2022-JP Japanese
Shift_JIS Japanese, Windows
EUC-JP Japanese, Unix
Big5 Chinese, Taiwan
GB2312 Chinese, mainland China
KOI6-R Russian
ISO-2022-KR Korean
EUC-KR Korean, Unix
ISO-2022-CN Chinese
27. 187
Chapter 7 ✦ Foreign Languages and Non-Roman Text
Summary
In this chapter you learned:
✦ Web pages should identify the encoding they use.
✦ What a script is, how it relates to languages, and the four things a script
requires.
✦ How scripts are used in computers with character sets, fonts, glyphs, and
input methods.
✦ What character sets are commonly used on different platforms and that most
are based on ASCII.
✦ How to write XML in Unicode without a Unicode editor (write the document
in ASCII and include Unicode character references).
✦ When writing XML in other encodings, include an encoding attribute in the
XML declaration.
In the next chapter, you’ll begin exploring DTDs and how they enable you to define
and enforce a vocabulary, syntax, and grammar for your documents.
✦ ✦ ✦