SlideShare a Scribd company logo
Java course - IAG0040




               Text processing,
             Charsets & Encodings




Anton Keks                             2011
String processing
 ●
     The following classes provide String processing:
     String, StringBuilder/Buffer, StringTokenizer
 ●
     All primitives can be converted to/from Strings using
     their wrapper classes (e.g. Integer, Float, etc)
 ●
     java.util.regex provides regular expressions
 ●   java.text package provides classes and interfaces for
     parsing and formatting text, dates, numbers, and
     messages in a manner independent of natural
     languages


Java course – IAG0040                                   Lecture 7
Anton Keks                                                Slide 2
Locales
 ●
     Java also supports locales, just like most OSs
 ●
     A java.util.Locale object represents a specific
     geographical, political, or cultural region.
     –   There is a default locale, which is used by some
         String operations (e.g. toUpperCase) and formatters
         in java.text package.
     –   Locale is initialized with: ISO 2-letter language code
         (lower case), ISO 2-letter country code (upper case),
         and a variant. Latter two are optional
                        ●   e.g. “de”, “et_EE”, “en_GB”

Java course – IAG0040                                     Lecture 7
Anton Keks                                                  Slide 3
Localization
 ●   ResourceBundle classes can be used for
     localization of your programs
           –   ResourceBundles contain locale-specific
                objects, e.g. Strings
           –   ListResourceBundle and
                PropertyResourceBundle are simple
                implementations
           –   ResourceBundle.getBundle(...)
                returns a locale-specific bundle

Java course – IAG0040                              Lecture 7
Anton Keks                                           Slide 4
Natural language comparison
 ●   String.compareTo() does lexicographical
     comparison, ie compares character codes
 ●   Collators are used for locale-sensitive
     comparison/sorting, according to the rules of
     the specific language/locale
        –   java.text.Collator implements Comparator<String>
        –   Use Collator.getInstance(...) for obtaining one
        –   RuleBasedCollator is the common implementation,
              allows specification of own rules

Java course – IAG0040                                         Lecture 7
Anton Keks                                                      Slide 5
StringBuffer vs String
 ●
     A StringBuilder (and StringBuffer) is a mutable String
 ●   Always use it, when doing complex String processing, especially when
     doing a lot of concatenations in a loop
 ●   Java uses StringBuilder internally in place of the '+' operator
      –   String s = a + b + 25; is the same as
      –   String s = new StringBuilder()
             .append(a).append(b).append(25).toString();
      –   There are many different append() methods for all primitive types as well as
          any objects. For an arbitrary object, toString() is called.
 ●   StringBuffer, StringBuilder, and String implement CharSequence
 ●   StringBuilder has the same methods as StringBuffer, but a bit faster,
     because it is not thread safe (not internally synchronized)


Java course – IAG0040                                                           Lecture 7
Anton Keks                                                                        Slide 6
Formatting and Parsing
 ●   Locale-specific formatting and parsing is provided by java.text.
 ●   java.text.Format is an abstract base class for
      –   DateFormat (SimpleDateFormat) – date and time. Calendar is used for
          manipulation of date and time.
      –   NumberFormat (ChoiceFormat, DecimalFormat) – numbers, currencies,
          percentages, etc
      –   MessageFormat – for complex concatenated messages
      –   all of them provide various format and parse methods
      –   all of them can be initialized for the default or specified locale using
          provided static methods
      –   all of them can be created directly, specifying the custom format



Java course – IAG0040                                                       Lecture 7
Anton Keks                                                                    Slide 7
Regular expressions
 ●
     Regular expressions are expressions, allowing easy searching and matching
     of textual data, they are built into many languages, like Perl and PHP, and
     widely used in Unix command-line
 ●   Regular expression classes are in the java.util.regex package.
 ●   In Java, represented as Strings, but must be 'compiled' by
     Pattern.compile() before use.
 ●   However, many String methods provide convenient 'shortcuts', like
     split(), matches(), replaceFirst(), replaceAll(), etc
 ●   Pattern is an immutable compiled representation, which can be used for
     creation of mutable Matcher objects.
 ●   Use Patterns directly in case you intend to reuse the regexp




Java course – IAG0040                                                    Lecture 7
Anton Keks                                                                 Slide 8
Regular Expressions (cont)
 ●
     Read javadoc of the Pattern class!
      –   . (a dot) matches any character
      –   [] can be used for matching any specified character
      –   s, S, d, w, etc save you typing sometimes (note: double escaping
          is needed within String literals, e.g. “s”
      –   ?, +, * match the number of occurrences of the preceding character:
          0 or 1, 1 or more, any number respectively
      –   () - matches groups (they can be accessed individually)
      –   | means 'or', e.g. (dog|cat) matches both “dog” and “cat”
      –   ^ and $ match beginning and end of a line, respectively
      –   b matches word boundary


Java course – IAG0040                                                   Lecture 7
Anton Keks                                                                Slide 9
Scanning
 ●
     java.util.Scanner can be used for parsing Strings, InputStreams, Readers, or
     Files
 ●
     It uses either built-in or custom regular expressions for parsing input data, it is
     sensitive to either the default or specified Locale
 ●   Default delimiter is whitespace (“s”), custom delimeter may be set using
     the useDelimiter() method
 ●   It implements Iterator<String>, therefore has hasNext() and next()
     methods, various type-specific methods, e.g. hasNextInt(), nextInt(),
     etc, as well as finding and skipping facilities
 ●
     Can be used for parsing the standard input:
      –   Scanner s = new Scanner(System.in);
          int n = s.nextInt();




Java course – IAG0040                                                           Lecture 7
Anton Keks                                                                       Slide 10
Charsets and encodings
 ●
     In the 21st century, there is no excuse for any programmer
     not to know charsets and encodings well
 ●
     Charsets map glyphs (symbols) to numeric codes
 ●
     Charsets are represented by character encodings (actual
     bits and bytes that are stored in files)
 ●
     Fonts must support charsets in order to display texts in
     respective encodings properly
 ●
     Example:
      –   Glyph (symbol): A
      –   Numeric code: 65              (ASCII charset)
      –   Encoding: 0x41 == 1000001 b   (ASCII 7-bit encoding)
Java course – IAG0040                                       Lecture 7
Anton Keks                                                   Slide 11
ASCII
 ●
     American Standard Code for Information Interchange
 ●
     Created in 1963, ANSI in 1967, ISO-646 in 1972
 ●
     Allowed for text exchange between computers
 ●   Only 7 bits are defined, nowadays called US-ASCII
 ●
     0-31 – control chars
 ●
     33-126 – printable
 ●
     Was designed for
     English language



Java course – IAG0040                                    Lecture 7
Anton Keks                                                Slide 12
ASCII extensions
 ●
     ASCII is enough for only Latin, English, Hawaiian and Swahili
 ●
     For most other languages a number of 8-bit ASCII extensions
     were developed, incompatible with each other
 ●   ISO-8859 was an attempt to standardize them by defining the
     upper 128 characters in 8-bit wide bytes
      –   All of them have the first 7-bit the same as ASCII
      –   ISO-8859-1 (Latin-1) – Western European
      –   ISO-8859-4 – Northern, ISO-8859-13 – Baltic,
          WIN-1257 – MS Baltic (modified ISO)
      –   ISO-8859-5, KOI8-R – Cyrillic,
          WIN-1251 – MS Cyrillic (different from ISO)
      –   Many of them are still used today in legacy systems or formats
Java course – IAG0040                                              Lecture 7
Anton Keks                                                          Slide 13
Unicode (UCS, ISO-10646)
 ●
     Unicode solves the problem of incompatible charsets
 ●
     Unicode defines standardized numeric codes (code
     points) for most glyphs used in the world
      –   Code points are abstract – they don't define representation
      –   First 256 code points correspond to ISO-8859-1
      –   16 bit BMP (Basic Multilingual Plane) – most modern
          languages (including Chinese, Japanese, etc)
      –   More planes for other scripts (mathematical symbols,
          musical notation, ancient alphabets, etc)
 ●   Apart from UCS, Unicode defines formatting and
     combining rules as well (e.g. for bidirectional text)
Java course – IAG0040                                            Lecture 7
Anton Keks                                                        Slide 14
Unicode encodings
 ●
     Define representation of code points in bits and bytes
 ●
     Fixed-width UCS-2 (2 bytes) and UCS-4 (4 bytes)
 ●
     UTF (Unicode Transformation Format)
     –   All of them can encode any Unicode code points
     –   UTF-8 – variable size from 1 to 6 bytes (usually no longer
         than 3 bytes, compatible with ASCII), the most popular and
         compact
     –   UTF-16 – 2 or 4 bytes, 2 bytes for BMP code points, 4 bytes
         for other planes
     –   UTF-32 – constant size, 4 bytes per character, 'raw' unicode
     –   UTF-7 – 7-bit safe encoding (less popular nowadays)
Java course – IAG0040                                           Lecture 7
Anton Keks                                                       Slide 15
Charsets and Java
 ●   char and String are UTF-16
      –   Beware that length(), indexOf(), etc operate on chars (surrogates), not
          Unicode glyphs, therefore can return 'logically wrong' values in case of
          4-byte characters – this was a performance decision
 ●   Encoding conversions are built-in
      –   Encoded text is binary data for Java, therefore stored in bytes
      –   There always exists the default encoding (the one OS uses)
      –   Charset class is provided for encoding/decoding, enumeration, etc
      –   s.toBytes(...) - encodes a String
      –   new String(...) - decodes raw bytes to a String
      –   System.out and System.in automatically convert to/from the default
          encoding

Java course – IAG0040                                                       Lecture 7
Anton Keks                                                                   Slide 16

More Related Content

What's hot

Java Course 13: JDBC & Logging
Java Course 13: JDBC & LoggingJava Course 13: JDBC & Logging
Java Course 13: JDBC & Logging
Anton Keks
 
Core java
Core java Core java
Core java
Ravi varma
 
Core Java Tutorial
Core Java TutorialCore Java Tutorial
Core Java Tutorial
eMexo Technologies
 
Core Java introduction | Basics | free course
Core Java introduction | Basics | free course Core Java introduction | Basics | free course
Core Java introduction | Basics | free course
Kernel Training
 
Java Course 14: Beans, Applets, GUI
Java Course 14: Beans, Applets, GUIJava Course 14: Beans, Applets, GUI
Java Course 14: Beans, Applets, GUI
Anton Keks
 
Core Java Tutorials by Mahika Tutorials
Core Java Tutorials by Mahika TutorialsCore Java Tutorials by Mahika Tutorials
Core Java Tutorials by Mahika Tutorials
Mahika Tutorials
 
Java basic tutorial by sanjeevini india
Java basic tutorial by sanjeevini indiaJava basic tutorial by sanjeevini india
Java basic tutorial by sanjeevini india
Sanjeev Tripathi
 
Java features
Java featuresJava features
Java features
Prashant Gajendra
 
An Introduction to Java Compiler and Runtime
An Introduction to Java Compiler and RuntimeAn Introduction to Java Compiler and Runtime
An Introduction to Java Compiler and Runtime
Omar Bashir
 
Core Java Certification
Core Java CertificationCore Java Certification
Core Java Certification
Vskills
 
Java Presentation For Syntax
Java Presentation For SyntaxJava Presentation For Syntax
Java Presentation For Syntax
PravinYalameli
 
Java Basics
Java BasicsJava Basics
Java Basics
Brandon Black
 
Java training in delhi
Java training in delhiJava training in delhi
Java training in delhi
APSMIND TECHNOLOGY PVT LTD.
 
Introduction to java
Introduction to javaIntroduction to java
Introduction to java
Ajay Sharma
 
Java Class 6 | Java Class 6 |Threads in Java| Applets | Swing GUI | JDBC | Ac...
Java Class 6 | Java Class 6 |Threads in Java| Applets | Swing GUI | JDBC | Ac...Java Class 6 | Java Class 6 |Threads in Java| Applets | Swing GUI | JDBC | Ac...
Java Class 6 | Java Class 6 |Threads in Java| Applets | Swing GUI | JDBC | Ac...
Sagar Verma
 
Java history, versions, types of errors and exception, quiz
Java history, versions, types of errors and exception, quiz Java history, versions, types of errors and exception, quiz
Java history, versions, types of errors and exception, quiz
SAurabh PRajapati
 
Java tutorial PPT
Java tutorial PPTJava tutorial PPT
Java tutorial PPT
Intelligo Technologies
 
Java Course 11: Design Patterns
Java Course 11: Design PatternsJava Course 11: Design Patterns
Java Course 11: Design Patterns
Anton Keks
 
Ppl for students unit 4 and 5
Ppl for students unit 4 and 5Ppl for students unit 4 and 5
Ppl for students unit 4 and 5
Akshay Nagpurkar
 
Core Java Tutorial
Core Java TutorialCore Java Tutorial
Core Java Tutorial
Java2Blog
 

What's hot (20)

Java Course 13: JDBC & Logging
Java Course 13: JDBC & LoggingJava Course 13: JDBC & Logging
Java Course 13: JDBC & Logging
 
Core java
Core java Core java
Core java
 
Core Java Tutorial
Core Java TutorialCore Java Tutorial
Core Java Tutorial
 
Core Java introduction | Basics | free course
Core Java introduction | Basics | free course Core Java introduction | Basics | free course
Core Java introduction | Basics | free course
 
Java Course 14: Beans, Applets, GUI
Java Course 14: Beans, Applets, GUIJava Course 14: Beans, Applets, GUI
Java Course 14: Beans, Applets, GUI
 
Core Java Tutorials by Mahika Tutorials
Core Java Tutorials by Mahika TutorialsCore Java Tutorials by Mahika Tutorials
Core Java Tutorials by Mahika Tutorials
 
Java basic tutorial by sanjeevini india
Java basic tutorial by sanjeevini indiaJava basic tutorial by sanjeevini india
Java basic tutorial by sanjeevini india
 
Java features
Java featuresJava features
Java features
 
An Introduction to Java Compiler and Runtime
An Introduction to Java Compiler and RuntimeAn Introduction to Java Compiler and Runtime
An Introduction to Java Compiler and Runtime
 
Core Java Certification
Core Java CertificationCore Java Certification
Core Java Certification
 
Java Presentation For Syntax
Java Presentation For SyntaxJava Presentation For Syntax
Java Presentation For Syntax
 
Java Basics
Java BasicsJava Basics
Java Basics
 
Java training in delhi
Java training in delhiJava training in delhi
Java training in delhi
 
Introduction to java
Introduction to javaIntroduction to java
Introduction to java
 
Java Class 6 | Java Class 6 |Threads in Java| Applets | Swing GUI | JDBC | Ac...
Java Class 6 | Java Class 6 |Threads in Java| Applets | Swing GUI | JDBC | Ac...Java Class 6 | Java Class 6 |Threads in Java| Applets | Swing GUI | JDBC | Ac...
Java Class 6 | Java Class 6 |Threads in Java| Applets | Swing GUI | JDBC | Ac...
 
Java history, versions, types of errors and exception, quiz
Java history, versions, types of errors and exception, quiz Java history, versions, types of errors and exception, quiz
Java history, versions, types of errors and exception, quiz
 
Java tutorial PPT
Java tutorial PPTJava tutorial PPT
Java tutorial PPT
 
Java Course 11: Design Patterns
Java Course 11: Design PatternsJava Course 11: Design Patterns
Java Course 11: Design Patterns
 
Ppl for students unit 4 and 5
Ppl for students unit 4 and 5Ppl for students unit 4 and 5
Ppl for students unit 4 and 5
 
Core Java Tutorial
Core Java TutorialCore Java Tutorial
Core Java Tutorial
 

Similar to Java Course 7: Text processing, Charsets & Encodings

An Introduction to Scala - Blending OO and Functional Paradigms
An Introduction to Scala - Blending OO and Functional ParadigmsAn Introduction to Scala - Blending OO and Functional Paradigms
An Introduction to Scala - Blending OO and Functional Paradigms
Miles Sabin
 
BCS SPA 2010 - An Introduction to Scala for Java Developers
BCS SPA 2010 - An Introduction to Scala for Java DevelopersBCS SPA 2010 - An Introduction to Scala for Java Developers
BCS SPA 2010 - An Introduction to Scala for Java Developers
Miles Sabin
 
An Introduction to Scala for Java Developers
An Introduction to Scala for Java DevelopersAn Introduction to Scala for Java Developers
An Introduction to Scala for Java Developers
Miles Sabin
 
Manuel - SPR - Intro to Java Language_2016
Manuel - SPR - Intro to Java Language_2016Manuel - SPR - Intro to Java Language_2016
Manuel - SPR - Intro to Java Language_2016
Manuel Fomitescu
 
Ch6
Ch6Ch6
Data.ppt
Data.pptData.ppt
Data.ppt
RithikRaj25
 
3. jvm
3. jvm3. jvm
A Brief Introduction to Scala for Java Developers
A Brief Introduction to Scala for Java DevelopersA Brief Introduction to Scala for Java Developers
A Brief Introduction to Scala for Java Developers
Miles Sabin
 
Miles Sabin Introduction To Scala For Java Developers
Miles Sabin Introduction To Scala For Java DevelopersMiles Sabin Introduction To Scala For Java Developers
Miles Sabin Introduction To Scala For Java Developers
Skills Matter
 
Let's start with Java- Basic Concepts
Let's start with Java- Basic ConceptsLet's start with Java- Basic Concepts
Let's start with Java- Basic Concepts
Aashish Jain
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011
Patrick Walton
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
Hiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
Hiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
Hiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
Hiroshi Ono
 
The Evolution of Scala
The Evolution of ScalaThe Evolution of Scala
The Evolution of Scala
Martin Odersky
 
Java Course 8: I/O, Files and Streams
Java Course 8: I/O, Files and StreamsJava Course 8: I/O, Files and Streams
Java Course 8: I/O, Files and Streams
Anton Keks
 
SE 20016 - programming languages landscape.
SE 20016 - programming languages landscape.SE 20016 - programming languages landscape.
SE 20016 - programming languages landscape.
Ruslan Shevchenko
 
Scala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistScala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologist
pmanvi
 
6 data types
6 data types6 data types
6 data types
jigeno
 

Similar to Java Course 7: Text processing, Charsets & Encodings (20)

An Introduction to Scala - Blending OO and Functional Paradigms
An Introduction to Scala - Blending OO and Functional ParadigmsAn Introduction to Scala - Blending OO and Functional Paradigms
An Introduction to Scala - Blending OO and Functional Paradigms
 
BCS SPA 2010 - An Introduction to Scala for Java Developers
BCS SPA 2010 - An Introduction to Scala for Java DevelopersBCS SPA 2010 - An Introduction to Scala for Java Developers
BCS SPA 2010 - An Introduction to Scala for Java Developers
 
An Introduction to Scala for Java Developers
An Introduction to Scala for Java DevelopersAn Introduction to Scala for Java Developers
An Introduction to Scala for Java Developers
 
Manuel - SPR - Intro to Java Language_2016
Manuel - SPR - Intro to Java Language_2016Manuel - SPR - Intro to Java Language_2016
Manuel - SPR - Intro to Java Language_2016
 
Ch6
Ch6Ch6
Ch6
 
Data.ppt
Data.pptData.ppt
Data.ppt
 
3. jvm
3. jvm3. jvm
3. jvm
 
A Brief Introduction to Scala for Java Developers
A Brief Introduction to Scala for Java DevelopersA Brief Introduction to Scala for Java Developers
A Brief Introduction to Scala for Java Developers
 
Miles Sabin Introduction To Scala For Java Developers
Miles Sabin Introduction To Scala For Java DevelopersMiles Sabin Introduction To Scala For Java Developers
Miles Sabin Introduction To Scala For Java Developers
 
Let's start with Java- Basic Concepts
Let's start with Java- Basic ConceptsLet's start with Java- Basic Concepts
Let's start with Java- Basic Concepts
 
Rust All Hands Winter 2011
Rust All Hands Winter 2011Rust All Hands Winter 2011
Rust All Hands Winter 2011
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
The Evolution of Scala
The Evolution of ScalaThe Evolution of Scala
The Evolution of Scala
 
Java Course 8: I/O, Files and Streams
Java Course 8: I/O, Files and StreamsJava Course 8: I/O, Files and Streams
Java Course 8: I/O, Files and Streams
 
SE 20016 - programming languages landscape.
SE 20016 - programming languages landscape.SE 20016 - programming languages landscape.
SE 20016 - programming languages landscape.
 
Scala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologistScala and jvm_languages_praveen_technologist
Scala and jvm_languages_praveen_technologist
 
6 data types
6 data types6 data types
6 data types
 

More from Anton Keks

Being a professional software tester
Being a professional software testerBeing a professional software tester
Being a professional software tester
Anton Keks
 
Java Course 10: Threads and Concurrency
Java Course 10: Threads and ConcurrencyJava Course 10: Threads and Concurrency
Java Course 10: Threads and Concurrency
Anton Keks
 
Java Course 9: Networking and Reflection
Java Course 9: Networking and ReflectionJava Course 9: Networking and Reflection
Java Course 9: Networking and Reflection
Anton Keks
 
Choose a pattern for a problem
Choose a pattern for a problemChoose a pattern for a problem
Choose a pattern for a problem
Anton Keks
 
Simple Pure Java
Simple Pure JavaSimple Pure Java
Simple Pure Java
Anton Keks
 
Database Refactoring
Database RefactoringDatabase Refactoring
Database Refactoring
Anton Keks
 
Scrum is not enough - being a successful agile engineer
Scrum is not enough - being a successful agile engineerScrum is not enough - being a successful agile engineer
Scrum is not enough - being a successful agile engineer
Anton Keks
 
Being a Professional Software Developer
Being a Professional Software DeveloperBeing a Professional Software Developer
Being a Professional Software Developer
Anton Keks
 

More from Anton Keks (8)

Being a professional software tester
Being a professional software testerBeing a professional software tester
Being a professional software tester
 
Java Course 10: Threads and Concurrency
Java Course 10: Threads and ConcurrencyJava Course 10: Threads and Concurrency
Java Course 10: Threads and Concurrency
 
Java Course 9: Networking and Reflection
Java Course 9: Networking and ReflectionJava Course 9: Networking and Reflection
Java Course 9: Networking and Reflection
 
Choose a pattern for a problem
Choose a pattern for a problemChoose a pattern for a problem
Choose a pattern for a problem
 
Simple Pure Java
Simple Pure JavaSimple Pure Java
Simple Pure Java
 
Database Refactoring
Database RefactoringDatabase Refactoring
Database Refactoring
 
Scrum is not enough - being a successful agile engineer
Scrum is not enough - being a successful agile engineerScrum is not enough - being a successful agile engineer
Scrum is not enough - being a successful agile engineer
 
Being a Professional Software Developer
Being a Professional Software DeveloperBeing a Professional Software Developer
Being a Professional Software Developer
 

Recently uploaded

Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
Zilliz
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Networks
 
(CISOPlatform Summit & SACON 2024) Orientation by CISO Platform_ Using CISO P...
(CISOPlatform Summit & SACON 2024) Orientation by CISO Platform_ Using CISO P...(CISOPlatform Summit & SACON 2024) Orientation by CISO Platform_ Using CISO P...
(CISOPlatform Summit & SACON 2024) Orientation by CISO Platform_ Using CISO P...
Priyanka Aash
 
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
Priyanka Aash
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
Ivanti
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
ankush9927
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
bellared2
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
SubhamMandal40
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Torry Harris
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
313mohammedarshad
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 

Recently uploaded (20)

Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
 
(CISOPlatform Summit & SACON 2024) Orientation by CISO Platform_ Using CISO P...
(CISOPlatform Summit & SACON 2024) Orientation by CISO Platform_ Using CISO P...(CISOPlatform Summit & SACON 2024) Orientation by CISO Platform_ Using CISO P...
(CISOPlatform Summit & SACON 2024) Orientation by CISO Platform_ Using CISO P...
 
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 

Java Course 7: Text processing, Charsets & Encodings

  • 1. Java course - IAG0040 Text processing, Charsets & Encodings Anton Keks 2011
  • 2. String processing ● The following classes provide String processing: String, StringBuilder/Buffer, StringTokenizer ● All primitives can be converted to/from Strings using their wrapper classes (e.g. Integer, Float, etc) ● java.util.regex provides regular expressions ● java.text package provides classes and interfaces for parsing and formatting text, dates, numbers, and messages in a manner independent of natural languages Java course – IAG0040 Lecture 7 Anton Keks Slide 2
  • 3. Locales ● Java also supports locales, just like most OSs ● A java.util.Locale object represents a specific geographical, political, or cultural region. – There is a default locale, which is used by some String operations (e.g. toUpperCase) and formatters in java.text package. – Locale is initialized with: ISO 2-letter language code (lower case), ISO 2-letter country code (upper case), and a variant. Latter two are optional ● e.g. “de”, “et_EE”, “en_GB” Java course – IAG0040 Lecture 7 Anton Keks Slide 3
  • 4. Localization ● ResourceBundle classes can be used for localization of your programs – ResourceBundles contain locale-specific objects, e.g. Strings – ListResourceBundle and PropertyResourceBundle are simple implementations – ResourceBundle.getBundle(...) returns a locale-specific bundle Java course – IAG0040 Lecture 7 Anton Keks Slide 4
  • 5. Natural language comparison ● String.compareTo() does lexicographical comparison, ie compares character codes ● Collators are used for locale-sensitive comparison/sorting, according to the rules of the specific language/locale – java.text.Collator implements Comparator<String> – Use Collator.getInstance(...) for obtaining one – RuleBasedCollator is the common implementation, allows specification of own rules Java course – IAG0040 Lecture 7 Anton Keks Slide 5
  • 6. StringBuffer vs String ● A StringBuilder (and StringBuffer) is a mutable String ● Always use it, when doing complex String processing, especially when doing a lot of concatenations in a loop ● Java uses StringBuilder internally in place of the '+' operator – String s = a + b + 25; is the same as – String s = new StringBuilder() .append(a).append(b).append(25).toString(); – There are many different append() methods for all primitive types as well as any objects. For an arbitrary object, toString() is called. ● StringBuffer, StringBuilder, and String implement CharSequence ● StringBuilder has the same methods as StringBuffer, but a bit faster, because it is not thread safe (not internally synchronized) Java course – IAG0040 Lecture 7 Anton Keks Slide 6
  • 7. Formatting and Parsing ● Locale-specific formatting and parsing is provided by java.text. ● java.text.Format is an abstract base class for – DateFormat (SimpleDateFormat) – date and time. Calendar is used for manipulation of date and time. – NumberFormat (ChoiceFormat, DecimalFormat) – numbers, currencies, percentages, etc – MessageFormat – for complex concatenated messages – all of them provide various format and parse methods – all of them can be initialized for the default or specified locale using provided static methods – all of them can be created directly, specifying the custom format Java course – IAG0040 Lecture 7 Anton Keks Slide 7
  • 8. Regular expressions ● Regular expressions are expressions, allowing easy searching and matching of textual data, they are built into many languages, like Perl and PHP, and widely used in Unix command-line ● Regular expression classes are in the java.util.regex package. ● In Java, represented as Strings, but must be 'compiled' by Pattern.compile() before use. ● However, many String methods provide convenient 'shortcuts', like split(), matches(), replaceFirst(), replaceAll(), etc ● Pattern is an immutable compiled representation, which can be used for creation of mutable Matcher objects. ● Use Patterns directly in case you intend to reuse the regexp Java course – IAG0040 Lecture 7 Anton Keks Slide 8
  • 9. Regular Expressions (cont) ● Read javadoc of the Pattern class! – . (a dot) matches any character – [] can be used for matching any specified character – s, S, d, w, etc save you typing sometimes (note: double escaping is needed within String literals, e.g. “s” – ?, +, * match the number of occurrences of the preceding character: 0 or 1, 1 or more, any number respectively – () - matches groups (they can be accessed individually) – | means 'or', e.g. (dog|cat) matches both “dog” and “cat” – ^ and $ match beginning and end of a line, respectively – b matches word boundary Java course – IAG0040 Lecture 7 Anton Keks Slide 9
  • 10. Scanning ● java.util.Scanner can be used for parsing Strings, InputStreams, Readers, or Files ● It uses either built-in or custom regular expressions for parsing input data, it is sensitive to either the default or specified Locale ● Default delimiter is whitespace (“s”), custom delimeter may be set using the useDelimiter() method ● It implements Iterator<String>, therefore has hasNext() and next() methods, various type-specific methods, e.g. hasNextInt(), nextInt(), etc, as well as finding and skipping facilities ● Can be used for parsing the standard input: – Scanner s = new Scanner(System.in); int n = s.nextInt(); Java course – IAG0040 Lecture 7 Anton Keks Slide 10
  • 11. Charsets and encodings ● In the 21st century, there is no excuse for any programmer not to know charsets and encodings well ● Charsets map glyphs (symbols) to numeric codes ● Charsets are represented by character encodings (actual bits and bytes that are stored in files) ● Fonts must support charsets in order to display texts in respective encodings properly ● Example: – Glyph (symbol): A – Numeric code: 65 (ASCII charset) – Encoding: 0x41 == 1000001 b (ASCII 7-bit encoding) Java course – IAG0040 Lecture 7 Anton Keks Slide 11
  • 12. ASCII ● American Standard Code for Information Interchange ● Created in 1963, ANSI in 1967, ISO-646 in 1972 ● Allowed for text exchange between computers ● Only 7 bits are defined, nowadays called US-ASCII ● 0-31 – control chars ● 33-126 – printable ● Was designed for English language Java course – IAG0040 Lecture 7 Anton Keks Slide 12
  • 13. ASCII extensions ● ASCII is enough for only Latin, English, Hawaiian and Swahili ● For most other languages a number of 8-bit ASCII extensions were developed, incompatible with each other ● ISO-8859 was an attempt to standardize them by defining the upper 128 characters in 8-bit wide bytes – All of them have the first 7-bit the same as ASCII – ISO-8859-1 (Latin-1) – Western European – ISO-8859-4 – Northern, ISO-8859-13 – Baltic, WIN-1257 – MS Baltic (modified ISO) – ISO-8859-5, KOI8-R – Cyrillic, WIN-1251 – MS Cyrillic (different from ISO) – Many of them are still used today in legacy systems or formats Java course – IAG0040 Lecture 7 Anton Keks Slide 13
  • 14. Unicode (UCS, ISO-10646) ● Unicode solves the problem of incompatible charsets ● Unicode defines standardized numeric codes (code points) for most glyphs used in the world – Code points are abstract – they don't define representation – First 256 code points correspond to ISO-8859-1 – 16 bit BMP (Basic Multilingual Plane) – most modern languages (including Chinese, Japanese, etc) – More planes for other scripts (mathematical symbols, musical notation, ancient alphabets, etc) ● Apart from UCS, Unicode defines formatting and combining rules as well (e.g. for bidirectional text) Java course – IAG0040 Lecture 7 Anton Keks Slide 14
  • 15. Unicode encodings ● Define representation of code points in bits and bytes ● Fixed-width UCS-2 (2 bytes) and UCS-4 (4 bytes) ● UTF (Unicode Transformation Format) – All of them can encode any Unicode code points – UTF-8 – variable size from 1 to 6 bytes (usually no longer than 3 bytes, compatible with ASCII), the most popular and compact – UTF-16 – 2 or 4 bytes, 2 bytes for BMP code points, 4 bytes for other planes – UTF-32 – constant size, 4 bytes per character, 'raw' unicode – UTF-7 – 7-bit safe encoding (less popular nowadays) Java course – IAG0040 Lecture 7 Anton Keks Slide 15
  • 16. Charsets and Java ● char and String are UTF-16 – Beware that length(), indexOf(), etc operate on chars (surrogates), not Unicode glyphs, therefore can return 'logically wrong' values in case of 4-byte characters – this was a performance decision ● Encoding conversions are built-in – Encoded text is binary data for Java, therefore stored in bytes – There always exists the default encoding (the one OS uses) – Charset class is provided for encoding/decoding, enumeration, etc – s.toBytes(...) - encodes a String – new String(...) - decodes raw bytes to a String – System.out and System.in automatically convert to/from the default encoding Java course – IAG0040 Lecture 7 Anton Keks Slide 16