Unicode basics in python
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Unicode basics in python

on

  • 371 views

 

Statistics

Views

Total Views
371
Views on SlideShare
368
Embed Views
3

Actions

Likes
0
Downloads
1
Comments
2

1 Embed 3

https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Unicode basics in python Presentation Transcript

  • 1. Unicode in python
  • 2. We Cover these now ● Unicode history ● terms clarity (code point,BOM,utf-8,utf-16) ● decoding and encoding in python ● how django handles these? ● helpful python modules to tackle it Note: BOM is used in utf-16.since, it has multi bytes character code point
  • 3. How it came? Americans came up with (7 bit)ASCII representation with english only alphabets as a standard to exchange information.(‘A’ - 65, ’a’ - 97) Rest of the world came up with their unaccented english characters ('ä', )in their own way.(messed up)
  • 4. What causes unicode born? To exchange information in all languages, we got some requirements ● Unique and simple rule was needed ● Adoptable across all machines(windows,ibm, etc..) ● Efficient storage as much possible
  • 5. Unicode Unicode = UCS(universal character set) + bit representation logic UCS: character + code point (‘a’, 97) bit representation: BOM = Big endian (or) Little endian 00 48 00 65 00 6C 00 6C 00 6F (or) 48 00 65 00 6C
  • 6. utf-8 is famous, because ● multi-byte encoding ● variable width encoding ● upto 4 byte code points are allowed by utf-8 ● mostly, No need BOM(8 bits) ● memory efficient How? for NON-ASCII bytes, 1st byte is reserved to indicate the no of bytes the char is using(eg.compression)
  • 7. decoding Character to Numeric value(code point) conversion ● from <type 'str'> to <type 'unicode'> ● it throws maximum “UnicodeDecodeError:” (samples demo)
  • 8. encoding ● Numeric value(code point) to Characters ● from <type 'unicode'> to <type 'str'> ● it throws maximum “UnicodeEncodeError:” (samples demo)
  • 9. Rules to Remember… ● Decode early, Unicode everywhere, Encode late ● UTF-8 is the best guess for an encoding ● chardet.detect() ========================== in Python 3 this is solved… ● <type 'str'> is a Unicode object
  • 10. How django handles? >>> def to_unicode( ... obj, encoding='utf-8'): ... if isinstance(obj, basestring): ... if not isinstance(obj, unicode): ... obj = unicode(obj, encoding) ... return obj smart_text(s, encoding='utf-8', strings_only=False, errors='strict') force_text(s, encoding='utf-8', strings_only=False, errors='strict') smart_bytes(s, encoding='utf-8',
  • 11. How to set your python default encoding standard? import sys >>>reload(sys) >>>sys.setdefaultencoding(‘utf-8’) >>>sys.getdefaultencoding >>>’utf-8’ (or) # -*- coding: utf-8 -*- (tell to python you saved <mod_name.py> in utf-8)
  • 12. Related python modules.. ● chardet.detect() ● unicodedata ● codecs
  • 13. Thanks for your time Post your questions.
  • 14. samples demo….
  • 15. screenshot2