« Previous | Next » 

Revision aff28f5f

IDaff28f5f747b05a2f5cf61b5d0c43d8cb4e0cb6d

Added by Georgios D. Tsoukalas about 11 years ago

uenc: utility function for encoding unicode to str

1. Motivation.

Too often a programmer outputs an object that can either be str or
unicode. The default python encoding of 'ascii' cannot handle all of
unicode text, but this will not crash until such a text is encountered.

A popular response is to force unicode into a UTF-8 encoding,
but this forcing breaks cases where other encodings may be desired
or needed (e.g. user terminal settings, CSV files).

The 'force' approach is sufficient for application data, since the
application decides for itself how to handle it consistently.

However, forcing an encoding on a user-interfacing output,
(e.g. terminal, notification messages) disrespects the user's
configuration.

2. Approach.

uenc() will honor the user's configuration as defined through
the POSIX call setlocale(), which expects the user's preference
in the LC_* environment variables. LC_CTYPE is most relevant here.

However, these preferences are not honored automatically; a call to
setlocale() must first be made. Therefore, if the locale is not set,
it will be during the importing of the uenc()'s parent module.

Guidelines

- Programmers who wish to honor the preferences, but do not want to
care about either str or unicode (or even another object) can call
uenc() to encode (if needed) all their output text objects.

- Programmers who want to force a specific encoding, they must
immediately encode their text. Possible calls to uenc() on their text
from other modules will not touch the str objects.

- Programmers who want to honor the configuration of their output
devices (e.g. file with its encoding attribute set), they must either
trust the LC_* configuration and use uenc() or trust the configuration
of the output and NOT use uenc().

- Programmers who manage storage of internal application data are better
of encoding all their text to UTF-8, and never forward unicode and
suffer the uncertainty of unicode-to-string encoding. UTF8 can both
handle all unicode texts and is compatible with plain 7-bit ascii.

- Programmers who output text according to communication protocols
(e.g HTTP, JSON) must always be aware and honor the encoding
requirements of the protocol, even when they use libraries that
'do the right thing' with unicode. It is more cumbersome, but always
safer to encode unicode to string before giving output away to a
protocol library.

Often, the very question 'what encoding should I use', will make the
programmer aware of encoding issues and protocol details, whereas
passing on unicode would trigger none such inquiry.

Files

  • added
  • modified
  • copied
  • renamed
  • deleted

View differences