You are here
Primary tabs
tech proposal: FALLBACK_ENCODINGS
This is a technical proposal of a standard for conveying a computer user's preference of the precedence list of character sets to decode textual data into, which is in unknown encoding.
Proposal: FALLBACK_ENCODINGS environment variable
The user sets FALLBACK_ENCODINGS variable in his computing environment according to this EBNF:
FALLBACK_ENCODINGS = word , { ":" , word } ;
word = char , { char } ;
char = "a" .. "z" | "A" .. "Z" | "0" .. "9" | "_" | "-" ;
Process Flow
Programms encountering data with encoded text, to which there is no metadata indicating which charset to decode it into, takes the FALLBACK_ENCODINGS string, had been read from the OS environment, as charset names delimited by the colon ":" char.
Optionally appends its own programm-specific fallback charset name, if any, to the end. "UTF-8" is recommended for this purpose.
Then tries decode the data into the given charset. If the decoding routine succeeds, returns the decoded text back to the user, optionally with the used charset as metadata.
If fails, continue with the next item. If no charset succeded, fails the data decoding task.
Reference Implementation
in Python programming language
import os
def decode_textual_data(data, metadata):
encodings = []
detected_charset = metadata.get('charset')
if detected_charset:
encodings.insert(0, detected_charset)
FALLBACK_ENCODINGS = os.environ.get('FALLBACK_ENCODINGS', '').split(':')
FALLBACK_ENCODINGS.append('UTF-8')
encodings.extend(FALLBACK_ENCODINGS)
for encoding in encodings:
try:
data.decode(encoding, 'strict')
break
except UnicodeDecodeError:
encoding = None
if encoding is None:
raise
return {'text': data.decode(encoding), 'metadata': {'charset': encoding,}}
Related Standards
- LANG
- LANGUAGE
- LC_MESSAGES
- 46 reads