You are here

tech proposal: FALLBACK_ENCODINGS

This is a technical proposal of a standard for conveying a computer user's preference of the precedence list of character sets to decode textual data into, which is in unknown encoding.

Proposal: FALLBACK_ENCODINGS environment variable

The user sets FALLBACK_ENCODINGS variable in his computing environment according to this EBNF:

FALLBACK_ENCODINGS  = word , { ":" , word } ;

word                = char , { char } ;

char                = "a" .. "z" | "A" .. "Z" | "0" .. "9" | "_" | "-" ;

Process Flow

Programms encountering data with encoded text, to which there is no metadata indicating which charset to decode it into, takes the FALLBACK_ENCODINGS string, had been read from the OS environment, as charset names delimited by the colon ":" char.

Optionally appends its own programm-specific fallback charset name, if any, to the end. "UTF-8" is recommended for this purpose.

Then tries decode the data into the given charset. If the decoding routine succeeds, returns the decoded text back to the user, optionally with the used charset as metadata.

If fails, continue with the next item. If no charset succeded, fails the data decoding task.

Reference Implementation

in Python programming language

import os

def decode_textual_data(data, metadata):
  encodings = []
  detected_charset = metadata.get('charset')
  if detected_charset:
    encodings.insert(0, detected_charset)
  
  FALLBACK_ENCODINGS = os.environ.get('FALLBACK_ENCODINGS', '').split(':')
  FALLBACK_ENCODINGS.append('UTF-8')
  encodings.extend(FALLBACK_ENCODINGS)
  
  for encoding in encodings:
    try:
      data.decode(encoding, 'strict')
      break
    except UnicodeDecodeError:
      encoding = None
  
  if encoding is None:
    raise
  
  return {'text': data.decode(encoding), 'metadata': {'charset': encoding,}}

Related Standards

  • LANG
  • LANGUAGE
  • LC_MESSAGES

 

Add new comment

Languages

Email a Login Link

Navigation