You are here

Revisions for tech proposal: FALLBACK_ENCODINGS

2025, August 29 - 22:25 by sysop2025, August 29 - 22:34 by sysop

wording

Changes to Body
 
-------- PROCESS FLOW --------------------------------------------------------
 
-------- PROCESS FLOW --------------------------------------------------------
 
 
-
Programms encountering data with encoded text for which there is no known
+
Programms encountering data with encoded text, to which there is no metadata
-
metadata about which charset to decode to, takes the FALLBACK_ENCODINGS
+
indicating which charset to decode it into, takes the FALLBACK_ENCODINGS
 
string, had been read from the OS environment, as charset names delimited by
 
string, had been read from the OS environment, as charset names delimited by
 
the colon ":" char.
 
the colon ":" char.
Current revision:

This is a technical proposal of a standard for conveying a computer user's preference of the precedence list of character sets to decode textual data into, which is in unknown encoding.

Proposal: FALLBACK_ENCODINGS environment variable

The user sets FALLBACK_ENCODINGS variable in his computing environment according to this EBNF:

FALLBACK_ENCODINGS  = word , { ":" , word } ;

word                = char , { char } ;

char                = "a" .. "z" | "A" .. "Z" | "0" .. "9" | "_" | "-" ;

Process Flow

Programms encountering data with encoded text, to which there is no metadata indicating which charset to decode it into, takes the FALLBACK_ENCODINGS string, had been read from the OS environment, as charset names delimited by the colon ":" char.

Optionally appends its own programm-specific fallback charset name, if any, to the end. "UTF-8" is recommended for this purpose.

Then tries decode the data into the given charset. If the decoding routine succeeds, returns the decoded text back to the user, optionally with the used charset as metadata.

If fails, continue with the next item. If no charset succeded, fails the data decoding task.

Reference Implementation

in Python programming language

import os

def decode_textual_data(data, metadata):
  encodings = []
  detected_charset = metadata.get('charset')
  if detected_charset:
    encodings.insert(0, detected_charset)
  
  FALLBACK_ENCODINGS = os.environ.get('FALLBACK_ENCODINGS', '').split(':')
  FALLBACK_ENCODINGS.append('UTF-8')
  encodings.extend(FALLBACK_ENCODINGS)
  
  for encoding in encodings:
    try:
      data.decode(encoding, 'strict')
      break
    except UnicodeDecodeError:
      encoding = None
  
  if encoding is None:
    raise
  
  return {'text': data.decode(encoding), 'metadata': {'charset': encoding,}}

Related Standards

  • LANG
  • LANGUAGE
  • LC_MESSAGES

 

Languages

Email a Login Link

Navigation