Get content (plain text) in UTF-8

ggrossetie · January 31, 2016, 5:51pm

Hello,

I’ve developed an add-on to preview AsciiDoc as HTML inside Firefox: https://github.com/asciidoctor/asciidoctor-firefox-addon

We have an issue when the character encoding of the plain text document is not declared.
You can read in the Firefox console something like:

The character encoding of the plain text document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the file needs to be declared in the transfer protocol or file needs to use a byte order mark as an encoding signature.

We need to obtain the plain text in UTF-8 (regardless of the browser’s default encoding). Is there a way to achieve this ?

The following hack is working but was rejected by AMO (because I’m using XMLHttpRequest):

// if charset is not UTF-8, try techniques to coerce it to UTF-8
// likely used only for local files
if (document.characterSet.toUpperCase() != 'UTF-8') {
  try {
    // this technique works if all characters are in standard ASCII set
    // see: http://www.ascii-code.com
    sanitizeAndShowHTML(convertToHTML(decodeURIComponent(escape(document.firstChild.textContent))));
  } catch (decodeError) {
    // XMLHttpRequest responseText is UTF-8 encoded by default
    var xhr = new XMLHttpRequest();
    xhr.open('GET', window.location.href, true);
    xhr.onload = function (evt) {
      if (xhr.readyState === 4) {
        // NOTE status is 0 for local files (i.e., file:// URIs)
        if (xhr.status === 200 || xhr.status === 0) {
          sanitizeAndShowHTML(convertToHTML(xhr.responseText));
        } else {
          console.error('Could not read AsciiDoc source. Reason: [' + xhr.status + '] ' + xhr.statusText);
        }
      }
    };
    xhr.onerror = function (evt) {
      console.error(xhr.statusText);
    };
    xhr.send();
  }
} else {
  sanitizeAndShowHTML(convertToHTML(document.firstChild.textContent));
}

Without the UTF-8 text, we cannot properly support non-English languages: https://github.com/asciidoctor/asciidoctor-firefox-addon/issues/43

Thanks,
Guillaume

Lithopsian · January 31, 2016, 7:16pm

Possibly a TextEncoder will do what you want:
https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

You should clarify just why XMLHttpRequest was rejected. So far as I know there is no reason why this object cannot be used in an addon. The rejection may have been tangential to XHR itself and might be solved without entirely ditching XHR. I see from your code that the parsed HTML is simply written to innerHTML, which is not allowed except in very restricted circumstances. Retrieving data from an arbitary URL and throwing it at innerHTML is probably going to be rejected no matter what sort of validation and sanitisation you do, but that problem isn’t specific to XMLHttpRequest. So, for example, it might be acceptable if you can demonstrate that it is only being used on local files.

ggrossetie · January 31, 2016, 10:48pm

Thanks for your reply @Lithopsian

I’m not sure that the TextEncoder will work because document.firstChild.textContent contains “garbled text” (characters outside ASCII range).
I will give it a try.

Here is the reason for XMLHttpRequest: “Your add-on makes remote, synchronous XMLHttpRequests which have the ability to lock-up the browser UI and are not allowed in public add-ons. Please use asynchronous requests instead.”

I’m not retrieving data from an arbitrary URL but from the URL that the user is browsing. The add-on just retrieve the content in plain text, converts it to HTML with Asciidoctor and write the result to innerHTML.
Also I was forced to use this sanitize method but this isn’t perfect because links are removed from the generated HTML but this is another subject

Lithopsian · January 31, 2016, 11:48pm

This is either a mistake or obsolete. The code you show in your first post is making asynchronous XMLHttpRequests (third parameter to xhr.open is true). Could there be code elsewhere that this is rejection is actually referring to?

Lithopsian · February 1, 2016, 12:24am

If you’re only ever interested in a document that Firefox has already loaded into a browser window, it shouldn’t be necessary to do your own parsing. Garbage characters would indicate that the character set (encoding) of the document is not correct, but this can be changed. An easy way to do this is to call BrowserSetForcedCharacterSet(“UTF-8”) which will set the character set, remember it for that page (unless you’re in private browsing mode), and reload the page. In most cases, it won’t actually need to reload the document, just reparse it. I know that web progress listeners get called (with a special flag LOAD_FLAGS_CHARSET_CHANGE), and I assume that DOMContentLoaded and load events are also generated for the new page.

ggrossetie · February 1, 2016, 10:14am

I tried to add BrowserSetForcedCharacterSet both on content script and on extension script but I get this error:

ReferenceError: BrowserSetForcedCharacterSet is not defined

I also tried to use TextEncoder/TextDecoder but all my attempts failed.
Could you please post a snippet of code on how to use BrowserSetForcedCharacterSet and/or TextEncoder ?

Full source code of my addon is available on GitHub: https://github.com/asciidoctor/asciidoctor-firefox-addon

Thanks,
Guillaume