Lightweight Page Content Encoding

Posted on June 2, 2018 in dotnet , umbraco

You are creating a website supporting languages with accents (as in, everything but English). You have created the proper html element with lang="es", and the proper meta tags to indicate that your page is text/html; charset=utf-8. Then you render the title of the page as:

<div class="title">@Model.Title</div>

And... what you get in the source code of the page is: Espa&#241;a, not España.

Oops

How come? Well, ASP.NET Razor encodes everything it renders, for obvious security reasons. Anything suspicious such as < would turn into &lt; and this is a good thing. However, the encoder is... a bit aggressive and encodes practically everything that is not pure ASCII, including the ñ in the title.

And, it does not look nice in the page source code. Oh and also, it means that instead of sending one char, you send six chars. Bandwidth is cheap nowadays, but nevertheless, this feels like an overkill.

Let us fix this

Turns out there's no out-of-the box solution, but ASP.NET provides enough extensibility points for us to fix the situation. We are going to write our own encoder, and tell ASP.NET to use it. Then encoder will look like:

public class HttpEncoder : System.Web.Util.HttpEncoder
{
  protected override void HtmlEncode(string value, TextWriter output)
  {
    // is used by Razor to encode all @... content,
    // when it's not an IHtmlString - the original
    // HttpEncoder calls System.Net.WebUtility.HtmlEncode(...)

    // do something different...
  }
}

And we tell ASP.NET to use it by setting the encoderType attribute of the httpRuntime element in web.config:

<httpRuntime encoderType="MyAssembly.HttpEncoder, MyAssembly" />

And now, we have to write an encoder. On my own website, it looks like:

protected override void HtmlEncode(string value, TextWriter output)
{
  HtmlEncodeUnsafe(value, output);
}

And the magic happens in HtmlEncodeUnsafe, which is an unsafe method and is, basically, the original encoder's code, slightly modified. I have posted the code below. It essentially skips "utf8-safe" characters but still encodes everything else.

Problem fixed!

Code

    private static unsafe void HtmlEncode2(string value, TextWriter output)
    {
        // this is the raw reflector disassembly output
        // modified and refactored

        if (value == null) return;

        if (output == null)
            throw new ArgumentNullException(nameof(output));

        var pos = IndexOfHtmlEncodingChars2(value, 0);
        if (pos == -1)
        {
            output.Write(value);
        }
        else
        {
            var rem = value.Length - pos;
            fixed (char* sp0 = value)
            {
                var sp = sp0;
                while (pos-- > 0)
                {
                    //sp++;
                    output.Write(sp[0]);
                    sp++;
                }
                while (rem > 0)
                {
                    //sp++;
                    var ch = sp[0];
                    if (ch <= '>')
                    {
                        switch (ch)
                        {
                            case '&':
                                output.Write("&amp;");
                                break;
                            case '\'':
                                output.Write("&#39;");
                                break;
                            case '"':
                                output.Write("&quot;");
                                break;
                            case '<':
                                output.Write("&lt;");
                                break;
                            case '>':
                                output.Write("&gt;");
                                break;
                            default:
                                output.Write(ch);
                                break;
                        }
                    }
                    // else
                    //if ((ch >= '\x00a0') && (ch < 'Ā'))
                    //{
                    //    output.Write("&#");
                    //    output.Write(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
                    //    output.Write(';');
                    //}
                    else
                    {
                        output.Write(ch);
                    }
                    sp++;
                    rem--;
                }
            }
        }
    }

    private static unsafe int IndexOfHtmlEncodingChars2(string s, int startPos)
    {
        // this is the raw reflector disassembly output
        // modified and refactored

        var rem = s.Length - startPos;
        fixed (char* sp0 = s)
        {
            var sp = sp0 + startPos;
            while (rem > 0)
            {
                var ch = sp[0];
                if (ch <= '>')
                {
                    switch (ch)
                    {
                        case '&':
                        case '\'':
                        case '"':
                        case '<':
                        case '>':
                            return s.Length - rem;

                        //case '=':
                        //    goto Label_0086;
                    }
                }
                //else if ((ch >= '\x00a0') && (ch < 'Ā'))
                //{
                //    return (s.Length - num);
                //}
                //Label_0086:
                sp++;
                rem--;
            }
        }
        return -1;
    }

There used to be Disqus-powered comments here. They got very little engagement, and I am not a big fan of Disqus. So, comments are gone. If you want to discuss this article, your best bet is to ping me on Mastodon.