Why you should be using a proper HTML sanitization library

The idea that it’s easy to protect your web app from XSS seems to be floating around the interwebz. I’ve seen a whole bunch of “tutorials” saying stuff like “just use htmlencode()” etc.

If you’re building web applications, even small ones, do yourself a favor and use a proper html sanitization library. I hear good things about HTMLPurifier (PHP) and Sanitize (Ruby); there probably is a library for most other languages as well. I don’t dare to recommend anything specific so you’ll need to do some research, but it will be time well spent. And if you’re using a framework or CMS, use the sanitization functions they provide.

“Nah I’ll just use htmlencode(), it’ll encode tags and quotes and I don’t need anything else really cos I’m not doing any fancy stuff and libraries are just unnecessary bloat” a lazy developer might say.

Wrong. HTML is not easy to sanitize. Here’s why:

Unless your web application is just static HTML, there are most probably anchors and/or inputs somewhere displaying user input – a comment form or whatever. Here’s a really simple example:

// Encode all html special characters
<a href="<?php echo htmlencode( $_POST['url'] ) ?>">
    <?php echo htmlencode( $_POST['title'] ) ?>
</a>

This will, indeed, take care of HTML tags. But you probably know that it’s not safe:

$_POST['url'] = ' www.google.com" onclick="javascript:alert(document.cookie); ';

Simple as that. This is a lame example though, everyone knows that you need to encode quotes too.

// Encode all html special characters and also quotes.
<a href="<?php echo htmlencode( $_POST['url'],  ENT_QUOTES, 'UTF-8' ) ?>">
    <?php echo htmlencode( $_POST['title'], ENT_QUOTES, 'UTF-8' ) ?>
</a>

Still not good enough:

$_POST['url'] = ' javascript:alert(document.cookie); ';

Yes, you can XSS the fuck out of a website without any quotes or tags at all. In my experience, Estonian developers tend to forget this way too often. I’m somewhat afraid the situation isn’t better in any other corner of the earth.

To counter this one the lazy developer will start filtering stuff out. I don’t know how exactly, but in my experience, usually wrong.

<?php
    function sanitize($data) {
        // Encode all html special chars and quotes
        $data = htmlencode( $data,  ENT_QUOTES, 'UTF-8' );
        // Strip out any 'javascript' strings
        $data = str_replace('javascript', '', $data);
        return $data;
    }
?>
<a href="<?php echo sanitize( $_POST['url'],  ENT_QUOTES, 'UTF-8' ) ?>">
$_POST['url'] = ' Javascript:alert(document.cookie); ';

Lol forgot case sensitivity. That was silly, thanks for pointing it out tester. Fixed now.

<?php
    function sanitize($data) {
        // Encode all html special chars and quotes
        $data = htmlencode( $data,  ENT_QUOTES, 'UTF-8' );
        // Strip out any 'javascript' strings CASE INSENSITIVE
        $data = str_ireplace('javascript', '', $data);
        return $data;
    }
?>
<a href="<?php echo sanitize( $_POST['url'],  ENT_QUOTES, 'UTF-8' ) ?>">
$_POST['url'] = ' javajavascriptscript:alert(document.cookie) ';

Thanks again tester, I’m not in the best shape today it seems. Fixed now.

<?php
    function sanitize($data) {
        $data = htmlencode( $data,  ENT_QUOTES, 'UTF-8' );
        // Strip out any 'javascript' strings in a loop until there are no more CASE INSENSITIVE
        while( stristr( $data, 'javascript' ) ) {
            $data = str_ireplace('javascript', '', $data);
        }
        return $data;
    }
?>
<a href="<?php echo sanitize( $_POST['url'],  ENT_QUOTES, 'UTF-8' ) ?>">
$_POST['url'] = ' http://en.wikipedia.org/wiki/JavaScript ';

[facepalm]

<?php
    function sanitize($data) {
        $data = htmlencode( $data,  ENT_QUOTES, 'UTF-8' );
        // Strip out 'javascript:' strings with colons only to avoid breaking real URLs
        while( stristr( $data, 'javascript:' ) ) {
            $data = str_ireplace('javascript:', '', $data);
        }
        return $data;
    }
?>
<a href="<?php echo sanitize( $_POST['url'],  ENT_QUOTES, 'UTF-8' ) ?>">
$_POST['url'] = ' jav    ascript:alert(1) '; // That's a tab in there

This bug has been marked CLOSED (WORKSFORME) by LazyDeveloper@dev.com

This bug has been marked REOPENED by LazyDeveloper@dev.com: you’re right, sorry, I use latest Firefox for developing, didn’t realize I should also test with IE8..

.. I think you get the point by now.

There are a ton of ways to exploit a simple <a href=””>. A lot of them are browser-specific. And this is just one tag with one property. What about the rest? What about user input in JSON requests? What about letting your users enter limited html?

There’s a fairly simple solution for most of these problems: use a library for sanitizing your html.

I should emphasize that using one will not automagically save you from any and all XSS attacks. Software development is really complex and humans tend to make mistakes, overlook things, forget.

You might accidentally use wrong options when configuring the library. You might forget to sanitize something (Importing .csv files, for example. True story.). Anything could go wrong. That’s why you need to test.

Security is difficult. Use existing tools. Always test.

2 thoughts on “Why you should be using a proper HTML sanitization library

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>