The innerHTML
property is extremely popular because it provides a simple way to completely replace the contents of an HTML element. Another way to do that is to use the DOM Level 2 API (removeChild
, createElement
, appendChild
) but using innerHTML
is by far the easiest and most efficient way to modify the DOM tree. However, innerHTML
has few problems of its own that you need to be aware of:
- Improper handling of the
innerHTML
property can enable script-injection attacks on Internet Explorer when the HTML string contains a script tag marked as deffered:<script defer>...<script>
- Setting
innerHTML
will destroy existing HTML elements that have event handlers attached to them, potentially creating a memory leak on some browsers.
There are a few other minor drawbacks worth mentioning:
- You don’t get back a reference to the element(s) you just created, forcing you to add code to retrieve those references manually (using the DOM APIs…)
- You can’t set the
innerHTML
property on all HTML elements on all browsers (for instance, Internet Explorer won’t let you set theinnerHTML
property of a table row element)
I am more concerned with the security and memory issues associated with using the innerHTML
property. Obviously, this problem is nothing new, and very bright people have already figured out ways to work around some of these problems.
Douglas Crockford wrote a purge
function that takes care of breaking some circular references caused by attaching event handlers to HTML elements, allowing the garbage collector to release all the memory associated with these HTML elements.
Removing the script tags from the HTML string is not as easy as it seems. A regular expression should do the trick, although it’s hard to know whether it covers all possible cases. Here is the one I came up with:
/<script[^>]*>[\S\s]*?<\/script[^>]*>/ig
Now, let’s put these two techniques together in a single setInnerHTML
function (Update: Thanks to those who commented on this article. I fixed the errors/holes you mentioned, and also decided to bind the setInnerHTML
function to YAHOO.util.Dom
)
YAHOO.util.Dom.setInnerHTML = function (el, html) { el = YAHOO.util.Dom.get(el); if (!el || typeof html !== 'string') { return null; } // Break circular references. (function (o) { var a = o.attributes, i, l, n, c; if (a) { l = a.length; for (i = 0; i < l; i += 1) { n = a[i].name; if (typeof o[n] === 'function') { o[n] = null; } } } a = o.childNodes; if (a) { l = a.length; for (i = 0; i < l; i += 1) { c = o.childNodes[i]; // Purge child nodes. arguments.callee(c); // Removes all listeners attached to the element via YUI's addListener. YAHOO.util.Event.purgeElement(c); } } })(el); // Remove scripts from HTML string, and set innerHTML property el.innerHTML = html.replace(/<script[^>]*>[\S\s]*?<\/script[^>]*>/ig, ""); // Return a reference to the first child return el.firstChild; };
Voila! Let me know if there is anything else that should be part of this function, or if I missed anything obvious in the regular expression.
Update: There are obviously many more ways to inject malicious code in a web page. The setInnerHTML
function barely normalizes the <script>
tag execution behavior across all A-grade browsers. If you are going to inject HTML code that cannot be trusted, make sure you sanitize it first on the server side. There are many libraries available for this.
Update: IE8 has a new toStaticHTML
function attached to the window object that removes any potentially executable content from an HTML string!
Forgive me for asking what might seem like a dumb question, but what’s the point of replacing script tags in the innerHTML? If you had to replace script tags, it would mean that you couldn’t trust it to be free of XSS attacks (which is certainly a valid concern). However, if the code wasn’t trustworthy, couldn’t someone just set the innerHTML to something like ”? Or, better yet, the innerHTML could be something like ”
Additionally, it seems that your regex may have a slight flaw in it. A (.) in JS’s regex parser doesn’t match a \n, so if I throw in at least one of those into my evil script, I can get stuff through the regex. Just as a check, I threw up a quick test page to see what happens w/ that regex and some \n chars: . I think a better solution might be to use /]*>(.|\n)*/g or /]*>(.|\n)*?/g (depending on your preference for greedy vs lazy in case the evil person tries to throw in a \n//\n into the attempted XSS attack).
Anyway, am I seeing that correctly?
Regarding the regular expression for scripts - how about ingoring case with «i» modifier and detecting unclosed script tags? And what is «\\?» for? «»?
Your blog broke my code :> The last was similar to «»
Argh it even doesn’t allow me to post ending script tag with zero-width joiners between every letter. I’ll try with a space: «».
That doesn’t work either. Well, mayble like that «<\/script>»
Pingback: The Problem With innerHTML | David Bisset: Web Designer, Coder, Wordpress Guru
Good to see that you acknowledge that, in real-life scenarios,
innerHTML
is sometimes the best approach (especially with certain AJAX scenarios). Douglas solution is great, and nice with your addition.I can’t help of being wary of using a variable named html, though. Seems like a perfect scenario where IE would get a messed up reference in certain cases… :-)
Although a great idea, the code still references Douglas’ purge function (which in your example) is not included.
I too wonder about the script stuff….
a.) As long as I am not exposing my method to other 3rd parties to use, then I don’t see the worry.
b.) as mentioned by others, it should be a case insensitive regex
c.) since I may want to insert script tags (quite likely), what I would rather have, is a regex to remove the defer attribute!
Doug’s purge method works only when you’ve attached event handlers directly to the element, via element.onclick, etc. If you attach events using attachEvent(), the circular references remain and there’s no way to know that the event handlers are still attached before changing innerHTML.
Another problem with innerHTML is that it doesn’t stay true to the string you’ve used. It will change tag case, add/remove attributes, etc., meaning that you can’t necessarily test innerHTML to see if it contains the text you had previous set it to.
Conclusion: innerHTML is nifty but far from perfect. :)
P.S. Looks like you’ve got some spam comments.
@Nicholas
Indeed, I forgot to mention that important point. I would like to see something similar to setInnerHTML as part of the YUI library. That way, events attached using the DOM Level 2 API could be automatically removed. Your second point is less of an issue for most uses of the innerHTML property.
I don’t see any spam comments.
@Nicholas
I updated the post to take your remark into account and fix some stupid bugs I had left out.
Script tags are just one out of many ways to inject malicious code. Stripping only the script tag is sub par. You must also be careful of IFRAMEs, OBJECT, EMBED, IMG … and even some attributes ( some browsers still allow script execution in style=”background:url(javascript:…);”.
The only way to sanitize properly markup to be injected is to use a whitelist of tags and attributes … or simply don’t inject markup using innerHTML.
Pingback: Javascript News » Blog Archive » The problem with innerHTML
Pingback: The problem with innerHTML « outaTiME
Pingback: Ajax Girl » Blog Archive » The problem with innerHTML
@Mathieu
I could not agree more. The
setInnerHTML
function barely normalizes the script tag execution behavior across all A-grade browsers.Pingback: Woodworkingde Yahoo! nos cuenta los problemas de innerHTML - elWebmaster
Nice article, now I won’t feel so dirty using innerHTML as much as I do. Regarding Nicholas’ comment couldn’t you use YUI’s YAHOO.util.Event.purgeElement method to remove the event handlers set with attachEvent (or would that only be for those set with the Event utility). Either way it’s a nice solution but will definitely be a task to forge a silver bullet to solve all scenarios.
thanks again,
Dylan
To Nicholas’ point, you break circular references differently with DOM 0 events ( ex. inline onclick events, etc. ) than with DOM 2 events ( ex. attachEvent, addEventListener ).
So, to be effective, you need a way to do both.
Another Caveat with innerHTML. Internet Explorer normalizes all the space when inserting into innerHTML. Including a pre-tag. I ran into this with an app I wrote that processed xml using xslt, and inserted the results into the tree. The xslt preserved space, and went back and changed \n to ‘s after the fact, but that didn’t work in IE.
There isn’t really much use on using regex to filter tags. Regex aren’t powerful enough on their own, without grammars (<a title=”x”>bbb<a title=””> kind of stuff would filter too much) and therefore you can always get around it. Try filtering something like
html = “<scrivoid(0);pt defer=\”true\”>alert(‘foo’);”
So there’s no real security benefit. Filtering scripts before adding them to any DOM structure (even one that doesn’t belong to the document tree) won’t be an easy job…
Pingback: Blog do Márcio d’Ávila
/<script[^>]*>((.|[\r\n])*?)<\\?\/script>/ig
…should be:
/<script[^>]*>[\S\s]*?<\/script>/ig
For one, it’s more readable, and secondly it’s much more efficient. If you want to also match self-closed script elements you could use:
/<script[^>]*(?:\/>|>[\S\s]*?<\/script>)/ig
One more security hole to plug is </script> tags with whitespace or other attributes, which I believe browsers allow (at least whitespace). So you could modify the last regex to /<script[^>]*(?:\/>|>[\S\s]*?<\/script[^>]*>)/ig
@Steven
Thanks. You’re absolutely right!
What I have noticed while using innerHTML is that if I create a select box through innerHTML in a DIV which is in a form and then I submit the for the select box is not captured in request(JAVA) in case of Firefox but gets captured in case of IE.
Is this due to
“You don’t get back a reference to the element(s) you just created, forcing you to add code to retrieve those references manually (using the DOM APIs…)”
Really nice article Julien!
but, simple using innerHTML is not enough, especially it is problematic with html form elements, “table” and “option” elements.
I think your code needs some improvements to fix all noted problems with various tags. see code below:
wrapper = document.createElement('div');
if (/^t(body|foot|head)$/i.test(el.tagName)) {
wrapper.innerHTML = '' + html + '';
wrapper = wrapper.getElementsByTagName('tbody')[0];
} else if (/^select$/i.test(el.tagName)) {
wrapper.innerHTML = '' + html + '';
wrapper = wrapper.getElementsByTagName('select')[0];
} else {
wrapper.innerHTML = html;
}
while(wrapper.firstChild) {
el.appendChild(wrapper.firstChild);
}
I’m sorry, code was posted incorrectly:
wrapper = document.createElement('div');
if (/^t(body|foot|head)$/i.test(el.tagName)) {
wrapper.innerHTML = '<table>' + html + '</table>';
wrapper = wrapper.getElementsByTagName('tbody')[0];
} else if (/^select$/i.test(el.tagName)) {
wrapper.innerHTML = '<select name="tmp">' + html + '</select>';
wrapper = wrapper.getElementsByTagName('select')[0];
} else {
wrapper.innerHTML = html;
}
while(wrapper.firstChild) {
el.appendChild(wrapper.firstChild);
}
@Joseph,
if you have too many table or select/option entries - recreate the whole table or select. If it is still too many, your app has a faulty design anyway, since bloated selects are not user-friendly and max column tables should be replaced by an optimized design as well.
my $0.02 of course.
Merry Christmas time everybody,
Frankie / Berlin / Germany
… max row tables, of course. Sry.