Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Faster HTML escaping? #2885

Closed
RunDevelopment opened this issue Nov 21, 2020 · 9 comments
Closed

[Request] Faster HTML escaping? #2885

RunDevelopment opened this issue Nov 21, 2020 · 9 comments
Labels
enhancement An enhancement or new feature

Comments

@RunDevelopment
Copy link

Right now, HTML escaping is implemented like this but there are faster methods out there. This can speed up highlighting quite a bit. I.e. Prism is about 13% faster in total with the faster escaping but Highlight.JS should make its own performance tests.

I also want to point out that this is very much a tradeoff: execution speed vs bundle size.

@RunDevelopment RunDevelopment added the enhancement An enhancement or new feature label Nov 21, 2020
@joshgoebel
Copy link
Member

joshgoebel commented Nov 22, 2020

I also want to point out that this is very much a tradeoff: execution speed vs bundle size.

And complexity. And potential for the performance to be pulled out from under you in the future. :-) I have some strong opinions on things like this (big picture). I tend to be a skeptic. :-)

  1. People often focus entirely on the wrong things with performance tuning.

I worry the win wouldn't be nearly as profound with us (in terms of total time). We spend a lot more time parsing (WAY more for auto-detect) than we do escaping HTML... and actually there are other huge wins we could consider IF we cared more about performance - before something like this that increases the bundle size, adds complexity to what should be a trivial function, and can mortgage your future to save your present.

But for a second let's pretend optimizing the escaping was highly relevant... Right now highlight is a complete operation EVEN if we only need the weighting (for language detection - a very common use of our library)... The very worst case to "auto-detect" a language when using all 190 languages is we parse, tokenize, generate HTML, and escape your code snippet 190 times - once for each language. A much simple win: Don't generate HTML or escape the code for 189 of those attempts. :-)

If the metric was only "time spent escaping" I've just made a very common use case 189x faster, seriously! (Actually even faster since I "cheated" and also removed HTML generation). And no need for gnarly replaceHTML code. :-) Of course this would scale linear to the number of languages one is auto-detecting... The more languages the bigger the win.

That's actually probably very-much something we should consider. :-)


  1. Focusing on micro-performance wins (this may not be that but it's close to that category in my mind) requires a "performance culture" to be successful.

At first glance I dislike everything about this kind of patch. :-) Replacing items in a string should be replaces bread and butter. What other reason does it exist? This code is very complex (vs the simple cases) and entirely unintuitive to me. I can think of a few other "simple" changes we could consider here (2-3 line very clean changes):

  • replaceALL (with strings vs regex)
  • replace(/[group]/, callback)

The first might be faster (strings vs regex)... the latter might be but I wouldn't bet on it because now the speed of JS function callbacks (and the table lookup you need) might hurt you. The only thing that I could imagine that should/could possibly faster would be replace("x","y") where the strings were the exact same length since then you could imagine doing the internal replace inline without generating a new string (yes I realize you need at least one new string since strings are immutable).

So anytime I see something like this I ask myself: What is so broken with replace - is this an engine problem? WHY is it SO slow that we can beat what should really be a native C++ function with tons of additional JS code? I skimmed a few of the links but I didn't see any answers to that just people throwing around lots of crazy solutions that bench faster without any explanation of why.

The why matters. Let's say replace is slow... so we use this crazy convoluted code. We get some win now, but at the cost of tons of complexity and size. Then finally someone fixes replace (to be fast, like it should be)... but how are we going to know? So now we're stuck with an inferior, complex solution to a trivially simple problem - vs just trusting the core library functions to be fast. Unless you have a culture that constantly revisits these things or tracks them I see this as the path to the dark side. I guess I see this often as a long-term vs short-term thinking.

I see this kind of talk a lot with the new ES6/2018 stuff. "Iterators are so slow, don't use them". I believe they will get faster, the compilation engines will improve. Engines have had a LONG time to optimize the hell out of ES5 code... so if iterator is a much simpler solution I'd rather use that and trust the engines to catch up later.

So I tend to worry if no one can explain WHY something is faster when it seems very intuitive the library functions should be fastest... and also I fear there could be differences here across engines - that should be tested. Is it possible replace is natively VERY fast in Safari and this hack "only" speeds up Chrome? Those are the kind of things I think it's easy to miss without a mindset specifically focused on performance as a key metric.


On a smaller note things like this set off my spidey sense really fast:

function escapeHTML(...)
  const match = (attr ? ATTR_REGEX : CONTENT_REGEX).exec(html);
  if (!match) return html;

Anytime I see people guard using CPU burn in an attempt to avoid CPU burn... so many times this is the wrong call - or at least very input dependent - and a lot of people never stop to ask those questions. If 99% of highlighted code includes one of those symbols that needs to be escaped then this becomes an entirely wasted check - we might as well just do the replace operation. On the other hand if most code does not then it's likely a win. But OTTOMH I'd say a lot of code snippets (esp. if medium or large) are likely to contain a character needing escaping... given random input though this a complete coin flip - and if that type of optimization was a good idea (as a general 50/50 bet) then I'd expect replace to do it internally.

Ok point 2 probably covered a bit much. :-)


It'd be fun to focus more on performance with Highlight.js, but to do it "right" I think that would require some dedicated CI servers (at the very least CPU locked VPSs) and a very specific suite of performance tests - to make sure we're actually moving forward, avoiding regressions, etc... short of that I just kind of keep an eye on the time the tests suite takes to run and as long as its "sorta the same" I'm happy.

@joshgoebel
Copy link
Member

Now it's possible there are answers to WHY. It's possible the ECMAScript spec requires very specific behavior from replace (though I don't know why it would preclude any beneficial performance optimizations) that "ties it's hands" so to speak. If that were determined then this kind of thing would be slightly more interesting... but still not as much as you might think... it seems like the kind of thing better suited to a tiny (well maintained) JS library replaceHTML... by someone who cares about the speed and is going to keep an eye on that ball. And then perhaps it'd make sense to consider using that library vs 'rolling our own'...

IE, replaceHTML is really more "utility" than our bread and butter. I'm sure such libraries already exist, I just don't know if any are small and performance focused. :)

But honestly speed has never been a super high priority vs reliability and consistency. In many common use cases we're way, way past "fast enough" so micro tuning at the cost of code complexity is a poor tradeoff.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 22, 2020

I just tried removing all the wasted extra HTML generation and HTML encoding for our auto-detection. Our auto-detect test suite runs over 36,000 combinations of highlighting. So we're looking for the 189x faster encoding and HTML generation... and drum roll:

  • Before: ~8.0s
  • After: ~8.0s

So even if that was a big win it's such a small total part of our time spent as to be irrelevant in the big picture.

Oh but it's worse. :-) Our auto-detect tests IGNORE the output (they don't care about HTML at all)... they are only looking for "I expected this to be JS, was it detected as JS"... so actually I just eliminated 100% of the HTML generation and encoding for the auto-detect test suite with 0 measurable impact. :-/

Your library must be insanely fast at parsing if you can find 13% wins in the HTML encoding alone.

@RunDevelopment
Copy link
Author

Seems like a faster HTML escaping doesn't help HighlightJS. Just wanted to let you know in case it did.

As you correctly said, maintainability is a big factor as well but sometimes the tradeoff is worth it ;)

I'm sure such libraries already exist, I just don't know if any are small and performance focused. :)

Oh yes, they do.


I'll close the issue as it doesn't seem like this will benefit HighlightJS.

@joshgoebel
Copy link
Member

Did you have any comments/thoughts/insight on WHY it's faster? Seems very strange to me that replace would be so very slow...

@RunDevelopment
Copy link
Author

Functions that only uses basic control structures (if, switch, and simple loops) is what the browsers can optimize best. My best guess is that the optimized escape function gets optimized to what essential equivalent to a single replace call, so it only takes one pass to properly escape the text.
But I'm no expert on JIT optimizations, this is just my guess from what I know.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 22, 2020

My best guess is that the optimized escape function gets optimized to what essential equivalent to a single replace call, so it only takes one pass to properly escape the text.

Ok, we have 5 calls - so we are iterating over the string 5 times. I get that. It's not 100% apparent to me that once would be faster (C++ code vs JS code) - which is why I didn't change it (I've though about it). But even if so replace already does that. Of the top of my head:

code.replace(/[&<>"'/g, (m) => { RAW_to_ESCAPED[m] }

Actually (I just looked again) that's exactly what they were doing initially - a single pass, but with switch/case instead of an object lookup. Again I'm expecting replace to be fast native code and wondering how a bunch of hand-written JS could trump it... If I had unlimited time I'd totally go exploring down this rabbit hole. :-)

I'm not sure your benchmark was comparing apples to apples... If you have the time (or inclination) I'd be very curious if you could benchmark just:

.replace(/&/g, '&amp;').replace(/</g, '&lt;').replace(/\u00a0/g, ' ');

vs

// use a lookup table and callback
.replace(/[&<\u00a0/g, (m) => { RAW_to_ESCAPED[m] }

vs

// crazy version

My strong suspicion is the middle version wins. #benchmarkingishard

My worry is perhaps you were really only benchmarking the "do it 3 times" vs "do it once".

@joshgoebel
Copy link
Member

crazy js version x 355 ops/sec ±0.60% (85 runs sampled)
sequential simple x 304 ops/sec ±0.50% (86 runs sampled)
simple all at once with callback x 294 ops/sec ±0.39% (84 runs sampled)
all at once with switch x 299 ops/sec ±0.48% (84 runs sampled)
escape same length x 295 ops/sec ±0.56% (83 runs sampled)
Fastest is crazy js version

Well damn. Now off I go down the rabbit hole of WHY.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 23, 2020

@RunDevelopment Well thanks for that little diversion, it was fun. :-) When you look closely you see they use the regex as a sort of "guard" to get the first replacement position but then they loop over the string one character at a time.

Which immediately led me to all sorts of good questions:

  • Is regex REALLY that slow?
  • That we can beat it by essentially reimplementing it in JS?
  • ... looping manually one character at a time over a string?
  • If it's so slow why aren't we starting at index 0 and avoid the regex entirely?
  • ...or If it's faster, why are we only using it to locate the first replacement, not all of them?

So I'm more confused than ever about my original question... Why is replace so SLOW? ...but I made your version of the function (for Prism) 75% faster and the escape-html NPM version 30% faster. Turns out regex IS indeed faster than looping one character at a time. At least something makes sense. :-)

component/escape-html#28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature
Projects
None yet
Development

No branches or pull requests

2 participants