Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use regex for ALL search: 30% faster #28

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

joshgoebel
Copy link

If regex is so fast that it's worth using at the beginning of the function, why not use it for the stepping?

regex stepping x 280 ops/sec ±0.41% (83 runs sampled)
original x 214 ops/sec ±0.60% (85 runs sampled)

index.js Outdated Show resolved Hide resolved
Copy link
Contributor

@dougwilson dougwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, and thank you so much for this!

Please make sure you run npm test and fix up the failing unit tests. We want to be sure that everything is passing before merging in.

This module has a benchmark suite and testing master vs this PR, it shows that this PR is overall slower, but especially when there are lots of special characters to escape. You can run npm run bench.

master:

  no special characters    x 17,654,390 ops/sec ±1.35% (185 runs sampled)
  single special character x  5,221,477 ops/sec ±1.84% (178 runs sampled)
  many special characters  x  2,653,137 ops/sec ±0.88% (182 runs sampled)

PR #28

  no special characters    x 17,720,386 ops/sec ±1.04% (187 runs sampled)
  single special character x  4,523,898 ops/sec ±1.03% (186 runs sampled)
  many special characters  x    668,211 ops/sec ±0.97% (186 runs sampled)

@joshgoebel
Copy link
Author

joshgoebel commented Nov 23, 2020

Please make sure you run npm test and fix up the failing unit tests. We want to be sure that everything is passing before merging in.

If there is an edge case or two those can be fixed... but first we should sort out the performance discrepancy.

I'm the maintainer of Highlight.js so I was working within that context. My test sample (for benching) was a 1.2mb Javascript file of Highlight.js compiled with all languages. This means there are plenty of " and < and & but not every few characters... To me this a very "real world" workload - esp. for our use cases. I tested both a Prism like encoder (escape only 3 characters) and then a Highlight.js encoder [same as here] (encoding 5 characters).

The Prism workload was 75% faster and the Highlight.js workload 30% faster. 30% is an impressive number or I never would have opened this PR. I'm finding it hard to comprehend the test results (so opposite my own)... which leads me to notice a few things and ask a few questions... I worry we're comparing apples to oranges and first we need to agree on what the fruit is. :-)

  • What version of Node are these test running on?
  • All the tests are tiny, tiny snippets of random letters. Surely a 10-20 byte string is NOT a representative sample of what people will be using this library for in the real world? I worry this could be making your performance tests meaningless. IE, they are testing something - but not testing how fast the code is in the real world on real data. I'm finding it very hard to otherwise account for the very different results. (unless it's Node version)
  • I feel also we may need to decide which cases are we optimizing for? What is the most common "real life" scenario you're trying to solve?
  • "many special characters" (like every few characters as in your sample) seems suspect to me. What about "all special characters"? :-) I say "suspect" as in perhaps it represents an edge case (a worse or best case scenario - depending on the replace algorithm chosen)
  • It seems obvious (from the current benchmark results - unless they are totally skewed by string length) that a "one character at a time" solution could be quite fast when a string requires most characters to be replaced, wheres a regex solution might be faster (and faster) the fewer characters need to be replaced.

I feel like once we have more representative samples we may have to prioritize which tests cases are more important. Encoding actual real HTML or text might be way faster with this algorithm while encoding payloads with 100% special characters much slower (since that is a worst case scenario). And of course it makes sense to benchmark the worse case scenario, but not to make all decisions based on it.

I feel like perhaps we first need to agree on what representative test samples are.

  • Is my 1.2mb Javascript source code a valid test sample?
  • Where should it be weighted along with the other test samples in importance?

Perhaps we need some tests of 1k samples, 10k samples, 100k samples? Repeating the same test 100 times is not at all the same as testing a string 100x longer. Perhaps some ACTUAL real life samples of source code, HTML files, etc?

From what I've see it looks like those results may be VERY different from the results of the current benchmark run you posted in response.

Thoughts?

@dougwilson
Copy link
Contributor

Thank you for all those questions. One things I wanted to mention, though, is I think you are incorrectly assuming I actually wrote the benchmarks and the current code; both of these things have been PRs just the same as you are making a PR to get changes landed. So I cannot really answer any of your questions relating to why I did x or y, as I likely have done neither, so would have no answer for you.

I think perhaps you may need to rewrite all the above that removes anything that would be asking why I did something one way or another and instead work towards asking objective questions rather than subjective so I can actually provide answers.

@joshgoebel
Copy link
Author

joshgoebel commented Nov 23, 2020

  single special character x  5,221,477 ops/sec ±1.84% (178 runs sampled)
VS  single special character x  4,523,898 ops/sec ±1.03% (186 runs sampled)

This test specifically makes me feel the string size is skewing things.... because for a single character what is happens is so simple:

  • My implementation runs the regex twice (one match, one fail), adds two strings, done.
  • The library (as currently implemented) runs the regex once, then loop over whole string checking every character one at a time, then add two strings.

It seems is impossible to believe that "loop over whole string checking every character one at a time" in JS is faster than the regex engine running once (compiled C++ code). And indeed with a 1.2mb sample of actual code, it's not at all. It's demonstrably slower. I'm betting the existing tests are measuring some "start up" penalty of the regex engine... (making every single one of them truly a "worst case" test, not a realistic one) so if there are only 10 characters, a tight loop wins... but if there are 100, 1000, 10000 characters... the question is where is that line.

The thing that led me down this rabbit hole is: Which is faster regex or for loop? And if it's clear - why use both? Why use the regex to find only the first match? If regex is slow why use it at all; just walk the whole string. I have a sneaking suspicious for these tiny strings it might be even faster if you removed the regex completely. It's the first thing I tried - but for larger payloads its way worse - because the regex engine is much faster.

Sorry for writing a book. :)

@dougwilson
Copy link
Contributor

Hi, thank you for your response. It does not seem like you are interested in having an objective discussion on this, even bolding out asking why I did specifically did something. If you cannot have a civil, objective discussion, there is likely nothing further to discuss here. Please let me know if you are going to update your comments to address this or not to know if there is point on going forward.

@joshgoebel
Copy link
Author

joshgoebel commented Nov 24, 2020

It does not seem like you are interested in having an objective discussion on this

I totally am. sighs 😢 Sometimes I really really hate the internet's complete lack of tone.

even bolding out asking why I did specifically did something.

I bolded "And if it's clear - why are use both?"

I specifically removed the you (forgot and left the "are") when proofing to AVOID it sounded accusatory. So it should have said "why use both"? It's just a question (a key question hence the bold) not an accusation.

If you cannot have a civil, objective discussion, there is likely nothing further to discuss here.

A civil objective discussion is exactly what I wanted. I think I did assume you had written the code but nothing I said was intended to blame or slander - even if you had written it all.

Please let me know if you are going to update your comments to address this or not to know if there is point on going forward.

What do you want me to change exactly and I will? going back to reread it now

(edited to replace bold with italics)

@joshgoebel
Copy link
Author

joshgoebel commented Nov 24, 2020

Do just want me to remove all the "you" and "your" and replace with "this code"? Done.

Again no slight was implied or intended. To me "your library", "my patch", "your code", "my code", etc. is just the most natural way to refer to which I'm talking about. I do understand how it implies ownership yet it was not intended to be offensive and I truly apologize.

@joshgoebel
Copy link
Author

joshgoebel commented Nov 24, 2020

Meanwhile over in the SSR thread on this topic:

I do think given these are small template inserts where the static parts are already processed at compile time the average is going to be in that 10 character region.

Oh wow. :-) I was imagining dropping in processed user input like posting a question of StackOverflow or a forum post or something... (which would be much longer than 10, but of course not 1.2mb by any stretch) Your use case then is very different than mine. I kind of wonder if a truly optimal solution (for all cases) would have two entirely different code paths based on the size of the string. It seems there may be some real "startup costs" with just engaging regex on tiny strings...


I'm not sure how much complexity you want to allow in the library but it could be that we could have our cake AND eat it as well by using one algorithm for short snippets and a regex powered solution for longer ones.

@joshgoebel
Copy link
Author

joshgoebel commented Nov 24, 2020

Ok so now I'm using the included benchmark suite (with more tests) and I switched to a likely more representative sample (a large page of HTML from Github vs Javascript code). I'd think HTML would be a realistic real world scenario for an HTML escape, yes? Let's walk thru a few comparisons.

What are we comparing:

  • Using regex for all search
  • Using regex for the first find (current implementation)
  • Not using regex at all (pure loop)

I don't always include the latter as it's really only interesting vs the original implementation in one case.


The results:

  Long HTML page                           x       251 ops/sec ±2.35% (81 runs sampled)
  Long HTML page REGEX                     x       285 ops/sec ±0.92% (82 runs sampled)

First 650kb of HTML from a Github discussion thread... Using regex is 13% faster.


  Short HTML page                          x      5,707 ops/sec ±0.68% (87 runs sampled)
  Short HTML page REGEX                    x      6,380 ops/sec ±0.54% (90 runs sampled)

That same webpage but truncated to 30kb... and now regex is 11% faster.


  single special character                 x 9,381,587 ops/sec ±5.86% (84 runs sampled)
  single special character REGEX           x 9,470,886 ops/sec ±7.36% (78 runs sampled)

I'm not sure why this is reversed from your original results (perhaps because I switched to test vs exec), but this is really still neck to neck and well within the huge margin of error.


  single special character (large)         x    26,388 ops/sec ±1.15% (80 runs sampled)
  single special character (large) REGEX   x    98,110 ops/sec ±3.91% (87 runs sampled)

The full on regex is now winning handily at finding a single needle in a haystack. 3.7x faster. This makes sense since we're avoiding iterating thru the last half of the string one char at a time. (the needle was inserted in the middle) If it's placed at the beginning the performance gets much worse and if it's at the very end then we'll start to approach the speed of "no special character" since the for loop will iterate so few times.


Now lets look at "many special characters" when is indeed where regex falls on it's face... the the original is slow here if compared to not using regex at all:

  many special characters                  x 4,154,659 ops/sec ±4.18% (86 runs sampled)
  many special characters NO REGEX         x 5,197,501 ops/sec ±6.71% (80 runs sampled)
  many special characters REGEX            x 1,461,307 ops/sec ±6.34% (77 runs sampled)

As I guessed, almost 25% faster to not use regex at all... and using regex exclusively now costs us, so we run at only 35% of the speed of iterating a single character at a time.

  many special characters (large)          x     5,510 ops/sec ±1.19% (85 runs sampled)
  many special characters (large) REGEX    x     1,474 ops/sec ±1.08% (85 runs sampled)

So of course this bears out with larger samples as well.


Note: I also purposely excluded "no special characters" from discussion here since both implementations use regex and should be effectively the same speed (as the original benchmark you ran pretty much showed).


So I think it boils down to what type of input data is the most common or realistic use case...

  • When used to escape actual HTML code it seems this is a win. 11-13% faster.
  • When used to escape source code (which contains fewer special characters) this is a bigger win. 30% faster.
  • When used to escape worse case input (\'>\'\\"\\"&>h<e>&<y>") with a huge proportion of special characters, my proposed solution starts to lose out to the original.

Thoughts?


Also: All tests are green now (on my machine). I pushed EVERYTHING so you'd be able to run the same tests. If we agree on a path forward I can of course clean up the PR to remove all the extra code, etc. Though if the HTML test is reasonable we need a location to keep it, etc.

@makeable
Copy link

makeable commented Nov 24, 2020

I'm going to weigh-in because I use this lib.

I think the current balance is spot on. Faster than iteration short-circuit for fragments that don't need escaping and faster-than multi-match Regexp iteration for fragments containing a realistic percentage of real-world entities in a typical document (both prose and html).

Please don't change it!

@@ -0,0 +1,11747 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can include this content here according to the posted policy at https://docs.github.com/en/free-pro-team@latest/github/site-policy/github-terms-of-service#f-copyright-infringement-and-dmca-policy

You may not duplicate, copy, or reuse any portion of the HTML/CSS, Javascript, or visual design elements or concepts without express written permission from GitHub.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. :-) Didn't even think of this. Can you think of another site that perhaps wouldn't care if we borrowed an HTML sample without just making one up? I'm not sure GitHub would come after us for this, but if you want to err on the side of caution I also understand that. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to add a "source code" sample test would any large OSS project with a permissive license work? (ie Vue, React, etc?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a Svelte maintainer and would be happy if you want to use svelte.dev or kit.svelte.dev. Our sites are MIT licensed

@@ -20,7 +20,7 @@ var matchHtmlRegExp = /["'&<>]/
* @public
*/

module.exports = escapeHtml
module.exports = { escapeHtml, escapeHtmlFast, escapeHtmlNoRegex }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure your syntax works on the Node.js versions declared in our package.json; generally ES5. If there are now multiple exports, they of course should all be documented on the README.

I also see there is no longer any default export, definately making this a major version bump, at least.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I said:

If we agree on a path forward I can of course clean up the PR to remove the extra code, etc.

I pushed what I had so that anyone could easily test it exactly as I had. (see both results side by side, tweak both, etc) Unless we decide to change the API these changes don't need to happen. I'm just first trying to get agreement on what we're doing here.

generally ES5.

Ah, I'm spoiled by our ES6 codebase, but I can fix. (or again, simply revert)

I also see there is no longer any default export, definately making this a major version bump, at least.

I'd agree if we are going to have multiple functions.

@dougwilson
Copy link
Contributor

Thank you so much for weighing in @makeable ! I also use this module all over the place in various other npm modules as well, all of which pass through it very short strings, most of which contain nothing to escape or a few characters to escape. Certainly keeping that use-case fast would be a priority for me as well, but I think if we can improve the performance of very large inputs, we should do that too.

But looking over the pull request, I don't think we should do that at the expense of an API requiring the user to choose which method to use. Just like how the v8 engine uses two different sorting algorithms for the Array.prototype.sort, I think the same kind of thing can be applied here: employ perhaps two or more (this PR adds 2 and keeps the original, so seems like maybe we need 3?) algorithms and then have the single export choose the right one to use in order to be performant across the board while providing an easy-to-use API.

@joshgoebel
Copy link
Author

joshgoebel commented Nov 25, 2020

Faster than iteration short-circuit for fragments that don't need escaping and faster-than multi-match Regexp iteration for fragments containing a realistic percentage of real-world entities in a typical document (both prose and html).

The short-circuit isn't going anywhere - both implementations short-circuit.

Could you weigh in on what you mean by "prose" exactly? As I've shown actual HTML (I think a complex GitHub page would be pretty good reference sample) is 10-13% faster. Perhaps a huge section of Brainfuck could be slower with this new way... but:

  • HTML encoding of actual HTML should be faster. (13%)
  • HTML encoding of some source code (Javascript) - 30% faster.
  • HTML encoding of text or poetry or something should be even faster.
  • Code with no characters needing to be escaped is equally fast.

It's possible there is a use case that contains a HUGE % of escapable characters, but no one has mentioned what it might be yet. :-)

We don't disagree. :-) The algorithm (unless it can detect and run two paths) should be optimized for "a realistic percentage of real-world entities". The question right now is what exactly is "a realistic percentage of real-world entities" for the "common" 80/20 case... If the primary goal is escaping actual HTML it looks like right now the new algorithm has an edge.

@joshgoebel
Copy link
Author

joshgoebel commented Nov 25, 2020

well, all of which pass through it very short strings, most of which contain nothing to escape or a few characters to escape. Certainly keeping that use-case fast would be a priority for me as well, but I think if we can improve the performance of very large inputs, we should do that too.

Just to be clear the latest benchmark does not seem to show a penalty for small inputs against the original method. It's speed seems consistent short or small (making my original thinking wrong). So the "mostly text, short input" case should still be equally fast. There is definitely a huge win for regex with larger inputs though. As the string gets larger the regex method gets faster and faster (since regex is better at covering the vast amount of ground) - again assuming a "HTML" level of characters needing to be encoded (or less).

The only downside I'm seeing now is highlighting what to me seems slightly "artificial" input with a huge amount of characters needing to be escaped: "<<<<<<<<<<<<<<<<<<<". Unless someone pipes up with real life examples that this is a very common use case I'd suggest that's it's the edge case. :-)

But looking over the pull request, I don't think we should do that at the expense of an API requiring the user to choose which method to use.

I was never recommending this although I'm also not 100% opposed. I'd name them differently though separating escapeHTML from escapeHTMLCharByChar (or something). Honestly (unless someone mentions a use case) I feel that the latter is an edge case - making it not worthy of inclusion.

employ perhaps two or more (this PR adds 2 and keeps the original, so seems like maybe we need 3?) algorithms and then have the single export choose the right one to use in order to be performant across the board while providing an easy-to-use API.

This is what I considered at first - when I thought the issue was length. If the performance we skewed by size then we could just figure out where the "boundary" was and have a if size < 500 check or some such... but sadly it seems it all comes down to content - which (unless someone has a genius solution) there is no way to know the content in advance without scanning the string first (further slowing us down) - or the user telling us (having 2 APIs).

The best we can do is a regex short-circuit ensuring content with 0 characters to be encoded is as fast as can be - which both algorithms already do. :)

so seems like maybe we need 3?

Oh please no. :) I was never suggesting that "loop over EVERY character without regex" be a serious contender - only trying to make the point that a lot of different algorithms have different performance - all depending on the characteristics of the input.

@joshgoebel
Copy link
Author

joshgoebel commented Nov 25, 2020

If we can reach an agreement (that highlighting actual HTML, code, or textual-ish content is the primary use) case then this PR can be simplified to a tiny, tiny PR that just swaps out the core function with the faster one (ie, like it was originally- but with all tests and linter green). Again the current "state" of the repo is just to allow anyone who wants to reproduce my results to do so easily (or expand the test cases).

This really boils down to does this library aim to solve the 80/20 or 90/10 case super-well or is the ambition larger - to provide a hammer, a screwdriver, and a pair of pliers all at the same time - to cover every conceivable scenario? ;-)

@joshgoebel
Copy link
Author

joshgoebel commented Nov 25, 2020

I guess I should mention a few of the use cases I'm thinking of (or have personally):

  • Source code (highlighted output from Highlight.js or Prism.js), so really source code with a bunch of intermingled HTML.
  • Raw source code (just to get it on the web)
  • Encoded Markdown content (so like a forum, StackOverflow, GitHub, etc) - ie, more HTML.
  • Raw text being sanitized for the webpage (mostly text, few special characters, best case scenario)

All these cases would use something like escapeHTML at the very end of the process before rendering the content or piping it to the client. So for all these cases I'm seeing 10-30% (or better) performance increases. And even a 60% improvement for replacing 3 characters vs 5, though that's perhaps not relevant here. Seems to suggest the regex gets a lot faster the less work it has to do, which makes sense.

@makeable
Copy link

By prose, I mean "non-html content". <-- this very string for example!

I'm confused by your use cases to be honest. Why do you want to encode all entities in a 1.2MB HTML document? Then it is no longer HTML. If it is for display in a code block, for example, then you aren't going to be using HUGE documents. Even so, html contains many entities to be encoded, so the multi-match mechanism is inefficient compared to simple iteration.

If it is for safely outputting user-generated content, then it is generally working on textual content, which will be sparsely populated with entities (if at all), and generally only a few paragraphs. In this case the short circuit works well, as much prose doesn't contain entities, and for those that do, it will be sparsely populated, and generally in pairs (quotes for example), meaning the multi-match mechanism is less efficient.

From the benchmarks given, I am not seeing any benefit over what already exists. Maybe it is worth putting together a jsperf or something, to demonstrate for a number of use case, rather than trying to submit this as a PR?

@joshgoebel
Copy link
Author

joshgoebel commented Nov 25, 2020

By prose, I mean "non-html content". <-- this very string for example!

That's the best case (nothing to escape) and is equally fast in both iterations. Although it can be quickly slow down with the current implementation as soon as the content starts including any special characters at all. The new implementation performs best the sparser the content since that's when regex really shines.

I'm confused by your use cases to be honest. Why do you want to encode all entities in a 1.2MB HTML document?

It was just the first test case I had laying around. :-) Don't get too focused on that - as I show above the performance gains seem to hold for small snippets. The new implementation just gets faster and faster the larger the content.

Stepping over a string one character at a time is very inefficient (unless you need to replace almost all the characters).

If it is for display in a code block, for example, then you aren't going to be using HUGE documents.

See sites like GitHub - sometimes you indeed do have a LARGE file with syntax coloring... But lets forget 1.2mb... The benchmarks clearly show HUGE (3.7x) speedups for only 13kb of data in optimal cases.

Even so, html contains many entities to be encoded, so the multi-match mechanism is inefficient compared to simple iteration.

Right now RegExp.test` seems 10-12% faster for HTML. See the benchmarks which I cited above and provided (so that you can easily reproduce should you so choose). I'll inline it again for convenience:

  Short HTML page                          x      5,707 ops/sec ±0.68% (87 runs sampled)
  Short HTML page REGEX                    x      6,380 ops/sec ±0.54% (90 runs sampled)

HTML is processed 13% faster with the implementation proposed in this PR. If you have an opposing benchmark or some issue with the benchmark itself (as provided in the PR) please be specific.

it will be sparsely populated, and generally in pairs (quotes for example), meaning the multi-match mechanism is less efficient.

Based on what are you saying it's less efficient? This statement simply does not agree with the benchmark data. The more sparsely populated the faster regex gets and the slower character by character. For "single character" currently the new algorithm has a slight edge for small snippets and the larger the snippets get the bigger the performance gains. Again 3.7x faster for a small, optimal 13kb snippet.

From the benchmarks given, I am not seeing any benefit over what already exists.

Are we looking at the same benchmarks? Again to summarize what the benchmarks above show:

  • HTML encoding of actual HTML should be faster. (13%)
  • HTML encoding of some source code (Javascript) - 30% faster.
  • HTML encoding of text or poetry or something should be even faster.
  • Code with a single character or a few characters can easily be 3-4 TIMES faster. The larger the sample the more the speed-up.
  • Code with no characters needing to be escaped is equally fast for both implementations.

The attached benchmarks show this clearly - I shared the numbers. If you believe I'm reading them wrong it would be helpful for you to mention a specific case and the issue you're having with the benchmark itself.

Maybe it is worth putting together a jsperf or something

That's what benchmark/index.js is. I've benchmarked the changes and provided the data above. Anyone can download the PR and run the benchmarks to verify. If we need to add additional some test cases, that'd be fine... we just need to agree on what additional cases we should be testing.


Summary

Again from what I see right now the existing algorithm seems to only be superior (vs the new one) when given content where almost every letter needs to be replaced. In all other cases (as shown by benchmark), the new algorithm seems equivalent or faster. Either just a little faster, or in some cases a LOT faster. If I'm missing something happy to have someone point to it specifically or provide an opposing benchmark.

@joshgoebel
Copy link
Author

joshgoebel commented Nov 25, 2020

Again from what I see right now the existing algorithm seems to only be superior (vs the new one) when given content where almost every letter needs to be replaced

Ok, that's entirely wrong - my bad. So I should clarify. Turns out the "density" of special characters in HTML is just very low. The large HTML sample from Github is 6% special characters (of the entire file). And the new algorithm seems to out-perform the current when the density is less than 10%. (for small samples)

  100 characters (0% special) (current)        x 9,741,265 ops/sec ±0.64% (38 runs sampled)
  100 characters (0% special) Regex            x 9,750,900 ops/sec ±0.45% (39 runs sampled)
  100 characters (2% special) (current)        x 3,275,650 ops/sec ±0.60% (35 runs sampled)
  100 characters (2% special) Regex            x 5,856,570 ops/sec ±0.76% (37 runs sampled)
  100 characters (4% special) (current)        x 1,779,370 ops/sec ±0.48% (35 runs sampled)
  100 characters (4% special) Regex            x 3,484,068 ops/sec ±0.67% (37 runs sampled)
  100 characters (6% special) (current)        x 1,573,226 ops/sec ±1.86% (38 runs sampled)
  100 characters (6% special) Regex            x 1,770,686 ops/sec ±0.64% (37 runs sampled)
  100 characters (8% special) (current)        x 1,436,798 ops/sec ±2.13% (37 runs sampled)
  100 characters (8% special) Regex            x 1,830,717 ops/sec ±0.86% (38 runs sampled)
  100 characters (10% special) (current)       x 1,301,051 ops/sec ±1.50% (39 runs sampled)
  100 characters (10% special) Regex           x 1,408,288 ops/sec ±0.62% (39 runs sampled)
  100 characters (12% special) (current)       x 1,143,921 ops/sec ±1.59% (38 runs sampled)
  100 characters (12% special) Regex           x 1,108,194 ops/sec ±0.82% (39 runs sampled)
  100 characters (14% special) (current)       x 1,291,972 ops/sec ±1.74% (38 runs sampled)
  100 characters (14% special) Regex           x 1,120,855 ops/sec ±2.90% (35 runs sampled)
  Long HTML page                               x       268 ops/sec ±1.59% (30 runs sampled)
  Long HTML page REGEX                         x       296 ops/sec ±1.00% (30 runs sampled)
  Short HTML page                              x     5,405 ops/sec ±1.58% (35 runs sampled)
  Short HTML page REGEX                        x     6,306 ops/sec ±2.16% (36 runs sampled)

The larger the data size, the better... so here with 10k of data we're still 40% even when the density has risen to 14%.

  10000 characters (0% special) (current)        x 133,299 ops/sec ±2.53% (36 runs sampled)
  10000 characters (0% special)           Regex  x 137,775 ops/sec ±0.85% (35 runs sampled)
  10000 characters (2% special) (current)        x  20,104 ops/sec ±3.65% (29 runs sampled)
  10000 characters (2% special)           Regex  x  51,094 ops/sec ±0.90% (37 runs sampled)
  10000 characters (4% special) (current)        x  14,380 ops/sec ±2.91% (38 runs sampled)
  10000 characters (4% special)           Regex  x  26,989 ops/sec ±3.43% (35 runs sampled)
  10000 characters (6% special) (current)        x  11,639 ops/sec ±1.13% (36 runs sampled)
  10000 characters (6% special)           Regex  x  19,690 ops/sec ±1.17% (37 runs sampled)
  10000 characters (8% special) (current)        x   9,400 ops/sec ±1.39% (36 runs sampled)
  10000 characters (8% special)           Regex  x  15,068 ops/sec ±0.94% (36 runs sampled)
  10000 characters (10% special) (current)       x   8,375 ops/sec ±1.07% (37 runs sampled)
  10000 characters (10% special)           Regex x  12,777 ops/sec ±0.95% (36 runs sampled)
  10000 characters (12% special) (current)       x   7,075 ops/sec ±0.83% (37 runs sampled)
  10000 characters (12% special)           Regex x  10,316 ops/sec ±1.05% (35 runs sampled)
  10000 characters (14% special) (current)       x   6,533 ops/sec ±0.79% (38 runs sampled)
  10000 characters (14% special)           Regex x   9,166 ops/sec ±1.02% (36 runs sampled)

@makeable
Copy link

Actually... I am going to have to concede.

I run your multi-match against 300k user-generated statements (generally tweet-sized). It is 30% faster for my use case in both node and browser.

I would be happy to see this change, but I do not like the idea of choosing between 3 different strategies. If anything, this should simply replace the existing one.

My suggestion would be to create a PR with just changes to the benchmark, to include real-world scenarios (tweets, chat statements, small code snippets, textual blog content, etc). Randomly generating data (and the current tiny strings approach) is not a fair assessment- hence my initial apprehension. Then the maintainer can make an objective assessment of the benefits.

But... I do like it - sorry for being a blocker 👍

@joshgoebel
Copy link
Author

Actually you could maybe have your cake and eat it too (for larger snippets)... though I'm not sure I have time to pursue that angle. But the idea:

  • Short-circuit to protect against 0 changes needed (as is already done)
  • Analyze the next 100-200 characters (one at a time) to get a density count.
  • If density seems in the "sweet spot", switch to REGEX processing.
  • If density was crazy high, you would continue processing a character at a time.

This would mean you'd never get the performance improvements for small snippets (less than whatever the analysis window is) though, only larger ones.

@joshgoebel
Copy link
Author

joshgoebel commented Nov 25, 2020

It is 30% faster for my use case in both node and browser.

Awesome! HTML seems to be a real sweat spot for the new code.

If anything, this should simply replace the existing one.

Yes, that's the hope. The PR is only in its current state to make benching the multiple algorithms easier. We should only have a single happy path (unless someone wants to pursue the idea I just posted). If the maintainer agrees this is the right direction I can update the PR to just change the existing function and little else.

My suggestion would be to create a PR with just changes to the benchmark, to include real-world scenarios (tweets, chat statements, small code snippets, textual blog content, etc). ... Then the maintainer can make an objective assessment of the benefits.

I've already spent way more time than I intended on this. If this needs a LOT of additional validation/research it may have to wait for someone else to pick it up on a different day. I do agree though, a folder of fixtures containing actual real-life samples that were run instead of the current artificial ones would be optimal. I'm not sure that's necessary to come to an agreement here though if HTML (or less dense prose) is a super common use case.

@makeable
Copy link

makeable commented Nov 25, 2020

To confirm - my usecase isn't escaping HTML. It is escaping text (which occasionally contains quotes, etc) for display in html. I would expect this to be a much more prevalent use-case than actually escaping html code.

For example, in a web view for a chat application, you could expect in excess of a hundred instances of escaping per second (one per chat statement) while rapidly scrolling through a chat history (escaping those 100 instances account for 0.1ms on low-end devices with the current solution).

It seems any use-case for HTML escaping would generally be limited to displaying code, which I would assume occurs a lot less frequently. (Your speedup mentioned above for the 650KB document accounts for less than 0.25ms, for example)

Anyway - as mentioned, for my text-based requirements escaping via the regex multi-match works well - so Im happy for a 30% performance boost - although we are talking 0.3μs difference, and it occurs on the client, so I don't care THAT much. But I think you need to realise what the primary usecase for this lib is - and I don't think its actually escaping HTML (despite the name).

Also, please keep it es5 - its more portable that way.

@joshgoebel
Copy link
Author

No intention to change ES5 support.

@benmccann benmccann mentioned this pull request May 6, 2022
4 tasks
@jonchurch
Copy link

jonchurch commented Mar 10, 2024

Here's an almost standalone benchmark. You can run this from within escape-html repo https://gist.github.com/jonchurch/82bfcec44675f3d8fb24dc5c6fb7e45b

I added in a naive adaptive version, which switches based on length of the string. The adaptive approach "underperforms" when there is a string which is over the length threshold and contains high density of special chars.

The default approach "underperforms" when there is a long string with very low density of special chars.

overall the default one is quite fast. I don't think any of the benchmarks so far are conclusive as to an obviously better algorithm. leaving this comment in case anyone else stumbles into it and wants to kick tires or create better benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants