The Wayback Machine - https://web.archive.org/web/20241208073139/https://github.com/openssl/openssl/issues/17627
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0 Performance problems #17627

Closed
paulidale opened this issue Feb 2, 2022 · 55 comments
Closed

3.0 Performance problems #17627

paulidale opened this issue Feb 2, 2022 · 55 comments
Assignees
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 inactive triaged: bug The issue/pr is/fixes a bug triaged: performance The issue/pr reports/fixes a performance concern

Comments

@paulidale
Copy link
Contributor

Performance problems encountered with 3.0 should be mentioned here.

@paulidale paulidale added branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch labels Feb 2, 2022
@paulidale
Copy link
Contributor Author

paulidale commented Feb 2, 2022

3.0.0 OpenVMS performance issues #16552

@paulidale
Copy link
Contributor Author

paulidale commented Feb 2, 2022

Massive performance degradation in OpenSSL 3.0 if used in a heavy multi threaded server application #17064

This is likely resolved by #17543, however confirmation is lacking.

@paulidale
Copy link
Contributor Author

Performance measurements #16791

@paulidale
Copy link
Contributor Author

Performance regression for CMAC #14787

@paulidale paulidale added the hold: need otc decision The OTC needs to make a decision label Feb 8, 2022
@paulidale
Copy link
Contributor Author

Adding an OTC decision label here.

Question to the OTC: which of these do we consider important enough to address sooner rather than later?

@t8m
Copy link
Member

t8m commented Feb 9, 2022

PEM/DER decoding of PKCS8 RSA private keys are 80 times slower in 3.0 #15199

@t8m t8m added the triaged: bug The issue/pr is/fixes a bug label Feb 10, 2022
@paulidale
Copy link
Contributor Author

OTC: looking for suggestions as to which, if any, should be in 3.1.

@paulidale
Copy link
Contributor Author

paulidale commented Mar 7, 2022

PR# 3.0 3.1 4.0 later Description Impact Notes Fixed by
#16552 🗹 🗹 🗹 🗹 huge slowdown for AES operations OPENSSL_NO_AES_CONST_TIME related Fixed by #17600
#17064 🗷 🗹 🗹 🗹 highly threaded AES/SHA 30× CPU load unsure of source, profile output doesn't seem to match, ossl_lib_ctx_get_data is suspect
#17116 🗷 🗹 🗹 🗹 locking in libctx (ex_data) Possibly related to the following #17881
#16791 🗷 🗹 🗹 🗹 MS QUIC handshakes per second 10× slower #18151 microsoft/msquic#2588
#17950 ? ? 🗹 🗹 d2i_X509 in highly threaded environment 90-95% CPU Likely to be related to ex_data in lib_ctx or serialisation of decoders #18151 ?
#14787 🗷 🗷 🗷 🗹 CMAC 3× slower, EVP_MAC API 1.5× slower Faster than direct CMAC APIs already
#15199 ? 🗹 🗹 🗹 PKCS8 decode 80× slower Decode is generally a lot slower in 3.0 but more flexible #18151 ?

@paulidale
Copy link
Contributor Author

paulidale commented Mar 7, 2022

Question for OTC: do any of the releases need to be changed?
If so, to what?

The question marks remain to be populated.

@t8m
Copy link
Member

t8m commented Mar 9, 2022

Another serious performance regression: #17844

@paulidale
Copy link
Contributor Author

Added that to the table.

@t8m
Copy link
Member

t8m commented Mar 10, 2022

Dropped it again as it wasn't an openssl problem in the end.

@paulidale
Copy link
Contributor Author

#17950 is encountering very high CPU load using d2i_X509. Likely related to cache conflict accessing ex_data in lib ctx.

@mattcaswell mattcaswell removed the hold: need otc decision The OTC needs to make a decision label Mar 29, 2022
@paulidale
Copy link
Contributor Author

The OTC has discussed this, there is no longer a need for the hold.

@richsalz
Copy link
Contributor

richsalz commented Jan 4, 2023

In view of the performance concerns, perhaps #15600 should be re-opened.

@richsalz
Copy link
Contributor

richsalz commented Jan 4, 2023

Consider a configuration option that disallowed no-store.

@richsalz
Copy link
Contributor

#20171 is a performance issue.

Is anyone on the team doing periodic sweeps through new issues and referencing them here?

@richsalz
Copy link
Contributor

#20286 is a performance issue that should be linked here.

@mattcaswell mattcaswell self-assigned this Jan 8, 2024
@mattcaswell mattcaswell moved this from New to In progress in Project Board Jan 8, 2024
@davidben
Copy link
Contributor

davidben commented Jan 8, 2024

The number of cores on the machine you're testing with likely also matters. Thread contention issues with highly parallel workloads are often only visible on machines that can support that level of parallelism.

@mattcaswell
Copy link
Member

There are 20 cores on my test machine (as opposed to 32 on @zh-jq's machine). I would have thought that 20 would be sufficient to demonstrate an effect....but the only way to know for sure would be for me to set up a proper isolated environment, which is quite a lot more work.

@zh-jq-b
Copy link

zh-jq-b commented Jan 9, 2024

There is a dead of loop bug in G3 openssl wrapper, so I updated the graph today, but the difference is just the final RPS comes from 5xxx -> 6xxx.

There are 20 cores on my test machine (as opposed to 32 on @zh-jq's machine). I would have thought that 20 would be sufficient to demonstrate an effect....but the only way to know for sure would be for me to set up a proper isolated environment, which is quite a lot more work.

I also changed the g3tiles worker number to 20 and re-run the test, the result is:

image

The graph is then a bit different and the final RPS is much better (from 6xxx to about 9000) but still far from ideal.
So the conclusion should be that some lock contentions caused the poor performance on 32 cores.

@zh-jq-b
Copy link

zh-jq-b commented Jan 9, 2024

What config parameters do you use to build OpenSSL?

The build logic is here https://github.com/alexcrichton/openssl-src-rs/blob/main/src/lib.rs.
I will try to link to dynamic build of libssl and test again once I can find some more time.

@mattcaswell
Copy link
Member

Thanks to @quarckster, I now have a proper test environment for this consisting of 3 VMs - all with 16 cores (less than what @zh-jq used - but this is what we had easily available). I've set things up as I understand that @zh-jq has set things up, ie. one VM for angie, one VM for g3tiles and one VM for g3bench.

The g3bench installation is linked against OpenSSL 1.1.1 in all tests. I vary the version of OpenSSL that g3tiles uses. For each concurrency level I repeated the test 5 times and took the average. I restarted the g3tiles server between each test run to ensure we have a consistent starting point.

Here is my graph:

g3tilesperf

This shows some interesting properties:

  • 3.2 looks to be consistently a bit better than 3.0 as we might expect
  • 3.2 looks to be more-or-less in line with the 1.1.1 performance
  • master performance looks a bit odd, and I can't explain it. It seems to have poorer performance than 3.2 at lower concurrency levels, before later exceeding it (and 1.1.1) at higher concurrency levels

None of these look anything like @zh-jq's graph - and they do not show the drop off in performance for 3.x that is shown there. So, I also cannot explain this.

@zh-jq-b
Copy link

zh-jq-b commented Jan 17, 2024

I re-run the test against g3tiles with 16 & 12 workers, the graph:
image

@mattcaswell The 12 workers curve is the same as yours. So the way to reproduce is to add more cores to the g3tiles host.

The architecture of aws c6gn.8xlarge is arm64, but I don't think this will make too much difference.

@lukastribus
Copy link

On the contrary, afaik architecture matters a lot, ARM (including Cortex A76) is especially bad for RSA based TLS handshaking:

See:
haproxy/haproxy#2169 (comment)

Some technical explanation:
haproxy/haproxy#2169 (comment)

Benchmarking:
haproxy/haproxy#2169 (comment)

@zh-jq-b
Copy link

zh-jq-b commented Jan 17, 2024

I run the test on an Intel 8260x2 host today, you can see the result here:
https://bytedance.larkoffice.com/docx/NEqXdDyq8op8mwx0XTWcwgionnf

The graph indeed is different, but the 3.2 performance is still far more than ideal.

@mattcaswell
Copy link
Member

The graph indeed is different, but the 3.2 performance is still far more than ideal.

How many cores does the g3tiles VM have in this setup?

@zh-jq-b
Copy link

zh-jq-b commented Jan 17, 2024

How many cores does the g3tiles VM have in this setup?

It runs directly on the physical host with all of the 96 cores (96 workers, each bound to a core).

@mattcaswell
Copy link
Member

It runs directly on the physical host with all of the 96 cores (96 workers, each bound to a core).

Are you doing something specific here to configure the number of workers? In my tests I haven't done anything to configure the number of workers - am I supposed to be?

@zh-jq-b
Copy link

zh-jq-b commented Jan 18, 2024

Are you doing something specific here to configure the number of workers? In my tests I haven't done anything to configure the number of workers - am I supposed to be?

The config

worker: {}

will default to use all available cores as worker numbers, so nothing need to be done in your test.

On physical machines we use the following config:

worker:
  sched_affinity: true

to default to use all available cores and also bind each worker thread to a specific CPU.

@zh-jq-b
Copy link

zh-jq-b commented Jan 18, 2024

I also add tls1.2 test result here https://bytedance.larkoffice.com/docx/NEqXdDyq8op8mwx0XTWcwgionnf.
The openssl 3.2 tls1.2 performance is also much better than tls1.3, but still much less than openssl 1.1 tls1.2.

@lukastribus
Copy link

lukastribus commented Jan 27, 2024

Is the lack of available high core count ARM devices impeding the ability the troubleshoot at OpenSSL end?

I'd guess that 3x dedicated ARM servers like Hetzner RX170 could be used for this (unless of course you reproduce with the exact AWS setup that @zh-jq-b uses).

Does the openssl project have the ability to rent those? Otherwise would some company step up to rent those for the openssl project?

It is my understanding that the business impact of the openssl 3.x performance regressions is quite bad, so somebody ought to come up with the < 1000 bucks required to rent those, if that is what's required.

@mattcaswell
Copy link
Member

Is the lack of available high core count ARM devices impeding the ability the troubleshoot at OpenSSL end?

I don't currently have access to this - but I am hoping to have such access soon.

Does the openssl project have the ability to rent those?

Yes. It's just taken me a while to organise it (partly due to some other competing priorities).

@nhorman
Copy link
Contributor

nhorman commented May 23, 2024

Having read over this, We have a collection of issues here, some of which have been addressed(or at least those issues have been closed). Others remain open. I'd like to propose that, given that I don't see a clear direction forward on this particular ticket, that we close this particular issue, and prioritize the remaining individual issues for addressing. Are there any objections?

@richsalz
Copy link
Contributor

I would like it to remain open until the OMC or OTC decides, by vote, that the concerns are met.

Yes this is a sprawling topic. The fact that team members, or at least one, do not see a way forward to address the overall concern raised by the user community does not justify closing it. If the vote happens, so be it.

Thank you for asking first.

@nhorman
Copy link
Contributor

nhorman commented May 23, 2024

@richsalz I'm happy to oblige on that request, but, at least in its current state, its a request that can't be satisfied, at least not without additional input. This issue has no OTC/OMC hold label on it, nor is there a distinct question for either group to address. As such, we would be in a catch-22 situation.

We could of course add that hold if there was a specific question to answer, but as you note, this is a sprawling issue with no clear definition of purpose, so I am concerned that simply asking "Are the concerns met" is insufficient in terms of driving this to resolution.

A better approach may be to ennumerate all the issues codified here (we already have at least a partial list here and address them individually. I've added the remaining issues on that list to our refinement board for addressing already. If there are more, and we can ennumerate them here, I'm happy to add those as well.

@richsalz
Copy link
Contributor

If you close it, make sure to lock the issue and have the final comment be a pointer to the list.

Please note that I didn't say "satisfied," I said "concerns are met." Addressed would have been a better word.

@nhorman
Copy link
Contributor

nhorman commented Jun 9, 2024

Marking as inactive, to be closed at the end of the 3.4 dev cycle, barring further input

Link to broken out issue list #17627 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 inactive triaged: bug The issue/pr is/fixes a bug triaged: performance The issue/pr reports/fixes a performance concern
Projects
Status: Done
Development

No branches or pull requests