Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-blocking ConcurrentDictionary #50337

Closed
wants to merge 12 commits into from
Closed

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Mar 28, 2021

Not done yet. In progress.

  • initial port
  • run existing tests. make sure all passes.
  • port some additional tests from NonBlocking repo.
  • do some cleanup, maybe add more comments.

@ghost
Copy link

ghost commented Mar 28, 2021

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

Not done yet. In progress.

  • initial port
  • run existing tests. make sure all passes.
  • add more tests
  • do some cleanup, maybe add more comments.
Author: VSadov
Assignees: -
Labels:

area-System.Collections

Milestone: -

@VSadov VSadov changed the title Non blocking ConcurrentDictionary Non-blocking ConcurrentDictionary Mar 28, 2021
@danmoseley
Copy link
Member

@AArnott does VS have a benchmark that stresses ConcurrentDictionary? Or perhaps a usage pattern we sold be sure to measure.

@stephentoub
Copy link
Member

Thanks for making progress on this! I remember talking to you about trying to replace the guts of ConcurrentDictionary with your implementation over six years ago 😄

I'll look forward to reviewing the implementation. In the meantime, there are some benchmarks in dotnet/performance. However, from a performance perspective, the most important thing is that reads are as fast or faster, and then obviously the purpose of such an implementation would be to improve the scalability of updates, so that should be proven, while also not meaningfully regressing at lower core counts. It's also relatively common to create ConcurrentDictionary instances even if they're not heavily used, so measuring just the raw overhead of constructing an instance (including allocations) is of interest.

@VSadov
Copy link
Member Author

VSadov commented Mar 29, 2021

Thanks for making progress on this! I remember talking to you about trying to replace the guts of ConcurrentDictionary with your implementation over six years ago 😄

I was not happy with the guarantee on when dead keys can be GC-ed (eventually, maybe, if/when resize happens).
That works fine for a third party component, since the user has a choice and often does not care. It would not seem right for a default platform implementation. For a while I was not sure it is even possible to have better guarantees without significantly degrading perf or lock freedom.

@davidfowl
Copy link
Member

What are the key differences here? What does non blocking mean? No locks?

@VSadov
Copy link
Member Author

VSadov commented Mar 29, 2021

@davidfowl - it is basically lock-free without ambiguities like "are spin-locks ok?".
Imagine OS can preempt a thread at a random location for an hour. Non-blocking means the rest of the system still makes progress.

@AArnott
Copy link
Contributor

AArnott commented Mar 29, 2021

@danmoseley Jeff Robison's file watcher service in VS makes heavy concurrent use of ConcurrentDictionary I believe. I suspect he has stress tests for it. Do you want to reach out to him?

@danmoseley
Copy link
Member

@VSadov would that be useful?

@stephentoub
Copy link
Member

You don't need to go far to find uses of ConcurrentDictionary. Every socket operation on Linux performs a read on a ConcurrentDictionary (for a mapping from socket file descriptor to corresponding context). Every HTTP request on an HttpClient/SocketsHttpHandler performs a read on a ConcurrentDictionary (for the connection pool). Every static method on Regex reads a ConcurrentDictionary (for the regex cache). Every serialization with JsonSerializer reads a ConcurrentDictionary (to find the prevously catalogued data for the target type). Etc.

@Tornhoof
Copy link
Contributor

As for Benchmarks you could take a look at the code provided by Microsoft FASTER. They have perf tests around different ConcurrentDictionary use cases.
https://github.com/Microsoft/FASTER/wiki/Performance-of-FASTER-in-C%23
and https://github.com/microsoft/FASTER/blob/master/cs/benchmark/Program.cs

@@ -109,5 +109,19 @@ public static uint FastMod(uint value, uint divisor, ulong multiplier)
Debug.Assert(highbits == value % divisor);
return highbits;
}

// returns 2^x >= size
public static int AlignToPowerOfTwo(int size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use

internal static uint RoundUpToPowerOf2(uint value)
?
(#43135 for reference).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfoidl This was written long time ago. I suspected we have LZCNT-based helper for this by now when porting over, but did not have time to look it up yet. Thanks!!

@VSadov
Copy link
Member Author

VSadov commented Apr 4, 2021

I have run concurrent dictionary benchmarks from dotnet/performance.

These benchmarks for the most part test throughput of singlethreaded operations with a small 512-element dictionary. This is challenging for an implementation designed to scale since the extra complexity to improve scalability is not very beneficial.
With that in mind I think nonblocking implementation performed pretty well. There are some regressions, but generally within 20%.

https://gist.githubusercontent.com/VSadov/8c8a7ab5a5380d5cc32c870af1162de4/raw/491434ade3075abaddac5eb60997f3b6cb3563af/bench01_512.txt

Here is also the same set of tests with a larger 512K dictionary size. Nonblocking fares better here. It is still mostly singlethreaded.

https://gist.githubusercontent.com/VSadov/c9c1c52fbbcec532774a1ab5f749478f/raw/e5eee09793642a2fd0a554053a772b07b6de7207/bench01_512K.txt

Few things to note:

  • baseline is runnig with nondefault concurrency level 4. I guess this is for results to be more comparable between machines or to optimize a bit more for the low threadcount.
    Nonblocking implementation scales as needed and does not have a concept of concurrencey level.

  • regression in CtorGivenSize tests is expected and ByDesign. A closed-hash (open addressing) implementation allocates the space for cells upfront. Once the dictionary is filled to capacity it is expected to amortize this cost and likely get ahead. A test that preallocates a dictionary of certain capacity without actually filling it will not see the second part.

@danmoseley
Copy link
Member

danmoseley commented Apr 5, 2021

These benchmarks for the most part test throughput of singlethreaded operations with a small 512-element dictionary.

Do you want to first augment those with some others, which perhaps are more realistic?

@stephentoub
Copy link
Member

stephentoub commented Apr 5, 2021

The values that concern me the most are the ones for reading on smaller dictionaries, e.g. TryGetValue / Contains / etc. Some of these appear to have taken huge hits, with ratios like 1.42, 1.43, 1.26, etc. And such reading on a ConcurrentDictionary today scales well (in terms of concurrency) and is wait-free. It's also one of the most heavily exercised code paths: while some CDs are indeed about purely adding concurrently and building up a large dictionary as a result of some processing, the most common case is sporadic adds but heavy reads, e.g. for a cache.

@iSazonov
Copy link
Contributor

iSazonov commented Apr 5, 2021

It is hardly possible for the same code to work equally well in all scenarios. Can think of "strategies" similar to those that are implemented for FileStream?

the most common case is sporadic adds but heavy reads, e.g. for a cache.

What if we implement a separate highly optimized class for this most popular case?

(In PowerShell repo we have interesting case related to ConcurrentDictionary performance. Normally users never use many breakpoints in scripts but Pester sets breakpoints on every line in code coverage scenario. And this work very slow. It would be interesting to know if this PR will solve this scenario or we will still have to accept PowerShell/PowerShell#14953)

@VSadov
Copy link
Member Author

VSadov commented Apr 5, 2021

@stephentoub Right. The baseline implementation is good at Get/ContainsKey scenario with small dictionaries and in particular with small int->int dictionaries. These are among the easiest and fastest scenarios for both implementations. Since the advantage does not hold with increase in size, the reason for the differences is likely to be computational (not the memory access patterns).

An extreme case would be reading from a one-element int->int dictionary.
I wonder if something can be gained by examining the JIT`ed code.

Ultimately, having pauseless rehashes requires a "read barrier", so some small cost will be present on the read path due to that.
At larger scale the cost seems to be absorbed by other gains, but the "easy" case is tough, primarily because it is "easy", thus offering little opportunity to save on something.

@VSadov
Copy link
Member Author

VSadov commented May 21, 2021

A brief update.
I have made a few changes to improve performance a bit. Nothing architectural, just saving a few cycles here and there. The new results are:

512 elements: https://gist.githubusercontent.com/VSadov/4cdda8fedd9b33b857018b58849b1a23/raw/88b312a2ca0c797201cfad1555a314eb2fc7f641/bench01_512.txt

512K elements: https://gist.githubusercontent.com/VSadov/17ce56428cd6c76818ede817737d4034/raw/1cd7d4f83e5d7550d163678c02d590cb041ea201/bench01_512K.txt

For the 512K elements everything is generally faster, except foreach, which may not be very interesting.
For the 512 elements – with int->int dictionary a successful get is 21% slower and missing get is 32% slower.

I have doubts that Get scenario could be further improved in a significant way. Ultimately there is extra cost of the “read barrier” and there are open-hash/closed-hash differences that make some scenarios faster and some slower.
(ex: reference TValue is faster since the value lives directly in the table with no extra space or indirection)

As with any hashtable there are ways to shift the cost from one scenario to another. Tweaking the resize heuristics, for example, can improve perf by making table less occupied and reducing average reprobe, at the cost of consuming more memory. There are ways to improve performance of lookup misses at the cost of lookup hits. I tried a few changes like that, but it felt too much like fitting into a particular benchmark and overall tradeoffs did not seem good.

I also tried running some TechEmpower benchmarks (plaintext, platform.json with various number of connections, on Linux), but I did not see differences that would be consistent and reproducible. Even though sockets use concurrent dictionary, it does not seem to be big enough use to show up on end-to-end results.

I need to think what all this means, but it seems to me if Get on a small int->int dictionary is the bar, the new implementation would not meet the bar.

@sharwell
Copy link
Member

Imagine OS can preempt a thread at a random location for an hour. Non-blocking means the rest of the system still makes progress.

For me, this sounds like "wait free", as a strict subset of "lock free".

/// Returns the approximate value of the counter at the time of the call.
/// </summary>
/// <remarks>
/// EstimatedValue could be significantly cheaper to obtain, but may be slightly delayed.
Copy link
Member

@sharwell sharwell May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❔ Is 'stale' a better word choice here? Concern is the reader could interpret "slightly delayed" to mean "slower to return". Stale has its own downsides (e.g. is it eventually consistent?) so would leave the call to y'all.

Suggested change
/// EstimatedValue could be significantly cheaper to obtain, but may be slightly delayed.
/// EstimatedValue could be significantly cheaper to obtain, but may be stale.

Comment on lines 35 to 44
if (IntPtr.Size == 4)
{
uint addr = (uint)&cellCount;
return (int)(addr % cellCount);
}
else
{
ulong addr = (ulong)&cellCount;
return (int)(addr % cellCount);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❔ Can this use nuint now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strangely enough (nuint)&cellCount; in an error in C#.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nuint)(&cellCount); works though, so I will use that. Thanks!

Copy link
Member Author

@VSadov VSadov May 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can use FastMod here, although it is cumbersome since it needs two values that must match and thus updated atomically.

@VSadov
Copy link
Member Author

VSadov commented May 22, 2021

@sharwell "Wait-free" is generally understood as non-blocking with an upper bound on the number of steps per action. So if a threads A and B collide, in non-blocking case thread A can simply retry, possibly after helping B – to make sure we are not again in the same situation after going around. With wait-free, we would also need to ensure it is not the same thread A who keeps pulling the short straw. That can be arranged by adding a turn/ticket system and maybe some roll-back ability, so that B could yield even if it made more progress than A.

In practice wait-free guarantee is rarely necessary. If collisions are rare, or it can be ensured that they are rare – by doing exponential back-off for example, then there is typically more than enough “long-term fairness” in the system without explicit guarantees.

@ghost ghost closed this Jun 21, 2021
@ghost
Copy link

ghost commented Jun 21, 2021

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 21, 2021
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants