Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a matcher for partitioning services #3224

Closed
lawrencegripper opened this issue Apr 23, 2018 · 23 comments
Closed

Add a matcher for partitioning services #3224

lawrencegripper opened this issue Apr 23, 2018 · 23 comments

Comments

@lawrencegripper
Copy link
Contributor

Do you want to request a feature or report a bug?

Feature (I'll write it)

What did you do?

Service Fabric uses partitioning of services to improve scalability. I would like to add a matching rule which allows requests to be partitioned. A frontend would be created per partition and the matching rule would ensure requests are matched to the correct frontend based on the value of a hash function – allowing you to evenly distribute across n number of partitions. This would be useful to other providers, for example allowing requests to be partitioned across multiple container instances or services in Kubernetes.

Additional discussion: jjcollinge/traefik-on-service-fabric#45

Proposal

Add an additional matching rule to Traefik which enables a hashed range match for example HashedRange: type:header value:x-partitionheader match:0-100 range:0-300 . It would take an input and use a hashing algorithm to convert this to an int with even distribution in a range. In this case the full range would be 0-300 and this rule would match if the hashed result of the header x-partitionheader fell in the range 0-100.

This could be used to create 3 partitions with a KeyMin=0 and KeyMax=300 for example and distribute load between them:

  • Frontend for Parition 1 with matcher HashedRange: type:header value:x-partitionheader match:0-100 range 0-300
  • Frontend for Parition 2 with matcher HashedRange: type:header value:x-partitionheader match:100-200 range 0-300
  • Frontend for Parition 2 with matcher HashedRange: type:header value:x-partitionheader match:200-300 range 0-300

In addition to the type:header option I would also look to add url-regex which would match a section of the url to hash,

I can think of more types but I think these two cover most use cases.

Example of url-regex type

URL: http://example.com/bob/?customerid=jamesnesbit
HashedRange: type:url-regex value:[=].* match:0-100 range:0-300

This would hash jamesnesbit and match if it the result was in range 0-100

What did you expect to see?

The service fabric provider would query stateful services and create a frontend for each partition with the appropriate hashedrange matcher. Requests would then match the correct partition based on the value of their header or url-regex.

What did you see instead?

I don't believe it's currently possible to achieve this behavior in Traefik

CC: @jjcollinge

@lawrencegripper lawrencegripper changed the title Add a Matcher for Partitioned services Add a matcher for partitioning services Apr 23, 2018
@ldez ldez added the kind/proposal a proposal that needs to be discussed. label Apr 27, 2018
@geraldcroes
Copy link
Contributor

Hi, I'm quite new to the Service Fabric world so excuse my candor here.

Why can't you use the existing matchers in your use case? Like, a simple regex-based matcher?
Why do you need to hash the value?

To me, it looks like you're trying to implement a load balancing rule in the matcher.

Before taking any action, we need to fully understand your use case.

More specifically, what is a partitioned service? Can you give us some pointers, a diagram along with a use case?

Thanks for your help

@petertiedemann
Copy link

petertiedemann commented Apr 27, 2018

@geraldcroes A partitioned service is typically a stateful service, that has been broken into partitions( like shards for databases ). Imagine you have customers A,B,C,D, and you decide to have 3 partitions.

The client only knows the customer, it does not know how A,B,C,D are allocated to partitions, but this information is required to route the request correctly. Without @lawrencegripper 's feature, either the client must call somewhere else to get the partition, or a separate proxy service has to be setup.

Also see https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-concepts-partitioning .

FYI, the reason i am replying here, is that we are starting to use Traefik in Service Fabric to replace the Azure APIM in our setup, and this is a bit of an annoyance (so there is a real world "customer" here :)

@geraldcroes
Copy link
Contributor

Thank you for your pointers, I'll read them right away.

I still don't get why the partitioning logic has to exist both in the stateful service and in Traefik (and why Traefik can't contact some kind of master that would handle the routing).

Also, one of my question was, "why do you need to hash / compute the value and not use the value "as is" using a regexp?"

@jjcollinge
Copy link
Contributor

jjcollinge commented Apr 27, 2018

The SF partitions have no knowledge of their fellow partitions and need to be individually load balanced across - hence the appropriate partition endpoint must be resolved before the request is then load balanced over the partition instances. There is a SF API that can resolve the request but this would require a lookup from a piece of custom middleware for each request destined for a stateful partition (pick your poison?).

The need for a hash is to support range based matching rather than direct string matching. Hashing ensures even distribution across the partitions to avoid getting hot spots and effective use of the underlying resources.

@lawrencegripper
Copy link
Contributor Author

lawrencegripper commented Apr 27, 2018

A similar approach is used by the Metaparticle project to handle this in Kubernetes - good doc with diagram. The doc explains how this approach used, hopefully demonstrating that the label is useful in both SF and other orchestrators.

@geraldcroes
Copy link
Contributor

A quick update -- I wasn't able to work on it last week but am now setting up an environment so I can test it and move forward.

@geraldcroes
Copy link
Contributor

Another update -- I've dived into Service Fabric and now have a better grasp of the problem at hand.

To be completely honest, I was not familiar with the stateful services approach. Until now, I've always preferred the stateless one (a computing unit with an external persistance store).

That being said, I understand its value and the fact that it is an important feature (even if I have questions that I cannot yet find the answers to).

Allow us a bit more time to discuss it. We'll soon come back to you.

@lawrencegripper
Copy link
Contributor Author

lawrencegripper commented May 11, 2018

Thanks for taking a look into the Service Fabric use case. I think the matcher also has broader usability for other systems too - as partitioning can be used for both scale and A/B testing.

For scale in large deployments

The Metaparticle link I shared provides a good example:

Sharding is useful because it ensures that only a small number of a containers handle any particular request. This in turn ensures that caches stay hot and if failures occur, they are limited to a small subset of users.

https://metaparticle.io/tutorials/dotnet-sharding/

For A/B testing

For example you want to A/B test a new UI change. You want to expose the new version to a low number of users initially to understand how it affects engagement or errors rates. To do this you treat deployments as immutable, keeping the old version deployed alongside the new version and sending a % of requests to the new version. The problem is that, once a user has the new UI, you don't want them to jump randomly between new and old versions between each request or device.

With the HashedRange matcher you can run your 2 deployments with the following labels:

  • New Version: HashedRange: type:header value:x-userid match:0-5 range 0-100
  • Old Version: HashedRange: type:header value:x-userid match:5-100 range 0-100

This would ensure that 5% of users are directed to the new version always, even if they disconnect/reconect, logged in on a different device, browsers etc. The same 5% of users (through the x-userid header) will always see the new deployment. This gives users a consistent experience during a A/B test and you a consistent test group.

This method may be preferrable to stick sessions (cookies) as, even if the user disconnects/reconnects, flushes cookies, uses incognito or different browers they will always see the new deployment.

@geraldcroes
Copy link
Contributor

Another update --

We've heavily discussed the proposal, and there are still some debates whether Traefik should or should not embed this feature. For my part, after having investigated on the issue and its use cases, I'm convinced that it should be (at some point) included.

There are some cons though.

  • The matcher would stand out as being more complicated than the others,
  • The matcher could have a serious impact on the performance,
  • There should be more options to define the routes, maybe a chain system,
  • There should be support for custom algorithms to select the shards
  • The feature is currently not being asked by many (nor supported by a large community) ... even if I think that it would be welcome.

So for now, even if the team seems interested in the feature, it doesn't fully agree (yet) on the proposal.

Still, in the foreseeable future, Traefik will provide a feature that should enable users to customise and introduce the behaviour you're asking for.

In the meantime, I'll let maintainers take over and move forward.

@lawrencegripper
Copy link
Contributor Author

Hi @geraldcroes thanks for taking a look and the wider team for discussions - appreciate the time and effort taken and agree with a number of the con's listed. In the interest of exploring all options, do any of the following give us a way forward?

  1. Move code under SF Provider

In the proposal I tried to make the matcher generic, making it work outside the ServiceFabric provider. If it was specific to the SF provider code and located within it, would that change the teams view? I believe it would mitigate the impact on the wider Traefik codebase while still allowing Traefik to support the SF stateful service use case and leaving support and maintenance to the SF community.

We would need a way to add a matcher to the list from the provider code, like we do at the moment with the application insights hook for logrus here.

  1. Explore a plugin model

We ruled it out as go-plugin still doesn't support windows. Would you be open to using something like hashicorp/go-plugin? I'd be happy to POC creating an extension point with it allowing plugins to register matcher's. This approach would have an added benefit as many of the users of SF are .net developers so they could write their own partitioning matcher in .net. I would want to benchmark to ensure the RPC calls didn't introduce too much latency but it would be specific to SF users.

Let me know your thoughts.

@petertiedemann
Copy link

@geraldcroes I definitely agree that it makes sense to support multiple sharding algorithms, but i am not sure why it would be considered so much more complicated than the other matchers or have that significant performance impact?

You mention that is it not a very requested feature, but it is certainly a feature we would like where i work. Would it make any difference if we were a paying customer (i noticed you introduced commercial support)?

@geraldcroes
Copy link
Contributor

@lawrencegripper We're discussing options, I'll keep you updated as soon as I can.

@ldez
Copy link
Contributor

ldez commented Jun 1, 2018

Moving the code under the SF Provider doesn't look appealing because it would make it stand apart (even more than it currently does).

One of our goals is to offer a cohesive and straightforward API, whatever provider the users have chosen and we don't welcome the idea of proposing features here but not there.

Once again we understand that the feature would be welcome by the Service Fabric community, but unfortunately we're not yet ready to include it as is.

This is not the first time that plugin systems (or others) have come up into the discussion (see below for references), but even if we're working toward solutions that would make it possible, we're not ready yet, and by "yet", I mean that we're actively working on it.

It's never an easy thing to answer with "sorry, not yet," but this is all I can do for now.

@petertiedemann Wether you're a paying customer or not has not come up once in the debate. The only reason why we're postponing the proposal is because we truely are not ready yet.

We thank you once again for the proposal that we'll keep open, and regret to close the current pull request.

Rule #1 of open-source: no is temporary, yes is forever.

https://twitter.com/solomonstre/status/715277134978113536?lang=en

grpc plugin: #2362
go plugin: #1865
plugin: #1336

@lawrencegripper
Copy link
Contributor Author

@ldez Really appreciate the response, thanks for taking a look at the alternatives I proposed and all the help with SF provider 👍

Let me know when/how things go with the plugin model and look forward to taking another crack at this in the future.

@petertiedemann
Copy link

@ldez I only brought the support thing up, because @geraldcroes said this functionality was not a much requested feature and not supported by a large community, thinking that having paying customers using the feature might help justify having to support it.

Without this feature we will either have to use a fork of Traefik, or write stateless proxies for our stateful services (luckily we only have a few). I haven't explored how paid support would work if using a fork, but i doubt it would work out well.

You guys really need plugin support :)

@lawrencegripper
Copy link
Contributor Author

So I recently came across goja an ECMAScript implementation in Go.: https://github.com/dop251/goja as it's used by the k6 load testing project here: https://github.com/loadimpact/k6/blob/master/js/compiler/compiler_test.go

It would, in theory, allow us to have simple JS functions defining matching rules/middleware. These could be base64 encoded and set as labels on the services then loaded and run dynamically or provided to Traefik in the TOML.

We would need to run some tests to understand that impact on performance, my hope is that basic rules would be faster than out-of-process RPC style plugin models.

@ldez If this sounds of interest I'd be happy to look at running some benchmarks.

@marshalYuan
Copy link

@lawrencegripper I also want a dynamic matcher for A/B testing or service-chain, and go-lua is my origin plan. But our engineers debated it's performance. What about goja?

@lawrencegripper
Copy link
Contributor Author

lawrencegripper commented Aug 1, 2018

I’m unclear on the perf as haven’t run any benchmarks but I’d be happy to do some testing if this is something that the traefik team would consider merging, assuming it can meet performance goals.

@clazarr
Copy link

clazarr commented Mar 3, 2019

As a potentially interested party seeking additional options to Azure APIM and custom coded API gateways, I was wondering if any progress has been made, since the original PR almost a year ago and more than 9 months since the "no, not yet" response, in supporting stateful services in Service Fabric with Traefik? Is there another path under development by the Traefik maintainers such as the JS functions as matching rules/middleware approach @lawrencegripper mentioned?

This is important to Traefik's integration with the SF platform since stateful service support is a major differentiator of the SF platform. In other words, without it, folks in my situation will likely look elsewhere.

@lawrencegripper
Copy link
Contributor Author

There hasn't been any progress on this that I'm aware of as it's blocked by the availability of a plugin model to move this out of the tree.

Ldez's comment is the a good summary of the situation. I understand that managing an OSS project which has lots of different users and supported platforms means some will not get everything they want.

On an related note building OSS is hard and people can expect a lot and unintentionally sometimes not appear grateful for the hard-work of others. Please keep in mind that @ldez and the Traefik team have taken a lot of time to review, improve and maintain the Service Fabric provider.

@clazarr
Copy link

clazarr commented Mar 4, 2019

On an related note building OSS is hard and people can expect a lot...

I appreciate the response and status update. I agree that successful OSS projects are built upon lots of hard work. Contributions of ideas and code from the community help further that success. I appreciate the Traefik team's specific vision for the right way to evolve the project. It's understandable that there has been much interest in an extensibility model to allow additional functionality or leverage platform capabilities (e.g. Service Fabric and others) for a long time. I think we're all just trying to move things forward and address significant functional requirements / use cases.

@aantono
Copy link
Contributor

aantono commented Mar 15, 2019

For what it’s worth, I’ve been prototyping various embeddable interpreters like go-lua and gojo, etc. So far their performance hasn’t been great (worse than the previous attempts with GRPC or Hashicorp go-plugin). I have got good results with https://github.com/d5/tengo, so will try to make a PR for Traefik folks to consider.

@nmengin
Copy link
Contributor

nmengin commented Feb 8, 2024

Hello,

This proposal targets Traefik v1 which is not supported anymore.
I close the issue accordingly.

We'll re-open it later if necessary.

@nmengin nmengin closed this as completed Feb 8, 2024
@traefik traefik locked and limited conversation to collaborators Mar 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants