I think at some point there were more usages in the tests. But now it
is pretty easy.
PriorityLb.ChildLbState.picker is initialized to
FixedResultPicker(NoResult). So now that GracefulSwitchLb is using the
same picker, equals() is able to de-dup an update.
Previously it would wait for the new LB to enter READY. However, that
prevents there being an upper-bound on how long the old policy will
continue to be used. The point of graceful switch is to avoid RPCs
seeing increased latency when we swap config. We don't want it to
prevent the system from becoming eventually consistent.
Some clients watching health status can cancel their watch and `HealthService` when trying to notify these watchers were getting CANCELLED exception because there was no cancellation handler set on the `StreamObserver`. This change sets the cancellation handler that removes the watcher from the set of watcher clients to be notified of the health status.
panic() calls a good amount of code, so it could get another exception.
The SynchronizationContext is running on an arbitrary thread and we
don't want to propagate this secondary exception up its stack (to be
handled by its UncaughtExceptionHandler); it we wanted that we'd
propagate the original exception.
This second exception will only be seen in the logs; the first exception
was logged and will be used to fail RPCs.
Also related to http://yaqs/8493785598685872128 and b692b9d26
This is to support gRFC A83 xDS GCP Authentication Filter:
> Otherwise, the filter will look in the CDS resource's metadata for a
> key corresponding to the filter's instance name.
It is much harder to debug refcounting problems when we ignore
impossible situations. So make such impossible cases complain loudly so
the bug is obvious.
This allows Filters to access the xds configuration for their own
processing. From gRFC A83:
> This data is available via the XdsConfig attribute introduced in A74.
> If the xDS ConfigSelector is not already passing that attribute to the
> filters, it will need to be changed to do so.
Contributes to the gRFC A74 effort.
https://github.com/grpc/proposal/blob/master/A74-xds-config-tears.md
The alternative to using Mockito's ArgumentMatcher is to use Hamcrest.
However, Hamcrest did not impress me. ArgumentMatcher is trivial if you
don't care about the error message.
This fixes a pre-existing issue where ConfigSelector.releaseCluster
could revert the LB config back to using cluster manager after releasing
all RPCs using a cluster have committed.
Co-authored-by: Larry Safran <lsafran@google.com>
It is confusing/harder to read the logs when the
activations/deactivations because of the config happen before the log
entry describing the new config.
Listener2.onResult() doesn't require running in the sync context, so
when called from the sync context it is guaranteed not to do its
processing immediately (instead, it schedules work into the sync
context).
The code was doing an update dance: 1) update service config to add new
cluster, 2) update config selector to use new cluster, 3) update service
config to remove old clusters. But the onResult() wasn't being processed
immediately, so the actual execution order was 2, 1, 3 which has a small
window where RPCs will fail. But onResult2() does run immediately. And
since ca4819ac6, updateBalancingState() updates the picker immediately.
cleanUpRoutes() was also racy because it updated the routingConfig
before swapping to the new config selector, so RPCs could fail saying
there was no route instead of the useful error message. Even with the
opposite order, some RPCs may be executing the while loop of
selectConfig(), trying to acquire a cluster. The code unreffed the
clusters before updating the routingConfig, so those RPCs could go into
a tight loop until the routingConfig was updated. Also, once the
routingConfig was updated to EMPTY those RPCs would similarly
see the wrong error message. To give the correct error message,
selectConfig() must fail such RPCs directly, and once it can do that
there's no need to stop using the config selector in error cases. This
has the benefit of fewer moving parts and more consistent threading
among cases.
The added test was able to detect the race 2% of the time. The slower
the code/machine, the more reliable the test failed. ca4819ac6 along
with this commit reduced it to 0 failures in 1000 runs.
Discovered when investigating b/394850611
ffcc360ba adjusted updateBalancingState() to require being run within
the sync context. However, it still queued the work into the sync
context, which was unnecessary. This re-entering the sync context
unnecessarily delays the new state from being used.
This PR adds support filter state retention in Java. The mechanism
will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.
### Filter instance lifecycle
#### xDS gRPC clients
New filter instances are created per combination of:
1. `XdsNameResolver` instance,
2. Filter name+typeUrl as configured in
HttpConnectionManager (HCM) http_filters.
Existing client-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
HCM that is missing filter configuration for name+typeUrl
combination of this instance.
- All filter instances when watched LDS resource is missing from an
LDS update.
- All filter instances name resolver shutdown.
#### xDS-enabled gRPC servers
New filter instances are created per combination of:
1. Server instance,
2. FilterChain name,
3. Filter name+typeUrl as configured in FilterChain's HCM.http_filters
Filter instances of Default Filter Chain is tracked separately per:
1. Server instance,
2. Filter name+typeUrl in default_filter_chain's HCM.http_filters.
Existing server-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
FilterChain with HCM.http_filters that is missing configuration for
filter name+typeUrl.
- All filter instances associated with the FilterChain when an LDS
update no longer contains FilterChain's name.
- All filter instances when watched LDS resource is missing from an
LDS update.
- All filter instances on server shutdown.
### Related
- Part 1: #11883
`XdsServerWrapper#generatePerRouteInterceptors` was always intended
to be executed within a sync context. This PR ensures that by calling
`syncContext.throwIfNotInThisSynchronizationContext()`.
This change is needed for upcoming xDS filter state retention because
the new tests in XdsServerWrapperTest flake with this NPE:
> `Cannot invoke "io.grpc.xds.client.XdsClient$ResourceWatcher.onChanged(io.grpc.xds.client.XdsClient$ResourceUpdate)" because "this.ldsWatcher" is null`
We want to move away from handleResolvedAddresses(). These are "easy" in
that they need no logic. LBs extending ForwardingLoadBalancer had the
method duplicated from handleResolvedAddresses() and swapped away from
`super` because ForwardingLoadBalancer only forwards
handleResolvedAddresses() reliably today. Duplicating small methods was
less bug-prone than dealing with ForwardingLoadBalancer.
Implements [gRFC A76: explicitly setting the request hash key for the
ring hash LB policy][A76]
* Explictly setting the request hash key is guarded by the
`GRPC_EXPERIMENTAL_RING_HASH_SET_REQUEST_HASH_KEY` environment
variable until API stabilized.
Tested:
* Verified end-to-end by spinning up multiple gRPC servers and a gRPC
client that injects a custom service (load balancing) config with
`ring_hash_experimental` and a custom `request_hash_header` (with
NO associated value in the metadata headers) which generates a random
hash for each request to the ring hash LB. Verified picks/RPCs are
split evenly/uniformly across all backends.
* Ran affected unit tests with thread sanitizer and 1000 iterations to
prevent data races.
[A76]: https://github.com/grpc/proposal/blob/master/A76-ring-hash-improvements.md#explicitly-setting-the-request-hash-key
This is the first step towards supporting filter state retention in
Java. The mechanism will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.
In Java, xDS HTTP filters are backed by classes implementing
`io.grpc.xds.Filter`, from here just called "Filters". To support
Filter state retention (next PR), Java's xDS implementation must be
able to create unique Filter instances per:
- Per HCM
`envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager`
- Per filter name as specified in
`envoy.extensions.filters.network.http_connection_manager.v3.HttpFilter.name`
This PR **does not** implements Filter state retention, but lays the
groundwork for it by changing how filters are registered and
instantiated. To achieve this, all existing Filter classes had to be
updated to the new instantiation mechanism described below.
Prior to these this PR, Filters had no livecycle. FilterRegistry
provided singleton instances for a given typeUrl. This PR introduces
a new interface `Filter.Provider`, which instantiates Filter classes.
All functionality that doesn't need an instance of a Filter is moved
to the Filter.Provider. This includes parsing filter config proto
into FilterConfig and determining the filter kind
(client-side, server-side, or both).
This PR is limited to refactoring, and there's no changes to the
existing behavior. Note that all Filter Providers still return
singleton Filter instances. However, with this PR, it is now possible
to create Providers that return a new Filter instance each time
`newInstance` is called.
A failing Status from acceptResolvedAddresses means something is wrong
with the config, but parts of the config may still have been applied.
Thus there are now two possible flows: errors that should prevent
updateOverallBalancingState() and errors that should have no effect
other than the return code. To manage that, MultChildLb must always be
responsible for calling updateOverallBalancingState().
acceptResolvedAddressesInternal() was inlined to make that error
processing easier. No existing usages actually needed to have logic
between updating the children and regenerating the picker.
RingHashLb already was verifying that the address list was not empty, so
the short-circuiting when acceptResolvedAddressesInternal() returned an
error was impossible to trigger. WrrLb's updateWeightTask() calls the
last picker, so it can run before acceptResolvedAddressesInternal(); the
only part that matters is re-creating the weightUpdateTimer.
S2AStub is an internal API and shouldn't be used outside of s2a. It is
still available for tests.
IntegrationTest was moved to io.grpc.s2a. It uses a io.grpc.s2a class,
so shouldn't be in internal.handler
Switched to using 8192 which is the current value of Segment.SIZE and just have a test check that they are equal.
The reason for doing this is that Segment.SIZE is Kotlin internal so shouldn't be used outside of its module.
To try to aid failure when building android-interop-testing
```
The Daemon will expire after the build after running out of JVM heap space.
The project memory settings are likely not configured or are configured to an insufficient value.
The daemon will restart for the next build, which may increase subsequent build times.
These settings can be adjusted by setting 'org.gradle.jvmargs' in 'gradle.properties'.
The currently configured max heap space is '512 MiB' and the configured max metaspace is '384 MiB'.
...
Exception in thread "Daemon client event forwarder" java.lang.OutOfMemoryError: Java heap space
...
> Task :grpc-android-interop-testing:mergeDexDebug FAILED
ERROR:D8: java.lang.OutOfMemoryError: Java heap space
com.android.builder.dexing.DexArchiveMergerException: Error while merging dex archives:
```
* don't process resourceDoesNotExist for watchers that have been cancelled.
* Change test to use an ArgumentMatcher instead of expecting that only the final result will be sent since depending on timing there may be configs sent for clusters being removed with their entries as errors.
Currently this improves 2 flows
1. Known length message which length is greater than 1Mb. Previously the
first buffer was 1Mb, and then many buffers of 4096 bytes (from
CodedOutputStream), now subsequent buffers are also up to 1Mb
2. In case of compression, the first write is always 10 bytes buffer
(gzip header), but worth allocating more space