Commit Graph

6814 Commits

Author SHA1 Message Date
Kannan J 42e1829b37
xds: Do RLS fallback policy eagar start (#12211)
The resource subscription to the fallback target was done only at the time of falling back, which can cause rpcs to fail. This change makes the fallback target to be subscribed and cached earlier, similar to C++ and go gRPC implementations.
2025-07-24 16:58:32 +05:30
Eric Anderson c4256add4d xds: Align PriorityLB child selection with A56
The PriorityLB predates A56. tryNextPriority() now matches
ChoosePriority() from the gRFC.

The biggest change is waiting on CONNECTING children instead of failing
after the failOverTimer fires. The failOverTimer should be used to start
lower priorities more eagerly, but shouldn't cause the overall
connectivity state to become TRANSIENT_FAILURE on its own. The prior
behavior of creating the "Connection timeout for priority" failing
picker was particularly strange, because it didn't update child's
connectivity state. This previous behavior was creating errors because
of the failOverTimer with no way to diagnose what was going wrong.

b/428517222
2025-07-23 06:38:33 -07:00
Eric Anderson 6ff8ecac09 core: Don't pre-compute DEADLINE_EXCEEDED message for delayed calls
The main reason I made a change here was to fix the tense from the
deadline "will be exceeded in" to "was exceeded after". But we really
don't want to be doing the string formatting unless the deadline is
actually exceeded. There were a few more changes to make some variables
effectively final.
2025-07-22 06:56:02 -07:00
Patrick Strawderman 80217275db
api: Size Sets and Maps correctly in handling of Metadata values to be exchanged during a call (#12229)
Fix HashSet / HashMap initializations to have sufficient capacity allocated based on the number of keys to be inserted, without which it would always lead to a rehash / resize operation.
2025-07-22 09:14:08 +05:30
Eric Anderson 2e96fbf1e8 netty: Associate netty stream eagerly to avoid client hang
In #12185, RPCs were randomly hanging. In #12207 this was tracked down
to the headers promise completing successfully, but the netty stream
was null. This was because the headers write hadn't completed but
stream.close() had been called by goingAway().
2025-07-17 21:55:53 +00:00
George Gensure a37d3eb349 Guarantee missing stream promise delivery
In observed cases, whether RST_STREAM or another failure from netty or
the server, listeners can fail to be notified when a connection yields a
null stream for the selected streamId. This causes hangs in clients,
despite deadlines, with no obvious resolution.

Tests which relied upon this promise succeeding must now change.
2025-07-17 21:55:16 +00:00
Eric Anderson 1fc4ab0bb2 LBs should avoid calling LBs after lb.shutdown()
LoadBalancers shouldn't be called after shutdown(), but RingHashLb could
have enqueued work to the SynchronizationContext that executed after
shutdown(). This commit fixes problems discovered when auditing all LBs
usage of the syncContext for that type of problem.

Similarly, PickFirstLb could have requested a new connection after
shutdown(). We want to avoid that sort of thing too.

RingHashLb's test changed from CONNECTING to TRANSIENT_FAILURE to get
the latest picker. Because two subchannels have failed it will be in
TRANSIENT_FAILURE. Previously the test was using an older picker with
out-of-date subchannelView, and the verifyConnection() was too imprecise
to notice it was creating the wrong subchannel.

As discovered in b/430347751, where ClusterImplLb was seeing a new
subchannel being called after the child LB was shutdown (the shutdown
itself had been caused by RingHashConfig not implementing equals() and
was fixed by a8de9f07ab, which caused ClusterResolverLb to replace its
state):

```
java.lang.NullPointerException
	at io.grpc.xds.ClusterImplLoadBalancer$ClusterImplLbHelper.createClusterLocalityFromAttributes(ClusterImplLoadBalancer.java:322)
	at io.grpc.xds.ClusterImplLoadBalancer$ClusterImplLbHelper.createSubchannel(ClusterImplLoadBalancer.java:236)
	at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:47)
	at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:47)
	at io.grpc.internal.PickFirstLeafLoadBalancer.createNewSubchannel(PickFirstLeafLoadBalancer.java:527)
	at io.grpc.internal.PickFirstLeafLoadBalancer.requestConnection(PickFirstLeafLoadBalancer.java:459)
	at io.grpc.internal.PickFirstLeafLoadBalancer.acceptResolvedAddresses(PickFirstLeafLoadBalancer.java:174)
	at io.grpc.xds.LazyLoadBalancer$LazyDelegate.activate(LazyLoadBalancer.java:64)
	at io.grpc.xds.LazyLoadBalancer$LazyDelegate.requestConnection(LazyLoadBalancer.java:97)
	at io.grpc.util.ForwardingLoadBalancer.requestConnection(ForwardingLoadBalancer.java:61)
	at io.grpc.xds.RingHashLoadBalancer$RingHashPicker.lambda$pickSubchannel$0(RingHashLoadBalancer.java:440)
	at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)
	at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:128)
	at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.onData(XdsClientImpl.java:817)
```
2025-07-17 12:56:33 +00:00
MV Shiva 6935d3a115
Revert "xds: add "resource_timer_is_transient_failure" server feature (#12063)" (#12228) 2025-07-17 11:35:34 +05:30
MV Shiva d7d70c6905
xds: cncf/xds proto sync to 2025-05-02 (#12225) 2025-07-17 10:26:12 +05:30
Kannan J d352540a02
api: Add more Javadoc for NameResolver.Listener2 interface (#12220) 2025-07-16 14:39:43 +05:30
MV Shiva 5a8326f1c7
xds: add "resource_timer_is_transient_failure" server feature (#12063) 2025-07-15 15:33:02 +05:30
Eric Anderson a8de9f07ab xds: Implement equals in RingHashConfig
Lack of equals causes cluster_resolver to consider every update a
different configuration and restart itself.

b/430347751
2025-07-14 14:06:15 +00:00
Eric Anderson 9d191b31b5 xds: Check isHttp11ProxyAvailable in equals()
This fixes an equals/hashCode bug introduced in 12197065fe.

Discovered when investigating b/430347751
2025-07-14 14:05:35 +00:00
Richard Belleville 01bd63d88f
Remove inactive maintainers (#12187) 2025-07-11 15:07:00 -07:00
John Cormie 94532a6b56
binder: Introduce server pre-authorization (#12127)
grpc-binder clients authorize servers by checking the UID of the sender of the SETUP_TRANSPORT Binder transaction against some SecurityPolicy. But merely binding to an unauthorized server to learn its UID can enable "keep-alive" and "background activity launch" abuse, even if security policy ultimately decides the connection is unauthorized. Pre-authorization mitigates this kind of abuse by looking up and authorizing a candidate server Application's UID before binding to it. Pre-auth is especially important when the server's address is not fixed in advance but discovered by PackageManager lookup.
2025-07-10 14:14:36 -07:00
Eric Anderson 6dfa03c51c
core: grpc-timeout should always be positive (#12201)
PROTOCOL-HTTP2.md specifies "TimeoutValue → {positive integer as ASCII
string of at most 8 digits}". Zero is not positive, so it should be
avoided. So make sure timeouts are at least 1 nanosecond instead of 0
nanoseconds.

grpc-go recently began disallowing zero timeouts in
https://github.com/grpc/grpc-go/pull/8290 which caused a regression as
grpc-java can generate such timeouts. Apparently no gRPC implementation
had previously been checking for zero timeouts.

Instead of changing the max(0) to max(1) everywhere, just move the max
handling into TimeoutMarshaller, since every caller of TIMEOUT_KEY was
doing the same max() handling.

Before fd8fd517d (in 2016!), grpc-java actually behaved correctly, as it
failed RPCs with timeouts "<= 0". The commit changed the handling to the
max(0) handling we see now.

b/427338711
2025-07-03 11:44:04 +05:30
Abhishek Agrawal 919370172d
census: APIs for stats and tracing (#12050) 2025-07-01 20:44:28 +05:30
Eric Anderson ca99a8c478 Fix RLS regressions from XdsDepMan conversion
297ab05ef converted CDS to XdsDependencyManager. This caused three
regressions:

 * CdsLB2 as a RLS child would always fail with "Unable to find
   non-dynamic root cluster" because is_dynamic=true was missing in
   its service config
 * XdsNameResolver only propagated resolution updates when the clusters
   changed, so a CdsUpdate change would be ignored. This caused a hang
   for RLS even with is_dynamic=true. For non-RLS the lack config update
   broke the circuit breaking psm interop test. This would have been
   more severe if ClusterResolverLb had been converted to
   XdsDependenceManager, as it would have ignored EDS updates
 * RLS did not propagate resolution updates, so CdsLB2 even with
   is_dynamic=true the CdsUpdate for the new cluster would never arrive,
   causing a hang

b/428120265
b/427912384
2025-06-30 14:23:32 +00:00
John Cormie 2ee4f9b488
AndroidComponentAddress constructor can be private. (#12188) 2025-06-27 10:58:48 +05:30
John Cormie 74aee11389
Clarify requirements for creating a cross-user Channel. (#12181)
The @SystemApi runtime visibility requirement isn't really new. It has always been implicit in the required INTERACT_ACROSS_USERS permission, which (in production) can only be held by system apps.

The SDK_INT >= 30 requirement was also always present, via @RequiresApi() on  BinderChannelBuilder#bindAsUser. This change just updates its replacement APIs (AndroidComponentAddress and TARGET_ANDROID_USER) to require it too.
2025-06-26 17:43:13 -07:00
vimanikag 64322c3243
11243: RLS cleanups (#12085) 2025-06-25 10:55:00 +05:30
Eric Anderson af7efeb9f5 core: Rely on ping-pong for flow control testing
The previous code did a ping-pong to make sure the transport had enough
time to process, but then proceeded to sleep 5 seconds. That sleep would
have been needed without the ping-pong, but with the ping-pong we are
confident all events have been drained from the transport. Deleting the
unnecessary sleeps saves 10 seconds, for each of the 9 instances of this
test.
2025-06-25 04:52:46 +00:00
Eric Anderson ebc6d3e932 Start 1.75.0 development cycle 2025-06-25 04:52:17 +00:00
Eric Anderson d374b26b68 xds: Disable LOGICAL_DNS in XdsDepMan until used
ClusterResolverLb is still doing DNS itself, so disable it in XdsDepMan
until that migration has finished. EDS is fine in XdsDepman, because
XdsClient will share the result with ClusterResolverLb.
2025-06-24 14:56:16 +00:00
MV Shiva f99b2aaef8
release: Migrate artifacts publishing from legacy OSSRH to Central Portal (#12156) 2025-06-24 10:12:35 +05:30
John Cormie 30d40a6179
binder: Cancel checkAuthorization() request if still pending upon termination (#12167) 2025-06-23 12:51:40 -07:00
John Cormie 9a6bdc70af
download maven using archive/permalink url (#12169) 2025-06-23 12:00:25 -07:00
Kun Zhang e6e7bcadaf
binder: stops emulating for 21/22 Lollipop in tests
See cl/769936336 internally
2025-06-18 14:47:24 -07:00
John Cormie 922dc8a999
Mark a few test helper methods as @CanIgnoreReturnValue (#12162) 2025-06-18 10:59:21 +05:30
Eric Anderson d2d8ed8efa xds: Add logical dns cluster support to XdsDepManager
ClusterResolverLb gets the NameResolverRegistry from
LoadBalancer.Helper, so a new API was added in NameResover.Args to
propagate the same object to the name resolver tree.

RetryingNameResolver was exposed to xds. This is expected to be
temporary, as the retrying is being removed from ManagedChannelImpl and
moved into the resolvers. At that point, DnsNameResolverProvider would
wrap DnsNameResolver with a similar API to RetryingNameResolver and xds
would no longer be responsible.
2025-06-17 22:14:20 +00:00
Eric Anderson f07eb47cac
util: Deliver addresses in a random order in MultiChildLb
This should often not matter much, but in b/412468630 it was cleary
visible that child creation order can skew load for the first batch of
RPCs. This doesn't solve all the cases, as further-away backends will
still be less likely chosen initially and it is ignorant of the LB
policy. But this doesn't impact correctness, is easy, and is one fewer
cases to worry about.
2025-06-17 07:14:32 -07:00
Eric Anderson 2604ce8a55 xds: XdsNR should be subscribing to clusters with XdsDepManager
This is missing behavior defined in gRFC A74:

> As per gRFC A31, the ConfigSelector gives each RPC a ref to the
> cluster that was selected for it to ensure that the cluster is not
> removed from the xds_cluster_manager LB policy config before the RPC
> is done with its LB picks. These cluster refs will also hold a
> subscription for the cluster from the XdsDependencyManager, so that
> the XdsDependencyManager will not stop watching the cluster resource
> until the cluster is removed from the xds_cluster_manager LB policy
> config.

Without the logic, RPCs can race and see the error:

> INTERNAL: CdsLb for cluster0: Unable to find non-dynamic root cluster

Fixes #12152. This fixes the regression introduced in 297ab05e
2025-06-17 13:36:24 +00:00
vimanikag 1c43098990
util: OutlierDetection should use Ticker, not TimeProvider (#12110)
TimeProvider provides wall time. That can move forward and backward as time is adjusted. OutlierDetection is measuring durations, so it should use a monotonic clock.

Fixes #11622
2025-06-16 07:47:36 -07:00
Eric Anderson d5b4fb51c2 xds: Support tracking non-xds resources in XdsDepManager
This will be used for logical dns clusters as part of gRFC A74. Swapping
to EnumMap wasn't really necessary, but was easy given the new type
system.

I can't say I'm particularly happy with the name of the new
TrackedWatcher type, but XdsConfigWatcher prevented using "Watcher"
because it won't implement the new interface, and ResourceWatcher
already exists in XdsClient. So we have TrackedWatcher, WatcherTracer,
TypeWatchers, and TrackedWatcherType.
2025-06-16 14:32:27 +00:00
Eric Anderson 8974a306af
util: Mark OutlierDetectionLb classes final
None of these classes were intended to be extended. Even non-public
classes need final to prevent mocks from doing horrible things.
2025-06-13 10:52:00 -07:00
Eric Anderson d88ef97a87 core: Remove RetryingNR.RESOLUTION_RESULT_LISTENER_KEY
It was introduced in fcb5c54e4 because at the time we didn't change the
API to communicate the status. When onResult2() was introduced in
90d0fabb1 this hack stopped being necessary.
2025-06-13 15:20:10 +00:00
Eric Anderson 240f731e00 xds: Avoid changing cache when watching children in XdsDepManager
The watchers can be completely regular, so the base class can do the
cache management while the subclasses are only concerned with
subscribing to children.
2025-06-13 15:18:14 +00:00
MV Shiva 26bd0eee47
core: Use lazy message formatting in checkState (#12144) 2025-06-13 12:30:13 +05:30
David Sanderson 6f69363d90
bazel: Migrate java_grpc_library to use DefaultInfo (#12148)
We here address the following obstacles in grpc-java to using Bazel's
--incompatible_disable_target_default_provider_fields flag:

```
ERROR: /private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/googleapis+/google/devtools/build/v1/BUILD.bazel:81:18: in _java_grpc_library rule @@googleapis+//google/devtools/build/v1:build_java_grpc:
Traceback (most recent call last):
        File "/private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/grpc-java+/java_grpc_library.bzl", line 94, column 30, in _java_rpc_library_impl
                args.add(toolchain.plugin.files_to_run.executable, format = "--plugin=protoc-gen-rpc-plugin=%s")
Error: Accessing the default provider in this manner is deprecated and will be removed soon. It may be temporarily re-enabled by setting --incompatible_disable_target_default_provider_fields=false. See https://github.com/bazelbuild/bazel/issues/20183 for details.
ERROR: /private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/googleapis+/google/devtools/build/v1/BUILD.bazel:81:18: Analysis of target '@@googleapis+//google/devtools/build/v1:build_java_grpc' failed
ERROR: Analysis of target '//src:bazel' failed; build aborted: Analysis failed
```
2025-06-12 08:43:23 -07:00
Eric Anderson 6cc2ff1ced util: In OutlierDetectionLb, don't box longs if they can't be null
Changed the builder pattern to pass the builder to the constructor,
since I was changing almost all the arguments of the constructor anyway.
2025-06-12 04:02:40 +00:00
Eric Anderson 13fe008044
rls: Refactor estimatedSizeBytes updates (#12145)
Just use a regular method instead of reusing the EvictionListener API.
Fix a few comments as well. Both of these changes were based on review
comments to pre-existing code in #11203.

Contributes to #11243
2025-06-12 09:31:19 +05:30
John Cormie 30f6a4db77
google-java-format a line that was too long (#12147) 2025-06-11 12:19:19 -07:00
Eric Anderson 297ab05efe
xds: Convert CdsLb to XdsDepManager
I noticed we deviated from gRFC A37 in some ways. It turned out those
were added to the gRFC later in https://github.com/grpc/proposal/pull/344:
- NACKing empty aggregate clusters
- Failing aggregate cluster when children could not be loaded
- Recusion limit of 16. We had this behavior already, but it was
  ascribed to matching C++

There's disagreement on whether we should actually fail the aggregate
cluster for bad children, so I'm preserving the pre-existing behavior
for now.

The code is now doing a depth-first leaf traversal, not breadth-first.
This was odd to see, but the code was also pretty old, so the reasoning
seems lost to history. Since we haven't seen more than a single level of
aggregate clusters in practice, this wouldn't have been noticed by
users.

XdsDependencyManager.start() was created to guarantee that the callback
could not be called before returning from the constructor. Otherwise
XDS_CLUSTER_SUBSCRIPT_REGISTRY could potentially be null.
2025-06-11 11:56:13 -07:00
MV Shiva a16d655919
compiler: generate blocking v2 unary calls that throw StatusException (#12126) 2025-06-10 10:31:03 +05:30
Eric Anderson 6afacf589e xds: Don't cache rdsName in XdsDepManager
We can easily compute the rdsName and avoiding the state means we don't
need to override onResourceDoesNotExist() to keep the cache in-sync with
the config.
2025-06-09 14:13:59 +00:00
Eric Anderson 4ee662fbcf xds: cancelled=true on watch close in XdsDepManager
1fd29bc80 replaced cancelWatcher() with watcher.close(). But setting
cancelled was missing. Because the config update checks for shutdown,
the cancelled flag no longer avoids exceptions. But it seems best to
continue avoiding any processing after close to avoid surprises.
2025-06-09 14:13:41 +00:00
Eric Anderson 1fd29bc804 xds: Use tracing GC in XdsDepManager
Reference counting doesn't release cycles, so swap to a tracing garbage
collector. This greatly simplifies the code as well, as diffing is no
longer necessary. (If vanilla reference counting was used, diffing
wouldn't have been necessary either as you just increment all the new
objects and decrement the old ones. But that doesn't work when use a set
instead of an integer.)
2025-06-06 08:43:35 -07:00
eshitachandwani dc192f5c5e
Create SPIFFE tests config (#12133) 2025-06-06 19:24:27 +05:30
John Cormie c206428749
binder: Rationalize @ThreadSafe-ty of BinderTransport. (#12130)
- Use @BinderThread to document restrictions on methods and certain fields.
- Make TransactionHandler non-public since only Android should call it.
- Replace an unnecessary AtomicLong with a plain old long.
2025-06-05 19:07:59 -07:00
Eric Anderson 4cd7881086 xds: Fix XdsDepManager aggregate cluster child ordering and loop detection
The children of aggregate clusters have a priority order, so we can't
ever throw them in an ordinary set for later iteration.

This now detects recusion limits only after subscribing, but that
matches our existing behavior in CdsLoadBalancer2. We don't get much
value detecting the limit before subscribing and doing so makes watcher
types more different.

Loops are still a bit broken as they won't be unwatched when orphaned,
as they will form a reference loop. In CdsLoadBalancer2, duplicate
clusters had duplicate watchers so there was single-ownership and
reference cycles couldn't form. Fixing that is a bigger change.

Intermediate aggregate clusters are now included in XdsConfig, just for
simplicity. It doesn't hurt anything whether they are present or
missing. but it required updates to some tests.
2025-06-05 08:43:02 -07:00