Message Processor timeouts against Cassandra

Hello,

In the last few weeks we started experiencing the following error in Apigee.

The Router shows errors against the client since it does not have upstreams (Message Processors) available:

2024/11/08 12:31:46 [error] 6382#6382: *359174066 upstream timed out (110: Connection timed out) while connecting to upstream, client: xxx.xxx.xx.xx, server: api.xxxxxxxx.xx.xxxx, request: "GET /customers/api/customers/00000000/documents HTTP/1.1", upstream: "http://xx.xx.xx.xx:8998/customers-api/api/customers/00000000/documents", host: "api.xxxxxxxx.xx.xxxx"

At the same time, Message Processor return time outs against the Cassandra Cluster:

2024-11-08 12:31:46,508 org:XXXX env:production api:xxxxx rev:2 messageid:iacvm0000-4447-347605043-673 policy:OAuth-v20-Store-External-Token Apigee-Main-37 ERROR DATASTORE.CASSANDRA - AstyanaxCassandraClient.insert() : Exception while insert rowkey [eyJhbGcixx, oauth_20_access_tokens, kms] to the columnFamily : {} in the keyspace {}
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: PoolTimeoutException: [host=None(0.0.0.0):0, latency=2001(2001), attempts=2]Timed out waiting for connection
at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:231)
....
2024-11-08 12:31:46,510 org:xxxx env:production api:oauthv2 rev:2 messageid:iacvm0000-4447-347605043-673 policy:OAuth-v20-Store-External-Token Apigee-Main-37 ERROR DATASTORE.CASSANDRA - AstyanaxCassandraClient.logHostPoolInCaseOfErrors() : Cassandra Host Pool under use - All Hosts: xx.xx.xx.xx(xx.xx.xx.xx):9160,yy.yy.yy.yy(yy.yy.yy.yy):9160,zz.zz.zz.zz(zz.zz.zz.zz):9160. Active Hosts: xx.xx.xx.xx(xx.xx.xx.xx):9160,zz.zz.zz.zz(zz.zz.zz.zz):9160
2024-11-08 12:31:46,510 org:xxxx env:production api:oauthv2 rev:2 messageid:iacvm0000-4447-347605043-673 policy:OAuth-v20-Store-External-Token Apigee-Main-37 ERROR KERNEL - ErrorMessages.formatMessage() : Unable to locate a resource bundle for error code invalid_request
2024-11-08 12:31:46,510 org:xxxx env:production api:oauthv2 rev:2 messageid:iacvm0000-4447-347605043-673 Apigee-Main-37 ERROR MESSAGING.FLOW - AbstractAsyncExecutionStrategy$AsyncExecutionTask.logException() : Exception caught. Message is invalid_request
2024-11-08 12:31:46,510 org:xxxx env:production api:oauthv2 rev:2 messageid:iacvm0000-10276-194337648-574 policy:OAuth-v20-Store-External-Token Apigee-Main-35 ERROR DATASTORE.CASSANDRA - AstyanaxCassandraClient.insert() : Exception while insert rowkey [eyJhbGcxxxxx, oauth_20_access_tokens, kms] to the columnFamily : {} in the keyspace {}
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: PoolTimeoutException: [host=None(0.0.0.0):0, latency=2000(2000), attempts=2]Timed out waiting for connection
at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:231)
....

In Cassandra we did not see any error or “reconection”. It only shows logs about garbage collector but at every time (sometimes 20sec of gc duration) and not only during error.

Also, we exposed jmx metrics in grafana, and during the issue we see a blink in message processor.

Do you know why that blink?
Do you know why Message Processors have those timeouts against the Cassandra?
Is there any command we can execute to determinate the root cause?

Thanks,
Mateo.

Hi @mateoporcar , thank you so much for your question and for your patience! We’re closely following this conversation and ensuring you get a response. In the meantime, we’d like to invite other community members to share their insights.

Also, if you have a chance we’d love for you to join us tomorrow at our virtual community event on Nov 14 “Spec-Driven Development for Apigee: A Fully Automated CI/CD Approach” :backhand_index_pointing_right:t3: Register here.

Hi Alex, thanks you! We will wait for the response.

I couldn’t participate and join to the Spec-Driven event. Was it recorded and is it available somewhere?

Thanks!

Mate.

@mateoporcar What is your architecture here, is this ApigeeHybrid running on some K8s cluster? Just a guess here but from what you’ve described so far, that seems network related if both events happened at the same time.

1 Like

Hello grsiepka,

It is Apigee Edge private cloud instalation (v4.52.00) in rhel8.

We have 6 nodes of cassandra distrubuted in 2 datacenters (3 nodes in dc1 and 3 nodes in dc2).

We have 4 edge-router and 4 edge-message-processors (2 router/message processor in dc1 and 2 in dc2).

Today we have a new message-processor error log against Cassandra:

2024-11-15 11:44:14,054 org:xxxxxxxxx env:production api:push-subscription rev:2 messageid:iacvm1472-10029-34722162-1 policy:OAuth-v20-Verify-Token Apigee-Main-63 ERROR DATASTORE.CASSANDRA - AstyanaxCassandraClient.logHostPoolInCaseOfErrors() : Cassandra Host Pool under use - All Hosts: xx.xx.xx.xx(xx.xx.xx.xx):9160,yy.yy.yy.yy(yy.yy.yy.yy):9160,zz.zz.zz.zz(zz.zz.zz.zz):9160. Active Hosts: xx.xx.xx.xx(xx.xx.xx.xx):9160,zz.zz.zz.zz(zz.zz.zz.zz):9160
2024-11-15 11:44:14,054 org:xxxxxxxxx env:production api:offers-service rev:1 messageid:iacvm1474-10199-34950864-1 policy:OAuth-v20-Verify-Token Apigee-Main-26 ERROR DATASTORE.CASSANDRA - AstyanaxCassandraClient.logHostPoolInCaseOfErrors() : Cassandra Host Pool under use - All Hosts: xx.xx.xx.xx(xx.xx.xx.xx):9160,yy.yy.yy.yy(yy.yy.yy.yy):9160,zz.zz.zz.zz(zz.zz.zz.zz):9160. Active Hosts: xx.xx.xx.xx(xx.xx.xx.xx):9160,zz.zz.zz.zz(zz.zz.zz.zz):9160
2024-11-15 11:44:14,054 org:xxxxxxxxx env:production api:push-subscription rev:2 messageid:iacvm1472-10029-34722162-1  Apigee-Main-63 ERROR MESSAGING.FLOW - AbstractAsyncExecutionStrategy$AsyncExecutionTask.logException() : Exception caught. Message is Error while accessing datastore;Please retry later 
2024-11-15 11:44:14,054 org:xxxxxxxxx env:production api:offers-service rev:1 messageid:iacvm1474-10199-34950864-1  Apigee-Main-26 ERROR MESSAGING.FLOW - AbstractAsyncExecutionStrategy$AsyncExecutionTask.logException() : Exception caught. Message is Error while accessing datastore;Please retry later 
2024-11-15 11:44:14,066 org:xxxxxxxxx env:production api:mya-subscription-api rev:2 messageid:iacvm1474-10199-34950866-1 policy:OAuth-v20-Verify-Token Apigee-Main-87 ERROR DATASTORE.CASSANDRA - AstyanaxCassandraClient.searchForKeysWithSearchMap() :  Couldn't search for specific criteria  
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: PoolTimeoutException: [host=None(0.0.0.0):0, latency=2001(2001), attempts=2]Timed out waiting for connection
	at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:231)
	at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.borrowConnection(SimpleHostConnectionPool.java:198)
	at com.netflix.astyanax.connectionpool.impl.LeastOutstandingExecuteWithFailover.borrowConnection(LeastOutstandingExecuteWithFailover.java:74)
	at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:117)
--

The cause error seems to be one: Message processor time out against cassandra. And Router error it seems to be consequense of message processor connection error.

Is there any command o configuration we can execute to determinate the correct operation of cassandra?

Excecuting a ./nodetool status we have the following information:

[apigee@iacvmxxxx ~]# /opt/apigee/apigee-cassandra/bin/nodetool status
Picked up JAVA_TOOL_OPTIONS: -Dcom.sun.jndi.rmiURLParsing=legacy
Datacenter: dc-1
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns (effective)  Host ID                               Rack
UN  xx.xx.xx.xx  92.55 GB   1       100.0%            e0059d36-01bc-46d3-9b27-0879f55ca8e3  ra-1
UN  yy.yy.yy.yy  91.14 GB   1       100.0%            8b697312-000e-4de8-9281-25b1e81ad4b6  ra-1
UN  zz.zz.zz.zz  93.56 GB   1       100.0%            10683353-5be6-4226-b301-df8627bd3ef1  ra-1
Datacenter: dc-2
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns (effective)  Host ID                               Rack
UN  aa.aa.aa.aa  89.31 GB   1       100.0%            1a2b6540-d388-41ca-857c-87d3d1617e86  ra-1
UN  bb.bb.bb.bb  89.58 GB   1       100.0%            77a2e60f-973e-401e-b401-a24e57abfa61  ra-1
UN  cc.cc.cc.cc  91.8 GB    1       100.0%            bb4136d0-c00c-4772-8ba7-aae1f7407508  ra-1

Thanks,

Mate.

No worries @mateoporcar , you can find the recording here :backhand_index_pointing_right:t3: Spec-Driven Development for Apigee A fully automated CI/CD Approach | Virtual Session.

1 Like

Couple of things to Check -

  1. What is the compaction Strategy. Kindly refer this document . The preferred strategy is LeveledCompactionStrategy.
  2. Check if these errors happen during any ongoing compaction or backup. Check if you are running any cron jobs to backup and whats the frequency. Check the command - nodetool compactionstats

If you think you have the right compaction strategy, please reaise a support ticket with google for further debugging.

1 Like

Hello @rajeshmi , thanks for your reply.

  1. Our compaction strategy is LeveledCompactionStrategy (I think is the default for v4.52.00).

  2. Let me show you some logs that we saw at the same time in Message Processor and in Cassandra system.log.

Message Processor at 10:27:59:

2024-10-25 10:27:59,293 org:xxxxx env:production api:proxy-login-api rev:2 messageid:iacvmxxxxx-10276-80569763-1 policy:OAuth-v20-Verify-Token Apigee-Main-63 ERROR DATASTORE.CASSANDRA - AstyanaxCassandraClient.get() : Exception while fetching rowKey : [xxxxxxxxxxxxxxxxxx, oauth_20_access_tokens, kms] from column family:{} in keyspace:{}
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: PoolTimeoutException: [host=None(0.0.0.0):0, latency=2000(2000), attempts=2]Timed out waiting for connection

In Cassandra system.log:

WARN  [GossipTasks:1] 2024-10-25 10:27:59,669 FailureDetector.java:258 - Not marking nodes down due to local pause of 11132532588 > 5000000000
INFO  [ScheduledTasks:1] 2024-10-25 10:27:59,670 MessagingService.java:929 - MUTATION messages were dropped in last 5000 ms: 1 for internal timeout and 0 for cross node timeout

Some minutes later in Message Processor the same error (10:44:50):

2024-10-25 10:44:50,441 org:xxxxxxx env:production api:accounts-api rev:1 messageid:iacvmxxxxx-10276-80764926-1 policy:OAuth-v20-Verify-Token Apigee-Main-29 ERROR DATASTORE.CASSANDRA - AstyanaxCassandraClient.get() : Exception while fetching rowKey : [xxxxxxxxxxxxxxxxxx, oauth_20_access_tokens, kms] from column family:{} in keyspace:{}
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: PoolTimeoutException: [host=None(0.0.0.0):0, latency=2001(2001), attempts=2]Timed out waiting for connection

Cassandra:

WARN  [GossipTasks:1] 2024-10-25 10:44:57,897 FailureDetector.java:258 - Not marking nodes down due to local pause of 12258925947 > 5000000000
INFO  [ScheduledTasks:1] 2024-10-25 10:45:02,899 MessagingService.java:929 - MUTATION messages were dropped in last 5000 ms: 561 for internal timeout and 0 for cross node timeout

Do you know what this WARN and MUTATION means ?

Thanks!

Mate.