We have 1 node cassandra in dev/qa & 3 node in dc1 & 3 node dc2.
We recently had a server maintenance and seen a strange behavior with data loss & recovery(partial).
(It was first incident & want to share)..
Way to replicate the situation:
1.As part of maintaince servers were rebooted where apigee CS node is running 2.As part of hardening server was rebooted with /tmp having no exec permissions 3.Apigee services starts-up while server boot up.Check the apigee cassandra logs for below error. == Caused by: org.apache.cassandra.exceptions.ConfigurationException: SnappyCompressor.create() threw an error: java.lang.NoClassDefFoundError Could not initialize class org.xerial.snappy.Snappy at org.apache.cassandra.io.compress.CompressionParameters.createCompressor(CompressionParameters.java:179) at org.apache.cassandra.io.compress.CompressionParameters.(CompressionParameters.java:71) at org.apache.cassandra.io.compress.CompressionMetadata.(CompressionMetadata.java:95) … 11 more == 4.Change back the permissions on /tmp to have execute permissions 5.Restart the apigee services & verify the recent API proxies situation as in our case we lost all the recently worked proxies information.
What we found was data is not available in CS but was in ZK.
Opened a case and found there is no way to recover as it is single node with no backup.Support recommended to clear the entries in ZK & re-create the org.
After we cleared the entries and recreate the org, restarted the services we found all the proxies are visible back(other things we missing like kvm in the org but thats fine).
Any one seen this behavior, can cassandra expert can explain the behaviour?
How frequent the commit happens & how can data loss happens and recover? How does it internally works?
-Vinay