Thanks for the response. I’m fairly certain the issue I’m facing is a bug with Cloud SQL and have contacted support, but that’s been slow moving. Maybe posting in this channel will expediate that process.
For those who don’t want to read the exhaustive explanation a little further down: PITR appears to do a restore from the most recent full backup (prior to chosen point in time) and re-apply transaction logs up until the desired point in time. If a transaction log is corrupted at any point after the full backup and before the chosen point in time, PITR will fail.
Assuming that’s true, there’s a couple of problems:
- The documentation should very clearly state for PITR the transaction logs are restored from the last full backup which equates to the most recent of [on_demand_backup, oldest_automatic_backup].
- A corrupted transaction log at any point since the last full backup will cause PITR to fail no matter what time you select. PITR will only work if a healthy full backup exists and the time selected is after that healthy backup. This means if you have a full backup on Monday, corrupted transaction on Tuesday, something terrible happens Wednesday and you need to do a PITR, you can only choose a time prior to the corrupted transaction log on Tuesday. This to me, is a huge issue.
- Restoring a backup to an instance and then cloning to a new instance will give you a completely incorrect instance state unless you run an on-demand backup prior to the PITR
Hopefully I get a meaningful response from support, or someone in the forums can point out an issue with my testing and help clarify how Cloud SQL is handling things.
For full details on how I came to this conclusion, read the below.
I have three databases:
When cloning databases individually, DB_A and DB_B succeed (despite the .mdf file not found error) and it’s only DB_C that fails.
Reason? The error I originally reported on is (probably) a red herring and can probably be ignored (although it’s pretty confusing to see show up in the logs even when the clone is successful).
I dug through some logs and found an error occurred during the restoration of a log in DB_C.
Could not redo log record (496884:85120:3), for transaction ID (0:1129725227), on page (1:9), allocation unit 6488064, database ‘DB_C’ (database ID 5). Page: LSN = (496884:85120:2), allocation unit = 6488064, type = 13. Log: OpCode = 4, context 20, PrevPageLSN: (496884:85088:6). Restore from a backup of the database, or repair the database.\r"
Given that neither DB_A or DB_B report an error of this nature, I’m tempted to think this error is the culprit of the clone failure.
To try and resolve this issue:
- I ran CHECKDB (which returned no issues) and re-attempted the clone.
- Clone failed
- Tried a clone from a later point in time. This way, a different backup would be used (or so I think)
- Clone failed
- Create a new instance, restore an automated backup from the original instance, clone the new instance using PITR
- Clone succeeded! But there was no data? That’s unusual.
To keep some names clear:
- og: The orginal sql server instance. Cloning this failed.
- created_from_og_backup: Created from a backup of “og”
- backup_clone: Cloned from “created_from_og_backup”
This is some bizarre behavior. Maybe it’s because I chose a PITR time that occurred after the “created_from_og_backup” was created, but before the backup from “og” was restored? To test this, I took a look at the backup history and operation history of “created_from_og_backup”
backup history
Created | Type | Location | Description
-----------------------------------------|-------------------|--------------------------|-----------------------------------
Dec 21, 2023, 1:32:15 PM | Automated | Multi-region: us | Backup for SQL Server PITR setup
Dec 21, 2023, 1:30:43 PM | Automated | Multi-region: us | Backup for SQL Server PITR setup
Operations history
Creation Time | Completion Time | Type | Status
-----------------------------------------|-----------------------------------------|------------------------------------------------|-------------------------------
Dec 21, 2023, 2:08:32 PM | Dec 21, 2023, 2:08:47 PM| Update | Update finished
Dec 21, 2023, 1:39:47 PM | Dec 21, 2023, 1:48:42 PM| Restore instance from backup| Restore instance from backup finished
Dec 21, 2023, 1:32:15 PM | Dec 21, 2023, 1:33:46 PM| Backup | Backup finished
Dec 21, 2023, 1:30:43 PM | Dec 21, 2023, 1:32:15 PM| Backup | Backup finished
Dec 21, 2023, 1:24:37 PM | Dec 21, 2023, 1:35:37 PM| Create | Create finished
Ok, so if I do a PITR after “Dec 21, 2023, 1:39:47 PM” it should contain all of the databases and all my data, right?
No! None my data was available. Only available databases were master, model, tempdb, and msdb. This is very unexpected behavior and is almost certainly a bug.
Ok, ok. What if I take a new, on-demand backup of “created_from_og_backup”, wait 10 minutes for some logs to be created and backed up, then try another PITR?
the backup
2023-12-21 17:04:44.10 Backup Database backed up. Database: DB_C, creation date(time): 2023/12/11(11:14:29), pages dumped: 16362819, first LSN: 496889:160560:40, last LSN: 496889:160600:1, number of dump devices: 1, device information: (FILE=1, TYPE=VIRTUAL_DEVICE: {‘15e8ea9178786b60’}). This is an informational message only. No user action is required.
the transaction log backup
2023-12-21 17:18:21.06 Backup Log was backed up. Database: DB_C, creation date(time): 2023/12/11(11:14:29), first LSN: 496889:160624:1, last LSN: 496889:160664:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘/var/opt/mssql/data/pitr/tlog/ca90d11b-b37f-47fd-9655-433bedde312c_DB_C/ca90d11b-b37f-47fd-9655-433bedde312c_DB_C_2023-12-21-23-18-21.006.log’}). This is an informational message only. No user action is required.
So, did cloning “created_from_og_backup” create a clone with the correct data available?
Yup. That worked.
If taking an on-demand backup of “created_from_og_backup” and then cloning resolved all my issues, would the same work if done on “og”?
Yup. No failure due to a corrupted transaction log.
What does this all mean?
What the above investigation implies to me, is that PITR is applying all transaction logs since that last full backup. Assuming that’s true, there’s a couple of problems:
- The documentation should very clearly state for PITR the transaction logs are restored from the last full backup prior to the chosen point in time.
- A corrupted transaction log at any point since the last full backup previous to the selected point in time will cause PITR to fail no matter what time you select. PITR will only work if a healthy full backup exists and the time selected is after that healthy backup. This means if you have a full backup on Monday, corrupted transaction on Tuesday, something terrible happens Wednesday and you need to do a PITR, you can only choose a time prior to the corrupted transaction log on Tuesday. This to me, is a huge issue.
- Restoring a backup to an instance and then cloning to a new instance will give you a completely incorrect instance state unless you run an on-demand backup prior to the PITR
PITR SQL Server docs: https://cloud.google.com/sql/docs/sqlserver/backup-recovery/pitr