HANA System Replication – Backup
This blog will focus on how backup is done for SAP HANA System Replication (HSR).
As intro to HSR you can read through our previous blog on HSR: HANA System Replication - Take-over process. Here you will also find further material about the HSR topic in general.
In our recent blog HANA System Replication - Take-over process, we discussed how to setup HSR and how to run through a takeover. In this blog we will discuss, how to have continuous backup during a takeover.
A how-to guide can be found here: How to Perform System Replication for SAP HANA
Before discussing backup for HSR, we will do a quick summary of how backup works in HANA:
- To make backups in HANA, you have to start with full data backup: the continuously written log backups are the "diff" to the last full data backup. Hence, to be replayed, log backups require a full data backup as starting point.
- A full backup contains everything which is required to restore the database, to the point in time the full backup was made.
- If HANA is restored to a full backup, the state of the full backup is restored and open transactions (open at the state of the full backup) are rolled back.
- A log backup contains all actions executed on the database, regardless if committed or not. Log backups allow to restore the database to any point in time, but require a full backup as starting point. Hence,
- For a restore, there is always a full backup required. Also log backups require a full backup as starting point: Starting from the full backup, log backups are “replayed” to restore the required state (e.g., specific point in time, all log backups available, etc.).
- Log backups before the oldest full backup cannot be replayed any more.
- Log backups which cannot be replayed are not deleted automatically. See, e.g., https://service.sap.com/sap/support/notes/1852242 for a script deleting not needed log backups.
- Deleting backup files (e.g., running the script from note 1852242) does *NOT* clear the backup catalog (see info about backup catalog below).
- When committed, transactions are written to persistent storage in "log segments.” Log backup is written asynchronously, i.e., there is no immediate backup once a transaction is committed.
- If a log backup is written, the according log segment is closed. Or: if a log segment is closed, an according log backup is written. Hence, defining a backup interval also defines how often a log segment is written.
- Technically, the backup is written asynchronously, i.e., closed segments are put into an internal backup queue. Only if segments are actually written to backup and contained in a save point, the log segment is "free" and can be overwritten. This is why a disabled log backup will cause the log area to run full and hence will cause HANA to hang.
- Furthermore, a log segment is closed if it is full. Hence, the backup interval is the maximum time how often backups of log segments are written.
- Note that if there is no commit for a segment and the segment is not full, this segment is not written (backed up) as there is nothing which could be restored (although data are written in segments before they are committed, only transactions which have been committed are restored). Hence, not having log backups does not mean backup is failing, but that there are no commits.
- HANA keeps a history of written backups called backup catalog (e.g., see in studio -> Backup -> Backup Catalog). As this information is needed to restore HANA, the backup catalog itself is also backed up. This backup catalog contains the file paths to the full/log backup files.
Note that removing backup files (e.g., from the file system) does *NOT* clear the backup catalog. For details on housekeeping for backup catalog, check the administration guide http://help.sap.com/hana/SAP_HANA_Administration_Guide_en.pdf “Housekeeping for Backup Catalog and Backup Storage”
Backup for System Replication: Facts
- For system replication, the primary has to be configured for backup, otherwise the secondary will not be able to register.
- While functioning system replication can guarantee data redundancy, with backups one can recover to any point in time. During a takeover, backup is essential as system replication as redundancy mechanisms falls away.
- Only the primary site runs backups.
=> Recommendation: A script or agent starting the backup should check, if the site is running as primary.
- With a takeover, also the backup catalog is taken over, i.e., the "backup state.”
=> Recommendation: The secondary site should have access to the same file path for backups as the primary site. It could be the same "physical directory", a automated or manual replicated directory on different hardware.
- The secondary receives the information which log entry is written in a new log segment.
- After a takeover from site A to site B, site B is able to write log backups of transactions started and/or committed on site A. The backup catalog is, as every thing else, taken over with the take over. Thus, site B continues to write log backups to the backup catalog. To be able to recover on site B, site B needs 1) the data/log backup written on site A, 2) the data/log backup written on site B, and 3) a backup catalog written on site B. If site A and B use the same path for data and log backup, it will, according to the backup catalog written on site B, use the right data/log backup files.
- After a takeover from site A to site B (and no further action), site A is still in primary mode. If site A is still running, it is writing backups and backup catalogs. Such backup catalogs written after the takeover on site A can be source of error: they are referencing log backups written on site A, although site A is not the "source of truth" any more. The backup catalog on site A contains backups of site A and no backups of site B and hence must not be used for a restore. As HANA is choosing the latest backup catalog for a restore, those backup catalogs must not be accessible to HANA on site B, as otherwise the "wrong history" is restored.
=> Recommendation: if the backup directories on site A and site B are not be the physically same, there should be a replication from site A to site B. After a takeover, site B (i.e., the logical primary site) should be in control (of replication) of data and log backups. Hence, site B should either delete or not replicate backup catalogs created after the takeover from site A. Backup catalogs are stored in the backup log folder and have the form log_backup_0_0_0_0_.<seconds since Jan 01 1970><milliseconds>
Exemplary run through
We will now run through an exemplary takeover. Note that this is one possible run in a variety of setups.
- Before the takeover, site A is running as primary, site B is running as secondary.
- Site A is writing log backups to /backup/<SID>/data for full backups and /backup/<SID>/log for log backups, whereas /backup is mounted to NAS_A
- Site B has mounted /backup to NAS_B
- There is a synchronisation between NAS_A to NAS_B, i.e., the backup folders
- T1 (Timestamp 1): transaction C1 is committed (site A)
- T2: a log backup and backup catalog is written to /backup/<SID>/log/<name><T2> (physically to NAS_A) and replicated to NAS_B
- T3: transaction C2 is committed (site A)
- The Data Center of site A has some network problems and it is decided to take over to site B
- T4: The virtual IPs pointing to site A are reconfigured to point to site B, i.e., existing connections from clients to site A will break, new connections will wait for site B to become available
Note: here is the crucial point, where two things have to made sure:
- No client must be able to connect to site A and commit a transaction. There is no (and there cannot be a) built-in mechanism which would make sure that site A knows site B has taken over (e.g., if there is no connection between site A and site B due to the cause which resulted in the takeover, you cannot rely on the mechanisms HANA possesses to make sure not client connects to site A).
- Site A must not (over-)write (the) a backup catalog (of site B), as in a case of a restore, the backup of site A would be restored, although site B was running as source of truth. It is sufficient if the backup catalog of site B is the most recent one, however, it is more safe to assure that site A does not write any backup catalog at all.
As noted above, this cannot be guaranteed by HANA as HANA would rely on infrastructure which may cause the takeover (i.e., is currently failing). It depends on the infrastructure how those points can be made sure, e.g., using virtual IPs moved from site A to site B, or unmounting the shared backup site, etc.
- T5: On site B, sr_takeover is executed, i.e., site B starts to replay the logs on its most recent sync point
- T6: transaction C3 is committed on site A (it must not be from a remote client)
- T7: a log backup and backup catalog is written to /backup/<SID>/log/<name><T7> (physically to NAS_A), only the backup, but not the backup catalog is replicated to NAS_B
- T8: site B writes a log backup and backup catalog to /backup/<SID>/log/<name><T8> (physically to NAS_B) containing the log segment started after T2, i.e., including transaction C2
- T9: site B starts to serve client requests
- T10: transaction C4 is committed (site B)
- T11: site B writes a log backup and backup catalog to /backup/<SID>/log/<name><T11> (physically to NAS_B)
If site B would require a restore (from backup) after T11, site B would
- Find the most recent backup catalog in /backup/<SID>/log at NAS_B, i.e., the backup catalog written at T11
- Backup catalog T11 would reference the log backups
- T2: it was written on site A before the takeover and hence was part of the backup catalog which was taken over with the takeover started at T5
- T8: was written on site B, but containing the transactions committed on site A not part of the backup catalog before the takeover
- T11: written on site B, containing the transactions committed on site B (i.e., C4)
- Backup catalog T11 would not reference
- T6: transaction C3 was committed on site A after site A was not the “source of truth” any more, and hence is not committed on site B and hence must not be containted in the backup catalog. Whoever is doing a take-over is responsible that no client is able to commit a transaction on site A after the take-over. See also HANA System Replication - Take-over process
- Backup catalog T11 does not reference the log backup T7, as this backup was written after the takeover and hence was not in the backup catalog
General note: the mechanisms discussed here have been discussed along the backup to files, however the general mechanisms are also valid when using BACKINT!