cancel
Showing results for 
Search instead for 
Did you mean: 

High swapping rates when diskbackup is running

Former Member
0 Kudos

SLES 10 SP1 with MaxDB

During backup of the "archive logs" (10 GB each file) the system is heavily swapping. Apparently all I/O reads are being pushed through the filesystem cache.

Note: I'm not talking about saving the logs out of the database to a directory but of the backup of those created files using a backup software.

I was playing our with /proc/sys/vm/swappiness but I'm not sure if that is the right parameter to adjust.

I have no possibility to configure the backup software (Legato/EMC Networker) to use DIRECT_IO while reading files so I'm searching for a way to limit the amount of filesystemcache used. Our database is running on RAW devices so this wouldn't impact performance.

Any ideas about tunables?


top - 19:59:11 up 26 days,  5:16,  4 users,  load average: 13.83, 13.66, 13.86
Tasks: 307 total,   2 running, 303 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.3%us,  3.9%sy,  0.0%ni, 78.7%id, 12.6%wa,  0.0%hi,  1.5%si,  0.0%st
Mem:  53580916k total, 43299300k used, 10281616k free,    80092k buffers
Swap: 63154160k total, 31574380k used, 31579780k free,  4315632k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                    
  908 maxdb     16   0 34.5g  31g  35m S   48 61.7  24788:07 /sapdb/P01/db/pgm/kernel P01                
29306 maxdb     15   0 27332 5636 4112 R    8  0.0   2:31.88 /usr/sbin/save -v -s srvlibrh.aubi.de -b SAP
29317 maxdb     15   0 27324 5632 4112 S    5  0.0   2:15.38 /usr/sbin/save -v -s srvlibrh.aubi.de -b 

Regards,

Markus

Accepted Solutions (1)

Accepted Solutions (1)

hannes_kuehnemund
Active Contributor
0 Kudos

Hi Markus,

maybe drop_caches is an option for you. Since SLES10 and RHEL5 you have a new proc parameter. Accoring to proc.txt it says:

"

drop_caches

-


Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.

To free pagecache:

echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes:

echo 2 > /proc/sys/vm/drop_caches

To free pagecache, dentries and inodes:

echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation and dirty objects are not freeable, the user should run `sync' first.

"

At least you can free the filesystem cache...

Thanks,

Hannes

markus_doehr2
Active Contributor
0 Kudos

Hi Hannes,

thank you for the idea - I tried that, the filesystem cache is indeed being free´ed but the system remains in that unresponsive state a few seconds later because the backup process continues to trash...

Markus

hannes_kuehnemund
Active Contributor
0 Kudos

Hi Markus,

just to make sure, which MaxDB version have you running? There are known swapping problems during thebackupof the MaxDB volumes.

SAP Note [977515|https://service.sap.com/sap/support/notes/977515] - Linux System paged during MaxDB data backup/add volume

describes the issue and the solution. Before digging into this, did you know this note?

Thanks,

Hannes

markus_doehr2
Active Contributor
0 Kudos

Hi Hannes,

yes - I know that note but that is not our problem.

It´s not specific to the database but to the OS.

You can easily reproduce it by creating several 10 GB files using DD and then use an external backup to to save them to a tape. Or just create them and watch "top d1c" on a loaded system. The filesystem reads (or writes) are trashing the buffers and the system becomes unresponsive just by the fact, that kswapd is using a lot of system CPU. BTW: that effect is much stronger on NUMA machines than on non-NUMA (Intel) (due to a missing processor affinity?)

Since I can´t tell neither "dd" nor the Legato save process to open files wiht O_DIRECT everything will go through the filesystem cache. And because the SAN is pretty fast reading (or writing) a sequential file, the system CPU load is going up to 80 % on an 8-way machine.

If you try that with a modified dd using O_DIRECT you won´t see that effect the same way you don´t see it when configuring the database to open the volumes as described in that note.

The question is: how to configure the Linux kernel with 0 filesystemcache or alternatively with an algorithm, that will detect just "races" and switches to non-cached I/O?

Markus

Former Member
0 Kudos

Hi Markus,

what happens if you mount your Filesystem, where the Backupfiles are, with the "sync" option in /etc/fstab?

It is normally used for floppys.

Then all IO will be written simultanios to the Disks. I dont know, if this option prevents the Kernel

to use the RAM for Filesystemcaching.

Regards

Manuel

markus_doehr2
Active Contributor
0 Kudos

Hi Manuel,

I will try that - thank you for the idea!


# mount | grep arch
/dev/sda1 on /archivelog type reiserfs (rw,sync,acl,user_xattr)

Backup runs at 6:30 pm - will post results!

Markus

Former Member
0 Kudos

Hi Markus,

sync only works with ext2 ext3 or ufs.

Not reiser

Format your archivelog Partition as ext3.

Bytheway,

i'm also using reiser, but i've learned my lessons. There are many utilities for destroyed ext3 partitions

which are not available for reiser. Since this, i try to use ext3 wherever i can.

(Its no Problem to recover a destroyed ext3 partition, even when completley deleted all inodes.

This is not possible for reiser, just of the missing tools).

Regards

Manuel

markus_doehr2
Active Contributor
0 Kudos

Hi Manuel,

> sync only works with ext2 ext3 or ufs.

> Not reiser

Oh...

UFS on Linux? Would give it a shot (but not on this production system).

> Format your archivelog Partition as ext3.

Ok - I will do so - thanx for the advise

> Bytheway,

> i'm also using reiser, but i've learned my lessons. There are many utilities for destroyed ext3 partitions

> which are not available for reiser. Since this, i try to use ext3 wherever i can.

I prefer to use raw devices wherever possible - but unfortunately there's no possibility to put the SAP kernel onto a raw device

Markus

markus_doehr2
Active Contributor
0 Kudos

> Hi Markus,

> sync only works with ext2 ext3 or ufs.

> Not reiser

Well - I see a difference didn't reformat using ext3 but I don't see any kswapds and I don't see a high presure any more....

Markus

Former Member
0 Kudos

Gr8

I'm wondering why the sync option is working with reiser...

I don't have to understand everything

Regards Manuel

markus_doehr2
Active Contributor
0 Kudos

Just to add:

The kernel still does a lot of kswapd ins/outs but it's not that aggressive:


top - 17:44:11 up 36 days,  3:01,  5 users,  load average: 7.92, 7.87, 7.81
Tasks: 307 total,   4 running, 301 sleeping,   0 stopped,   2 zombie
Cpu(s): 15.1%us,  3.1%sy,  0.0%ni, 67.2%id, 13.6%wa,  0.1%hi,  0.8%si,  0.0%st
Mem:  53580916k total, 53481932k used,    98984k free,   212768k buffers
Swap: 63154160k total, 26770540k used, 36383620k free, 11697336k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                    
10629 p01adm    16   0 73204  27m 5488 R   43  0.1 147:53.83 /usr/sap/P01/SYS/exe/run/R3load -e SAPSSEXC.
11907 p01adm    16   0 32.1g 3.5g 3.3g R   33  6.8 586:07.86 dw.sapP01_DVEBMGS00 pf=/usr/sap/P01/SYS/prof
  908 maxdb     16   0 34.5g  33g  47m S   31 66.3  34985:21 /sapdb/P01/db/pgm/kernel P01                
31666 root      16   0 28432 5352 4148 D   13  0.0   0:19.18 /usr/sbin/savepnpc -s srvlibrh.aubi.de -g P0
11895 p01adm    15   0 32.2g 4.3g 4.0g R   12  8.4   2166:28 dw.sapP01_DVEBMGS00 pf=/usr/sap/P01/SYS/prof
11814 root      16   0 46020  19m 1268 S    5  0.0 219:10.95 /usr/sap/P01/SYS/exe/run/saposcol           
32107 p01adm    15   0  5724 1376  872 R    4  0.0   0:06.29 top d1c                                     
  299 root      15   0     0    0    0 S    2  0.0 129:43.87 [kswapd2]                                   
  298 root      15   0     0    0    0 S    1  0.0  92:45.84 [kswapd3]                                   
  300 root      15   0     0    0    0 S    1  0.0 168:35.53 [kswapd1]                                   
  301 root      15   0     0    0    0 S    1  0.0 123:00.98 [kswapd0]                                   
    1 root      16   0   780   72   40 S    0  0.0   0:12.15 init [5]        

and the system time stays low (as opposed to > 60 % in before the 'sync' option).

Markus

former_member184709
Participant
0 Kudos

Hello,

I know this is an old thread, but we had a similar issue with copying a DB2 backup from a local filesystem to an NFS filesystem, on Linux as well. This was done on a new server with a huge amount of physical free RAM available. I believe the filesystem copy grabbed all the free RAM and pushed other processes into swap.

Remounting the backup filesystems seems to have helped performance and reduced swapping.

Markus, did you find keeping the sync option on your mount to ultimately help your issue?

markus_doehr2
Active Contributor
0 Kudos

> Markus, did you find keeping the sync option on your mount to ultimately help your issue?

Not really.

When we backup archive logs now I scripted

# cat clean_bc.sh 
sync
echo 3 > /proc/sys/vm/drop_caches

each minute using cron. This keeps the load acceptable but I don't think this is an optimal solution. After three or four days after a system restart almost all available swap space is "in use" because of that. I haven't looked into SLES 11 yet and if there's a possibility to completely avoid this behavior.

Other *nix like OS'ses have the option to mount filesystem not using a cache, I hope Novell will sooner or later implement that also in SLES 11.

Markus

former_member184709
Participant
0 Kudos

Hi Markus,

Thanks for the feedback and for your workaround.

I have been researching this issue, and it seems this is a common issue with the Linux memory manager, where high swap rates can be encountered, especially in a workload where a backup operation and a database workload are combined. It does not seem to matter if the system has a very high amount of physical RAM available or not.

I have seen other people mention they run a cron script similar to yours to clear out trapped cache, one admin went pretty extreme and used physical RAM as his swap device, though he is happy with the solution. We may have to go with a similar cron script to run before and after our nightly backups occur.

I am wondering if this was a major driving force behind DB2 implementing the NO FILE SYSYTEM CACHING option for their tablespaces (uses the O_DIRECT call).

Here is a similar issue on Oracle:

https://bugzilla.redhat.com/show_bug.cgi?id=160033

The same issue on MySQL (not Livecache version):

http://don.blogs.smugmug.com/2008/05/01/mysql-and-the-linux-swap-problem/

And here is a description of some problem workloads with the memory manager, where two types of workloads are mentioned as possibly being problematic:

"Database, 128GB RAM, 2GB swap"

"Database + backup"

http://linux-mm.org/ProblemWorkloads

The last link also contains a page on information about the memory manager being re-written, the author mentions this would probably be available in newer Linux kernel versions, but I'm not sure what the most recent status of that is. We will probably engage SAP and Redhat on this issue since we are moving more of our enterprise environments over to Linux, and this is a pretty big scalability issue.

Others have mentioned that the swappiness tunable works better in the newer versions of Linux, and a swappiness value of 10 is better than 0.

This situation also looks like exactly what happened in our situation:

http://www.mentby.com/Group/linux-kernel/swappiness-vs-mmap-and-interactive-response.html

former_member184709
Participant
0 Kudos

Darn, looks like drop_caches cannot be used on Redhat 4, but is fine for Redhat 5

http://kbase.redhat.com/faq/docs/DOC-23223

Though I did find the following parameters that are mentioned for DB2 specifically :

vm.swappiness = 0

vm.dirty_ratio = 10

vm.dirty_background_ratio = 5

fs.file-max = 262144

http://www.idug.org/conferences/EU2008/data/EU08D15.pdf

http://www.vmware.com/appliances/directory/uploaded_files/01-DB2-EXPC-9.5.0-2.2-Install-Instructions...

markus_doehr2
Active Contributor
0 Kudos

Hi Derek,

thank you for the additional info - so we're not alone here

I am following the linux-mm kernel list irrregularly and Rik had some interesting ideas, however, I was not keen enough to build an mm-kernel and test it out.

The whole problem could be circumvented by offering a mount option to not use the cache when data is read/written to that particular disk. I think that approach is much better than a complex memory/VFS handling algorithm.

We use Novell SLES 10 (and SLES 11) for our production systems but the behavior is the same. I was in contact in the past with the Linuxlab, I think they did already and RFE to Redhat and SuSE to provide those patches for the VFS (as far as I understood they exist, they were just not (yet) backported to the used kernels).

Maybe in the next support package...

For our databases (2,9 TB, 1,1 TB) we use raw devices and for self written things we use O_DIRECT. Unfortunately the backup software doesn't open the files with that option.

Markus

Former Member
0 Kudos

A few remarks from the OS perspective:

- We don't recommend too low swappiness values. 40 -- 75 is a sane range.

- Backup apps could use direct I/O or madvise(POSIX_MADV_DONTNEED) to make the job easier for the OS ...

- With SLES11SP1, our kernel treats unmapped pagecache with lower prio, resulting in less swapping caused by filesystem IO. We expect that this will be sufficient to cure most problems reported here.

- If still too much swapping happens, we also have an experimental feature in SLES11SP1 that allows you to limit (unmapped) pagecache usage. Please get in touch with LinuxLab if you are interested in testing ...

markus_doehr2
Active Contributor
0 Kudos

Hi Kurt,

> A few remarks from the OS perspective:

> - We don't recommend too low swappiness values. 40 -- 75 is a sane range.

We use current 0 - which eleminates the "races" we saw in the past (SLES 10 SP1) when doing backups and we see a highly decreased swap usage.

> - Backup apps could use direct I/O or madvise(POSIX_MADV_DONTNEED) to make the job easier for the OS ...

Well - I can't tell EMC to implement that in their backup software. And since Linux is the only OS having that problem (I don't see it on Solaris nor HP-UX in that intensity) it's not necessarily a backup tool thing only. (just my EUR 0.02).

> - With SLES11SP1, our kernel treats unmapped pagecache with lower prio, resulting in less swapping caused by filesystem IO. We expect that this will be sufficient to cure most problems reported here.

That's really good to know - I'll try as soon as it's out I'll try it.

> - If still too much swapping happens, we also have an experimental feature in SLES11SP1 that allows you to limit (unmapped) pagecache usage. Please get in touch with LinuxLab if you are interested in testing ...

Thank you! It's good to see that something is being done.

Markus

Former Member
0 Kudos

> - We don't recommend too low swappiness values. 40 -- 75 is a sane range.

We use current 0 - which eleminates the "races" we saw in the past (SLES 10 SP1) when doing backups and we see a highly decreased swap usage.

This is only safe since Aug 2007 -- before the system could loop in VM with a depleted inactive list ...

Still not great for performance ...

> - Backup apps could use direct I/O or madvise(POSIX_MADV_DONTNEED) to make the job easier for the OS ...

Well - I can't tell EMC to implement that in their backup software. And since Linux is the only OS having that problem (I don't see it on Solaris nor HP-UX in that intensity) it's not necessarily a backup tool thing only. (just my EUR 0.02).

It's a good idea to give the OS some info on what's important, so it does not have to guess ... Apparently, Slowlaris and hpux guess better here.

> - With SLES11SP1, our kernel treats unmapped pagecache with lower prio, resulting in less swapping caused by filesystem IO. We expect that this will be sufficient to cure most problems reported here.

That's really good to know - I'll try as soon as it's out I'll try it.

Two weeks to go. Feedback is welcome!

> - If still too much swapping happens, we also have an experimental feature in SLES11SP1 that allows you to limit (unmapped) pagecache usage. Please get in touch with LinuxLab if you are interested in testing ...

Thank you! It's good to see that something is being done.

You're welcome!

Edited by: Kurt Garloff on May 21, 2010 1:58 PM

Edited by: Kurt Garloff on May 21, 2010 1:59 PM

markus_doehr2
Active Contributor
0 Kudos

> It's a good idea to give the OS some info on what's important, so it does not have to guess ... Apparently, Slowlaris and hpux guess better here.

I told them by using the appropriate mount option for the filesystem There's unfortunately neither an equivalent for mount -o convosync=direct nor a mount -o forcedirectio available for either Reiser or ext3. That would be the easiest option (for users/administrator) because, especially in database environments, the archive log area is usually on a separate disk/volume anyway.

> Two weeks to go. Feedback is welcome!

You'll get that for sure!

Markus

former_member184709
Participant
0 Kudos

Thanks Markus and Kurt for your continued discussions in this thread. This seems to be an emerging area, as Linux is now replacing alot of enterprise level UNIX workloads. Before, it seemed to be adopted more on lighter type workloads. I notice that we are comparing Solaris and HP-UX here now to Linux, so that sort of speaks to the type of big workloads we are now seeing.

I do like Markus' idea to just set everything at the filesystem level, and that just keeps everything simple. Our Solaris servers do not have this issue either, typically we have free RAM available and see it being taken and then released back with no paging contention between app/db operations and filesystem type operations.

My goal for our SAP environments, which are moving off of Solaris to Linux, is to eliminate swapping entirely, unless that is just technically impossible.

It seems alot of the Linux tuning specific to SAP is more SLES geared, but hopefully many of the same concepts can be applied to Redhat installations as well.

One more thread I found that seems relevant to this issue is where Linus himself talks about the memory manager.

http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-01/msg02784.html

former_member184709
Participant
0 Kudos

We rebooted our server this weekend but within 2 days it was already swapping up to 2GB. There's 72GB of RAM on that machine so it's got plenty of room to spare. Performance is better than ever on that server, but it's still frustrating to see it still wanting to use swap like that.

I wish there was a "don't use swap" button or something haha.

Former Member
0 Kudos

> > Two weeks to go. Feedback is welcome!

> You'll get that for sure!

DId you get a chance to test SLE11SP1 already?

Any observations?

If you're only almost happy, we have some pagecache_limit settings for you to play with.

markus_doehr2
Active Contributor
0 Kudos

Hi Kurt,

> DId you get a chance to test SLE11SP1 already?

> Any observations?

> If you're only almost happy, we have some pagecache_limit settings for you to play with.

We are running SLES 11 SP1 as hypervisor in our test environment, there are no logs written so I can't tell.

In the next 2 months we'll get new hardware for our productions systems and there we will see the immediate outcome because they'll be installed with SLES 11 SP1.

Interestingly enough, I don't see this behaviour on SLES 9 SP3 where our production BW is (still) running. There is an equal amount of logs being backed up but SLES 9 is very happy (2.6.5-7.244-smp) and not swapping at all during the backups.

Markus

former_member184709
Participant
0 Kudos

Sorry, I know this is an old thread, but recently, in the DB2 world, I came across this registry parameter in another forum thread, that seems like it would help our Linux systems for this particular problem. Oracle also has a similar setting. This was not available on Linux in DB2 8.

http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic=/com.ibm.db2.udb.uprun.doc/doc/r0...

DB2_PINNED_BP

Setting this variable to YES causes DB2 to request that the operating system pins DB2's Database Shared Memory. When configuring DB2 to pin Database Shared Memory, care should be taken to ensure that the system is not overcommitted, as the operating system will have reduced flexibility in managing memory.

Answers (1)

Answers (1)

nelis
Active Contributor
0 Kudos

Hi Marcus,

We use Legato Networker too.

For the "archive logs" we just include them in the normal file system backup using the client which happens each night when the system is not being used. The archive logs are written to a separate SAN disk initially.

I gather you piping these backups direct to Legato, how often ?

Sorry if not answering your original question, just interested to know how other people use Legato.

Regards,

Nelis

markus_doehr2
Active Contributor
0 Kudos

Hi Nelis,

we backup the logs to a local filesystem and savepnpc grabs them and puts them on tape.

Our database is 1.8 TB, we backup with two processes in parallel over Gbit ethernet to an MSL 6000. Since we have no real time, where the system is not used (night shifts, at about 2 am our time China start its work) we have to do the backup during the online time.

We are a manufacturing company and we have many batch processes at night (up to 20 in parallel) doing a lot of I/O which slows down the backup significantly. Currently we need about 5 hours for a full backup when no other systems are being backed up in parallel to the same tapes.

We recently tried to do backup-to-disk onto a different SAN but we found out, that this backup is slower than piping directly to tapes.

Our problem is, that the backup of the files (archivelogs), which are 10 GB each, will completely trash the filesystem cache and making the system almost unresponsive. Depending on the number of logs that are created (up to 10) the system remains in that state for > 30 minute with intermittent "hangs" where several are using 100 % system time. A logon during that time impossible.

I want to decrease the filesystem cache to 0 so that it´s not used any more.

Markus