cancel
Showing results for 
Search instead for 
Did you mean: 

SIGSEGV Signal 11 Crashdumps TimerThread after "unused_retention_period" enabled (Note 2127458)

Former Member
0 Kudos

Greetings all,

We are just surviving on a now-undersized BW7.3 SP8 on HDB rev83 (6 node scaleout).  While we are waiting on new hardware over next few months, we are exploring ways to better keep our heads above water for longer.  One of these option is enabling HANA to unload columns that are not used for some time (e.g. non-reporting data in a typical LSA propagation layer) to avoid main memory contention during the day while users run their reports.  We are mindful this is not the SAP recommended approach, and are only exploring this option as a workaround.

2127458 - FAQ: SAP HANA Loads and Unloads

This note indicated a way to enable the column displacement from memory after a specified retention period.

When testing with this enabled 'System'-wide across all hosts, we found that the master node was crashdumping and rte-dumping frequently.

Here is a sample trace files generated:

indexserver_saphdp-an1.30003.crashdump.20150427-155006.060789.trc

indexserver.perspage.20150427-134110000.0x00000000002ad1e8L.0x0000c00000120000P.1.ok.dmp

indexserver_saphdp-an1.30003.rtedump.20150427-134037.060789.page.trc

indexserver.perspage.20150427-134042000.0x00000000002ad1e8L.0x0000c00000120000P.0.corrupt.dmp

indexserver_saphdp-an1.30003.crashdump.20150427-132538.058793.trc

indexserver.perspage.20150427-132101000.0x00000000002ad1e8L.0x0000c00000120000P.9.corrupt.dmp

indexserver_saphdp-an1.30003.rtedump.20150427-132032.058793.page.trc

indexserver.perspage.20150427-132033000.0x00000000002ad1e8L.0x0000c00000120000P.0.corrupt.dmp

So we test with this enabled just across the SLAVE hosts, we found that they were crashdumping occasionally.

Here is a sample trace files generated:

indexserver_saphdp-an4.30003.crashdump.20150505-064045.050854.trc

indexserver_saphdp-an3.30003.crashdump.20150503-152138.003832.trc

indexserver_saphdp-an4.30003.crashdump.20150503-131004.037534.trc

indexserver_saphdp-an5.30003.crashdump.20150503-074249.029552.trc

indexserver_saphdp-an3.30003.crashdump.20150503-021555.040263.trc

indexserver_saphdp-an5.30003.crashdump.20150502-182801.028108.trc

In the crashdump trace files, we notice it is always a Signal 11 on the TimerThread of the particular indexserver that crashed.  Here is a summary (bold highlighted words are found in every crashdump trace file:

"[CRASH_SHORTINFO]  exception short info: (2015-05-05 06:40:45 366 Local)

SIGNAL 11 (SIGSEGV) caught, thread: 4302[thr=nnnnn]: TimerThread, etc..."

Also, if it's any help, the error code 1 is detected for Signal 11:

"[CRASH_EXTINFO]  extended exception info: (2015-05-05 06:40:45 367 Local)

----> Dump of siginfo contents <----

  signal:      11(SIGSEGV)

  code:        1(SEGV_MAPERR: address not mapped to object)"

We have logged a SAP message and going through the motions of opening system access, upload tracefiles etc.

Just wanted to check with the community here whether you have encountered and/or resolved similar crashdump issues for Rev83?

Thanks.

Accepted Solutions (0)

Answers (1)

Answers (1)

Former Member
0 Kudos

Here's the SAP reply:

"Dear customer,

This is a known crash. Unfortunately, we do not know the root cause.  Therefore, we implemented (sic) further tracing starting with Revision 91.

Best regards.

Moritz"

The said Note's parameter is a great idea to bring forward the unload of columns unused after, say, 4 hours instead of waiting until HANA hits the threshold for memory contention.  In our now-undersized scale-out rev83 system, we have found that whenever the HANA memory is under peak operational stress, the LRU unloading efficiency does not keep up with the flood of data requesting to be loaded into memory.  This leads to "oom/rtedump" on the slave hosts where most of the columnstores reside. Once the users repeat their BW queries 5mins later, the required data is then available.  It is a little frustrating albeit this only occurs once a week.

Pity about the Signal 11 crashdumps due to the Note's parameter, otherwise this would be the panacea until the hardware is upgraded.

We have consultants pushing a hard-sell of an unproven bespoke application to unload the unused columns at varying times by manually specifying the BW tables.  Great concept but, needless to say, this is not likely to have SAP's blessing.  This would be money down the drain once SAP resolve the Signal 11 crashdumps.

Would be interested in the community's view.

Thanks.

lbreddemann
Active Contributor
0 Kudos

Besides the fact that you cannot unload single columns, if the workaround actually leads to less crashes and less downtime why not go for it?

Implementing a solution that looks into unloading tables that haven't been touched for some time is certainly not that difficult that it would rectify a big expenditure. After all, it doesn't have to be perfect, but reasonable.

So, checking overall memory consumption and then unloading tables for which no column had been touched recently (not sure about SPS 8 but SPS 9 got M_CS_COLUMNS.LAST_ACCESS_TIME) should be doable in a simple procedure.

Schedule to run this regularly and you should be good to go until more RAM arrives.

Sometimes bugs can get terribly difficult to analyze, especially in a massive multi-threaded environment like SAP HANA. Putting more tracing in place then might be the only way to tackle the issue, as frustrating as this is for everybody.

- Lars

Former Member
0 Kudos

Hi Lars,

Appreciate you taking the time to reply.

Actually, we already have a daily workaround for a memory reset job scheduled to run before our daily BW processing starts at midnight to unload all BW A/B/F-tables (e.g. DSO/PSA/Cubes).  The BW ETL jobs complete around 7am before the business users start running their reports.  Those BW process chains typically include unloading of the non-reporting DSO tables after their ETL processing. 

During the day, we schedule a repeat of the same table unloads since our IT team can also load those non-reporting tables into memory as part of investigating and resolving data issues that users have raised.

By the way, it was said their custom application can unload individual infrequently used columns, supposedly, I think based on the last_access_time column (yes, also in SPS8). I am glad you have clarified that it is not currently possible.

So thanks, that saved us from beta-testing for them, not to mention avoiding the extra administrative overhead to maintain and support those table entries manually (even if it was possible to unload single column).


- Boon