cancel
Showing results for 
Search instead for 
Did you mean: 

Scalability and progress of data import into HANA

Former Member
0 Kudos

Hi

I'm playing with importing and merging of data to a single table on HANA SPS6 Developer Edition VM on Cloudshare. My control file for the IMPORT looks like this:

import data

into table "LEONID"."CDRs"

from 'CDRs_01.gz_15000000_b.out'

    record delimited by '\n'

    field delimited by '|'

error log 'LEONID_CDRs.err'

Unfortunately, I see that at most 1.5 out of 4 virtual CPU cores available on the VM are active during the IMPORT, while MERGE is executed in a much more parallel way. In other words, the scalability of IMPORT seems to be quite limited. What can be done to overcome the problem? Is it an expected behavior?

My another question is about logging of progress of the IMPORT operation. So far, the only way I have found to watch the progress of the IMPORT and MERGE is to monitor M_CS_TABLES view. Is it possible to see the progress in a textual log, or something like that?

Also, does somebody succeeded to use M_IMPORT_BINARY_STATUS view? On my environment it is always empty. Given the the view description in documentation "status information for current session", I probably just don't understand how to use it correctly, please help.

Thanks in advance

Leonid Gvirtz

Accepted Solutions (1)

Accepted Solutions (1)

rama_shankar3
Active Contributor
0 Kudos

Leo,

In addition to Thomas's input, please make sure that you review the below article for sure and follow tips.

http://www.saphana.com/community/blogs/blog/2013/04/07/best-practices-for-sap-hana-data-loads

Hope this helps.

Rama

Answers (3)

Answers (3)

Former Member
0 Kudos

Hi Thomas, Rama

I studied the blog post provided by Rama, did some experiments and have come to some interesting findings:

Using of THREADS cause of the IMPORT command may make a huge effect on the performance and scalability of IMPORT. This confirms John Appleby's findings in the blog post and completely contradicts the documentation, as cited by Thomas, which says "When importing a single table THREADS has no effect".

Particularly, the import of 5,000,000 rows into my test table with auto-merging turned off runs as follows:

-- Import into non-partitioned table with default import options: 102 sec.

-- Import into partitioned table (5 round-robin partitions) with default import options: 95 sec.

-- Import into partitioned table with "THREADS 10": 39 sec.

-- Import into partitioned table with "WITH THREADS 10 BATCH 200000": 23 sec.

-- Import into non-partitioned table with "WITH THREADS 10 BATCH 200000": 27 sec.

As we can see, increasing of number of threads to 10 has a major effect on the IMPORT performance. This effect could have even been bigger, because all 4 virtual CPUs available on the Cloudshare Developer Edition VM (which is similar to m2.xlarge, as far as I understand) are saturated when the IMPORT is running.

Round-robin partitioning has a relatively small effect to the IMPORT performance in my case.

Increasing of batch size has a positive effect on performance in general, but it has its drawback. Particularly, when I tried to import 15,000,000 rows instead of 5,000,000, increasing of the batch size to 200000 caused HANA to run out-of-memory, while the IMPORT of the same data successfully finishes with the default batch size. I have verified that running out of memory is caused by increasing of batch size and not by something else.

In addition, I also tried to disable HANA logging - with no measurable effect during both the IMPORT and the MERGE.

Any comments are welcome. In addition, I have a number of questions:

1. Why increasing of import batch size causes over-allocation of memory? How much memory should be allocated for it?

2. When HANA runs out of memory during the IMPORT, I don't receive any error messages and I can only see the problem in HANA diagnostics logs. This is very confusing. Is it an expected behavior?

3. What is the default batch size for the IMPORT? I have not succeeded to find it in documentation so far.

4. Is it possible to ask the IMPORT command to import the whole data (millions of rows) in a single transaction? Is it a good practice to do so in HANA?

Thanks in advance

Leonid Gvirtz

rama_shankar3
Active Contributor
0 Kudos

Leo,

Please make sure that your AWS instance is atleast m2.xlarge - Since SP5 I noticed that we need to have m2.xlarge server.

Regards,

Rama

thomas_jung
Developer Advocate
Developer Advocate
0 Kudos

The IMPORT DATA command has a parameter threads. The documentation has the following section on THREADS:

Number of threads to use

The THREADS parameter specifies how many objects will be imported in parallel, the default is 1. Increasing number of threads reduces import time, but can also negatively affect database system performance. Following items should be considered when using this parameter:●

When importing a single table THREADS has no effect.●

When importing a view or procedure 2 or more threads should be used, up to the number of dependent objects.●

When importing a whole schema consider using more than 10 threads. With a maximum being the number of CPU cores in the system.●

When importing a whole BW / ERP system database with tens of thousands of tables using the ALL keyword, a large number of threads can be used e.g. up to 256.

So as you have only one table, that threads won't help. It would seem the parallelization is designed for importing multiple objects with this command.