Hana Tables - Fill gaps in big data

Hi,

i have a large table having 10 Milion of records .

Some value are NULL and i need to UPDATE it with the value of the row on top .

Example :

Key	Attribute
A	1
A	2
A	NULL
B	6
B	NULL
B	NULL

Content of the table after che fixing process :

Key	Attribute
A	1
A	2
A	2
B	6
B	6
B	6

are there some best practice to get this UPDATE fast ?

Thank you

SAP Managed Tags:
SAP HANA

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Accepted Solutions (0)

Answers (2)

See my test case below:

create table test_fill_gaps

(

type_id char(1) not null,

attr_value int null

);

insert into test_fill_gaps values ('A', 1);

insert into test_fill_gaps values ('A', 2);

insert into test_fill_gaps values ('A', NULL);

insert into test_fill_gaps values ('B', 6);

insert into test_fill_gaps values ('B', NULL);

commit;

create index idx_test_fill_gaps_type_id on test_fill_gaps (type_id);

update test_fill_gaps set

attr_value = (select max(attr_value) from test_fill_gaps b where test_fill_gaps.type_id = b.type_id)

where attr_value is null;

The general recommendation is to avoid procedural code if you can do the same in SQL.

Hope it helps

Leonid Gvirtz

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hi Leonid,

My experience with similar situation:

for a table having about 300 million records
Needed to update some fields where the value should be updated from the value of previous row or next row if the previous row is null.
I used the select max(value) with the nested query as you did.
It brought the system down.

The nested loop for max values spawned numerous worker jobs and it started blocking each other out. The resource utilization reached maximum impacting everything else in the system.

We had to restart the HANA server.

My suggestion would be:

Before trying to run such query, please try it with few thousand records table and check the processes. If you see multiple worker jobs, then you may also end up in similar situation as I did. If you don't, then I guess my bad luck was really really bad that day..

Please keep us updated with the progress of this thread. It is quite an interesting topic.

Regards,

Ravi

Thanks for the update Ravi. I was about to do the same but seen your message. Some what lucky today.

If the attr_value is not sequential, and assuming that you could include 1 more column, I'd insert a calculated column with the row number and execute a sql like:

update test_fill_gaps set

attr_number = (select attr_number from test_fill_gaps b where (test_fill_gaps.row_number -1) = b.row_number)

where attr_number is null;

Hi Ravi

The information that you have provided is very interesting. Actually, I'm quite new to HANA, the HANA's behavior that you have explained is very surprising and unexpected to me. I will surely perform the test that you have suggested and will report the results.

Thanks

Leonid Gvirtz

Completely at your own risk:

Can you please try it with a table with few thousand records and see if you see multiple worker jobs ?

Regards,

Ravi

But then it will bring a change to the table model. Also the sequence numbers will have to be maintained in case there is addition / modification / deletion of data from the table.

Regards,

Ravi

Hi Leonid, Raj,

I am eagerly waiting for your test results. I still have the similar requirement as I need to replicate data from 300 million records to about 5 billion records updating some of the attribute values in the replicated records. The requirement is, to update the attribute value in the replicated data with previous record value. If the previous record value is null, then update with next value...

As I mentioned, I implemented a work around as mentioned by Rodrigo below, but instead of nested subquery, I used a join. It works for me quite well.. or did work for me for 400 - 500 million records. Now with more than 1 billion records, I need to find if there is any more effective way.

Thanks and regards,

Ravi

Hi Ravi

I tested with 100,000 records - no problem so far, the UPDATE finished in 132 ms. I will try to repeat the test with millions and tens of millions of rows, probably tomorrow. Unfortunately, my resources are limited by HANA Developer Edition VM on Cloudshare. Here is my test case:

create column table LEONID.test_fill_gaps

(

type_id char(1) not null,

attr_value int null

);

-- about 1/3 of records have attr_value of NULL

import from '/tmp/test_fill_gaps.ctl';

create index idx_test_fill_gaps_type_id on LEONID.test_fill_gaps (type_id);

MERGE DELTA OF LEONID.test_fill_gaps;

update LEONID.test_fill_gaps set

attr_value = (select max(attr_value) from LEONID.test_fill_gaps b where LEONID.test_fill_gaps.type_id = b.type_id)

where attr_value is null;

Hi Ravi,

I am building a fact table with 10 million records to recreate this scenario and so far 1 million was inserted. Once the loading is done, Will start working on it.

Looks like interesting thread for us to work for

Regards

Raj

Hi Ravi

I tested the same scenario with 5, 10, 20 and 40 millions of rows - the problem didn't reproduce. Here is my test case:

create column table LEONID.test_fill_gaps

(

type_id char(1) not null,

attr_value smallint null

);

-- Import 1,000,000 rows, repeat the needed number of times

import from '/tmp/test_fill_gaps_1000000.ctl' WITH THREADS 5;

create index idx_test_fill_gaps_type_id on LEONID.test_fill_gaps (type_id);

MERGE DELTA OF LEONID.test_fill_gaps;

-- 5000000 rows - 5868 ms. Rows Affected: 1670510

-- 10000000 rows - 13397 ms. Rows Affected: 3341020

-- 20000000 rows - 26440 ms. Rows Affected: 6682040

-- 40000000 rows - 49221 ms. Rows Affected: 13364080

update LEONID.test_fill_gaps set

attr_value = (select max(attr_value) from LEONID.test_fill_gaps b where LEONID.test_fill_gaps.type_id = b.type_id)

where attr_value is null;

As you can see, the response time grow near linearly, which is expected because the number of rows to update also grows linearly. I monitored the CPU consumption on the HANA host using vmstat - nothing unusual, not even close to CPU saturation. Actually, I would have expected a better degree of parallelism. Unfortunately, I couldn't perform the test with bigger number of rows because of severe memory limitations of HANA Developer Edition VM on Cloudshare (SPS6, 4 virtual CPU cores). Succeeded to import 60 millions of rows, but the merge failed with out-of-memory errors with no apparent reason.

What version of HANA did you use when the problem occurred? To me it looks like a bug (not a feature), or may be your real case is more much more complicated than mine. It would have been very strange (at the very least) if HANA required us to write a procedural code to implement such a simple update logic.

Hope it helps

Leonid Gvirtz

Hi Leonid,

It would have been very strange (at the very least) if HANA required us to write a procedural code to implement such a simple update logic.

I agree with you but what the case if table has many number of columns and the column which we update does not have the maximum value?

Regards

Raj

Hi Leonid,

you made some interesting remarks here:

Leonid Gvirtz wrote:
As you can see, the response time grow near linearly, which is expected because the number of rows to update also grows linearly.

Why do you expect this to be linear? Is the linear growth of input set size the only reason for your assumption?

Leonid Gvirtz wrote:
Actually, I would have expected a better degree of parallelism. 
...
Succeeded to import 60 millions of rows, but the merge failed with out-of-memory errors with no apparent reason.

Uh? So what degree of parallelism did you expect?

And why?

What do you expect the database to do with your data in order to parallelize more?

And also: how did you measure the degree of parallelism?

Or could it be that you mistake the % of CPU time consumption with it?

Don't get me wrong, but I really like to understand where your expectations come from.

Concerning the OOM - well, it's very likely that the error occurred, because there was not enough memory available for the merge...

- Lars

Hi Lars,

Thanks for pitching in. What is your recommendation. Did you ever face a situation to update data with nested subquery / correlated subquery in HANA.

In my case, the table structure was quite big (around 215 columns, 65 characteristics and 150 KFs), with such a big table and millions of records, the max(value) concept did not work quite well.

In your opinion and experience, is there any other optimal way.

Regards,

Ravi

Hi Ravi,

as usual there is no one size fits all best approach.

I still would like to know the bottleneck first and optimize against that.

For update-statements like this it is usually required to understand the nature of the missing data.

If there are only a few actual "source" groups for the data, it could be worthwhile to go down the path of writing SQLScript to first create the filler information and only access the records with the missing values in a second run.

If the source data set is really huge, one could also consider to break it up into several partitions (not table partitions!) and have them run in separate update commands.

And remember, with SAP HANA you could effectively partitions vertical and horizontal.

Things like stopping the delta merge during the run might also improve the total runtime.

You see, there are many possible options. Which one of them will finally improve the runtime of a specific data load/update scenario totally depends on the bottleneck at hand.

I know this is not the answer everybody is looking for, but that's what I know about this matter

Cheers, Lars

Hi Lars

Thanks for your input. To your remarks:

I expected the growth in the response time of the UPDATE statement to be roughly proportional to the number of rows to update, which grew linearly in my test. I thought that the update/logging part of the statement will be by far the most expensive, without performing enough tests (my fault). Actually, I’m not sure how can I check the performance of the SELECT like listed below vs. the performance of the UPDATE without the overhead of fetching of millions of rows to the client and with minimal possible logging. The example below takes 15 seconds vs 49 sec. for UPDATE:

create column table LEONID.test_fill_gaps_join

(

type_id char(1) not null,

attr_value smallint null,

max_attr_value smallint null

) no logging without auto merge;

insert into LEONID.test_fill_gaps_join

select a.type_id, b.attr_value, a.max_attr_value

from

(

select

type_id,

max(attr_value) max_attr_value

from LEONID.test_fill_gaps

group by type_id

) a,

LEONID.test_fill_gaps b

where a.type_id = b.type_id

and b.attr_value is null;

Anyway, the growth of UPDATE response time appears to be proportional to the number of updated rows and to the table size, with no relation to my (probably wrong) assumptions. Could you provide a better explanation of it? My test is fully reproducible; I can send you my test dataset if needed.

I measured the degree of parallelism by the value of user% using “vmstat 5”. In my opinion, it is quite a good metric, given that no other activity runs in the system and physical I/O doesn’t represent a performance bottleneck. I have 4 CPU cores on my system, so if I see mostly values close to 25% then I can assume that HANA uses only about 1 CPU core out of 4. I’m quite sure that behind the scenes HANA employs multiple threads to process my UPDATE statement, in fact these threads processes my query in parallel. But, at the bottom line, HANA consumes fewer available resources that it probably could do to answer my query as soon as possible. That’s what I meant by “degree of parallelism”.

As to my expectations about the parallelism – I didn’t have any specific figure in mind. I just saw that HANA doesn’t fully utilize available CPU resources most of the time of the update execution, with some CPU consumption spikes, that probably may be attributed to automatic merges.

Regarding possible scenarios with more parallelism - I can imagine a database (not necessarily HANA) splitting the input dataset to a number of chunks that can be processed independently in parallel, such as different type_id’s in my test query. Of course, I understand that not any action can/should be parallelized. By the way, I inserted the same data into SAP Sybase IQ 16, which is generally really good in parallel processing and tried the same query after putting adequate indexes on the table. While the response time of the UPDATE appeared to be about twice better than in HANA, the degree of parallelism was still quite low, IQ used at most 2 CPU cores and IQ query plan confirmed it and the number of allocated threads was also very low (mostly 1).

I’m not sure if the comparison of the response times between IQ and HANA is correct in this case, because my test IQ runs on a much more powerful physical server (16 modern CPU cores with hyper-threading), even though the CPU consumption during the UPDATE statement was quite low. So, the bottom line is that my expectations for the query parallelism in this specific case may be too optimistic.

If I understand you correctly, there is a better method to measure query parallelism in HANA. Could you explain more about it or provide relevant links? I didn’t see it in query plans.

Regarding out-of-memory errors that I have received – most probably you are right, I didn’t take into account all the memory requirements during the merge.

Generally, my goal here is to learn HANA and compare it to SAP Sybase IQ, another column-based database, that I have years of experience with. I’d like to understand what to expect from HANA and how HANA and IQ can be used together.

Implementing the solution for the problem described by the original poster using SQL (and not procedural code) looked quite natural to me. Ravi’s post about his negative experience with an UPDATE statement made me curious and I decided to build a simple test case to be sure that at least this test case works OK. Now I see that Ravi’s case is much more complicated than my intentionally simplified test case and I agree that more analysis is required to find the real cause of the problem.

Thanks

Leonid Gvirtz

Hi Raj

For more complicated cases, more complicated solutions may be required. I just have proposed a solution for the original poster's problem. I'm not sure I understand the scenario proposed by you correctly, I will need more details to provide an answer.

Generally, my approach is to use SQL instead of procedural code, unless it is proved that the procedure code is indeed better. Sometimes, splitting of a single SQL statement to a small number of simpler SQLs may be beneficial, usually because of some specific query optimizer limitations. Statistically, implementing the solution using SQL (set-based processing) will be better than a cursor-based approach (row-by-row processing) in the vast majority of the cases. All the stated above is based on my experience with a number of SAP and non-SAP database products (not HANA!). While it may be different in HANA, I hardly believe it.

Thanks

Leonid Gvirtz

Hi Carlo,

You can do this by using procedure where you loop over the table and if the value is null then update the value with previous value.

Regards

Raj

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Looping over a table of 10 million records ? , good luck with that.. but I guess HANA should be able to handle it. The approach is good..

Defining a cursor, getting data of previous row in a set of variables, checking the data in the current row for null and update from the variables, will work technically. But I am not sure how efficient would that be. Secondly this approach is ok if it is a one time activity. If this has to be repeated over and over, then I don't think it would be a good idea.

Regards,

Ravi

Maybe the problem is not the loop, but the UPDATE that will be executed for each record .

I am looking for a cursor that permit to modify the data directy, like an ABAP FIELD SYMBOLS .

thank you

Ask a Question

Top Q&A Solution Author

User

Count