cancel
Showing results for 
Search instead for 
Did you mean: 

Oracle "DISTINCT" vs. "SORT + Delete adjacent duplicates"

peter_strauss
Participant
0 Kudos

Hello,

I have a simple "select distinct" statement on a small table. This single statement accounts for over 80% of all CPU time in the DB.

There are quite a lot of comments that in abap coding SELECT DISTINCT should be avoided and replaced with a select to an internal table followed by sort + delete adjacent duplicates.

I cannot understand why this would be a general recommendation.

Is ABAP better than Oracle (or other databases) at filtering duplicates?

I have a feeling that if I do this I will reduce CPU time on the DB but increase it (maybe to a greater extent) on the application servers.

Are there certain criteria where this recommendation makes sense?

At a general level this is a question about ABAP, however I would be interested to know the opinion of Oracle experts so I'm posting here.

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Hi Peter,

Because, "SELECT DISTINCT" does sort operation. If you take a look at same queries with the DISTINCT statement and without the DISTINCT statement, both execution plans should be same. Under this circumstance data read strategy does not show any difference on both queries. Only difference is sort operation performed while reading the dataset with the DISTINCT statement. It is not make a performance penalty to use SELECT DISTINCT, when the table contains a few records. It will make a serious performance problem when the dataset to be sorted is large.

Because of this reason, it is good to be analyze the table characteristic while accessing the data, first.

From the ABAP point of view, you just read the dataset with the secondary key and sort it in the memory. Then, delete the duplicaties over the memory. By doing so, the operation will be done in the memory. I am not so sure behind of the scene of DISTINCT statement at the Oracle level, but as far as I am sure that more additional operations are performed during the operation, than a simple ABAP call.

However, it is important to understand why the records are duplicated in the table. Another approach ma be choose to apply normalization forms and read the data with the correct indexes to optimize your application.

Best regards,

Orkun Gedik

peter_strauss
Participant
0 Kudos

Hello Orkun,

I did a little testing on the production system.

Assuming that the sort operation would also be done in ABAP I just ran the SQL select on the DB without the DISTINCT clause.

The surprising thing was that the select WITHOUT the distinct clause was actually slower (and had higher CPU time).

The number of records returned without the distinct was close to the number of rows in the table, so I realised that there was no point using the index. Running the same select, including the DISTINCT and adding a FULL hint reduced the CPU time by about half.

(the table only has about 10000 rows so a full table scan doesn't take long at all).

I take your point about normalization.

Back to my original question, I suspect that advice to always use delete adjacent duplicates rather than distinct might come from the fact that when a DISTINCT clause is used the ABAP buffer is bypassed. However my experience with the current problem is that it is impossible to make a general statement about using delete adjacent duplicates over DISTINCT, and that it may actually make the situation worse.

Kind regards,

Peter

Former Member
0 Kudos

>> The surprising thing was that the select WITHOUT the distinct clause was actually slower (and had higher CPU time).

Be sure that the performance difference not be caused by read the data on application server and/or database buffer.

Maybe is it better to share the both SQL statements, tkprof output and execution plan, here. So, we can discuss it over the results, more.

Best regards,

Orkun Gedik

peter_strauss
Participant
0 Kudos

Thanks for the offer, but I think this will lead us off topic.

Former Member
0 Kudos

I don't think that DISTINCT statement has poor performance against ABAP statements. It cannot be a general rule, as I notes previously.

Best regards,

Orkun Gedik

Answers (2)

Answers (2)

stefan_koehler
Active Contributor
0 Kudos

Hi Peter,

> This single statement accounts for over 80% of all CPU time in the DB.

Where do you get this information from? AWR? How often is this single SQL executed in a specific time frame? Is the CPU load caused by a single run or in sum of all executed SQLs? Just looking at the shared pool (e.g. by DBACOCKPIT) gives you wrong values as other SQLs are aged out and so on.

> There are quite a lot of comments that in abap coding SELECT DISTINCT should be avoided and replaced with a select to an internal table followed by sort + delete adjacent duplicates.

That might work pretty well, if the data set is not very large, but just think about the needed memory if you got a large dataset. The large amount of data needs to be transferred to the application server first and handled afterwards.

> Because, "SELECT DISTINCT" does sort operation. If you take a look at same queries with the DISTINCT statement and without the DISTINCT statement, both execution plans should be same.

@Orkun: No, it certainly does not. This was the case in Oracle 10g pre-times. Oracle introduced hash aggregation in Oracle 10g. Here is just a tiny test case on my 11.2.0.3.2 database.

http://oracle-randolf.blogspot.de/2011/01/hash-aggregation.html

SYS@T11:133> create table TESTAB as select * from dba_objects;

SYS@T11:133> select distinct(owner) from TESTAB;

> The surprising thing was that the select WITHOUT the distinct clause was actually slower (and had higher CPU time).

The question here is - are the execution plans the same? How did you find out the "higher CPU time"? CPU is usually caused by excessive logical I/O or specific functions (like hashing and so on). Oracle SQL Monitor is perfect to determine the CPU causer.

> Running the same select, including the DISTINCT and adding a FULL hint reduced the CPU time by about half. (the table only has about 10000 rows so a full table scan doesn't take long at all).

That observation confirms the suspicion about a different execution plan. It seems like an index access and the corresponding table access (by rowid) causes unnecessary logical I/O (maybe bad clustering factor as well) and high CPU time is the consequence. Execution plan information would make everything clear (especially with possible filters). By the way the amount of rows has nothing to do with the work that needs to be done by a full table scan (high water mark). You can have a table with only one row, but read hundreds of GBs by FTS

> lack of official supporting documentation is what brought me here.

ABAP developers are usually not very good SQL developers and so they do what they know best. Transfer all to the ABAP application server and handle it there. This might work for the first time (when the dataset is not huge and the parallel usage is not high), but unfortunately it will not scale out properly.

The database engine is able to do it much smarter and better in many cases (if the SQL is written well and the database structure is fitting) , but you need to know how to handle different database platforms to get the best result.

Regards

Stefan

peter_strauss
Participant
0 Kudos

>Where do you get this information from?...

Yes, the information came from DBACOCKPIT cache analysis. I understand that other SQL is aged out of the cache, however this single statement accounts for more than 10 times the execution and CPU than any other statement displayed, so it seemed worth taking a look at (though I understand that saying "80%" is meaningless in this context).

To be cynical, if I can tune this statement it will look good to the customer based on information from the SQL cache.

> The question here is - are the execution plans the same?

Yes, the execution plans were the same. I don't expect you to believe this without evidence, but it's Friday afternoon here in Japan and I'm heading home in ten minutes max. I'll update later.

Thank you for your help!

Kind regards,

Peter

Back again.

Here is the Autotrace for both statements (one with the distinct clause and the other without).

SQL> SELECT

DISTINCT "FIELD1" , "FIELD2" , "FIELD3" , "FIELD4" FROM

"TABLEA" WHERE "MANDT" = :A0 AND "FIELD1" = :A1 AND "FIELD3" = :A2 AND

"FIELD5" = :A3;  2    3    4

1125 rows selected.

Elapsed: 00:00:00.04

Execution Plan

----------------------------------------------------------

Plan hash value: 2879723666

---------------------------------------------------------------------------------------------

| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |

---------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT             |              |  1231 | 32006 |     8  (25)| 00:00:01 |

|   1 |  HASH UNIQUE                 |              |  1231 | 32006 |     7  (15)| 00:00:01 |

|   2 |   TABLE ACCESS BY INDEX ROWID| TABLEA     |  1231 | 32006 |     6   (0)| 00:00:01 |

|*  3 |    INDEX RANGE SCAN          | TABLEA~ID1 |  1477 |       |     2   (0)| 00:00:01 |

---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   3 - access("MANDT"=:A0 AND "FIELD1"=:A1 AND "FIELD3"=:A2 AND "FIELD5"=:A3)

Statistics

----------------------------------------------------------

          0  recursive calls

          0  db block gets

        157  consistent gets

          0  physical reads

          0  redo size

      28625  bytes sent via SQL*Net to client

       1302  bytes received via SQL*Net from client

         76  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

       1125  rows processed

SQL>

SELECT

"FIELD1" , "FIELD2" , "FIELD3" , "FIELD4" FROM

"TABLEA" WHERE "MANDT" = :A0 AND "FIELD1" = :A1 AND "FIELD3" = :A2 AND

"FIELD5" = :A3;SQL>   2    3    4

8696 rows selected.

Elapsed: 00:00:00.09

Execution Plan

----------------------------------------------------------

Plan hash value: 2427608097

--------------------------------------------------------------------------------------------

| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |

--------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |              |  1231 | 32006 |     7  (15)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID| TABLEA     |  1231 | 32006 |     6   (0)| 00:00:01 |

|*  2 |   INDEX RANGE SCAN          | TABLEA~ID1 |  1477 |       |     2   (0)| 00:00:01 |

--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   2 - access("MANDT"=:A0 AND "FIELD1"=:A1 AND "FIELD3"=:A2 AND "FIELD5"=:A3)

Statistics

----------------------------------------------------------

          0  recursive calls

          0  db block gets

       1311  consistent gets

          0  physical reads

          0  redo size

     196181  bytes sent via SQL*Net to client

       6857  bytes received via SQL*Net from client

        581  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

       8696  rows processed

SQL> set autotrace off

SQL> select count(*) from "TABLEA";

  COUNT(*)

----------

      9848

Elapsed: 00:00:00.01

SQL>

This was the second execution.

I can't figure out why the select with the DISTINCT is faster. I though the same number of rows would be fetched first and then the HASH UNIQUE operation applied to the fetched dataset. But what we see is that the number of consistent gets is less if the DISTINCT is included. Can you make sense of this?

btw, you can see that this select is very small. It is executed very often.

There is no CPU bottleneck on the system, so tuning this statement is not critical.

Peter


Message was edited by: Peter Strauss Finished off my post

Former Member
0 Kudos

Interesting, i did a bunch of tests and can observe the same behavior. I suspected that with distinct over an index the database can skip a few blocks compared to the full select scenario. But that does not seem to be the case. But SQLplus might be a bad tool for testing stuff like this.

One other difference is obviously the size of the the result set:

76  SQL*Net roundtrips to/from client

<- versus ->

581  SQL*Net roundtrips to/from client

And that is a good reason to do the data reduction on the database server.

Usually good practice is to use SELECT SINGLE FROM table WHERE col = :a1 on an indexed col to check for a certain value. It is very seldom necessary to look at all possible values.

Cheers Michael

stefan_koehler
Active Contributor
0 Kudos

Hi Peter,

thanks for the provided examples. Your observations are clear now. Michael nearly hit the issue why you see such a (marginal) performance difference "00:00:00.04" to "00:00:00.09".

> I though the same number of rows would be fetched first and then the HASH UNIQUE operation applied to the fetched dataset. But what we see is that the number of consistent gets is less if the DISTINCT is included. Can you make sense of this?

At first you are using AUTOTRACE, which has no granularity about the buffer gets at all .. however ...

Yes, but in case of a DISTINCT much less data needs to be transferred/fetched from the database to the client. Fetches are limited by the array size (which is determined dynamically by SAP DBSL depending on codepage, table and column definition).

Here is a tiny example for clarification (i used literals and hints for getting the same execution plan, because it makes no difference here). My example is a more drastically case, but it demonstrates the impact of the array size even better

-- Building test environment on Oracle 11.2.0.3.2

SYS@T11:12> create table TABLEA (MANDT VARCHAR(10), FIELD1 VARCHAR(10), FIELD2 VARCHAR(10), FIELD3 VARCHAR(10), FIELD4 VARCHAR(10), FIELD5 VARCHAR(10));

SYS@T11:12> create index TABLEAI on TABLEA(MANDT, FIELD1, FIELD3, FIELD5);

SYS@T11:12> begin

for i in 1 .. 10000 loop

    insert into TABLEA values('010','AAAA','BBBB','CCCC','DDDD','EEEE');

end loop;

commit;

end;

/

-- SELECTs with default array size(15)

SYS@T11:12> select /*+ GATHER_PLAN_STATISTICS INDEX(TABLEA TABLEAI)  */ distinct FIELD1 , FIELD2 , FIELD3 , FIELD4

     from TABLEA

     where "MANDT" = '010' and "FIELD1" = 'AAAA' and "FIELD3" = 'CCCC' and "FIELD5" = 'EEEE';

As you can see 10.000 rows are fetched from index and the table (86 buffer gets in sum), but only one row needs to be transferred to the client (ID 0).

-- SELECTs with default array size (15)

SYS@T11:12> select /*+ GATHER_PLAN_STATISTICS INDEX(TABLEA TABLEAI)  */ FIELD1 , FIELD2 , FIELD3 , FIELD4

     from TABLEA

     where "MANDT" = '010' and "FIELD1" = 'AAAA' and "FIELD3" = 'CCCC' and "FIELD5" = 'EEEE';

In this case 10.000 rows needs to be transferred to the client (ID 0), but only 15 rows are transferred by each fetch (so you maybe need to visit the same blocks several times for transferring all data).

Now let's tune it up to the max array size of 5.000 for SQL*Plus.

SYS@T11:12> set arraysize 5000

SYS@T11:12>select /*+ GATHER_PLAN_STATISTICS INDEX(TABLEA TABLEAI)  */ distinct FIELD1 , FIELD2 , FIELD3 , FIELD4

     from TABLEA

     where "MANDT" = '010' and "FIELD1" = 'AAAA' and "FIELD3" = 'CCCC' and "FIELD5" = 'EEEE';

Look closely at the buffer gets. The "SELECT DISTINCT" needs the same amount of buffer gets as before (with an array size of 15), because of only 1 row needs to be returned to the client and this fits in one fetch call.

SYS@T11:12> set arraysize 5000

SYS@T11:12> select /*+ GATHER_PLAN_STATISTICS INDEX(TABLEA TABLEAI)  */ FIELD1 , FIELD2 , FIELD3 , FIELD4

     from TABLEA

     where "MANDT" = '010' and "FIELD1" = 'AAAA' and "FIELD3" = 'CCCC' and "FIELD5" = 'EEEE';

Look closely at the buffer gets now. The same SELECT (without DISTINCT) needs only 90 buffer gets now. It needed 1418 buffer gets with a much smaller array size.

I am pretty sure, that this is the minimal "performance" difference that you see. You need much more fetch calls, SQL*Net Ttraffic and logical I/O for transferring the data (for the SQL without the distinct). Just try your example once again with the max array size and you will see this.

> To be cynical, if I can tune this statement it will look good to the customer based on information from the SQL cache.

It is better to explain the customer the backgrounds instead of doing useless implementations ... customers are ordering consultants to help them getting their problems solved and not only to implement their thoughts, which are wrong from time to time.

> There is no CPU bottleneck on the system, so tuning this statement is not critical.

Well your SQLs are already pretty nice (7 buffer gets per row for the DISTINCT and 6.6 buffer gets per row for the bare SELECT), but as previously explained these values (especially the last value) are currently wrong, because of the SAP DBSL is using a larger fetch size. Maybe you can reduce the buffer gets by eliminating the table access by row id or change the application logic to execute less SQLs (which is even better, if possible).

Regards

Stefan

P.S.: You can verify the client communication / fetch behavior with an OPI trace as well.

Former Member
0 Kudos

Nice, Stefan you did it again. I had a very similar test case with a 16 blocks table, where distinct had 37 Gets and the full select had 100 Gets. Setting arraysize to a high enough value results in 16 Gets (as expected) and 17 Gets for the full select.

So i repeat my statement, reducing the result set on the database server is usually the best option. Especially when this leads to fewer round trips (which will be the case for larger tables).

Cheers Michael

Update: 1164043 - DBSL (Oracle): Maximum buffer size for array operations

shaikhtabrez3
Explorer
0 Kudos

Can one help me how to resolve performance issue for below code

if it_vart04_art[] IS NOT INITIAL.

   SELECT DISTINCT * FROM  makt INTO TABLE p_it_xmaktx

             FOR ALL ENTRIES IN  it_vart04_art

           WHERE matnr  = it_vart04_art-matnr

             AND spras  = sy-langu.

   SORT p_it_xmaktx BY matnr.

   ENDIF.


Thanks in advance

Regards


Tabrez

Former Member
0 Kudos

There are quite a lot of comments that in abap coding SELECT DISTINCT should be avoided and replaced with a select to an internal table followed by sort + delete adjacent duplicates.

Can you provide any links to these recommendations?

peter_strauss
Participant
0 Kudos

Certainly, but nothing official;

lack of official supporting documentation is what brought me here.

http://wiki.sdn.sap.com/wiki/display/Community/ABAP+Performance+tips

"Whenever it's possible avoid SELECT DISTINCT, instead select data into internal table, sort and use DELETE ADJACENT DUPLICATES."

Similar statements pop up here and there.

The only related official documentation I could find was to the effect that using DISTINCT bypasses the ABAP data buffer, from which I might assume that in the case where you have a buffered table avoiding the DISTINCT clause might be a good idea. This is not enough to support a general statement like the one above.