on 10-03-2013 2:15 PM
Several of our long queries on the S670 table have started to choose the S670~VAB index rather than the S670~0 index. This causes extremely long run times going from 20 minutes to 58 hours. Trying to test a solution is challenging because in our non-prod systems (all refreshed in the last 30 days), the Explain Plans correctly choose the S670~0 index.
Any advice on how to determine why the stats in PRD lead the CBO to choose ~VAB but the stats in DEV lead the CBO to choose ~0?
All of our systems are updating stats on this table regularly and using the same estimate/compute values. The distinct values are the same because these are all refreshes from PRD.
Trying to understand what causes PRD to behave differently.
Hi Richard,
> All of our systems are updating stats on this table regularly and using the same estimate/compute values. The distinct values are the same because these are all refreshes from PRD.
Can you please create a CBO trace for the two identical queries (one on DEV and one on PRD) and post the trace files here. Placeholder <SQL_ID> needs to be replaced with the real SQL ID.
SQL> exec dbms_sqldiag.dump_trace(p_sql_id=>'<SQL_ID>', p_child_number=>0, p_component=>'Compiler', p_file_id=>'CBO_TRACE_DEV');
SQL> exec dbms_sqldiag.dump_trace(p_sql_id=>'<SQL_ID>', p_child_number=>0, p_component=>'Compiler', p_file_id=>'CBO_TRACE_PRD');
The CBO trace includes all the necessary information for us.
Regards
Stefan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Richard,
you can zip it before attaching it on SCN ("ZIP trace file - Change suffix from .zip to .txt - Upload it") or you can drop me an Email (check my SCN profile) with it, but the best way would be here on SCN so that everybody can check it.
However it is pretty strange that there is such a huge (size) difference between the PRD and DEV trace file, but a first indicator that there is something different between PRD and DEV.
Regards
Stefan
Hi Richard,
thank you for the traces. However DEV and PRD are not comparable at all, because of the SQL statements are totally different. For example:
DEV
... "PKUNAG" IN (..., :A220 ,:A221 ) AND "MATNR"=:A222 AND "ZCNTIND"=:A223 ...
PRD
... "PKUNAG" IN (..., :A219 ,:A220 ) AND "PKUNAG"<>:A221 AND "MATNR"=:A222 AND "ZCNTIND"=:A223 ...
However the optimizer has chosen an execution plan with "BITMAP CONVERSION FROM ROWIDS" in PRD. Easily spoken, that means that the (retrieved data from) B*Tree indexes are converted "on-the-fly" into a kind of bitmap index for comparison ("BITMAP AND").
This behavior can be influenced by init parameter "_B_TREE_BITMAP_PLANS". This feature is pretty well known for possible issues in a SAP environment and you maybe need to differ from the SAP default suggestions (SAPnote #1431798) here. You can crosscheck this by comparing the parameter suggestions over the time (sometimes it should be enabled and sometimes it should be disabled). The Oracle parameter suggestions by SAP are working for the majority of the clients, but not for all
Maybe the optimizer has done a correct calculation based on the given SQL and statistic data, but i do such deep analysis only on commissioning for sure. Please make sure that you are looking at the same SQLs in DEV and PRD. So if you still have an example where the SQL statement is completely identical, please create the same traces for that example and attach it here.
Regards
Stefan
Sorry about that Stefan. See if this is better. It should be the same statement from both PRD and DEV. I was able to use the same SQL_ID since DEV was refreshed from PRD just last week. I'll take a look at Note 1431798. Attaching a new zip file with the two CBO Trace files.
Hi Richard,
thank you very much for the new trace files. The SQLs are identical now, but your optimizer environment is configured differently and not like suggested by SAP. For example:
DEV
*************************************
PARAMETERS WITH ALTERED VALUES
******************************
Compilation Environment Dump
parallel_threads_per_cpu = 1
star_transformation_enabled = true
optimizer_index_cost_adj = 20
optimizer_index_caching = 50 <-> differs
query_rewrite_enabled = false
_optim_peek_user_binds = false
_fix_control_key = 1218097180
_optimizer_extended_cursor_sharing_rel = none
_optimizer_adaptive_cursor_sharing = false
is_recur_flags = 8
_optimizer_use_feedback = false
Bug Fix Control Environment
PRD
*************************************
PARAMETERS WITH ALTERED VALUES
******************************
Compilation Environment Dump
parallel_threads_per_cpu = 1
_pga_max_size = 2097152 KB <-> differs
star_transformation_enabled = true
optimizer_index_cost_adj = 20
query_rewrite_enabled = false
_optim_peek_user_binds = false
_fix_control_key = 1218097180
_optimizer_extended_cursor_sharing_rel = none
_optimizer_adaptive_cursor_sharing = false
is_recur_flags = 8
_optimizer_use_feedback = false
Bug Fix Control Environment
Especially the optimizer parameter "optimizer_index_caching" is pretty meaningful in this context here. I really recommend to crosscheck your parameter settings with SAPnote #1431798 first and apply the correct settings (e.g optimizer_index_caching should not be set at all) in both environments. Afterwards you can compare both environments again, if they behave the same way and test your previously mentioned solution. (the solution in this case should be pretty obvious anyway)
Just for your information about parameter "optimizer_index_caching":
The initialization parameter optimizer_index_caching is used to specify the expected amount (in percent) of index blocks cached in the buffer cache during the execution of in-list iterators and nested loop joins. It is important to note that the value of this initialization parameter is used by the query optimizer to adjust its estimations. In other words, it does not specify how much of each of the indexes is actually cached by the database engine. Valid values range from 0 to 100. The default is 0. Values greater than 0 decrease the cost of index scans performed for in-list iterators and in the inner loop of nested loop joins. Because of this, it is used to push the utilization of these operations.
Source: Troubleshooting Oracle Performance by Christian Antognini
Regards
Stefan
Hi Richard,
you are welcome - no problem at all.
Just let me know, if you need further assistance by troubleshooting the execution plan itself (based on the current information i would expect that "bad" execution plan in DEV as well). I also assist several US clients by troubleshooting Oracle performance (SQL / CBO) issues
Regards
Stefan
We found a lot of parameters that are not the same in DEV so maybe it's not the best for analysis. I looked at QAS and it has only a couple of parameters that are out of sync with PRD. The optimizer_index_caching is set to default on both QAS and PRD, for example. QAS selects the S670~ZIN index, which is identical to the S670~0 index except is doesn't have mandt (we have only one client). If you get time, see if anything jumps out when comparing QAS to PRD. It might give us better insight.
Thanks,
Richard
Hi Richard,
thanks for these traces from QA and PRD. We are reaching a point now, where it is pretty hard or impossible to make a serious and well-founded statement without access to the system (key word "knowing your data").
You can also strip down the SQL statement to one OR concatenation part to focus on the root cause of this CBO / SQL issue (which also makes the CBO trace files easier to read).
However i have noticed some anomalies in your PRD system (and between your QA and PRD system) by browsing through the CBO traces shortly. Some statistic values are totally different in some parts (which may lead to the "bad" changes here), which can be based on the following root causes:
Here is an example from the latest traces:
QA
Table Stats::
Table: S670 Alias: S670
#Rows: 9225000 #Blks: 147306 AvgRowLen: 110.00 ChainCnt: 0.00
Index: S670~VAB Col#: 1 12 ( = MANDT, MATNR)
LVLS: 2 #LB: 37331 #DK: 2711 LB/K: 75.00 DB/K: 13004.00 CLUF: 6580428.00
Single Table Cardinality Estimation for S670[S670]
Column (#1): MANDT(
AvgLen: 4 NDV: 1 Nulls: 0 Density: 1.000000
...
Column (#12): MATNR(
AvgLen: 7 NDV: 2711 Nulls: 0 Density: 0.000369
PRD
Table Stats::
Table: S670 Alias: S670
#Rows: 9987400 #Blks: 159541 AvgRowLen: 110.00 ChainCnt: 0.00
Index: S670~VAB Col#: 1 12 ( = MANDT, MATNR)
LVLS: 2 #LB: 29609 #DK: 6985 LB/K: 4.00 DB/K: 985.00 CLUF: 6884800.00
Single Table Cardinality Estimation for S670[S670]
Column (#1): MANDT(
AvgLen: 4 NDV: 1 Nulls: 0 Density: 1.000000
...
Column (#12): MATNR(
AvgLen: 7 NDV: 2752 Nulls: 0 Density: 0.000363
The index "S670~VAB" is used for "BITMAP AND" in your PRD system due to a cost calculation of 1.414157 and selectivity of 0.000143 (e.g. predicate "MANDT"=:A0 AND "MATNR"=:A222). I don't know how your real data looks like, but you will notice an anomaly here, if you look closely at the column and index statistics in PRD. How is it possible that the index "S670~VAB" (with column MANDT, MATNR) can have a higher distinct key value than the combined distinct key values from the columns itself. Crosscheck this with the stats in QA and you will notice that this currently makes sense in QA, but not in PRD.
Some simple arithmetic about selectivity and cardinality
QA: 1 / 2711 = 0.00036886757654 = 9225000 * 0.000369 = 3404 rows
PRD with DK from index: 1 / 6985 = 0.00014316392269 = 9987400 * 0.000143 = 1428 rows
PRD with DK from col stats: (1 / 1) * (1 / 2752) = 1 * 0.00036337209302 = 9987400 * 0.000363 = 3625 rows
So what is the reality (knowing your data) and why do the statistics differ here in PRD from reality? As a short test you can set the DK value of index "S670~VAB" manually and verify the execution plan in PRD after cross-checking how your real data looks like for that particular predicate ... or you can export the statistics in PRD, import them in QA and do the manual adjustment there for verification.
SQL> exec DBMS_STATS.SET_INDEX_STATS('SAPR3','"S670~VAB"', numdist => 2752, no_invalidate => FALSE);
If this works you just need to find out why the statistics are so weird in parts in PRD and if it does not work you need to dig deeper in the stripped SQL example.
Regards
Stefan
P.S.: Just out of curiosity. What is the purpose of index "S670~ZIN"? It includes every column (like index "S670~Z01") except leading column MANDT. I assume that table S670 is client dependent and so i can not see any use case of index "S670~ZIN" (even if you only have one MANDT in your SAP system). The MANDT predicate is always generated for client dependent tables and so you can benefit of an index only access with index "S670~Z01", but you need to access the table S670 by rowid to filter the data on MANDT with index "S670~ZIN" (b-level 3 and leaf blocks 130519 vs. 111065 based on statistic data in PRD). However in addition index "S670~Z01" has the same structure as unique index "S670~0". What is the purpose of that index except the missing unique attribute?
Stefan, your last post has been very revealing in what is causing our issue. I took a look at the number of distinct values and PRD is correctly seeing the 6985 values for matnr. QA and DEV, by refresh, also have these numbers. So why the anomoly? I think I've found the root cause ...
DEV and PRD is where I first started my analysis. The first thing I checked was the Stats and that they were calculated the same. They were:
DEV
INDEX_NAME SAMPLE_SIZE LAST_ANAL GLO USE
------------------------------ ----------- --------- --- ---
S670~Z01 299229 10-SEP-13 YES NO
S670~ZIN 9995095 12-SEP-13 NO NO
S670~0 288415 10-SEP-13 YES NO
S670~VAB 9995095 12-SEP-13 NO NO
PRD
INDEX_NAME SAMPLE_SIZE LAST_ANAL GLO USE
------------------------------ ----------- --------- --- ---
S670~Z01 299229 10-SEP-13 YES NO
S670~ZIN 9995095 12-SEP-13 NO NO
S670~0 288415 10-SEP-13 YES NO
S670~VAB 9995095 12-SEP-13 NO NO
But as you pointed out in previous posts, the behavior of the CBO is altered because we have optimizer_index_caching=50 in our DEV system. So I took at look at QA and how it collects stats.
QA
INDEX_NAME SAMPLE_SIZE LAST_ANAL GLO USE
------------------------------ ----------- --------- --- ---
S670~Z01 271894 13-JAN-13 YES NO
S670~ZIN 273789 13-JAN-13 YES NO
S670~0 281224 13-JAN-13 YES NO
S670~VAB 292059 13-JAN-13 YES NO
So I interpret this to mean that the Distinct Keys for QA is different than PRD because QA is doing an ESTIMATE P3 to gather stats, while PRD is doing a full COMPUTE for our two Z indexes. Is that too simplistic?
It also jumps out to me that the value of DBA_IND_STATISTICS.GLOBAL_STATS is NO in PRD for S670~VAB and S670~ZIN. The value of DBA_TABLES.GLOBAL_STATS for the table S670 is YES. I believe that difference in the collection of the stats is what causes the Explain Plans to perform so differently between QA and PRD.
Thoughts? I'm not quite sure what would cause the global_stats value for the individual indexes to be set to NO.
Thanks,
Richard
Hi Richard,
> PRD is correctly seeing the 6985 values for matnr.
Ok perfect, so the "exact" DK value is 6985 like it is stored for the index "S670~VAB" in PRD.
> So I interpret this to mean that the Distinct Keys for QA is different than PRD because QA is doing an ESTIMATE P3 to gather stats, while PRD is doing a full COMPUTE for our two Z indexes. Is that too simplistic?
I guess you have misunderstood some parts of my previous posting. The statistics between QA and PRD are different (and so the cases are not 100 percent comparable), because of it seems like that QA is not an actual copy of PRD (like DEV). So the statistic difference here is pretty common (e.g. blkcnt also differs), but you can not do an 100 percent comparison between QA and PRD. So let's just focus on the issue in PRD.
The anomaly is only in PRD, because of the index S670~VAB got a tremendous different DK value with the corresponding selectivity than the standard formula based on the column stats (1 / NDV col stat of MANDT) * (1 / NDV col stat of MATNR). Luckily you have posted the sample sizes of each object, which is a reasonable explanation for that difference:
S670~Z01 = (1 / 9987400) * 299229 ≈ 3 percent sample size
S670~ZIN = (1 / 9987400) * 9995095 ≈ 100 percent sample size
S670~0 = (1 / 9987400) * 288415 ≈ 2.9 percent sample size
S670~VAB = (1 / 9987400) * 9995095 ≈ 100 percent sample size
So the question here is why do your objects have different sample sizes? The statistics are usually collected by DBMS_STATS on the table object with cascade option. One possibility (which also explains the GLOBAL_STATS value) would be a rebuild or creation of the indexes S670~ZIN and S670~VAB at 12.09.2013.
These changes in the statistic values for only a part of the dependent objects (and the special case by using a predicate that can use the DK from the index stats) are the root cause here. Have you tried my previous mentioned fix with DBMS_STATS.SET_INDEX_STATS and did it work?
Here is a short demo on my 11.2.0.3.6 database. The DK values from index and the column stats are pretty close due to the created test data pattern and i have to generate that amount of data due to DBMS_STATS behavior.
SQL> create table TEST (A number, B number);
SQL> create index TEST_I_A on TEST(A);
SQL> create index TEST_I_B on TEST(B);
SQL> begin
for i in 1..1000000 loop
insert into TEST values(i,i);
end loop;
end;
/
SQL> commit;
SQL> exec DBMS_STATS.GATHER_TABLE_STATS(USER, 'TEST',
estimate_percent=> 3, cascade=> true);
SQL> select round(((1/NUM_ROWS)*SAMPLE_SIZE)*100) as PERCENT, NUM_ROWS,
GLOBAL_STATS from DBA_TABLES where TABLE_NAME = 'TEST';
PERCENT NUM_ROWS GLO
---------- ---------- ---
3 1004000 YES
SQL> select COLUMN_NAME, NUM_DISTINCT, GLOBAL_STATS
from DBA_TAB_COL_STATISTICS where TABLE_NAME = 'TEST';
COLUMN_NAME NUM_DISTINCT GLO
------------------------------ ------------ ---
B 1004000 YES
A 1004000 YES
SQL> select INDEX_NAME, round(((1/NUM_ROWS)*SAMPLE_SIZE)*100) as PERCENT,
DISTINCT_KEYS, LEAF_BLOCKS, GLOBAL_STATS
from DBA_INDEXES where TABLE_NAME = 'TEST';
INDEX_NAME PERCENT DISTINCT_KEYS LEAF_BLOCKS GLO
------------------------------ ---------- ------------- ----------- ---
TEST_I_A 57 973837 1947 YES
TEST_I_B 57 1004000 2020 YES
SQL> alter index TEST_I_A rebuild online;
SQL> select INDEX_NAME, round(((1/NUM_ROWS)*SAMPLE_SIZE)*100) as PERCENT,
DISTINCT_KEYS, LEAF_BLOCKS, GLOBAL_STATS
from DBA_INDEXES where TABLE_NAME = 'TEST';
INDEX_NAME PERCENT DISTINCT_KEYS LEAF_BLOCKS GLO
------------------------------ ---------- ------------- ----------- ---
TEST_I_A 100 1000000 2226 NO
TEST_I_B 57 1004000 2020 YES
Please take a look at the percentage and GLOBAL_STATS value after the index rebuild - it looks exactly the same like your provided data :-))
You can also crosscheck this with column LAST_DDL_TIME in view DBA_OBJECTS.
I hope this clears the doubts.
Regards
Stefan
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Following your example, I've been able to finally replicate the issue in QA.
SQL> exec DBMS_STATS.GATHER_TABLE_STATS (USER,'S670', estimate_percent=>3, cascade=>true);
SQL> select INDEX_NAME, round(((1/NUM_ROWS)*SAMPLE_SIZE)*100) as PERCENT,
DISTINCT_KEYS, LEAF_BLOCKS, GLOBAL_STATS
from DBA_INDEXES where TABLE_NAME = 'S670';
INDEX_NAME PERCENT DISTINCT_KEYS LEAF_BLOCKS GLO
---------- ---------- ------------- ----------- ---
S670~Z01 3 9328433 122870 YES
S670~ZIN 3 9769533 117566 YES
S670~0 3 9454233 130119 YES
S670~VAB 3 2697 38716 YES
SQL> alter index "S670~VAB" rebuild online;
Index altered.
SQL> select INDEX_NAME, round(((1/NUM_ROWS)*SAMPLE_SIZE)*100) as PERCENT,
DISTINCT_KEYS, LEAF_BLOCKS, GLOBAL_STATS
from DBA_INDEXES where TABLE_NAME = 'S670';
INDEX_NAME PERCENT DISTINCT_KEYS LEAF_BLOCKS GLO
---------- ---------- ------------- ----------- ---
S670~Z01 3 9328433 122870 YES
S670~ZIN 3 9769533 117566 YES
S670~0 3 9454233 130119 YES
S670~VAB 100 6875 28106 NO
Both of the above resulted in my Explain Plan chosing the S670~VAB index. I don't understand that, since it is now much fewer distinct keys. I tried your suggestion. That reduced the number of DK as expected, but the Explain Plan still chose S670~VAB.
SQL> exec DBMS_STATS.SET_INDEX_STATS('SAPR3','"S670~VAB"', numdist => 2752, no_invalidate => FALSE);
SQL> select INDEX_NAME, round(((1/NUM_ROWS)*SAMPLE_SIZE)*100) as PERCENT,
DISTINCT_KEYS, LEAF_BLOCKS, GLOBAL_STATS
from DBA_INDEXES where TABLE_NAME = 'S670';
INDEX_NAME PERCENT DISTINCT_KEYS LEAF_BLOCKS GLO
---------- ---------- ------------- ----------- ---
S670~Z01 3 9328433 122870 YES
S670~ZIN 3 9769533 117566 YES
S670~0 3 9454233 130119 YES
S670~VAB 100 2752 28106 NO
I wanted to reset QA back so I ran brconnect to gather the stats. Now, all indexes are using a sampling including VAB which was using 100% but the Explain Plan still goes with S670~VAB. I'm now so confused. At least I now have QA behaving the same as PRD.
brconnect -u / -c -f stats -o SAPR3 -t S670 -e null -m E -s P3 -f allsel,collect,method,sample,keep
SQL> select INDEX_NAME, round(((1/NUM_ROWS)*SAMPLE_SIZE)*100) as PERCENT,
DISTINCT_KEYS, LEAF_BLOCKS, GLOBAL_STATS
from DBA_INDEXES where TABLE_NAME = 'S670';
INDEX_NAME PERCENT DISTINCT_KEYS LEAF_BLOCKS GLO
---------- ---------- ------------- ----------- ---
S670~Z01 3 9504133 122870 YES
S670~ZIN 3 9170300 117566 YES
S670~0 3 9621400 130119 YES
S670~VAB 4 2744 28184 YES
Hi Richard,
i am not sure about your QA environment right now, but my example was just for demonstrating how your current production state could be accrued (depending on the provided information). However now it is proofed, that it happened exactly this way.
Have you re-parsed the SQL statement (e.g. by changing a character) or flushed the shared pool in QA? Please try the following in QA to be absolutely sure and re-check the execution plan after each try.
SQL> exec DBMS_STATS.SET_INDEX_STATS('SAPR3','"S670~VAB"', numdist => 2697);
SQL> alter system flush shared_pool;
-- Explain plan
SQL> exec DBMS_STATS.SET_INDEX_STATS('SAPR3','"S670~VAB"', numdist => 2697, numlblks => 38716);
SQL> alter system flush shared_pool;
-- Explain plan
SQL> exec DBMS_STATS.SET_INDEX_STATS('SAPR3','"S670~VAB"', numdist => 2697, numlblks => 38716, clstfct => 6580428);
SQL> alter system flush shared_pool;
-- Explain plan
I don't know what your clustering factor of index "S670~VAB" looked like before rebuilding in QA (but in the latest provided trace it was 6580428). You maybe need to crosscheck this as well. Sometimes rebuilding indexes can cause such problems due to its "disproportion" to other objects afterwards.
One of the provided examples from above should lead to the "old" execution plan.
Regards
Stefan
I tried manipulating the stats and flushing the shared_pool. What I found is that it still doesn't change the behavior of a couple of queries we have.
However, I was able to re-run stats with brconnect E P3 on production and now have most of our S670 queries working ok. The others, we have resolved by using an Oracle Hint. I'm sure there was a way to do the same thing with the stats but the hint was something we knew would stay put.
The root cause that led to this was just as Stefan mentioned, an online rebuild of the index changed how stats were being collected. I'm not sure that I understand it but at least I know the sequence of events that led to the changed behavior by the CBO.
Awarding correct answer to Stefan for showing me how to compare the stats between PRD and non-prd by using dbms_sqldiag.dump_trace. That really did open this up and helped lead us to explain what happened.
Thanks again Stefan. VERY useful responses.
Hi Richard,
glad to hear that you were able to fix your issue. Please let me just add an additional information to this thread for the other SCN users.
> What I found is that it still doesn't change the behavior of a couple of queries we have.
Yes, because of the manual statistics manipulation was designed for the given SQL query and the "special predicate case" (MANDT = <VALUE> and MATNR = <VALUE>) with its derived distinct key from the index statistics. Additional queries with a different predicate are working differently from a cost based optimizer point of view.
> The others, we have resolved by using an Oracle Hint.
... i hope you have used the optimizer parameter hint with "_B_TREE_BITMAP_PLANS", if the other issues were also caused by "BITMAP CONVERSION FROM ROWIDS" instead of "hard fixing" the execution plan. Contrary to the documentation, the hint "opt_param" can also be used for other optimizer parameters than "OPTIMIZER_DYNAMIC_SAMPLING, OPTIMIZER_INDEX_CACHING, OPTIMIZER_INDEX_COST_ADJ, OPTIMIZER_SECURE_VIEW_MERGING, and STAR_TRANSFORMATION_ENABLED".
Regards
Stefan
User | Count |
---|---|
83 | |
10 | |
10 | |
9 | |
7 | |
6 | |
5 | |
5 | |
4 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.