on 10-18-2008 2:56 PM
Hi,
I executed the command UPDATE STATISTICS for some table columns supplying a 'sample definition'. It ran successfully and returned with an OK message. My question is -
Where should I look to find the details of the statistics that have been created as a result of this run? More specifically, I am looking for histograms for the columns; or any other kind of detailed information that MaxDB stores as part of table column statistics.
~maximus
P.S.- Some background:
System details - MaxDB version - 7.6.05.09, Platform - Linux, 64-bit
I am novice user of MaxDB. I have installed MaxDB, created a database, created tables and loaded them with data. Now I want to update the statistics for the columns of the tables.
I looked in the table OPTIMIZERSTATISTICS but it only gives the number of distinct values, but not the histogram. In fact, even the PAGECOUNT field in this table always has nothing but a '?' mark.
Hi Lars,
I was not sure about which parameters you were interested in. I executed PARAM_GETDIRECTALL at the DBMCLI prompt and chose some parameters that I thought might be relevant. Please correct me if this is not what you are looking for.
System details - MaxDB version - 7.6.05.09, Platform - Linux, 64-bit
Database Parameters -
KERNELVERSION KERNEL 7.6.05 BUILD 009-123-191-997
INSTANCE_TYPE OLTP
OPTIM_MAX_MERGE 500
OPTIM_INV_ONLY YES
OPTIM_CACHE NO
OPTIM_JOIN_FETCH 0
JOIN_SEARCH_LEVEL 0
JOIN_MAXTAB_LEVEL4 16
JOIN_MAXTAB_LEVEL9 5
TRACE_JOIN 0
TRACE_KSQL 0
OPTIMIZE_AGGREGATION YES
OPTIMIZE_FETCH_REVERSE YES
OPTIMIZE_STAR_JOIN YES
OPTIMIZE_JOIN_ONEPHASE YES
OPTIMIZE_JOIN_OUTER YES
OPTIMIZE_MIN_MAX YES
OPTIMIZE_FIRST_ROWS YES
OPTIMIZE_OPERATOR_JOIN YES
OPTIMIZE_JOIN_HASHTABLE YES
OPTIMIZE_JOIN_HASH_MINIMAL_RATIO 1
OPTIMIZE_OPERATOR_JOIN_COSTFUNC YES
OPTIMIZE_JOIN_PARALLEL_MINSIZE 1000000
OPTIMIZE_JOIN_PARALLEL_SERVERS 0
OPTIMIZE_JOIN_OPERATOR_SORT YES
OPTIMIZE_QUAL_ON_INDEX YES
OPTIMIZE_QUERYREWRITE OPERATOR
UPDSTATWANTED 1
UPDATESTAT_SAMPLE_ALGO 1
DDL statements
create table t1 (c1 integer primary key, c2 integer, c3 integer)
insert into t1 (c1) (select rowno from tables a, tables b, tables c where rowno <=1000000)
update t1 set c2=mod(c1, 14)
update t1 set c3= c1
create table t2 (c1 integer primary key, c2 integer, c3 integer)
insert into t2 (c1) (select rowno from tables a, tables b, tables c where rowno <=1000000)
update t2 set c2=0 where c1<=100
update t2 set c2=1 where 100<c1 and c1<=300
update t2 set c2=2 where 300<c1 and c1<=700
update t2 set c2=3 where 700<c1 and c1<=1500
update t2 set c2=4 where 1500<c1 and c1<=3100
update t2 set c2=5 where 3100<c1 and c1<=6300
update t2 set c2=6 where 6300<c1 and c1<=12700
update t2 set c2=7 where 12700<c1 and c1<=25500
update t2 set c2=8 where 25500<c1 and c1<=51100
update t2 set c2=9 where 51100<c1 and c1<=100000
update t2 set c2=10 where 100000<c1 and c1<=200000
update t2 set c2=11 where 200000<c1 and c1<=400000
update t2 set c2=12 where 400000<c1 and c1<=800000
update t2 set c2=13 where 800000<c1
update t2 set c3= c1
update statistics column (*) for t1
update statistics column (*) for t2
As can be seen, in the table t2, there are 100 rows with c2=0, 200 rows with c2=1, and so on.
I tried EXPLAIN-ing the following query
explain join (result) select * from t1, t1 t3, t2 where t1.c1=t3.c1 and t1.c1=t2.c1 and t1.c2=1 and t2.c2<3
| STRATEGY | TABLENAME | ADDITIONAL ROWS | ACCUMULATED_COSTS |
| ---------------------| ------------- | -------------------------- | -------------------------- |
| | T1 | 998527.9999999999 | 4304 |
| JOIN VIA SINGLE KEY | T3 | 998527.9999999999 | 8608 |
| JOIN VIA SINGLE KEY | T2 | 998527.9999999999 | 13126.22624434389 |
The other columns have comparable values for all 3 tables. Note that the estimated ADDITIONAL ROWS in all tables is approx 1 million - the size of the whole table.
In the 3-way join above, the optimizer choses the join order \[ t1 t3 t2 ]. But, I would expect the ideal order to be \[ t2 t1 t3 ]. The condition c2<3 causes only 700 rows to be retrieved and must be selected first. This result should be joined with t1 (which has the next smallest result on its own, due to the condition c2=1 which gives 1000000/14 u2248 71000 rows). Finally, the result should be joined with t3. Is this reasoning valid?
Warm regards,
Maximus
Edited by: maximus on Oct 30, 2008 1:12 PM
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Maximus,
Ok, I tried to reproduce your examples.
The first thing here is obvious: col2 is not indexed, so although conditions like 't2.c2<3' reduce the result set a lot, MaxDB cannot use this knowledge, as it can either use the primary key or an table scan for the data access.
So, for this unindexed col2 there won't be any evaluation as it cannot be used for data access at all.
Create an index on col2 and you'll see that the estimations will be much better.
Concerning the 3-way-join: I just can't reproduce this.
That's what I get:
explain join select * from t1, t1 t3, t2 where t1.c1=t3.c1 and t1.c1=t2.c1 and t1.c2=1 and t2.c2<3;
STRATEGY OWNER TABLENAME MAX ADDNL. ROWS MULTIPLIER REVERSE MULTIPLIER ADDITIONAL PAGES ADDITIONAL ROWS ACCUMULATED_COSTS
LARS T2 997473 1 1 8673.67826086956 997473 4281
JOIN VIA SINGLE KEY LARS T1 999141 1 1 12468.4125 997473 8794.45248868778
JOIN VIA SINGLE KEY T3 999141 1 1 13853.7916666667 997473 13307.9049773756
That's the correct join order.
regards,
Lars
I tried again after building indexes. Now it gives the right join order. That answers my question.
Thanks a lot, Lars, for the help!
Warm regards,
Maximus
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I performed the following experiment -
Create 2 tables t1(c1,c2) and t2(c1,c2) (c1 is the key in both tables).
I filled 1 million records in t1 so that c1=1,2,...,1000000 and c2= c1 mod 14. Thus column c2 has 14 values, all having the same frequency of occurrence in the table t1.
I filled 1 million records in t2 so that c1=1,2,...,1000000 and c2=0 for the first 100 records, c2=1 for the next 200 records, c2=2 for the next 400 records and so on. It turns out that column c2 has 14 values but with vastly differing frequencies of occurrence in the table t2.
Now, when I executed an EXPLAIN JOIN on the 2 queries -
Q1:
select * from t1, t1 t3 where t1.c1=t3.c1 and t1.c2>6;
Q2:
select * from t2, t2 t4 where t2.c1=t4.c1 and t2.c2>6;
I got very similar cost estimates (see below)
explain join (result) select * from t1, t1 t3 where t1.c1=t3.c1 and t1.c2>6
| STRATEGY | TABLENAME | ACCUMULATED_COSTS |
| ---------------------------------------- | -------------------------------- | -------------------------- |
| | T1 | 4304 |
| JOIN VIA SINGLE KEY | T3 | 8608 |
explain join (result) select * from t2, t2 t4 where t2.c1=t4.c1 and t2.c2>6
| STRATEGY | TABLENAME | ACCUMULATED_COSTS |
| ---------------------------------------- | -------------------------------- | -------------------------- |
| | T2 | 4522 |
| JOIN VIA SINGLE KEY | T4 | 9044 |
(The other columns, omitted here due to space constraints, also had almost identical values.)
Ideally, the optimizer should have estimated the cost of Q2 to be much higher than Q1, I feel.
Based on this, I am tempted to conclude that the MaxDB optimizer either (a) does not consider the distribution of values or (b) is not provide this data by the statistics, and hence is helpless anyway.
Is the experiment correctly designed and is the conclusion valid? Please comment.
Aside: I also tried other queries with these tables - notably 3-way joins. In each case, when I expected that the optimizer to pick a particular join order that must be the least cost, it did not and made a naïve choice instead (as was evident in the EXPLAIN SEQUENCE output)
Regards,
Maximus
P.S.- Thanks once again Lars for the tips. I did try to read most of the relevant documentation, as you suggested.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Maximus,
time to show some more information!
Let us know:
- database version
- parameter setup
- exact DDL/DML used to produce the testcase
You did not mention that you gathered statistics after the dataload.
If you did it anyhow, then what was the exact command used for that?
As already stated, MaxDB does not store histogram or data distribution information, but samples the B*Trees on strategy evalutation.
Concerning the execution plans: both join orders are correct here. So even if the displayed accumulated_costs do not fit your expectation, it's still the right plan chosen.
The statement and plan for the wrong join order is missing in your post - so currently it's just what you think...
regards,
Lars
From Melanie's answer, it is clear to me that MaxDB statistics does not store histograms. Then, how does the optimizer calculate the cardinality of a condition?
e.g. In the query
SELECT * FROM t1, t2 WHERE t1.f1 = t2.f2 AND t1.f3 BETWEEN <val-1> AND <val-n>
the optimizer will try to calculate the selectivity of the BETWEEN condition. But, in the absence of a histogram, it does not have accurate values for the selectivities of each value <val-i> between <val-1> and <val-n>.
Does the optimizer assume a uniform distribution of the values? In other words, if there are N distinct values for the field t1.f3 and the range <val-1> to <val-n> contains n values, does the optimizer assume that the selectivity of the condition is n/N ?
Warm regards,
Maximus
P.S.- Thanks Melanie and Lars! The answers were very useful.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
> , how does the optimizer calculate the cardinality of a condition?
You should really read the Information about he optimizer in the WIKI [SQL Optimizer |https://wiki.sdn.sap.com/wiki/x/DSg ].
MaxDB does not use stored histograms but performs evaluation a.k.a. sampling during optimization time.
> Does the optimizer assume a uniform distribution of the values?
No, it does not (see above) - evalutations are used.
Basically the MaxDB performs test-reads into the B*Tree to figure out how big (in terms of pages that need to be read) a specific range would be.
Before you ask - there is no way to see the results of this sampling process in a reasonable way.
Anyhow, for SQL tuning you may want to take a closer look to explain sequence and explain join.
regards,
Lars
HI Maximus,
Melanie is of course completely right: read the documentation and the wiki first!
Anyhow, here are just a few steps to demonstrate what kind of information is gathered when:
create table test (kcol1 varchar(10) primary key, nkcol2 integer, nkcol3 varchar(20))
//
insert into test (kcol1) (select rowno from tables a, tables b, tables c where rowno <=10000)
//
update test set nkcol2=mod(kcol1, 100)
//
update test set nkcol3= kcol1
Ok, ncol1 is obviously the pimary key for this table, and has 10000 distinct values,
nkcol3 is a direct copy of it, so it also has 10000 distinct values in it, whereas nkcol2 has only 100 distinct values due to the mod( x, 100) function used in the update.
sqlcli DB760=>
sqlcli DB760=> \mu
Multiline mode switched ON
sqlcli DB760=> select t.tablename, f.entrycount, treeindexsize+treeleavessize+lobsize SIZE_IN_PAGES
=> from files f, tables t
=> where
=> t.tableid = f.fileid
=> and f.type='TABLE'
=> and t.tablename='TEST'
=> and t.schemaname=user;
| TABLENAME | ENTRYCOUNT | SIZE_IN_PAGES |
| ---------- | ------------------ | ------------------- |
| TEST | 10000 | 368 |
1 row selected (122.726 msec)
So, the first thing I looked at here, are actually no statistics but current size values from the filedirectory.
I did this just to see what statistical values we could expect.
By the way: to run multiline statements like the above one in sqlcli you've to set \mu first to activate the multiline mode. To mark the end of a statement in multilinemode, you've to add a ';' to the statement, just as you do with Oracle's sqlplus.
sqlcli DB760=> select * from optimizerinformation where tablename='TEST' and schemaname=user;
| SCHEMANAME | TABLENAME | COLUMNNAME | INDEXNAME | DISTINCTVALUECOUNT | CALCULATEDSIZE |
| ---------- | ---------- | ---------- | ---------- | -------------------- | ---------------- |
| LARS | TEST | KCOL1 | ? | 0 | ? |
| LARS | TEST | ? | ? | 0 | 8 |
2 rows selected (24.137 msec)
As we see, even without having ever run a update statistics there are two entries in the table OPTIMIZERINFORMATION (if you check the documentation, you'll figure that OPTIMIZERSTATISTICS is nowadays deprecated).
We've one line for the table (you'll notice the NULL value ? in the field COLUMNNAME) and one for the primary key "KCOL1". Unfortunately the values we see for DISTINCTVALUECOUNT and CALCULATEDSIZE are pretty much nonsense, as we know from the first query.
ATTENTION: here you can see why it is such a big problem for the optimizer if the statistics are not fitting to reality. Although MaxDB uses statistics only for Join-Statements, this table would be estimated to be very small...
sqlcli DB760=> update statistics test;
0 rows affected (81.886 msec)
sqlcli DB760=> select * from optimizerinformation where tablename='TEST' and schemaname=user;
| SCHEMANAME | TABLENAME | COLUMNNAME | INDEXNAME | DISTINCTVALUECOUNT | CALCULATEDSIZE |
| ----------- | ---------- | ----------- | ---------- | ------------------------ | ------------------ |
| LARS | TEST | KCOL1 | ? | 10000 | ? |
| LARS | TEST | ? | ? | 10000 | 360 |
2 rows selected (24.013 msec)
This looks much better, now!
But what about the other columns? For all non-indexes and non-primary columns you've to specify if you want to gather statistics on them. Usually this is not necessary as such columns won't ever be used for a join transition:
sqlcli DB760=> update statistics column (*) for test;
0 rows affected (140.526 msec)
sqlcli DB760=> select * from optimizerinformation where tablename='TEST' and schemaname=user;
| SCHEMANAME | TABLENAME | COLUMNNAME | INDEXNAME | DISTINCTVALUECOUNT | CALCULATEDSIZE |
| ---------- | ---------- | ---------- | --------- | ------------------ | --------------- |
| LARS | TEST | KCOL1 | ? | 10000 | ? |
| LARS | TEST | NKCOL2 | ? | 100 | ? |
| LARS | TEST | NKCOL3 | ? | 10000 | ? |
| LARS | TEST | ? | ? | 10000 | 360 |
4 rows selected (24.795 msec)
As we see, the OPTIMIZERINFORMATION does indeed show the different results of a UPDATE STAT command - you just have to look right to it.
regards,
Lars
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi,
if you are new to MaxDB you might want to read our MaxDB performance guide and the support guide.
Both can be found here:
https://wiki.sdn.sap.com/wiki/x/gjc
This section specifically explains the statistics:
https://wiki.sdn.sap.com/wiki/x/Dyg
We don't have any histograms - so you won't find any.
Best regards,
Melanie
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
87 | |
10 | |
10 | |
9 | |
7 | |
6 | |
6 | |
5 | |
4 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.