cancel
Showing results for 
Search instead for 
Did you mean: 

When should I run update [index] statistics on a table?

Former Member
0 Kudos

Greetings,

     Few of our tables in production server have few hundred million rows( between 200mil and 800mil). We add between 200,000 and 500,000 rows every week. Less than one percent of total rows each of them have. Most of these tables have chronologically increasing index columns like date or identity values. We run update stats on weekends for more than 24 hours. For not having enough maintenance window we run update stats on different group of tables in each weekend. It takes about 4 weekends to cover all the tables.

Do I really need to run update stats on tables that may not have changed much? What is the threshold % data change or kind of criteria I can use to filter my table list so I can cover many other tables to refresh statistics?


I can create & manipulate a metadata table and run update stats in parallel on multiple tables that qualify the criteria. This way I can cover many more tables and reduce cycle time between stats run on each table.

Thank you & regards,

Sid.

Accepted Solutions (1)

Accepted Solutions (1)

marcin_najs
Explorer
0 Kudos

Hi.

You don't need to run update stats every time for tables that do not change much...

Here are simple methods to reduce time of update process:

1. You can update statistics basing on datachange() function output.

For example:

If datachange(tableA) > 5% update stats tableA [...]

2. You can reduce 'sampling' parameter for large tables (NOTE: histograms may be less accurate...)

Parameters like:

- datachange() threshold

- sampling

- number of rows (when your tables are considered as "large")

you have to adjust by yourself basing on your data and your queries.

Here are some useful posts from SCN:

(Optimizer Stats - PDF doc.)  http://tinyurl.com/qhrj925

datachange function doc

Hope that helps.

Best Regards.

--
Marcin

Former Member
0 Kudos

Marcin,

     That is exactly what I have. I am just finishing my script. Here is some detail that may help others.

Metadata-1

TableGroupId, GroupName, DBname, TableName, MinRows, MaxRows,RowCnt, DataChangeHWM, DataChanged, StepValue

RowCnt & DataChanged are calculated during runtime and rest of the columns are being set by a script in initial setup.

Medadata-2

TableListId, TableGroupId, StartTime, EndTime, RunStatus, Action

RunStatus = 'NEW', 'Running', 'Completed', 'Failed'

Action = 'STATS', 'INDEX_STATS', 'REORG_REBUILD', 'REORG_DEFRAG' ....

I have two sets of Scripts.

AddNewGroup.ksh populates Metadata-1


     AddNewGroup.ksh <TargetServer> <TableGroup> <DbName> <MinRows> <MaxRows> <StepValue> <DataChangeHWM>

TableMaintRunner.ksh & ExecTableMaint.ksh run the underline action.


     TableMaintRunner.ksh <TargetServer> <TableGroup> <DbName> <TableAction> <ParallelDegree> <Duration in Minutes> <DEBUG/NODEBUG>

TableMaintRunner executes ExecTableMaint in background to achieve parallelism.

TableMaintRunner.ksh controls number of ExecTableMaint.ksh  processes and will  exit once duration exceeds. This may be calculated based on current time & last execution duration for a specific object. ExecTableMaint runs the actual action ( update stats, reorg etc ).


Thanks,

Sid.


Answers (1)

Answers (1)

kevin_sherlock
Contributor
0 Kudos

The standard answer here is that you only need to update statistics when you have performance issues due to ineffecient query plans (due to out-of-date statistics).

However, that answer basically implies that update stats is a reactive job (ie, wait until you have a problem first).

So, the classic DBA challenge is to determine how to proactively avoid performance problems.  This usually involves techniques already suggested above involving "datachange()", cluster ratios functions like "dpcr()", etc.   The complexity and logic behind such a task is where the DBA earns his/her money because, as we all know, every dataset and transaction workload is different.

One caution about using "datachange()" function.  Above it's suggested that you base your decisions on "datachange(<table>,null,null)".  I've never considered that query useful whatsoever in determining when to run a specific update stats command.  That form returns the MAXIMUM datachange value for all colums which have histograms over all partitions.  So, while it may indicate that _SOME_ column on _some_ partition has changed a lot, it doesn't indicate _WHICH_ column/partition.  The only use I can see for that command is to "pre qualify" which tables need to be looked at more closely.  Something like:

for each <table>

if datachange(<table>,null,null) > 20 

  then for each <column> of <table>

        if datachange(<table>,null,<column>) > 20 then add_column_to_list(<table>,<column>)

next <table>

...

Also keep in mind that "datachange()" values aren't affected by rollbacks.  If you have lots of rollback transactions, that can skew results.  The whole "update stats" -via- "datachange()" subject area is a complex and kinda fascinating topic (to me anyway), and probably deserves a better treatment at some point.  Hmmm...  ISUGTECH seeds now planted...

There are ways to make "update stats" run faster as well which may or may not simplify the job described above.  If you can fit all of your update stats within your maintenance window, you have less work to do possibly.  My main suggestion is to partition your large tables.  Especially tables that monotonically grow at "the end" as you describe.  No need to repeatedly update stats on "old" data, just the new added data.  Another suggestion is to run update stats jobs in "parallel".  Leverage multiple jobs running at the same time rather than one script that serially runs through tables/indexes/columns one-at-a-time.  Leverage all of your threads/engines and push your io controllers so that more work can be done in less time.

You didn't post your ASE version, but there are also options for "update statistics" that can help as well.

"using out_of_range on"  -  Set's stats for columns whose key values always increase (or decrease for that matter).  This sounds as if it might be useful for your case you outline above

"with consumers = n" - if you are updating non-indexed column stats where sorting is necessary, and you have parallel resources enabled

"with partial_hashing | hashing " - to leverage a possibly faster engine to gen stats