Call (800) 766-1884 for Oracle support & training
Free Oracle Tips


Oracle Consulting Support
Oracle Upgrades
Use New Oracle Features
Oracle Replication Support
Oracle Training
Remote Oracle DBA
System Documentation
Oracle Tips
Oracle Performance
 

Free Oracle Tips


 

HTML Text

BC Oracle tuning

Oracle training

Oracle support

Remote Oracle

 

Oracle Tips by Burleson

Oracle SQL Tuning and CBO Internal

Histograms

 

Another feature requiring the cost-based optimizer is histograms. Histograms are created for indexed columns that are badly skewed, such as the REGION column in the STATS table. The purpose of a histogram is to help the cost-based optimizer make a better decision about whether or not to use an index and to reduce the number of full table scans. The following query shows that one row contains a region code of NW, while the other seven region codes contain either 125,000 or 249,999 entries.

 

SQL> COL    REGION  FORMAT A6

SQL> SELECT REGION, COUNT(*)

 2   FROM   STATS

 3*  GROUP  BY REGION;

 

REGION   COUNT(*)

------ ----------

     E     125000

     N     125000

    NE     125000

    NW          1

     S     125000

    SE     125000

    SW     125000

     W     249999

 

8 rows selected.

 

Without a histogram on the REGION column, Oracle will not use an index on the REGION column. The following code creates a btree index on the REGION column of the STATS table.

 

SQL> CREATE INDEX I_STATS_REGION ON STATS(REGION);

Index created.

 

Next, analyze the STATS table because you just added a new index.

 

SQL> ANALYZE TABLE STATS COMPUTE STATISTICS;

Table analyzed.

 

Use AUTOTRACE to see the plan.

 

SQL> SET AUTOTRACE TRACEONLY EXPLAIN

 

Notice that without a histogram, the cost-based optimizer would perform a full table scan on a one million-row table to return one row.

 

SQL> SELECT *

 2   FROM   STATS

 3   WHERE  REGION = 'NW';

 

Execution Plan

-----------------------------------------

0  SELECT STATEMENT Optimizer=CHOOSE

   (Cost=583 Card=125000 Bytes=2625000)

1 0  TABLE ACCESS (FULL) OF 'STATS'

     (Cost=583 Card=125000 Bytes=2625000)

 

Why does Oracle do this? The following two queries reveal why. In user_indexes, the DISTINCT_KEYS column shows that Oracle knows there are eight different values. In user_tables, the NUM_ROWS column indicates that Oracle knows the number of rows in the stats table. You can almost see the light go on in the optimizer. Ah Hah! Because there are one million rows with eight different values, I’ll just divide 8 into 1,000,000. The optimizer’s incorrect conclusion is that each of the eight values comprise 125,000 rows of the stats table. Obviously, the optimizer will not use an index to retrieve 1/8th of the rows from a one million-row table, and will perform a full table scan instead.

 

SQL> SELECT DISTINCT_KEYS

 2   FROM   USER_INDEXES

 3   WHERE  INDEX_NAME = 'I_STATS_REGION';

 

DISTINCT_KEYS

-------------

            8

SQL> SELECT NUM_ROWS

 2   FROM   USER_TABLES

 3   WHERE  TABLE_NAME = 'STATS';

 

  NUM_ROWS

----------

   1000000

 

The REGION column is in bad need of a histogram. It is better to ask for too many buckets than to ask for too few. Although you request 250 buckets, Oracle creates 7 buckets (see Exhibit 4).

 

SQL> ANALYZE TABLE STATS -- Default 75 Buckets

 2           COMPUTE STATISTICS -- Maximum Of 254 Buckets

 3           FOR TABLE -- Oracle Uses Only Required Buckets

 4*          FOR COLUMNS REGION SIZE 250;

Table analyzed.

 

SQL> SELECT NUM_DISTINCT,

 2          LOW_VALUE,

 3          HIGH_VALUE,

 4          DENSITY,

 5          NUM_BUCKETS

 6   FROM   USER_TAB_COL_STATISTICS

 7*  WHERE  TABLE_NAME = 'STATS' AND COLUMN_NAME = 'REGION';

 

NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY    NUM_BUCKETS

------------ --------- ---------- ---------- -----------

           8        45         57   .0000005           7

Exhibit 4. Histogram for REGION Column

 

Oracle now has a better feel for how the data is distributed in the REGION column by the values in the ENDPOINT_NUMBER column (see Exhibit 5).

 

SQL> SELECT COLUMN_NAME,

 2          ENDPOINT_NUMBER,

 3          ENDPOINT_VALUE

 4   FROM   USER_HISTOGRAMS

 5   WHERE  TABLE_NAME = 'STATS' AND COLUMN_NAME = 'REGION'

 6*  ORDER  ENDPOINT_NUMBER;

 


COLUMN_NAME  ENDPOINT_NUMBER ENDPOINT_VALUE

------------ --------------- --------------

REGION                125000     3.5827E+35

REGION                250000     4.0500E+35

REGION                375000     4.0640E+35

REGION                375001     4.0676E+35

REGION                500001     4.3096E+35

REGION                625001     4.3236E+35

REGION                750001     4.3273E+35

REGION               1000000     4.5173E+35

8 rows selected.

Exhibit 5. Data Distribution in the REGION Column

 

With this new histogram information, the query executed above should now cause Oracle to use the index on the REGION column instead of a full table scan, and it does. Notice the cost is only 4 compared to a cost of 583. In addition, this query requires 0.1 seconds to execute using the index, whereas a full table scan requires nearly 10 seconds.

 

SQL> SET AUTOTRACE TRACEONLY EXPLAIN

SQL> SELECT *

 2   FROM   STATS

 3   WHERE  REGION = 'NW';

 

Execution Plan

-----------------------------------------

0 SELECT STATEMENT Optimizer=CHOOSE

    (Cost=4 Card=1 Bytes=21)

1 0 TABLE ACCESS (BY INDEX ROWID) OF 'STATS'

    (Cost=4 Card=1 Bytes=21)

2 1 INDEX (RANGE SCAN) OF 'I_STATS_REGION' (NON-UNIQUE)

    (Cost=3 Card=1)

Histograms

In some cases, the distribution of values within a column of a table will affect the optimizer's decision to use an index vs. performing a full-table scan.  This scenario occurs when the value with a where clause does not have an equivalent distribution of values, making a full-table scan cheaper than index access in some cases.  The problem is the optimizer can’t tell when it is using a value with a few rows, and when it is using a value with a large number of rows.

 

A column histogram should only be created when we have data skew exists or suspected.  In the real world, that happens rarely, and one of the most common mistakes with the optimizer is the unnecessary introduction of histograms into optimizer statistics.  The histograms signals the optimizer that the column is not linearly distributed, and the optimizer will peek into the literal value in the SQL where clause, and compare that value to the histogram buckets in the histogram statistics.

 

While they are used to make a yes-or-no decision about the use of an index to access the table, histograms are most commonly used to predict the size of the intermediate result set from a multi-way table join.

 

For example, assume that we have a five-way table join whose result set will be only 10 rows.  Oracle will want to join the tables together in such a way as to make the result set (cardinality) of the first join as small as possible.  By carrying less baggage in the intermediate result sets, the query will run faster.  To minimize intermediate results, the optimizer attempts to estimate the cardinality of each result set during the parse phase of SQL execution.  Having histograms on skewed columns will greatly aid the optimizer in making a proper decision.  (Remember, you can create a histogram even if the column does not have an index and does not participate as a join key.)

 

Because a complex schema might have tens of thousands of columns, it is impractical to evaluate each column for skew and thus Oracle provides an automated method for building histograms as part of the dbms_stats utility.  By using the method_opt=>'for all columns size skewonly' option of dbms_stats, you can direct Oracle to automatically create histograms for those columns whose values are heavily skewed.

 

As a general rule, histograms are used to predict the cardinality and the number of rows returned in the result set.  For example, assume that we have a product_type index and 70% of the values are for the HARDWARE type.  Whenever SQL with where product_type='HARDWARE' is specified, a full-table scan is the fastest execution plan, while a query with where product_type='SOFTWARE' would be fastest using index access.

 

Because histograms add additional overhead to the parsing phase of SQL, you should avoid them unless they are required for a faster optimizer execution plan.  But, there are several conditions where creating histograms is advised:

  • When the column is referenced in a query — there is no point in creating histograms if the queries do not reference the column.

  • When there is a significant skew in the distribution of columns values — this skew should be sufficiently significant that the value in the WHERE clause will make the optimizer choose a different execution plan.

  • When the column values causes an incorrect assumption — if the optimizer makes an incorrect guess about the size of an intermediate result set, it may choose a sub-optimal table join method.  Adding a histogram to this column will often provide the information required for the optimizer to use the best join method.

So how do you find those columns that are appropriate for histograms?  There is a feature in dbms_stats that provides for the ability to automatically look for columns that should have histograms, and create the histograms.  Multi-bucket histograms add a huge parsing overhead to SQL statements, and histograms should only be used when the SQL will choose a different execution plan based upon the column value.

 

To aid in intelligent histogram generation, Oracle uses the method_optparameter of dbms_stats.  There are also important new options within the method_opt clause, namely skewonlyand auto (and others).

 

method_opt=>'for all columns size skewonly'

method_opt=>'for all columns size auto'

 

The first is the "skewonlyoption, which is very time-intensive because it examines the distribution of values for every column within every index. If dbms_stats discovers an index with columns that are unevenly distributed, it will create histograms for that index to aid the cost-based SQL optimizer in making a decision about index vs. full-table scan access. For example, if an index has one column that is in 50% of the rows, a full-table scan is faster than an index scan to retrieve these rows.

 

Histograms are also used with SQL that has bind variables and SQL with cursor_sharing enabled.  In these cases, the optimizer determines if the column value could affect the execution plan, and if so, replace the bind variable with a literal and performs a hard parse.


begin

 dbms_stats.gather_schema_stats(

   ownname          => 'SCOTT',

   estimate_percent => dbms_stats.auto_sample_size,

   method_opt       => 'for all columns size skewonly',

   degree           => 7

 );

end;

/

 

The auto option is used when monitoring is implemented (alter table xxx monitoring) and creates histograms based upon data distribution and the manner in which the column is accessed by the application (e.g., the workload on the column as determined by monitoring). Using method_opt=>'auto' is similar to using the gather auto in the option parameter of dbms_stats:

 

begin

   dbms_stats.gather_schema_stats(

      ownname          => 'SCOTT',

      estimate_percent => dbms_stats.auto_sample_size,

      method_opt       => 'for all columns size auto',

      degree           => 7

   );

end;

/

 


The above text is an excerpt from:

Oracle SQL Tuning & CBO Internals
ISBN 0-9745993-3-6

by Kimberly Floss


Download your Oracle scripts now:

www.oracle-script.com

The definitive Oracle Script collection for every Oracle professional DBA

 

  
 

Oracle performance tuning software 
 
 
 
 

Oracle performance tuning book

 

 
 
 
Oracle performance Tuning 10g reference poster
 
 
 
Oracle training in Linux commands
 
Oracle training Excel
 
Oracle training & performance tuning books
 
   

Copyright © 1996 -  2014 by Burleson. All rights reserved.

Oracle® is the registered trademark of Oracle Corporation.