Refer Me for Oracle: Histogram Example

Histogram (12c)

----------------

A histogram is a column statistics that provide the data distribution on a particular column.

Histogram type is chosen based on the NDV (number of distinct values) on that column.

Diffent type for Historgrams are

1) Frequency & Top Frequency

2) Height based (Legacy)

3) Hybrid

Purpose

By default the optimizer assumes a uniform distribution of rows across the distinct values in a column. For columns that contain data skew (a nonuniform distribution of data within the column), a histogram enables the optimizer to generate accurate cardinality estimates for filter and join predicates that involve these columns.
For example, a California-based book store ships 95% of the books to California, 4% to Oregon, and 1% to Nevada. The book orders table has 300,000 rows. A table column stores the state to which orders are shipped. A user queries the number of books shipped to Oregon. Without a histogram, the optimizer assumes an even distribution of 300000/3 (the NDV is 3), estimating cardinality at 100,000 rows. With this estimate, the optimizer chooses afull table scan. With a histogram, the optimizer calculates that 4% of the books are shipped to Oregon, and chooses an index scan.

SQL> select TABLE_NAME, COLUMN_NAME, ENDPOINT_NUMBER, ENDPOINT_VALUE , ENDPOINT_ACTUAL_VALUE, ENDPOINT_REPEAT_COUNT, scope from user_tab_histograms where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID' order by 3;

TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU ENDPOINT_REPEAT_COUNT SCOPE

-------------------- -------------------- --------------- ------------------------ -------------------- --------------------- -------

BOOK_SALES SOLD_TO_COUNTRY_ID 18000 1001 1001 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 30000 1002 1002 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 40000 1003 1003 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 50000 1004 1004 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 65000 1005 1005 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 75000 1006 1006 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 80000 1007 1007 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 90000 1008 1008 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 93000 1009 1009 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 95000 1010 1010 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 100000 1011 1011 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 130000 1012 1012 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 160000 1013 1013 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 160001 1014 1014 0 SHARED

SQL> select SOLD_TO_COUNTRY_ID, count(*) from BOOK_SALES group by SOLD_TO_COUNTRY_ID order by 1;

SOLD_TO_COUNTRY_ID COUNT(*)

------------------ ----------

1001 18000

1002 12000

1003 10000

1004 10000

1005 15000

1006 10000

1007 5000

1008 10000

1009 3000

1010 2000

1011 5000

1012 30000

1013 30000

1014 1

14 rows selected.

Popular values

a) Optimizer checks whether it a popular values by first checking whether it is a end point value of a bucket. If so the optimizer subtracts the end point number for the previous bucket from the end point number of the current bucket, if the value is greater than 1 then it is popular.

Eg. (popular)

SQL> select SOLD_TO_COUNTRY_ID, count(*) from BOOK_SALES where SOLD_TO_COUNTRY_ID = 1010 group by SOLD_TO_COUNTRY_ID;

SOLD_TO_COUNTRY_ID COUNT(*)

------------------ ----------

1010 2000

Optimizer uses the below formula for calculating cardinality for popular values.

Cardinally of popular values = number of rows in table * (number of end point spanned by this value / total num of endpoints).

SQL> select count(*) from BOOK_SALES;

COUNT(*)

----------

160001

Num. of rows in table = 160001

Num. Of end points spanned by this value (1010) is = (95000 – 93000) = 2000

Total number of end points is = 160001

then

C = 160001 * ( 2000/160001) = 2000

Thus the optimizer derives the cardinlity.

Eg. Non popular

SQL> select SOLD_TO_COUNTRY_ID, count(*) from BOOK_SALES where SOLD_TO_COUNTRY_ID = 1014 group by SOLD_TO_COUNTRY_ID;

SOLD_TO_COUNTRY_ID COUNT(*)

------------------ ----------

1014 1

Cardinality estimate of non popular = number of rows in table * density

SQL> select table_name, COLUMN_NAME, NUM_DISTINCT, DENSITY, HISTOGRAM from user_tab_col_statistics where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID';

TABLE_NAME COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM

-------------------- -------------------- ------------ ---------- ---------------

BOOK_SALES SOLD_TO_COUNTRY_ID 14 .000003125 FREQUENCY

Optimizers calculates density using the factors such as number of buckets and NDV, density is expressed in decimal number between

0 and 1 , value close to 1 indicates that optimizer expects many rows to be returned and close to 0 indicates optimizer to expect few rows to be returned.

C = 160001 * .000003125

SQL> select 160001*DENSITY from user_tab_col_statistics where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID';

160001*DENSITY

--------------

Creating top frequency

IF a small number of values occupies most of the rows, then creating a frequency histogram on this small set of values is useful

Even when the NDV is greater than n (number of buckets). To create better quality histogram on popular values the optimizer ignores

The non popular values. The database create top when following conditions are met.

a) The data set NDV should be more than number of buckets specified.

b) The percentage of rows occupied by the top n frequent values should be >= threashld p where p = (1-(1/n))*100

N= number of buckets.

c) The estimate parameter is set to AUTO_SAMPLE_SIZE in dbms_stats.

Eg.

SQL> select SOLD_TO_COUNTRY_ID, count(*) from book_sales group by SOLD_TO_COUNTRY_ID ORDER BY 1;

SOLD_TO_COUNTRY_ID COUNT(*)

------------------ ----------

1001 18000

1002 12000

1003 10000

1004 10000

1005 15000

1006 10000

1007 5000

1008 10000

1009 3000

1010 2000

1011 5000

1012 30000

1013 30000

1014 1

14 rows selected.

In the above the number of distinct value is 14, lets create a histogram by specifying 10 buckets.

SQL> exec dbms_stats.gather_table_stats('MANZOOR','BOOK_SALES', method_opt => 'FOR COLUMNS SOLD_TO_COUNTRY_ID SIZE 10');

PL/SQL procedure successfully completed.

TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU ENDPOINT_REPEAT_COUNT SCOPE

-------------------- -------------------- --------------- ------------------------ -------------------- --------------------- -------

BOOK_SALES SOLD_TO_COUNTRY_ID 18000 1001 1001 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 30000 1002 1002 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 40000 1003 1003 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 50000 1004 1004 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 65000 1005 1005 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 75000 1006 1006 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 85000 1008 1008 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 115000 1012 1012 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 145000 1013 1013 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 145001 1014 1014 0 SHARED

10 rows selected.

SQL> select table_name, COLUMN_NAME, NUM_DISTINCT, DENSITY, HISTOGRAM from user_tab_col_statistics where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID';

TABLE_NAME COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM

-------------------- -------------------- ------------ ---------- ---------------

BOOK_SALES SOLD_TO_COUNTRY_ID 14 .000003125 TOP-FREQUENCY

As per above we could see that the TOP-FREQUNCY histogram has been created by using the popular values. The see whether the

Conditions are met.

1) NDV should > number of buckets = TRUE i.e ( NDV is 14 and number of buckets specified is 10).

2) Percentage of rows occupied by top frequency values should be greater than or equal to threashold p ( p = (1-(1/n))*100)

p=(1-1/10))*100

p=90

As per the formula the percentage of rows occupied by the top frequenct values should be >= 90% . Lets check.

SQL> select (a.top_frequent_sum/b.total_sum)*100 "Top Frequenct values %" from

( select sum(full_count) top_frequent_sum from

(select count(*) full_count from book_sales where SOLD_TO_COUNTRY_ID in

(select ENDPOINT_VALUE from user_tab_histograms where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID')

)) a,

(select count(*) total_sum from book_sales ) b

Top Frequenct values %

----------------------

90.6250586

As we could see that the top frequecnt values occupied 90.62 % of rows in the table, hence this condition is satisfied.

3) Estimate percent should be AUTO_SAMPLE_SIZE, we have not specified it, it has taken the default values. Lets check the

Default value.

SQL> var A VARCHAR2(30);

SQL> exec :A := dbms_stats.get_prefs (PNAME => 'ESTIMATE_PERCENT', OWNNAME => 'MANZOOR', TABNAME => 'BOOK_SALES');

PL/SQL procedure successfully completed.

SQL> print :A

------------------------------------------------------------------

DBMS_STATS.AUTO_SAMPLE_SIZE

Condition satisified.

HeightBased Histogram (Legacy) in 12c

The default value of number of buckets is 254 when it is not specified using method_opt parameter. Before 12c oracle

Creates a height based histrogram for which the NDV is > number of buckets. This type of histrogram was useful for range

Predicates and equality predicats on values that appear as endpoint in at least 2 buckets.

Eg.

SQL> exec dbms_stats.delete_table_stats('MANZOOR','BOOK_SALES');

PL/SQL procedure successfully completed.

To simulate Oracle Database 11g behavior, which is necessary to create a height-based histogram, set estimate_percent to a nondefault value. If you specify a nondefault percentage, then the database creates frequency or height-balanced histograms.

SQL> exec dbms_stats.gather_table_stats

('MANZOOR','BOOK_SALES',METHOD_OPT => 'FOR COLUMNS SOLD_TO_COUNTRY_ID SIZE 10', ESTIMATE_PERCENT => 100);

PL/SQL procedure successfully completed.

TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU ENDPOINT_REPEAT_COUNT SCOPE

-------------------- -------------------- --------------- -------------- -------------------- --------------------- -------

BOOK_SALES SOLD_TO_COUNTRY_ID 1 1001 1001 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 2 1003 1003 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 3 1004 1004 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 4 1005 1005 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 5 1007 1007 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 6 1011 1011 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 8 1012 1012 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 9 1013 1013 0 SHARED

BOOK_SALES SOLD_TO_COUNTRY_ID 10 1014 1014 0 SHARED

9 rows selected.

SQL> select table_name, COLUMN_NAME, NUM_DISTINCT, DENSITY, HISTOGRAM from user_tab_col_statistics where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID';

TABLE_NAME COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM

-------------------- -------------------- ------------ ---------- ---------------

BOOK_SALES SOLD_TO_COUNTRY_ID 14 .098844776 HEIGHT BALANCED

SQL> select SOLD_TO_COUNTRY_ID, count(*) from book_sales group by SOLD_TO_COUNTRY_ID ORDER BY 1;

SOLD_TO_COUNTRY_ID COUNT(*)

------------------ ----------

1001 18000

1002 12000

1003 10000

1004 10000

1005 15000

1006 10000

1007 5000

1008 10000

1009 3000

1010 2000

1011 5000

1012 30000

1013 30000

1014 1

Refer Me for Oracle

Tuesday, 27 October 2015

Histogram Example

1 comment:

About Me