Tuesday, 27 October 2015

Histogram Example

Histogram  (12c)
----------------

A histogram is a column statistics that provide the data distribution on a particular column.

Histogram type is chosen based on the NDV (number of distinct values) on that column.

Diffent type for Historgrams are

1)  Frequency & Top Frequency
2)  Height based (Legacy)
3)  Hybrid

Purpose
By default the optimizer assumes a uniform distribution of rows across the distinct values in a column. For columns that contain data skew (a nonuniform distribution of data within the column), a histogram enables the optimizer to generate accurate cardinality estimates for filter and join predicates that involve these columns.
For example, a California-based book store ships 95% of the books to California, 4% to Oregon, and 1% to Nevada. The book orders table has 300,000 rows. A table column stores the state to which orders are shipped. A user queries the number of books shipped to Oregon. Without a histogram, the optimizer assumes an even distribution of 300000/3 (the NDV is 3), estimating cardinality at 100,000 rows. With this estimate, the optimizer chooses afull table scan. With a histogram, the optimizer calculates that 4% of the books are shipped to Oregon, and chooses an index scan.

SQL> select TABLE_NAME, COLUMN_NAME, ENDPOINT_NUMBER, ENDPOINT_VALUE , ENDPOINT_ACTUAL_VALUE, ENDPOINT_REPEAT_COUNT, scope from user_tab_histograms where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID' order by 3;
 
TABLE_NAME           COLUMN_NAME          ENDPOINT_NUMBER           ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU ENDPOINT_REPEAT_COUNT SCOPE
-------------------- -------------------- --------------- ------------------------ -------------------- --------------------- -------
BOOK_SALES           SOLD_TO_COUNTRY_ID             18000                     1001 1001                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             30000                     1002 1002                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             40000                     1003 1003                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             50000                     1004 1004                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             65000                     1005 1005                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             75000                     1006 1006                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             80000                     1007 1007                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             90000                     1008 1008                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             93000                     1009 1009                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             95000                     1010 1010                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID            100000                     1011 1011                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID            130000                     1012 1012                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID            160000                     1013 1013                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID            160001                     1014 1014                                     0 SHARED

SQL> select SOLD_TO_COUNTRY_ID, count(*) from BOOK_SALES group by SOLD_TO_COUNTRY_ID order by 1;

SOLD_TO_COUNTRY_ID   COUNT(*)
------------------ ----------
              1001      18000
              1002      12000
              1003      10000
              1004      10000
              1005      15000
              1006      10000
              1007       5000
              1008      10000
              1009       3000
              1010       2000
              1011       5000
              1012      30000
              1013      30000
              1014          1

14 rows selected.

Popular values

a)  Optimizer checks whether it a popular values by first checking whether it is a end point value of a bucket. If so the optimizer subtracts the end point number for the previous bucket from the end point number of the current bucket, if the value is greater than 1 then it is popular.

Eg.  (popular)

SQL> select SOLD_TO_COUNTRY_ID, count(*) from BOOK_SALES where SOLD_TO_COUNTRY_ID = 1010 group by SOLD_TO_COUNTRY_ID;

SOLD_TO_COUNTRY_ID   COUNT(*)
------------------ ----------
              1010       2000



Optimizer uses the below formula for calculating cardinality for popular values.

Cardinally of popular values = number of rows in table * (number of end point spanned by this value / total num of endpoints).

SQL> select count(*) from BOOK_SALES;

  COUNT(*)
----------
    160001

Num. of rows in table = 160001
Num. Of end points spanned by this value (1010) is  = (95000 – 93000) = 2000
Total number of end points is = 160001

then

C = 160001 * ( 2000/160001) = 2000

Thus the optimizer derives the cardinlity.






Eg. Non popular


SQL> select SOLD_TO_COUNTRY_ID, count(*) from BOOK_SALES where SOLD_TO_COUNTRY_ID = 1014 group by SOLD_TO_COUNTRY_ID;

SOLD_TO_COUNTRY_ID   COUNT(*)
------------------ ----------
              1014          1


Cardinality estimate of non popular = number of rows in table * density

SQL> select table_name, COLUMN_NAME, NUM_DISTINCT, DENSITY, HISTOGRAM from user_tab_col_statistics where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID';

TABLE_NAME           COLUMN_NAME          NUM_DISTINCT    DENSITY HISTOGRAM
-------------------- -------------------- ------------ ---------- ---------------
BOOK_SALES           SOLD_TO_COUNTRY_ID             14 .000003125 FREQUENCY

Optimizers calculates density using the factors such as number of buckets and NDV, density is expressed in decimal number between 
0  and 1 , value close to 1 indicates that optimizer expects many rows to be returned and close to 0 indicates optimizer to expect few rows to be returned.

C = 160001 * .000003125

SQL> select 160001*DENSITY from user_tab_col_statistics where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID';

160001*DENSITY
--------------
            .5


Creating top frequency

IF a small number of values occupies most of the rows, then creating a frequency histogram on this small set of values is useful
Even when the NDV is greater than n (number of buckets). To create better quality histogram on popular values the optimizer ignores
The non popular values. The database create top when following conditions are met.

a)  The data set NDV should be more than number of buckets specified.
b)  The percentage of rows occupied by the top n frequent values should be >= threashld p where p = (1-(1/n))*100
N= number of buckets.
c)  The estimate parameter is set to AUTO_SAMPLE_SIZE in dbms_stats.

Eg.
SQL> select SOLD_TO_COUNTRY_ID, count(*) from book_sales group by SOLD_TO_COUNTRY_ID ORDER BY 1;

SOLD_TO_COUNTRY_ID   COUNT(*)
------------------ ----------
              1001      18000
              1002      12000
              1003      10000
              1004      10000
              1005      15000
              1006      10000
              1007       5000
              1008      10000
              1009       3000
              1010       2000
              1011       5000
              1012      30000
              1013      30000
              1014          1

14 rows selected.


In the above the number of distinct value is 14, lets create a histogram by specifying 10 buckets.


SQL> exec dbms_stats.gather_table_stats('MANZOOR','BOOK_SALES', method_opt => 'FOR COLUMNS SOLD_TO_COUNTRY_ID SIZE 10');

PL/SQL procedure successfully completed.

SQL> select TABLE_NAME, COLUMN_NAME, ENDPOINT_NUMBER, ENDPOINT_VALUE , ENDPOINT_ACTUAL_VALUE, ENDPOINT_REPEAT_COUNT, scope from user_tab_histograms where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID' order by 3;
 
TABLE_NAME           COLUMN_NAME          ENDPOINT_NUMBER           ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU ENDPOINT_REPEAT_COUNT SCOPE
-------------------- -------------------- --------------- ------------------------ -------------------- --------------------- -------
BOOK_SALES           SOLD_TO_COUNTRY_ID             18000                     1001 1001                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             30000                     1002 1002                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             40000                     1003 1003                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             50000                     1004 1004                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             65000                     1005 1005                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             75000                     1006 1006                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID             85000                     1008 1008                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID            115000                     1012 1012                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID            145000                     1013 1013                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID            145001                     1014 1014                                     0 SHARED

10 rows selected.

SQL> select table_name, COLUMN_NAME, NUM_DISTINCT, DENSITY, HISTOGRAM from user_tab_col_statistics where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID';

TABLE_NAME           COLUMN_NAME          NUM_DISTINCT    DENSITY HISTOGRAM
-------------------- -------------------- ------------ ---------- ---------------
BOOK_SALES           SOLD_TO_COUNTRY_ID             14 .000003125 TOP-FREQUENCY

As per above we could see that the TOP-FREQUNCY histogram has been created by using the popular values. The see whether the
Conditions are met.




1)  NDV should > number of buckets  = TRUE  i.e  ( NDV is 14 and number of buckets specified is 10).
2)  Percentage of rows occupied by top frequency values should be greater than or equal to threashold p ( p = (1-(1/n))*100)
       p=(1-1/10))*100
       p=90
    As per the formula the percentage of rows occupied by the top frequenct values should be >= 90% . Lets check.

SQL>  select (a.top_frequent_sum/b.total_sum)*100 "Top Frequenct values %" from
      (   select sum(full_count) top_frequent_sum from
          (select count(*) full_count from book_sales where SOLD_TO_COUNTRY_ID in
          (select ENDPOINT_VALUE from user_tab_histograms where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID')
                )) a,
     (select count(*) total_sum from book_sales ) b
/

Top Frequenct values %
----------------------
            90.6250586

As we could see that the top frequecnt values occupied 90.62 % of rows in the table, hence this condition is satisfied.

3)  Estimate percent should be AUTO_SAMPLE_SIZE, we have not specified it, it has taken the default values. Lets check the
Default value.

SQL> var A VARCHAR2(30);
SQL> exec :A := dbms_stats.get_prefs (PNAME => 'ESTIMATE_PERCENT', OWNNAME => 'MANZOOR', TABNAME => 'BOOK_SALES');

PL/SQL procedure successfully completed.

SQL> print :A

A
------------------------------------------------------------------
DBMS_STATS.AUTO_SAMPLE_SIZE

Condition satisified.



HeightBased Histogram (Legacy) in 12c

The default value of number of buckets is 254 when it is not specified using method_opt parameter. Before 12c oracle
Creates a height based histrogram for which the NDV is > number of buckets. This type of histrogram was useful for range
Predicates and equality predicats on values that appear as endpoint in at least 2 buckets.


Eg.

SQL> exec dbms_stats.delete_table_stats('MANZOOR','BOOK_SALES');

PL/SQL procedure successfully completed.


To simulate Oracle Database 11g behavior, which is necessary to create a height-based histogram, set estimate_percent to a nondefault value. If you specify a nondefault percentage, then the database creates frequency or height-balanced histograms.

SQL> exec dbms_stats.gather_table_stats
('MANZOOR','BOOK_SALES',METHOD_OPT => 'FOR COLUMNS SOLD_TO_COUNTRY_ID SIZE 10', ESTIMATE_PERCENT => 100);

PL/SQL procedure successfully completed.

SQL> select TABLE_NAME, COLUMN_NAME, ENDPOINT_NUMBER, ENDPOINT_VALUE , ENDPOINT_ACTUAL_VALUE, ENDPOINT_REPEAT_COUNT, scope from user_tab_histograms where table_name = 'BOOK_SALES' and column_name = 'SOLD_TO_COUNTRY_ID' order by 3;

TABLE_NAME           COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU ENDPOINT_REPEAT_COUNT SCOPE
-------------------- -------------------- --------------- -------------- -------------------- --------------------- -------
BOOK_SALES           SOLD_TO_COUNTRY_ID                 1           1001 1001                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID                 2           1003 1003                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID                 3           1004 1004                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID                 4           1005 1005                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID                 5           1007 1007                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID                 6           1011 1011                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID                 8           1012 1012                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID                 9           1013 1013                                     0 SHARED
BOOK_SALES           SOLD_TO_COUNTRY_ID                10           1014 1014                                     0 SHARED

9 rows selected.

SQL>  select table_name, COLUMN_NAME, NUM_DISTINCT, DENSITY, HISTOGRAM from user_tab_col_statistics where table_name = 'BOOK_SALES' and  column_name = 'SOLD_TO_COUNTRY_ID';

TABLE_NAME           COLUMN_NAME          NUM_DISTINCT    DENSITY HISTOGRAM
-------------------- -------------------- ------------ ---------- ---------------
BOOK_SALES           SOLD_TO_COUNTRY_ID             14 .098844776 HEIGHT BALANCED

SQL> select SOLD_TO_COUNTRY_ID, count(*) from book_sales group by SOLD_TO_COUNTRY_ID ORDER BY 1;

SOLD_TO_COUNTRY_ID   COUNT(*)
------------------ ----------
              1001      18000
              1002      12000
              1003      10000
              1004      10000
              1005      15000
              1006      10000
              1007       5000
              1008      10000
              1009       3000
              1010       2000
              1011       5000
              1012      30000
              1013      30000
              1014          1



1 comment:

  1. CASINO 100% RANGE BONUS CODES - JTM Hub
    The Casino is 서울특별 출장마사지 one of the most popular destinations around 이천 출장안마 in 통영 출장샵 the United States. You can't miss the excitement of gaming excitement at the 원주 출장샵 Casino in 김해 출장안마

    ReplyDelete