常见latch的争用分析以及解决办法

chenyu 发表于 2010-11-12 13:11:29

一、buffer cache 中的CBC latch（cache buffer chains）,这表名在buffer中查找数据块时发生了latch争用，则有两种可能，一种是系统的latch确实不够用，二是大量进程集中访问某些数据块(热点快)导致latch争用，现实中可能大多是这种情况；怎么判断到底是那种情况导致latch争用呢？我们可以查询v$latch_children视图，如果发现latch平均分布在各个子latch中说明latch不够可以增加cpu（增加CPU可以增加latch）或修改隐藏参数,如果分布不均匀说明出现热点块，解决热点块可以通过下面的两个方法：1、对这个热点快对象进行分区；2、增加热点快对象的pctfree，把块分散开；为什么分区和增加pctfree可以解决热点块的问题呢？原因如下：
对表分区:假如一个表中有city字段，某个块中有北京、天津、青岛三个city的数据，这时有三个session分别访问这个块中三个城市的数据行，哪么这个块就会被三个session访问，如果对这张表city列进行范围分区，哪么北京、天津、青岛的数据肯定分布在不同的分区（不同的数据块）中，这时三个session就会访问不同的数据块了，从而解决热点块的问题；
增加pctfree:假如原pctfree由10%增加到20%,哪么数据块中用来存放数据的空间就减少10%，原来存放在这10%中的数据就会放到其他的数据块中，这样假如原来的session访问这10%的数据此时就会访问其他的数据块，从而解决热点块的问题；
解决热点快的流程
1、查找v$system_event查看系统等待事件，这里假设发现cbc latch争用严重
2、查找v$latch_children找到那些子latch争用比较厉害
select addr,name,gets,spin_gets,sleeps,hash from v$latch where name = 'cache buffers chains' order by sleeps
3、查找热点块属于那个对象
select t1.object_name,t2.file#,t2.dbablk
from dba_objects t1,(select file#,dbablk,obj,addr,hladdr from x$bh where hladdr = '？') t2 where t1.object_id = t2.obj;//？就是上一步查到的addr
4、定位导致热点快的SQL语句
select sql_text,executions from v$sqlarea where sql_text like '%？%';//？就是上一步查询到的object_name
我们根据SQL语句的执行次数就应该能够判断出是那些SQL语句导致的热点快
模拟发生热点块时的检查步骤：
SQL> select rowid,ename from emp;
ROWID          ENAME
------------------ ----------
AAACYVAAEAAAAA6AAA SMITH
AAACYVAAEAAAAA6AAB ALLEN
AAACYVAAEAAAAA6AAC WARD
AAACYVAAEAAAAA6AAD JONES
AAACYVAAEAAAAA6AAE MARTIN
AAACYVAAEAAAAA6AAF BLAKE
AAACYVAAEAAAAA6AAG CLARK
AAACYVAAEAAAAA6AAH SCOTT
AAACYVAAEAAAAA6AAI KING //查询这行数据
AAACYVAAEAAAAA6AAJ TURNER
AAACYVAAEAAAAA6AAK ADAMS
AAACYVAAEAAAAA6AAL JAMES
AAACYVAAEAAAAA6AAM FORD
AAACYVAAEAAAAA6AAN MILLER
14 rows selected.
--find.sql
declare
name varchar2(100);
begin
loop
select ename into name from emp where rowid = 'AAACYVAAEAAAAA6AAI';
end loop;
end;
/
exit;
--run.sql
sqlplus -s scott/tiger@PROD @find.sql &
sqlplus -s scott/tiger@PROD @find.sql &
sqlplus -s scott/tiger@PROD @find.sql
chmod +x run.sql
select event,TOTAL_WAITS,TIME_WAITED from v$system_event where wait_class!='Idle' order by TIME_WAITED;
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: cache buffers chains                                           99       164
运行一段时间后
select event,TOTAL_WAITS,TIME_WAITED from v$system_event where wait_class!='Idle' order by TIME_WAITED;
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: cache buffers chains                                           477       835 //latch争用比较严重
--查看子latch，得到地址
select addr,latch#,child#,name,hash,gets,misses,sleeps,spin_gets from v$latch_children where name = 'cache buffers chains' order by sleeps;
ADDR       LATCH# CHILD# NAME                               HASH    GETS MISSES SLEEPSSPIN_GETS
-------- ---------- ---------- ------------------------------ ---------- ---------- ---------- ---------- ----------
31D476B0    122    870 cache buffers chains       3563305585 6230630    63862    908    63095
--查看热点快属于那个对象
select t1.object_name from dba_objects t1,x$bh t2 where t1.object_id = t2.obj and t2.hladdr='31D476B0';
OBJECT_NAME
-----------------------
I_OBJ1
C_OBJ#_INTCOL#
I_H_OBJ#_COL#
SMON_SCN_TO_TIME
WRI$_ADV_TASKS_IDX_01
I_WRI$_OPTSTAT_H_ST
I_WRI$_OPTSTAT_H_ST
EMP
BIGTAB
BIGTAB
BIGTAB
BIGTAB
WRH$_LATCH
WRH$_WAITSTAT
WRH$_TABLESPACE_STAT
从上面的结果发现emp和bigtab两张表比较可疑，假如现在我们不知道是那张表发生热点快
--查询导致热点快的SQL
select sql_text,executions from v$sqlarea where upper(sql_text) like '%EMP%' or upper(sql_text) like '%BIGTAB%' order by executions;
SQL_TEXT                                                 EXECUTIONS
------------------------------------------------------------ ----------
declare name varchar2(100); begin loop select ename       0
into name from emp where rowid = 'AAACYVAAEAAAAA6AAI'; end
loop; end;
select sql_text,executions from v$sqlarea where upper(sql_te       2
xt) like '%EMP%' or upper(sql_text) like '%BIGTAB%' order by
executions
select value$ from props$ where name='DEFAULT_TEMP_TABLESPAC       3
E'
select sql_id,TEMPSEG_SIZE,MAX_MEM_USED,WORK_AREA_SIZE from       3
v$sql_workarea_active
select dbms_metadata.get_ddl('TABLESPACE','TEMP') from dual       3
select sql_id,MAX_TEMPSEG_SIZE,LAST_TEMPSEG_SIZE from v$sql_       6
workarea order by MAX_TEMPSEG_SIZE,LAST_TEMPSEG_SIZE
select obj#, dataobj#, part#, hiboundlen, hiboundval, ts#, f    184
ile#, block#, pctfree$, pctused$, initrans, maxtrans, flags,
analyzetime, samplesize, rowcnt, blkcnt, empcnt, avgspc, ch
ncnt, avgrln, length(bhiboundval), bhiboundval from tabpart$
where bo# = :1 order by part#
select t.ts#,t.file#,t.block#,nvl(t.bobj#,0),nvl(t.tab#,0),t    16165
.intcols,nvl(t.clucols,0),t.audit$,t.flags,t.pctfree$,t.pctu
sed$,t.initrans,t.maxtrans,t.rowcnt,t.blkcnt,t.empcnt,t.avgs
pc,t.chncnt,t.avgrln,t.analyzetime,t.samplesize,t.cols,t.pro
perty,nvl(t.degree,1),nvl(t.instances,1),t.avgspc_flb,t.flbc
nt,t.kernelcols,nvl(t.trigflag, 0),nvl(t.spare1,0),nvl(t.spa
re2,0),t.spare4,t.spare6,ts.cachedblk,ts.cachehit,ts.logical
read from tab$ t, tab_stats$ ts where t.obj#= :1 and t.obj#
= ts.obj# (+)
SELECT ENAME FROM EMP WHERE ROWID = 'AAACYVAAEAAAAA6AAI'    13374107
至此发现是由于SELECT ENAME FROM EMP WHERE ROWID = 'AAACYVAAEAAAAA6AAI'这个SQL语句导致的CBC latch的争用，下一步就可以按照上面的步骤进行处理了；
二、shared pool latch: 从shared pool中找空闲空间时需要获取,排他latch，如果这个latch比较高，大部分原因是因为SQL没共享,这时一般情况下hard parse也相应比较高，解决方法是查询v$sql或v$sqlarea，查询是那些SQL。
模拟shared pool latch的测试
SQL> select count(*) from bigtab;
COUNT(*)
----------
1000000
SQL> desc bigtab;
Name                                  Null? Type
----------------------------------------- -------- ----------------------------
ID                                              NUMBER
NAME                                           VARCHAR2(32)
--查看开始时的shared pool情况
select event,TOTAL_WAITS,TIME_WAITED from v$system_event where wait_class!='Idle' order by TIME_WAITED;
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: shared pool                                                    318       161
--运行程序
declare
name varchar2(50);
sqlstr varchar2(100);
begin
for i in 1..1000000000 loop
sqlstr := 'select name from bigtab where id = ' || i;
execute immediate sqlstr into name; //未绑定变量,模拟硬解析
end loop;
end;
/
--再次查看shared pool情况，发现有争用但是不明显
select event,TOTAL_WAITS,TIME_WAITED from v$system_event where wait_class!='Idle' order by TIME_WAITED;
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: shared pool                                                    324       162
--当前数据库设置,sga自动管理，可能是由于sga比较大，SQL有比较简单所以shared pool的争用不明显
SQL> show parameter shared_pool_size
NAME                               TYPE    VALUE
------------------------------------ ----------- ------------------------------
shared_pool_size                   big integer 0
SQL> show parameter sga_target
NAME                               TYPE    VALUE
------------------------------------ ----------- ------------------------------
sga_target                         big integer 300M
--换一种分析思路，如果shared pool争用比较厉害，那么一般是由于没有使用绑定变量导致大量硬解析造成的，这时我们查询系统硬解析的情况
SQL> select name,value from v$sysstat where name like '%parse%hard%';
NAME                            VALUE
------------------------------ ----------
parse count (hard)             142897
SQL> /
NAME                            VALUE
------------------------------ ----------
parse count (hard)             143182 //比较明显
SQL> /
NAME                            VALUE
------------------------------ ----------
parse count (hard)             143673 //比较明显
--查询是由于那些SQL语句导致的硬解析，从结果中发现有大量的select name from bigtab where id = ？的语句没有使用绑定变量，至此问题已经找到；
select sql_text,executions from v$sqlarea order by sql_text,executions desc;
SQL_TEXT                                                 EXECUTIONS
------------------------------------------------------------ ----------
......
select name from bigtab where id = 93510                            1
select name from bigtab where id = 93511                            1
select name from bigtab where id = 93512                            1
select name from bigtab where id = 93513                            1
select name from bigtab where id = 93514                            1
select name from bigtab where id = 93515                            1
.......

三、library cache latch:产生共享SQL和找共享SQL时需要获取,排他latch;
如果shared pool latch不高但是library cache latch很高时表示有很多相同的SQL要执行(软解析)，需要到shared_pool的链中查找，这时查询v$sqlarea视图看那个SQL的executions比较大；
模拟library cache latch测试
--把session_cached_cursors设置为0，确保发生软解析
SQL> alter system set session_cached_cursors=0 scope=spfile;
System altered.
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area314572800 bytes
Fixed Size                1219184 bytes
Variable Size          88081808 bytes
Database Buffers       218103808 bytes
Redo Buffers             7168000 bytes
Database mounted.
Database opened.
SQL> show parameter session_cached_cursors;
NAME                               TYPE    VALUE
------------------------------------ ----------- ------------------------------
session_cached_cursors             integer 0
--表进行建立索引
SQL> select index_name from user_indexes where table_name = 'BIGTAB';
INDEX_NAME
------------------------------
IDX_BIGTAB_ID
--find.sql
declare
str varchar2(50);
n integer;
begin
for i in 1..10 loop
n := floor(dbms_random.value*1000000);
select name into str from bigtab where id = n;
end loop;
end;
/
exit;
--run.sql
sqlplus -s scott/tiger@prod @find.sql &
sqlplus -s scott/tiger@prod @find.sql &
sqlplus -s scott/tiger@prod @find.sql
chmod +x run.sql
--开始执行程序之前的library cache情况
SQL> select event,TOTAL_WAITS,TIME_WAITED from v$system_event where event like '%library cache%';
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
library cache load lock                                                 2       1
--执行程序
./run.sql
--一段时间后再次查询library cache争用情况
SQL> select event,TOTAL_WAITS,TIME_WAITED from v$system_event where event like '%library cache%';
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: library cache                                                    1       3
library cache pin                                                       3       11
library cache load lock                                                 2       1
/
EVENT                                                          TOTAL_WAITS TIME_WAITED
---------------------------------------------------------------- ----------- -----------
latch: library cache                                                    23       52//争用出现
latch: library cache pin                                                 5       12
library cache pin                                                       3       11
library cache load lock                                                 2       1
--查询子latch的地址和详细的争用情况
SQL> select addr,latch#,child#,name,hash,gets,misses,sleeps,spin_gets from v$latch_children where name = 'library cache' order by sleeps;
ADDR       LATCH# CHILD# NAME                               HASH    GETS MISSES SLEEPSSPIN_GETS
-------- ---------- ---------- ------------------------------ ---------- ---------- ---------- ---------- ----------
313C3B3C    214       1 library cache                3055961779    24242       18       0       18
313C3A74    214       3 library cache                3055961779    41359       31       2       29
313C3AD8    214       2 library cache                3055961779 2707431    15096       79    15020 //争用显著
--查询执行次数多的SQL语句
select * from (select sql_text,executions from v$sqlarea order by executions desc) where rownum <= 10;
SQL_TEXT                                                 EXECUTIONS
------------------------------------------------------------ ----------
SELECT NAME FROM BIGTAB WHERE ID = :B1                      2457736 //发现问题，实有与这个SQL引起的library cache latch争用
select /*+ rule */ bucket_cnt, row_cnt, cache_cnt, null_cnt,    1960
timestamp#, sample_size, minimum, maximum, distcnt, lowval,
hival, density, col#, spare1, spare2, avgcln from hist_head
$ where obj#=:1 and intcol#=:2
select /*+ rule */ bucket, endpoint, col#, epvalue from hist    744
grm$ where obj#=:1 and intcol#=:2 and row#=:3 order by bucke
t
select timestamp from fixed_obj$ where obj# = :1                597
select type#,blocks,extents,minexts,maxexts,extsize,extpct,u    555
ser#,iniexts,NVL(lists,65535),NVL(groups,65535),cachehint,hw
mincr, NVL(spare1,0),NVL(scanhint,0) from seg$ where ts#=:1
and file#=:2 and block#=:3
select intcol#,nvl(pos#,0),col#,nvl(spare1,0) from ccol$ whe    492
re con#=:1
select o.owner#,o.name,o.namespace,o.remoteowner,o.linkname,    324
o.subname,o.dataobj#,o.flags from obj$ o where o.obj#=:1
select obj#,type#,ctime,mtime,stime,status,dataobj#,flags,oi    320
d$, spare1, spare2 from obj$ where owner#=:1 and name=:2 and
namespace=:3 and remoteowner is null and linkname is null a
nd subname is null
select pos#,intcol#,col#,spare1,bo#,spare2 from icol$ where       243
obj#=:1
select count(*) from sys.job$ where (next_date > sysdate) an    204
d (next_date < (sysdate+5/86400))

kevin.zhang 发表于 2010-11-12 15:05:16

:):):):)不错，顶一个

oraunix 发表于 2010-11-12 17:42:17

上面是对三种latch的解决方案。
这三种latch是最常见的latch。
希望大家认真的阅读和理解，当然前提是大家对latch的工作机制以及两种pool的内存结构有比较清楚的理解。

小小草 发表于 2014-5-7 08:28:28

不错，顶一个学习一下:)

页: [1]

iiDBA's Archiver

常见latch的争用分析以及解决办法