100 things you probably didn't knowDocument Transcript
100 Things You Probably Didn't Know About Oracle
Published by Arup
Recently, while delivering a presentation on Cache Fusion at New York Oracle Users Group
(www.nyoug.org), the regional user group where I have been a long time member, I was surprised to
hear from many participants some beliefs they had held for a long time that were completely wrong. I
always thought these were as obvious as they come; but of course I was dead wrong. What was even
more surprising that most of these believers were veterans in Oracle Database technologies; not
newbies. Part of the problem – I think – lies with the system that focuses on the execution rather than
learning and part of it due to the lack of clear documentation. During that discussion some encouraged
me to write about these. I immediately agreed it was a great idea and merited serious attention. Here
is the product: my attempt at explaining some of the “mysteries” of how Oracle Database operates. I
will cover 100 such nuggets of information, roughly once a week.
Before you start, however, I would like to bring your attention to this important point. You may already
be aware of these facts. I did; so it is reasonable to believe that a vast majority would as well. Under
no circumstances I am claiming these to be ground breaking or awe-inspiring. If you are already
familiar with this specific knowledge, I am not at all surprised. Please feel free to skip. For those who
read on, I hope you found these helpful and will take a moment to write to me how you felt.
Part 1: Myth of Commit Causing Buffer to be Flushed to
Consider a scenario: In the EMP table I updated my salary from 1000 to 2000; and committed
immediately. The very instance after I issued commit, if I check the datafile on the disk of the
tablespace where this table is located, which value should I see – 1000 or 2000? (Remember, the value
Did you answer 2000 (perhaps because it was committed)? If so, then consider a normal application
where commits are issued up to tens of thousands of times every minute. In a traditional database the
weakest link in the chain is always I/O. If Oracle issued an update of the datafile every time someone
commits, it would grind to a halt.
Did you answer 1000? Well, in that case, consider a case when the instance crashes. The datafile would
have had 1000; not 2000 – the value that was committed. In such a case the instance must bring back
the value committed (2000, in this case) to the datafile. How?
Let’s examine a different scenario. Suppose I did not issue a commit after the update (perhaps I was
not sure of the implication of giving myself a pay hike or perhaps I had pang of conscience). I left the
session as is and left for the day. The next day I was sick and didn’t come to work. 24 hours passed
since I updated the record. At that point, if someone reads the datafile, what value would they see –
1000 or 2000?
Did you answer 1000 – a logical choice since the read consistency model of Oracle Database guarantees
that the other sessions will see the pre-change data for the un-committed transactions?
Question #3 in this scenario: if you check the redo log file (not the datafile), what value will you find
there – 1000 or 2000? Remember, it has not been committed. Did you answer 1000? It sort of makes
sense; the changes are not committed so there is no reason for them to be in the redo log file, which is
a very important part of the recovery process. If you answered 2000, then how would you explain the
recovery process? In case of instance failure, the recovery must read the redo log file and since the
transaction was not committed, it must roll the vale back to the previous – 1000. How would it do that
if the redo log file contains 2000, not 1000?
The answers, if you can’t wait any longer: 1000 for the first question, 2000 for the second and 2000 for
the third. How so? Please read on.
To understand the mechanics of the process, let’s go over the buffer management process of the
Oracle database. It’s a rudimentary detail but is quite vital in the path to understand the myth here.
Consider a very small table in an equally small tablespace we created:
SQL> create tablespacetesttsdatafile '/tmp/testts_01.dbf' size 1M;
SQL> create table mythbuster1 (col1 varchar2(200)) tablespacetestts;
Insert a row:
SQL> insert into mythbuster1 values (‘ORIGINAL_VALUE’);
Shutdown and restart the database so that the buffer cache is completely devoid of this table. You can
also issue ALTER SYSTEM FLUSH BUFFER_CACHE; but I want to make sure all traces of this table (and
value of the column inside) vanish from all memory areas – buffer cache, shared pool, PGA, whatever.
You can now check the presence of the value in the datafile:
$ strings /tmp/testts_01.dbf
The value is right there. Now suppose a user issues a statement like this from SQL*Plus:
SQL> select * from mythbuster1;
Oracle creates a process – called “server process” – on behalf of this user session to service the
request from the session. This process is named, in unix and like OS’es, oracle. Here is how you
find it out:
$ ps -aef|grepsqlplus
14257 14214 0 13:42 pts/2
$ ps -aef | grep 14257
14257 14214 0 13:42 pts/2
14258 14257 0 13:42 ?
The process 14258 is the server process. The SQL*Plus process is known as the user processwhich can
be any process a user executes such as a Java program, a Pro*C code, a TOAD session and so on. It’s
the server process that handles all the interaction with the Oracle database; not the user process. This
is why Oracle database interaction is said to be based on a Two Task Architecture; there are
always two tasks – the user task that a regular user has written and the server task that performs the
database operations. This is an important concept established during the early foundations of the
Oracle database to protect the database from errant code in the user task introduced either
maliciously or inadvertently.
The server process then identifies the block the row exists in. Since the database instance just came up
the buffer cache is empty and the block will not be found. Therefore the server process issues a read
call from the datafile for that specific block. The block is read from the disk to the buffer cache. Until
the loading of the block from the disk to the buffer cache is complete, the session waits with the event
– db file scattered read. In this very case the session issues a full table scan. Had it performed an
index scan, the session would have waited with the event db file sequential read. [I know, I know – it
seems to defy conventional logic a little bit. I would have assumed index scan to be named scattered
reads and full table scans to be sequential].
Once this process is complete, the buffer cache holds the copy of the block of the table mythbuster1.
Subsequent session, if they select from the table, will simply get the data from this buffer; not from
Now, suppose the session issues the statement:
SQL> update mythbuster1 set col1 = ‘CHANGED_VALUE’;
Immediately afterwards, check for the presence of the values in the datafile:
$ strings /tmp/testts_01.dbf
The old value, not the new value, is found. The datafile on the disk still has the old value; not the new
one, even though the transaction has been committed. The update statement actually updated only
the buffer in the cache; not the disk. So, when is the data on the disk updated?
The datafile gets updated by a process known as Database Writer (a.k.a. Database Buffer Writer). It’s
named DBW0. Actually, there may be more than one such process and they are named DBW0, DBW1,
etc. – more conveniently addressed as DBWn. For the purpose of the discussion here, let’s assume only
one process – DBW0. It has only one responsibility – to update the datafile with the most up to date
buffers from the buffer caches. [Note: I used buffer caches – plural. This is not a typo. There may be
more than one buffer cache in the database – keep, recycle, default and other block sizes – but that’s
for another day]. The buffer that has been updated is known as a dirty buffer since its contents are
different from the block on the disk. DBW0 process writes the contents of the buffer to the disk –
making it clean again.
But the big question is when DBW0 writes the dirty buffer to the disk? Ah, that’s the very question we
are pondering over here. There are several “triggering” events that cause DBW0 to copy the buffers to
the disk – also called flushing of the buffers. By the way, DBW0 is a lazy process; it does not flush
buffers by itself or on a regular basis. It sleeps most of the time and must be woken up by another
process to perform its duties. One such watchdog process is called the Checkpoint (you can check its
existence by ps -aef | grepckptinUnix systems). Checkpoint actually does not perform the
flushing (also called checkpointing activity); but calls the DBW0 process to do it. How often does
Checkpoint process perform a checkpoint? It depends on various conditions – the biggest of all is the
MTTR setting, which we will cover later in a different installment.
Next. let's examine a different scenario. Drop the table, create the table again, recycle the database
to remove all buffers of the table and then perform the update; but do not commit. Then flush the
buffers from the cache to the disk. You can also trigger a checkpointing activity manually instead of
waiting for the checkpoint process. Here is how to do it:
SQL> alter system checkpoint;
After that statement completes, check the presence of the values in the datafile again:
$ strings /tmp/testts_01.dbf
The old value is still there; but that is an artifact; it will eventually be gone. The new value is updated
in the datafile. But do you remember a very important fact – the transaction is still not committed? In a
different session, if you check the data in COL1 column, you will see the value ORIGINAL_VALUE.
Where does Oracle get that value from? It gets that value from the Undo Segments in the Undo
Tablespace. The undo tablespace contains the pre-change value.
Well, now you may wonder how on earth the server process knows that the data is uncommitted and
therefore the undo segment is to be checked. Good question. Let me add yet another wrinkle to it –
the datafile contains the new value; not the old one. How does Oracle even know to return which rows
pre-change? It gets that information from the header of the block where the transactions are recorded
– called Transaction Table – or, a little bit differently: Interested Transaction List (ITL). I will cover
that in detail in a future installment of this series. For the time being, please bear in mind that the
block header holds that information. When the server process accesses the buffer (or the block on the
disk) to get the column value, it accesses the transaction table, sees that there is an uncommitted
transaction against it and gets the undo information from there. Finally it creates a different copy of
the buffer as it would have looked like had the update statement not been issued. This process is
called Consistent Read (CR) Processing.
Now back to our original discussion. Since DBW0 does not immediately flush the buffers to the datafile,
it makes datafile inconsistent with the committed data. Won’t that compromise the recovery process?
What would happen when the instance crashes before the flushing has occurred? Since the transaction
was previously committed, the recovery should update the datafile. Where does that information come
from? Undo tablespace? No; undo tablespace is also another datafile; it gets flushed in the same
manner; so it may not have those values. Besides it may not even contain the new value.
This is where the other leg of the database’s guarantee of the committed transaction comes in. When
the changes occur in the table, Oracle also records the information in another pool in the memory
called Log Buffer. Compared to buffer caches, which could be several terabytes; this buffer is tiny –
often just a few MBs. The update statement records the pre and post change values to the log buffer
(not to the log file, mind you). But the log buffer is just an area of memory; it also goes away when the
instance crashes. So how does Oracle use the information to protect the committed data?
This is where the redo log files (a.k.a. online redo logs) come into picture. When the session commits,
the contents of the log buffer are flushed to the redo log files. Until the flushing is completed, the
session waits with various wait events depending on conditions, the majority of which are “log file
sync” and “log file parallel write”. But does the log buffer flushing occur only when a commit occurs?
No. There are other triggering events as well:
(1) When one third of the log buffer is full
(2) When 1 MB of log buffer is written
(3) Every three seconds
There are other events as well; but these are the major ones. Since commit statement flushes the log
buffer to the redo log file, even if the instance crashes the information is stored in the redo log file
and can be easily read by the instance recovery processes. In case of a RAC database, a single instance
may have crashed. The instance recovery is done by one of the surviving instances. But it must read the
redo entries of the crashed instance to reconstruct the blocks on the disk. This is why the redo log
files, although for only one instance, must be visible to all nodes.
Even if the commit is not issued, the other triggering events flush the contents of the log buffer to the
redo log files as well. The presence of the redo entries on the redo log files is independent of the
commit. From the previous discussion you learned that the checkpoint flushes the buffers from the
cache to the disk, regardless of the issuance of the commit statement. Therefore, these are the
interesting possibilities after a session updates the data (which is updated in the buffer cache):
Checkpoint Datafile Redo
Commit will force a
redo log flush
Looking at the table above you may see some interesting conundrums – redo log has the changed data
but datafile does not and vice versa. How does Oracle know when and what exactly to recover since
the presence of record in the redo log file is not a guarantee that the data was committed?
To address that issue, Oracle places a special “marker”, called a Commit Marker in the redo stream
which goes into the redo log buffer. When instance recovery is required, Oracle doesn’t just recover
anything that is present in the redo log buffer; it looks for a commit marker. If one is not found, then
the changes are deemed to be uncommitted; and therefore Oracle rolls them back. If the changes are
not found in the redo log, then the changes are uncommitted – guaranteed (remember, a commit will
definitely flush the log buffer to redo). In that case Oracle rolls them back from the datafiles – a
process known as rolling back. When the changes are found in redo log (along with the commit marker)
but no corresponding changes in the datafile (scenario #5), Oracle will apply the changes to the
datafile from the redo entries – a process known as roll forward. Recovery consists of both rolling back
To put it all together, here is a rough algorithm for the actions of the recovery process:
Read the redo log entries starting with the oldest one
Check the SCN number of the change
Look for the commit marker. If the commit marker is found, then data has been committed.
If found, then look for the changes in the datafile (via the SCN number)
Change has been reflected in the datafile?
If yes, then move on
If no, then apply the changes to the datafile (roll forward)
If not found, then the data is uncommitted. Look for the changes in the datafile.
Change found in datafile?
If no, then move on
If yes, then update the datafile with the pre-change data (rollback)
Let me reiterate some of the lessons from this installment.
(1) Data buffers are flushed to the disk from the buffer cache independently of the commit statement.
Commit does not flush the buffers to the disk.
(2) If the buffer is modified in the buffer cache but not yet flushed to the disk, it is known as a dirty
(3) If a buffer is clean (i.e. not dirty), it does not mean that the data changes have been committed.
(4) When a commit occurs, the log buffer (not the buffer cache) is flushed to the disk
(5) Log buffer may already have been flushed to the disk due to other triggering events. So if a change
in found in the redo log file, the change is not necessarily commited.
(6) A commit statement puts a special “commit marker” on the redo log, which is the guarantee of a
(7) The frequency of the flushing of the buffer cache to the datafiles is controlled by the MTTR setting
and whether free buffers are needed in the cache due to incoming blocks from the datafiles.
How do you use this information? There are several things for you to consider:
(1) The more you commit, the more log buffer will be flushed, not very good for I/O.
(2) The more aggressive the MTTR target is, the less time it will take if the instance crashes, but the
more the frequency of flushing to the datafile will be as well – causing I/O
(3) The MTTR target has nothing to do with commit frequency; they are two independent activities. So,
reducing commit frequency will not cause a reduction in flushing frequency.
(4) If your buffer size is small, there will be more the need to flush
(5) Exactly how small is “small”? There is no fixed formula; it depends on how much of the data in the
buffer cache is updated.