HW4_0711282.pdf

Introduction to Database Systems HW4
(程式碼的說明會備註於程式碼裡頭)
2-A
1. Describe the general structure of SQLite code base
Reference:
1.Architecture of SQLite (https://www.sqlite.org/arch.html)
2.A Look at SQLite (https://www.a2hosting.com/blog/sqlite-benefits/)
3.SQLite As An Application File Format (https://www.sqlite.org/aff_short.html)
4.SQLite Advantages (https://www.javatpoint.com/sqlite-advantages-and-disadvantages)
5.Stack overflow (https://stackoverflow.com/questions/19946298/what-is-the-advantage-of-using-sqlite-rather-than-file)
a.
b.
1.Tokenizer: 當偵測到String含有SQL語法時他將會先被送到Tokenizer，Tokenizer將會把SQL Text
切成多個token然後丟給Parser(Tokenizer 包含了 tokenize.c)
2.Parser: SQLite 採⽤ Lemon parser generator，Lemon與YACC/BISON相似但使⽤不同的輸⼊語
法來防⽌編碼錯誤。當語法錯誤時候因為Lemon設計了non-terminal destructor所以不會因為語法錯
誤導致leak memory. (Parser 包含了 parse.y)。因此Parser的主要⼯作就是Parser接收到Token後，
使⽤語法解析器Lemon在指定的Context Free Language賦予Token具體的意義
因為還沒學過編譯器概論，稍微上網查了⼀下(Yacc 與 Lex 快速⼊⾨
(http://inspiregate.com/programming/other/483-yacc-and-lex-getting-started.html))(簡單學 Parser - lex 與 yacc
(https://mropengate.blogspot.com/2015/05/parser-lex-yacc-1.html))，上述所提到的YACC是⽤來⽣成編譯器的編譯

器，⽣成的編譯器主要是⽤C語⾔寫成的Parser，需要與詞語解析器Lex⼀起使⽤，再把兩部份產⽣出
來的C file⼀起編譯
BISON則是將LALR形式的Context Free Language轉為可做語法分析的C或C++ file，⼀般與flex⼀起
作⽤(Yacc/Bison (https://www.cs.ccu.edu.tw/~naiwei/cs5605/YaccBison.html))
所以我們這裡可以理解為其實Lemon做的事情跟上述兩者差不多，由官⽅⽂件的說明來看(The
Lemon LALR(1) Parser Generator (https://www.sqlite.org/lemon.html))，我們可以理解Lemon是為了SQLite
良深打造的Parser Generator
3.Code Generator: Code Generator將會分析Parse Tree並且將其轉換為Bytecode，所以⼀開始為
SQL Statment，最後將會變成Bytecode，再由VM執⾏。
包含了多個檔案:attach.c, auth.c, build.c, delete.c, expr.c, insert.c, pragma.c, select.c, trigger.c,
update.c, vacuum.c, where.c, wherecode.c,whereexpr.c, attach.c, delete.c, insert.c, select.c,
trigger.c, update.c, vacuum.c,build.c.
其中where*.c負責SQL中的WHERE語法，attach.c負責ATTACH，delete.c負責DELETE，insert.c負
責ISNERT，select.c負責SELECT，trigger.c負責TRIGGER，update.c負責UPDATE,vacuum.c負責
VACUUM
(上述的Bytecode與機器碼不同，所謂的機器碼就是由0與1組成，可被CPU直接讀取的指令，
Bytecode我們可以把他想成⼀個中間狀態，，需要經過直譯器轉譯後才成為機器碼，Bytecode主要為
了實現特定軟件運⾏與軟件環境，與硬體環境無關)(參考資料
(https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/696466/))
因此我們總結上述三者，透過Tokenizer、Parser我們對SQL進⾏語法檢查，把它轉化為底層更⽅便處
理的Parse Tree，再把Parse Tree傳給Code Generator最後⽣成Bytecode，最後由Virtual Machine
執⾏。
4.Virtual machine(Bytecode Engine):在code generator後會由原本SQL statment產⽣為bytecode
program，再由Virtual machine運⾏。
此部份包含了: vdbe.c, vdbe.h, vdbeInt.h, vdbeapi.c, vdbemem.c等。
其中vdbe.c為virtual machine本⾝; vdbe.h為header file為virtual machine與SQLite library的接⼝
vbde*.c的檔案(包含
vdbeapi.c,vdbeblob.c,vdbe.h,vdbemem.c,vdbetrace.c,vdbeaux.c,vdbe.c,vdbeInt.h,vdbesort.c,
vdbevtab.c)都為為了運⾏virtual machine的週邊檔案。
SQLite實做SQL function(ex: abs()、count()、substr())的部份是利⽤回傳回去C語⾔，以C語⾔實
現，此檔案為func.c
以abs()為例，擷取部份程式碼如下:

/*
** Implementation of the abs() function.
**
** IMP: R-23979-26855 The abs(X) function returns the absolute value of
** the numeric argument X.
*/
static void absFunc(sqlite3_context *context
,int argc, sqlite3_value **argv){
assert( argc==1 );
UNUSED_PARAMETER(argc);
switch( sqlite3_value_type(argv[0]) ){
case SQLITE_INTEGER: {
i64 iVal = sqlite3_value_int64(argv[0]);
if( iVal<0 ){
if( iVal==SMALLEST_INT64 ){
/* IMP: R-31676-45509 If X is the integer -9223372036854775808
** then abs(X) throws an integer overflow error since there is no
** equivalent positive 64-bit two complement value. */
sqlite3_result_error(context, "integer overflow", -1);
return;
}
iVal = -iVal;
}
sqlite3_result_int64(context, iVal);
break;
}
case SQLITE_NULL: {
/* IMP: R-37434-19929 Abs(X) returns NULL if X is NULL. */
sqlite3_result_null(context);
break;
}
default: {
/* Because sqlite3_value_double() returns 0.0 if the argument is not
** something that can be converted into a number, we have:
** IMP: R-01992-00519 Abs(X) returns 0.0 if X is a string or blob
** that cannot be converted to a numeric value.
*/
double rVal = sqlite3_value_double(argv[0]);
if( rVal<0 ) rVal = -rVal;
sqlite3_result_double(context, rVal);
break;
}
}
}
5.B-tree:
SQLite database維護Disk的資料結構是使⽤B-tree
(但在實際操作中會使⽤到b* tree與b tree兩種結構)
全部的B-tree都存在同⼀個Disk file上
相關檔案有btree.c、btree.h
6.Page Cache:
B-tree會以固定⼤⼩的page(預設是4096)從DIsk上請求資料，Page Cache的⼯作就是負責讀、寫還
有儲存這些page，page cache提供rollback、atomic commit的功能還有負責檔案的LOCK，當btree
要求了特定的page或是要修改page或是rollback當前修改時，也會通知Page Cache，簡單的說Page
Cache會處理所有細節確保所有要求都可以有效率且安全地被處理
相關檔案:pager.c、pager.h、wal.c、pcache1.c等等
7.OS Interface:
為了在不同作業系統間具有可移植性，SQLite採⽤VFS操作OS的interface，使⽤VFS提供在DISK上
讀寫以及開關檔案
⽬前提供的作業系統為:windows以及unix
相關檔案:os_unix.c、os_win.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Components FIle Purpose
Tokenizer tokenize.c 將SQL statment切割成Token
Parser parser.y 藉由Lemon實做Parser
Code
Generator
update.c 處理UPDATE的statment
delete.c 處理DELETE的statment
insert.c 處理INSERT的statment
trigger.c 處理TRIGGER的statment
attach.c 處理ATTACH和DEATTACH的statment
select.c 處理SELECT的statment
where.c 處理WHERE的statment
vacuum.c 處理VACUUM的statment
pragma.c 處理PRAGMA的statment
expr.c 處理SQL statment中的Expression
auth.c 實做sqlite3_set_authorizer()
analyze.c 實現ANALYZE命令
alter.c 實做ALTER TABLE
build.c
處理以下的命令(語法) CREATE TABLE、DROP TABLE、
CREATE INDEX、DROP INDEX、creating ID lists、BEGIN
TRANSACTION、COMMIT、ROLLBACK
func.c 實做SQL中函數的部份
data.c 與⽇期以及時間轉換有關的function
Virtual
machine
vdbeapi.c 虛擬機提供上層使⽤的API實做
vdbe.c 虛擬機本體
vdbe.h Header file of vdbe.c
vdbeInt.h vdbe.c的私有Header file，定義了VDBE常⽤的Data structure
vdbemem.c 操作mem的data structure的Function
B-tree btree.h Header file of btree.c
btree.c Btree的本體
Page Cache pager.c Page Cache的實做
pager.h Header file of Page Cache
OS Interface os.h Header file of os.c
os.c 實做對⼀般常⾒的OS architectures的Interface
os_win.c windows下的OS Interface
os_unix.c unix下的OS Interface
os_os2.c OS2下的OS Interface

c.

課程中的Architecture
Sqlite的Architecture如上述a⼩題所提供的
SQLite並⾮作為⼀個獨⽴Process，⽽是直接作為應⽤程式的⼀部分
因此使它較為輕量、快速以及⾼效率
SQLite的優點如下:
1. 容易設定，且無伺服器的特性容易安裝
2. 輕量級，無論是設定或是管理上所佔⽤的資源少
3. 可移植⾼，可以跨平台(OS)使⽤
與課程中所使⽤的Architecture⼤致上相同
差異是SQLite不提供網路訪問，也不能管理User
因此較⼤型的應⽤程式可較為不適合使⽤SQLite
⽐較適合嵌⼊式設備或是物聯網等等

2. Describe how SQLite process a SQL statement
Reference:
1.The SQLite Query Optimizer Overview (https://www.sqlite.org/optoverview.html)
2.Architecture of SQLite (https://www.sqlite.org/arch.html)
3.Query Planning (https://www.sqlite.org/queryplanner.html)
4.Indexes On Expressions (https://www.sqlite.org/expridx.html)
5.TEXT affinity (https://www.sqlite.org/datatype3.html#affinity)
a. list the components included in the procedure, describe their roles in the procedure (a brief
explanation is enough here)
如果單看Process SQL statment的話⼤概是這幾個components

各Components的主要⼯作:
Tokenizer: 接收到SQL statment後將其切割成多個Token，並把這些Token傳給Parser
Parser: Parser接收到Token後，使⽤語法解析器Lemon在指定的Context Free Language賦予Token
具體的意義
Code Generator: Parser把Token組裝成完整的SQL statment後轉換成Bytecode給虛擬機執⾏SQL
Statment請求的⼯作內容

Components FIle Purpose
Tokenizer tokenize.c 將SQL statment切割成Token
Parser parser.y 藉由Lemon實做Parser
Code
Generator
update.c 處理UPDATE的statment
delete.c 處理DELETE的statment
insert.c 處理INSERT的statment
trigger.c 處理TRIGGER的statment
attach.c 處理ATTACH和DEATTACH的statment
select.c 處理SELECT的statment
where.c 處理WHERE的statment
vacuum.c 處理VACUUM的statment
pragma.c 處理PRAGMA的statment
expr.c 處理SQL statment中的Expression
auth.c 實做sqlite3_set_authorizer()
analyze.c 實現ANALYZE命令
alter.c 實做ALTER TABLE
build.c
處理以下的命令(語法) CREATE TABLE、DROP TABLE、
CREATE INDEX、DROP INDEX、creating ID lists、BEGIN
TRANSACTION、COMMIT、ROLLBACK
fun.c 實做SQL中函數的部份
data.c 與⽇期以及時間轉換有關的function

b. describe how SQLite optimize the query execution in detail
i. explain how each term (where, like, between etc.) are optimized, how indexes are used
in the optimization
ii.explain the query planner as detail as possible
以下優化like between or等的程式碼皆位於whereexpr.c中的exprAnalyze(…)
1. WHERE clause analysis
Where clause會被以AND為界線切割為多個Term，如果Where clause是以OR所分割組成，那整個
clause應該要被視為OR優化的單個Term
接著分析所有item看看他們能否滿⾜index的條件，要使⽤index的Term必須為下列形式:
如果index使⽤下⾯這樣的statment建⽴

如果 initial column a、b等等出現在where clause中，就可以使⽤index，index的初始值必須與IN、
IS、= ⼀起使⽤，最右邊的column可以使⽤不等式，然⽽最右邊column的index最多可以有兩個不等
式，但此不等式必須將允許值控制在兩個極值之間，且須注意的是不能中斷使⽤index column，⽐如
上圖c不能使⽤index，那麼⼜只有a與b可以使⽤index，其餘的index cloumn將無法正常使⽤
2. The BETWEEN optimization
如果分割後的Clause包含Between，則把Between轉化成兩個不等式
例如下圖:
可以轉化成下圖形式
轉化成兩個不等式僅僅拿來分析使⽤，⽽不會產⽣VDBE bytecode，然⽽如果Between已經被編碼
過，那麼轉換後的Clause將會被忽略，如果還未編碼且轉化後的Clause可以使⽤index，那SQLite將
會跳過原本的Between形式
可以在WHERE.c中找到function驗證
#ifndef SQLITE_OMIT_BETWEEN_OPTIMIZATION
/* If a term is the BETWEEN operator, create two new virtual terms
** that define the range that the BETWEEN implements. For example:
**
** a BETWEEN b AND c
**
** is converted into:
**
** (a BETWEEN b AND c) AND (a>=b) AND (a<=c)
**
** The two new terms are added onto the end of the WhereClause object.
** The new terms are "dynamic" and are children of the original BETWEEN
** term. That means that if the BETWEEN term is coded, the children are
** skipped. Or, if the children are satisfied by an index, the original
** BETWEEN term is skipped.
*/
else if( pExpr->op==TK_BETWEEN && pWC->op==TK_AND ){
//判斷是不是Between形式Clause
ExprList *pList = pExpr->x.pList;
int i;
static const u8 ops[] = {TK_GE, TK_LE};
assert( pList!=0 );
assert( pList->nExpr==2 );
for(i=0; i<2; i++){
Expr *pNewExpr; // ⽤來存轉換後的Expression
int idxNew; // 新插⼊的expression在Where clause中的index
pNewExpr = sqlite3PExpr(pParse, ops[i], //創建>=和<= expression
sqlite3ExprDup(db, pExpr->pLeft, 0),
sqlite3ExprDup(db, pList->a[i].pExpr, 0));
transferJoinMarkings(pNewExpr, pExpr);
idxNew = whereClauseInsert(pWC, pNewExpr, TERM_VIRTUAL|TERM_DYNAMIC);
testcase( idxNew==0 ); //檢查轉換有沒有成功
exprAnalyze(pSrc, pWC, idxNew);
pTerm = &pWC->a[idxTerm];
// 表⽰新建的expression是Bwtween clause的變化
markTermAsChild(pWC, idxNew, idxTerm);
}
}
#endif /* SQLITE_OMIT_BETWEEN_OPTIMIZATION */
3. OR optimizations
如上圖，當分割的Clause是由於OR連接時有兩種情況要考慮:
1. 如果OR連結的所有clause都是在同⼀Table的同⼀column時，可以使⽤IN來替換OR clause
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

2. 如果OR連結的所有clause不是在同⼀column且各個clause的operation為"=", “<”, “<=”, “>”, ">=
", “IS NULL”, or "IN"中的⼀個，並且各個clause的column是某些index的column，就能夠把OR改
寫為UNION形式，每個clause都可以依靠相對應的index進⾏優化
如果兩者接符合者，SQLite將會預設使⽤第⼀種⽅式進⾏優化
#if !defined(SQLITE_OMIT_OR_OPTIMIZATION) && !defined(SQLITE_OMIT_SUBQUERY)
/* Analyze a term that is composed of two or more subterms connected by
** an OR operator.
*/
else if( pExpr->op==TK_OR ){ //如果該Expression是由OR組合成的
assert( pWC->op==TK_AND ); //如果Where clause是由AND分割的
exprAnalyzeOrTerm(pSrc, pWC, idxTerm); // 優化該expression
}
#endif /* SQLITE_OMIT_OR_OPTIMIZATION */
4. The LIKE optimization
Like優化時需要滿⾜底下條件:
1. LIKE左邊必須是有index的column，且型態必須是TEXT affinity的
2. LIKE右邊必須是string或是⼀個為string類型的參數且不可以wildcard character開頭
3. ESCAPE不能出現在LIKE中
4. 如果case_sensitive_like模式是啟⽤的（區分⼤⼩寫），那麼column必須使⽤BINARY collating
sequence排列。如果滿⾜了上述條件，我們則可以添加⼀些不等式來減少LIKE的查詢範圍。例
如： x LIKE ‘abc%’ 添加條件後變為： x >= ‘abc’ AND x < ‘abd’ AND x LIKE ‘abc%’ 這樣x可以
使⽤索引並且可以縮⼩LIKE的查詢範圍。
優化的具體實現
⼀個單獨的SQL statment可能有好幾種⽅法完成，所花的時間與SQLite最終決定⽤哪種⽅法有關
所以where.c的⽬的就是主要⽣成WHERE statment的Bytecode並進⾏優話處理
sqlite3WhereBegin()是整個查詢優化處理的核⼼，主要完成WHERE的優話以及opcode(要丟給虛擬
機的)的⽣成
1
2
3
4
5
6
7
8
9
10

#ifndef SQLITE_OMIT_LIKE_OPTIMIZATION
/* Add constraints to reduce the search space on a LIKE or GLOB
** operator.
**
** A like pattern of the form "x LIKE 'aBc%'" is changed into constraints
**
** x>='ABC' AND x<'abd' AND x LIKE 'aBc%'
**
** The last character of the prefix "abc" is incremented to form the
** termination condition "abd". If case is not significant (the default
** for LIKE) then the lower-bound is made all uppercase and the upper-
** bound is made all lowercase so that the bounds also work when comparing
** BLOBs.
*/
if( pWC->op==TK_AND
&& isLikeOrGlob(pParse, pExpr, &pStr1, &isComplete, &noCase)
){
Expr *pLeft; /* LHS of LIKE/GLOB operator */
Expr *pStr2; /* Copy of pStr1 - RHS of LIKE/GLOB operator */
Expr *pNewExpr1;
Expr *pNewExpr2;
int idxNew1;
int idxNew2;
const char *zCollSeqName; /* Name of collating sequence */
const u16 wtFlags = TERM_LIKEOPT | TERM_VIRTUAL | TERM_DYNAMIC;
pLeft = pExpr->x.pList->a[1].pExpr;
pStr2 = sqlite3ExprDup(db, pStr1, 0);
/* Convert the lower bound to upper-case and the upper bound to
** lower-case (upper-case is less than lower-case in ASCII) so that
** the range constraints also work for BLOBs
*/
if( noCase && !pParse->db->mallocFailed ){
int i;
char c;
pTerm->wtFlags |= TERM_LIKE;
for(i=0; (c = pStr1->u.zToken[i])!=0; i++){
pStr1->u.zToken[i] = sqlite3Toupper(c);
pStr2->u.zToken[i] = sqlite3Tolower(c);
}
}
if( !db->mallocFailed ){
u8 c, *pC; /* Last character before the first wildcard */
pC = (u8*)&pStr2->u.zToken[sqlite3Strlen30(pStr2->u.zToken)-1];
c = *pC;
if( noCase ){
/* The point is to increment the last character before the first
** wildcard. But if we increment '@', that will push it into the
** alphabetic range where case conversions will mess up the
** inequality. To avoid this, make sure to also run the full
** LIKE on all candidate expressions by clearing the isComplete flag
*/
if( c=='A'-1 ) isComplete = 0;
c = sqlite3UpperToLower[c];
}
*pC = c + 1;
}
zCollSeqName = noCase ? "NOCASE" : sqlite3StrBINARY;
pNewExpr1 = sqlite3ExprDup(db, pLeft, 0);
pNewExpr1 = sqlite3PExpr(pParse, TK_GE,
sqlite3ExprAddCollateString(pParse,pNewExpr1,zCollSeqName),
pStr1);
// Create >=的 expression
transferJoinMarkings(pNewExpr1, pExpr);
idxNew1 = whereClauseInsert(pWC, pNewExpr1, wtFlags);
//插⼊>= expression 到 where clause中
testcase( idxNew1==0 );
//測試有沒有成功插⼊
exprAnalyze(pSrc, pWC, idxNew1);
//分析Expression是不是operation+expression這種形式
//如果不是的話轉換為operation+expression這種形式
pNewExpr2 = sqlite3ExprDup(db, pLeft, 0);
pNewExpr2 = sqlite3PExpr(pParse, TK_LT,
sqlite3ExprAddCollateString(pParse,pNewExpr2,zCollSeqName),
pStr2);
// Create <的 expression
transferJoinMarkings(pNewExpr2, pExpr);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

t a s e Jo a gs(p e p , p p );
idxNew2 = whereClauseInsert(pWC, pNewExpr2, wtFlags);
//插⼊< expression 到 where clause中
testcase( idxNew2==0 );
//測試有無插⼊成功
exprAnalyze(pSrc, pWC, idxNew2);
//分析Expression是不是operation+expression這種形式
//如果不是的話轉換為operation+expression這種形式
if( isComplete ){
markTermAsChild(pWC, idxNew1, idxTerm);
markTermAsChild(pWC, idxNew2, idxTerm);
}
}
#endif /* SQLITE_OMIT_LIKE_OPTIMIZATION */
struct Select {
u8 op;
/* One of: TK_UNION TK_ALL TK_INTERSECT TK_EXCEPT */
LogEst nSelectRow; /* Estimated number of result rows */
u32 selFlags; /* Various SF_* values */
int iLimit, iOffset;
/* Memory registers holding LIMIT & OFFSET counters */
u32 selId; /* Unique identifier number for this SELECT */
int addrOpenEphm[2]; /* OP_OpenEphem opcodes related to this select */
ExprList *pEList; /* The fields of the result */
SrcList *pSrc; /* FROM 內容的parse tree */
Expr *pWhere; /* WHERE 內容的parse tree */
ExprList *pGroupBy; /* GROUP 內容的parse tree */
Expr *pHaving; /* HAVING 內容的parse tree */
ExprList *pOrderBy; /* ORDER 內容的parse tree */
Select *pPrior; /* Prior select in a compound select statement */
Select *pNext; /* Next select to the left in a compound */
Expr *pLimit; /* LIMIT expression. 不使⽤的話以NULL表⽰. */
With *pWith; /* WITH clause attached to this select. Or NULL. */
#ifndef SQLITE_OMIT_WINDOWFUNC
Window *pWin; /* List of window functions */
Window *pWinDefn; /* List of named window definitions */
#endif
};
因為在WHERE中使⽤了select這個struct
我們可以在sqliteint.h裡找到他的定義
sqlite3WhereBegin()的parameter
WhereInfo *sqlite3WhereBegin(
Parse *pParse, /* The parser context */
SrcList *pTabList, /* FROM clause: A list of all tables to be scanned */
Expr *pWhere, /* The WHERE clause */
ExprList *pOrderBy, /* An ORDER BY (or GROUP BY) clause, or NULL */
ExprList *pResultSet, /* Query result set. Req'd for DISTINCT */
u16 wctrlFlags, /* The WHERE_* flags defined in sqliteInt.h */
int iAuxArg /* If WHERE_OR_SUBCLAUSE is set, index cursor number
** If WHERE_USE_LIMIT, then the limit amount */
// pTablist是由parser⽣成對FROM部份⽣成的Parse Tree
// 包含FROM Table的資料
// pWHERE 是WHERE部份的Parse Tree
// 會包含WHERE中的EXpression
// porderby對應order by 的 parse tree
)
WhereInfo *pWInfo;
//宣告在sqlite3WhereBegin
//pWInfo將會成為sqlite3WhereBegin的return value
9
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1
2
3

/* 分配記憶體空間並初始化 WhereInfo structure pWInfo
* (同時也是這個Function的return value)
* */
pWInfo = sqlite3DbMallocRawNN(db, nByteWInfo + sizeof(WhereLoop));
if( db->mallocFailed ){
sqlite3DbFree(db, pWInfo);
pWInfo = 0;
goto whereBeginError;
}
pWInfo->pParse = pParse;
pWInfo->pTabList = pTabList;
pWInfo->pOrderBy = pOrderBy;
pWInfo->pWhere = pWhere;
pWInfo->pResultSet = pResultSet;
pWInfo->aiCurOnePass[0] = pWInfo->aiCurOnePass[1] = -1;
pWInfo->nLevel = nTabList;
pWInfo->iBreak = pWInfo->iContinue = sqlite3VdbeMakeLabel(pParse);
pWInfo->wctrlFlags = wctrlFlags;
pWInfo->iLimit = iAuxArg;
pWInfo->savedNQueryLoop = pParse->nQueryLoop;
memset(&pWInfo->nOBSat, 0,
offsetof(WhereInfo,sWC) - offsetof(WhereInfo,nOBSat));
memset(&pWInfo->a[0], 0, sizeof(WhereLoop)+nTabList*sizeof(WhereLevel));
assert( pWInfo->eOnePass==ONEPASS_OFF ); /* ONEPASS defaults to OFF */
pMaskSet = &pWInfo->sMaskSet;
sWLB.pWInfo = pWInfo;
sWLB.pWC = &pWInfo->sWC;
sWLB.pNew = (WhereLoop*)(((char*)pWInfo)+nByteWInfo);
assert( EIGHT_BYTE_ALIGNMENT(sWLB.pNew) );
whereLoopInit(sWLB.pNew);
#ifdef SQLITE_DEBUG
sWLB.pNew->cId = '*';
#endif
接著透過要做的是切割使⽤AND的WHERE中的內容
//初始化MaskSet與WhereClause
// 並且呼叫sqlite3WhereSplit切割內容
initMaskSet(pMaskSet);
sqlite3WhereClauseInit(&pWInfo->sWC, pWInfo);
sqlite3WhereSplit(&pWInfo->sWC, pWhere, TK_AND);
切割使⽤的Function概念如下，實做上採取遞迴的⽅式:
WHERE a==‘hello’ AND coalesce(b,11)<10 AND (c+12!=d OR c==22)
_ _ _ _/ _ _ _ _ _ _ _ _ _/ _ _ _ _ _ _ _ _ _ _/
slot[0] slot[1] slot[2]
void sqlite3WhereSplit(WhereClause *pWC, Expr *pExpr, u8 op){
Expr *pE2 = sqlite3ExprSkipCollateAndLikely(pExpr);
pWC->op = op;
if( pE2==0 ) return;
if( pE2->op!=op ){
whereClauseInsert(pWC, pExpr, 0);
}else{
sqlite3WhereSplit(pWC, pE2->pLeft, op);
sqlite3WhereSplit(pWC, pE2->pRight, op);
}
}

呼叫sqlite3WhereExprAnalyze()準備開始分析所有的subexpressions(⼦表達式)
sqlite3WhereExprAnalyze(pTabList, &pWInfo->sWC);
sqlite3WhereExprAnalyze再呼叫exprAnalyze分析所有的subexpressions
需要注意的是因為exprAnalyze()可能會加個virtual terms在WHERE段落的最後⾯
所以以倒序的⽅式跑迴圈避免出問題
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
11
1

void sqlite3WhereExprAnalyze(
SrcList *pTabList, /* the FROM clause */
WhereClause *pWC /* the WHERE clause to be analyzed */
){
int i;
for(i=pWC->nTerm-1; i>=0; i--){
exprAnalyze(pTabList, pWC, i);
}
}
最後就可以接回最上⾯的Between、or、like等等的優化內容(也就是exprAnalyze()這個function的內
容)

query planner:
1.Tables Without Indices
如果對於沒有index的table⽽⾔，以下⾯這個每個⽔果價格的Table為例⼦
如果要尋找Peach這個⽔果的價格
SQLite將會從Table上到下全部掃描直到fruit=peach為⽌
然後將peach對應的價格輸出
這種掃描⽅式稱為Full Table Scan
2.Lookup By Rowid
如果我們已經知道peach的rowid=4
如果rowid有照順序存在table裡頭
SQLite便可以透過rowid進⾏binary search快速找到所需要的⾏
時間複雜度為O（logN）
⽐起整張table都掃描的O(logn)還快上許多
1
2
3
4
5
6
7
8
9

3.Lookup By Index
雖然上述的⽅法很快，但是我們並不會知道peach的rowid是多少
因此為了讓查詢更有效率，我們就得為fruit加上index，
Fruit列是⽤來給element排序的primary key，⽽rowid是當相同fruit出現時所需要的secondary key
需要注意的是因為rowid是唯⼀的，所以fruit與rowid的複合鍵也是唯⼀
透過這個⽅法我們可以先對第⼀張Table進⾏fruit=peach的⼆分搜尋法，再找到fruit=peach後取得
rowid再到原本的table進⾏binary search
4.Multiple Result Rows
在上個例⼦中只有⼀⾏滿⾜，如果結果顯⽰多⾏滿⾜⼀樣可以使⽤相同的⽅法解決
5.Multiple AND-Connected WHERE-Clause Terms

如果此時你要查的是Orange的價格且產地是California
⼀種⽅案是先找出所有的Orange⾏，再查看產地是否符合
另⼀種⽅案為替state建index
查詢⽅式為先找出產地是CA者，再過濾⽔果是不是Orange
使⽤第⼆種⽅法在這裡與第⼀種⽅法查詢所花的時間是相同的
(查詢⽅式不同並不會改變結果，但會影響查詢時間)
SQLite官⽅⽂件說明到如果在平常沒有下ANALYZE命令時這兩個⽅法將會隨機執⾏
如果有下ANALYZE命令的話SQLite將會收集index數據，
SQLite會發現在⽅法⼀的情況下搜尋範圍縮⼩到1⾏
⽅法⼆通常將範圍縮⼩到2⾏
在沒有其他條件的情況下SQLite將會選擇⽅法1
6.Multi-Column Indices
為了實做多個AND的WHERE⼦句的最⼤性能，
需要⼀個Multi-Column來覆蓋每個AND的情況
這種情況則在fruit與state上加上index

使⽤新的⽅法SQLite有可能使⽤2次binary search就找出來⾃CA的Orange
因為這個新的Table包含了最⼀開始的這個table
所以在新的table的存在之下就不需要最⼀開始的table了
如果要查詢peach的價格，
只需要使⽤新的table上並忽略state就可以實做
7.Covering Indices
通過two-column index，可以使查詢產地為CA的Orange的速度更快，
但是如果Create⼀個three-column index，
SQLite將可以查詢的更加快速

由上圖可以清楚看到如果變為three-column index的話
甚⾄可以避免需要回原圖找價格的時間
使的Binary search的次數下降
8.OR-Connected Terms In The WHERE Clause
當要查詢的是Orange，或⽣產⾃CA的⽔果的話上⾯⼀個⽅法就起不到作⽤
在處理OR連接的WHERE statment時
SQLite要檢查每個OR的條件
並且盡量使⽤sub statment都使⽤index
然後對這些根據條件所塞選出來的rowid取聯集，過程如上圖
上圖顯⽰根據各個index取出rowid後對他們取聯集在到最⼀開始的Table查詢
但實際上的作法是取得rowid是與取聯集同時進⾏的
SQlite⼀次使⽤⼀個index來找rowid，同時把他們標記為已經查過了
避免之後出現重複
Sort

SQLite也可以透過index來完成ORDER BY的statment
也就是說index不僅可以在查詢時加快速度
甚⾄在排序上也可以幫助加速
1.Sorting By Rowid
因為排序相當耗費資源
因此SQLite會想辦法盡量避免排序
如果SQlite已經確定輸出結果已經按照指定的⽅法排序那他就不會進⾏排序
2.Sorting By Index
如果查詢要按照Fruit進⾏排序
那左Table就要從前⾯開始往後⾯掃描然後找到rowid後回到右邊最原始的table再輸出價格
所以每⼀次回到右邊Table時都要使⽤⼀次Binary search
所以如果有N個數據要排序，使⽤此⽅法的時間複雜度為Nlog(N)
然⽽如果是整張表直接進⾏排序的話時間複雜度也是Nlog(N)
SQLite如果有多個⽅案可以選擇的話，
會試圖分析每個⽅案的花費(通常考慮的為運⾏時間)，
然後選擇花費最⼩者
但通常⽽⾔Sorting By Index會被選擇，
因為他不⽤將結果額外在存⼊另⼀張臨時被建⽴的Table中，
因次空間上的花費最⼩
3.Sorting By Covering Index

與上述提到的Covering Indices是⼀樣意思
如果能使⽤Covering Indices的話那就可以不⽤回原本的Table做搜尋
時間與空間上的花費都將會變⼩許多

3. Describe the interaction between SQLite and OS
(以下均取⾃官⽅⽂件說明Database File Format (https://www.sqlite.org/fileformat2.html))
a. describe how SQLite stores files in file system (format, pages, headers etc.) in detail
⾸先，我們要知道所有對Database的讀寫，都是以Page為單位(除了剛打開Database時，對前100
bytes of the database file的讀取 (the database file header))
⼀個完整的SQLite database通常稱作 “main database file”，通常在⼀個Transaction中，SQLite會將
事物處理訊息保在⼀個"Rollback Jounal"中，但如果SQLite處於WAL狀態時，則會保存在⼀個Write-
head log中)。當⼀個Transaction在完成之前出現主機crash或是應⽤崩潰等等的情況，那麼必須透過
Rollback Jounal或是Write-head log恢復database，因此稱這些含有恢復database所需要的訊息為
hot Jounal或是hot WAL file
Page的⼤⼩: 要注意的是，同⼀個database中的page⼤⼩⼀樣，且從offset第16bit開始(稍後就會看到
為什麼是16)。Page number從1開始，最⼤可以到2147483646 (2^31 - 2)，⽽Page Size最⼩為
512Bytes，最⼤可以到65536。
因此我們可以總結: SQLite Database最⼩的是1個512Bytes的Page，最⼤可以有2147483646個
65536Bytes的Page。
Database Header:
Database file header由最開始的100Bytes組成分別如下(Big-Endian):
(圖⽚取⾃官⽅說明網站)
⾸先，每個正確的SQLite database file會以這16Bytes作為開頭: (in hex): 53 51 4c 69 74 65 20 66
6f 72 6d 61 74 20 33 00.
對照到的UTF-8為字串: “SQLite format 3” 官⽅說明稱此為Magic Header String
接著offset 16與17Byte為表⽰Page size的2的次⽅數，上述有提到，最⼩為512(2 15)
(但從3.7.1版開始⽀持Page size為65536，以0x00 0x01)
offset 18 write version: 為1表⽰rollback jouranl模式，為2表⽰WAL journal模式(上述已經說明何謂
rollback與WAL)
offset 19 read version: 為1表⽰rollback jouranl模式，為2表⽰WAL journal模式(上述已經說明何謂
rollback與WAL)
offset 20 Reserved bytes per page: SQLite保留此段做其他⽤途，如加密SQLite⽂件
offset 21-23 Payload fractions: 原本設計⽤來可以修改B-tree的storage format，不過⽬前都為固定
值分別為64,32,32
9)，最⼤為32768(2

offset 24⾄27表⽰file修改次數，每次DB file修改並解鎖後都會加1，這樣在多個程序讀取同⼀個
Database file時就能夠監控⽂件是被修改，如果被修改，其他程序就要更新。
offset 28-31 In-header database size: 代表database file中page的⼤⼩，且當file修改次數(offset24
與version-valid-for number相同且In-header database size不為0時database才有效)
offset 32-35 Free page list中第⼀個page的⾴號(0表⽰Free page list為空)
offset 36-39 Free page list中所有空間中page的總數量
offset 40-43 Schema cookie: 相當於該database schema的版本號次，每次schema發⽣改變都會使
該值增加。需要注意的是，每當⼀個prepared statement要運作的時候，要先確認該Schema cookie
是否和prepare的時候⼀樣，若不⼀樣就要重新prepare
offset 44-47 Schema格式版本號次: 這邊代表的是上層的SQL格式，⾮底層的B-tree，⽬前四個格式
代表:
格式1.: 被⽀持3.0.0之後的SQLite⽀持
格式2.: 增加對增減colume的⽀持，包含ADD COLUME等等
格式3.: 增加ALTER TABLE、ADD COLUME等命令⾮non-NULL default values的⽀持
格式4.: 增加對DESC的關鍵字⽀持
offset 48-51 Suggested cache size:該值如同字⾯意思僅僅是個建議值不⼀定要遵守，可以通過
default_cache_size pragma來恢復成這個預設值
offset 52-55 Incremental vacuum settings:位於offset52與64的兩個Bytes⽤於管理auto-vacuum模式
和incremental-vacuum模式，如果offset52值為0表⽰pointer-map（ptrmap）page將會被省略，並且
auto-vacuum和incremental-vacuum模式都不被⽀持。如果⾮0表⽰該database中的root page號次，
database file中將包含ptrmap page並且database為auto-vacuum（offset 64的值為false）
或者incremental -vacuum模式（offset 64的值為true）。如果offset 52的值是0那offset 64也必須為0
offset 56-59 Text encoding: 表⽰編碼格式，值為1表⽰UTF-8，值為2表⽰UTF-16 little-endian，值
為3表⽰UTF-16 big-endian，其他值均為⾮法
offset 60-63 User version number: 這個User number僅僅表⽰user的版本號次提供user使⽤，
SQLite本⾝不使⽤這個值，可以⽤user_version pragma讀取
offset 68-71 Application ID:此為應⽤程序的ID，可以透過PRAGMA application_id讀取知道該
Database file屬於哪個應⽤程序
offset 72-91: 此為保留空間，為之後可能的⽤途做準備，必須設定為0
offset 92-95 Write library version number and version-valid-for number
offset 96-99 Header space reserved for expansion: 表⽰最近修改過database file的版本號次
Page:
Lock-Byte page:
Lock-Byte page為1個page位於data file offset 1073741824 and 1073742335之間，⼩於
1073741824的database沒有Lock-Byte page，⼤於Lock-Byte page的database也僅有⼀個Lock-
Byte page。Lock-Byte page的存在我們可以把他理解為為了版本向前兼容，即使是古⽼的版本也可
以使⽤較新版本的軟或硬件，在window-95有強制⽂件鎖(也就是要利⽤Lock-Byte page實現對
database file進⾏鎖操作)，⽬前現代的系統都⽀持⾮強制⽂件鎖，所以Lock-Byte page不再有意義。
Freelist Page:
在database file中可能會含有沒有在使⽤(inactive)的空⽩page，可能是因為使⽤者刪除database中的
資料導致unused page的出現，這些unused page就存放於Freelist之中，如果需要額外page時這些
unused page就會被重新使⽤。
freelist是由freelist trunk pages所組成的linked list，每個freelist trunk pages包含0個以上的freelist
leaf page的page number.
freelist trunk page由4個byte 整數array組成，這個array的數量由page中可⽤空間⼤⼩決定。最⼩的
page可⽤空間⼤⼩是480Bytes，所以最少會有120個元素。
接著為了⽅便說明我畫了圖表⽰freelist trunk page的架構

⾸先第⼀個4Byte為下⼀個freelist trunk的page number，如果為0表⽰他是最後⼀個trunk
第⼆個4Byte表⽰leaf page 指標的數量，如果該整數⼤於0，假設為L，那每個在index第3到L+2的整
數都是表達Freelist leaf Page的page number，如第三個就是地⼀個Freelist leaf Page的編號以此類
推
B-Tree Page:
SQLite使⽤了兩種B-tree:分別為B* Tree，將所有data存在leaf中(SQLite稱他為table b-tree)。B Tree
則是將key與data保存在leaf與interior page中，僅僅存key不存data，SQLite稱他為index b-tree。
⼀個B-tree table若不是Table b-tree page就是index b-tree page，同⼀個B-tree中的table都是同⼀種
類型。
Table b-tree中的每個entry由⼀個64位的有號整數key與⻑度最多2147483647 bytes的任意⻑度data
組成，Interior table b-tree⾴中只有key和指向其⼦page的指標，所有的data都存放在table b-tree
leaf page。
Cell Payload Overflow Pages:
當⼀個B-tree cell太⼤裝不下時，多餘的部份將會放⼊Overflow Page中，Overflow Page會組成⼀個
Linked List，這些Cell Payload Overflow Page的結構如下圖:
⼀開始的4個Bytes是下⼀個page的page number，如果是0的話表⽰是最後⼀個
第5個Bytes開始表⽰溢出的內容
Pointer Map or Ptrmap Pages(Ptrmap):
ptrmap是datavase中額外的page，⽬的是使auto_vacuum和incremental_vacuum模式更加有效率，
通常其他的page都會有⽗node到⼦node的指標，例如上⾯提到的Cell Payload Overflow Pages⼀開
始的4個Bytes是下⼀個page的page number，那ptrmap提供另外⼀個⽅向，可以從⼦node快速到⽗
node，在release page時可以快速通過ptrmap找到指定的⽗node並修改他
架構如下圖:
其中每個ptrmap中的的5bytes由1個byte的page類型和其後的4byte page number組成。
有5種類型：
1. A b-tree root page: 後⾯4byte的page number應該為0
2. A freelist page: 後⾯4byte的page number應該為0
3. The first page of a cell payload overflow chain:後⾯4byte指向該cell payload overflow page的
page number
4. A page in an overflow chain other than the first page: cell payload overflow page list中的其他
page，後⾯4byte指向前⼀page的page number
5. A non-root b-tree page: 後⾯4byte表⽰其⽗page的page number

Schema Layer:
這邊開始將描述SQLite底層file format

Record Format:table b-tree中的data或是index b-tree中的key會被轉換為record格式，record使⽤
64-bit signed integers格式定義了對於table或是index中的cloumn的data，描述column的個數、data
type、內容
Serial Type Codes Of The Record Format:
1個record以⼀個varint開頭表⽰整個header的⻑度，接著是好幾個varint(這些varint被稱為serial
type)⽤來描述Column的datatype與⻑度，⽐較不⼀樣的是描述BLOB的時候會變成2-3的Byte，
接著在header後⾯的就是各個colume的值，但對於serial type為0、8、9、12和13的幾種類型的
column來說，column body的⻑度為0，如同下圖表達
record排序順序:
⽐較規則如下(由格式的左⽐到右):
1. NULL values(serial type 0)最優先
2. Numeric values(serial types 1 到9）排在NULL之後，整數間Numeric values排序
3. Text values 由column⽐對函數決定
4. BLOB values 由memcmp()決定
Representation Of SQL Tables
每個Database Schema中有rowid的SQL table在Disk上都是通過table B-tree表⽰，Table b-tree中的
每個Entry對應於SQL table中的⼀⾏，Entry中的64位有號整數key對應於rowid
⾸先將各個column組成⼀個record格式的byte array，array中各個值的排列順序和各個column在表中
的排列順序⼀致，接著將該array作為payload存放在table b-tree的entry中，如果⼀個SQL表包含
INTEGER PRIMARY KEY的column(此定義的column就是rowid，替代原先隱含的rowid)，那麼這個
column在record中的值為空值，SQLite包含了INTEGER PRIMARY KEY的column的時候，使⽤table
b-tree的key替代。
如果⼀個column的affinity（建議類型）是REAL類型，並且包含⼀個可以被轉換為整數的值（沒有⼩
數部分，並且沒有⼤到溢出整數所能表⽰範圍）那麼該值在record中可能被存儲為整數。當將之從
record中提取出來的時候，SQLite會⾃動轉換為浮點數。
Representation of WITHOUT ROWID Tables
如果⼀個SQL table在創建的時候⽤了“WITHOUT ROWID”，那麼該table就是⼀個WITHOUT ROWID
table，在disk上的存儲格式也和普通SQL表不同。WITHOUT ROWID表使⽤index b-tree⽽⾮table b-
tree來進⾏存儲。這個index b-tree中各個entry的key就是record，這些record由PRIMARY KEY 修飾
的column作為開頭（WITHOUT ROWID表必須要有primary key，⽤於替代rowid的作⽤），其它
column在其後⾯。
所以，WITHOUT ROWID表的的內容編碼和普通的rowid表編碼是⼀樣的，區別在於：

將PRIMARY KEY的column提前到最前
將整個record（⽐較的時候就看PRIMARY KEY，值唯⼀且不能為空）作為index b-tree中的key，⽽
⾮table b-tree中的data。
普通rowid表中對REAL建議類型存儲的特殊規則，在WITHOUT ROWID表中也同樣適⽤。
Representation Of SQL Indices
所有的SQL index，無論是通過CREATE INDEX語句顯性create的，還是通過UNIQUE或者
PRIMARY KEY隱性create的，都對應於數據庫⽂件中的⼀個index b-tree。Index b-tree中的每個
entry對應於相應SQL表的⼀⾏，index b-tree的key是⼀個record，這個record由對應表中被索引的
column（⼀個或者多個）以及對應表的key組成。對於普通rowid表來說，這個key就是rowid，對於
WITHOUT ROWID表來說，這個key是PRIMARY KEY，無論哪種情況，這個key都是表中唯⼀的
（這裡的key指的不是b-tree的key）。
在普通的index中，原table中的⾏，和相應的index中的entry是⼀⼀對應的，但是在⼀個partial
index（創建index的時候加WHERE從句，對table中部分row索引）中，index b -tree只包含WHERE
從句為真的⾏對應的entry。Index和table b-tree中對應的⾏使⽤同樣的rowid或primary key，並且所
有被索引的column都具有和原table中column同樣的值

b. describe how SQLite control file read, write, close etc. (a brief explanation is enough here)
Reference:
1. 官⽅⽂件:SQLite File IO Specification (https://www.sqlite.org/fileio.html)
2. 官⽅WAL說明⽂件(https://www.sqlite.org/wal.html)
3. 官⽅⽂件:File Locking And Concurrency In SQLite Version 3 (https://www.sqlite.org/lockingv3.html)
4. WAL-mode File Format (https://www.sqlite.org/walformat.html)
打開與關閉page分別使⽤sqlite3pagereopen與sqlite3pagereclose
並且使⽤sqlite3pagerReadFileheader讀取header
如果需要寫⼊的話為使⽤sqlute3PagerWrite的這個Function
以下擷取部份程式碼的部份

int sqlite3PagerOpen(
sqlite3_vfs *pVfs, /* The virtual file system to use */
Pager **ppPager, /* OUT: Return the Pager structure here */
const char *zFilename, /* Name of the database file to open */
int nExtra, /* Extra bytes append to each in-memory page */
int flags, /* flags controlling this file */
int vfsFlags, /* flags passed through to sqlite3_vfs.xOpen() */
void (*xReinit)(DbPage*) /* Function to reinitialize pages */
){
u8 *pPtr;
Pager *pPager = 0; /* Pager object to allocate and return */
int rc = SQLITE_OK; /* Return code */
int tempFile = 0; /* True for temp files (incl. in-memory files) */
int memDb = 0; /* True if this is an in-memory file */
int readOnly = 0; /* True if this is a read-only file */
int journalFileSize; /* Bytes to allocate for each journal fd */
char *zPathname = 0; /* Full path to database file */
int nPathname = 0; /* Number of bytes in zPathname */
int useJournal = (flags & PAGER_OMIT_JOURNAL)==0; /* False to omit journal *
int pcacheSize = sqlite3PcacheSize(); /* Bytes to allocate for PCache
u32 szPageDflt = SQLITE_DEFAULT_PAGE_SIZE; /* Default page size */
const char *zUri = 0; /* URI args to copy */
int nUriByte = 1; /* Number of bytes of URI args at *zUri */
int nUri = 0; /* Number of URI parameters */
/* Figure out how much space is required for each journal file-handle
** (there are two of them, the main journal and the sub-journal). */
journalFileSize = ROUND8(sqlite3JournalSize(pVfs));
/* Set the output variable to NULL in case an error occurs. */
*ppPager = 0;
/* Compute and store the full pathname in an allocated buffer pointed
** to by zPathname, length nPathname. Or, if this is a temporary file,
** leave both nPathname and zPathname set to 0.
*/
if( zFilename && zFilename[0] ){
const char *z;
nPathname = pVfs->mxPathname+1;
zPathname = sqlite3DbMallocRaw(0, nPathname*2);
if( zPathname==0 ){
return SQLITE_NOMEM_BKPT;
}
zPathname[0] = 0; /* Make sure initialized even if FullPathname() fails */
rc = sqlite3OsFullPathname(pVfs, zFilename, nPathname, zPathname);
if( rc!=SQLITE_OK ){
if( rc==SQLITE_OK_SYMLINK ){
if( vfsFlags & SQLITE_OPEN_NOFOLLOW ){
rc = SQLITE_CANTOPEN_SYMLINK;
}else{
rc = SQLITE_OK;
}
}
}
nPathname = sqlite3Strlen30(zPathname);
z = zUri = &zFilename[sqlite3Strlen30(zFilename)+1];
while( *z ){
z += strlen(z)+1;
z += strlen(z)+1;
nUri++;
}
nUriByte = (int)(&z[1] - zUri);
assert( nUriByte>=1 );
if( rc==SQLITE_OK && nPathname+8>pVfs->mxPathname ){
/* This branch is taken when the journal path required by
** the database being opened will be more than pVfs->mxPathname
** bytes in length. This means the database cannot be opened,
** as it will not be possible to open the journal file or even
** check for a hot-journal before reading.
*/
rc = SQLITE_CANTOPEN_BKPT;
}
sqlite3DbFree(0, zPathname);
return rc;
}
}
/* Allocate memory for the Pager structure, PCache object, the
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

/ ocate e o y o t e age st uctu e, Cac e object, t e
** three file descriptors, the database file name and the journal
** file name. The layout in memory is as follows:
**
** Pager object (sizeof(Pager) bytes)
** PCache object (sqlite3PcacheSize() bytes)
** Database file handle (pVfs->szOsFile bytes)
** Sub-journal file handle (journalFileSize bytes)
** Main journal file handle (journalFileSize bytes)
** Ptr back to the Pager (sizeof(Pager*) bytes)
** 0000 database prefix (4 bytes)
** Database file name (nPathname+1 bytes)
** URI query parameters (nUriByte bytes)
** Journal filename (nPathname+8+1 bytes)
** WAL filename (nPathname+4+1 bytes)
** 000 terminator (3 bytes)
**
** Some 3rd-party software, over which we have no control, depends on
** the specific order of the filenames and the 0 separators between them
** so that it can (for example) find the database filename given the WAL
** filename without using the sqlite3_filename_database() API. This is a
** misuse of SQLite and a bug in the 3rd-party software, but the 3rd-party
** software is in widespread use, so we try to avoid changing the filename
** order and formatting if possible. In particular, the details of the
** filename format expected by 3rd-party software should be as follows:
**
** - Main Database Path
** - 0
** - Multiple URI components consisting of:
** - Key
** - 0
** - Value
** - 0
** - 0
** - Journal Path
** - 0
** - WAL Path (zWALName)
** - 0
**
** The sqlite3_create_filename() interface and the databaseFilename() utilit
** that is used by sqlite3_filename_database() and kin also depend on the
** specific formatting and order of the various filenames, so if the format
** changes here, be sure to change it there as well.
*/
pPtr = (u8 *)sqlite3MallocZero(
ROUND8(sizeof(*pPager)) + /* Pager structure */
ROUND8(pcacheSize) + /* PCache object */
ROUND8(pVfs->szOsFile) + /* The main db file */
journalFileSize * 2 + /* The two journal files */
sizeof(pPager) + /* Space to hold a pointer */
4 + /* Database prefix */
nPathname + 1 + /* database filename */
nUriByte + /* query parameters */
nPathname + 8 + 1 + /* Journal filename */
3 /* Terminator */
);
assert( EIGHT_BYTE_ALIGNMENT(SQLITE_INT_TO_PTR(journalFileSize)) );
if( !pPtr ){
sqlite3DbFree(0, zPathname);
return SQLITE_NOMEM_BKPT;
}
pPager = (Pager*)pPtr; pPtr += ROUND8(sizeof(*pPager));
pPager->pPCache = (PCache*)pPtr; pPtr += ROUND8(pcacheSize);
pPager->fd = (sqlite3_file*)pPtr; pPtr += ROUND8(pVfs->szOsFile);
pPager->sjfd = (sqlite3_file*)pPtr; pPtr += journalFileSize;
pPager->jfd = (sqlite3_file*)pPtr; pPtr += journalFileSize;
assert( EIGHT_BYTE_ALIGNMENT(pPager->jfd) );
memcpy(pPtr, &pPager, sizeof(pPager)); pPtr += sizeof(pPager);
/* Fill in the Pager.zFilename and pPager.zQueryParam fields */
pPtr += 4; /* Skip zero prefix */
pPager->zFilename = (char*)pPtr;
if( nPathname>0 ){
memcpy(pPtr, zPathname, nPathname); pPtr += nPathname + 1;
if( zUri ){
memcpy(pPtr, zUri, nUriByte); pPtr += nUriByte;
}else{
pPtr++;
}
}
9
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
9

/* Fill in Pager.zJournal */
if( nPathname>0 ){
pPager->zJournal = (char*)pPtr;
memcpy(pPtr, zPathname, nPathname); pPtr += nPathname;
memcpy(pPtr, "-journal",8); pPtr += 8 + 1;
}else{
pPager->zJournal = 0;
}
if( nPathname ) sqlite3DbFree(0, zPathname);
pPager->pVfs = pVfs;
pPager->vfsFlags = vfsFlags;
/* Open the pager file.
*/
if( zFilename && zFilename[0] ){
int fout = 0; /* VFS flags returned by xOpen() */
rc = sqlite3OsOpen(pVfs, pPager->zFilename, pPager->fd, vfsFlags, &fout);
assert( !memDb );
readOnly = (fout&SQLITE_OPEN_READONLY)!=0;
/* If the file was successfully opened for read/write access,
** choose a default page size in case we have to create the
** database file. The default page size is the maximum of:
**
** + SQLITE_DEFAULT_PAGE_SIZE,
** + The value returned by sqlite3OsSectorSize()
** + The largest page size that can be written atomically.
*/
if( rc==SQLITE_OK ){
int iDc = sqlite3OsDeviceCharacteristics(pPager->fd);
if( !readOnly ){
setSectorSize(pPager);
assert(SQLITE_DEFAULT_PAGE_SIZE<=SQLITE_MAX_DEFAULT_PAGE_SIZE);
if( szPageDflt<pPager->sectorSize ){
if( pPager->sectorSize>SQLITE_MAX_DEFAULT_PAGE_SIZE ){
szPageDflt = SQLITE_MAX_DEFAULT_PAGE_SIZE;
}else{
szPageDflt = (u32)pPager->sectorSize;
}
}
}
pPager->noLock = sqlite3_uri_boolean(pPager->zFilename, "nolock", 0);
if( (iDc & SQLITE_IOCAP_IMMUTABLE)!=0
|| sqlite3_uri_boolean(pPager->zFilename, "immutable", 0) ){
vfsFlags |= SQLITE_OPEN_READONLY;
goto act_like_temp_file;
}
}
}else{
/* If a temporary file is requested, it is not opened immediately.
** In this case we accept the default page size and delay actually
** opening the file until the first call to OsWrite().
**
** This branch is also run for an in-memory database. An in-memory
** database is the same as a temp-file that is never written out to
** disk and uses an in-memory rollback journal.
**
** This branch also runs for files marked as immutable.
*/
act_like_temp_file:
tempFile = 1;
pPager->eState = PAGER_READER; /* Pretend we already have a lock */
pPager->eLock = EXCLUSIVE_LOCK; /* Pretend we are in EXCLUSIVE mode */
pPager->noLock = 1; /* Do no locking */
readOnly = (vfsFlags&SQLITE_OPEN_READONLY);
}
/* The following call to PagerSetPagesize() serves to set the value of
** Pager.pageSize and to allocate the Pager.pTmpSpace buffer.
*/
assert( pPager->memDb==0 );
rc = sqlite3PagerSetPagesize(pPager, &szPageDflt, -1);
testcase( rc!=SQLITE_OK );
}
/* Initialize the PCache object. */
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238

j
nExtra = ROUND8(nExtra);
assert( nExtra>=8 && nExtra<1000 );
rc = sqlite3PcacheOpen(szPageDflt, nExtra, !memDb,
!memDb?pagerStress:0, (void *)pPager, pPager->pPCache);
}
/* If an error occurred above, free the Pager structure and close the file.
*/
sqlite3OsClose(pPager->fd);
sqlite3PageFree(pPager->pTmpSpace);
sqlite3_free(pPager);
return rc;
}
PAGERTRACE(("OPEN %d %sn", FILEHANDLEID(pPager->fd), pPager->zFilename));
IOTRACE(("OPEN %p %sn", pPager, pPager->zFilename))
pPager->useJournal = (u8)useJournal;
/* pPager->stmtOpen = 0; */
/* pPager->stmtInUse = 0; */
/* pPager->nRef = 0; */
/* pPager->stmtSize = 0; */
/* pPager->stmtJSize = 0; */
/* pPager->nPage = 0; */
pPager->mxPgno = SQLITE_MAX_PAGE_COUNT;
/* pPager->state = PAGER_UNLOCK; */
/* pPager->errMask = 0; */
pPager->tempFile = (u8)tempFile;
assert( tempFile==PAGER_LOCKINGMODE_NORMAL
|| tempFile==PAGER_LOCKINGMODE_EXCLUSIVE );
assert( PAGER_LOCKINGMODE_EXCLUSIVE==1 );
pPager->exclusiveMode = (u8)tempFile;
pPager->changeCountDone = pPager->tempFile;
pPager->memDb = (u8)memDb;
pPager->readOnly = (u8)readOnly;
assert( useJournal || pPager->tempFile );
pPager->noSync = pPager->tempFile;
if( pPager->noSync ){
assert( pPager->fullSync==0 );
assert( pPager->extraSync==0 );
assert( pPager->syncFlags==0 );
assert( pPager->walSyncFlags==0 );
}else{
pPager->fullSync = 1;
pPager->extraSync = 0;
pPager->syncFlags = SQLITE_SYNC_NORMAL;
pPager->walSyncFlags = SQLITE_SYNC_NORMAL | (SQLITE_SYNC_NORMAL<<2);
}
/* pPager->pFirst = 0; */
/* pPager->pFirstSynced = 0; */
/* pPager->pLast = 0; */
pPager->nExtra = (u16)nExtra;
pPager->journalSizeLimit = SQLITE_DEFAULT_JOURNAL_SIZE_LIMIT;
assert( isOpen(pPager->fd) || tempFile );
setSectorSize(pPager);
if( !useJournal ){
pPager->journalMode = PAGER_JOURNALMODE_OFF;
}else if( memDb || memJM ){
pPager->journalMode = PAGER_JOURNALMODE_MEMORY;
}
/* pPager->xBusyHandler = 0; */
/* pPager->pBusyHandlerArg = 0; */
pPager->xReiniter = xReinit;
setGetterMethod(pPager);
/* memset(pPager->aHash, 0, sizeof(pPager->aHash)); */
/* pPager->szMmap = SQLITE_DEFAULT_MMAP_SIZE // will be set by btree.c */
*ppPager = pPager;
return SQLITE_OK;
}
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310

int sqlite3PagerReadFileheader(Pager *pPager, int N, unsigned char *pDest){
int rc = SQLITE_OK;
memset(pDest, 0, N);
assert( isOpen(pPager->fd) || pPager->tempFile );
/* This routine is only called by btree immediately after creating
** the Pager object. There has not been an opportunity to transition
** to WAL mode yet.
*/
assert( !pagerUseWal(pPager) );
if( isOpen(pPager->fd) ){
IOTRACE(("DBHDR %p 0 %dn", pPager, N))
rc = sqlite3OsRead(pPager->fd, pDest, N, 0);
if( rc==SQLITE_IOERR_SHORT_READ ){
rc = SQLITE_OK;
}
}
return rc;
}
int sqlite3PagerClose(Pager *pPager, sqlite3 *db){
u8 *pTmp = (u8*)pPager->pTmpSpace;
assert( db || pagerUseWal(pPager)==0 );
assert( assert_pager_state(pPager) );
disable_simulated_io_errors();
sqlite3BeginBenignMalloc();
pagerFreeMapHdrs(pPager);
/* pPager->errCode = 0; */
pPager->exclusiveMode = 0;
pager_reset(pPager);
if( MEMDB ){
pager_unlock(pPager);
}else{
/* If it is open, sync the journal file before calling UnlockAndRollback.
** If this is not done, then an unsynced portion of the open journal
** file may be played back into the database. If a power failure occurs
** while this is happening, the database could become corrupt.
**
** If an error occurs while trying to sync the journal, shift the pager
** into the ERROR state. This causes UnlockAndRollback to unlock the
** database and close the journal file without attempting to roll it
** back or finalize it. The next database user will have to do hot-journal
** rollback before accessing the database file.
*/
if( isOpen(pPager->jfd) ){
pager_error(pPager, pagerSyncHotJournal(pPager));
}
pagerUnlockAndRollback(pPager);
}
sqlite3EndBenignMalloc();
enable_simulated_io_errors();
PAGERTRACE(("CLOSE %dn", PAGERID(pPager)));
IOTRACE(("CLOSE %pn", pPager))
sqlite3OsClose(pPager->jfd);
sqlite3OsClose(pPager->fd);
sqlite3PageFree(pTmp);
sqlite3PcacheClose(pPager->pPCache);
assert( !pPager->aSavepoint && !pPager->pInJournal );
assert( !isOpen(pPager->jfd) && !isOpen(pPager->sjfd) );
sqlite3_free(pPager);
return SQLITE_OK;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

// 將 data page設定為可寫的
// 這個function在真正對data做改變之前應該要先被呼叫
// 呼叫這個Function者除⾮回傳值是SQLITE_OK否則不能對資料操作
int sqlite3PagerWrite(PgHdr *pPg){
Pager *pPager = pPg->pPager;
assert( (pPg->flags & PGHDR_MMAP)==0 );
assert( pPager->eState>=PAGER_WRITER_LOCKED );
assert( assert_pager_state(pPager) );
if( (pPg->flags & PGHDR_WRITEABLE)!=0 && pPager->dbSize>=pPg->pgno ){
if( pPager->nSavepoint ) return subjournalPageIfRequired(pPg);
return SQLITE_OK;
}else if( pPager->errCode ){
return pPager->errCode;
}else if( pPager->sectorSize > (u32)pPager->pageSize ){
assert( pPager->tempFile==0 );
return pagerWriteLargeSector(pPg);
}else{
//與pager_write不同的是這FUnction還會處理⼀個特例:
//當兩個以上多個page fit on a single disk sector
return pager_write(pPg);
}
// 如果有錯誤發⽣Return SQLITE_NOMEM 或是 IO error code
// 否則的話回傳 SQLITE_OK
}

4. Describe the concurrency control of SQLite
a. describe how SQLite handle concurrency control (file locking, journal files etc.) in detail
Reference:
1. 官⽅說明⽂件Write-Ahead Logging (https://www.sqlite.org/wal.html)
2. 官⽅說明⽂件Atomic Commit In SQLite (https://www.sqlite.org/atomiccommit.html)
3. CSDN (https://blog.csdn.net/tianxuhong/article/details/78752357)
4. SQLite WAL 模式簡單介紹(https://xiaozhuanlan.com/topic/1754328960)
5. CSDN (https://blog.csdn.net/tianyeming/article/details/85763621)
6. WAL-mode File Format (https://www.sqlite.org/walformat.html)
7. sqlite:WAL模式(https://www.jianshu.com/p/c78cf4caceab)
8. 官⽅⽂件:File Locking And Concurrency In SQLite Version 3 (https://www.sqlite.org/lockingv3.html)
9. Page Cache之並發控制(https://www.cnblogs.com/hustcat/archive/2009/03/01/1400757.html)
10. Blog (https://my.oschina.net/u/587236/blog/129022)
11. 深⼊理解SQLite (https://www.kancloud.cn/kangdandan/sqlite/64358)
12. Isolation In SQLite (https://sqlite.org/isolation.html)
SQLite的Lock機制與型態：
locking
state
說明
unlocked
如字⾯上的意思就是Unlock，
該process沒有取得任何lock(需要注意的是這是SQLite的預設值)
shared
多個process可以同時對database讀取資料但不可以做寫⼊的動作
(因此允許多個shared lock)
reserved
表⽰某個process預計在之後要做寫⼊的動作，同⼀時間只能存在⼀個reserved
lock，但可以同時shared lock共存
pending
代表該process希望盡快拿到exclusive lock，但⽬前還有其他share lock還沒結束，
只要存在⼀個pending lock SQLite就不會允許新的shared lock
exclusive
process要寫⼊Databse前需要取得exclusive lock，同時只要有⼀個exclusive
lock存在(最多也只能有⼀個)，不可以存在其他lock，
因此SQLite要盡量縮短exclusive lock存在的時間
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

在將原本SQL語法解析完後，就要對database進⾏操作，SQLite會透過B-tree尋找需要的page，B-
tree的⼯作是維護(聯繫)各個page之間的關係，並不會直接讀寫Disk，是由Pager來負責獲得所需要的
page或是需要修改的page，所以可以說pager就是B-tree與Disk的中間⼈。
Atomic Commit In SQLite (https://www.sqlite.org/atomiccommit.html)這個官⽅⽂件中說明Atomic Commit表⽰
著⼀個Transaction database的變化會完成或是不完成，並不會write data到⼀半結果斷掉或系統崩潰
造成database損毀的現象，
初始狀態
在⼀個database剛打開時，最右邊Disk表⽰存在Disk中的內容，中間表⽰OS的Disk buffer cache，
左邊表⽰正在使⽤SQlite，User process的memory內容，因為還沒有任何data被讀⼊，所以是空的。
1.Acquiring A Read Lock
在SQlite write data前需要先拿到Read Lock，看他是否已經存在於database中了。為了從database
file中讀取資料第⼀步是獲得shared lock，如上⾯所提到的，shared lock允中多個process同時讀取同
個file，但不允許做任何的寫⼊。
需要注意的是shared lock是針對OS Buffer並⾮Disk本⾝。
2.Reading Information Out Of The Database
當Shared lock取得之後，就可以從file中讀取資訊，根據前⾯所假設的，我們已經假設User space是
空的，所以資訊必須先從Disk到OS Buffer，之後所需的部份或是全部資訊都可以OS Buffer取得，⼀
般⽽⾔只有database中的部份page讀取，因為database中的資料數通常不會太⼩，如上圖8個page只
讀3個

3.Obtaining A Reserved Lock
在開始修改database前，必須先拿到RESERVED LOCK，上述也有提到了，⼀個database file只能
有⼀個RESERVED LOCK，⽬的是宣告他即將去修改database了(還沒開始)，⽽其他process可以同
時做讀取
4.Creating A Rollback Journal File
在修改database前，SQLite會⽣成⼀個單獨的rollback journal file，並將被修改前的page寫⼊其中，
所以rollback journal file表⽰著我們恢復database的所有資訊
rollback journal file的頂部有⼀個header(如圖綠⾊標記部份)，紀錄原始database file的⼤⼩，因此就
算database被修改後變⼤了我們還是知道他原始的⼤⼩為何
當⼀個新 file被建⽴時⼤部分OS並不會⾺上寫⼊到Disk中，會有些延遲，如圖中所表⽰的Disk中的file
部份仍為空⽩狀態
5.Changing Database Pages In User Space
當修改前的狀態已經被存起來後，我們就可以放⼼的去做修改Database的動作

6.Flushing The Rollback Journal File To Mass Storage
接下來rollback journal file將會存回Disk中，因為要寫⼊Disk，這是⼀個相當耗時的動作(離CPU越近
的記憶體越⼩也越快，越遠記憶體越⼤也越慢)
7.Obtaining An Exclusive Lock
在修改database前，還需要取得⼀個Pending Lock，如同上述所提到的，Pending Lock允許其他
Process繼續讀取，但不允許繼續⽣成SHARED Lock。
Pending Lock存在的⽬的是:試想今天假如有多個Process都要讀取同⼀個file，不斷的有⼈申請要
SHARED Lock，有⼈完成讀取後釋放SHARED Lock，如果沒有PENDING LOCK，最後要寫⼊的
Process就等不到EXCLUSIVE LOCk，但有了PENDING Lock後就可以阻⽌不斷⽣成SHARED Lock
8.Writing Changes To The Database File

Process拿到EXL+CLUSIVE LOCK後代表此時不會並存其他LOCK，可以放⼼的更新或是寫⼊資料，
修改完後也會更新OS Buffers的內容
9.Flushing Changes To Mass Storage
將做了修改的內容再存回Disk
10.Deleting The Rollback Journal
當數據已經安全寫⼊到Disk後rollback journal file就沒有必要存在了，因此可以刪除，如果刪除之後
發⽣系統崩潰或是停電等情況，因為所有變化已經寫⼊Disk，並不影響，所以SQLite判斷Database
file是否完成變更要由rollback journal file是否存在來判斷
11.Releasing The Lock
最後⼀個步驟是釋放調EXCLUSIVE LOCK，這樣其他Process就⼜可以繼續訪問database file

上圖顯⽰鎖釋放的同時User space也被清空，但較新版本的SQLite並不會清空的，避免下⼀個
Process開始時⼜使⽤到相同的數據，如果清空了反⽽會降低效能，再次使⽤這些數據之前我們得先
獲得⼀個SHARED Lock，然後檢查database file的第⼀個page中的修改次數計數器(在a題中已提
到)，file每做⼀次修改該計數器就會加1，如果該計數器顯⽰這個file已經被修改過了那就要清空User
space並且重新讀⼊。

Rollback
這邊講說明如果發⽣電源發⽣問題、程式崩潰等等情況時，要如何恢復資料
1.When Something Goes Wrong…
假設在上述寫⼊資料到Disk的過程中發⽣問題(第9步)，
2.Hot Rollback Journals
在上述已經提過了，Hot Journal的存在是為了要使database恢復到故障前的狀態。當Process試圖去
訪問database file前，必須取得SHARED Lock，但現在卻被告知有Rollback Journal存在，SQLite會
去查看這個Journal是否為Hot Journal，Hot Journal的存在表⽰在前⼀步過程中發⽣了系統崩潰或是
故障等等
那，要如何判斷Journal是否為Hot Journal呢？
1. Rollback Journal 存在
2. Rollback Journal 並⾮空的
3. Database中⽬前不存在RESERVED lock
4. Rollback Journal 不含有 the name of a master journal file(稍後會說明) 或是包含了the name of
a master journal file 且 master journal file 存在
3.Obtaining An Exclusive Lock On The Database

為了防⽌其他Process rollback同⼀個hot journal ，要先取得⼀個Database的EXCLUSIVE Lock
4.Rolling Back Incomplete Changes
⼀旦Process取得的EXCLUSIVE Lock，他就被允許可以更新database file，然後從Journal中讀取原
始的內容，在上述有提到Database file原始⼤⼩已經寫⼊Journal的Header，所以SQLite可以⽤這個
訊息來中⽌Database file
5.Deleting The Hot Journal
當Journal 都被放回到Database File且FLush到Disk後，這個Journal就可以刪掉了
6.Continue As If The Uncompleted Writes Had Never Happened

最後的這個步驟將EXCLUSIVE Lock降回成SHARED lock，這個恢復動作就到這裡完成了
原本預設的⽅法都是使⽤上述的rollback journal，在版本3.7.0(2010年)開始SQLite引⼊新的Write-
Ahead Log(WAL)模式，與預設的⽅法相⽐有這些優勢:⼤部分情況下有更好的效能、更好的
concurrency control，當然也含有⼀些缺點，如:所有的database操作必須都在同⼀台電腦上，並且該
機器的OS需要⽀持VFS特性等等
WAL的⼯作原理是採⽤與rollback journal相反的概念，rollback journal在對database file操作前會先
對database file備份，然後才開始操作，如果過程中發⽣故障或是斷電等情況，會將Journal的內容
rollback到datavase中進⾏恢復，如果順利完成的話在完成時刪除Journal相關⽂件
(如此圖，引⽤⾃Reference中的網站)
WAL則是在Process對database操作時，先複製⼀份原本的database數據到Journal中，然後把操作
更新在Journal，原有的database不做任何改變，如果過程最後失敗了WAL的紀錄就會被忽略掉，反
之，如果成功了他在之後就會被寫回到database中(稱為checkpoint)
(如此圖，引⽤⾃Reference中的網站)

在看SQLite如何實做前可以總結上述情況為此圖(圖⽚取⾃Reference)
SQLite實現⽅式

static int winLock(sqlite3_file *id, int locktype){
int rc = SQLITE_OK; /* Return code from subroutines */
int res = 1; /* Result of a Windows lock call */
int newLocktype; /* 在退出前把此值設為 pFile->locktype */
int gotPendingLock = 0;/* True if we acquired a PENDING lock this time */
winFile *pFile = (winFile*)id;
DWORD lastErrno = NO_ERROR;
assert( id!=0 );
OSTRACE(("LOCK file=%p, oldLock=%d(%d), newLock=%dn",
pFile->h, pFile->locktype, pFile->sharedLockByte, locktype));
// 如果來申請的LOCK權限沒有⽐當前File的權限還⾼的話就直接Return
if( pFile->locktype>=locktype ){
OSTRACE(("LOCK-HELD file=%p, rc=SQLITE_OKn", pFile->h));
return SQLITE_OK;
}
// 不允許Read-onlt的 database申請Wirte Lock
if( (pFile->ctrlFlags & WINFILE_RDONLY)!=0 && locktype>=RESERVED_LOCK ){
return SQLITE_IOERR_LOCK;
}
/* Make sure the locking sequence is correct
*/
/*
檢查當前提出的申請是否適當或是有不合法的地⽅
1. 如果⽬前database的Lock種類為NO_LOCK，
那申請的鎖類別⼀定要為SHARED_LOCK
2: 申請的Lock不可以是PENDING LOCK，因為這是⼀種過渡鎖
3: 如果申請的LOCK種類是RESERVED LOCK
那當前database的Lock種類⼀定要是SHARED_LOCK
*/
assert( pFile->locktype!=NO_LOCK || locktype==SHARED_LOCK );
assert( locktype!=PENDING_LOCK );
assert( locktype!=RESERVED_LOCK || pFile->locktype==SHARED_LOCK );
/* 如果申請的鎖類型是PENDING_LOCK或是SHARED_LOCK的話就要LOCK住PENDING_LOCK byte
* 1. 如果當前database處於沒有LOCk的狀態，⾸先要先取得SHARED_LOCK，
* 上述也有提過的無論是要READ或是WRITE都要先取得SHARED_LOCK
* 2. 如果當前database處於SHARED_LOCK狀態，
* 要申請EXCLUSIVE LOCK的話要先獲得PENDING LOCK
*/
newLocktype = pFile->locktype;
if( pFile->locktype==NO_LOCK
|| (locktype==EXCLUSIVE_LOCK && pFile->locktype<=RESERVED_LOCK)
){
int cnt = 3;
// 取得PENDING LOCK
while( cnt-->0 && (res = winLockFile(&pFile->h
, SQLITE_LOCKFILE_FLAGS,PENDING_BYTE, 0, 1, 0))==0 ){
/* 嘗試3次取得PENDING LOCK，這是必要的步驟，
* 與WINDOWS系統本⾝的INDEX或是防毒軟體造成
* */
lastErrno = osGetLastError();
OSTRACE(("LOCK-PENDING-FAIL file=%p, count=%d, result=%dn",
pFile->h, cnt, res));
if( lastErrno==ERROR_INVALID_HANDLE ){
pFile->lastErrno = lastErrno;
rc = SQLITE_IOERR_LOCK;
OSTRACE(("LOCK-FAIL file=%p, count=%d, rc=%sn",
pFile->h, cnt, sqlite3ErrName(rc)));
return rc;
}
if( cnt ) sqlite3_win32_sleep(1);
//設定gotPendingLock為1 後⾯的程式可能會根據這個值來釋放PENDING LOCK
}
gotPendingLock = res;
if( !res ){
}
}
/* 獲得SHARED Lock
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

/ 獲得S oc
* 此時 TRANSICTION應該要有PENDING_LOCK
* PENDING_LOCK作為NO_LOCK到SHARED_LOCK的⼀個過渡
* 實際上此时同時屬於:PENDING LOCK和SHARED LOCk，
* ⼀直到最後 PENDING_LOCK 後，才真正的屬於SHARED LOCK的狀態
*/
if( locktype==SHARED_LOCK && res ){
assert( pFile->locktype==NO_LOCK );
res = winGetReadLock(pFile);
if( res ){
newLocktype = SHARED_LOCK;
}else{
}
}
/* 獲得 RESERVED lock
* TRANSICTION應該要拿到 SHARED_LOCK，變化過程為: SHARED->RESERVED
* RESERVED LOCK的作⽤是為了提⾼系統的Concurrency效能
*/
if( locktype==RESERVED_LOCK && res ){
assert( pFile->locktype==SHARED_LOCK );
res = winLockFile(&pFile->h, SQLITE_LOCKFILE_FLAGS
, RESERVED_BYTE, 0, 1, 0);
if( res ){
newLocktype = RESERVED_LOCK;
}else{
}
}
/* Acquire a PENDING lock
*/
if( locktype==EXCLUSIVE_LOCK && res ){
// 這裡實際上並沒有做加鎖的動作
// PENDING LOCk在前⾯已經加過了
// 單純只是把狀態更改為PENDING狀態
newLocktype = PENDING_LOCK;
gotPendingLock = 0;
}
/* 取得 EXCLUSIVE lock
* 此時的TRANSICTION應該已經具備了PENDING LOCK
* 變化: PENDING LOCK -> EXCLUSIVE LOCK
*/
if( locktype==EXCLUSIVE_LOCK && res ){
assert( pFile->locktype>=SHARED_LOCK );
// 先解除該Process 對 database 的共享鎖
res = winUnlockReadLock(pFile);
res = winLockFile(&pFile->h, SQLITE_LOCKFILE_FLAGS, SHARED_FIRST, 0,
SHARED_SIZE, 0);
if( res ){
newLocktype = EXCLUSIVE_LOCK;
}else{
winGetReadLock(pFile);
}
/* 如果加鎖成功，則說明此詞沒有其他Process正在讀可以安全地進⾏Write的動作
* 如果加鎖失敗，則說明此時仍有其他Process正在進⾏Read的操作，
* 無法獲得EXCLUSIVE LOCk -> 也因此沒辦法Write
* 但仍保有PENDING LOCK
* */
}
/* 如果我們申請的是SHARED_LOCK那應該這時替他做更改
* Lock的變化為: PENDING LOCK->SHARED LOCK
*/
if( gotPendingLock && locktype==SHARED_LOCK ){
winUnlockFile(&pFile->h, PENDING_BYTE, 0, 1, 0);
}
/* 改變file的Lock狀態並適當傳回適當的Return code
*/
if( res ){
rc = SQLITE_OK;
}else{
pFile->lastErrno = lastErrno;
rc = SQLITE_BUSY;
OS C ((" OC fil % t d %d t %d "
9
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
9

OSTRACE(("LOCK-FAIL file=%p, wanted=%d, got=%dn",
pFile->h, locktype, newLocktype));
}
pFile->locktype = (u8)newLocktype;
OSTRACE(("LOCK file=%p, lock=%d, rc=%sn",
pFile->h, pFile->locktype, sqlite3ErrName(rc)));
return rc;
}
b. use examples to explain concurrency control/isolation in SQLite
i. when is the change done by an operation visible to other operations(which operations)
ii. when will nondeterministic happen (that is to say, in what condition we can not know
what will happen in advance)
隔離性是交易的保證之⼀，表⽰交易與交易之間不互相⼲擾，就好像同⼀個時間就只有⾃⼰的交易存
在⼀樣，隔離性保證的基本⽅式是在資料庫層⾯，對資料庫或相關欄位鎖定，在同⼀時間內只允許⼀
個交易進⾏更新或讀取。
可能發⽣的問題有這些:
1. 資料遺失:某個Transaction對欄位進⾏更新，因為同時有另⼀個交易的介⼊⽽遺失。
2. Dirty Read:兩個Transaction同時進⾏，其中⼀個Transaction更新資料，另⼀個Transaction讀了
還沒COMMIT的資料，就有可能發⽣Dirty Read問題
3. 無法重複讀取(unrepeatable read)：某個Transaction兩次讀取同⼀欄位的資料並不⼀致，例如，
如果Transaction A在Transaction B前後進⾏資料的讀取，則會得到不同的結果：
4. phantom read：如果交易A進⾏兩次查詢，在兩次查詢之中有個Transaction B插⼊⼀筆新資料或
刪除⼀筆新資料，第⼆次查詢時得到的資料多了第⼀次查詢時所沒有的筆數，或者少了⼀筆
交易隔離層級有這四種:
1. read uncommited:A交易在更新但未提交，B交易的更新會被延後⾄A提交之後
2. read commited: 讀取的交易不會阻⽌其它的交易，⼀個未確認的更新交易會阻⽌其它所有的交
易，但這影響效能較⼤，另⼀個基本作法是交易正在更新，尚未確定前都先操作暫存表格
3. repeatable read: 讀取交易不會阻⽌其它讀取的交易，但阻⽌其它寫⼊的交易，但這影響效能較
⼤，另⼀基本作法是，⼀個交易正在讀取，尚未確認前，另⼀交易要更新給予暫存表格。
4. serializable:A交易讀取時，B交易若要更新，就必須循序，A交易更新時，B交易無論讀取或更新
都必須循序。
SQLite只⽀援兩種交易隔離層級:serializable、read uncommitted
serializable: SQLite的預設，這是最嚴格但同時也是最安全的隔離層級，只要⼀個A Transaction要讀
取，但B Transaction要寫⼊或是更新，就必須依序，在A Transaction更新時，B Transaction無論有
無寫⼊或是讀取都必須依序執⾏，但也因此效率較差，僅適合平⾏處理效能要求較低且寫⼊操作⽐較
多的情況
可避免：更新遺失（lost update）、dirty read、無法重複讀取（unrepeatable read）、phantom
read問題。
read uncommitted: 與上述相反，這是最不嚴格同時最有可能發⽣資料不⼀致的可能性，但相對地，
這個隔離層級的效率較佳，作法是A Transaction在更新但還沒commit，B Transaction的更新就會被
延後到A commit之後，適合⽤於平⾏處理效能要求較⾼，記憶體資源較少，且寫⼊操作較少的情況
可避免「資料遺失」的問題，但無法避免「dirty read」、「unrepeatable read」以及「phantom
read」問題。

2-B
In the last part you are required answer questions related to topics mentioned in the lectures.
There are still lots of components in the database not included in the questions. In this part,
you need to select a component (or several components if you want) you are interested in
(and not the component we asked for detail explanation in the last part) then describe the
component in detail. Include the following points in your explanation:
159
160
161
162
163
164
165
166

Description of the component you choose (its functionality and role in SQLite)
Related source code files (in “sqlite/src/”)
Detail explanation of how it works (code explanation, description in documentation)
我選擇的部份為探討SQLite如何處理⼀個Transaction的完整流程
Reference:
1. SQLite File IO Specification (https://www.sqlite.org/fileio.html)
2. Transaction (https://www.sqlite.org/lang_transaction.html)
3. SQLite B-Tree Module (https://www.sqlite.org/btreemodule.html)
4. Atomic Commit In SQLite (https://www.sqlite.org/atomiccommit.html)
Transaction的處理也有涉及到上⾯問題中的鎖機制、concurrency control、Atomic Commit等等的內
容
有關聯的component:
關聯的檔案有多個包含了:
1. vdbc.c
2. btree.c
3. pager.c
底下有程式碼說明
完整的流程為:
OP_Transaction(vdbe.c):虛擬機跑指令
sqlite3BtreeBeginTrans(btree.c):btree開始Transaction
sqlite3pager_begin(pager.c):獲取寫鎖並打開journal⽂件
pager_open_journal(pager.c):打開journal⽂件，並寫⼊journal header
OS_interface

//
//p1为database file的index
//0:main database
//1:temporary tables使⽤
//p2 不為0，代表⼀個write Transaction的開始
case OP_Transaction: {
//data file 的 index
int i = pOp->p1;
//database應該對應到的btree
Btree *pBt;
assert( i>=0 && i<db->nDb );
assert( (p->btreeMask & (1<<i))!=0 );
//設定btree pointer
pBt = db->aDb[i].pBt;
if( pBt ){
rc = sqlite3BtreeBeginTrans(pBt, pOp->p2);
//開始btree的Transaction,主要给file加上LOCK,且設定btree的狀態
if( rc==SQLITE_BUSY ){
p->pc = pc;
p->rc = rc = SQLITE_BUSY;
goto vdbe_return;
}
if( rc!=SQLITE_OK && rc!=SQLITE_READONLY /* && rc!=SQLITE_BUSY */ ){
goto abort_due_to_error;
}
}
break;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

//開始⼀個Transaction，如果第⼆個parameter不是0,
//则為⼀個write Transaction開始,
//否則的話為⼀個read Transaction
//如果wrflag>=2,⼀個exclusive Transaction開始，
//此時别的process不能對database做任何操作
int sqlite3BtreeBeginTrans(Btree *p, int wrflag){
BtShared *pBt = p->pBt;
int rc = SQLITE_OK;
btreeIntegrity(p);
/* If the btree is already in a write-transaction, or it
** is already in a read-transaction and a read-transaction
** is requested, this is a no-op.
*/
//如果b-tree處於⼀個write transactions;或者處於⼀個read transactions
//如果此時另⼀個read transactions⼜request,則return SQLITE_OK
if( p->inTrans==TRANS_WRITE || (p->inTrans==TRANS_READ && !wrflag) ){
return SQLITE_OK;
}
/* Write transactions are not possible on a read-only database */
//write Transaction 不能對⽬前read-only database的database做操作
if( pBt->readOnly && wrflag ){
return SQLITE_READONLY;
}
/* If another database handle has already opened a write transaction
** on this shared-btree structure and a second write transaction is
** requested, return SQLITE_BUSY.
*/
//如果database已經處於write transactions
//则該write transactions request時return SQLITE_BUSY
if( pBt->inTransaction==TRANS_WRITE && wrflag ){
return SQLITE_BUSY;
}
do {
//如果database對應的btree的第⼀個page還沒讀⼊
//則把該page讀⼊並且database也相對應的加read lock
if( pBt->pPage1==0 ){
//加read lock且讀取內容
rc = lockBtree(pBt);
}
if( rc==SQLITE_OK && wrflag ){
//對database file加上RESERVED_LOCK
rc = sqlite3pager_begin(pBt->pPage1->aData, wrflag>1);
rc = newDatabase(pBt);
}
}
if( wrflag ) pBt->inStmt = 0;
}else{
unlockBtreeIfUnused(pBt);
}
}while( rc==SQLITE_BUSY && pBt->inTransaction==TRANS_NONE &&
sqlite3InvokeBusyHandler(pBt->pBusyHandler) );
if( p->inTrans==TRANS_NONE ){
//btree的transactions數量加1
pBt->nTransaction++;
}
//設定⽬前btree transactions 狀態
p->inTrans = (wrflag?TRANS_WRITE:TRANS_READ);
if( p->inTrans>pBt->inTransaction ){
pBt->inTransaction = p->inTrans;
}
}
btreeIntegrity(p);
return rc;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

/*
**獲得database的write lock,但發⽣以下情況時去除write lock:
** * sqlite3pager_commit() is called.
** * sqlite3pager_rollback() is called.
** * sqlite3pager_close() is called.
** * sqlite3pager_unref() is called to on every outstanding page.
** pData指向database的打開的page,這時還沒開始進⾏修改
** 相對應的pager檢查是否為read-lock狀態
** 如果打開的不是temp file,則打開⽇誌⽂件
** 如果database已经處於寫⼊的狀態，则什麼事都不做
*/
int sqlite3pager_begin(void *pData, int exFlag){
PgHdr *pPg = DATA_TO_PGHDR(pData);
Pager *pPager = pPg->pPager;
int rc = SQLITE_OK;
assert( pPg->nRef>0 );
assert( pPager->state!=PAGER_UNLOCK );
//pager已经处于share状态
if( pPager->state==PAGER_SHARED ){
assert( pPager->aInJournal==0 );
if( MEMDB ){
pPager->state = PAGER_EXCLUSIVE;
pPager->origDbSize = pPager->dbSize;
}else{
//对file 加 RESERVED_LOCK
rc = sqlite3OsLock(pPager->fd, RESERVED_LOCK);
//設定pager的狀態
pPager->state = PAGER_RESERVED;
if( exFlag ){
rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK);
}
}
return rc;
}
pPager->dirtyCache = 0;
TRACE2("TRANSACTION %dn", PAGERID(pPager));
//使⽤⽇誌,不是temp file,則打開⽇誌⽂件
if( pPager->useJournal && !pPager->tempFile ){
//為pager打開⽇誌⽂件，pager應該處於RESERVED或EXCLUSIVE狀態
//向⽇誌⽂件寫⼊header
rc = pager_open_journal(pPager);
}
}
}
return rc;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

//create⼀個⽇誌⽂件
//pager應該為RESERVED或EXCLUSIVE狀態
static int pager_open_journal(Pager *pPager){
int rc;
assert( !MEMDB );
assert( pPager->state>=PAGER_RESERVED );
assert( pPager->journalOpen==0 );
assert( pPager->useJournal );
assert( pPager->aInJournal==0 );
sqlite3pager_pagecount(pPager);
pPager->aInJournal = sqliteMalloc( pPager->dbSize/8 + 1 );
if( pPager->aInJournal==0 ){
rc = SQLITE_NOMEM;
goto failed_to_open_journal;
}
//打開⽇誌⽂件
rc = sqlite3OsOpenExclusive(pPager->zJournal, &pPager->jfd,
pPager->tempFile);
//⽇誌⽂件的位置pointer
pPager->journalOff = 0;
pPager->setMaster = 0;
pPager->journalHdr = 0;
}
/* os這時create的file位於disk儲存
* 並沒有實際存在在disk
** 下⾯三個操作就是為了把结果寫⼊Disk,⽽對於
** windows系统來說,並沒有提供相對應的API，所以實際上沒有意義
*/
//fullSync對windows沒有意義
sqlite3OsSetFullSync(pPager->jfd, pPager->full_fsync);
sqlite3OsSetFullSync(pPager->fd, pPager->full_fsync);
/* Attempt to open a file descriptor for the directory that contains a file.
**This file descriptor can be used to fsync() the directory
**in order to make sure the creation of a new file is actually written to disk
*/
sqlite3OsOpenDirectory(pPager->jfd, pPager->zDirectory);
pPager->journalOpen = 1;
pPager->journalStarted = 0;
pPager->needSync = 0;
pPager->alwaysRollback = 0;
pPager->nRec = 0;
if( pPager->errCode ){
rc = pPager->errCode;
}
pPager->origDbSize = pPager->dbSize;
//寫⼊⽇誌⽂件的header
rc = writeJournalHdr(pPager);
if( pPager->stmtAutoopen && rc==SQLITE_OK ){
rc = sqlite3pager_stmt_begin(pPager);
}
if( rc!=SQLITE_OK && rc!=SQLITE_NOMEM ){
rc = pager_unwritelock(pPager);
rc = SQLITE_FULL;
}
}
return rc;
failed_to_open_journal:
sqliteFree(pPager->aInJournal);
pPager->aInJournal = 0;
if( rc==SQLITE_NOMEM ){
/* If this was a malloc() failure,
* then we will not be closing the pager
** file. So delete any journal file we may have just created.
** Otherwise,the system will get confused,
** we have a read-lock on the file and a
** mysterious journal has appeared in the filesystem.
*/
sqlite3OsDelete(pPager->zJournal);
}else{
sqlite3OsUnlock(pPager->fd, NO_LOCK);
pPager->state = PAGER_UNLOCK;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

}
return rc;
}
/*寫⼊⽂件的header
**journal header的格式如下:
** - 8 bytes: 標誌⽇志⽂件的magic number
** - 4 bytes: ⽇誌⽂件中紀錄數
** - 4 bytes: Random number used for page hash.
** - 4 bytes: 原本database的⼤⼩(kb)
** - 4 bytes: sector⼤⼩512byte
*/
static int writeJournalHdr(Pager *pPager){
//⽇誌⽂件的header
char zHeader[sizeof(aJournalMagic)+16];
int rc = seekJournalHdr(pPager);
if( rc ) return rc;
pPager->journalHdr = pPager->journalOff;
if( pPager->stmtHdrOff==0 ){
pPager->stmtHdrOff = pPager->journalHdr;
}
//將⽂件pointer指向header
pPager->journalOff += JOURNAL_HDR_SZ(pPager);
/* FIX ME:
**
** Possibly for a pager not in no-sync mode, the journal magic should not
** be written until nRec is filled in as part of next syncJournal().
**
** Actually maybe the whole journal header should be delayed until that
** point. Think about this.
*/
memcpy(zHeader, aJournalMagic, sizeof(aJournalMagic));
/* The nRec Field. 0xFFFFFFFF for no-sync journals. */
put32bits(&zHeader[sizeof(aJournalMagic)], pPager->noSync ? 0xffffffff : 0);
/* The random check-hash initialiser */
sqlite3Randomness(sizeof(pPager->cksumInit), &pPager->cksumInit);
put32bits(&zHeader[sizeof(aJournalMagic)+4], pPager->cksumInit);
/* The initial database size */
put32bits(&zHeader[sizeof(aJournalMagic)+8], pPager->dbSize);
/* The assumed sector size for this process */
put32bits(&zHeader[sizeof(aJournalMagic)+12], pPager->sectorSize);
//寫⼊Header
rc = sqlite3OsWrite(pPager->jfd, zHeader, sizeof(zHeader));
/* The journal header has been written successfully. Seek the journal
** file descriptor to the end of the journal header sector.
*/
rc = sqlite3OsSeek(pPager->jfd, pPager->journalOff-1);
rc = sqlite3OsWrite(pPager->jfd, "000", 1);
}
}
return rc;
}
79
80
81
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

如果在User space有修改page的情況:
sqlite3BtreeSync() (btree.c)
sqlite3pager_sync() (pager.c)

syncJournal()(pager.c) pager_write_paglist() (pager.c)
下⾯三部份程式碼說明所對應的圖為上⾯有提過這幾張(因排版關係統⼀放在這裡):

/*
**Sync⽇誌⽂件,保證所有的dirty page寫⼊disk⽇誌⽂件
*/
static int syncJournal(Pager *pPager){
PgHdr *pPg;
int rc = SQLITE_OK;
/* Sync the journal before modifying the main database
** (assuming there is a journal and it needs to be synced.)
*/
if( pPager->needSync ){
if( !pPager->tempFile ){
assert( pPager->journalOpen );
/* assert( !pPager->noSync ); // noSync might be set if synchronous
** was turned off after the transaction was started. Ticket #615 */
#ifndef NDEBUG
{
/* Make sure the pPager->nRec counter we are keeping agrees
** with the nRec computed from the size of the journal file.
*/
i64 jSz;
rc = sqlite3OsFileSize(pPager->jfd, &jSz);
if( rc!=0 ) return rc;
assert( pPager->journalOff==jSz );
}
#endif
{
/* Write the nRec value into the journal file header. If in
** full-synchronous mode, sync the journal first. This ensures that
** all data has really hit the disk before nRec is updated to mark
** it as a candidate for rollback.
*/
if( pPager->fullSync ){
TRACE2("SYNC journal of %dn", PAGERID(pPager));
//⾸先保證dirty page中所有的數據都已经寫⼊⽇誌⽂件
rc = sqlite3OsSync(pPager->jfd, 0);
}
rc = sqlite3OsSeek(pPager->jfd,
pPager->journalHdr + sizeof(aJournalMagic));
if( rc ) return rc;
//page的數量寫⼊⽇誌⽂件
rc = write32bits(pPager->jfd, pPager->nRec);
if( rc ) return rc;
rc = sqlite3OsSeek(pPager->jfd, pPager->journalOff);
if( rc ) return rc;
}
TRACE2("SYNC journal of %dn", PAGERID(pPager));
rc = sqlite3OsSync(pPager->jfd, pPager->full_fsync);
pPager->journalStarted = 1;
}
pPager->needSync = 0;
/* Erase the needSync flag from every page.
*/
//清楚needSync標誌位
for(pPg=pPager->pAll; pPg; pPg=pPg->pNextAll){
pPg->needSync = 0;
}
pPager->pFirstSynced = pPager->pFirst;
}
#ifndef NDEBUG
/* If the Pager.needSync flag is clear then the PgHdr.needSync
** flag must also be clear for all pages. Verify that this
** invariant is true.
*/
else{
for(pPg=pPager->pAll; pPg; pPg=pPg->pNextAll){
assert( pPg->needSync==0 );
}
assert( pPager->pFirstSynced==pPager->pFirst );
}
#endif
return rc;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

//把所有的dirty page寫⼊database
//到這裡開始獲取EXCLUSIVEQ LOCK，並將page寫回去OS file
static int pager_write_pagelist(PgHdr *pList){
Pager *pPager;
int rc;
if( pList==0 ) return SQLITE_OK;
pPager = pList->pPager;
/* At this point there may be either a RESERVED or EXCLUSIVE lock on the
** database file. If there is already an EXCLUSIVE lock, the following
** calls to sqlite3OsLock() are no-ops.
**
** Moving the lock from RESERVED to EXCLUSIVE actually involves going
** through an intermediate state PENDING. A PENDING lock prevents new
** readers from attaching to the database but is unsufficient for us to
** write. The idea of a PENDING lock is to prevent new readers from
** coming in while we wait for existing readers to clear.
**
** While the pager is in the RESERVED state, the original database file
** is unchanged and we can rollback without having to playback the
** journal into the original database file. Once we transition to
** EXCLUSIVE, it means the database file has been changed and any rollback
** will require a journal playback.
*/
//加EXCLUSIVE_LOCK锁
rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK);
return rc;
}
while( pList ){
assert( pList->dirty );
rc = sqlite3OsSeek(pPager->fd, (pList->pgno-1)*(i64)pPager->pageSize);
if( rc ) return rc;
/* If there are dirty pages in the page cache with page numbers greater
** than Pager.dbSize, this means sqlite3pager_truncate() was called to
** make the file smaller (presumably by auto-vacuum code). Do not write
** any such pages to the file.
*/
if( pList->pgno<=pPager->dbSize ){
char *pData = CODEC2(pPager, PGHDR_TO_DATA(pList), pList->pgno, 6);
TRACE3("STORE %d page %dn", PAGERID(pPager), pList->pgno);
//寫⼊file
rc = sqlite3OsWrite(pPager->fd, pData, pPager->pageSize);
TEST_INCR(pPager->nWrite);
}
#ifndef NDEBUG
else{
TRACE3("NOSTORE %d page %dn", PAGERID(pPager), pList->pgno);
}
#endif
if( rc ) return rc;
//設定為dirty
pList->dirty = 0;
#ifdef SQLITE_CHECK_PAGES
pList->pageHash = pager_pagehash(pList);
#endif
//指向下⼀個dirty page
pList = pList->pDirty;
}
return SQLITE_OK;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

//同步btree對應的database file
//该函数return之後，只需要提交write transaction,删除⽇誌⽂件
int sqlite3BtreeSync(Btree *p, const char *zMaster){
int rc = SQLITE_OK;
if( p->inTrans==TRANS_WRITE ){
BtShared *pBt = p->pBt;
Pgno nTrunc = 0;
#ifndef SQLITE_OMIT_AUTOVACUUM
if( pBt->autoVacuum ){
rc = autoVacuumCommit(pBt, &nTrunc);
return rc;
}
}
#endif
//呼叫pager進⾏同步
rc = sqlite3pager_sync(pBt->pPager, zMaster, nTrunc);
}
return rc;
}
//把pager所有dirty page寫回
int sqlite3pager_sync(Pager *pPager, const char *zMaster, Pgno nTrunc){
int rc = SQLITE_OK;
TRACE4("DATABASE SYNC: File=%s zMaster=%s nTrunc=%dn",
pPager->zFilename, zMaster, nTrunc);
/* If this is an in-memory db, or no pages have been written to, or this
** function has already been called, it is a no-op.
*/
//pager不處於PAGER_SYNCED狀態,dirtyCache為1,
//則進⾏sync
if( pPager->state!=PAGER_SYNCED && !MEMDB && pPager->dirtyCache ){
PgHdr *pPg;
assert( pPager->journalOpen );
/* If a master journal file name has already been written to the
** journal file, then no sync is required. This happens when it is
** written, then the process fails to upgrade from a RESERVED to an
** EXCLUSIVE lock. The next time the process tries to commit the
** transaction the m-j name will have already been written.
*/
if( !pPager->setMaster ){
//pager修改計數器
rc = pager_incr_changecounter(pPager);
if( rc!=SQLITE_OK ) goto sync_exit;
if( nTrunc!=0 ){
/* If this transaction has made the database smaller, then all pages
** being discarded by the truncation must be written to the journal
** file.
*/
Pgno i;
void *pPage;
int iSkip = PAGER_MJ_PGNO(pPager);
for( i=nTrunc+1; i<=pPager->origDbSize; i++ ){
if( !(pPager->aInJournal[i/8] & (1<<(i&7))) && i!=iSkip ){
rc = sqlite3pager_get(pPager, i, &pPage);
rc = sqlite3pager_write(pPage);
sqlite3pager_unref(pPage);
}
}
}
#endif
rc = writeMasterJournal(pPager, zMaster);
//sync⽇誌⽂件
rc = syncJournal(pPager);
}
if( nTrunc!=0 ){
rc = sqlite3pager truncate(pPager, nTrunc);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

c sq te3page _t u cate(p age , u c);
}
#endif
/* Write all dirty pages to the database file */
pPg = pager_get_all_dirty_pages(pPager);
//把所有dirty page寫回OS file
rc = pager_write_pagelist(pPg);
/* Sync the database file. */
//sync database file
if( !pPager->noSync ){
rc = sqlite3OsSync(pPager->fd, 0);
}
pPager->state = PAGER_SYNCED;
}else if( MEMDB && nTrunc!=0 ){
rc = sqlite3pager_truncate(pPager, nTrunc);
}
sync_exit:
return rc;
}
9
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

HW4_0711282.pdf

Recommended

Recommended

More Related Content

Similar to HW4_0711282.pdf

Similar to HW4_0711282.pdf (20)

HW4_0711282.pdf