1. Introduction to Database Systems HW4
(程式碼的說明會備註於程式碼裡頭)
2-A
1. Describe the general structure of SQLite code base
Reference:
1.Architecture of SQLite (https://www.sqlite.org/arch.html)
2.A Look at SQLite (https://www.a2hosting.com/blog/sqlite-benefits/)
3.SQLite As An Application File Format (https://www.sqlite.org/aff_short.html)
4.SQLite Advantages (https://www.javatpoint.com/sqlite-advantages-and-disadvantages)
5.Stack overflow (https://stackoverflow.com/questions/19946298/what-is-the-advantage-of-using-sqlite-rather-than-file)
a.
b.
1.Tokenizer: 當偵測到String含有SQL語法時他將會先被送到Tokenizer,Tokenizer將會把SQL Text
切成多個token然後丟給Parser(Tokenizer 包含了 tokenize.c)
2.Parser: SQLite 採⽤ Lemon parser generator,Lemon與YACC/BISON相似但使⽤不同的輸⼊語
法來防⽌編碼錯誤。當語法錯誤時候因為Lemon設計了non-terminal destructor所以不會因為語法錯
誤導致leak memory. (Parser 包含了 parse.y)。因此Parser的主要⼯作就是Parser接收到Token後,
使⽤語法解析器Lemon在指定的Context Free Language賦予Token具體的意義
因為還沒學過編譯器概論,稍微上網查了⼀下(Yacc 與 Lex 快速⼊⾨
(http://inspiregate.com/programming/other/483-yacc-and-lex-getting-started.html))(簡單學 Parser - lex 與 yacc
(https://mropengate.blogspot.com/2015/05/parser-lex-yacc-1.html)),上述所提到的YACC是⽤來⽣成編譯器的編譯
5. 課程中的Architecture
Sqlite的Architecture如上述a⼩題所提供的
SQLite並⾮作為⼀個獨⽴Process,⽽是直接作為應⽤程式的⼀部分
因此使它較為輕量、快速以及⾼效率
SQLite的優點如下:
1. 容易設定,且無伺服器的特性容易安裝
2. 輕量級,無論是設定或是管理上所佔⽤的資源少
3. 可移植⾼,可以跨平台(OS)使⽤
與課程中所使⽤的Architecture⼤致上相同
差異是SQLite不提供網路訪問,也不能管理User
因此較⼤型的應⽤程式可較為不適合使⽤SQLite
⽐較適合嵌⼊式設備或是物聯網等等
2. Describe how SQLite process a SQL statement
Reference:
1.The SQLite Query Optimizer Overview (https://www.sqlite.org/optoverview.html)
2.Architecture of SQLite (https://www.sqlite.org/arch.html)
3.Query Planning (https://www.sqlite.org/queryplanner.html)
4.Indexes On Expressions (https://www.sqlite.org/expridx.html)
5.TEXT affinity (https://www.sqlite.org/datatype3.html#affinity)
a. list the components included in the procedure, describe their roles in the procedure (a brief
explanation is enough here)
如果單看Process SQL statment的話⼤概是這幾個components
7. Components FIle Purpose
Tokenizer tokenize.c 將SQL statment切割成Token
Parser parser.y 藉由Lemon實做Parser
Code
Generator
update.c 處理UPDATE的statment
delete.c 處理DELETE的statment
insert.c 處理INSERT的statment
trigger.c 處理TRIGGER的statment
attach.c 處理ATTACH和DEATTACH的statment
select.c 處理SELECT的statment
where.c 處理WHERE的statment
vacuum.c 處理VACUUM的statment
pragma.c 處理PRAGMA的statment
expr.c 處理SQL statment中的Expression
auth.c 實做sqlite3_set_authorizer()
analyze.c 實現ANALYZE命令
alter.c 實做ALTER TABLE
build.c
處理以下的命令(語法) CREATE TABLE、DROP TABLE、
CREATE INDEX、DROP INDEX、creating ID lists、BEGIN
TRANSACTION、COMMIT、ROLLBACK
fun.c 實做SQL中函數的部份
data.c 與⽇期以及時間轉換有關的function
b. describe how SQLite optimize the query execution in detail
i. explain how each term (where, like, between etc.) are optimized, how indexes are used
in the optimization
ii.explain the query planner as detail as possible
以下優化like between or等的程式碼皆位於whereexpr.c中的exprAnalyze(…)
1. WHERE clause analysis
Where clause會被以AND為界線切割為多個Term,如果Where clause是以OR所分割組成,那整個
clause應該要被視為OR優化的單個Term
接著分析所有item看看他們能否滿⾜index的條件,要使⽤index的Term必須為下列形式:
如果index使⽤下⾯這樣的statment建⽴
8. 如果 initial column a、b等等出現在where clause中,就可以使⽤index,index的初始值必須與IN、
IS、= ⼀起使⽤,最右邊的column可以使⽤不等式,然⽽最右邊column的index最多可以有兩個不等
式,但此不等式必須將允許值控制在兩個極值之間,且須注意的是不能中斷使⽤index column,⽐如
上圖c不能使⽤index,那麼⼜只有a與b可以使⽤index,其餘的index cloumn將無法正常使⽤
2. The BETWEEN optimization
如果分割後的Clause包含Between,則把Between轉化成兩個不等式
例如下圖:
可以轉化成下圖形式
轉化成兩個不等式僅僅拿來分析使⽤,⽽不會產⽣VDBE bytecode,然⽽如果Between已經被編碼
過,那麼轉換後的Clause將會被忽略,如果還未編碼且轉化後的Clause可以使⽤index,那SQLite將
會跳過原本的Between形式
可以在WHERE.c中找到function驗證
#ifndef SQLITE_OMIT_BETWEEN_OPTIMIZATION
/* If a term is the BETWEEN operator, create two new virtual terms
** that define the range that the BETWEEN implements. For example:
**
** a BETWEEN b AND c
**
** is converted into:
**
** (a BETWEEN b AND c) AND (a>=b) AND (a<=c)
**
** The two new terms are added onto the end of the WhereClause object.
** The new terms are "dynamic" and are children of the original BETWEEN
** term. That means that if the BETWEEN term is coded, the children are
** skipped. Or, if the children are satisfied by an index, the original
** BETWEEN term is skipped.
*/
else if( pExpr->op==TK_BETWEEN && pWC->op==TK_AND ){
//判斷是不是Between形式Clause
ExprList *pList = pExpr->x.pList;
int i;
static const u8 ops[] = {TK_GE, TK_LE};
assert( pList!=0 );
assert( pList->nExpr==2 );
for(i=0; i<2; i++){
Expr *pNewExpr; // ⽤來存轉換後的Expression
int idxNew; // 新插⼊的expression在Where clause中的index
pNewExpr = sqlite3PExpr(pParse, ops[i], //創建>=和<= expression
sqlite3ExprDup(db, pExpr->pLeft, 0),
sqlite3ExprDup(db, pList->a[i].pExpr, 0));
transferJoinMarkings(pNewExpr, pExpr);
idxNew = whereClauseInsert(pWC, pNewExpr, TERM_VIRTUAL|TERM_DYNAMIC);
testcase( idxNew==0 ); //檢查轉換有沒有成功
exprAnalyze(pSrc, pWC, idxNew);
pTerm = &pWC->a[idxTerm];
// 表⽰新建的expression是Bwtween clause的變化
markTermAsChild(pWC, idxNew, idxTerm);
}
}
#endif /* SQLITE_OMIT_BETWEEN_OPTIMIZATION */
3. OR optimizations
如上圖,當分割的Clause是由於OR連接時 有兩種情況要考慮:
1. 如果OR連結的所有clause都是在同⼀Table的同⼀column時,可以使⽤IN來替換OR clause
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
9. 2. 如果OR連結的所有clause不是在同⼀column且各個clause的operation為"=", “<”, “<=”, “>”, ">=
", “IS NULL”, or "IN"中的⼀個,並且各個clause的column是某些index的column,就能夠把OR改
寫為UNION形式,每個clause都可以依靠相對應的index進⾏優化
如果兩者接符合者,SQLite將會預設使⽤第⼀種⽅式進⾏優化
#if !defined(SQLITE_OMIT_OR_OPTIMIZATION) && !defined(SQLITE_OMIT_SUBQUERY)
/* Analyze a term that is composed of two or more subterms connected by
** an OR operator.
*/
else if( pExpr->op==TK_OR ){ //如果該Expression是由OR組合成的
assert( pWC->op==TK_AND ); //如果Where clause是由AND分割的
exprAnalyzeOrTerm(pSrc, pWC, idxTerm); // 優化該expression
pTerm = &pWC->a[idxTerm];
}
#endif /* SQLITE_OMIT_OR_OPTIMIZATION */
4. The LIKE optimization
Like優化時需要滿⾜底下條件:
1. LIKE左邊必須是有index的column,且型態必須是TEXT affinity的
2. LIKE右邊必須是string或是⼀個為string類型的參數且不可以wildcard character開頭
3. ESCAPE不能出現在LIKE中
4. 如果case_sensitive_like模式是啟⽤的(區分⼤⼩寫),那麼column必須使⽤BINARY collating
sequence排列。如果滿⾜了上述條件,我們則可以添加⼀些不等式來減少LIKE的查詢範圍。例
如: x LIKE ‘abc%’ 添加條件後變為: x >= ‘abc’ AND x < ‘abd’ AND x LIKE ‘abc%’ 這樣x可以
使⽤索引並且可以縮⼩LIKE的查詢範圍。
優化的具體實現
⼀個單獨的SQL statment可能有好幾種⽅法完成,所花的時間與SQLite最終決定⽤哪種⽅法有關
所以where.c的⽬的就是主要⽣成WHERE statment的Bytecode並進⾏優話處理
sqlite3WhereBegin()是整個查詢優化處理的核⼼,主要完成WHERE的優話以及opcode(要丟給虛擬
機的)的⽣成
1
2
3
4
5
6
7
8
9
10
10. #ifndef SQLITE_OMIT_LIKE_OPTIMIZATION
/* Add constraints to reduce the search space on a LIKE or GLOB
** operator.
**
** A like pattern of the form "x LIKE 'aBc%'" is changed into constraints
**
** x>='ABC' AND x<'abd' AND x LIKE 'aBc%'
**
** The last character of the prefix "abc" is incremented to form the
** termination condition "abd". If case is not significant (the default
** for LIKE) then the lower-bound is made all uppercase and the upper-
** bound is made all lowercase so that the bounds also work when comparing
** BLOBs.
*/
if( pWC->op==TK_AND
&& isLikeOrGlob(pParse, pExpr, &pStr1, &isComplete, &noCase)
){
Expr *pLeft; /* LHS of LIKE/GLOB operator */
Expr *pStr2; /* Copy of pStr1 - RHS of LIKE/GLOB operator */
Expr *pNewExpr1;
Expr *pNewExpr2;
int idxNew1;
int idxNew2;
const char *zCollSeqName; /* Name of collating sequence */
const u16 wtFlags = TERM_LIKEOPT | TERM_VIRTUAL | TERM_DYNAMIC;
pLeft = pExpr->x.pList->a[1].pExpr;
pStr2 = sqlite3ExprDup(db, pStr1, 0);
/* Convert the lower bound to upper-case and the upper bound to
** lower-case (upper-case is less than lower-case in ASCII) so that
** the range constraints also work for BLOBs
*/
if( noCase && !pParse->db->mallocFailed ){
int i;
char c;
pTerm->wtFlags |= TERM_LIKE;
for(i=0; (c = pStr1->u.zToken[i])!=0; i++){
pStr1->u.zToken[i] = sqlite3Toupper(c);
pStr2->u.zToken[i] = sqlite3Tolower(c);
}
}
if( !db->mallocFailed ){
u8 c, *pC; /* Last character before the first wildcard */
pC = (u8*)&pStr2->u.zToken[sqlite3Strlen30(pStr2->u.zToken)-1];
c = *pC;
if( noCase ){
/* The point is to increment the last character before the first
** wildcard. But if we increment '@', that will push it into the
** alphabetic range where case conversions will mess up the
** inequality. To avoid this, make sure to also run the full
** LIKE on all candidate expressions by clearing the isComplete flag
*/
if( c=='A'-1 ) isComplete = 0;
c = sqlite3UpperToLower[c];
}
*pC = c + 1;
}
zCollSeqName = noCase ? "NOCASE" : sqlite3StrBINARY;
pNewExpr1 = sqlite3ExprDup(db, pLeft, 0);
pNewExpr1 = sqlite3PExpr(pParse, TK_GE,
sqlite3ExprAddCollateString(pParse,pNewExpr1,zCollSeqName),
pStr1);
// Create >=的 expression
transferJoinMarkings(pNewExpr1, pExpr);
idxNew1 = whereClauseInsert(pWC, pNewExpr1, wtFlags);
//插⼊>= expression 到 where clause中
testcase( idxNew1==0 );
//測試有沒有成功插⼊
exprAnalyze(pSrc, pWC, idxNew1);
//分析Expression是不是operation+expression這種形式
//如果不是的話 轉換為operation+expression這種形式
pNewExpr2 = sqlite3ExprDup(db, pLeft, 0);
pNewExpr2 = sqlite3PExpr(pParse, TK_LT,
sqlite3ExprAddCollateString(pParse,pNewExpr2,zCollSeqName),
pStr2);
// Create <的 expression
transferJoinMarkings(pNewExpr2, pExpr);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
11. t a s e Jo a gs(p e p , p p );
idxNew2 = whereClauseInsert(pWC, pNewExpr2, wtFlags);
//插⼊< expression 到 where clause中
testcase( idxNew2==0 );
//測試有無插⼊成功
exprAnalyze(pSrc, pWC, idxNew2);
//分析Expression是不是operation+expression這種形式
//如果不是的話 轉換為operation+expression這種形式
pTerm = &pWC->a[idxTerm];
if( isComplete ){
markTermAsChild(pWC, idxNew1, idxTerm);
markTermAsChild(pWC, idxNew2, idxTerm);
}
}
#endif /* SQLITE_OMIT_LIKE_OPTIMIZATION */
struct Select {
u8 op;
/* One of: TK_UNION TK_ALL TK_INTERSECT TK_EXCEPT */
LogEst nSelectRow; /* Estimated number of result rows */
u32 selFlags; /* Various SF_* values */
int iLimit, iOffset;
/* Memory registers holding LIMIT & OFFSET counters */
u32 selId; /* Unique identifier number for this SELECT */
int addrOpenEphm[2]; /* OP_OpenEphem opcodes related to this select */
ExprList *pEList; /* The fields of the result */
SrcList *pSrc; /* FROM 內容的parse tree */
Expr *pWhere; /* WHERE 內容的parse tree */
ExprList *pGroupBy; /* GROUP 內容的parse tree */
Expr *pHaving; /* HAVING 內容的parse tree */
ExprList *pOrderBy; /* ORDER 內容的parse tree */
Select *pPrior; /* Prior select in a compound select statement */
Select *pNext; /* Next select to the left in a compound */
Expr *pLimit; /* LIMIT expression. 不使⽤的話以NULL表⽰. */
With *pWith; /* WITH clause attached to this select. Or NULL. */
#ifndef SQLITE_OMIT_WINDOWFUNC
Window *pWin; /* List of window functions */
Window *pWinDefn; /* List of named window definitions */
#endif
};
因為在WHERE中使⽤了select這個struct
我們可以在sqliteint.h裡找到他的定義
sqlite3WhereBegin()的parameter
WhereInfo *sqlite3WhereBegin(
Parse *pParse, /* The parser context */
SrcList *pTabList, /* FROM clause: A list of all tables to be scanned */
Expr *pWhere, /* The WHERE clause */
ExprList *pOrderBy, /* An ORDER BY (or GROUP BY) clause, or NULL */
ExprList *pResultSet, /* Query result set. Req'd for DISTINCT */
u16 wctrlFlags, /* The WHERE_* flags defined in sqliteInt.h */
int iAuxArg /* If WHERE_OR_SUBCLAUSE is set, index cursor number
** If WHERE_USE_LIMIT, then the limit amount */
// pTablist是由parser⽣成對FROM部份⽣成的Parse Tree
// 包含FROM Table的資料
// pWHERE 是WHERE部份的Parse Tree
// 會包含WHERE中的EXpression
// porderby對應order by 的 parse tree
)
WhereInfo *pWInfo;
//宣告在sqlite3WhereBegin
//pWInfo將會成為sqlite3WhereBegin的return value
9
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1
2
3
21. ⾸先第⼀個4Byte為下⼀個freelist trunk的page number,如果為0表⽰他是最後⼀個trunk
第⼆個4Byte表⽰leaf page 指標的數量,如果該整數⼤於0,假設為L,那每個在index第3到L+2的整
數都是表達Freelist leaf Page的page number,如第三個就是地⼀個Freelist leaf Page的編號以此類
推
B-Tree Page:
SQLite使⽤了兩種B-tree:分別為B* Tree,將所有data存在leaf中(SQLite稱他為table b-tree)。B Tree
則是將key與data保存在leaf與interior page中,僅僅存key不存data,SQLite稱他為index b-tree。
⼀個B-tree table若不是Table b-tree page就是index b-tree page,同⼀個B-tree中的table都是同⼀種
類型。
Table b-tree中的每個entry由⼀個64位的有號整數key與⻑度最多2147483647 bytes的任意⻑度data
組成,Interior table b-tree⾴中只有key和指向其⼦page的指標,所有的data都存放在table b-tree
leaf page。
Cell Payload Overflow Pages:
當⼀個B-tree cell太⼤裝不下時,多餘的部份將會放⼊Overflow Page中,Overflow Page會組成⼀個
Linked List,這些Cell Payload Overflow Page的結構如下圖:
⼀開始的4個Bytes是下⼀個page的page number,如果是0的話表⽰是最後⼀個
第5個Bytes開始表⽰溢出的內容
Pointer Map or Ptrmap Pages(Ptrmap):
ptrmap是datavase中額外的page,⽬的是使auto_vacuum和incremental_vacuum模式更加有效率,
通常其他的page都會有⽗node到⼦node的指標,例如上⾯提到的Cell Payload Overflow Pages⼀開
始的4個Bytes是下⼀個page的page number,那ptrmap提供另外⼀個⽅向,可以從⼦node快速到⽗
node,在release page時可以快速通過ptrmap找到指定的⽗node並修改他
架構如下圖:
其中每個ptrmap中的的5bytes由1個byte的page類型和其後的4byte page number組成。
有5種類型:
1. A b-tree root page: 後⾯4byte的page number應該為0
2. A freelist page: 後⾯4byte的page number應該為0
3. The first page of a cell payload overflow chain:後⾯4byte指向該cell payload overflow page的
page number
4. A page in an overflow chain other than the first page: cell payload overflow page list中的其他
page,後⾯4byte指向前⼀page的page number
5. A non-root b-tree page: 後⾯4byte表⽰其⽗page的page number
Schema Layer:
這邊開始將描述SQLite底層file format
22. Record Format:table b-tree中的data或是index b-tree中的key會被轉換為record格式,record使⽤
64-bit signed integers格式定義了對於table或是index中的cloumn的data,描述column的個數、data
type、內容
Serial Type Codes Of The Record Format:
1個record以⼀個varint開頭表⽰整個header的⻑度,接著是好幾個varint(這些varint被稱為serial
type)⽤來描述Column的datatype與⻑度,⽐較不⼀樣的是描述BLOB的時候會變成2-3的Byte,
接著在header後⾯的就是各個colume的值,但對於serial type為0、8、9、12和13的幾種類型的
column來說,column body的⻑度為0,如同下圖表達
record排序順序:
⽐較規則如下(由格式的左⽐到右):
1. NULL values(serial type 0)最優先
2. Numeric values(serial types 1 到9) 排在NULL之後,整數間Numeric values排序
3. Text values 由column⽐對函數決定
4. BLOB values 由memcmp()決定
Representation Of SQL Tables
每個Database Schema中有rowid的SQL table在Disk上都是通過table B-tree表⽰,Table b-tree中的
每個Entry對應於SQL table中的⼀⾏,Entry中的64位有號整數key對應於rowid
⾸先將各個column組成⼀個record格式的byte array,array中各個值的排列順序和各個column在表中
的排列順序⼀致,接著將該array作為payload存放在table b-tree的entry中,如果⼀個SQL表包含
INTEGER PRIMARY KEY的column(此定義的column就是rowid,替代原先隱含的rowid),那麼這個
column在record中的值為空值,SQLite包含了INTEGER PRIMARY KEY的column的時候,使⽤table
b-tree的key替代。
如果⼀個column的affinity(建議類型)是REAL類型,並且包含⼀個可以被轉換為整數的值(沒有⼩
數部分,並且沒有⼤到溢出整數所能表⽰範圍)那麼該值在record中可能被存儲為整數。當將之從
record中提取出來的時候,SQLite會⾃動轉換為浮點數。
Representation of WITHOUT ROWID Tables
如果⼀個SQL table在創建的時候⽤了“WITHOUT ROWID”,那麼該table就是⼀個WITHOUT ROWID
table,在disk上的存儲格式也和普通SQL表不同。WITHOUT ROWID表使⽤index b-tree⽽⾮table b-
tree來進⾏存儲。這個index b-tree中各個entry的key就是record,這些record由PRIMARY KEY 修飾
的column作為開頭(WITHOUT ROWID表必須要有primary key,⽤於替代rowid的作⽤),其它
column在其後⾯。
所以,WITHOUT ROWID表的的內容編碼和普通的rowid表編碼是⼀樣的,區別在於:
23. 將PRIMARY KEY的column提前到最前
將整個record(⽐較的時候就看PRIMARY KEY,值唯⼀且不能為空)作為index b-tree中的key,⽽
⾮table b-tree中的data。
普通rowid表中對REAL建議類型存儲的特殊規則,在WITHOUT ROWID表中也同樣適⽤。
Representation Of SQL Indices
所有的SQL index,無論是通過CREATE INDEX語句顯性create的,還是通過UNIQUE或者
PRIMARY KEY隱性create的,都對應於數據庫⽂件中的⼀個index b-tree。Index b-tree中的每個
entry對應於相應SQL表的⼀⾏,index b-tree的key是⼀個record,這個record由對應表中被索引的
column(⼀個或者多個)以及對應表的key組成。對於普通rowid表來說,這個key就是rowid,對於
WITHOUT ROWID表來說,這個key是PRIMARY KEY,無論哪種情況,這個key都是表中唯⼀的
(這裡的key指的不是b-tree的key)。
在普通的index中,原table中的⾏,和相應的index中的entry是⼀⼀對應的,但是在⼀個partial
index(創建index的時候加WHERE從句,對table中部分row索引)中,index b -tree只包含WHERE
從句為真的⾏對應的entry。Index和table b-tree中對應的⾏使⽤同樣的rowid或primary key,並且所
有被索引的column都具有和原table中column同樣的值
b. describe how SQLite control file read, write, close etc. (a brief explanation is enough here)
Reference:
1. 官⽅⽂件:SQLite File IO Specification (https://www.sqlite.org/fileio.html)
2. 官⽅WAL說明⽂件(https://www.sqlite.org/wal.html)
3. 官⽅⽂件:File Locking And Concurrency In SQLite Version 3 (https://www.sqlite.org/lockingv3.html)
4. WAL-mode File Format (https://www.sqlite.org/walformat.html)
打開與關閉page分別使⽤sqlite3pagereopen與sqlite3pagereclose
並且使⽤sqlite3pagerReadFileheader讀取header
如果需要寫⼊的話為使⽤sqlute3PagerWrite的這個Function
以下擷取部份程式碼的部份
24. int sqlite3PagerOpen(
sqlite3_vfs *pVfs, /* The virtual file system to use */
Pager **ppPager, /* OUT: Return the Pager structure here */
const char *zFilename, /* Name of the database file to open */
int nExtra, /* Extra bytes append to each in-memory page */
int flags, /* flags controlling this file */
int vfsFlags, /* flags passed through to sqlite3_vfs.xOpen() */
void (*xReinit)(DbPage*) /* Function to reinitialize pages */
){
u8 *pPtr;
Pager *pPager = 0; /* Pager object to allocate and return */
int rc = SQLITE_OK; /* Return code */
int tempFile = 0; /* True for temp files (incl. in-memory files) */
int memDb = 0; /* True if this is an in-memory file */
int readOnly = 0; /* True if this is a read-only file */
int journalFileSize; /* Bytes to allocate for each journal fd */
char *zPathname = 0; /* Full path to database file */
int nPathname = 0; /* Number of bytes in zPathname */
int useJournal = (flags & PAGER_OMIT_JOURNAL)==0; /* False to omit journal *
int pcacheSize = sqlite3PcacheSize(); /* Bytes to allocate for PCache
u32 szPageDflt = SQLITE_DEFAULT_PAGE_SIZE; /* Default page size */
const char *zUri = 0; /* URI args to copy */
int nUriByte = 1; /* Number of bytes of URI args at *zUri */
int nUri = 0; /* Number of URI parameters */
/* Figure out how much space is required for each journal file-handle
** (there are two of them, the main journal and the sub-journal). */
journalFileSize = ROUND8(sqlite3JournalSize(pVfs));
/* Set the output variable to NULL in case an error occurs. */
*ppPager = 0;
/* Compute and store the full pathname in an allocated buffer pointed
** to by zPathname, length nPathname. Or, if this is a temporary file,
** leave both nPathname and zPathname set to 0.
*/
if( zFilename && zFilename[0] ){
const char *z;
nPathname = pVfs->mxPathname+1;
zPathname = sqlite3DbMallocRaw(0, nPathname*2);
if( zPathname==0 ){
return SQLITE_NOMEM_BKPT;
}
zPathname[0] = 0; /* Make sure initialized even if FullPathname() fails */
rc = sqlite3OsFullPathname(pVfs, zFilename, nPathname, zPathname);
if( rc!=SQLITE_OK ){
if( rc==SQLITE_OK_SYMLINK ){
if( vfsFlags & SQLITE_OPEN_NOFOLLOW ){
rc = SQLITE_CANTOPEN_SYMLINK;
}else{
rc = SQLITE_OK;
}
}
}
nPathname = sqlite3Strlen30(zPathname);
z = zUri = &zFilename[sqlite3Strlen30(zFilename)+1];
while( *z ){
z += strlen(z)+1;
z += strlen(z)+1;
nUri++;
}
nUriByte = (int)(&z[1] - zUri);
assert( nUriByte>=1 );
if( rc==SQLITE_OK && nPathname+8>pVfs->mxPathname ){
/* This branch is taken when the journal path required by
** the database being opened will be more than pVfs->mxPathname
** bytes in length. This means the database cannot be opened,
** as it will not be possible to open the journal file or even
** check for a hot-journal before reading.
*/
rc = SQLITE_CANTOPEN_BKPT;
}
if( rc!=SQLITE_OK ){
sqlite3DbFree(0, zPathname);
return rc;
}
}
/* Allocate memory for the Pager structure, PCache object, the
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
25. / ocate e o y o t e age st uctu e, Cac e object, t e
** three file descriptors, the database file name and the journal
** file name. The layout in memory is as follows:
**
** Pager object (sizeof(Pager) bytes)
** PCache object (sqlite3PcacheSize() bytes)
** Database file handle (pVfs->szOsFile bytes)
** Sub-journal file handle (journalFileSize bytes)
** Main journal file handle (journalFileSize bytes)
** Ptr back to the Pager (sizeof(Pager*) bytes)
** 0000 database prefix (4 bytes)
** Database file name (nPathname+1 bytes)
** URI query parameters (nUriByte bytes)
** Journal filename (nPathname+8+1 bytes)
** WAL filename (nPathname+4+1 bytes)
** 000 terminator (3 bytes)
**
** Some 3rd-party software, over which we have no control, depends on
** the specific order of the filenames and the 0 separators between them
** so that it can (for example) find the database filename given the WAL
** filename without using the sqlite3_filename_database() API. This is a
** misuse of SQLite and a bug in the 3rd-party software, but the 3rd-party
** software is in widespread use, so we try to avoid changing the filename
** order and formatting if possible. In particular, the details of the
** filename format expected by 3rd-party software should be as follows:
**
** - Main Database Path
** - 0
** - Multiple URI components consisting of:
** - Key
** - 0
** - Value
** - 0
** - 0
** - Journal Path
** - 0
** - WAL Path (zWALName)
** - 0
**
** The sqlite3_create_filename() interface and the databaseFilename() utilit
** that is used by sqlite3_filename_database() and kin also depend on the
** specific formatting and order of the various filenames, so if the format
** changes here, be sure to change it there as well.
*/
pPtr = (u8 *)sqlite3MallocZero(
ROUND8(sizeof(*pPager)) + /* Pager structure */
ROUND8(pcacheSize) + /* PCache object */
ROUND8(pVfs->szOsFile) + /* The main db file */
journalFileSize * 2 + /* The two journal files */
sizeof(pPager) + /* Space to hold a pointer */
4 + /* Database prefix */
nPathname + 1 + /* database filename */
nUriByte + /* query parameters */
nPathname + 8 + 1 + /* Journal filename */
3 /* Terminator */
);
assert( EIGHT_BYTE_ALIGNMENT(SQLITE_INT_TO_PTR(journalFileSize)) );
if( !pPtr ){
sqlite3DbFree(0, zPathname);
return SQLITE_NOMEM_BKPT;
}
pPager = (Pager*)pPtr; pPtr += ROUND8(sizeof(*pPager));
pPager->pPCache = (PCache*)pPtr; pPtr += ROUND8(pcacheSize);
pPager->fd = (sqlite3_file*)pPtr; pPtr += ROUND8(pVfs->szOsFile);
pPager->sjfd = (sqlite3_file*)pPtr; pPtr += journalFileSize;
pPager->jfd = (sqlite3_file*)pPtr; pPtr += journalFileSize;
assert( EIGHT_BYTE_ALIGNMENT(pPager->jfd) );
memcpy(pPtr, &pPager, sizeof(pPager)); pPtr += sizeof(pPager);
/* Fill in the Pager.zFilename and pPager.zQueryParam fields */
pPtr += 4; /* Skip zero prefix */
pPager->zFilename = (char*)pPtr;
if( nPathname>0 ){
memcpy(pPtr, zPathname, nPathname); pPtr += nPathname + 1;
if( zUri ){
memcpy(pPtr, zUri, nUriByte); pPtr += nUriByte;
}else{
pPtr++;
}
}
9
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
9
26. /* Fill in Pager.zJournal */
if( nPathname>0 ){
pPager->zJournal = (char*)pPtr;
memcpy(pPtr, zPathname, nPathname); pPtr += nPathname;
memcpy(pPtr, "-journal",8); pPtr += 8 + 1;
}else{
pPager->zJournal = 0;
}
if( nPathname ) sqlite3DbFree(0, zPathname);
pPager->pVfs = pVfs;
pPager->vfsFlags = vfsFlags;
/* Open the pager file.
*/
if( zFilename && zFilename[0] ){
int fout = 0; /* VFS flags returned by xOpen() */
rc = sqlite3OsOpen(pVfs, pPager->zFilename, pPager->fd, vfsFlags, &fout);
assert( !memDb );
readOnly = (fout&SQLITE_OPEN_READONLY)!=0;
/* If the file was successfully opened for read/write access,
** choose a default page size in case we have to create the
** database file. The default page size is the maximum of:
**
** + SQLITE_DEFAULT_PAGE_SIZE,
** + The value returned by sqlite3OsSectorSize()
** + The largest page size that can be written atomically.
*/
if( rc==SQLITE_OK ){
int iDc = sqlite3OsDeviceCharacteristics(pPager->fd);
if( !readOnly ){
setSectorSize(pPager);
assert(SQLITE_DEFAULT_PAGE_SIZE<=SQLITE_MAX_DEFAULT_PAGE_SIZE);
if( szPageDflt<pPager->sectorSize ){
if( pPager->sectorSize>SQLITE_MAX_DEFAULT_PAGE_SIZE ){
szPageDflt = SQLITE_MAX_DEFAULT_PAGE_SIZE;
}else{
szPageDflt = (u32)pPager->sectorSize;
}
}
}
pPager->noLock = sqlite3_uri_boolean(pPager->zFilename, "nolock", 0);
if( (iDc & SQLITE_IOCAP_IMMUTABLE)!=0
|| sqlite3_uri_boolean(pPager->zFilename, "immutable", 0) ){
vfsFlags |= SQLITE_OPEN_READONLY;
goto act_like_temp_file;
}
}
}else{
/* If a temporary file is requested, it is not opened immediately.
** In this case we accept the default page size and delay actually
** opening the file until the first call to OsWrite().
**
** This branch is also run for an in-memory database. An in-memory
** database is the same as a temp-file that is never written out to
** disk and uses an in-memory rollback journal.
**
** This branch also runs for files marked as immutable.
*/
act_like_temp_file:
tempFile = 1;
pPager->eState = PAGER_READER; /* Pretend we already have a lock */
pPager->eLock = EXCLUSIVE_LOCK; /* Pretend we are in EXCLUSIVE mode */
pPager->noLock = 1; /* Do no locking */
readOnly = (vfsFlags&SQLITE_OPEN_READONLY);
}
/* The following call to PagerSetPagesize() serves to set the value of
** Pager.pageSize and to allocate the Pager.pTmpSpace buffer.
*/
if( rc==SQLITE_OK ){
assert( pPager->memDb==0 );
rc = sqlite3PagerSetPagesize(pPager, &szPageDflt, -1);
testcase( rc!=SQLITE_OK );
}
/* Initialize the PCache object. */
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
28. int sqlite3PagerReadFileheader(Pager *pPager, int N, unsigned char *pDest){
int rc = SQLITE_OK;
memset(pDest, 0, N);
assert( isOpen(pPager->fd) || pPager->tempFile );
/* This routine is only called by btree immediately after creating
** the Pager object. There has not been an opportunity to transition
** to WAL mode yet.
*/
assert( !pagerUseWal(pPager) );
if( isOpen(pPager->fd) ){
IOTRACE(("DBHDR %p 0 %dn", pPager, N))
rc = sqlite3OsRead(pPager->fd, pDest, N, 0);
if( rc==SQLITE_IOERR_SHORT_READ ){
rc = SQLITE_OK;
}
}
return rc;
}
int sqlite3PagerClose(Pager *pPager, sqlite3 *db){
u8 *pTmp = (u8*)pPager->pTmpSpace;
assert( db || pagerUseWal(pPager)==0 );
assert( assert_pager_state(pPager) );
disable_simulated_io_errors();
sqlite3BeginBenignMalloc();
pagerFreeMapHdrs(pPager);
/* pPager->errCode = 0; */
pPager->exclusiveMode = 0;
pager_reset(pPager);
if( MEMDB ){
pager_unlock(pPager);
}else{
/* If it is open, sync the journal file before calling UnlockAndRollback.
** If this is not done, then an unsynced portion of the open journal
** file may be played back into the database. If a power failure occurs
** while this is happening, the database could become corrupt.
**
** If an error occurs while trying to sync the journal, shift the pager
** into the ERROR state. This causes UnlockAndRollback to unlock the
** database and close the journal file without attempting to roll it
** back or finalize it. The next database user will have to do hot-journal
** rollback before accessing the database file.
*/
if( isOpen(pPager->jfd) ){
pager_error(pPager, pagerSyncHotJournal(pPager));
}
pagerUnlockAndRollback(pPager);
}
sqlite3EndBenignMalloc();
enable_simulated_io_errors();
PAGERTRACE(("CLOSE %dn", PAGERID(pPager)));
IOTRACE(("CLOSE %pn", pPager))
sqlite3OsClose(pPager->jfd);
sqlite3OsClose(pPager->fd);
sqlite3PageFree(pTmp);
sqlite3PcacheClose(pPager->pPCache);
assert( !pPager->aSavepoint && !pPager->pInJournal );
assert( !isOpen(pPager->jfd) && !isOpen(pPager->sjfd) );
sqlite3_free(pPager);
return SQLITE_OK;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
29. // 將 data page設定為可寫的
// 這個function在真正對data做改變之前應該要先被呼叫
// 呼叫這個Function者除⾮回傳值是SQLITE_OK否則不能對資料操作
int sqlite3PagerWrite(PgHdr *pPg){
Pager *pPager = pPg->pPager;
assert( (pPg->flags & PGHDR_MMAP)==0 );
assert( pPager->eState>=PAGER_WRITER_LOCKED );
assert( assert_pager_state(pPager) );
if( (pPg->flags & PGHDR_WRITEABLE)!=0 && pPager->dbSize>=pPg->pgno ){
if( pPager->nSavepoint ) return subjournalPageIfRequired(pPg);
return SQLITE_OK;
}else if( pPager->errCode ){
return pPager->errCode;
}else if( pPager->sectorSize > (u32)pPager->pageSize ){
assert( pPager->tempFile==0 );
return pagerWriteLargeSector(pPg);
}else{
//與pager_write不同的是這FUnction還會處理⼀個特例:
//當兩個以上多個page fit on a single disk sector
return pager_write(pPg);
}
// 如果有錯誤發⽣Return SQLITE_NOMEM 或是 IO error code
// 否則的話回傳 SQLITE_OK
}
4. Describe the concurrency control of SQLite
a. describe how SQLite handle concurrency control (file locking, journal files etc.) in detail
Reference:
1. 官⽅說明⽂件Write-Ahead Logging (https://www.sqlite.org/wal.html)
2. 官⽅說明⽂件Atomic Commit In SQLite (https://www.sqlite.org/atomiccommit.html)
3. CSDN (https://blog.csdn.net/tianxuhong/article/details/78752357)
4. SQLite WAL 模式簡單介紹(https://xiaozhuanlan.com/topic/1754328960)
5. CSDN (https://blog.csdn.net/tianyeming/article/details/85763621)
6. WAL-mode File Format (https://www.sqlite.org/walformat.html)
7. sqlite:WAL模式(https://www.jianshu.com/p/c78cf4caceab)
8. 官⽅⽂件:File Locking And Concurrency In SQLite Version 3 (https://www.sqlite.org/lockingv3.html)
9. Page Cache之並發控制(https://www.cnblogs.com/hustcat/archive/2009/03/01/1400757.html)
10. Blog (https://my.oschina.net/u/587236/blog/129022)
11. 深⼊理解SQLite (https://www.kancloud.cn/kangdandan/sqlite/64358)
12. Isolation In SQLite (https://sqlite.org/isolation.html)
SQLite的Lock機制與型態:
locking
state
說明
unlocked
如字⾯上的意思就是Unlock,
該process沒有取得任何lock(需要注意的是這是SQLite的預設值)
shared
多個process可以同時對database讀取資料但不可以做寫⼊的動作
(因此允許多個shared lock)
reserved
表⽰某個process預計在之後要做寫⼊的動作,同⼀時間只能存在⼀個reserved
lock,但可以同時shared lock共存
pending
代表該process希望盡快拿到exclusive lock,但⽬前還有其他share lock還沒結束,
只要存在⼀個pending lock SQLite就不會允許新的shared lock
exclusive
process要寫⼊Databse前需要取得exclusive lock,同時只要有⼀個exclusive
lock存在(最多也只能有⼀個),不可以存在其他lock,
因此SQLite要盡量縮短exclusive lock存在的時間
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
30. 在將原本SQL語法解析完後,就要對database進⾏操作,SQLite會透過B-tree尋找需要的page,B-
tree的⼯作是維護(聯繫)各個page之間的關係,並不會直接讀寫Disk,是由Pager來負責獲得所需要的
page或是需要修改的page,所以可以說pager就是B-tree與Disk的中間⼈。
Atomic Commit In SQLite (https://www.sqlite.org/atomiccommit.html)這個官⽅⽂件中說明Atomic Commit表⽰
著⼀個Transaction database的變化會完成或是不完成,並不會write data到⼀半結果斷掉或系統崩潰
造成database損毀的現象,
初始狀態
在⼀個database剛打開時,最右邊Disk表⽰存在Disk中的內容,中間表⽰OS的Disk buffer cache,
左邊表⽰正在使⽤SQlite,User process的memory內容,因為還沒有任何data被讀⼊,所以是空的。
1.Acquiring A Read Lock
在SQlite write data前需要先拿到Read Lock,看他是否已經存在於database中了。為了從database
file中讀取資料第⼀步是獲得shared lock,如上⾯所提到的,shared lock允中多個process同時讀取同
個file,但不允許做任何的寫⼊。
需要注意的是shared lock是針對OS Buffer並⾮Disk本⾝。
2.Reading Information Out Of The Database
當Shared lock取得之後,就可以從file中讀取資訊,根據前⾯所假設的,我們已經假設User space是
空的,所以資訊必須先從Disk到OS Buffer,之後所需的部份或是全部資訊都可以OS Buffer取得,⼀
般⽽⾔只有database中的部份page讀取,因為database中的資料數通常不會太⼩,如上圖8個page只
讀3個
31. 3.Obtaining A Reserved Lock
在開始修改database前,必須先拿到RESERVED LOCK,上述也有提到了,⼀個database file只能
有⼀個RESERVED LOCK,⽬的是宣告他即將去修改database了(還沒開始),⽽其他process可以同
時做讀取
4.Creating A Rollback Journal File
在修改database前,SQLite會⽣成⼀個單獨的rollback journal file,並將被修改前的page寫⼊其中,
所以rollback journal file表⽰著我們恢復database的所有資訊
rollback journal file的頂部有⼀個header(如圖綠⾊標記部份),紀錄原始database file的⼤⼩,因此就
算database被修改後變⼤了我們還是知道他原始的⼤⼩為何
當⼀個新 file被建⽴時⼤部分OS並不會⾺上寫⼊到Disk中,會有些延遲,如圖中所表⽰的Disk中的file
部份仍為空⽩狀態
5.Changing Database Pages In User Space
當修改前的狀態已經被存起來後,我們就可以放⼼的去做修改Database的動作
32. 6.Flushing The Rollback Journal File To Mass Storage
接下來rollback journal file將會存回Disk中,因為要寫⼊Disk,這是⼀個相當耗時的動作(離CPU越近
的記憶體越⼩也越快,越遠記憶體越⼤也越慢)
7.Obtaining An Exclusive Lock
在修改database前,還需要取得⼀個Pending Lock,如同上述所提到的,Pending Lock允許其他
Process繼續讀取,但不允許繼續⽣成SHARED Lock。
Pending Lock存在的⽬的是:試想今天假如有多個Process都要讀取同⼀個file,不斷的有⼈申請要
SHARED Lock,有⼈完成讀取後釋放SHARED Lock,如果沒有PENDING LOCK,最後要寫⼊的
Process就等不到EXCLUSIVE LOCk,但有了PENDING Lock後就可以阻⽌不斷⽣成SHARED Lock
8.Writing Changes To The Database File
33. Process拿到EXL+CLUSIVE LOCK後代表此時不會並存其他LOCK,可以放⼼的更新或是寫⼊資料,
修改完後也會更新OS Buffers的內容
9.Flushing Changes To Mass Storage
將做了修改的內容再存回Disk
10.Deleting The Rollback Journal
當數據已經安全寫⼊到Disk後rollback journal file就沒有必要存在了,因此可以刪除,如果刪除之後
發⽣系統崩潰或是停電等情況,因為所有變化已經寫⼊Disk,並不影響,所以SQLite判斷Database
file是否完成變更要由rollback journal file是否存在來判斷
11.Releasing The Lock
最後⼀個步驟是釋放調EXCLUSIVE LOCK,這樣其他Process就⼜可以繼續訪問database file
34. 上圖顯⽰鎖釋放的同時User space也被清空,但較新版本的SQLite並不會清空的,避免下⼀個
Process開始時⼜使⽤到相同的數據,如果清空了反⽽會降低效能,再次使⽤這些數據之前我們得先
獲得⼀個SHARED Lock,然後檢查database file的第⼀個page中的修改次數計數器(在a題中已提
到),file每做⼀次修改該計數器就會加1,如果該計數器顯⽰這個file已經被修改過了那就要清空User
space並且重新讀⼊。
Rollback
這邊講說明如果發⽣電源發⽣問題、程式崩潰等等情況時,要如何恢復資料
1.When Something Goes Wrong…
假設在上述寫⼊資料到Disk的過程中發⽣問題(第9步),
2.Hot Rollback Journals
在上述已經提過了,Hot Journal的存在是為了要使database恢復到故障前的狀態。當Process試圖去
訪問database file前,必須取得SHARED Lock,但現在卻被告知有Rollback Journal存在,SQLite會
去查看這個Journal是否為Hot Journal,Hot Journal的存在表⽰在前⼀步過程中發⽣了系統崩潰或是
故障等等
那,要如何判斷Journal是否為Hot Journal呢?
1. Rollback Journal 存在
2. Rollback Journal 並⾮空的
3. Database中⽬前不存在RESERVED lock
4. Rollback Journal 不含有 the name of a master journal file(稍後會說明) 或是包含了the name of
a master journal file 且 master journal file 存在
3.Obtaining An Exclusive Lock On The Database
35. 為了防⽌其他Process rollback同⼀個hot journal ,要先取得⼀個Database的EXCLUSIVE Lock
4.Rolling Back Incomplete Changes
⼀旦Process取得的EXCLUSIVE Lock,他就被允許可以更新database file,然後從Journal中讀取原
始的內容,在上述有提到Database file原始⼤⼩已經寫⼊Journal的Header,所以SQLite可以⽤這個
訊息來中⽌Database file
5.Deleting The Hot Journal
當Journal 都被放回到Database File且FLush到Disk後,這個Journal就可以刪掉了
6.Continue As If The Uncompleted Writes Had Never Happened
40. OSTRACE(("LOCK-FAIL file=%p, wanted=%d, got=%dn",
pFile->h, locktype, newLocktype));
}
pFile->locktype = (u8)newLocktype;
OSTRACE(("LOCK file=%p, lock=%d, rc=%sn",
pFile->h, pFile->locktype, sqlite3ErrName(rc)));
return rc;
}
b. use examples to explain concurrency control/isolation in SQLite
i. when is the change done by an operation visible to other operations(which operations)
ii. when will nondeterministic happen (that is to say, in what condition we can not know
what will happen in advance)
隔離性是交易的保證之⼀,表⽰交易與交易之間不互相⼲擾,就好像同⼀個時間就只有⾃⼰的交易存
在⼀樣,隔離性保證的基本⽅式是在資料庫層⾯,對資料庫或相關欄位鎖定,在同⼀時間內只允許⼀
個交易進⾏更新或讀取。
可能發⽣的問題有這些:
1. 資料遺失:某個Transaction對欄位進⾏更新,因為同時有另⼀個交易的介⼊⽽遺失。
2. Dirty Read:兩個Transaction同時進⾏,其中⼀個Transaction更新資料,另⼀個Transaction讀了
還沒COMMIT的資料,就有可能發⽣Dirty Read問題
3. 無法重複讀取(unrepeatable read):某個Transaction兩次讀取同⼀欄位的資料並不⼀致,例如,
如果Transaction A在Transaction B前後進⾏資料的讀取,則會得到不同的結果:
4. phantom read:如果交易A進⾏兩次查詢,在兩次查詢之中有個Transaction B插⼊⼀筆新資料或
刪除⼀筆新資料,第⼆次查詢時得到的資料多了第⼀次查詢時所沒有的筆數,或者少了⼀筆
交易隔離層級有這四種:
1. read uncommited:A交易在更新但未提交,B交易的更新會被延後⾄A提交之後
2. read commited: 讀取的交易不會阻⽌其它的交易,⼀個未確認的更新交易會阻⽌其它所有的交
易,但這影響效能較⼤,另⼀個基本作法是交易正在更新,尚未確定前都先操作暫存表格
3. repeatable read: 讀取交易不會阻⽌其它讀取的交易,但阻⽌其它寫⼊的交易,但這影響效能較
⼤,另⼀基本作法是,⼀個交易正在讀取,尚未確認前,另⼀交易要更新給予暫存表格。
4. serializable:A交易讀取時,B交易若要更新,就必須循序,A交易更新時,B交易無論讀取或更新
都必須循序。
SQLite只⽀援兩種交易隔離層級:serializable、read uncommitted
serializable: SQLite的預設,這是最嚴格但同時也是最安全的隔離層級,只要⼀個A Transaction要讀
取,但B Transaction要寫⼊或是更新,就必須依序,在A Transaction更新時,B Transaction無論有
無寫⼊或是讀取都必須依序執⾏,但也因此效率較差,僅適合平⾏處理效能要求較低且寫⼊操作⽐較
多的情況
可避免:更新遺失(lost update)、dirty read、無法重複讀取(unrepeatable read)、phantom
read問題。
read uncommitted: 與上述相反,這是最不嚴格同時最有可能發⽣資料不⼀致的可能性,但相對地,
這個隔離層級的效率較佳,作法是A Transaction在更新但還沒commit,B Transaction的更新就會被
延後到A commit之後,適合⽤於平⾏處理效能要求較⾼,記憶體資源較少,且寫⼊操作較少的情況
可避免「資料遺失」的問題,但無法避免「dirty read」、「unrepeatable read」以及「phantom
read」問題。
2-B
In the last part you are required answer questions related to topics mentioned in the lectures.
There are still lots of components in the database not included in the questions. In this part,
you need to select a component (or several components if you want) you are interested in
(and not the component we asked for detail explanation in the last part) then describe the
component in detail. Include the following points in your explanation:
159
160
161
162
163
164
165
166
41. Description of the component you choose (its functionality and role in SQLite)
Related source code files (in “sqlite/src/”)
Detail explanation of how it works (code explanation, description in documentation)
我選擇的部份為探討SQLite如何處理⼀個Transaction的完整流程
Reference:
1. SQLite File IO Specification (https://www.sqlite.org/fileio.html)
2. Transaction (https://www.sqlite.org/lang_transaction.html)
3. SQLite B-Tree Module (https://www.sqlite.org/btreemodule.html)
4. Atomic Commit In SQLite (https://www.sqlite.org/atomiccommit.html)
Transaction的處理也有涉及到上⾯問題中的鎖機制、concurrency control、Atomic Commit等等的內
容
有關聯的component:
關聯的檔案有多個包含了:
1. vdbc.c
2. btree.c
3. pager.c
底下有程式碼說明
完整的流程為:
OP_Transaction(vdbe.c):虛擬機跑指令
sqlite3BtreeBeginTrans(btree.c):btree開始Transaction
sqlite3pager_begin(pager.c):獲取寫鎖並打開journal⽂件
pager_open_journal(pager.c):打開journal⽂件,並寫⼊journal header
OS_interface
48. /*
**Sync⽇誌⽂件,保證所有的dirty page寫⼊disk⽇誌⽂件
*/
static int syncJournal(Pager *pPager){
PgHdr *pPg;
int rc = SQLITE_OK;
/* Sync the journal before modifying the main database
** (assuming there is a journal and it needs to be synced.)
*/
if( pPager->needSync ){
if( !pPager->tempFile ){
assert( pPager->journalOpen );
/* assert( !pPager->noSync ); // noSync might be set if synchronous
** was turned off after the transaction was started. Ticket #615 */
#ifndef NDEBUG
{
/* Make sure the pPager->nRec counter we are keeping agrees
** with the nRec computed from the size of the journal file.
*/
i64 jSz;
rc = sqlite3OsFileSize(pPager->jfd, &jSz);
if( rc!=0 ) return rc;
assert( pPager->journalOff==jSz );
}
#endif
{
/* Write the nRec value into the journal file header. If in
** full-synchronous mode, sync the journal first. This ensures that
** all data has really hit the disk before nRec is updated to mark
** it as a candidate for rollback.
*/
if( pPager->fullSync ){
TRACE2("SYNC journal of %dn", PAGERID(pPager));
//⾸先保證dirty page中所有的數據都已经寫⼊⽇誌⽂件
rc = sqlite3OsSync(pPager->jfd, 0);
if( rc!=0 ) return rc;
}
rc = sqlite3OsSeek(pPager->jfd,
pPager->journalHdr + sizeof(aJournalMagic));
if( rc ) return rc;
//page的數量寫⼊⽇誌⽂件
rc = write32bits(pPager->jfd, pPager->nRec);
if( rc ) return rc;
rc = sqlite3OsSeek(pPager->jfd, pPager->journalOff);
if( rc ) return rc;
}
TRACE2("SYNC journal of %dn", PAGERID(pPager));
rc = sqlite3OsSync(pPager->jfd, pPager->full_fsync);
if( rc!=0 ) return rc;
pPager->journalStarted = 1;
}
pPager->needSync = 0;
/* Erase the needSync flag from every page.
*/
//清楚needSync標誌位
for(pPg=pPager->pAll; pPg; pPg=pPg->pNextAll){
pPg->needSync = 0;
}
pPager->pFirstSynced = pPager->pFirst;
}
#ifndef NDEBUG
/* If the Pager.needSync flag is clear then the PgHdr.needSync
** flag must also be clear for all pages. Verify that this
** invariant is true.
*/
else{
for(pPg=pPager->pAll; pPg; pPg=pPg->pNextAll){
assert( pPg->needSync==0 );
}
assert( pPager->pFirstSynced==pPager->pFirst );
}
#endif
return rc;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
49. //把所有的dirty page寫⼊database
//到這裡開始獲取EXCLUSIVEQ LOCK,並將page寫回去OS file
static int pager_write_pagelist(PgHdr *pList){
Pager *pPager;
int rc;
if( pList==0 ) return SQLITE_OK;
pPager = pList->pPager;
/* At this point there may be either a RESERVED or EXCLUSIVE lock on the
** database file. If there is already an EXCLUSIVE lock, the following
** calls to sqlite3OsLock() are no-ops.
**
** Moving the lock from RESERVED to EXCLUSIVE actually involves going
** through an intermediate state PENDING. A PENDING lock prevents new
** readers from attaching to the database but is unsufficient for us to
** write. The idea of a PENDING lock is to prevent new readers from
** coming in while we wait for existing readers to clear.
**
** While the pager is in the RESERVED state, the original database file
** is unchanged and we can rollback without having to playback the
** journal into the original database file. Once we transition to
** EXCLUSIVE, it means the database file has been changed and any rollback
** will require a journal playback.
*/
//加EXCLUSIVE_LOCK锁
rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK);
if( rc!=SQLITE_OK ){
return rc;
}
while( pList ){
assert( pList->dirty );
rc = sqlite3OsSeek(pPager->fd, (pList->pgno-1)*(i64)pPager->pageSize);
if( rc ) return rc;
/* If there are dirty pages in the page cache with page numbers greater
** than Pager.dbSize, this means sqlite3pager_truncate() was called to
** make the file smaller (presumably by auto-vacuum code). Do not write
** any such pages to the file.
*/
if( pList->pgno<=pPager->dbSize ){
char *pData = CODEC2(pPager, PGHDR_TO_DATA(pList), pList->pgno, 6);
TRACE3("STORE %d page %dn", PAGERID(pPager), pList->pgno);
//寫⼊file
rc = sqlite3OsWrite(pPager->fd, pData, pPager->pageSize);
TEST_INCR(pPager->nWrite);
}
#ifndef NDEBUG
else{
TRACE3("NOSTORE %d page %dn", PAGERID(pPager), pList->pgno);
}
#endif
if( rc ) return rc;
//設定為dirty
pList->dirty = 0;
#ifdef SQLITE_CHECK_PAGES
pList->pageHash = pager_pagehash(pList);
#endif
//指向下⼀個dirty page
pList = pList->pDirty;
}
return SQLITE_OK;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
50. //同步btree對應的database file
//该函数return之後,只需要提交write transaction,删除⽇誌⽂件
int sqlite3BtreeSync(Btree *p, const char *zMaster){
int rc = SQLITE_OK;
if( p->inTrans==TRANS_WRITE ){
BtShared *pBt = p->pBt;
Pgno nTrunc = 0;
#ifndef SQLITE_OMIT_AUTOVACUUM
if( pBt->autoVacuum ){
rc = autoVacuumCommit(pBt, &nTrunc);
if( rc!=SQLITE_OK ){
return rc;
}
}
#endif
//呼叫pager進⾏同步
rc = sqlite3pager_sync(pBt->pPager, zMaster, nTrunc);
}
return rc;
}
//把pager所有dirty page寫回
int sqlite3pager_sync(Pager *pPager, const char *zMaster, Pgno nTrunc){
int rc = SQLITE_OK;
TRACE4("DATABASE SYNC: File=%s zMaster=%s nTrunc=%dn",
pPager->zFilename, zMaster, nTrunc);
/* If this is an in-memory db, or no pages have been written to, or this
** function has already been called, it is a no-op.
*/
//pager不處於PAGER_SYNCED狀態,dirtyCache為1,
//則進⾏sync
if( pPager->state!=PAGER_SYNCED && !MEMDB && pPager->dirtyCache ){
PgHdr *pPg;
assert( pPager->journalOpen );
/* If a master journal file name has already been written to the
** journal file, then no sync is required. This happens when it is
** written, then the process fails to upgrade from a RESERVED to an
** EXCLUSIVE lock. The next time the process tries to commit the
** transaction the m-j name will have already been written.
*/
if( !pPager->setMaster ){
//pager修改計數器
rc = pager_incr_changecounter(pPager);
if( rc!=SQLITE_OK ) goto sync_exit;
#ifndef SQLITE_OMIT_AUTOVACUUM
if( nTrunc!=0 ){
/* If this transaction has made the database smaller, then all pages
** being discarded by the truncation must be written to the journal
** file.
*/
Pgno i;
void *pPage;
int iSkip = PAGER_MJ_PGNO(pPager);
for( i=nTrunc+1; i<=pPager->origDbSize; i++ ){
if( !(pPager->aInJournal[i/8] & (1<<(i&7))) && i!=iSkip ){
rc = sqlite3pager_get(pPager, i, &pPage);
if( rc!=SQLITE_OK ) goto sync_exit;
rc = sqlite3pager_write(pPage);
sqlite3pager_unref(pPage);
if( rc!=SQLITE_OK ) goto sync_exit;
}
}
}
#endif
rc = writeMasterJournal(pPager, zMaster);
if( rc!=SQLITE_OK ) goto sync_exit;
//sync⽇誌⽂件
rc = syncJournal(pPager);
if( rc!=SQLITE_OK ) goto sync_exit;
}
#ifndef SQLITE_OMIT_AUTOVACUUM
if( nTrunc!=0 ){
rc = sqlite3pager truncate(pPager, nTrunc);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79