FOSSASIA PGDAY ASIA 2017 presentation material.
FOSSASIA PGDAY ASIA 2017 の発表資料です。
In this presentation, I will talk about the following two topics.
* Considerations for securing a database system.
* Current status of database audit on PostgreSQL
FOSSASIA 2017
http://2017.fossasia.org/
PGDAY ASIA 2017
http://2017.pgday.asia/
NTT pgaudit
https://github.com/ossc-db/pgaudit
High availability is critical for PostgreSQL database systems, especially for organizations that depend on their databases to support their operations. In this presentation, we will explore the different options available for achieving high availability in PostgreSQL.
FOSSASIA PGDAY ASIA 2017 presentation material.
FOSSASIA PGDAY ASIA 2017 の発表資料です。
In this presentation, I will talk about the following two topics.
* Considerations for securing a database system.
* Current status of database audit on PostgreSQL
FOSSASIA 2017
http://2017.fossasia.org/
PGDAY ASIA 2017
http://2017.pgday.asia/
NTT pgaudit
https://github.com/ossc-db/pgaudit
High availability is critical for PostgreSQL database systems, especially for organizations that depend on their databases to support their operations. In this presentation, we will explore the different options available for achieving high availability in PostgreSQL.
En savoir plus sur www.opensourceschool.fr
Ce support est diffusé sous licence Creative Commons (CC BY-SA 3.0 FR) Attribution - Partage dans les Mêmes Conditions 3.0 France
Plan :
1. Introduction
2. Installation
3. The psql client
4. Authentication and privileges
5. Backup and restoration
6. Internal Architecture
7. Performance optimization
8. Stats and monitoring
9. Logs
10. Replication
GTIDs were introduced to solve replication problems and improve database consistency in MySQL database replication.
When, accidentally, transactions occur on a replica, this introduces GTIDs on that replica that don't exist on the master. When, on a master failover, this replica becomes the new master, and the corresponding binlogs of the errant GTIDs are already purged, replication breaks on the replicas of this new master, because those missing GTIDs can't be retrieved from the binlogs of this new master.
This presentation will talk about GTIDs and how to detect errant GTIDs on a replica (before the corresponding binlogs are purged) and how to look at the corresponding transactions in the binlogs. I'll give some examples of transactions that could happen on a replica that didn't originate from a primary node, explain how this is possible and share some tips on how to avoid this.
Basic understanding of MySQL database replication is assumed.
This presentation was at Percona Live 2019 in Austin, Texas.
https://www.percona.com/live/19/sessions/errant-gtids-breaking-replication-how-to-detect-and-avoid-them
This ppt was used by Devrim at pgDay Asia 2017. He talked about some important facts about WAL - Transaction Logs or xlogs in PostgreSQL. Some of these can really come handy on a bad day
Using the new extended Berkley Packet Filter capabilities in Linux to the improve performance of auditing security relevant kernel events around network, file and process actions.
How Safe is Asynchronous Master-Master Setup?Sveta Smirnova
It is common knowledge that built-in asynchronous master-master (active-active) replication is not safe. I remember times when the official MySQL User Reference Manual stated that such an installation is not recommended for production use. Some experts repeat this claim even now.
While this statement is generally true, I worked with thousands of shops that successfully avoided asynchronous replication limitations in active-active setups.
In this talk, I will show how they did it, demonstrate situations when asynchronous master-master replication is the best possible high availability option and beats such solutions as Galera or InnoDB Clusters. I will also cover common mistakes, leading to disasters.
Presented in "MySQL, MariaDB and Friends devroom" at Fosdem in 2020: https://fosdem.org/2020/schedule/event/mysql_master_master/
Talk for SCaLE13x. Video: https://www.youtube.com/watch?v=_Ik8oiQvWgo . Profiling can show what your Linux kernel and appliacations are doing in detail, across all software stack layers. This talk shows how we are using Linux perf_events (aka "perf") and flame graphs at Netflix to understand CPU usage in detail, to optimize our cloud usage, solve performance issues, and identify regressions. This will be more than just an intro: profiling difficult targets, including Java and Node.js, will be covered, which includes ways to resolve JITed symbols and broken stacks. Included are the easy examples, the hard, and the cutting edge.
En savoir plus sur www.opensourceschool.fr
Ce support est diffusé sous licence Creative Commons (CC BY-SA 3.0 FR) Attribution - Partage dans les Mêmes Conditions 3.0 France
Plan :
1. Introduction
2. Installation
3. The psql client
4. Authentication and privileges
5. Backup and restoration
6. Internal Architecture
7. Performance optimization
8. Stats and monitoring
9. Logs
10. Replication
GTIDs were introduced to solve replication problems and improve database consistency in MySQL database replication.
When, accidentally, transactions occur on a replica, this introduces GTIDs on that replica that don't exist on the master. When, on a master failover, this replica becomes the new master, and the corresponding binlogs of the errant GTIDs are already purged, replication breaks on the replicas of this new master, because those missing GTIDs can't be retrieved from the binlogs of this new master.
This presentation will talk about GTIDs and how to detect errant GTIDs on a replica (before the corresponding binlogs are purged) and how to look at the corresponding transactions in the binlogs. I'll give some examples of transactions that could happen on a replica that didn't originate from a primary node, explain how this is possible and share some tips on how to avoid this.
Basic understanding of MySQL database replication is assumed.
This presentation was at Percona Live 2019 in Austin, Texas.
https://www.percona.com/live/19/sessions/errant-gtids-breaking-replication-how-to-detect-and-avoid-them
This ppt was used by Devrim at pgDay Asia 2017. He talked about some important facts about WAL - Transaction Logs or xlogs in PostgreSQL. Some of these can really come handy on a bad day
Using the new extended Berkley Packet Filter capabilities in Linux to the improve performance of auditing security relevant kernel events around network, file and process actions.
How Safe is Asynchronous Master-Master Setup?Sveta Smirnova
It is common knowledge that built-in asynchronous master-master (active-active) replication is not safe. I remember times when the official MySQL User Reference Manual stated that such an installation is not recommended for production use. Some experts repeat this claim even now.
While this statement is generally true, I worked with thousands of shops that successfully avoided asynchronous replication limitations in active-active setups.
In this talk, I will show how they did it, demonstrate situations when asynchronous master-master replication is the best possible high availability option and beats such solutions as Galera or InnoDB Clusters. I will also cover common mistakes, leading to disasters.
Presented in "MySQL, MariaDB and Friends devroom" at Fosdem in 2020: https://fosdem.org/2020/schedule/event/mysql_master_master/
Talk for SCaLE13x. Video: https://www.youtube.com/watch?v=_Ik8oiQvWgo . Profiling can show what your Linux kernel and appliacations are doing in detail, across all software stack layers. This talk shows how we are using Linux perf_events (aka "perf") and flame graphs at Netflix to understand CPU usage in detail, to optimize our cloud usage, solve performance issues, and identify regressions. This will be more than just an intro: profiling difficult targets, including Java and Node.js, will be covered, which includes ways to resolve JITed symbols and broken stacks. Included are the easy examples, the hard, and the cutting edge.
1. 最近发现在大数据量的 lua 环境中,GC 占据了很多的 CPU 。差不多是整个 CPU 时间的 20% 左右。希望着
手改进。这样,必须先对 lua 的 gc 算法极其实现有一个详尽的理解。我之前读过 lua 的源代码,由于 lua 源
码版本变迁,这个工作还需要再做一次。这次我重新阅读了 lua 5.1.4 的源代码。从今天起,做一个笔记,详细
分析一下 lua 的 gc 是如何实现的。阅读代码整整花掉了我一天时间。但写出来恐怕比阅读时间更长。我会分几
天写在 blog 上。
Lua 采用一个简单的标记清除算法的 GC 系统。在 Lua 中,一共只有 9 种数据类型,分别为 nil 、
boolean、lightuserdata、number、string、table、function、userdata 和 thread 。其中,只有 string table function
thread 四种在 vm 中以引用方式共享,是需要被 GC 管理回收的对象。其它类型都以值形式存在。
但在 Lua 的实现中,还有两种类型的对象需要被 GC 管理。分别是 proto (可以看作未绑定 upvalue 的函
数), upvalue (多个 upvalue 会引用同一个值)。
Lua 是以 union + type 的形式保存值。具体定义可见 lobject.h 的 56 - 75 行:
/*** Union of all Lua values */
typedef union {
GCObject *gc;
void *p;
lua_Number n;
int b;
} Value;
/*** Tagged Values*/
#define TValuefields Value value; int tt
typedef struct lua_TValue {
TValuefields;
} TValue;
我们可以看到,Value 以 union 方式定义。如果是需要被 GC 管理的对象,就以 GCObject 指针形式保存,否
则直接存值。在代码的其它部分,并不直接使用 Value 类型,而是 TValue 类型。它比 Value 多了一个类型标
识。 int tt 记录。
用 通常的系统中,每个 TValue 长为 12 字节。btw, 在 The implementation of Lua 5.0 中作者
讨论了,在 32 位系统下,为何不用某种 trick 把 type 压缩到前 8 字节内。
所有的 GCObject 都有一个相同的数据头,叫作 CommonHeader,在 lobject.h 里 43 行 以宏形式定义出来
的。使用宏是源于使用上的某种便利。C 语言不支持结构的继承。
#define CommonHeader GCObject *next; lu_byte tt; lu_byte marked
从这里我们可以看到:所有的 GCObject 都用一个单向链表串了起来。每个对象都以 tt 来识别其类型。marked
域用于标记清除的工作。
2. 标记清除算法是一种简单的 GC 算法。每次 GC 过程,先以若干根节点开始,逐个把直接以及间接和它们相关
的节点都做上标记。对于 Lua ,这个过程很容易实现。因为所有 GObject 都在同一个链表上,当标记完成后,
遍历这个链表,把未被标记的节点一一删除即可。
Lua 在实际实现时,其实不只用一条链表维系所有 GCObject 。这是因为 string 类型有其特殊性。所有的
string 放在一张大的 hash 表中。它需要保证系统中不会有值相同的 string 被创建两份。 string 是被单独管
顾
理的,而不串在 GCObject 的链表中。
回头来看看 lua_State 这个类型。这是写 C 和 Lua 交互时用的最多的数据类型。顾名思义,它表示了 lua vm
的某种状态。从实现上来说,更接近 lua 的一个 thread 以及其间包含的相关数据(堆栈、环境等等)。事实上,
一个 lua_State 也是一个类型为 thread 的 GCObject 。见其定义于 lstate.h 97 行。
/*** `per thread' state */
struct lua_State {
CommonHeader;
lu_byte status;
StkId top; /* first free slot in the stack */
StkId base; /* base of current function */
global_State *l_G;
CallInfo *ci; /* call info for current function */
const Instruction *savedpc; /* `savedpc' of current function */
StkId stack_last; /* last free slot in the stack */
StkId stack; /* stack base */
CallInfo *end_ci; /* points after end of ci array*/
CallInfo *base_ci; /* array of CallInfo's */
int stacksize;
int size_ci; /* size of array `base_ci' */
unsigned short nCcalls; /* number of nested C calls */
unsigned short baseCcalls; /* nested C calls when resuming coroutine */
lu_byte hookmask;
lu_byte allowhook;
int basehookcount;
int hookcount;
lua_Hook hook;
TValue l_gt; /* table of globals */
TValue env; /* temporary place for environments */
GCObject *openupval; /* list of open upvalues in this stack */
GCObject *gclist;
struct lua_longjmp *errorJmp; /* current error recover point */
ptrdiff_t errfunc; /* current error handling function (stack index) */
};
一个完整的 lua 虚拟机在运行时,可有多个 lua_State ,即多个 thread 。它们会共享一些数据。这些数据放
在 global_State *l_G 域中。其中自然也包括所有 GCobject 的链表。
22. 我们也可以看到,自动 GC 过程处于 sweep 流程时,收集器也并不那么急于释放内存了。因为释放内存会导
致 totalbytes 减少,如果没有新的对象分配出来,totalbytes 是不会增加到触发新的 luaC_step 的。
手动分步 GC 则是另一番光景。lua_gc GCSTEP 使用的 data 值,在处于 mark 流程时,便可以认为扫描的内
存字节数乘上 step multiplier 这个系数。当然这只是大致上个估计。毕竟内存不是理想中的逐字节扫描的。
等待所有灰色节点都处理完,我们就可以开始清理那些剩下的白色节点了吗?不,因为 mark 是分步执行的,
中间 lua 虚拟机中的对象关系可能又发生的变化。所以在开始清理工作前,我们还需要做最后一次扫描。这个
过程不可以再被打断。
让我们看看相关的处理函数。见 lgc.c 526 行:
static void atomic (lua_State *L) {
global_State *g = G(L);
size_t udsize; /* total size of userdata to be finalized */
/* remark occasional upvalues of (maybe) dead threads */
remarkupvals(g);
/* traverse objects cautch by write barrier and by 'remarkupvals' */
propagateall(g);
/* remark weak tables */
g->gray = g->weak;
g->weak = NULL;
lua_assert(!iswhite(obj2gco(g->mainthread)));
markobject(g, L); /* mark running thread */
markmt(g); /* mark basic metatables (again) */
propagateall(g);
/* remark gray again */
g->gray = g->grayagain;
g->grayagain = NULL;
propagateall(g);
udsize = luaC_separateudata(L, 0); /* separate userdata to be finalized */
marktmu(g); /* mark `preserved' userdata */
udsize += propagateall(g); /* remark, to propagate `preserveness' */
cleartable(g->weak); /* remove collected objects from weak tables */
/* flip current white */
g->currentwhite = cast_byte(otherwhite(g));
g->sweepstrgc = 0;
g->sweepgc = &g->rootgc;
g->gcstate = GCSsweepstring;
g->estimate = g->totalbytes - udsize; /* first estimate */
}
代码中注释的很详细。propagateall 是用来迭代所有的灰色节点的。所以每个关键步骤结束后,都需要调用一下
保证处理完所有可能需要标记的对象。第一个步骤是那些 open 的 TUPVAL ,在前面已经提及。然后是处理弱
表。
23. 弱表是个比较特殊的东西,这里,弱表需要反复 mark ,跟 write barrier 的行为有关,超出了今天的话题 。
write barrier 还会引起另一些对象再 mark ,即对 grayagain 链的处理。都留到明天再谈。
接下来处理 userdata 的部分,luaC_separateudata 之前已经解释过了。
所以的一切都标记完后,那些不被引用的对象就都被分离出来了。可以安全的清理所有尚被引用的弱表 。
cleartable 就是干的这件事情。实现没有什么好谈的,一目了然。这里也不列代码了。不过值得一提的是对可以
清理的项的判定函数:iscleared 。见 lgc.c 的 329 行。
/*
** The next function tells whether a key or value can be cleared from
** a weak table. Non-collectable objects are never removed from weak
** tables. Strings behave as `values', so are never removed too. for
** other objects: if really collected, cannot keep them; for userdata
** being finalized, keep them in keys, but not in values
*/
static int iscleared (const TValue *o, int iskey) {
printf("is cleared %p,%dn", o,iskey);
if (!iscollectable(o)) return 0;
if (ttisstring(o)) {
stringmark(rawtsvalue(o)); /* strings are `values', so are never weak */
return 0;
}
return iswhite(gcvalue(o)) ||
(ttisuserdata(o) && (!iskey && isfinalized(uvalue(o))));
}
如果你对 for userdata being finalized, keep them in keys, but not in values 这一句表示困惑的话。可以读
读 Roberto 对此的解释 。
和那些被清理掉的 userdata 一样。当 userdata 被标为白色,在下一个 GC 周期中,它便不再认为是
isfinalized 状态。对应弱表中的 k/v 就被真的被清除掉了。
来说说 write barrier 。在 GC 的扫描过程中,由于分步执行,难免会出现少描了一半时,那些已经被置黑的
对象又被修改,需要重新标记的情况。这就需要在改写对象时,建立 write barrier 。在扫描过程中触发 write
barrier 的操作影响的对象被正确染色,或是把需要再染色的对象记录下来,留到 mark 的最后阶段 atomic
完成。
和 barrier 相关的 API 有四个,定义在 lgc.h 86 行:
#define luaC_barrier(L,p,v) { if (valiswhite(v) && isblack(obj2gco(p)))
luaC_barrierf(L,obj2gco(p),gcvalue(v)); }
#define luaC_barriert(L,t,v) { if (valiswhite(v) && isblack(obj2gco(t)))
24. luaC_barrierback(L,t); }
#define luaC_objbarrier(L,p,o)
{ if (iswhite(obj2gco(o)) && isblack(obj2gco(p)))
luaC_barrierf(L,obj2gco(p),obj2gco(o)); }
#define luaC_objbarriert(L,t,o)
{ if (iswhite(obj2gco(o)) && isblack(obj2gco(t))) luaC_barrierback(L,t); }
luaC_barrier 和 luaC_objbarrier 功能上相同,只不过前者是针对 TValue 的,后者则针对 GCObject 。它用于
把 v 向 p 关联时,当 v 为白色且 p 为黑色时,调用 luaC_barrierf 。
luaC_barriert 和 luaC_objbarriert 则是用于将 v 关联到 t 时,调用 luaC_barrierback 。当然前提条件也是 v
为白色,且 t 为黑色。
为何 table 要被特殊对待?因为 table 的引用关系是 1 对 N ,往往修改同一个 table 是很频繁的。而其它对
象之间则是有限的联系。分开处理可以减少 barrier 本身的开销。
luaC_barrierf 用于把新建立联系的对象立刻标记。它的实现在 lgc.c 的 663 行:
void luaC_barrierf (lua_State *L, GCObject *o, GCObject *v) {
global_State *g = G(L);
lua_assert(isblack(o) && iswhite(v) && !isdead(g, v) && !isdead(g, o));
lua_assert(g->gcstate != GCSfinalize && g->gcstate != GCSpause);
lua_assert(ttype(&o->gch) != LUA_TTABLE);
/* must keep invariant? */
if (g->gcstate == GCSpropagate)
reallymarkobject(g, v); /* restore invariant */
else /* don't mind */
makewhite(g, o); /* mark as white just to avoid other barriers */
}
简而言之,当 GC 处于 GCSpropagate 状态(mark 流程中),就标记 v 。否则,把与之联系的节点 o 从黑
色置为白色。前面我们已经说过,lua 的 GC 采用了两种白色标记,乒乓切换。 mark 流程结束后,白色状态
在
被切换。这个时候调用 makewhite 并不会导致 sweep 过程清除这个节点。同时,由于 o 变为白色,就可以减
少 barrier 的开销。即,再有对 o 的修改,不会产生 luaC_barrierf 的函数调用了。
luaC_barrierback 会把要修改的 table 退回 gary 集合,留待将来继续标记。见代码 lgc.c 的 676 行:
void luaC_barrierback (lua_State *L, Table *t) {
global_State *g = G(L);
GCObject *o = obj2gco(t);