Introduce to Voldemort 唐福林 <iMobile>
背景 <ul><li>LinkedIn, Bhupesh, Elias, and jaykreps </li></ul><ul><li>http://project-voldemort.com/blog/2009/06/building-a-1...
需求 <ul><li>大量的数据( Tb 级别) </li></ul><ul><li>离线计算( Hadoop ) </li></ul><ul><li>计算结果供线上使用(线上只读) </li></ul><ul><li>Daily  updat...
需求(续) <ul><li>以前最大的限制是离线计算能力不够 </li></ul><ul><li>Hadoop  解决了离线计算能力的问题 </li></ul><ul><li>现在的问题是: deliver data to the site <...
现有的做法 <ul><li>rsync , ftp , jdbc batch </li></ul><ul><li>问题: </li></ul><ul><ul><li>centralized,  un-scalable </li></ul></u...
现有的备选方案 <ul><li>Memcache </li></ul><ul><ul><li>mem :内存限制 </li></ul></ul><ul><ul><li>cache :易失 </li></ul></ul><ul><ul><li>无...
现有的备选方案(续) <ul><li>Mysql </li></ul><ul><ul><li>InnoDB : too high space overhead </li></ul></ul><ul><ul><li>MyISAM : </li><...
幻想 <ul><li>Mysql MyIsam </li></ul><ul><li>离线插入,建立好索引 </li></ul><ul><li>拷贝数据库文件到线上机器,线上机器立即发现,并投入使用(要求不需要重启) </li></ul><ul>...
预期 <ul><li>Protect the live servers </li></ul><ul><li>Horizontal scalability at each step </li></ul><ul><li>Ability to rol...
Project Voldemort  -  overview
Project Voldemort - store <ul><li>起初想自己设计一种 storage engine —— lookup and caching structures </li></ul><ul><li>Benchmark  发...
Project Voldemort - store ( 2 ) <ul><li>my_store/ </li></ul><ul><li>version-0/ </li></ul><ul><li>0.index </li></ul><ul><li...
Project Voldemort - store ( 3 ) <ul><li>Version-0  :当前版本 </li></ul><ul><li>.index: 索引文件  .data: 原始数据文件 </li></ul><ul><li>0...
Project Voldemort - store ( 4 )
Project Voldemort - store ( 5 ) <ul><li>each key/value pair has a fixed overhead of exactly 24 bytes in addition to the le...
Project Voldemort - store ( 6 ) <ul><li>额外考虑 </li></ul><ul><ul><li>data 很小, key 很多: 100 million entries ,二分查找需要  27  次比较,相...
Project Voldemort - build <ul><li>single-process command-line java program ,测试用 </li></ul><ul><li>distributed Hadoop-based...
Project Voldemort - deployment <ul><li>更新: </li></ul><ul><ul><li>上传一份新的数据到  tmp  目录 </li></ul></ul><ul><ul><li>重命名: v(n-1)...
Project Voldemort - benchmarks <ul><li>benchmark </li></ul><ul><ul><li>build time for a store in Hadoop </li></ul></ul><ul...
Project Voldemort - benchmarks(2)‏ <ul><li>影响因素: </li></ul><ul><ul><li>The ratio of data to memory </li></ul></ul><ul><ul>...
Project Voldemort - Future <ul><li>Incremental data updates </li></ul><ul><ul><li>diff file ,节约网络传输 </li></ul></ul><ul><ul...
Project Voldemort - Future <ul><li>Improved key hashing </li></ul><ul><ul><li>replicating at the file level ,冗余度: 2 </li><...
Project Voldemort - Future <ul><li>Compression </li></ul><ul><ul><li>要求: fast decompression speed </li></ul></ul><ul><ul><...
Project Voldemort - Future <ul><li>Better indexing </li></ul><ul><ul><li>probabilistic binary search </li></ul></ul><ul><u...
结束 <ul><li>当前对  imobile  没有用,因为我们没有这样的需要 </li></ul><ul><li>某些做法,如  deployment  中的考虑,在  Search 2.0  中可以借鉴 </li></ul><ul><li...
关于 <ul><li>Imobile  http://www.imobile.com.cn </li></ul><ul><li>Team:  http://team.imobile.com.cn </li></ul><ul><li>Me:  h...
Upcoming SlideShare
Loading in …5
×

Voldemort Intro Tangfl

3,581 views
3,453 views

Published on

http://blog.fulin.org/

Published in: Technology
1 Comment
11 Likes
Statistics
Notes
No Downloads
Views
Total views
3,581
On SlideShare
0
From Embeds
0
Number of Embeds
432
Actions
Shares
0
Downloads
119
Comments
1
Likes
11
Embeds 0
No embeds

No notes for slide

Voldemort Intro Tangfl

  1. 1. Introduce to Voldemort 唐福林 <iMobile>
  2. 2. 背景 <ul><li>LinkedIn, Bhupesh, Elias, and jaykreps </li></ul><ul><li>http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/ </li></ul><ul><li>Along with Hadoop </li></ul>
  3. 3. 需求 <ul><li>大量的数据( Tb 级别) </li></ul><ul><li>离线计算( Hadoop ) </li></ul><ul><li>计算结果供线上使用(线上只读) </li></ul><ul><li>Daily update ( large daily data cycles ) </li></ul><ul><li>Project Voldemort </li></ul><ul><ul><li>the system built to deploy data to the live site </li></ul></ul><ul><ul><li>key-value storage system </li></ul></ul>
  4. 4. 需求(续) <ul><li>以前最大的限制是离线计算能力不够 </li></ul><ul><li>Hadoop 解决了离线计算能力的问题 </li></ul><ul><li>现在的问题是: deliver data to the site </li></ul><ul><li>Hadoop has been quite helpful in removing scalability problems in the offline portion of the system; but in doing so it creates a huge bottleneck in our ability to actually deliver data to the site </li></ul>
  5. 5. 现有的做法 <ul><li>rsync , ftp , jdbc batch </li></ul><ul><li>问题: </li></ul><ul><ul><li>centralized, un-scalable </li></ul></ul><ul><ul><li>需要在线上机器上建立 index 索引,影响服务 </li></ul></ul>
  6. 6. 现有的备选方案 <ul><li>Memcache </li></ul><ul><ul><li>mem :内存限制 </li></ul></ul><ul><ul><li>cache :易失 </li></ul></ul><ul><ul><li>无批量操作支持 </li></ul></ul>
  7. 7. 现有的备选方案(续) <ul><li>Mysql </li></ul><ul><ul><li>InnoDB : too high space overhead </li></ul></ul><ul><ul><li>MyISAM : </li></ul></ul><ul><ul><ul><li>线上只读,锁表不是问题 </li></ul></ul></ul><ul><ul><ul><li>load data infile local ,批量操作 </li></ul></ul></ul><ul><ul><ul><li>问题 1 :建索引需要很长时间,不能在线上机器上做 </li></ul></ul></ul><ul><ul><ul><li>问题 2 : mysql 并行能力不足 </li></ul></ul></ul>
  8. 8. 幻想 <ul><li>Mysql MyIsam </li></ul><ul><li>离线插入,建立好索引 </li></ul><ul><li>拷贝数据库文件到线上机器,线上机器立即发现,并投入使用(要求不需要重启) </li></ul><ul><li>问题: </li></ul><ul><ul><li>需要额外建索引的机器 </li></ul></ul><ul><ul><li>数据拷贝了多次 </li></ul></ul><ul><ul><li>当前 mysql 不支持立即发现 </li></ul></ul><ul><ul><li>不支持压缩 </li></ul></ul><ul><ul><li>。。。 </li></ul></ul>
  9. 9. 预期 <ul><li>Protect the live servers </li></ul><ul><li>Horizontal scalability at each step </li></ul><ul><li>Ability to rollback </li></ul><ul><li>Failure tolerance </li></ul><ul><li>Support large ratios of data to RAM </li></ul>
  10. 10. Project Voldemort - overview
  11. 11. Project Voldemort - store <ul><li>起初想自己设计一种 storage engine —— lookup and caching structures </li></ul><ul><li>Benchmark 发现主要瓶颈在于取数据的时候 pagecache 是否命中 </li></ul><ul><li>Lookup 不是瓶颈,所以不值得优化 </li></ul><ul><li>所以,简单的 mmap data file </li></ul><ul><li>http://en.wikipedia.org/wiki/Amdahl%27s_law </li></ul><ul><ul><li>只有占总时间百分比比较大的部分,才值得优化 </li></ul></ul>
  12. 12. Project Voldemort - store ( 2 ) <ul><li>my_store/ </li></ul><ul><li>version-0/ </li></ul><ul><li>0.index </li></ul><ul><li>0.data </li></ul><ul><li>... </li></ul><ul><li>n.index </li></ul><ul><li>n.data </li></ul><ul><li>version-1/ </li></ul><ul><li>0.index </li></ul><ul><li>0.data </li></ul><ul><li>... </li></ul>
  13. 13. Project Voldemort - store ( 3 ) <ul><li>Version-0 :当前版本 </li></ul><ul><li>.index: 索引文件 .data: 原始数据文件 </li></ul><ul><li>0 - n :打散 </li></ul><ul><ul><li>Data 文件最大 2G (java mmap 指针 32 位 )‏ </li></ul></ul><ul><li>更新: </li></ul><ul><ul><li>上传一份新的数据到 tmp 目录 </li></ul></ul><ul><ul><li>重命名: v(n-1)->v(n), ... , v(0)->v(1), tmp->v(0)‏ </li></ul></ul><ul><li>容忍硬盘上同时存在多份 </li></ul>
  14. 14. Project Voldemort - store ( 4 )
  15. 15. Project Voldemort - store ( 5 ) <ul><li>each key/value pair has a fixed overhead of exactly 24 bytes in addition to the length of the value itself : 16byte key md5 + 4byte location + 4byte size </li></ul><ul><li>ith index :20 * i, no internal pointers </li></ul><ul><li>唯一的问题: index 文件里的 key 怎么组织 </li></ul><ul><ul><li>Index 文件是在 hadoop 上生成的, map , reduce 过程要求生成的时候的内存占用尽量小 </li></ul></ul><ul><ul><li>index 文件很小( data 文件才 2G ),应该可以放进一个 pagecache ,所以怎么组织其实没有关系 </li></ul></ul><ul><ul><li>结论:简单的排序,读的时候使用二分查找 </li></ul></ul>
  16. 16. Project Voldemort - store ( 6 ) <ul><li>额外考虑 </li></ul><ul><ul><li>data 很小, key 很多: 100 million entries ,二分查找需要 27 次比较,相比一次读取,代价比较高,值得优化 </li></ul></ul><ul><ul><li>when we have an entirely uncached index : update 或 rollback </li></ul></ul><ul><ul><ul><li>To page the 100 million entry index for a chunk into memory will require 500k page faults no matter what the structure is </li></ul></ul></ul><ul><ul><ul><li>However it would be desirable to minimize the maximum number of page faults incurred on a given request to minimize the variance of the request time </li></ul></ul></ul><ul><ul><ul><li>page-organized tree </li></ul></ul></ul><ul><li>尝试 </li></ul><ul><ul><li>利用 md5 的特性,修改二分查找的实现(还未实现) </li></ul></ul>
  17. 17. Project Voldemort - build <ul><li>single-process command-line java program ,测试用 </li></ul><ul><li>distributed Hadoop-based store builder </li></ul><ul><ul><li>An user-extensible Mapper extracts keys from the source data </li></ul></ul><ul><ul><li>A custom Hadoop Partitioner then applies the Voldemort consistent hashing function to the keys, and assigns all keys mapped to a given node and chunk to a single reduce task </li></ul></ul><ul><ul><li>the shuffle phase of the map/reduce copies all values with the same destination node and chunk to the same reduce task, values are sorted by Hadoop, group by key </li></ul></ul><ul><ul><li>each of the reduce tasks will create one .index and .data file for a given chunk on a particular node </li></ul></ul>
  18. 18. Project Voldemort - deployment <ul><li>更新: </li></ul><ul><ul><li>上传一份新的数据到 tmp 目录 </li></ul></ul><ul><ul><li>重命名: v(n-1)->v(n), ... , v(0)->v(1), tmp->v(0)‏ </li></ul></ul><ul><ul><li>重命名保证原子性(前提:在同一个硬盘分区) </li></ul></ul><ul><li>上传: </li></ul><ul><ul><li>rsync : diff 运算耗费线上机器的 cpu ; HDFS 不支持,必须先 copy 到 ext3 之类的地方 </li></ul></ul><ul><ul><li>Push vs pull : pull 需要额外的 triggers </li></ul></ul><ul><ul><li>传输限速 </li></ul></ul>
  19. 19. Project Voldemort - benchmarks <ul><li>benchmark </li></ul><ul><ul><li>build time for a store in Hadoop </li></ul></ul><ul><ul><ul><li>* 100GB: 28mins (400 mappers, 90 reducers)‏ </li></ul></ul></ul><ul><ul><ul><li>* 512GB: 2hrs, 16mins (2313 mappers, 350 reducers)‏ </li></ul></ul></ul><ul><ul><ul><li>* 1TB: 5hrs, 39mins (4608 mappers, 700 reducers)‏ </li></ul></ul></ul><ul><ul><li>request rate a node can sustain once live </li></ul></ul><ul><ul><li>MySQL Voldemort </li></ul></ul><ul><ul><li>Reqs per sec. 727 1291 </li></ul></ul><ul><ul><li>Median req. time 0.23 ms 0.05 ms </li></ul></ul><ul><ul><li>Avg. req. time 13.7 ms 7.7 ms </li></ul></ul><ul><ul><li>99th percentile req. time 127.2 ms 100.7 ms </li></ul></ul>
  20. 20. Project Voldemort - benchmarks(2)‏ <ul><li>影响因素: </li></ul><ul><ul><li>The ratio of data to memory </li></ul></ul><ul><ul><li>The performance of the disk subsystem </li></ul></ul><ul><ul><li>The entropy (熵) of the request stream ( random or organized , determine cache misses rate ) </li></ul></ul>
  21. 21. Project Voldemort - Future <ul><li>Incremental data updates </li></ul><ul><ul><li>diff file ,节约网络传输 </li></ul></ul><ul><ul><ul><li>index 文件:有序文件, diff 大,但文件小 </li></ul></ul></ul><ul><ul><ul><li>Data 文件:无序文件,新内容 append 到最后就可以了 </li></ul></ul></ul><ul><ul><ul><ul><li>2G 大小的问题? </li></ul></ul></ul></ul><ul><ul><ul><li>Version-0 = version-1 + diff patch, </li></ul></ul></ul><ul><ul><ul><ul><li>耗硬盘 io </li></ul></ul></ul></ul><ul><ul><ul><li>Version-0 = diff patch, 读的时候直接去 version-1 , version-n 里面读 </li></ul></ul></ul><ul><ul><ul><ul><li>读逻辑复杂 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>keep a Bloom filter tracking which keys are in each day’s patch </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Rollback ? </li></ul></ul></ul></ul>
  22. 22. Project Voldemort - Future <ul><li>Improved key hashing </li></ul><ul><ul><li>replicating at the file level ,冗余度: 2 </li></ul></ul><ul><ul><li>replicating at the chunk level ,冗余度: <2 </li></ul></ul>
  23. 23. Project Voldemort - Future <ul><li>Compression </li></ul><ul><ul><li>要求: fast decompression speed </li></ul></ul><ul><ul><li>LZO compression </li></ul></ul>
  24. 24. Project Voldemort - Future <ul><li>Better indexing </li></ul><ul><ul><li>probabilistic binary search </li></ul></ul><ul><ul><li>204-way page-aligned tree </li></ul></ul><ul><ul><li>cache-oblivious algorithms , van Emde Boas tree </li></ul></ul><ul><ul><li>on-disk hash-based lookup structure </li></ul></ul>
  25. 25. 结束 <ul><li>当前对 imobile 没有用,因为我们没有这样的需要 </li></ul><ul><li>某些做法,如 deployment 中的考虑,在 Search 2.0 中可以借鉴 </li></ul><ul><li>Hadoop 是一个好东西,可以关注一下 </li></ul><ul><li>找到系统的瓶颈很重要,虽然很困难 </li></ul>
  26. 26. 关于 <ul><li>Imobile http://www.imobile.com.cn </li></ul><ul><li>Team: http://team.imobile.com.cn </li></ul><ul><li>Me: http://blog.fulin.org </li></ul><ul><li>My twitter: http://twitter.com/tangfl </li></ul>

×