Your SlideShare is downloading. ×
0
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Voldemort Intro Tangfl
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Voldemort Intro Tangfl

3,325

Published on

http://blog.fulin.org/

http://blog.fulin.org/

Published in: Technology
1 Comment
11 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,325
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
119
Comments
1
Likes
11
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduce to Voldemort 唐福林 <iMobile>
  • 2. 背景 <ul><li>LinkedIn, Bhupesh, Elias, and jaykreps </li></ul><ul><li>http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-linkedin-with-hadoop-and-project-voldemort/ </li></ul><ul><li>Along with Hadoop </li></ul>
  • 3. 需求 <ul><li>大量的数据( Tb 级别) </li></ul><ul><li>离线计算( Hadoop ) </li></ul><ul><li>计算结果供线上使用(线上只读) </li></ul><ul><li>Daily update ( large daily data cycles ) </li></ul><ul><li>Project Voldemort </li></ul><ul><ul><li>the system built to deploy data to the live site </li></ul></ul><ul><ul><li>key-value storage system </li></ul></ul>
  • 4. 需求(续) <ul><li>以前最大的限制是离线计算能力不够 </li></ul><ul><li>Hadoop 解决了离线计算能力的问题 </li></ul><ul><li>现在的问题是: deliver data to the site </li></ul><ul><li>Hadoop has been quite helpful in removing scalability problems in the offline portion of the system; but in doing so it creates a huge bottleneck in our ability to actually deliver data to the site </li></ul>
  • 5. 现有的做法 <ul><li>rsync , ftp , jdbc batch </li></ul><ul><li>问题: </li></ul><ul><ul><li>centralized, un-scalable </li></ul></ul><ul><ul><li>需要在线上机器上建立 index 索引,影响服务 </li></ul></ul>
  • 6. 现有的备选方案 <ul><li>Memcache </li></ul><ul><ul><li>mem :内存限制 </li></ul></ul><ul><ul><li>cache :易失 </li></ul></ul><ul><ul><li>无批量操作支持 </li></ul></ul>
  • 7. 现有的备选方案(续) <ul><li>Mysql </li></ul><ul><ul><li>InnoDB : too high space overhead </li></ul></ul><ul><ul><li>MyISAM : </li></ul></ul><ul><ul><ul><li>线上只读,锁表不是问题 </li></ul></ul></ul><ul><ul><ul><li>load data infile local ,批量操作 </li></ul></ul></ul><ul><ul><ul><li>问题 1 :建索引需要很长时间,不能在线上机器上做 </li></ul></ul></ul><ul><ul><ul><li>问题 2 : mysql 并行能力不足 </li></ul></ul></ul>
  • 8. 幻想 <ul><li>Mysql MyIsam </li></ul><ul><li>离线插入,建立好索引 </li></ul><ul><li>拷贝数据库文件到线上机器,线上机器立即发现,并投入使用(要求不需要重启) </li></ul><ul><li>问题: </li></ul><ul><ul><li>需要额外建索引的机器 </li></ul></ul><ul><ul><li>数据拷贝了多次 </li></ul></ul><ul><ul><li>当前 mysql 不支持立即发现 </li></ul></ul><ul><ul><li>不支持压缩 </li></ul></ul><ul><ul><li>。。。 </li></ul></ul>
  • 9. 预期 <ul><li>Protect the live servers </li></ul><ul><li>Horizontal scalability at each step </li></ul><ul><li>Ability to rollback </li></ul><ul><li>Failure tolerance </li></ul><ul><li>Support large ratios of data to RAM </li></ul>
  • 10. Project Voldemort - overview
  • 11. Project Voldemort - store <ul><li>起初想自己设计一种 storage engine —— lookup and caching structures </li></ul><ul><li>Benchmark 发现主要瓶颈在于取数据的时候 pagecache 是否命中 </li></ul><ul><li>Lookup 不是瓶颈,所以不值得优化 </li></ul><ul><li>所以,简单的 mmap data file </li></ul><ul><li>http://en.wikipedia.org/wiki/Amdahl%27s_law </li></ul><ul><ul><li>只有占总时间百分比比较大的部分,才值得优化 </li></ul></ul>
  • 12. Project Voldemort - store ( 2 ) <ul><li>my_store/ </li></ul><ul><li>version-0/ </li></ul><ul><li>0.index </li></ul><ul><li>0.data </li></ul><ul><li>... </li></ul><ul><li>n.index </li></ul><ul><li>n.data </li></ul><ul><li>version-1/ </li></ul><ul><li>0.index </li></ul><ul><li>0.data </li></ul><ul><li>... </li></ul>
  • 13. Project Voldemort - store ( 3 ) <ul><li>Version-0 :当前版本 </li></ul><ul><li>.index: 索引文件 .data: 原始数据文件 </li></ul><ul><li>0 - n :打散 </li></ul><ul><ul><li>Data 文件最大 2G (java mmap 指针 32 位 )‏ </li></ul></ul><ul><li>更新: </li></ul><ul><ul><li>上传一份新的数据到 tmp 目录 </li></ul></ul><ul><ul><li>重命名: v(n-1)->v(n), ... , v(0)->v(1), tmp->v(0)‏ </li></ul></ul><ul><li>容忍硬盘上同时存在多份 </li></ul>
  • 14. Project Voldemort - store ( 4 )
  • 15. Project Voldemort - store ( 5 ) <ul><li>each key/value pair has a fixed overhead of exactly 24 bytes in addition to the length of the value itself : 16byte key md5 + 4byte location + 4byte size </li></ul><ul><li>ith index :20 * i, no internal pointers </li></ul><ul><li>唯一的问题: index 文件里的 key 怎么组织 </li></ul><ul><ul><li>Index 文件是在 hadoop 上生成的, map , reduce 过程要求生成的时候的内存占用尽量小 </li></ul></ul><ul><ul><li>index 文件很小( data 文件才 2G ),应该可以放进一个 pagecache ,所以怎么组织其实没有关系 </li></ul></ul><ul><ul><li>结论:简单的排序,读的时候使用二分查找 </li></ul></ul>
  • 16. Project Voldemort - store ( 6 ) <ul><li>额外考虑 </li></ul><ul><ul><li>data 很小, key 很多: 100 million entries ,二分查找需要 27 次比较,相比一次读取,代价比较高,值得优化 </li></ul></ul><ul><ul><li>when we have an entirely uncached index : update 或 rollback </li></ul></ul><ul><ul><ul><li>To page the 100 million entry index for a chunk into memory will require 500k page faults no matter what the structure is </li></ul></ul></ul><ul><ul><ul><li>However it would be desirable to minimize the maximum number of page faults incurred on a given request to minimize the variance of the request time </li></ul></ul></ul><ul><ul><ul><li>page-organized tree </li></ul></ul></ul><ul><li>尝试 </li></ul><ul><ul><li>利用 md5 的特性,修改二分查找的实现(还未实现) </li></ul></ul>
  • 17. Project Voldemort - build <ul><li>single-process command-line java program ,测试用 </li></ul><ul><li>distributed Hadoop-based store builder </li></ul><ul><ul><li>An user-extensible Mapper extracts keys from the source data </li></ul></ul><ul><ul><li>A custom Hadoop Partitioner then applies the Voldemort consistent hashing function to the keys, and assigns all keys mapped to a given node and chunk to a single reduce task </li></ul></ul><ul><ul><li>the shuffle phase of the map/reduce copies all values with the same destination node and chunk to the same reduce task, values are sorted by Hadoop, group by key </li></ul></ul><ul><ul><li>each of the reduce tasks will create one .index and .data file for a given chunk on a particular node </li></ul></ul>
  • 18. Project Voldemort - deployment <ul><li>更新: </li></ul><ul><ul><li>上传一份新的数据到 tmp 目录 </li></ul></ul><ul><ul><li>重命名: v(n-1)->v(n), ... , v(0)->v(1), tmp->v(0)‏ </li></ul></ul><ul><ul><li>重命名保证原子性(前提:在同一个硬盘分区) </li></ul></ul><ul><li>上传: </li></ul><ul><ul><li>rsync : diff 运算耗费线上机器的 cpu ; HDFS 不支持,必须先 copy 到 ext3 之类的地方 </li></ul></ul><ul><ul><li>Push vs pull : pull 需要额外的 triggers </li></ul></ul><ul><ul><li>传输限速 </li></ul></ul>
  • 19. Project Voldemort - benchmarks <ul><li>benchmark </li></ul><ul><ul><li>build time for a store in Hadoop </li></ul></ul><ul><ul><ul><li>* 100GB: 28mins (400 mappers, 90 reducers)‏ </li></ul></ul></ul><ul><ul><ul><li>* 512GB: 2hrs, 16mins (2313 mappers, 350 reducers)‏ </li></ul></ul></ul><ul><ul><ul><li>* 1TB: 5hrs, 39mins (4608 mappers, 700 reducers)‏ </li></ul></ul></ul><ul><ul><li>request rate a node can sustain once live </li></ul></ul><ul><ul><li>MySQL Voldemort </li></ul></ul><ul><ul><li>Reqs per sec. 727 1291 </li></ul></ul><ul><ul><li>Median req. time 0.23 ms 0.05 ms </li></ul></ul><ul><ul><li>Avg. req. time 13.7 ms 7.7 ms </li></ul></ul><ul><ul><li>99th percentile req. time 127.2 ms 100.7 ms </li></ul></ul>
  • 20. Project Voldemort - benchmarks(2)‏ <ul><li>影响因素: </li></ul><ul><ul><li>The ratio of data to memory </li></ul></ul><ul><ul><li>The performance of the disk subsystem </li></ul></ul><ul><ul><li>The entropy (熵) of the request stream ( random or organized , determine cache misses rate ) </li></ul></ul>
  • 21. Project Voldemort - Future <ul><li>Incremental data updates </li></ul><ul><ul><li>diff file ,节约网络传输 </li></ul></ul><ul><ul><ul><li>index 文件:有序文件, diff 大,但文件小 </li></ul></ul></ul><ul><ul><ul><li>Data 文件:无序文件,新内容 append 到最后就可以了 </li></ul></ul></ul><ul><ul><ul><ul><li>2G 大小的问题? </li></ul></ul></ul></ul><ul><ul><ul><li>Version-0 = version-1 + diff patch, </li></ul></ul></ul><ul><ul><ul><ul><li>耗硬盘 io </li></ul></ul></ul></ul><ul><ul><ul><li>Version-0 = diff patch, 读的时候直接去 version-1 , version-n 里面读 </li></ul></ul></ul><ul><ul><ul><ul><li>读逻辑复杂 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>keep a Bloom filter tracking which keys are in each day’s patch </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Rollback ? </li></ul></ul></ul></ul>
  • 22. Project Voldemort - Future <ul><li>Improved key hashing </li></ul><ul><ul><li>replicating at the file level ,冗余度: 2 </li></ul></ul><ul><ul><li>replicating at the chunk level ,冗余度: <2 </li></ul></ul>
  • 23. Project Voldemort - Future <ul><li>Compression </li></ul><ul><ul><li>要求: fast decompression speed </li></ul></ul><ul><ul><li>LZO compression </li></ul></ul>
  • 24. Project Voldemort - Future <ul><li>Better indexing </li></ul><ul><ul><li>probabilistic binary search </li></ul></ul><ul><ul><li>204-way page-aligned tree </li></ul></ul><ul><ul><li>cache-oblivious algorithms , van Emde Boas tree </li></ul></ul><ul><ul><li>on-disk hash-based lookup structure </li></ul></ul>
  • 25. 结束 <ul><li>当前对 imobile 没有用,因为我们没有这样的需要 </li></ul><ul><li>某些做法,如 deployment 中的考虑,在 Search 2.0 中可以借鉴 </li></ul><ul><li>Hadoop 是一个好东西,可以关注一下 </li></ul><ul><li>找到系统的瓶颈很重要,虽然很困难 </li></ul>
  • 26. 关于 <ul><li>Imobile http://www.imobile.com.cn </li></ul><ul><li>Team: http://team.imobile.com.cn </li></ul><ul><li>Me: http://blog.fulin.org </li></ul><ul><li>My twitter: http://twitter.com/tangfl </li></ul>

×