SlideShare a Scribd company logo
1 of 26
What Every Data Programmer  Needs to Know about Disks OSCON Data – July, 2011 - Portland Ted Dziuba @dozba tjdziuba@gmail.com Not proprietary or confidential. In fact, you’re risking a career by listening to me.
Who are you and why are you talking? First job: Like college but they pay you to go. A few years ago: Technical troll for The Register. Recently: Co-founder of Milo.com, local shopping engine. Present: Senior Technical Staff for eBay Local
The Linux Disk Abstraction Volume /mnt/volume File System xfs, ext Block Device HDD, HW RAID array
What happens when you read from a file? f = open(“/home/ted/not_pirated_movie.avi”, “rb”) avi_header = f.read(56) f.close() Disk controller user buffer page cache platter
What happens when you read from a file? Disk controller user buffer page cache ,[object Object]
Latency: 100 nanoseconds
Throughput: 12GB/sec on good hardwareplatter
What happens when you read from a file? Disk controller user buffer page cache ,[object Object]
Latency: 10 milliseconds
Throughput: 768 MB/sec on SATA 3
(Faster if you have a lot of money)platter
Sidebar: The Horror of a 10ms Seek Latency A disk read is 100,000 times slower than a memory read. 100 nanoseconds Time it takes you to write a really clever tweet 10 milliseconds Time it takes to write a novel, working full time
What happens when you write to a file? f = open(“/home/ted/nosql_database.csv”, “wb”) f.write(key) f.write(“,”) f.write(value) f.close() Disk controller user buffer page cache platter
What happens when you write to a file? f = open(“/home/ted/nosql_database.csv”, “wb”) f.write(key) f.write(“,”) f.write(value) f.close() Disk controller user buffer page cache platter You need to make this part happen Mark the page dirty, call it a day and go have a smoke.
Aside: Stick your finger in the Linux Page Cache Pre-Linux 2.6 used “pdflush”, now per-Backing Device Info (BDI) flush threads Dirty pages: grep –i “dirty” /proc/meminfo /proc/sys/vmLove: ,[object Object]
dirty_ratio: flush after some percent of memory is used
dirty_writeback_centisecs: how often to wake up and start flushingClear your page cache: echo 1 > /proc/sys/vm/drop_caches Crusty sysadmin’s hail-Mary pass: sync; sync; sync
Fsync: force a flush to disk f = open(“/home/ted/nosql_database.csv”, “wb”) f.write(key) f.write(“,”) f.write(value) os.fsync(f.fileno()) f.close() Disk controller user buffer page cache platter Also note, fsync() has a cousin, fdatasync() that does not sync metadata.
Aside: point and laugh at MongoDB Mongo’s “fsync” command: > db.runCommand({fsync:1,async:true});  wat. Also supports “journaling”, like a WAL in the SQL world, however… ,[object Object]
It’s not enabled by default.,[object Object]
Fsync: bitter lies platter …it’s a cache! Disk controller page cache ,[object Object]
Writeback is the demon,[object Object]
(Just dropped in) to see what condition your caches are in A Good Server platter Writethrough cache on controller Writethrough cache on disk Disk controller
(Just dropped in) to see what condition your caches are in An Even Better Server platter Battery-backedwriteback cache on controller Writethrough cache on disk Disk controller
(Just dropped in) to see what condition your caches are in The Demon Setup platter Battery-backed writeback cache or Writethrough cache Writeback cache on disk Disk controller
Disks in a virtual environment The Trail of Tears to the Platter Host page cache Virtual controller user buffer page cache Physical controller Hypervisor platter

More Related Content

What's hot

[B11] 基礎から知るSSD(いまさら聞けないSSDの基本) by Hironobu Asano
[B11] 基礎から知るSSD(いまさら聞けないSSDの基本) by Hironobu Asano[B11] 基礎から知るSSD(いまさら聞けないSSDの基本) by Hironobu Asano
[B11] 基礎から知るSSD(いまさら聞けないSSDの基本) by Hironobu Asano
Insight Technology, Inc.
 
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Takahiro Inoue
 

What's hot (20)

TLS, HTTP/2演習
TLS, HTTP/2演習TLS, HTTP/2演習
TLS, HTTP/2演習
 
Apache Pulsarの概要と近況
Apache Pulsarの概要と近況Apache Pulsarの概要と近況
Apache Pulsarの概要と近況
 
[B11] 基礎から知るSSD(いまさら聞けないSSDの基本) by Hironobu Asano
[B11] 基礎から知るSSD(いまさら聞けないSSDの基本) by Hironobu Asano[B11] 基礎から知るSSD(いまさら聞けないSSDの基本) by Hironobu Asano
[B11] 基礎から知るSSD(いまさら聞けないSSDの基本) by Hironobu Asano
 
SpectreBustersあるいはLinuxにおけるSpectre対策
SpectreBustersあるいはLinuxにおけるSpectre対策SpectreBustersあるいはLinuxにおけるSpectre対策
SpectreBustersあるいはLinuxにおけるSpectre対策
 
20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)20111015 勉強会 (PCIe / SR-IOV)
20111015 勉強会 (PCIe / SR-IOV)
 
暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学
 
Vacuum徹底解説
Vacuum徹底解説Vacuum徹底解説
Vacuum徹底解説
 
Paxos
PaxosPaxos
Paxos
 
メルカリ・ソウゾウでは どうGoを活用しているのか?
メルカリ・ソウゾウでは どうGoを活用しているのか?メルカリ・ソウゾウでは どうGoを活用しているのか?
メルカリ・ソウゾウでは どうGoを活用しているのか?
 
dm-writeboost-kernelvm
dm-writeboost-kernelvmdm-writeboost-kernelvm
dm-writeboost-kernelvm
 
分散システムについて語らせてくれ
分散システムについて語らせてくれ分散システムについて語らせてくれ
分散システムについて語らせてくれ
 
Javaのログ出力: 道具と考え方
Javaのログ出力: 道具と考え方Javaのログ出力: 道具と考え方
Javaのログ出力: 道具と考え方
 
いまさら聞けないselectあれこれ
いまさら聞けないselectあれこれいまさら聞けないselectあれこれ
いまさら聞けないselectあれこれ
 
initramfsについて
initramfsについてinitramfsについて
initramfsについて
 
Zaim 500万ユーザに向けて〜Aurora 編〜
Zaim 500万ユーザに向けて〜Aurora 編〜Zaim 500万ユーザに向けて〜Aurora 編〜
Zaim 500万ユーザに向けて〜Aurora 編〜
 
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
SQLアンチパターン 幻の第26章「とりあえず削除フラグ」
 
Linux女子部 systemd徹底入門
Linux女子部 systemd徹底入門Linux女子部 systemd徹底入門
Linux女子部 systemd徹底入門
 
Go入門
Go入門Go入門
Go入門
 
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
 
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
Swagger ではない OpenAPI Specification 3.0 による API サーバー開発
 

Similar to What every data programmer needs to know about disks

Алексей Лесовский "Тюнинг Linux для баз данных. "
Алексей Лесовский "Тюнинг Linux для баз данных. "Алексей Лесовский "Тюнинг Linux для баз данных. "
Алексей Лесовский "Тюнинг Linux для баз данных. "
Tanya Denisyuk
 

Similar to What every data programmer needs to know about disks (20)

Linux Recovery
Linux RecoveryLinux Recovery
Linux Recovery
 
Memory management in Linux
Memory management in LinuxMemory management in Linux
Memory management in Linux
 
Let Me Pick Your Brain - Remote Forensics in Hardened Environments
Let Me Pick Your Brain - Remote Forensics in Hardened EnvironmentsLet Me Pick Your Brain - Remote Forensics in Hardened Environments
Let Me Pick Your Brain - Remote Forensics in Hardened Environments
 
Vmfs
VmfsVmfs
Vmfs
 
Logical volume manager xfs
Logical volume manager xfsLogical volume manager xfs
Logical volume manager xfs
 
Hardware
HardwareHardware
Hardware
 
Data hiding and finding on Linux
Data hiding and finding on LinuxData hiding and finding on Linux
Data hiding and finding on Linux
 
Tuning Linux for Databases.
Tuning Linux for Databases.Tuning Linux for Databases.
Tuning Linux for Databases.
 
Алексей Лесовский "Тюнинг Linux для баз данных. "
Алексей Лесовский "Тюнинг Linux для баз данных. "Алексей Лесовский "Тюнинг Linux для баз данных. "
Алексей Лесовский "Тюнинг Linux для баз данных. "
 
Live memory forensics
Live memory forensicsLive memory forensics
Live memory forensics
 
Clear cache memory
Clear cache memoryClear cache memory
Clear cache memory
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
MySQL Oslayer performace optimization
MySQL  Oslayer performace optimizationMySQL  Oslayer performace optimization
MySQL Oslayer performace optimization
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
FreeBSD Portscamp, Kuala Lumpur 2016
FreeBSD Portscamp, Kuala Lumpur 2016FreeBSD Portscamp, Kuala Lumpur 2016
FreeBSD Portscamp, Kuala Lumpur 2016
 
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...
Bringing up Android on your favorite X86 Workstation or VM (AnDevCon Boston, ...
 
Recipe of a linux Live CD (archived)
Recipe of a linux Live CD (archived)Recipe of a linux Live CD (archived)
Recipe of a linux Live CD (archived)
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Linux filesystemhierarchy
Linux filesystemhierarchyLinux filesystemhierarchy
Linux filesystemhierarchy
 
Unix fundamentals
Unix fundamentalsUnix fundamentals
Unix fundamentals
 

More from iammutex

Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011
iammutex
 

More from iammutex (20)

Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
 
Redis深入浅出
Redis深入浅出Redis深入浅出
Redis深入浅出
 
深入了解Redis
深入了解Redis深入了解Redis
深入了解Redis
 
NoSQL误用和常见陷阱分析
NoSQL误用和常见陷阱分析NoSQL误用和常见陷阱分析
NoSQL误用和常见陷阱分析
 
MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用MongoDB 在盛大大数据量下的应用
MongoDB 在盛大大数据量下的应用
 
8 minute MongoDB tutorial slide
8 minute MongoDB tutorial slide8 minute MongoDB tutorial slide
8 minute MongoDB tutorial slide
 
skip list
skip listskip list
skip list
 
Thoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency ModelsThoughts on Transaction and Consistency Models
Thoughts on Transaction and Consistency Models
 
Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告Rethink db&tokudb调研测试报告
Rethink db&tokudb调研测试报告
 
redis 适用场景与实现
redis 适用场景与实现redis 适用场景与实现
redis 适用场景与实现
 
Introduction to couchdb
Introduction to couchdbIntroduction to couchdb
Introduction to couchdb
 
Ooredis
OoredisOoredis
Ooredis
 
Ooredis
OoredisOoredis
Ooredis
 
redis运维之道
redis运维之道redis运维之道
redis运维之道
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011
 
[译]No sql生态系统
[译]No sql生态系统[译]No sql生态系统
[译]No sql生态系统
 
Couchdb + Membase = Couchbase
Couchdb + Membase = CouchbaseCouchdb + Membase = Couchbase
Couchdb + Membase = Couchbase
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
Hadoop introduction berlin buzzwords 2011
Hadoop introduction   berlin buzzwords 2011Hadoop introduction   berlin buzzwords 2011
Hadoop introduction berlin buzzwords 2011
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

What every data programmer needs to know about disks

  • 1. What Every Data Programmer Needs to Know about Disks OSCON Data – July, 2011 - Portland Ted Dziuba @dozba tjdziuba@gmail.com Not proprietary or confidential. In fact, you’re risking a career by listening to me.
  • 2. Who are you and why are you talking? First job: Like college but they pay you to go. A few years ago: Technical troll for The Register. Recently: Co-founder of Milo.com, local shopping engine. Present: Senior Technical Staff for eBay Local
  • 3. The Linux Disk Abstraction Volume /mnt/volume File System xfs, ext Block Device HDD, HW RAID array
  • 4. What happens when you read from a file? f = open(“/home/ted/not_pirated_movie.avi”, “rb”) avi_header = f.read(56) f.close() Disk controller user buffer page cache platter
  • 5.
  • 7. Throughput: 12GB/sec on good hardwareplatter
  • 8.
  • 11. (Faster if you have a lot of money)platter
  • 12. Sidebar: The Horror of a 10ms Seek Latency A disk read is 100,000 times slower than a memory read. 100 nanoseconds Time it takes you to write a really clever tweet 10 milliseconds Time it takes to write a novel, working full time
  • 13. What happens when you write to a file? f = open(“/home/ted/nosql_database.csv”, “wb”) f.write(key) f.write(“,”) f.write(value) f.close() Disk controller user buffer page cache platter
  • 14. What happens when you write to a file? f = open(“/home/ted/nosql_database.csv”, “wb”) f.write(key) f.write(“,”) f.write(value) f.close() Disk controller user buffer page cache platter You need to make this part happen Mark the page dirty, call it a day and go have a smoke.
  • 15.
  • 16. dirty_ratio: flush after some percent of memory is used
  • 17. dirty_writeback_centisecs: how often to wake up and start flushingClear your page cache: echo 1 > /proc/sys/vm/drop_caches Crusty sysadmin’s hail-Mary pass: sync; sync; sync
  • 18. Fsync: force a flush to disk f = open(“/home/ted/nosql_database.csv”, “wb”) f.write(key) f.write(“,”) f.write(value) os.fsync(f.fileno()) f.close() Disk controller user buffer page cache platter Also note, fsync() has a cousin, fdatasync() that does not sync metadata.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. (Just dropped in) to see what condition your caches are in A Good Server platter Writethrough cache on controller Writethrough cache on disk Disk controller
  • 24. (Just dropped in) to see what condition your caches are in An Even Better Server platter Battery-backedwriteback cache on controller Writethrough cache on disk Disk controller
  • 25. (Just dropped in) to see what condition your caches are in The Demon Setup platter Battery-backed writeback cache or Writethrough cache Writeback cache on disk Disk controller
  • 26. Disks in a virtual environment The Trail of Tears to the Platter Host page cache Virtual controller user buffer page cache Physical controller Hypervisor platter
  • 27.
  • 30. How are the caches configured?
  • 31. How big are the caches?
  • 35. Aside: Amazon EBS MySQL Amazon EBS Please stop doing this.
  • 36. What’s Killing That Box? ted@u235:~$ iostat -x Linux 2.6.32-24-generic (u235) 07/25/2011 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.15 0.14 0.05 0.00 0.00 99.66 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz%util sda 0.00 3.27 0.01 2.38 0.58 45.23 19.21 0.24
  • 37.
  • 38. Negligible seek time vs 10ms seek time
  • 39.
  • 44.
  • 48.

Editor's Notes

  1. Note that the page is actually in memory twice. Mmaped files fix this, but it’s beyond the scope of this discussion.Also this is why read performance on a lot of memory only NoSQL databases beats disk-backed SQL. Duh.
  2. Equate 100 nanoseconds to about 100 seconds. Then 10 milliseconds is about 3 months.
  3. This is where a lot of NoSQL databases get their performance, but more on that in a few minutes.
  4. There are threads that wake up every now and then to flush pages to disk.
  5. Fsync blocks until the data has been written to disk.
  6. With a battery-backed RAID controller, fsync can return very quickly with little risk of data loss.
  7. You need to dive into your vendor’s control tool to find this out.
  8. VMWare server is faithful to fsync, VMWare workstation is not. Xen usually queues I/O requests after they have been issued. The point is that you have no way of knowing. Your visibility of what happens to your data after you write or fsync ends at the hypervisor.
  9. Newer intel chips have the northbridge controller on-die. Southbridge bandwidth is usually <= 10GB/sec, and you are sharing this with other customers’ network and disk I/O. That, and you may be sharing drive spindles.
  10. EBS lies about the result of fsync. This is why Reddit is down all the time. You have been warned.
  11. EBS lies about the result of fsync. This is why Reddit is down all the time. You have been warned.