SlideShare a Scribd company logo
1 of 29
© Hortonworks Inc. 2011 - 2015
Democratizing Memory Storage
Arpit Agarwal
arp@apache.org
@aagarw
Page 1
© Hortonworks Inc. 2011 - 2015
HDFS Heterogeneous Storage Media
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Heterogeneous Storage (continued)
• Introduced in Apache Hadoop 2.3
• Memory introduced as a storage medium
–RAM Disk provides retention across process restarts
• Memory is treated differently due to its transient nature
–More on this later
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
HDFS Heterogeneous Storage (Continued)
• Rich storage media policies introduced in Hadoop 2.6
• Applications can target different storage media
• Set policy of individual file or directory sub-tree
–setStoragePolicy API
Page 4
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
HDFS Heterogeneous Storage (Continued)
• Example built-in policies
– DEFAULT – All replicas on DISK
– ONESSD – One replica on SSD, rest on DISK
– ALLSSD – All replicas on SSD
– COLD – All replicas on Archival Storage
– LAZY_PERSIST – 1 replica in local memory, lazy write to disk
Page 5
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Page 6
Architecting the Future of Big Data
• Why not rely on the OS page cache?
© Hortonworks Inc. 2011 - 2015
Page 7
Architecting the Future of Big Data
• Scan workloads invalidate the page cache
–HDFS uses buffered IO for reads and writes
• Control the eviction scheme
• Permit further optimizations
–Checksum computation off the hot path
–Collocate data and computation
© Hortonworks Inc. 2011 - 2015
Centralized Cache Management (CCM)
• Introduced in Hadoop 2.3
• Pin hot data to memory
Page 8
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
CCM (Continued)
• Administrator configures cache pools
• User issues commands to manage the contents of pools
• Users specify which files or directories are hot
–HDFS loads file contents into memory
Page 9
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
CCM (Continued)
Page 10
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
CCM (Continued)
• Eliminate checksum computations during read
–Checksums used to flag disk and network errors
–HDFS will pre-verify checksums when caching data from disk
• Data Node and the HDFS client use shared memory segments to
communicate which blocks are shared
Page 11
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
CCM (Continued)
• Enables short-circuit and zero-copy reads from memory to avoid RPC
overhead
• Short-circuit reads are transparent to applications
• Zero-copy read API
–ByteBuffer read(ByteBufferPool factory, int maxLength,
EnumSet<ReadOption> opts);
–void releaseBuffer(ByteBuffer buffer);
• E.g. Apache Hive uses ZCR for ORC files
Page 12
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
HDFS Lazy Persist Writes
• HDFS feature Introduced in Apache Hadoop 2.6
• Exposed via Storage Policies
–Set the LAZY_PERSIST policy on a file or directory
Page 13
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
HDFS Lazy Persist Writes (continued)
• Applications can write to files in memory
• HDFS will write the data to persistent storage off the hot path
–Retain memory latency
• Expected to be used with single replica writes
–Latency benefits negated by pipeline replication over the network
Page 14
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
HDFS Lazy Persist Writes (Continued)
Page 15
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
HDFS Lazy Persist Writes (Continued)
• Best-effort persistence with retention across process restarts
• Data loss rare but possible – node restart, network partition
–Recovery pushed to compute framework layers
• Adoption by Apache projects
–Hive in-memory tables
–Low latency persistence for Spark RDDs
Page 16
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Areas of Improvement
• Cache data on Read as opposed to pinning on demand
• Short-circuit writes
–Eliminate Hadoop RPC overhead for writes
• Isolate applications from HDFS APIs
Page 17
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Areas of Improvement
• Challenging to fix computation frameworks to use memory storage
• Address use cases beyond intermediate data
–When to cache?
–Frameworks do not know
• The application context knows or the user knows
• Let the user decide
–E.g. jobfoo input=memfs://… tmp=memfs://… output=hdfs://…
Page 18
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Memfs – A Layered File System
• Planned for Apache Hadoop 2.9
• A thin HCFS that can layer over any other HCFS
• Transparently uses HDFS memory features when available
• HDFS has used layered FS approach before
–ViewFS, ChecksumFS
Page 19
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Page 20
Architecting the Future of Big Data
• Memfs paths correspond to underlying FS paths 1:1
–E.g. memfs://results.txt hdfs://results.txt
• Reading a file via Memfs loads it into DataNode RAM
• Writing a file via Memfs transparently uses the Lazy Persist Storage
Policy for low latency writes
© Hortonworks Inc. 2011 - 2015
Memfs Benefits
• Beyond the typical use case of intermediate data
• Isolate applications from HDFS APIs
–Let us evolve HDFS support over time
• Lightweight - no state maintained outside of the base FS
Page 21
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Memfs Benefits (Continued)
• All IO is channeled through the base FS in the user’s security context
• Behavior can be controlled by configuration
–E.g. Administrator configures separate cache pools for Memfs
–Move the pool selection logic to Memfs
• Future Memfs implementations using other base HCFS are possible
–May not be as lightweight
Page 22
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Spark RDD
• Spark Resilient Distributed Datasets
• Lineage Information for Fault Tolerance is recorded with the RDD
–Lost data recomputed via Lineage
• HDFS Lazy Persist writes can complement Spark RDD as a low latency
backing store (SPARK-6479)
Page 23
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Tachyon
• Tachyon is also a layered file system
–Powerful idea
• Works best when data is guaranteed to fit in memory
• Introduces the concept of Lineage
–Optional but required for persistence and recovery
–memfs designed to use recovery built into framework layers in case of rare failures
Page 24
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Credits
• Heterogeneous Storage Media
– Tsz Wo (Nicholas) Sze, Hortonworks (szetszwo@apache.org)
– Sanjay Radia, Hortonworks (sradia@apache.org)
– Suresh Srinivas, Hortonworks (suresh@apache.org)
– Junping Du, Hortonworks (junping_du@apache.org)
• Rich Storage Policies
– Jing Zhao, Hortonworks (jing9@apache.org)
– Tsz Wo (Nicholas) Sze, Hortonworks (szetszwo@apache.org)
• CCM
– Andrew Wang, Cloudera (wang@apache.org)
– Colin Mccabe, Cloudera (cmccabe@apache.org)
– Chris Nauroth, Hortonworks (cnauroth@apache.org)
• Lazy Persist Writes
– Jitendra Pandey, Hortonworks (jitendra@apache.org)
– Sanjay Radia, Hortonworks (sradia@apache.org)
– Xiaoyu Yao, Hortonworks (xyao@apache.org)
– Gopal V, Hortonworks (gopalv@apache.org)
Page 25
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Slides URL
• http://s.apache.org/mem-2015
Page 26
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Page 27
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Apache Hadoop File Systems primer (Bonus)
• FileSystem interface captures common FS operations
• Any conforming implementation is a Hadoop Compatible File System
(HCFS)
Page 28
Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Page 29
Architecting the Future of Big Data
• HDFS is the canonical Hadoop FS
• Ships with Apache Hadoop and implements the complete set of features
exposed by the FileSystem interface e.g.
–Snapshots
–Heterogeneous Storage Media
–Extended Attributes
–Posix ACLs
• Supports Kerberos Authentication in Secure Mode

More Related Content

What's hot

Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePete Kisich
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesHDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesThe HDF-EOS Tools and Information Center
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_SystemBrett Keim
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Manish Chopra
 
Ozone: An Object Store in HDFS
Ozone: An Object Store in HDFSOzone: An Object Store in HDFS
Ozone: An Object Store in HDFSDataWorks Summit
 

What's hot (20)

Pillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS StoragePillars of Heterogeneous HDFS Storage
Pillars of Heterogeneous HDFS Storage
 
HDF5 I/O Performance
HDF5 I/O PerformanceHDF5 I/O Performance
HDF5 I/O Performance
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesHDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
HDF4 and HDF5 Performance Preliminary Results
HDF4 and HDF5 Performance Preliminary ResultsHDF4 and HDF5 Performance Preliminary Results
HDF4 and HDF5 Performance Preliminary Results
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_System
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Ozone: An Object Store in HDFS
Ozone: An Object Store in HDFSOzone: An Object Store in HDFS
Ozone: An Object Store in HDFS
 

Similar to Democratizing Memory Storage

Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Chris Nauroth
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Interactive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryInteractive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryChris Nauroth
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's EvolutionDataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
 

Similar to Democratizing Memory Storage (20)

Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Interactive Hadoop via Flash and Memory
Interactive Hadoop via Flash and MemoryInteractive Hadoop via Flash and Memory
Interactive Hadoop via Flash and Memory
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Hadoop In Action
Hadoop In ActionHadoop In Action
Hadoop In Action
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 

Democratizing Memory Storage

  • 1. © Hortonworks Inc. 2011 - 2015 Democratizing Memory Storage Arpit Agarwal arp@apache.org @aagarw Page 1
  • 2. © Hortonworks Inc. 2011 - 2015 HDFS Heterogeneous Storage Media Page 2 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 - 2015 Heterogeneous Storage (continued) • Introduced in Apache Hadoop 2.3 • Memory introduced as a storage medium –RAM Disk provides retention across process restarts • Memory is treated differently due to its transient nature –More on this later Page 3 Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 - 2015 HDFS Heterogeneous Storage (Continued) • Rich storage media policies introduced in Hadoop 2.6 • Applications can target different storage media • Set policy of individual file or directory sub-tree –setStoragePolicy API Page 4 Architecting the Future of Big Data
  • 5. © Hortonworks Inc. 2011 - 2015 HDFS Heterogeneous Storage (Continued) • Example built-in policies – DEFAULT – All replicas on DISK – ONESSD – One replica on SSD, rest on DISK – ALLSSD – All replicas on SSD – COLD – All replicas on Archival Storage – LAZY_PERSIST – 1 replica in local memory, lazy write to disk Page 5 Architecting the Future of Big Data
  • 6. © Hortonworks Inc. 2011 - 2015 Page 6 Architecting the Future of Big Data • Why not rely on the OS page cache?
  • 7. © Hortonworks Inc. 2011 - 2015 Page 7 Architecting the Future of Big Data • Scan workloads invalidate the page cache –HDFS uses buffered IO for reads and writes • Control the eviction scheme • Permit further optimizations –Checksum computation off the hot path –Collocate data and computation
  • 8. © Hortonworks Inc. 2011 - 2015 Centralized Cache Management (CCM) • Introduced in Hadoop 2.3 • Pin hot data to memory Page 8 Architecting the Future of Big Data
  • 9. © Hortonworks Inc. 2011 - 2015 CCM (Continued) • Administrator configures cache pools • User issues commands to manage the contents of pools • Users specify which files or directories are hot –HDFS loads file contents into memory Page 9 Architecting the Future of Big Data
  • 10. © Hortonworks Inc. 2011 - 2015 CCM (Continued) Page 10 Architecting the Future of Big Data
  • 11. © Hortonworks Inc. 2011 - 2015 CCM (Continued) • Eliminate checksum computations during read –Checksums used to flag disk and network errors –HDFS will pre-verify checksums when caching data from disk • Data Node and the HDFS client use shared memory segments to communicate which blocks are shared Page 11 Architecting the Future of Big Data
  • 12. © Hortonworks Inc. 2011 - 2015 CCM (Continued) • Enables short-circuit and zero-copy reads from memory to avoid RPC overhead • Short-circuit reads are transparent to applications • Zero-copy read API –ByteBuffer read(ByteBufferPool factory, int maxLength, EnumSet<ReadOption> opts); –void releaseBuffer(ByteBuffer buffer); • E.g. Apache Hive uses ZCR for ORC files Page 12 Architecting the Future of Big Data
  • 13. © Hortonworks Inc. 2011 - 2015 HDFS Lazy Persist Writes • HDFS feature Introduced in Apache Hadoop 2.6 • Exposed via Storage Policies –Set the LAZY_PERSIST policy on a file or directory Page 13 Architecting the Future of Big Data
  • 14. © Hortonworks Inc. 2011 - 2015 HDFS Lazy Persist Writes (continued) • Applications can write to files in memory • HDFS will write the data to persistent storage off the hot path –Retain memory latency • Expected to be used with single replica writes –Latency benefits negated by pipeline replication over the network Page 14 Architecting the Future of Big Data
  • 15. © Hortonworks Inc. 2011 - 2015 HDFS Lazy Persist Writes (Continued) Page 15 Architecting the Future of Big Data
  • 16. © Hortonworks Inc. 2011 - 2015 HDFS Lazy Persist Writes (Continued) • Best-effort persistence with retention across process restarts • Data loss rare but possible – node restart, network partition –Recovery pushed to compute framework layers • Adoption by Apache projects –Hive in-memory tables –Low latency persistence for Spark RDDs Page 16 Architecting the Future of Big Data
  • 17. © Hortonworks Inc. 2011 - 2015 Areas of Improvement • Cache data on Read as opposed to pinning on demand • Short-circuit writes –Eliminate Hadoop RPC overhead for writes • Isolate applications from HDFS APIs Page 17 Architecting the Future of Big Data
  • 18. © Hortonworks Inc. 2011 - 2015 Areas of Improvement • Challenging to fix computation frameworks to use memory storage • Address use cases beyond intermediate data –When to cache? –Frameworks do not know • The application context knows or the user knows • Let the user decide –E.g. jobfoo input=memfs://… tmp=memfs://… output=hdfs://… Page 18 Architecting the Future of Big Data
  • 19. © Hortonworks Inc. 2011 - 2015 Memfs – A Layered File System • Planned for Apache Hadoop 2.9 • A thin HCFS that can layer over any other HCFS • Transparently uses HDFS memory features when available • HDFS has used layered FS approach before –ViewFS, ChecksumFS Page 19 Architecting the Future of Big Data
  • 20. © Hortonworks Inc. 2011 - 2015 Page 20 Architecting the Future of Big Data • Memfs paths correspond to underlying FS paths 1:1 –E.g. memfs://results.txt hdfs://results.txt • Reading a file via Memfs loads it into DataNode RAM • Writing a file via Memfs transparently uses the Lazy Persist Storage Policy for low latency writes
  • 21. © Hortonworks Inc. 2011 - 2015 Memfs Benefits • Beyond the typical use case of intermediate data • Isolate applications from HDFS APIs –Let us evolve HDFS support over time • Lightweight - no state maintained outside of the base FS Page 21 Architecting the Future of Big Data
  • 22. © Hortonworks Inc. 2011 - 2015 Memfs Benefits (Continued) • All IO is channeled through the base FS in the user’s security context • Behavior can be controlled by configuration –E.g. Administrator configures separate cache pools for Memfs –Move the pool selection logic to Memfs • Future Memfs implementations using other base HCFS are possible –May not be as lightweight Page 22 Architecting the Future of Big Data
  • 23. © Hortonworks Inc. 2011 - 2015 Spark RDD • Spark Resilient Distributed Datasets • Lineage Information for Fault Tolerance is recorded with the RDD –Lost data recomputed via Lineage • HDFS Lazy Persist writes can complement Spark RDD as a low latency backing store (SPARK-6479) Page 23 Architecting the Future of Big Data
  • 24. © Hortonworks Inc. 2011 - 2015 Tachyon • Tachyon is also a layered file system –Powerful idea • Works best when data is guaranteed to fit in memory • Introduces the concept of Lineage –Optional but required for persistence and recovery –memfs designed to use recovery built into framework layers in case of rare failures Page 24 Architecting the Future of Big Data
  • 25. © Hortonworks Inc. 2011 - 2015 Credits • Heterogeneous Storage Media – Tsz Wo (Nicholas) Sze, Hortonworks (szetszwo@apache.org) – Sanjay Radia, Hortonworks (sradia@apache.org) – Suresh Srinivas, Hortonworks (suresh@apache.org) – Junping Du, Hortonworks (junping_du@apache.org) • Rich Storage Policies – Jing Zhao, Hortonworks (jing9@apache.org) – Tsz Wo (Nicholas) Sze, Hortonworks (szetszwo@apache.org) • CCM – Andrew Wang, Cloudera (wang@apache.org) – Colin Mccabe, Cloudera (cmccabe@apache.org) – Chris Nauroth, Hortonworks (cnauroth@apache.org) • Lazy Persist Writes – Jitendra Pandey, Hortonworks (jitendra@apache.org) – Sanjay Radia, Hortonworks (sradia@apache.org) – Xiaoyu Yao, Hortonworks (xyao@apache.org) – Gopal V, Hortonworks (gopalv@apache.org) Page 25 Architecting the Future of Big Data
  • 26. © Hortonworks Inc. 2011 - 2015 Slides URL • http://s.apache.org/mem-2015 Page 26 Architecting the Future of Big Data
  • 27. © Hortonworks Inc. 2011 - 2015 Page 27 Architecting the Future of Big Data
  • 28. © Hortonworks Inc. 2011 - 2015 Apache Hadoop File Systems primer (Bonus) • FileSystem interface captures common FS operations • Any conforming implementation is a Hadoop Compatible File System (HCFS) Page 28 Architecting the Future of Big Data
  • 29. © Hortonworks Inc. 2011 - 2015 Page 29 Architecting the Future of Big Data • HDFS is the canonical Hadoop FS • Ships with Apache Hadoop and implements the complete set of features exposed by the FileSystem interface e.g. –Snapshots –Heterogeneous Storage Media –Extended Attributes –Posix ACLs • Supports Kerberos Authentication in Secure Mode

Editor's Notes

  1. Storage Policies can be set by unprivileged users. HDFS also supports quotas on storage media which are set by the administrator
  2. Memory-mapped files are another option. Work well for reads but do not work well with the existing HDFS write pipeline.
  3. Cache pools are analogous to HDFS Quotas, but not quite the same Cache pools allow administrators to control which users can use memory resources
  4. These two problems are relatively easy to solve.
  5. We don’t want to indiscriminately target all input or output data to memory Frameworks lack application context such as which data will be accessed often, expected output size of a given job Let’s say we have a hypothetical file system called memfs which performs caching io on both read and write path