KFS aka Kosmos FS

  Nicolae Florin Petrovici
About me
• C++ programmer, several years of experience
• PHP programmer, several years of experience
• Working at 1&1 since March 2011, the backup dpt.
• Interested in : Linux, opensource,architecture
  design, Nokia Qt, frameworks, Web 2.0
• Personal projects: Web 2.0 search site (will launch
  in Feb. 2012), small contributions to opensource
• florin.petrovici@1and1.ro


                                               2
What's this all about ?

• KosmosFS – distributed FS, written in C++
• Previously known as Cloudstore
• Developed by Kosmix (American company)
• Opensource http://code.google.com/p/kosmosfs/
• Kosmix was so popular, that it was acquired by
  Walmart, the largest retail store chain in the
  world
• Several days experience with the product

                                         3
Sneak peek: A large variety of products inside a Walmart store




                                                     4
Fun facts
• KFS is modelled after HDFS (Hadoop
  Filesystem)
• C++ clone, version 0.5, unstable
• Bindings for Java, Python
• 4 bugs in the issue tracker, one submitted
  by yours truly
• Doesn't compile with g++ 4.6 (boost regex
  linking error)
• Has some potential                  5
So what's a distributed filesystem?


• As it says, it is a filesystem which is distributed
• Allows access to files from multiple hosts
• Usually, has a custom protocol based on TCP/UDP
• Clients do not have direct access to block storage
• It is NOT a shared filesystem (like NFS)




                                                        6
So what's a distributed filesystem?

Data is usually stored in chunks, across the network on
●


multiple nodes
●
 Chunks are replicated using a replication factor of 2 or 3
(fault-tolerance)
Metadata (the chunk location) is maintained using a
●


metaserver
Usually has some sort of single point of failure
●



It is the next big thing advertised for handling BigData
●



Hype started with Google Filesystem and MapReduce
●




                                                       7
Google Filesystem in a pic




   Chunks are usually 64 MB
                              8
Google Filesystem general overview
•   Cheap commodity hardware
•   Nodes are divided in : Master Node, Chunkservers
•   Chunkservers store the data already broken up in chunks
•   Each chunk is assigned a unique 64 bit label (logical mapping of files to chunks)
•   Each chunk is replicated several times
•   Master Node: table mappings with the 64 bit labels to chunk locations, locations of the copies of
    chunks, what processes are reading or writing
•   Metadata is kept in memory and flushed from time to time (checkpoints)
•   Permissions for modifications are handled by a system of time-limited expiring “leases”
•   Lease – finite period of time of ownership on that chunk
•   The chunk is then propagated to the chunkservers with the backup copies
•   ACK mechanism => operation atomicity
•   Single point of Failure: Master Node
•   Usespace library, no FUSE
•   Copies: HDFS (Java stack), KFS (C++ stack)
                                                                                   9
What were we talking about ?
•   Modelled after GFS
                                KFS
•   MetaServer, ChunkServer, client library
•   ChunkServer stores chunks as files
•   To protect agains corruptions, checksums are used on each 64KB block and saved in the
    chunk metadata
•   On reads, checksum verification is done using the saved data
•   Each chunk file is named: [file-id].[chunk-id].version
•   Each chunk has a 16K header that contains the chunk checksum information, updated during
    writes


The ChunkServer is only aware of its own chunks.
Upon restart, the metaserver validates the blocks and notifies the chunkserver of any stale
   blocks (not ownred by any file in the system) resulting in the deletion of those chunks


                                                                                 10
Features
                        KFS
•   Per file Replication (of course)
•   Re-replication (if possible)
•   Data integrity (checksums)
•   File writes (lazy writing or force-flush)
•   Stale chunks detection
•   Bindings (already mentioned that)
•   FUSE (Filesystem in Userspace)
                                                11
Advantages over HDFS

●
    File writing
    HDFS writes to a file once and reads many times. KFS supports seeking in file
    and writing multiple times
●
    Data integrity
     After you write to a file, the data becomes visible to other apps when the app
     closes the file. So, if the process were to crash before closing the file, the data
     is lost. KFS exposes data as soon as it gets pushed out to the chunkservers. It
     also has a caching mechanism which can be disabled/enabled on client side
●
    Data rebalancing
     Rudimentary support for automatic rebalancing (The system may migrate
     chunks from over-utilized nodes to under-utilized nodes)


                                                                           12
Insert smiley face here




                          13
Java people – why should you be interested?
•       KFS can be integrated in the Hadoop chain
•       Instructions here: http://code.google.com/p/kosmosfs/wiki/UsingKFSWithHadoop
Still
•       Actually, you shouldn't be
Hadoop stack is much better:
         - Provides all the necessary tools for mapreduce       jobs
         - Hadoop also has streaming support => clients can also be written in Python/C
         - Pig, Hive and other analyzing frameworks
         - HDFS is widely used in conjunction with HBASE
•       But in a few years:
•       Imagine a world with the C++ equivalent stack:
         -   KFS – distributed filesystem
         - HyperTable – C++ equivalent of Hbase (http://www.hypertable.org/)
         - Mapreduce in C++ anyone ?
         Major advantage: lower memory footprint, faster loading times etc
                                                                                 14
So it's basically




                    vs




Moore                       Lennon

                                15
Building (for those interested)

• Cmake based system
• Binaries are created inside a “build”
  directory
• Option to build the JNI/Python bindings
  and the FUSE filesystem
• Doesn't work with g++ 4.6 (boost_regex
  linking error)


                                             16
Deployment

•   Very engineer-wise:
•   You must have ssh passwordless acces on all machines
•   Script copies binaries and scripts to all nodes
•   Caveat: Doesn't work on different architectures
•   Define a machine configuration file
•   Define a machines.txt file (names of the nodes from machines.cfg file)
•   Deploy: python kfssetup.py -f machines.cfg -m machines.txt -b ../build -w ../webui
•   Start: python kfslaunch.py -f machines.cfg -m machines.txt -s
•   Stop: python kfslaunch.py -f machines.cfg -m machines.txt -S
•   Check status: kfsping -m -s <metaserver host> -p <metaserver port>
•   Uninstall: python kfssetup.py -f machines.cfg -m machines.txt -b ../build/bin -U
•   Also comes with a Python webserver which calls kfsping in the background for
    those interested




                                                                           17
Example config file
            [metaserver]
          node: machine1
     clusterkey: kfs-test-cluster
        rundir: /mnt/kfs/meta
          baseport: 20000
           loglevel: INFO
           numservers: 2
       [chunkserver_defaults]
       rundir: /mnt/kfs/chunk
chunkDir: /mnt/kfs/chunk/bin/kfschunk
          baseport: 30000
           space: 3400 G
           loglevel: INFO

                                        18
Cluster setup

• I proceeded enthusiastically to set up my own KFS cluster out of :
• x86_32 laptop (3GB ram) - masterNode
• x86_32 Netbook (Atom 1.2 Ghz 1gb ram) – chunkServer
• armv7 SheevaPlug (ARM 1.2Ghz, 512MB ram) -chunkServer


Slow compile-time on the SheevaPlug (over 1 hour)
Communication done via wireless


By the way, anyone wants to buy a netbook ?



                                                          19
Mounting via FUSE

• If you want to mount via FUSE:
kfs_mount <mount_point> -f (force foreground)
You need to have a kfs.prp file:


• metaServer.name = localhost
• metaServer.port = 20000

Unfortunately, the FUSE mount sometimes produces a segfault in
 one of the chunkservers. C'est la vie. I already filed a bug for this




                                                             20
Accessing via the client library

• Rich undocumented OOP/OOD API
• Main entry point: KfsClientFactory
Something in the lines of:

client = getKfsClientFactory()->GetClient(kfsPropsFile)
client->Mkdirs(dirname);
fd =client->Create(filename.c_str(), numReplicas)
client->SetIoBufferSize(fd, cliBufSize)
client->Write(fd, buffer, sizeBytes)


Other APIs: Enable/Disable Async write, CompareChunkReplicas, GetDirSummary, AtomicRecordAppend


Also available in the Java/Python distribution near you




                                                                                   21
Support opensource
• This project needs your help in order to
  thrive and to provide a real alternative to
  the Java stack
• API is good, bindings are ready
• Still has some bugs to be fixed
• Not production ready, maybe in a few
  years?


                                        22
23

Kfs presentation

  • 1.
    KFS aka KosmosFS Nicolae Florin Petrovici
  • 2.
    About me • C++programmer, several years of experience • PHP programmer, several years of experience • Working at 1&1 since March 2011, the backup dpt. • Interested in : Linux, opensource,architecture design, Nokia Qt, frameworks, Web 2.0 • Personal projects: Web 2.0 search site (will launch in Feb. 2012), small contributions to opensource • florin.petrovici@1and1.ro 2
  • 3.
    What's this allabout ? • KosmosFS – distributed FS, written in C++ • Previously known as Cloudstore • Developed by Kosmix (American company) • Opensource http://code.google.com/p/kosmosfs/ • Kosmix was so popular, that it was acquired by Walmart, the largest retail store chain in the world • Several days experience with the product 3
  • 4.
    Sneak peek: Alarge variety of products inside a Walmart store 4
  • 5.
    Fun facts • KFSis modelled after HDFS (Hadoop Filesystem) • C++ clone, version 0.5, unstable • Bindings for Java, Python • 4 bugs in the issue tracker, one submitted by yours truly • Doesn't compile with g++ 4.6 (boost regex linking error) • Has some potential 5
  • 6.
    So what's adistributed filesystem? • As it says, it is a filesystem which is distributed • Allows access to files from multiple hosts • Usually, has a custom protocol based on TCP/UDP • Clients do not have direct access to block storage • It is NOT a shared filesystem (like NFS) 6
  • 7.
    So what's adistributed filesystem? Data is usually stored in chunks, across the network on ● multiple nodes ● Chunks are replicated using a replication factor of 2 or 3 (fault-tolerance) Metadata (the chunk location) is maintained using a ● metaserver Usually has some sort of single point of failure ● It is the next big thing advertised for handling BigData ● Hype started with Google Filesystem and MapReduce ● 7
  • 8.
    Google Filesystem ina pic Chunks are usually 64 MB 8
  • 9.
    Google Filesystem generaloverview • Cheap commodity hardware • Nodes are divided in : Master Node, Chunkservers • Chunkservers store the data already broken up in chunks • Each chunk is assigned a unique 64 bit label (logical mapping of files to chunks) • Each chunk is replicated several times • Master Node: table mappings with the 64 bit labels to chunk locations, locations of the copies of chunks, what processes are reading or writing • Metadata is kept in memory and flushed from time to time (checkpoints) • Permissions for modifications are handled by a system of time-limited expiring “leases” • Lease – finite period of time of ownership on that chunk • The chunk is then propagated to the chunkservers with the backup copies • ACK mechanism => operation atomicity • Single point of Failure: Master Node • Usespace library, no FUSE • Copies: HDFS (Java stack), KFS (C++ stack) 9
  • 10.
    What were wetalking about ? • Modelled after GFS KFS • MetaServer, ChunkServer, client library • ChunkServer stores chunks as files • To protect agains corruptions, checksums are used on each 64KB block and saved in the chunk metadata • On reads, checksum verification is done using the saved data • Each chunk file is named: [file-id].[chunk-id].version • Each chunk has a 16K header that contains the chunk checksum information, updated during writes The ChunkServer is only aware of its own chunks. Upon restart, the metaserver validates the blocks and notifies the chunkserver of any stale blocks (not ownred by any file in the system) resulting in the deletion of those chunks 10
  • 11.
    Features KFS • Per file Replication (of course) • Re-replication (if possible) • Data integrity (checksums) • File writes (lazy writing or force-flush) • Stale chunks detection • Bindings (already mentioned that) • FUSE (Filesystem in Userspace) 11
  • 12.
    Advantages over HDFS ● File writing HDFS writes to a file once and reads many times. KFS supports seeking in file and writing multiple times ● Data integrity After you write to a file, the data becomes visible to other apps when the app closes the file. So, if the process were to crash before closing the file, the data is lost. KFS exposes data as soon as it gets pushed out to the chunkservers. It also has a caching mechanism which can be disabled/enabled on client side ● Data rebalancing Rudimentary support for automatic rebalancing (The system may migrate chunks from over-utilized nodes to under-utilized nodes) 12
  • 13.
  • 14.
    Java people –why should you be interested? • KFS can be integrated in the Hadoop chain • Instructions here: http://code.google.com/p/kosmosfs/wiki/UsingKFSWithHadoop Still • Actually, you shouldn't be Hadoop stack is much better: - Provides all the necessary tools for mapreduce jobs - Hadoop also has streaming support => clients can also be written in Python/C - Pig, Hive and other analyzing frameworks - HDFS is widely used in conjunction with HBASE • But in a few years: • Imagine a world with the C++ equivalent stack: - KFS – distributed filesystem - HyperTable – C++ equivalent of Hbase (http://www.hypertable.org/) - Mapreduce in C++ anyone ? Major advantage: lower memory footprint, faster loading times etc 14
  • 15.
    So it's basically vs Moore Lennon 15
  • 16.
    Building (for thoseinterested) • Cmake based system • Binaries are created inside a “build” directory • Option to build the JNI/Python bindings and the FUSE filesystem • Doesn't work with g++ 4.6 (boost_regex linking error) 16
  • 17.
    Deployment • Very engineer-wise: • You must have ssh passwordless acces on all machines • Script copies binaries and scripts to all nodes • Caveat: Doesn't work on different architectures • Define a machine configuration file • Define a machines.txt file (names of the nodes from machines.cfg file) • Deploy: python kfssetup.py -f machines.cfg -m machines.txt -b ../build -w ../webui • Start: python kfslaunch.py -f machines.cfg -m machines.txt -s • Stop: python kfslaunch.py -f machines.cfg -m machines.txt -S • Check status: kfsping -m -s <metaserver host> -p <metaserver port> • Uninstall: python kfssetup.py -f machines.cfg -m machines.txt -b ../build/bin -U • Also comes with a Python webserver which calls kfsping in the background for those interested 17
  • 18.
    Example config file [metaserver] node: machine1 clusterkey: kfs-test-cluster rundir: /mnt/kfs/meta baseport: 20000 loglevel: INFO numservers: 2 [chunkserver_defaults] rundir: /mnt/kfs/chunk chunkDir: /mnt/kfs/chunk/bin/kfschunk baseport: 30000 space: 3400 G loglevel: INFO 18
  • 19.
    Cluster setup • Iproceeded enthusiastically to set up my own KFS cluster out of : • x86_32 laptop (3GB ram) - masterNode • x86_32 Netbook (Atom 1.2 Ghz 1gb ram) – chunkServer • armv7 SheevaPlug (ARM 1.2Ghz, 512MB ram) -chunkServer Slow compile-time on the SheevaPlug (over 1 hour) Communication done via wireless By the way, anyone wants to buy a netbook ? 19
  • 20.
    Mounting via FUSE •If you want to mount via FUSE: kfs_mount <mount_point> -f (force foreground) You need to have a kfs.prp file: • metaServer.name = localhost • metaServer.port = 20000 Unfortunately, the FUSE mount sometimes produces a segfault in one of the chunkservers. C'est la vie. I already filed a bug for this 20
  • 21.
    Accessing via theclient library • Rich undocumented OOP/OOD API • Main entry point: KfsClientFactory Something in the lines of: client = getKfsClientFactory()->GetClient(kfsPropsFile) client->Mkdirs(dirname); fd =client->Create(filename.c_str(), numReplicas) client->SetIoBufferSize(fd, cliBufSize) client->Write(fd, buffer, sizeBytes) Other APIs: Enable/Disable Async write, CompareChunkReplicas, GetDirSummary, AtomicRecordAppend Also available in the Java/Python distribution near you 21
  • 22.
    Support opensource • Thisproject needs your help in order to thrive and to provide a real alternative to the Java stack • API is good, bindings are ready • Still has some bugs to be fixed • Not production ready, maybe in a few years? 22
  • 23.