SlideShare a Scribd company logo
1 of 34
Download to read offline
Introduction to
Tokyo Products

          Mikio Hirabayashi
              <hirarin@gmail.com>
Tokyo Products
• Tokyo Cabinet
  – database library
• Tokyo Tyrant                                    applications
                                                                               Prome
  – database server              custom storage    Tyrant           Dystopia
                                                                                nade


• Tokyo Dystopia                                        Cabinet
  – full-text search engine
                                                      file system
• Tokyo Promenade
  – content management system


• open source
  – released under LGPL
• powerful, portable, practical
  – written in the standard C, optimized to POSIX
Tokyo Cabinet
 - database library -
Features
• modern implementation of DBM
 – key/value database
   • e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley DB
 – simple library = process embedded
 – Successor of QDBM
   • C99 and POSIX compatible, using Pthread, mmap, etc...
   • Win32 porting is work-in-progress

• high performance
 – insert: 0.4 sec/1M records (2,500,000 qps)
 – search: 0.33 sec/1M records (3,000,000 qps)
• high concurrency
  – multi-thread safe
  – read/write locking by records
• high scalability
  – hash and B+tree structure = O(1) and O(log N)
  – no actual limit size of a database file (to 8 exabytes)
• transaction
  – write ahead logging and shadow paging
  – ACID properties
• various APIs
  – on-memory list/hash/tree
  – file hash/B+tree/array/table
• script language bindings
  – Perl, Ruby, Java, Lua, Python, PHP, Haskell, Erlang, etc...
TCHDB: Hash Database
• static hashing                  bucket array

  – O(1) time complexity

• separate chaining                key           value

  – binary search tree             key           value

  – balances by the second hash    key           value

• free block pool                  key           value

                                   key           value
  – best fit allocation
  – dynamic defragmentation        key           value


• combines mmap and                key

                                   key
                                                 value

                                                 value
  pwrite/pread                     key           value
  – saves calling system calls
                                   key           value
• compression                      key           value
  – deflate(gzip)/bzip2/custom
TCBDB: B+ Tree Database
• B+ tree                                     key   value
  – O(log N) time complexity   B tree index   key   value

• page caching                                key
                                              key
                                                    value
                                                    value
  – LRU removing
                                              key   value
  – speculative search
                                              key   value
• stands on hash DB                           key   value
  – records pages in hash DB                  key   value
  – succeeds time and space                   key   value
    efficiency                                key   value
• custom comparison                           key   value

  function                                    key   value

  – prefix/range matching                     key   value
                                              key   value
• cursor                                      key   value
  – jump/next/prev                            key   value
TCFDB: Fixed-length Database
• array of fixed-
  length elements          array
  – O(1) time complexity     value   value   value   value
  – natural number keys      value   value   value   value
  – addresses records by     value   value   value   value
    multiple of key          value   value   value   value

• most effective             value

                             value
                                     value

                                     value
                                             value

                                             value
                                                     value

                                                     value
  – bulk load by mmap        value   value   value   value
  – no key storage per       value   value   value   value
    record                   value   value   value   value
  – extremely fast and       value   value   value   value
    concurrent
TCTDB: Table Database
• column based
  – the primary key and named
    columns
  – stands on hash DB                                                    bucket array

• flexible structure                   primary key   name value name value name value
  – no data scheme, no data type
  – various structure for each
    record                             primary key   name value name value name value


• query mechanism                      primary key   name value name value name value
  – various operators matching
    column values                      primary key   name value name value name value
  – lexical/decimal orders by column
    values                             primary key   name value name value name value

• column indexes
  –   implemented with B+ tree         primary key   name value name value name value

  –   typed as string/number
  –   inverted index of token/q-gram                            column index
  –   query optimizer
On-memory Structures
• TCXSTR: extensible string
  – concatenation, formatted allocation
• TCLIST: array list (dequeue)
  – random access by index
  – push/pop, unshift/shift, insert/remove
• TCMAP: map of hash table
  – insert/remove/search
  – iterator by order of insertion
• TCTREE: map of ordered tree
  – insert/remove/search
  – iterator by order of comparison function
Other Mechanisms
• abstract database
  – common interface of 6 schema
     • on-memory hash, on-memory tree
     • file hash, file B+tree, file array, file table
  – decides the concrete scheme in runtime

• remote database
  – network interface of the abstract database
  – yes, it's Tokyo Tyrant!

• miscellaneous utilities
  – string processing, filesystem operation
  – memory pool, encoding/decoding
Example Code
#include   <tcutil.h>
#include   <tchdb.h>
#include   <stdlib.h>
#include   <stdbool.h>
#include   <stdint.h>

int main(int argc, char **argv){

  TCHDB *hdb;
  int ecode;
  char *key, *value;

  /* create the object */
  hdb = tchdbnew();

  /* open the database */
  if(!tchdbopen(hdb, "casket.hdb", HDBOWRITER | HDBOCREAT)){
    ecode = tchdbecode(hdb);                                       /* traverse records */
    fprintf(stderr, "open error: %s¥n", tchdberrmsg(ecode));       tchdbiterinit(hdb);
  }                                                                while((key = tchdbiternext2(hdb)) != NULL){
                                                                     value = tchdbget2(hdb, key);
  /* store records */                                                if(value){
  if(!tchdbput2(hdb, "foo", "hop") ||                                  printf("%s:%s¥n", key, value);
     !tchdbput2(hdb, "bar", "step") ||                                 free(value);
     !tchdbput2(hdb, "baz", "jump")){                                }
    ecode = tchdbecode(hdb);                                         free(key);
    fprintf(stderr, "put error: %s¥n", tchdberrmsg(ecode));        }
  }
                                                                   /* close the database */
  /* retrieve records */                                           if(!tchdbclose(hdb)){
  value = tchdbget2(hdb, "foo");                                     ecode = tchdbecode(hdb);
  if(value){                                                         fprintf(stderr, "close error: %s¥n", tchdberrmsg(ecode));
    printf("%s¥n", value);                                         }
    free(value);
  } else {                                                         /* delete the object */
    ecode = tchdbecode(hdb);                                       tchdbdel(hdb);
    fprintf(stderr, "get error: %s¥n", tchdberrmsg(ecode));
  }                                                                return 0;
                                                               }
Tokyo Tyrant
- database server -
Features
• network server of Tokyo Cabinet
 – client/server model
 – multi applications can access one database
 – effective binary protocol
• compatible protocols
 – supports memcached protocol and HTTP
 – available from most popular languages
• high concurrency/performance
 – resolves "c10k" with epoll/kqueue/eventports
 – 17.2 sec/1M queries (58,000 qps)
• high availability
  – hot backup and update log
  – asynchronous replication between servers
• various database schema
  – using the abstract database API of Tokyo Cabinet
• effective operations
  – no-reply updating, multi-record retrieval
  – atomic increment
• Lua extension
  – defines arbitrary database operations
  – atomic operation by record locking
• pure script language interfaces
  – Perl, Ruby, Java, Python, PHP, Erlang, etc...
Asynchronous Replication
 master and slaves                                 dual master
 (load balancing)                                  (fail over)

                          write query

      master server                      client
                                                          client
            database                                                 query if the master is dead
                                                  query
                            read query
           update log                             active master                 standby master
                            with load balancing

                                                      database                        database
replicate
the difference
slave server            slave server                  update log                     update log
                                                                   replicate
      database              database
                                                                   the difference



     update log             update log
Thread Pool Model
                                 epoll/kqueue            listen
accept the client connection
if the event is about the listener           queue   first of all, the listening socket is enqueued into
                                                     the epoll queue
           accept           epoll_ctl(add)
                                                                           queue back if keep-alive
                            epoll_wait

                                                         task manager
                            epoll_ctl(del)
                                                                  queue
                                                     enqueue
          move the readable client socket
          from the epoll queue to the task queue

                                                       deque
                                                                                 worker thread

                                                                                 worker thread

                                                     do each task                worker thread
Lua Extention
• defines DB operations as Lua functions
  – clients send the function name and record data
  – the server returns the return value of the function

• options about atomicity
  – no locking / record locking / global locking


  front end           request           back end
                      - function name
                      - key data            Tokyo Tyrant
                      - value data
    Clients
                                            Tokyo Cabinet   Lua processor
                     response
                     - result data

                                                                  script
                                              database          user-defined
                                                                operations
case: Timestamp DB at mixi.jp
• 20 million records               mod_perl
  – each record size is 20 bytes
                                      home.pl          update
• more than 10,000                  show_friend.pl
                                                                 TT (active)


  updates per sec.                   view_diary.pl
                                                                  database

  – keeps 10,000 connections
                                         search.pl               replication
• dual master                            other pages

  replication                                                     TT (standby)

  – each server is only one                                         database
                                                         fetch
• memcached                         list_friend.pl



  compatible protocol
                                    list_bookmark.pl



  – reuses existing Perl clients
case: Cache for Big Storages
• works as proxy                   clients                        1. inserts to the storage
  – mediates insert/search                                        2. inserts to the cache

  – write through, read through

• Lua extension
                                                 Tokyo Tyrant          MySQL/hBase


  – atomic operation by record                    atomic insert
                                                                          database
    locking                                             Lua
  – uses LuaSocket to access
    the storage                                                           database
                                                 atomic search
• proper DB scheme                                      Lua
  – TCMDB: for generic cache                                              database

  – TCNDB: for biased access
  – TCHDB: for large records                           cache
                                                                          database
    such as image                 1. retrieves from the cache
  – TCFDB: for small records           if found, return
                                  2. retrieves from the storage
    such as timestamp             3. inserts to the cache
Example Code
#include   <tcrdb.h>
#include   <stdlib.h>
#include   <stdbool.h>
#include   <stdint.h>

int main(int argc, char **argv){

  TCRDB *rdb;
  int ecode;
  char *value;

  /* create the object */
  rdb = tcrdbnew();

  /* connect to the server */
  if(!tcrdbopen(rdb, "localhost", 1978)){
    ecode = tcrdbecode(rdb);
    fprintf(stderr, "open error: %s¥n", tcrdberrmsg(ecode));
  }

  /* store records */
  if(!tcrdbput2(rdb, "foo", "hop") ||
     !tcrdbput2(rdb, "bar", "step") ||
     !tcrdbput2(rdb, "baz", "jump")){
    ecode = tcrdbecode(rdb);
    fprintf(stderr, "put error: %s¥n", tcrdberrmsg(ecode));
  }                                                                /* close the connection */
                                                                   if(!tcrdbclose(rdb)){
  /* retrieve records */                                             ecode = tcrdbecode(rdb);
  value = tcrdbget2(rdb, "foo");                                     fprintf(stderr, "close error: %s¥n", tcrdberrmsg(ecode));
  if(value){                                                       }
    printf("%s¥n", value);
    free(value);                                                   /* delete the object */
  } else {                                                         tcrdbdel(rdb);
    ecode = tcrdbecode(rdb);
    fprintf(stderr, "get error: %s¥n", tcrdberrmsg(ecode));        return 0;
  }                                                            }
Tokyo Dystopia
- full-text search engine -
Features
• full-text search engine
  – manages databases of Tokyo Cabinet as an inverted
    index

• combines two tokenizers
  – character N-gram (bi-gram) method
     • perfect recall ratio
  – simple word by outer language processor
     • high accuracy and high performance

• high performance/scalability
  – handles more than 10 million documents
  – searches in milliseconds
• optimized to professional use
  – layered architecture of APIs
  – no embedded scoring system
    • to combine outer scoring system
  – no text filter, no crawler, no language
    processor
• convenient utilities
  – multilingualism with Unicode
  – set operations
  – phrase matching, prefix matching, suffix
    matching, and token matching
  – command line utilities
Inverted Index
• stands on key/value database
  – key = token
     • N-gram or simple word
  – value = occurrence data (posting list)
     • list of pairs of document number and offset in the document

• uses B+ tree database
  – reduces write operations into the disk device
  – enables common prefix search for tokens
  – delta encoding and deflate compression

       ID:21            text: "abracadabra"
          a    -   21:10          ca - 21:1, 21:8
          ab   -   21:0,21:7      da - 21:4
          ac   -   21:3           ra - 21:2, 21:9
          br   -   21:5
Layered Architecture
• character N-gram index
  – "q-gram index" (only index), and "indexed database"
  – uses embedded tokenizer

• word index
  – "word index" (only index), and "tagged index"
  – uses outer tokenizer
                                                Applications
                             Character N-gram Index            Tagging Index

                               indexed database         tagged database

                                 q-gram index              word index

                                              Tokyo Cabinet
case: friend search at mixi.jp
• 20 million records
  – each record size is 1K bytes                   query               user interface
  – name and self introduction
                                      merger
• more than 100 qps                       TT's        social               query

• attribute narrowing                    cache        graph

  – gender, address, birthday                                         searcher
  – multiple sort orders              copy the social graph
                                                                        inverted   attribute

• distributed processing
                                                                          index       DB


  – more than 10 servers              indexer
  – indexer, searchers, merger                                      copy the index and the DB
                                        inverted     attribute
• ranking by social                       index         DB


  graph                                                          dump profile data

  – the merger scores the result by
    following the friend links                                    profile DB
Example Code
#include   <dystopia.h>
#include   <stdlib.h>
#include   <stdbool.h>
#include   <stdint.h>

int main(int argc, char **argv){
  TCIDB *idb;
  int ecode, rnum, i;
  uint64_t *result;
  char *text;
                                                                   /* search records */
  /* create the object */                                          result = tcidbsearch2(idb, "john || thomas", &rnum);
  idb = tcidbnew();                                                if(result){
                                                                     for(i = 0; i < rnum; i++){
  /* open the database */                                              text = tcidbget(idb, result[i]);
  if(!tcidbopen(idb, "casket", IDBOWRITER | IDBOCREAT)){               if(text){
    ecode = tcidbecode(idb);                                             printf("%d¥t%s¥n", (int)result[i], text);
    fprintf(stderr, "open error: %s¥n", tcidberrmsg(ecode));             free(text);
  }                                                                    }
                                                                     }
  /* store records */                                                free(result);
  if(!tcidbput(idb, 1, "George Washington") ||                     } else {
     !tcidbput(idb, 2, "John Adams") ||                              ecode = tcidbecode(idb);
     !tcidbput(idb, 3, "Thomas Jefferson")){                         fprintf(stderr, "search error: %s¥n", tcidberrmsg(ecode));
    ecode = tcidbecode(idb);                                       }
    fprintf(stderr, "put error: %s¥n", tcidberrmsg(ecode));
  }                                                                /* close the database */
                                                                   if(!tcidbclose(idb)){
                                                                     ecode = tcidbecode(idb);
                                                                     fprintf(stderr, "close error: %s¥n", tcidberrmsg(ecode));
                                                                   }

                                                                   /* delete the object */
                                                                   tcidbdel(idb);

                                                                   return 0;
                                                               }
Tokyo Promenade
- content management system -
Features
• content management system
  – manages Web contents easily with a browser
  – available as BBS, Blog, and Wiki

• simple and logical interface
  – aims at conciseness like LaTeX
  – optimized for text browsers such as w3m and Lynx
  – complying with XHTML 1.0 and considering WCAG 1.0

• high performance/throughput
  – implemented in pure C
  – uses Tokyo Cabinet and supports FastCGI
  – 0.836ms/view (more than 1,000 qps)
• sufficient functionality
  – simple Wiki formatting
  – file uploader and manager
  – user authentication by the login form
  – guest comment authorization by a riddle
  – supports the sidebar navigation
  – full-text/attribute search, calendar view
  – Atom feed
• flexible customizability
  – thorough separation of logic and presentation
  – template file to generate the output
  – server side scripting by the Lua extension
  – post processing by outer commands
Example Code
#!   Introduction to Tokyo Cabinet
#c   2009-11-05T18:58:39+09:00
#m   2009-11-05T18:58:39+09:00
#o   mikio
#t   database,programming,tokyocabinet

This article describes what is [[Tokyo
Cabinet|http://1978th.net/tokyocabinet/]] and
how to use it.

@ upfile:1257415094-logo-ja.png

* Features

- modern implementation of DBM
-- key/value database
-- e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley
DB
- simple library = process embedded
- Successor of QDBM
-- C99 and POSIX compatible, using Pthread,
mmap, etc...
-- Win32 porting is work-in-progress
- high performance
- insert: 0.4 sec/1M records (2,500,000 qps)
- search: 0.33 sec/1M records (3,000,000 qps)
innovating more and yet more...
               http://1978th.net/
Introduction to Tokyo Products

More Related Content

What's hot

Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPThomas Graf
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅NAVER D2
 
Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링JANGWONSEO4
 
FD.io VPP事始め
FD.io VPP事始めFD.io VPP事始め
FD.io VPP事始めtetsusat
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요Jo Hoon
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmicsDenys Haryachyy
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceSUSE Labs Taipei
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
OpenVZ - Linux Containers:第2回 コンテナ型仮想化の情報交換会@東京
OpenVZ - Linux Containers:第2回 コンテナ型仮想化の情報交換会@東京OpenVZ - Linux Containers:第2回 コンテナ型仮想化の情報交換会@東京
OpenVZ - Linux Containers:第2回 コンテナ型仮想化の情報交換会@東京Kentaro Ebisawa
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
 
DNSキャッシュサーバ チューニングの勘所
DNSキャッシュサーバ チューニングの勘所DNSキャッシュサーバ チューニングの勘所
DNSキャッシュサーバ チューニングの勘所hdais
 
Spring boot 를 적용한 전사모니터링 시스템 backend 개발 사례
Spring boot 를 적용한 전사모니터링 시스템 backend 개발 사례Spring boot 를 적용한 전사모니터링 시스템 backend 개발 사례
Spring boot 를 적용한 전사모니터링 시스템 backend 개발 사례Jemin Huh
 
[2D4]Python에서의 동시성_병렬성
[2D4]Python에서의 동시성_병렬성[2D4]Python에서의 동시성_병렬성
[2D4]Python에서의 동시성_병렬성NAVER D2
 
tcpdumpとtcpreplayとtcprewriteと他。
tcpdumpとtcpreplayとtcprewriteと他。tcpdumpとtcpreplayとtcprewriteと他。
tcpdumpとtcpreplayとtcprewriteと他。(^-^) togakushi
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machineAlexei Starovoitov
 

What's hot (20)

Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDP
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅
 
Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링Apache kafka 관리와 모니터링
Apache kafka 관리와 모니터링
 
FD.io VPP事始め
FD.io VPP事始めFD.io VPP事始め
FD.io VPP事始め
 
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
Understanding DPDK algorithmics
Understanding DPDK algorithmicsUnderstanding DPDK algorithmics
Understanding DPDK algorithmics
 
Fluentd vs. Logstash for OpenStack Log Management
Fluentd vs. Logstash for OpenStack Log ManagementFluentd vs. Logstash for OpenStack Log Management
Fluentd vs. Logstash for OpenStack Log Management
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
OpenVZ - Linux Containers:第2回 コンテナ型仮想化の情報交換会@東京
OpenVZ - Linux Containers:第2回 コンテナ型仮想化の情報交換会@東京OpenVZ - Linux Containers:第2回 コンテナ型仮想化の情報交換会@東京
OpenVZ - Linux Containers:第2回 コンテナ型仮想化の情報交換会@東京
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
DNSキャッシュサーバ チューニングの勘所
DNSキャッシュサーバ チューニングの勘所DNSキャッシュサーバ チューニングの勘所
DNSキャッシュサーバ チューニングの勘所
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Spring boot 를 적용한 전사모니터링 시스템 backend 개발 사례
Spring boot 를 적용한 전사모니터링 시스템 backend 개발 사례Spring boot 를 적용한 전사모니터링 시스템 backend 개발 사례
Spring boot 를 적용한 전사모니터링 시스템 backend 개발 사례
 
[2D4]Python에서의 동시성_병렬성
[2D4]Python에서의 동시성_병렬성[2D4]Python에서의 동시성_병렬성
[2D4]Python에서의 동시성_병렬성
 
tcpdumpとtcpreplayとtcprewriteと他。
tcpdumpとtcpreplayとtcprewriteと他。tcpdumpとtcpreplayとtcprewriteと他。
tcpdumpとtcpreplayとtcprewriteと他。
 
eBPFを用いたトレーシングについて
eBPFを用いたトレーシングについてeBPFを用いたトレーシングについて
eBPFを用いたトレーシングについて
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 

Similar to Introduction to Tokyo Products

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataRoger Xia
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsKorea Sdec
 
Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2John Ashmead
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014Avinash Ramineni
 
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandraShun Nakamura
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentationMurat Çakal
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014clairvoyantllc
 
Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTWsunnygleason
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesHaohui Mai
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 

Similar to Introduction to Tokyo Products (20)

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTW
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 

Recently uploaded (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 

Introduction to Tokyo Products

  • 1. Introduction to Tokyo Products Mikio Hirabayashi <hirarin@gmail.com>
  • 2. Tokyo Products • Tokyo Cabinet – database library • Tokyo Tyrant applications Prome – database server custom storage Tyrant Dystopia nade • Tokyo Dystopia Cabinet – full-text search engine file system • Tokyo Promenade – content management system • open source – released under LGPL • powerful, portable, practical – written in the standard C, optimized to POSIX
  • 3. Tokyo Cabinet - database library -
  • 4. Features • modern implementation of DBM – key/value database • e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley DB – simple library = process embedded – Successor of QDBM • C99 and POSIX compatible, using Pthread, mmap, etc... • Win32 porting is work-in-progress • high performance – insert: 0.4 sec/1M records (2,500,000 qps) – search: 0.33 sec/1M records (3,000,000 qps)
  • 5. • high concurrency – multi-thread safe – read/write locking by records • high scalability – hash and B+tree structure = O(1) and O(log N) – no actual limit size of a database file (to 8 exabytes) • transaction – write ahead logging and shadow paging – ACID properties • various APIs – on-memory list/hash/tree – file hash/B+tree/array/table • script language bindings – Perl, Ruby, Java, Lua, Python, PHP, Haskell, Erlang, etc...
  • 6. TCHDB: Hash Database • static hashing bucket array – O(1) time complexity • separate chaining key value – binary search tree key value – balances by the second hash key value • free block pool key value key value – best fit allocation – dynamic defragmentation key value • combines mmap and key key value value pwrite/pread key value – saves calling system calls key value • compression key value – deflate(gzip)/bzip2/custom
  • 7. TCBDB: B+ Tree Database • B+ tree key value – O(log N) time complexity B tree index key value • page caching key key value value – LRU removing key value – speculative search key value • stands on hash DB key value – records pages in hash DB key value – succeeds time and space key value efficiency key value • custom comparison key value function key value – prefix/range matching key value key value • cursor key value – jump/next/prev key value
  • 8. TCFDB: Fixed-length Database • array of fixed- length elements array – O(1) time complexity value value value value – natural number keys value value value value – addresses records by value value value value multiple of key value value value value • most effective value value value value value value value value – bulk load by mmap value value value value – no key storage per value value value value record value value value value – extremely fast and value value value value concurrent
  • 9. TCTDB: Table Database • column based – the primary key and named columns – stands on hash DB bucket array • flexible structure primary key name value name value name value – no data scheme, no data type – various structure for each record primary key name value name value name value • query mechanism primary key name value name value name value – various operators matching column values primary key name value name value name value – lexical/decimal orders by column values primary key name value name value name value • column indexes – implemented with B+ tree primary key name value name value name value – typed as string/number – inverted index of token/q-gram column index – query optimizer
  • 10. On-memory Structures • TCXSTR: extensible string – concatenation, formatted allocation • TCLIST: array list (dequeue) – random access by index – push/pop, unshift/shift, insert/remove • TCMAP: map of hash table – insert/remove/search – iterator by order of insertion • TCTREE: map of ordered tree – insert/remove/search – iterator by order of comparison function
  • 11. Other Mechanisms • abstract database – common interface of 6 schema • on-memory hash, on-memory tree • file hash, file B+tree, file array, file table – decides the concrete scheme in runtime • remote database – network interface of the abstract database – yes, it's Tokyo Tyrant! • miscellaneous utilities – string processing, filesystem operation – memory pool, encoding/decoding
  • 12. Example Code #include <tcutil.h> #include <tchdb.h> #include <stdlib.h> #include <stdbool.h> #include <stdint.h> int main(int argc, char **argv){ TCHDB *hdb; int ecode; char *key, *value; /* create the object */ hdb = tchdbnew(); /* open the database */ if(!tchdbopen(hdb, "casket.hdb", HDBOWRITER | HDBOCREAT)){ ecode = tchdbecode(hdb); /* traverse records */ fprintf(stderr, "open error: %s¥n", tchdberrmsg(ecode)); tchdbiterinit(hdb); } while((key = tchdbiternext2(hdb)) != NULL){ value = tchdbget2(hdb, key); /* store records */ if(value){ if(!tchdbput2(hdb, "foo", "hop") || printf("%s:%s¥n", key, value); !tchdbput2(hdb, "bar", "step") || free(value); !tchdbput2(hdb, "baz", "jump")){ } ecode = tchdbecode(hdb); free(key); fprintf(stderr, "put error: %s¥n", tchdberrmsg(ecode)); } } /* close the database */ /* retrieve records */ if(!tchdbclose(hdb)){ value = tchdbget2(hdb, "foo"); ecode = tchdbecode(hdb); if(value){ fprintf(stderr, "close error: %s¥n", tchdberrmsg(ecode)); printf("%s¥n", value); } free(value); } else { /* delete the object */ ecode = tchdbecode(hdb); tchdbdel(hdb); fprintf(stderr, "get error: %s¥n", tchdberrmsg(ecode)); } return 0; }
  • 14. Features • network server of Tokyo Cabinet – client/server model – multi applications can access one database – effective binary protocol • compatible protocols – supports memcached protocol and HTTP – available from most popular languages • high concurrency/performance – resolves "c10k" with epoll/kqueue/eventports – 17.2 sec/1M queries (58,000 qps)
  • 15. • high availability – hot backup and update log – asynchronous replication between servers • various database schema – using the abstract database API of Tokyo Cabinet • effective operations – no-reply updating, multi-record retrieval – atomic increment • Lua extension – defines arbitrary database operations – atomic operation by record locking • pure script language interfaces – Perl, Ruby, Java, Python, PHP, Erlang, etc...
  • 16. Asynchronous Replication master and slaves dual master (load balancing) (fail over) write query master server client client database query if the master is dead query read query update log active master standby master with load balancing database database replicate the difference slave server slave server update log update log replicate database database the difference update log update log
  • 17. Thread Pool Model epoll/kqueue listen accept the client connection if the event is about the listener queue first of all, the listening socket is enqueued into the epoll queue accept epoll_ctl(add) queue back if keep-alive epoll_wait task manager epoll_ctl(del) queue enqueue move the readable client socket from the epoll queue to the task queue deque worker thread worker thread do each task worker thread
  • 18. Lua Extention • defines DB operations as Lua functions – clients send the function name and record data – the server returns the return value of the function • options about atomicity – no locking / record locking / global locking front end request back end - function name - key data Tokyo Tyrant - value data Clients Tokyo Cabinet Lua processor response - result data script database user-defined operations
  • 19. case: Timestamp DB at mixi.jp • 20 million records mod_perl – each record size is 20 bytes home.pl update • more than 10,000 show_friend.pl TT (active) updates per sec. view_diary.pl database – keeps 10,000 connections search.pl replication • dual master other pages replication TT (standby) – each server is only one database fetch • memcached list_friend.pl compatible protocol list_bookmark.pl – reuses existing Perl clients
  • 20. case: Cache for Big Storages • works as proxy clients 1. inserts to the storage – mediates insert/search 2. inserts to the cache – write through, read through • Lua extension Tokyo Tyrant MySQL/hBase – atomic operation by record atomic insert database locking Lua – uses LuaSocket to access the storage database atomic search • proper DB scheme Lua – TCMDB: for generic cache database – TCNDB: for biased access – TCHDB: for large records cache database such as image 1. retrieves from the cache – TCFDB: for small records if found, return 2. retrieves from the storage such as timestamp 3. inserts to the cache
  • 21. Example Code #include <tcrdb.h> #include <stdlib.h> #include <stdbool.h> #include <stdint.h> int main(int argc, char **argv){ TCRDB *rdb; int ecode; char *value; /* create the object */ rdb = tcrdbnew(); /* connect to the server */ if(!tcrdbopen(rdb, "localhost", 1978)){ ecode = tcrdbecode(rdb); fprintf(stderr, "open error: %s¥n", tcrdberrmsg(ecode)); } /* store records */ if(!tcrdbput2(rdb, "foo", "hop") || !tcrdbput2(rdb, "bar", "step") || !tcrdbput2(rdb, "baz", "jump")){ ecode = tcrdbecode(rdb); fprintf(stderr, "put error: %s¥n", tcrdberrmsg(ecode)); } /* close the connection */ if(!tcrdbclose(rdb)){ /* retrieve records */ ecode = tcrdbecode(rdb); value = tcrdbget2(rdb, "foo"); fprintf(stderr, "close error: %s¥n", tcrdberrmsg(ecode)); if(value){ } printf("%s¥n", value); free(value); /* delete the object */ } else { tcrdbdel(rdb); ecode = tcrdbecode(rdb); fprintf(stderr, "get error: %s¥n", tcrdberrmsg(ecode)); return 0; } }
  • 22. Tokyo Dystopia - full-text search engine -
  • 23. Features • full-text search engine – manages databases of Tokyo Cabinet as an inverted index • combines two tokenizers – character N-gram (bi-gram) method • perfect recall ratio – simple word by outer language processor • high accuracy and high performance • high performance/scalability – handles more than 10 million documents – searches in milliseconds
  • 24. • optimized to professional use – layered architecture of APIs – no embedded scoring system • to combine outer scoring system – no text filter, no crawler, no language processor • convenient utilities – multilingualism with Unicode – set operations – phrase matching, prefix matching, suffix matching, and token matching – command line utilities
  • 25. Inverted Index • stands on key/value database – key = token • N-gram or simple word – value = occurrence data (posting list) • list of pairs of document number and offset in the document • uses B+ tree database – reduces write operations into the disk device – enables common prefix search for tokens – delta encoding and deflate compression ID:21 text: "abracadabra" a - 21:10 ca - 21:1, 21:8 ab - 21:0,21:7 da - 21:4 ac - 21:3 ra - 21:2, 21:9 br - 21:5
  • 26. Layered Architecture • character N-gram index – "q-gram index" (only index), and "indexed database" – uses embedded tokenizer • word index – "word index" (only index), and "tagged index" – uses outer tokenizer Applications Character N-gram Index Tagging Index indexed database tagged database q-gram index word index Tokyo Cabinet
  • 27. case: friend search at mixi.jp • 20 million records – each record size is 1K bytes query user interface – name and self introduction merger • more than 100 qps TT's social query • attribute narrowing cache graph – gender, address, birthday searcher – multiple sort orders copy the social graph inverted attribute • distributed processing index DB – more than 10 servers indexer – indexer, searchers, merger copy the index and the DB inverted attribute • ranking by social index DB graph dump profile data – the merger scores the result by following the friend links profile DB
  • 28. Example Code #include <dystopia.h> #include <stdlib.h> #include <stdbool.h> #include <stdint.h> int main(int argc, char **argv){ TCIDB *idb; int ecode, rnum, i; uint64_t *result; char *text; /* search records */ /* create the object */ result = tcidbsearch2(idb, "john || thomas", &rnum); idb = tcidbnew(); if(result){ for(i = 0; i < rnum; i++){ /* open the database */ text = tcidbget(idb, result[i]); if(!tcidbopen(idb, "casket", IDBOWRITER | IDBOCREAT)){ if(text){ ecode = tcidbecode(idb); printf("%d¥t%s¥n", (int)result[i], text); fprintf(stderr, "open error: %s¥n", tcidberrmsg(ecode)); free(text); } } } /* store records */ free(result); if(!tcidbput(idb, 1, "George Washington") || } else { !tcidbput(idb, 2, "John Adams") || ecode = tcidbecode(idb); !tcidbput(idb, 3, "Thomas Jefferson")){ fprintf(stderr, "search error: %s¥n", tcidberrmsg(ecode)); ecode = tcidbecode(idb); } fprintf(stderr, "put error: %s¥n", tcidberrmsg(ecode)); } /* close the database */ if(!tcidbclose(idb)){ ecode = tcidbecode(idb); fprintf(stderr, "close error: %s¥n", tcidberrmsg(ecode)); } /* delete the object */ tcidbdel(idb); return 0; }
  • 29. Tokyo Promenade - content management system -
  • 30. Features • content management system – manages Web contents easily with a browser – available as BBS, Blog, and Wiki • simple and logical interface – aims at conciseness like LaTeX – optimized for text browsers such as w3m and Lynx – complying with XHTML 1.0 and considering WCAG 1.0 • high performance/throughput – implemented in pure C – uses Tokyo Cabinet and supports FastCGI – 0.836ms/view (more than 1,000 qps)
  • 31. • sufficient functionality – simple Wiki formatting – file uploader and manager – user authentication by the login form – guest comment authorization by a riddle – supports the sidebar navigation – full-text/attribute search, calendar view – Atom feed • flexible customizability – thorough separation of logic and presentation – template file to generate the output – server side scripting by the Lua extension – post processing by outer commands
  • 32. Example Code #! Introduction to Tokyo Cabinet #c 2009-11-05T18:58:39+09:00 #m 2009-11-05T18:58:39+09:00 #o mikio #t database,programming,tokyocabinet This article describes what is [[Tokyo Cabinet|http://1978th.net/tokyocabinet/]] and how to use it. @ upfile:1257415094-logo-ja.png * Features - modern implementation of DBM -- key/value database -- e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley DB - simple library = process embedded - Successor of QDBM -- C99 and POSIX compatible, using Pthread, mmap, etc... -- Win32 porting is work-in-progress - high performance - insert: 0.4 sec/1M records (2,500,000 qps) - search: 0.33 sec/1M records (3,000,000 qps)
  • 33. innovating more and yet more... http://1978th.net/