SlideShare a Scribd company logo
1 of 34
Download to read offline
Introduction to
Tokyo Products

          Mikio Hirabayashi
              <hirarin@gmail.com>
Tokyo Products
• Tokyo Cabinet
  – database library
• Tokyo Tyrant                                    applications
                                                                               Prome
  – database server              custom storage    Tyrant           Dystopia
                                                                                nade


• Tokyo Dystopia                                        Cabinet
  – full-text search engine
                                                      file system
• Tokyo Promenade
  – content management system


• open source
  – released under LGPL
• powerful, portable, practical
  – written in the standard C, optimized to POSIX
Tokyo Cabinet
 - database library -
Features
• modern implementation of DBM
 – key/value database
   • e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley DB
 – simple library = process embedded
 – Successor of QDBM
   • C99 and POSIX compatible, using Pthread, mmap, etc...
   • Win32 porting is work-in-progress

• high performance
 – insert: 0.4 sec/1M records (2,500,000 qps)
 – search: 0.33 sec/1M records (3,000,000 qps)
• high concurrency
  – multi-thread safe
  – read/write locking by records
• high scalability
  – hash and B+tree structure = O(1) and O(log N)
  – no actual limit size of a database file (to 8 exabytes)
• transaction
  – write ahead logging and shadow paging
  – ACID properties
• various APIs
  – on-memory list/hash/tree
  – file hash/B+tree/array/table
• script language bindings
  – Perl, Ruby, Java, Lua, Python, PHP, Haskell, Erlang, etc...
TCHDB: Hash Database
• static hashing                  bucket array

  – O(1) time complexity

• separate chaining                key           value

  – binary search tree             key           value

  – balances by the second hash    key           value

• free block pool                  key           value

                                   key           value
  – best fit allocation
  – dynamic defragmentation        key           value


• combines mmap and                key

                                   key
                                                 value

                                                 value
  pwrite/pread                     key           value
  – saves calling system calls
                                   key           value
• compression                      key           value
  – deflate(gzip)/bzip2/custom
TCBDB: B+ Tree Database
• B+ tree                                     key   value
  – O(log N) time complexity   B tree index   key   value

• page caching                                key
                                              key
                                                    value
                                                    value
  – LRU removing
                                              key   value
  – speculative search
                                              key   value
• stands on hash DB                           key   value
  – records pages in hash DB                  key   value
  – succeeds time and space                   key   value
    efficiency                                key   value
• custom comparison                           key   value

  function                                    key   value

  – prefix/range matching                     key   value
                                              key   value
• cursor                                      key   value
  – jump/next/prev                            key   value
TCFDB: Fixed-length Database
• array of fixed-
  length elements          array
  – O(1) time complexity     value   value   value   value
  – natural number keys      value   value   value   value
  – addresses records by     value   value   value   value
    multiple of key          value   value   value   value

• most effective             value

                             value
                                     value

                                     value
                                             value

                                             value
                                                     value

                                                     value
  – bulk load by mmap        value   value   value   value
  – no key storage per       value   value   value   value
    record                   value   value   value   value
  – extremely fast and       value   value   value   value
    concurrent
TCTDB: Table Database
• column based
  – the primary key and named
    columns
  – stands on hash DB                                                    bucket array

• flexible structure                   primary key   name value name value name value
  – no data scheme, no data type
  – various structure for each
    record                             primary key   name value name value name value


• query mechanism                      primary key   name value name value name value
  – various operators matching
    column values                      primary key   name value name value name value
  – lexical/decimal orders by column
    values                             primary key   name value name value name value

• column indexes
  –   implemented with B+ tree         primary key   name value name value name value

  –   typed as string/number
  –   inverted index of token/q-gram                            column index
  –   query optimizer
On-memory Structures
• TCXSTR: extensible string
  – concatenation, formatted allocation
• TCLIST: array list (dequeue)
  – random access by index
  – push/pop, unshift/shift, insert/remove
• TCMAP: map of hash table
  – insert/remove/search
  – iterator by order of insertion
• TCTREE: map of ordered tree
  – insert/remove/search
  – iterator by order of comparison function
Other Mechanisms
• abstract database
  – common interface of 6 schema
     • on-memory hash, on-memory tree
     • file hash, file B+tree, file array, file table
  – decides the concrete scheme in runtime

• remote database
  – network interface of the abstract database
  – yes, it's Tokyo Tyrant!

• miscellaneous utilities
  – string processing, filesystem operation
  – memory pool, encoding/decoding
Example Code
#include   <tcutil.h>
#include   <tchdb.h>
#include   <stdlib.h>
#include   <stdbool.h>
#include   <stdint.h>

int main(int argc, char **argv){

  TCHDB *hdb;
  int ecode;
  char *key, *value;

  /* create the object */
  hdb = tchdbnew();

  /* open the database */
  if(!tchdbopen(hdb, "casket.hdb", HDBOWRITER | HDBOCREAT)){
    ecode = tchdbecode(hdb);                                       /* traverse records */
    fprintf(stderr, "open error: %s¥n", tchdberrmsg(ecode));       tchdbiterinit(hdb);
  }                                                                while((key = tchdbiternext2(hdb)) != NULL){
                                                                     value = tchdbget2(hdb, key);
  /* store records */                                                if(value){
  if(!tchdbput2(hdb, "foo", "hop") ||                                  printf("%s:%s¥n", key, value);
     !tchdbput2(hdb, "bar", "step") ||                                 free(value);
     !tchdbput2(hdb, "baz", "jump")){                                }
    ecode = tchdbecode(hdb);                                         free(key);
    fprintf(stderr, "put error: %s¥n", tchdberrmsg(ecode));        }
  }
                                                                   /* close the database */
  /* retrieve records */                                           if(!tchdbclose(hdb)){
  value = tchdbget2(hdb, "foo");                                     ecode = tchdbecode(hdb);
  if(value){                                                         fprintf(stderr, "close error: %s¥n", tchdberrmsg(ecode));
    printf("%s¥n", value);                                         }
    free(value);
  } else {                                                         /* delete the object */
    ecode = tchdbecode(hdb);                                       tchdbdel(hdb);
    fprintf(stderr, "get error: %s¥n", tchdberrmsg(ecode));
  }                                                                return 0;
                                                               }
Tokyo Tyrant
- database server -
Features
• network server of Tokyo Cabinet
 – client/server model
 – multi applications can access one database
 – effective binary protocol
• compatible protocols
 – supports memcached protocol and HTTP
 – available from most popular languages
• high concurrency/performance
 – resolves "c10k" with epoll/kqueue/eventports
 – 17.2 sec/1M queries (58,000 qps)
• high availability
  – hot backup and update log
  – asynchronous replication between servers
• various database schema
  – using the abstract database API of Tokyo Cabinet
• effective operations
  – no-reply updating, multi-record retrieval
  – atomic increment
• Lua extension
  – defines arbitrary database operations
  – atomic operation by record locking
• pure script language interfaces
  – Perl, Ruby, Java, Python, PHP, Erlang, etc...
Asynchronous Replication
 master and slaves                                 dual master
 (load balancing)                                  (fail over)

                          write query

      master server                      client
                                                          client
            database                                                 query if the master is dead
                                                  query
                            read query
           update log                             active master                 standby master
                            with load balancing

                                                      database                        database
replicate
the difference
slave server            slave server                  update log                     update log
                                                                   replicate
      database              database
                                                                   the difference



     update log             update log
Thread Pool Model
                                 epoll/kqueue            listen
accept the client connection
if the event is about the listener           queue   first of all, the listening socket is enqueued into
                                                     the epoll queue
           accept           epoll_ctl(add)
                                                                           queue back if keep-alive
                            epoll_wait

                                                         task manager
                            epoll_ctl(del)
                                                                  queue
                                                     enqueue
          move the readable client socket
          from the epoll queue to the task queue

                                                       deque
                                                                                 worker thread

                                                                                 worker thread

                                                     do each task                worker thread
Lua Extention
• defines DB operations as Lua functions
  – clients send the function name and record data
  – the server returns the return value of the function

• options about atomicity
  – no locking / record locking / global locking


  front end           request           back end
                      - function name
                      - key data            Tokyo Tyrant
                      - value data
    Clients
                                            Tokyo Cabinet   Lua processor
                     response
                     - result data

                                                                  script
                                              database          user-defined
                                                                operations
case: Timestamp DB at mixi.jp
• 20 million records               mod_perl
  – each record size is 20 bytes
                                      home.pl          update
• more than 10,000                  show_friend.pl
                                                                 TT (active)


  updates per sec.                   view_diary.pl
                                                                  database

  – keeps 10,000 connections
                                         search.pl               replication
• dual master                            other pages

  replication                                                     TT (standby)

  – each server is only one                                         database
                                                         fetch
• memcached                         list_friend.pl



  compatible protocol
                                    list_bookmark.pl



  – reuses existing Perl clients
case: Cache for Big Storages
• works as proxy                   clients                        1. inserts to the storage
  – mediates insert/search                                        2. inserts to the cache

  – write through, read through

• Lua extension
                                                 Tokyo Tyrant          MySQL/hBase


  – atomic operation by record                    atomic insert
                                                                          database
    locking                                             Lua
  – uses LuaSocket to access
    the storage                                                           database
                                                 atomic search
• proper DB scheme                                      Lua
  – TCMDB: for generic cache                                              database

  – TCNDB: for biased access
  – TCHDB: for large records                           cache
                                                                          database
    such as image                 1. retrieves from the cache
  – TCFDB: for small records           if found, return
                                  2. retrieves from the storage
    such as timestamp             3. inserts to the cache
Example Code
#include   <tcrdb.h>
#include   <stdlib.h>
#include   <stdbool.h>
#include   <stdint.h>

int main(int argc, char **argv){

  TCRDB *rdb;
  int ecode;
  char *value;

  /* create the object */
  rdb = tcrdbnew();

  /* connect to the server */
  if(!tcrdbopen(rdb, "localhost", 1978)){
    ecode = tcrdbecode(rdb);
    fprintf(stderr, "open error: %s¥n", tcrdberrmsg(ecode));
  }

  /* store records */
  if(!tcrdbput2(rdb, "foo", "hop") ||
     !tcrdbput2(rdb, "bar", "step") ||
     !tcrdbput2(rdb, "baz", "jump")){
    ecode = tcrdbecode(rdb);
    fprintf(stderr, "put error: %s¥n", tcrdberrmsg(ecode));
  }                                                                /* close the connection */
                                                                   if(!tcrdbclose(rdb)){
  /* retrieve records */                                             ecode = tcrdbecode(rdb);
  value = tcrdbget2(rdb, "foo");                                     fprintf(stderr, "close error: %s¥n", tcrdberrmsg(ecode));
  if(value){                                                       }
    printf("%s¥n", value);
    free(value);                                                   /* delete the object */
  } else {                                                         tcrdbdel(rdb);
    ecode = tcrdbecode(rdb);
    fprintf(stderr, "get error: %s¥n", tcrdberrmsg(ecode));        return 0;
  }                                                            }
Tokyo Dystopia
- full-text search engine -
Features
• full-text search engine
  – manages databases of Tokyo Cabinet as an inverted
    index

• combines two tokenizers
  – character N-gram (bi-gram) method
     • perfect recall ratio
  – simple word by outer language processor
     • high accuracy and high performance

• high performance/scalability
  – handles more than 10 million documents
  – searches in milliseconds
• optimized to professional use
  – layered architecture of APIs
  – no embedded scoring system
    • to combine outer scoring system
  – no text filter, no crawler, no language
    processor
• convenient utilities
  – multilingualism with Unicode
  – set operations
  – phrase matching, prefix matching, suffix
    matching, and token matching
  – command line utilities
Inverted Index
• stands on key/value database
  – key = token
     • N-gram or simple word
  – value = occurrence data (posting list)
     • list of pairs of document number and offset in the document

• uses B+ tree database
  – reduces write operations into the disk device
  – enables common prefix search for tokens
  – delta encoding and deflate compression

       ID:21            text: "abracadabra"
          a    -   21:10          ca - 21:1, 21:8
          ab   -   21:0,21:7      da - 21:4
          ac   -   21:3           ra - 21:2, 21:9
          br   -   21:5
Layered Architecture
• character N-gram index
  – "q-gram index" (only index), and "indexed database"
  – uses embedded tokenizer

• word index
  – "word index" (only index), and "tagged index"
  – uses outer tokenizer
                                                Applications
                             Character N-gram Index            Tagging Index

                               indexed database         tagged database

                                 q-gram index              word index

                                              Tokyo Cabinet
case: friend search at mixi.jp
• 20 million records
  – each record size is 1K bytes                   query               user interface
  – name and self introduction
                                      merger
• more than 100 qps                       TT's        social               query

• attribute narrowing                    cache        graph

  – gender, address, birthday                                         searcher
  – multiple sort orders              copy the social graph
                                                                        inverted   attribute

• distributed processing
                                                                          index       DB


  – more than 10 servers              indexer
  – indexer, searchers, merger                                      copy the index and the DB
                                        inverted     attribute
• ranking by social                       index         DB


  graph                                                          dump profile data

  – the merger scores the result by
    following the friend links                                    profile DB
Example Code
#include   <dystopia.h>
#include   <stdlib.h>
#include   <stdbool.h>
#include   <stdint.h>

int main(int argc, char **argv){
  TCIDB *idb;
  int ecode, rnum, i;
  uint64_t *result;
  char *text;
                                                                   /* search records */
  /* create the object */                                          result = tcidbsearch2(idb, "john || thomas", &rnum);
  idb = tcidbnew();                                                if(result){
                                                                     for(i = 0; i < rnum; i++){
  /* open the database */                                              text = tcidbget(idb, result[i]);
  if(!tcidbopen(idb, "casket", IDBOWRITER | IDBOCREAT)){               if(text){
    ecode = tcidbecode(idb);                                             printf("%d¥t%s¥n", (int)result[i], text);
    fprintf(stderr, "open error: %s¥n", tcidberrmsg(ecode));             free(text);
  }                                                                    }
                                                                     }
  /* store records */                                                free(result);
  if(!tcidbput(idb, 1, "George Washington") ||                     } else {
     !tcidbput(idb, 2, "John Adams") ||                              ecode = tcidbecode(idb);
     !tcidbput(idb, 3, "Thomas Jefferson")){                         fprintf(stderr, "search error: %s¥n", tcidberrmsg(ecode));
    ecode = tcidbecode(idb);                                       }
    fprintf(stderr, "put error: %s¥n", tcidberrmsg(ecode));
  }                                                                /* close the database */
                                                                   if(!tcidbclose(idb)){
                                                                     ecode = tcidbecode(idb);
                                                                     fprintf(stderr, "close error: %s¥n", tcidberrmsg(ecode));
                                                                   }

                                                                   /* delete the object */
                                                                   tcidbdel(idb);

                                                                   return 0;
                                                               }
Tokyo Promenade
- content management system -
Features
• content management system
  – manages Web contents easily with a browser
  – available as BBS, Blog, and Wiki

• simple and logical interface
  – aims at conciseness like LaTeX
  – optimized for text browsers such as w3m and Lynx
  – complying with XHTML 1.0 and considering WCAG 1.0

• high performance/throughput
  – implemented in pure C
  – uses Tokyo Cabinet and supports FastCGI
  – 0.836ms/view (more than 1,000 qps)
• sufficient functionality
  – simple Wiki formatting
  – file uploader and manager
  – user authentication by the login form
  – guest comment authorization by a riddle
  – supports the sidebar navigation
  – full-text/attribute search, calendar view
  – Atom feed
• flexible customizability
  – thorough separation of logic and presentation
  – template file to generate the output
  – server side scripting by the Lua extension
  – post processing by outer commands
Example Code
#!   Introduction to Tokyo Cabinet
#c   2009-11-05T18:58:39+09:00
#m   2009-11-05T18:58:39+09:00
#o   mikio
#t   database,programming,tokyocabinet

This article describes what is [[Tokyo
Cabinet|http://1978th.net/tokyocabinet/]] and
how to use it.

@ upfile:1257415094-logo-ja.png

* Features

- modern implementation of DBM
-- key/value database
-- e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley
DB
- simple library = process embedded
- Successor of QDBM
-- C99 and POSIX compatible, using Pthread,
mmap, etc...
-- Win32 porting is work-in-progress
- high performance
- insert: 0.4 sec/1M records (2,500,000 qps)
- search: 0.33 sec/1M records (3,000,000 qps)
innovating more and yet more...
               http://1978th.net/
Introduction to Tokyo Products

More Related Content

What's hot

Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
iOS/macOSとAndroid/Linuxのサンドボックス機構について調べた
iOS/macOSとAndroid/Linuxのサンドボックス機構について調べたiOS/macOSとAndroid/Linuxのサンドボックス機構について調べた
iOS/macOSとAndroid/Linuxのサンドボックス機構について調べたYoshio Hanawa
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기NHN FORWARD
 
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개OpenStack Korea Community
 
Logging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesLogging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesVineet Sabharwal
 
MySQL Performance Schema in 20 Minutes
 MySQL Performance Schema in 20 Minutes MySQL Performance Schema in 20 Minutes
MySQL Performance Schema in 20 MinutesSveta Smirnova
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationAngel Boy
 
Distributed Locking in Kubernetes
Distributed Locking in KubernetesDistributed Locking in Kubernetes
Distributed Locking in KubernetesRafał Leszko
 
High Availability Content Caching with NGINX
High Availability Content Caching with NGINXHigh Availability Content Caching with NGINX
High Availability Content Caching with NGINXNGINX, Inc.
 
[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기NAVER D2
 
Monitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstatsMonitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstatsBen Mildren
 
Ceph アーキテクチャ概説
Ceph アーキテクチャ概説Ceph アーキテクチャ概説
Ceph アーキテクチャ概説Emma Haruka Iwao
 
Differences between MariaDB 10.3 & MySQL 8.0
Differences between MariaDB 10.3 & MySQL 8.0Differences between MariaDB 10.3 & MySQL 8.0
Differences between MariaDB 10.3 & MySQL 8.0Colin Charles
 
[오픈소스컨설팅] Open Stack Ceph, Neutron, HA, Multi-Region
[오픈소스컨설팅] Open Stack Ceph, Neutron, HA, Multi-Region[오픈소스컨설팅] Open Stack Ceph, Neutron, HA, Multi-Region
[오픈소스컨설팅] Open Stack Ceph, Neutron, HA, Multi-RegionJi-Woong Choi
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
MySQL GTID 시작하기
MySQL GTID 시작하기MySQL GTID 시작하기
MySQL GTID 시작하기I Goo Lee
 
A whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerA whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerNikita Popov
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksJignesh Shah
 

What's hot (20)

Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
iOS/macOSとAndroid/Linuxのサンドボックス機構について調べた
iOS/macOSとAndroid/Linuxのサンドボックス機構について調べたiOS/macOSとAndroid/Linuxのサンドボックス機構について調べた
iOS/macOSとAndroid/Linuxのサンドボックス機構について調べた
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기
 
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
[OpenInfra Days Korea 2018] (Track 2) Neutron LBaaS 어디까지 왔니? - Octavia 소개
 
Logging using ELK Stack for Microservices
Logging using ELK Stack for MicroservicesLogging using ELK Stack for Microservices
Logging using ELK Stack for Microservices
 
MySQL Performance Schema in 20 Minutes
 MySQL Performance Schema in 20 Minutes MySQL Performance Schema in 20 Minutes
MySQL Performance Schema in 20 Minutes
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) Exploitation
 
Distributed Locking in Kubernetes
Distributed Locking in KubernetesDistributed Locking in Kubernetes
Distributed Locking in Kubernetes
 
High Availability Content Caching with NGINX
High Availability Content Caching with NGINXHigh Availability Content Caching with NGINX
High Availability Content Caching with NGINX
 
[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기
 
Monitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstatsMonitoring IO performance with iostat and pt-diskstats
Monitoring IO performance with iostat and pt-diskstats
 
Ceph アーキテクチャ概説
Ceph アーキテクチャ概説Ceph アーキテクチャ概説
Ceph アーキテクチャ概説
 
Differences between MariaDB 10.3 & MySQL 8.0
Differences between MariaDB 10.3 & MySQL 8.0Differences between MariaDB 10.3 & MySQL 8.0
Differences between MariaDB 10.3 & MySQL 8.0
 
Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1
 
[오픈소스컨설팅] Open Stack Ceph, Neutron, HA, Multi-Region
[오픈소스컨설팅] Open Stack Ceph, Neutron, HA, Multi-Region[오픈소스컨설팅] Open Stack Ceph, Neutron, HA, Multi-Region
[오픈소스컨설팅] Open Stack Ceph, Neutron, HA, Multi-Region
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
MySQL GTID 시작하기
MySQL GTID 시작하기MySQL GTID 시작하기
MySQL GTID 시작하기
 
A whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerA whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizer
 
PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW Locks
 

Similar to Introduction to Tokyo Products

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataRoger Xia
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsKorea Sdec
 
Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2John Ashmead
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014Avinash Ramineni
 
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandraShun Nakamura
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentationMurat Çakal
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014clairvoyantllc
 
Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTWsunnygleason
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesHaohui Mai
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 

Similar to Introduction to Tokyo Products (20)

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2Ruby on Rails & PostgreSQL - v2
Ruby on Rails & PostgreSQL - v2
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra第17回Cassandra勉強会: MyCassandra
第17回Cassandra勉強会: MyCassandra
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTW
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 

Recently uploaded (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Introduction to Tokyo Products

  • 1. Introduction to Tokyo Products Mikio Hirabayashi <hirarin@gmail.com>
  • 2. Tokyo Products • Tokyo Cabinet – database library • Tokyo Tyrant applications Prome – database server custom storage Tyrant Dystopia nade • Tokyo Dystopia Cabinet – full-text search engine file system • Tokyo Promenade – content management system • open source – released under LGPL • powerful, portable, practical – written in the standard C, optimized to POSIX
  • 3. Tokyo Cabinet - database library -
  • 4. Features • modern implementation of DBM – key/value database • e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley DB – simple library = process embedded – Successor of QDBM • C99 and POSIX compatible, using Pthread, mmap, etc... • Win32 porting is work-in-progress • high performance – insert: 0.4 sec/1M records (2,500,000 qps) – search: 0.33 sec/1M records (3,000,000 qps)
  • 5. • high concurrency – multi-thread safe – read/write locking by records • high scalability – hash and B+tree structure = O(1) and O(log N) – no actual limit size of a database file (to 8 exabytes) • transaction – write ahead logging and shadow paging – ACID properties • various APIs – on-memory list/hash/tree – file hash/B+tree/array/table • script language bindings – Perl, Ruby, Java, Lua, Python, PHP, Haskell, Erlang, etc...
  • 6. TCHDB: Hash Database • static hashing bucket array – O(1) time complexity • separate chaining key value – binary search tree key value – balances by the second hash key value • free block pool key value key value – best fit allocation – dynamic defragmentation key value • combines mmap and key key value value pwrite/pread key value – saves calling system calls key value • compression key value – deflate(gzip)/bzip2/custom
  • 7. TCBDB: B+ Tree Database • B+ tree key value – O(log N) time complexity B tree index key value • page caching key key value value – LRU removing key value – speculative search key value • stands on hash DB key value – records pages in hash DB key value – succeeds time and space key value efficiency key value • custom comparison key value function key value – prefix/range matching key value key value • cursor key value – jump/next/prev key value
  • 8. TCFDB: Fixed-length Database • array of fixed- length elements array – O(1) time complexity value value value value – natural number keys value value value value – addresses records by value value value value multiple of key value value value value • most effective value value value value value value value value – bulk load by mmap value value value value – no key storage per value value value value record value value value value – extremely fast and value value value value concurrent
  • 9. TCTDB: Table Database • column based – the primary key and named columns – stands on hash DB bucket array • flexible structure primary key name value name value name value – no data scheme, no data type – various structure for each record primary key name value name value name value • query mechanism primary key name value name value name value – various operators matching column values primary key name value name value name value – lexical/decimal orders by column values primary key name value name value name value • column indexes – implemented with B+ tree primary key name value name value name value – typed as string/number – inverted index of token/q-gram column index – query optimizer
  • 10. On-memory Structures • TCXSTR: extensible string – concatenation, formatted allocation • TCLIST: array list (dequeue) – random access by index – push/pop, unshift/shift, insert/remove • TCMAP: map of hash table – insert/remove/search – iterator by order of insertion • TCTREE: map of ordered tree – insert/remove/search – iterator by order of comparison function
  • 11. Other Mechanisms • abstract database – common interface of 6 schema • on-memory hash, on-memory tree • file hash, file B+tree, file array, file table – decides the concrete scheme in runtime • remote database – network interface of the abstract database – yes, it's Tokyo Tyrant! • miscellaneous utilities – string processing, filesystem operation – memory pool, encoding/decoding
  • 12. Example Code #include <tcutil.h> #include <tchdb.h> #include <stdlib.h> #include <stdbool.h> #include <stdint.h> int main(int argc, char **argv){ TCHDB *hdb; int ecode; char *key, *value; /* create the object */ hdb = tchdbnew(); /* open the database */ if(!tchdbopen(hdb, "casket.hdb", HDBOWRITER | HDBOCREAT)){ ecode = tchdbecode(hdb); /* traverse records */ fprintf(stderr, "open error: %s¥n", tchdberrmsg(ecode)); tchdbiterinit(hdb); } while((key = tchdbiternext2(hdb)) != NULL){ value = tchdbget2(hdb, key); /* store records */ if(value){ if(!tchdbput2(hdb, "foo", "hop") || printf("%s:%s¥n", key, value); !tchdbput2(hdb, "bar", "step") || free(value); !tchdbput2(hdb, "baz", "jump")){ } ecode = tchdbecode(hdb); free(key); fprintf(stderr, "put error: %s¥n", tchdberrmsg(ecode)); } } /* close the database */ /* retrieve records */ if(!tchdbclose(hdb)){ value = tchdbget2(hdb, "foo"); ecode = tchdbecode(hdb); if(value){ fprintf(stderr, "close error: %s¥n", tchdberrmsg(ecode)); printf("%s¥n", value); } free(value); } else { /* delete the object */ ecode = tchdbecode(hdb); tchdbdel(hdb); fprintf(stderr, "get error: %s¥n", tchdberrmsg(ecode)); } return 0; }
  • 14. Features • network server of Tokyo Cabinet – client/server model – multi applications can access one database – effective binary protocol • compatible protocols – supports memcached protocol and HTTP – available from most popular languages • high concurrency/performance – resolves "c10k" with epoll/kqueue/eventports – 17.2 sec/1M queries (58,000 qps)
  • 15. • high availability – hot backup and update log – asynchronous replication between servers • various database schema – using the abstract database API of Tokyo Cabinet • effective operations – no-reply updating, multi-record retrieval – atomic increment • Lua extension – defines arbitrary database operations – atomic operation by record locking • pure script language interfaces – Perl, Ruby, Java, Python, PHP, Erlang, etc...
  • 16. Asynchronous Replication master and slaves dual master (load balancing) (fail over) write query master server client client database query if the master is dead query read query update log active master standby master with load balancing database database replicate the difference slave server slave server update log update log replicate database database the difference update log update log
  • 17. Thread Pool Model epoll/kqueue listen accept the client connection if the event is about the listener queue first of all, the listening socket is enqueued into the epoll queue accept epoll_ctl(add) queue back if keep-alive epoll_wait task manager epoll_ctl(del) queue enqueue move the readable client socket from the epoll queue to the task queue deque worker thread worker thread do each task worker thread
  • 18. Lua Extention • defines DB operations as Lua functions – clients send the function name and record data – the server returns the return value of the function • options about atomicity – no locking / record locking / global locking front end request back end - function name - key data Tokyo Tyrant - value data Clients Tokyo Cabinet Lua processor response - result data script database user-defined operations
  • 19. case: Timestamp DB at mixi.jp • 20 million records mod_perl – each record size is 20 bytes home.pl update • more than 10,000 show_friend.pl TT (active) updates per sec. view_diary.pl database – keeps 10,000 connections search.pl replication • dual master other pages replication TT (standby) – each server is only one database fetch • memcached list_friend.pl compatible protocol list_bookmark.pl – reuses existing Perl clients
  • 20. case: Cache for Big Storages • works as proxy clients 1. inserts to the storage – mediates insert/search 2. inserts to the cache – write through, read through • Lua extension Tokyo Tyrant MySQL/hBase – atomic operation by record atomic insert database locking Lua – uses LuaSocket to access the storage database atomic search • proper DB scheme Lua – TCMDB: for generic cache database – TCNDB: for biased access – TCHDB: for large records cache database such as image 1. retrieves from the cache – TCFDB: for small records if found, return 2. retrieves from the storage such as timestamp 3. inserts to the cache
  • 21. Example Code #include <tcrdb.h> #include <stdlib.h> #include <stdbool.h> #include <stdint.h> int main(int argc, char **argv){ TCRDB *rdb; int ecode; char *value; /* create the object */ rdb = tcrdbnew(); /* connect to the server */ if(!tcrdbopen(rdb, "localhost", 1978)){ ecode = tcrdbecode(rdb); fprintf(stderr, "open error: %s¥n", tcrdberrmsg(ecode)); } /* store records */ if(!tcrdbput2(rdb, "foo", "hop") || !tcrdbput2(rdb, "bar", "step") || !tcrdbput2(rdb, "baz", "jump")){ ecode = tcrdbecode(rdb); fprintf(stderr, "put error: %s¥n", tcrdberrmsg(ecode)); } /* close the connection */ if(!tcrdbclose(rdb)){ /* retrieve records */ ecode = tcrdbecode(rdb); value = tcrdbget2(rdb, "foo"); fprintf(stderr, "close error: %s¥n", tcrdberrmsg(ecode)); if(value){ } printf("%s¥n", value); free(value); /* delete the object */ } else { tcrdbdel(rdb); ecode = tcrdbecode(rdb); fprintf(stderr, "get error: %s¥n", tcrdberrmsg(ecode)); return 0; } }
  • 22. Tokyo Dystopia - full-text search engine -
  • 23. Features • full-text search engine – manages databases of Tokyo Cabinet as an inverted index • combines two tokenizers – character N-gram (bi-gram) method • perfect recall ratio – simple word by outer language processor • high accuracy and high performance • high performance/scalability – handles more than 10 million documents – searches in milliseconds
  • 24. • optimized to professional use – layered architecture of APIs – no embedded scoring system • to combine outer scoring system – no text filter, no crawler, no language processor • convenient utilities – multilingualism with Unicode – set operations – phrase matching, prefix matching, suffix matching, and token matching – command line utilities
  • 25. Inverted Index • stands on key/value database – key = token • N-gram or simple word – value = occurrence data (posting list) • list of pairs of document number and offset in the document • uses B+ tree database – reduces write operations into the disk device – enables common prefix search for tokens – delta encoding and deflate compression ID:21 text: "abracadabra" a - 21:10 ca - 21:1, 21:8 ab - 21:0,21:7 da - 21:4 ac - 21:3 ra - 21:2, 21:9 br - 21:5
  • 26. Layered Architecture • character N-gram index – "q-gram index" (only index), and "indexed database" – uses embedded tokenizer • word index – "word index" (only index), and "tagged index" – uses outer tokenizer Applications Character N-gram Index Tagging Index indexed database tagged database q-gram index word index Tokyo Cabinet
  • 27. case: friend search at mixi.jp • 20 million records – each record size is 1K bytes query user interface – name and self introduction merger • more than 100 qps TT's social query • attribute narrowing cache graph – gender, address, birthday searcher – multiple sort orders copy the social graph inverted attribute • distributed processing index DB – more than 10 servers indexer – indexer, searchers, merger copy the index and the DB inverted attribute • ranking by social index DB graph dump profile data – the merger scores the result by following the friend links profile DB
  • 28. Example Code #include <dystopia.h> #include <stdlib.h> #include <stdbool.h> #include <stdint.h> int main(int argc, char **argv){ TCIDB *idb; int ecode, rnum, i; uint64_t *result; char *text; /* search records */ /* create the object */ result = tcidbsearch2(idb, "john || thomas", &rnum); idb = tcidbnew(); if(result){ for(i = 0; i < rnum; i++){ /* open the database */ text = tcidbget(idb, result[i]); if(!tcidbopen(idb, "casket", IDBOWRITER | IDBOCREAT)){ if(text){ ecode = tcidbecode(idb); printf("%d¥t%s¥n", (int)result[i], text); fprintf(stderr, "open error: %s¥n", tcidberrmsg(ecode)); free(text); } } } /* store records */ free(result); if(!tcidbput(idb, 1, "George Washington") || } else { !tcidbput(idb, 2, "John Adams") || ecode = tcidbecode(idb); !tcidbput(idb, 3, "Thomas Jefferson")){ fprintf(stderr, "search error: %s¥n", tcidberrmsg(ecode)); ecode = tcidbecode(idb); } fprintf(stderr, "put error: %s¥n", tcidberrmsg(ecode)); } /* close the database */ if(!tcidbclose(idb)){ ecode = tcidbecode(idb); fprintf(stderr, "close error: %s¥n", tcidberrmsg(ecode)); } /* delete the object */ tcidbdel(idb); return 0; }
  • 29. Tokyo Promenade - content management system -
  • 30. Features • content management system – manages Web contents easily with a browser – available as BBS, Blog, and Wiki • simple and logical interface – aims at conciseness like LaTeX – optimized for text browsers such as w3m and Lynx – complying with XHTML 1.0 and considering WCAG 1.0 • high performance/throughput – implemented in pure C – uses Tokyo Cabinet and supports FastCGI – 0.836ms/view (more than 1,000 qps)
  • 31. • sufficient functionality – simple Wiki formatting – file uploader and manager – user authentication by the login form – guest comment authorization by a riddle – supports the sidebar navigation – full-text/attribute search, calendar view – Atom feed • flexible customizability – thorough separation of logic and presentation – template file to generate the output – server side scripting by the Lua extension – post processing by outer commands
  • 32. Example Code #! Introduction to Tokyo Cabinet #c 2009-11-05T18:58:39+09:00 #m 2009-11-05T18:58:39+09:00 #o mikio #t database,programming,tokyocabinet This article describes what is [[Tokyo Cabinet|http://1978th.net/tokyocabinet/]] and how to use it. @ upfile:1257415094-logo-ja.png * Features - modern implementation of DBM -- key/value database -- e.g.) DBM, NDBM, GDBM, TDB, CDB, Berkeley DB - simple library = process embedded - Successor of QDBM -- C99 and POSIX compatible, using Pthread, mmap, etc... -- Win32 porting is work-in-progress - high performance - insert: 0.4 sec/1M records (2,500,000 qps) - search: 0.33 sec/1M records (3,000,000 qps)
  • 33. innovating more and yet more... http://1978th.net/