Growing in the wild. The story by cubrid database developers (Esen Sagynov, Eugene Stoyanovic)


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Self introduction.
  • Eugen:When I started thinking about this presentation, this is the outcome that I wanted from it:For the experienced guys in the audience this are the thoughtswhat I want you to have at the end of this presentation. I want you to think that:Some guys talked about some cool stuff they encountered in applications (don't remember what)There's a database that they use for this type of applications, it's open source and saves a lot of trouble (don't remember what trouble exactly)They're really keen on doing things rightThis is what I remember from every presentation that I’ve attended. Not the details.So I don’t expect you to remember the technical details. What I want is to grasp the concept of what we will talk about.
  • CUBRID is a fully-feature Relational Database Management System.
  • When we were invited to speak at Russian IT conference, the committee members asked us to explain WHY NHN has started CUBRID development. And this is what I’m going to do now. We’ve never actually explained this in other conferences.
  • Why not use existing solutions?Why start from scratch?Why not fork existing solutions?Why not co-develop?These are the common questions asked by users.To answer to these questions, we first need to understand who NHN is and what resource it possesses that it could pull off the project like CUBRID.
  • Some of these services such as online games and billing systems already use Oracle databases, both standard and enterprise, and Microsoft SQL Server.Other services use MySQL.CUBRID is used in WRITE intensive services such as logging and spam filter services, as well as in READ intensive services such as commenting and monitoring systems.Oracle is a super reliable DBMS. We all know this. MySQL is great, too. Our DBAs love it very much. But as a services provider,we have certainproblems with all of them.
  • They are all commercial. At NHN we have over 10,000 servers. Annually we pay several million dollars to extend the license and service support. No matter how much we earn from our main stream business, it’s a big chunk of an expense we have to pay every year. And we want to cut our expenses.Second disadvantage of existing solutions for us is that they are third-party, and NHN has no control over the course of core development.For this reason we spend quite a lot of money on customizations to serve our needs.There is also another big problem. It’s the communication problem we have with vendors. Most of them are located overseas. Many developers at NHN do not speak English well. They have problems conveying their requirements to the vendors.That slows the entire development process.
  • Of course NHN has considered these two options before developing CUBRID: whether to fork MySQL or other open source solutions, or to start from scratch.To start with, in 2006 NHN hired the best database experts and architects in Korea and built a team of 20 developers.They all analyzed what was the best option:study every line of theexisting open source product (eg. MySQL), understand their philosophy, the reasons behind their architecture, etc.create a new DBMS from scratch with an architectureoptimized for Web services with native support for HA, sharding, load balancing, etc.After this NHN and the entire crew came to a conclusion that starting from scratch and optimizing for Web services is easier and cheaper in terms of time than studying the existing solution.
  • As you will see now, there are many reasons why NHN has decided to create a new relational database solution.First, with an in-house solution NHN would cut the cost of ownership significantly. This if important, but it’s not the main reason.The key reason is that CUBRID Database is a core technological asset for NHN.By owning this technology, NHN controls the code base completely.No additional expenses are required for customizations.No communication problems with the database developers.Lastly, services wouldn’t suffer delayed updates and security and bug fixes.Possessing such technology would allow NHN to grow its developers, their skills and knowledge of storage technology. By training the developers, NHN can “export” these skills to other services at NHN, thereby improving the staff quality of those services. I will tell you what. NHN has invested 100M dollars to establish a Software Engineering Institution in Korea. Thus NHN is very serious when it comes to nurturing the engineering skills.Obviously, by developing CUBRID, NHN can provide database solution to other platforms and service departments within NHN. Those services would become so called internal customers. They would benefit from fast updates and fixes. There is a synergy effect.However, this is not all. After NHN has developed a relational database management system, the company now gained the knowledge how to develop other recurring solutions such as high-availability, sharding, rebalancing, cluster. And actually, after CUBRID, NHN has developed its own Owner based file system, has stared the Cluster database solution, and even a distributed database system for petabyte data.At this moment, I hope, you understand why NHN has decided to create its own RDBMS.
  • When we started to develop CUBRID, we set a goal that our database solution should be:StableShould be fastand ScalableAs well as Easy to UseThereby, no matter what new feature we add these four things should not be broken as ACID shouldn’t be violated in relational databases. In the next slides I will explain how at CUBRID we try to meet these criteria.
  • First, let’s see what we’ve done in CUBRID to meet the performance demands.
  • What every customer wants is the Performance Boost. They don’t care how you do it, but they want it to be fast.At NHN we’ve identified 3 types of Web services.One, which is READ intensive. Examples are News services, Wiki sites, Blogs, and other user-generated content providers. In this kind of services the READ operations account for almost 99%.On the other hand there are services like SNS sites and Push services where around 70% of operations are READ, and 30% - are WRITE operations.And the last type of Web services are INSERT heavy services which account for over 90% of operations. Examples are Log monitoring systems and Analytics.Over 90% of all services are READ heavy services.Then we thought what we can do to provide the satisfactory level of performance. There are 4 CRUD operations commonly used by developers.In CUBRID we needed to provide Fast Searching, and avoid table scan and ORDER BY.We should increase the performance of concurrent write operations.Improve locking mechanism.What all these have in common is that all these can be achieved by optimizing the indexing algorithm. Because indexes affect how fast the data can be searched, which you can see affects all operations. This is the approach we’ve taken in CUBRID. With super fast indexes we can satisfy our clients.
  • So we first focused on improving the READ operations, because this is what 90% of all services need. We’ve introduced a new concept of Shared Query Plan Caches. The first customers were very happy.Further the customer base started to grow, and the new clients asked more WRITE performance. In the next version we’ve improved the algorithm to achieve I/O load balancing when the INSERT operations are concentrated at a certain point of time. Besides, the Increased Space Reusability was another performance enhancement in the second phase.Phase 3 was the most promising for us. We had a few big services who said would replace their MySQL deployments with CUBRID if we further improved our READ operations. We’ve redesigned how indexes are stored in CUBRID which allowed us to reduce the data and index volume size by 70%. All of a sudden your database size becomes twice smaller. The performance of the database engine was increased by almost 3 times. Moreover we’ve significantly improved our Windows binaries by better handling the concurrent requests. Thus, we’ve migrated more MySQL based services to CUBRID. This is the first time when the performance of SELECT queries have surpassed the performance of MySQL.Our latest version of CUBRID 8.4.1 is another breakthrough. We’ve received many requests from SNS service providers, which have heavy WRITE operations. They wanted to try CUBRID if we promise to improve the WRITE performance. Therefore, we’ve focused on improving INSERT and UPDATE operations by rethinking how memory buffer and transaction logs were written to the disk. We’ve achieved 70% performance increase over the previous version. So far it’s the best CUBRID ever.In the next version under the code name Apricot which is due this summer, we’ll have several super smart improvements to indexes in CUBRID.Further, perhaps, we’ll improve the JOINs in CUBRID.So this whole performance improvement thing was completely led by our clients.
  • Over one hour, CUBRID manages an average QPS of 3685 with the maximum being 4469 and the minimum 2821. Both these values are close enough to the average and show a slow decrease of performance for CUBRID on the dataset.
  • Over one hour, MySQL manages an average QPS of 1796 with the maximum being 8951 and the minimum 1122. The performance of MySQL is very good at the beginning of the test but falls dramatically after the first few minutes.
  • Even though MySQL performed two times faster than CUBRID before it reached one million rows, at the end of the result, CUBRID inserted two times as much data in the table (~13 million rows for CUBRID versus ~6.5 million rows for MySQL).If you’ve worked with big data before you know how much important is predictable performance. High performance is good, but predictable performance is the King. CUBRID’s INSERT performance is predictable.
  • Apart from the must have optimizations, like all the databases have, we have special cases which we optimize for Web like (and show the list above). Each database has its own particular optimizations, there’s not much to talk about here. For example, MySQL optimizes special cases in inner joins which give them better performance over us (but worse on complicated joins), we heavily optimize range scans and limits and so on which gives us better performance for those cases. We can’t really go that far with inner joins because we have a much more complicated object model which makes it so difficult that during plan generation we don’t really know if that’s a table or a class or a class hierarchy or not even an object store at all but just a derived query. Not to mention hierarchical queries which complicate things even more.Loosely all query planers are based on the same algorithms designed in 1970’s by some guys at IBM, all current databases keep adding particular cases to them, we constantly search for common use cases that can be better handled in our database. The bottom line of this is that any optimization is a trade-off because you want to have the best query plan but you don’t want to spend too much time generating it so you have to compromise.To give a more clear example as to what I’m talking about when I say compromise, I’ll tell you about the Filter Index feature that will be released in the following version of CUBRID (around mid July this summer).Take query plan cache for exampleCUBRID caches 10000 plans by defaultthis takes a huge amount of memoryi think it can get to 50+Mband it's ok for us to do this but MySQL can'tit's huge considering the fact thatfor exampleon my website i have 60+ mysql databases that use different things60 * 50Mb = 3 Gb just for caching plansit is ok for CUBRID because you would only have one db per machineso 50 Mb is not importantit's ok for oracle too (they use the same technique) because you only have one instance of Oracle per machineso again, caching data is not important for us, for oracle, for MS sql serverbut it's mostly unacceptable for PostgreSQL and MySQL.CUBRID is not created for every application. You cannot have 60, 100 databases like hosting companies have for their shared hosting customers. CUBRID is not designed for hosting companies.
  • Filter Indexes (or partial indexes as they’re called in PostgreSQL) allow you to create an index on a subset of the data table, the syntax being something like: (create index)This index will make sure that only the tuples which have ‘open’ status will be in the index.With this index, if you want to find out which tickets are open, you will write a query like:SELECT title, component, assignee FROM users WHERE register_date > ‘2008-01-01’ ANDstatus = ‘open’;You will naturally only look through tickets that are already open, ignoring all the rest. Obviously this is faster (YAY!).You’re bound to have only a few tickets open but many, many closed ones. However, the real improvement for this type of index is the performance on insert/update/delete. This is where the trade of of indexes is. They’re fast for searching but not that good when manipulating data in them. Since this index might end up holding several orders of magnitude less data that the whole table, this is going to work magic on insert statements.
  • This chart shows the penalty normal indexes suffer while the dataset increases. It shows the number of INSERT statement a server can process per second. The data here was generated on my computer, it’s not an actual performance test but it shows the actual trend. You can see that the filter index QPS stays pretty much stable regardless of the dataset size while the Full index QPS slowly starts to decrease as we reach the 10.000.000 mark. Ok, so this is a filter index, it’s useful in certain situations, we decided to add support for it in CUBRID.However, there are some pitfalls to this type of index:
  • First one is CUBRID specific and it has to do with the 3-tier architecture: The server component does not know how to handle expressions in their raw form (status < ‘open’ has no meaning to it) so what we had to do was to extend the executable binary form of a query to be able to start execution wherever in the execution plan (thus enabling us to execute only the binary form status = open rather than the whole SELECT … FROM tickets WHERE status = open). This is like requiring a C compiler to be able to handle printing the value of x + 2 without having the main routine. More than this, we had to come up with a rather wicked way of cloning this section of binary code over multiple simultaneous transactions (rather than reinterpreting its serialized form over and over again since we think that disk read is BAD!). We’re really happy with the way this implementation turned out, we started applying it to other areas like partition pruning and precomputed columns and so on which will become much faster with this addition.
  • The second pitfall is common to all databases: FILTER index doesn’t really work with parameterized queries:SELECT x FROM tickets WHERE register_date > ? and status< ?;Obviously, there’s no way to know what value ‘?’ will have during plan generation so we have to assume that the filter index is not enough. We can live with this and do what other databases do in this situation:PostgreSQL says that this will not be using the partial indexOracle (where you would use a function index) mentions that the expression of the index must appear in the exact same form in the query (so even age < 17 will not be using the filter index, age < ? is not even on the waiting list)MS SQL Server also says that it will not be using the index but if you want to have parameterized queries with filter indexes you can write:SELECT name, email FROM users WHERE register_date > ? and age < ? and age < 18;This is obvious but MS makes it their motto to spell some things out for their users.OK, so problem solved, we will just do what other databases do and move on, right?No. It turns out that things are not as easy as this. The reason? “Shared” Query Plan cache.
  • PostgreSQL does not have this problem, they only cache a plan for the lifespan of a driver level prepared statement. MySQL does not cache query plans at all. (Can you guess why? Yes, this is a trade-off decision also <<great for large scale, horrible for small applications, this can be extended into a nice talk too :P >>)
  • However, CUBRID implements what is called a “shared” query plan cache. A shared query plan means that all compiled query plans get cached in a memory area and any session running the “same” query does not need to generate a new plan for it, it will just use the cached one. The default limit for cached plans is 10.000 query plans, considering that most applications do not have this many distinct queries (if they’re parameterized), we very rarely generate query plans for the life span of an application. We’ve got a huge improvement in overall performance when we added this feature (I think it was added before CUBRID became open source, It was already there when I joined the project).
  • To optimize things even more, before plan generation we convert any literals or constants into a parameter and use the parameterized printed version as a key for indexing the cache:SELECT name, email FROM users WHERE register_date < ‘2008-01-01’becomesSELECT name, email FROM users WHERE register_date < ?and we never need to generate the plan for this query for any value of ‘?’.Ok, so why is this a trap for filter index? Because caching and decaching plans is a costly process so we have to be very careful: We simply cannot allow not parameterized queries to go into the cache because, and I would put my life on the line for this statement, the first programmer that will use a filter index will be very careful to query all the values of age from - 5 to 10 million filling up the cache and complaining that CUBRID is really slow. Just leaving it like SELECT title, component FROM tickets WHERE register_date < ? and status =open is not an optionSELECT title, component FROM tickets WHERE register_date < ? and status = ? is not an option either because this matches any status (status = closed) and if we generated a plan with a filter index for status = open, we will return less results than expected for status IN (open, closed) (the cached plan will be used). So again, we had to come up with a really clever way of solving this issue without affecting people not using filter indexes (the performance of what was already implemented must not be affected).
  • Built-in Scalability is the second major field we’ve working on.
  • These are the questions our clients had asked when they first approached to us.How the nodes will be synched? MySQL provides only Asynchronous replication. This affects the data consistency.How about Load balancer? how do we choose which slave will run a certain query and maintain the load balance?And Fail-over? If the Master “fails”, what will happen? Will the database provide the native fail-over?How much will this solution cost?
  • There are various software in the market which provide High-Availability solutions. There is Oracle RAC, very expensive and very reliable. MySQL Cluster, or MySQL’s own Replication merged with a third-party solution such as MMM. In CUBRID there is a native transaction log based HA feature with 3 replication modes such as fully synchronized, semi-synched, and asynchronized.
  • So our clients told us that if we can provide a solution which won’t cause application halt, and will not loose any data during the transaction, then they are ready to use CUBRID.The first thing we’ve implemented is the one-way Replication for Read load balancing. But that was not enough. The application developers were still required to implement the fail-over logic within their application.To remove the potential human error, we moved further. The version 8.2.0 was revolutionary. We’ve added the native HA support based on transaction log replication. Auto Fail-over feature was implemented on top of Linux Heartbeat. Our client were really happy. At the time we’ve migrated tens of services which used to run on MySQL and Microsoft SQL Server to CUBRID. That period was one of the most successful win-backs in CUBRID. For more than 50 instances. If 10 or less, easy to do manually.Later on we kept receiving complains from the users that fail-over feature was often delayed for a long time, sometimes even didn’t work at all. We analyzed their applications, created many test scenarios and found out that Linux Heartbeat was not very stable, its behavior was unpredictable. In the next version we’ve developed CUBRID’s native heartbeat technology and improved the failure detection algorithm.Further, the existing clients started asking the HA monitoring tools. We’ve created the monitoring tool for them.Then they asked more features. We’ve developed them.Currently we are working on a few more important features requested by our clients. For example, when there is a large service with hundreds of database nodes, everyday you see some slave nodes go offline. Of course the application developers never know about this because service is HA. So if not developers, someone has to fix those slave DB nodes. They are DBAs. So far they manually restore the slave nodes. And now they want a built-in script to auto restore the slave nodes. And we’re working in this.But even though we provide slave auto rebuilder, the slave nodes still have several terabytes of data. So no matter what you do, it will still take the time to replicate the data. To solve this, we need to reduce the replication time. What we’re going to do is we’ll introduce the multi-threaded replication for slave nodes.Big diff with MySQL , they have statement based, data inconsistency.Current: 20-30 Slave DB rebuild time – 14h. 700G.Easy admin script: multi-threaded, concurrent replication to make slave rebuild time shorter.Replication delay time reduction
  • There are many configurations in CUBRID HA. You can build:Standard 1:1 Master-Slave HA systems.Extended 1:N Master-Slave system.1:1:N Master-Slave-ReplicaN:N Master-SlaveOr Compact N:1 Master:SlaveLast year at OSCON conference I presented the HA in its entirety. You can check out the presentation of the this link.
  • Right now CUBRID HA is one of the most stable High-Availability solutions in the market. Hundreds of large Web services in Korea use CUBRID HA. In NHN only we have several big services which monitor over 300 DB instances each.It relies on the Native Heartbeat technology.It has very stable and predictable fail-over.And the native load balancer through CUBRID’s own Broker.So we’ve developed HA for redundancy. With Load balancing provided in our CUBRID Broker HA provides READ distribution. Great for READ-intensive services.But what about WRITE distribution?
  • We’ve developed Database Sharding in CUBRID!The difference between partitioning and sharding is that with partitioning you can divide the data between multiple tables within one database which have identical schema.But with sharding you divide data between tables located in different databases. Sometimes the database gets so big that mere tables partitioning is not enough, in fact, it will hinder the performance of the entire system. So we’d better add new databases otherwise called Shards.If HA is for READ distribution, Sharding is for WRITE distribution as you can write to different databases simultaneously.This feature is something mostdevelopers dream to have it on Database side rather than on the application layer. Database Sharding doesn’t just simplify the developers’ life, but also improves both the application and database performance.The Application gets rid of the sharding logic.The Database reduces the index size.Win-win!
  • Without built-in sharding developersneed to implement the sharding logic on the application layer using some sort of frameworks or the like.When the data in existing databases grew already too much, the developers needs to add a new database to partition the data.Since the sharding logic is implemented on the application layer, the developers also need to add a separate broker for this as well as update the application code to relay some of the traffic to a new database. This is, in fact, a very common architecture.
  • With built-in database sharding, there is no work for developers. Everything is on the DBAs side. The developers see only one database. To determine which shard has to be queried, the developers can send the shard_id together with the query.When a new shard is added, all the DBA has to do is update the Metadata Directory on the Broker level. That’s all. The application never knows whether the data is partitioned or not.
  • One of our clients who operates the largest News service in Korea has requested this feature saying that if we provide a seamless database sharding, they will be eager to replace their MySQL servers with CUBRID.We’ve already completed the 1st phase. CUBRID SHARD allows to create unlimited # of shards. The data can be distributed based on Modulo, or DATETIME, or hash/range calculations. The developers can even feed their own library to calculate the SHARD_ID using some complicated custom algorithm.CUBRID SHARD will natively support connection and statement pooling as well as Load Balancing. These days we are performing QA on HA support for shards. Another unique feature that CUBRID Sharding will provide is that you can use it with heterogeneous database backend, i.e. some data can be store in CUBRID, some in MySQL or even Oracle.This first version of CUBRID SHARD which will be released in the coming months, will require DBAs to create the necessary amount of shard in advance. That is this first version doesn’t provide dynamic scale-out, otherwise it would become a full Cluster solution and we would call it CUBRID Cluster instead. But Sharding is not a Cluster solution, so at this moment it doesn’t allow to dynamically add or remove shard. You have to decide that in advance.You know what? Didn’t I tell you at the beginning that one of the reasons NHN has started CUBRID development is that this project would allow to create Recurring projects?The developers from other platforms at NHN have already created a stable Data Rebalancing technology. In the next phase we will merge that technology into CUBRID which will allow DBAs to add SHARDs and seamlessly redistribute the data among them.It’s going to be a revolutionary product. All in one. The fast, and stable RDBMS which provides seamless scalability, this is what CUBRID is. This is what you can expect from CUBRID in the coming months.Once we roll out Sharding, a few candidate Web services back in our country can try it for the first time as a part of migration to CUBRID.
  • Thus, with Sharding the developers can eradicate the sharding logic from their applications. All they will see is a single database view. No more they will need to make changes in the application code to adjust to the growing data.More than that the developers can define various sharding strategies by feeding their own libraries to calculate SHARD_ID.The DBAs will no more have to manually distribute data to the new shards. Everything will be done automatically.DBAs will be able to combine CUBRID with MySQL or Oracle if they prefer to.And the Load will be automatically balanced with CUBRID’s native Broker.
  • As I’ve been saying several time throughout this presentation many of corporate clients ask us to provide a certain solution to a certain problem but they always ask us to make it easy to use, easier than what other third-party solutions provide.So we always focus on easy of use when developing CUBRID.
  • SQL compatibility is the key request when it comes to database migration.After we’ve released the first version, one of the services in Korea running their system on Oracle asked us to support Hierarchical Queries. Now we support it.After that a few major READ-heavy MySQL services listed their requirements to consider the migration to CUBRID. Since then we’ve implemented several phases of MySQL compatibility and now we can proudly say that we support over 90% of MySQL SQL syntax.Phase 2:Extended CREATE/ALTER TABLE, INSERT, REPLACE, GROUP BY, ORDER BYOptional parts in SELECT.Added LIMIT, TRUNCATE, PREPAREOperator and functions extensions.Phase 3:API Improvements: added UTF-8 support to ODBC Connector, new server status related functions, functions to obtain schema and FK info.Usability: Perform bulk SQL operations on multiple table in CUBRID Tools.Phase 4:SQL syntax enhancementsSHOW statementsDATE/TIME functionsSTRING functionsAggregate functionsDB specific functionsImplicitTypeConversion behave very similar to that of MySQLUsability: all measurement units were changed from pages and Kilobytes to Total volume and Megabytes.Phrase 5:MySQL data type MIN/MAX valuesRegExpr
  • To further improve the easy of use, we put much effort on improving our APIs and Administration Tools.Originally we had only CUBRID Manager, currently the most powerful and complete database administration tool with server and HA monitoring dashboards. This tools is perfect for DBAs.However, later on we started to receive requests to create a new tool which is light and oriented for developers, not for DBAs. They said CM was too powerful for developers. They didn’t need backup, replication and HA stuff. All they need is a tool for easy and fast query execution, testing, and checking the query plan. That’s all.So, we’ve created CUBRID Query Browser, the light version of CM created with developers in mind. For ease of migration, we’ve developed CUBRID Migration Tool. It allows to migrate a database from MySQL, Oracle or MSSQL to CUBRID automatically. All you need to do is to confirm the target data types and hit the migrate button.Last year one of the big service providers asked a tool to control the entire farm of Database servers. At the time we didn’t have such tool. Then we decided to work on CUNITOR, a powerful database monitoring tool.Likewise starting from the first version we’ve kept supporting more and more programming languages. First, we’ve rolled out C, Java, OLEDB for Oracle and Microsoft SQL Server clients. Further we’ve added most popular scripting languages for general users. At the beginning of this year, January and February, we announced two more APIs for Perl and .NET.
  • Because you can run your service on cheaper hardware using proven open source solutions which are free.Last year we’ve migrated the entire System Monitoring service from Oracle to CUBRID. They key thing to notice here is how many servers were required to run Oracle based service and how many they needed to operate CUBRID servers.They had 40 Standard Oracle servers and 1 Enterprise server. After the migration we configured the entire system using only 25 CUBRID servers. They came and asked what should they do with the rest hardware. Because they were shocked to see how efficiently they could use CUBRID.After migration the service achieve 10,000 Inserts per second with CUBRID.The company saved a lot by staying away from Oracle licensing fees.That was an impressive success case.We’ve learnt that Oracle is not for every service. Your service can perfectly run on CUBRID with no need for compromise. Instead you will save a lot on license fees and support. If necessary buy a few cheap servers and use CUBRID’s built-in Load Balancer.There were around 30 databases, each stored some 1.5 to 2 Terabytes of logs.Mainly the service was INSERT intensive service.
  • So what we’ve learnt so far and where we are heading to?
  • All developers and project managers have got used to MySQL, Oracle and MS SQL Server so much that it’s really difficult to change their behavior. And this can happen to any software vendors if they enter already occupied market.But there is still a way to break this habit. You can achieve the acceptance through responsive technical support. Maybe CURBID is not as powerful as Oracle RAC, maybe we don’t support those features which our clients rely on, but with technical support we can solve anything. With technical support you can meet your clients’ expectation.The third lesson we’ve learnt with CUBRID is that some services don’t deserve Oracle, they even don’t deserve Microsoft SQL Server. Why? Here is why!
  • So far at NHN we’ve already deployed CUBRID in over 100 Web services. The red line displays how many CUBRID servers are actually running on these services. Over 70% of servers are configured in HA environment.
  • Four things:Stability, Performance, Scalability, Easy of Use. I wouldn’t hesitate to say that we’ve successfully achieved all these elements, though we have a long way to improve further!
  • What’s next?SELECT is already faster than in MySQL. But we will improve it more with more Web optimized indexes. We’ll improve INSERT queries more.More performanceIndex improvements and optimizationsINSERT improvementsMore SQL compatibilityMySQL and OracleBetter and more powerful toolsCM+, Web administratorSharding is coming very soon!Auto Rebalancer
  • Eugen:When I started thinking about this presentation, this is the outcome that I wanted from it:For the experienced guys in the audience this are the thoughtswhat I want you to have at the end of this presentation. I want you to think that:Some guys talked about some cool stuff they encountered in applications (don't remember what)There's a database that they use for this type of applications, it's open source and saves a lot of trouble (don't remember what trouble exactly)They're really keen on doing things rightThis is what I remember from every presentation that I’ve attended. Not the details.So I don’t expect you to remember the technical details. What I want is to grasp the concept of what we will talk about.
  • Growing in the wild. The story by cubrid database developers (Esen Sagynov, Eugene Stoyanovic)

    1. 1. Growing in the Wild. The story by CUBRID Database Developers.
    2. 2. • – –• – –
    3. 3. 
    4. 4. Reasons Behind CUBRID Development
    5. 5. Monitoring & Logging System MySQLOracle, CUBRIDMSSQL MSSQL, Oracle, MySQL Oracle, MySQL, CUBRID NoSQL MySQL
    6. 6. • •• • •••
    7. 7. • •• • • • CUBRID• •• •• •• •
    8. 8. #1Performance
    9. 9. Example News, Wiki, Blog, etc. SNS, Push services, etc. 90% of Log monitoring, Analytics. WebServices
    10. 10. CREATE TABLE forum_posts( CREATE TABLE users( user_id INTEGER, id INTEGER UNIQUE, post_moment INTEGER, username VARCHAR(255), post_text VARCHAR(64) last_posted INTEGER,); );INDEX i_forum_posts_post_moment ON forum_posts (post_moment);INDEX i_forum_posts_post_moment_user_idON forum_posts (post_moment, user_id);SELECT username FROM users WHERE id = ?;INSERT INTO forum_posts(user_id, post_moment, post_text)VALUES (?, ?, ?);UPDATE users SET last_posted = ? WHERE id = ?;
    11. 11. • –• – – – – –
    12. 12. CUBRID QPS decrease with DataSet size 5000 4500 4000 3500Queries per second 3000 Average = 3685 2500 Max = 4469 Min = 2821 2000 1500 1000 500 0
    13. 13. Queries per second 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 01,074,2191,769,1302,231,0162,533,9652,797,2363,033,1983,225,9483,399,6813,568,5633,723,4713,873,8734,015,6354,157,4334,289,1124,432,9384,570,9204,706,5234,838,0794,978,1525,118,6515,270,694 MySQL QPS decrease with DataSet size5,419,0565,546,5175,675,6195,809,0685,941,2966,073,4316,201,1386,334,749
    14. 14. CUBRID vs MySQL QPS decrease with DataSet size 12000 10000 8000Queries per second 6000 CUBRID QPS MySQL QPS 4000 2000 0
    15. 15. CREATE INDEX ON tickets(component, assignee)WHERE status = ‘open’;SELECT title, component, assignee FROM usersWHERE register_date > ‘2008-01-01’ AND status = ‘open’;•••
    16. 16. Queries per second 1000 2000 3000 4000 5000 6000 7000 0 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000 4,500,000 5,000,000 5,500,000 6,000,000 6,500,000 7,000,000 7,500,000 8,000,000 8,500,000 9,000,000 9,500,00010,000,000 QPS Full Index QPS Filter Index
    17. 17. SELECT title, component, assignee FROM usersWHERE register_date > ? AND status = ?; • “Shared” Query Plan Cache
    18. 18. Query Execution Query Execution without Plan Cache with Plan CacheParse SQLName Parse SQLResolving Get Cached PlanSemantic checkQuery Optimize Query ExecutionQuery PlanQueryExecution
    19. 19. SELECT title, component, assignee FROM usersWHERE register_date > ‘2008-01-01’ AND status = ‘open’;SELECT title, component, assignee FROM usersWHERE register_date > ? AND status = ?;
    20. 20. #2Scalability
    21. 21. • –• –• – –•
    22. 22. 1:1 M:S 1:N M:S 1:1:N M:S:R N:N M:S N:1 M:S
    23. 23. •••••• – –
    24. 24. • DB X Y Z• Shard DB DB DB X Y Z
    25. 25. Tbl1 Tbl2 Tbl3 Tbl4
    26. 26. Tbl1 Tbl2 Tbl3 Tbl4
    27. 27. 
    28. 28. • – – –••••
    29. 29. #3Ease of Use
    30. 30. > 90% MySQL SQL Compatibility
    31. 31. [Step1] Dual Write [Step2] Dual Write and Read Application Application Dual Dual Read/Writer Read/Writer Read Write Read Write MS SQL CUBRID MS SQL CUBRID[Step3] Win-back Complete Application • 16 Master/Slave servers and 1 Archive server • DB size:  0.4~0.5 billion/DB, Total 4 billion records Read  Total 3.2 TB Write  Total 4,000 ~ 5,000 QPS • Save money for MSSQL License and SAN Storage CUBRID
    32. 32. •  •
    33. 33. What we have learnt so far and Where we are heading to?
    34. 34. ••••
    35. 35. 140 500120 400100 80 300 60 200 40 100 20 0 0 ∑ services ∑ deployments
    36. 36.   • • • • • • CUBRID  • • • • • • • •
    37. 37.        8.4.x        
    38. 38.
    39. 39. • –• –