Slick Data Sharding: Slides from DrupalCon London


Published on

Presented at DrupalCon London by Senior Developer Tobby Hagler Slick Data Sharding teaches you how to develop scalable data applications with Drupal.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Just a reminder that the Official DrupalCon Party is tonight. Buses are leaving here starting at 4pm, but will be leaving continuously for awhile; which is good since all of you have places to be for the next 50 minutes…
  • Discuss what data sharding is, when you might need to shard your data, and what effects this has on your site or application HOW: Horizontal/partitioning and Vertical/Federation
  • Horizontal - More machines Vertical - Bigger machines Vertical will always eventually reach a limit
  • What is it – I’ll cover the different types and ways you can shard your data How does sharding help? How does it hurt? In short, WHEN is sharding right for me? Why not just keep scaling vertically?
  • Breaking apart your data is the easy part. The hard part is putting it back together again seamlessly. This was one of several broken plates that came from my wife’s great grandmother. I didn’t do it?
  • It’s easier to scale smaller pieces – makes it easier to horizontally scale Take one application that shares sensitive data split When you moved cache to memcache IS sharding So is using Varnish or a CDN like Akamai (forms of federated sharding)
  • Reduce your table indices The more data you have, the larger your table index overhead will be. Reduce that and you gain performances. A table with a million rows will perform better than a table with 10 million rows. Share your data with other applications or users. Great for taking CVs or form data that will be processed by an internal (proprietary) system Sometimes physically storing sensitive data (user information, credit card numbers, etc) in a different database can be a good idea. Don ’ t store these things on a database that can be accessed via non-SSL web servers
  • Yiouo guys are here to hear about scaling – let’s talk about all the other things you do to scale Load balancers – Apache mod_proxy and mod_proxy_balancer modules are a cheap way to load balance. There are plenty of cloud-based as well as hardware balancers you can use. '' Drupal 7 offers the concept of slave-safe queries (even in Views 3)
  • Have you performance tested? Is your problem data or application? Make sure that the size of your data is your problem… Compile PHP and apache without default modules. Gentoo Joke. Do you really need PDFLib or LibXML? Memory is cheap, DBAs are not
  • Load balancers – Apache mod_proxy and load balancing modules are a cheap way to load balance. There are plenty of cloud-based as well as hardware balancers you can use. '' Drupal 7 offers the concept of slave-safe querires (even in Views 3)
  • Make the individually smaller vs make the whole smaller A partition is a single piece split in half Even/Odd IDs, letters of the alphabet for user names Reduces index size A “federation” is defined as a “set of things” Logical divisions such as states, counties, countries Tend to be discrete or atomic
  • Reasons to choose horizontal partitioning Everything includes memcached, load balanced web servers, master/slave MySQL replication This is the sharding technique of last resort
  • The total number of rows in each table is reduced. This reduces index size, which generally improves search performance
  • This is why in theory horizontal scale sounds great – you have N-number of database clusters
  • Manageability – have you seen the number of tables in a Drupal install, especially in an install with tons of modules
  • The secondary databases no longer need to be MySQL Notice how the secondary database clusters are starting to look more like cache clusters
  • Disquis for commenting Edge-side includes for CDNs These are examples of application sharding
  • Want my website to collect resumes Want to dump resumes into my HR database, but don’t want all my HR data exposed to the web
  • Suppose your corporation’s web site sees thousands of applications per month or week. It might be a good idea to shard this data for scale. But also, you can shard it for data repurposing with your HR department’s software. Maybe you don’t want those guys with administrative access on the site… Keep personal information secure and off your company’s main website
  • This takes place in settings.php In this example we are sharing user data between multiple sites or applications. Profile field data will be available to both.
  • This takes place in settings.php Since profiles are integrated as fields, you may not have those tables
  • This takes place in settings.php Since profiles are integrated as fields, you may not have those tables
  • Note: This scheme will only work with databases of the same type. You can’t mix PostGRES and MySQL connections here You’ll be able to use different connection strings with usernames, etc
  • This does not HAVE to take place in settings.php - it should be there if at all possible moduleKey can be anything unique to your module
  • Setting the schema is not part of this, but strongly advised. Drupal_get_schema will static cache the table definition Db_set_active will switch database connections and THEN load the schema from static cache first, then database cache; then from code. If it can’t find the cache tables after you’ve switched database connections, it tries to throw an error; cascades down a dark path of errors after it can’t find system table, etc
  • What are the advantages to switching database connections? Can still use Drupal’s schema and database APIs Smaller database for your website helps with master/slave replication (faster), backups are more manageable, less overhead
  • From Drupal’s perspective, here’s how that looks
  • Mongo abstracts the need to horizontally scale – Mongo does the horizontal partitioning for you This scales vertically the application
  • I’m not affiliated with 10gen, I just wanted to mention their conference since we’re all here in London. They’ll have several Drupal-related sessions.
  • Out of the box, MongoDB module already does some things to help speed up and scale your site
  • Here’s a sample document that contains resume data. It’s stored in BSON – binary JSON
  • This is a sample query to return all users with the last name of “Smith”. - Applicants is a collection object - Applicant is a cursor object that you can loop through - $user = $users->findOne(array('username' => 'Smith', 'ssn': 1), array('first_name', 'last_name'));Can use findOne() to get a single return
  • THERE’S NO WEB SERVER INVOLVED AT ALL In addition to performance, you can share your MongoDB data via REST. For use in additional services Can share your data using REST and JSON to display content without costly queries
  • This gets a JSON object Note the trailing slash after the collection name Might need another REST interface like Sleepy.Mongoose for more advanced REST data
  • Slick Data Sharding: Slides from DrupalCon London

    1. 1. Slick Data Sharding How to Develop Scalable Data Applications With Drupal <ul><li>Tobby Hagler, Phase2 Technology </li></ul>
    2. 2. Don ' t Forget... <ul><li>Official DrupalCon London Party </li></ul><ul><li>Batman Live World Arena Tour </li></ul><ul><li>Buses leave main entrance </li></ul><ul><li>Fairfield Halls at 4pm </li></ul>
    3. 3. Overview <ul><li>Purpose – Reasons for sharding </li></ul><ul><li>Problems/Examples of a need for sharding </li></ul><ul><li>Types of scaling and sharding </li></ul><ul><li>Sharding options in Drupal </li></ul>
    4. 4. Scale: Horizontal vs Vertical <ul><li>Horizontal Scale </li></ul><ul><li>Add more machines of the same type </li></ul><ul><li>Vertical Scale </li></ul><ul><li>Bigger and badder machines </li></ul>
    5. 5. Sharding <ul><li>What is sharding? </li></ul><ul><li>Types of sharding – Partitioning and Federation </li></ul><ul><li>How sharding helps </li></ul><ul><li>Vs. typical monolithic Drupal database </li></ul>
    6. 6. What Is sharding? <ul><li>Simply put, sharding is physically breaking large data into smaller </li></ul><ul><li>pieces (shards) of data. </li></ul><ul><li>The trick is putting them back together again… </li></ul>
    7. 7. Reasons for Sharding <ul><ul><ul><ul><li>Sharding for scaling your application </li></ul></ul></ul></ul><ul><li>Sharding for shared application data </li></ul><ul><ul><ul><ul><li>Leveraging specialized technologies </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Caching is a form of federated sharding </li></ul></ul></ul></ul></ul>
    8. 8. How Sharding Helps <ul><li>Scale your applications by reducing data sets in any single database </li></ul><ul><li>Secure sensitive data by isolating it elsewhere </li></ul><ul><li>Segregates data </li></ul>
    9. 9. Be Sure You ' ve Tried Everything Else <ul><li>Memcached </li></ul><ul><li>Boost Module </li></ul><ul><li>Load balanced web servers </li></ul><ul><li>MySQL Master/Slave replicate </li></ul><ul><li>Turning Views into Custom Queries </li></ul>
    10. 10. More Things To Try... <ul><li>Moar memory! </li></ul><ul><li>Move .htacess to vhost config </li></ul><ul><li>Apache tunes </li></ul><ul><li>MySQL tunes </li></ul><ul><li>Replace search with Apache Solr </li></ul><ul><li>Optimizing PHP (custom compile) </li></ul><ul><li>Apache Drupal module </li></ul><ul><li>Replace Apache with nginx </li></ul><ul><li>Switched to 3 rd party services for comments </li></ul><ul><li>Replace contrib modules with custom development </li></ul>
    11. 11. Typical Balanced Environment
    12. 12. Types of sharding <ul><li>Partitioning </li></ul><ul><li>Horizontal </li></ul><ul><li>Divides something into two parts </li></ul><ul><li>Unshuffle </li></ul><ul><li>Reduced index size </li></ul><ul><li>Hard to do </li></ul><ul><li>Federation </li></ul><ul><li>Vertical </li></ul><ul><li>A set of things </li></ul><ul><li>Uses logical divisions </li></ul><ul><li>Split up across physically different machines </li></ul>
    13. 13. Horizontal Partitioning <ul><li>Scaling your application’s performance </li></ul><ul><li>Distributed data load </li></ul><ul><li>This is the Shard of Last Resort </li></ul>
    14. 14. Even/Odd Partitions <ul><li>This is not Master/Master replication </li></ul><ul><li>Rows are divided between physical databases </li></ul><ul><li>Will require custom database API to properly achieve split rows </li></ul><ul><li>Applies to node loads, entity loads, etc </li></ul><ul><li>Achieved by auto_increment by N with different starting offsets and application distributes writes in round-robin fashion and via keyed mechanisms to distribute reads and reassemble data </li></ul>
    15. 15. Horizontally Partitioned Databases
    16. 16. Federation <ul><li>Vertically partitioning data by logical affiliation </li></ul><ul><li>Sharding for shared application data </li></ul><ul><li>Manageability – distributing data sets </li></ul><ul><li>Security - Allows for exposing certain bits of data to other applications without exposing all </li></ul>
    17. 17. Vertically Scaled Databases
    18. 18. Application Sharding Not just sharding data Shard the components of your site
    19. 19. Sample Use Cases <ul><li>Collecting resumes within your existing site </li></ul><ul><li>Building an ideation tool </li></ul>
    20. 20. Sharding Resume Data <ul><li>Accepting resumes for a large corporation </li></ul><ul><li>Users submit resume via Webform </li></ul><ul><li>Submit and process data into separate database </li></ul><ul><li>Resume data is processed by internal HR software to evaluate potential employees </li></ul>
    21. 21. Sharding Schemas <ul><li>Same physical database, different schemas </li></ul><ul><li>Uses database prefixing in settings.php </li></ul><ul><li>~ or ~ </li></ul><ul><li>Different physical databases </li></ul><ul><li>Uses db_set_active to switch db connections </li></ul>
    22. 22. Database Prefixes <ul><li>Handled in settings.php </li></ul><ul><li>Uses MySQL’s dot separator to target different schemas </li></ul><ul><li>Requires that the MySQL user used by Drupal has proper permissions </li></ul><ul><li>Ex: db_1.users and db_2.users </li></ul>
    23. 23. Database Prefixes Drupal 6 <ul><li>$db_prefix = array ( </li></ul><ul><li>'default' => '', </li></ul><ul><li>'users' => 'shared_.', </li></ul><ul><li>'sessions' => 'shared_.', </li></ul><ul><li>'role' => 'shared_.', </li></ul><ul><li>'authmap' => 'shared_.', </li></ul><ul><li>'users_roles' => 'shared_.', </li></ul><ul><li>'profile_fields' => 'shared_.', </li></ul><ul><li>'profile_values' => 'shared_.', </li></ul><ul><li>); </li></ul>
    24. 24. Database Prefixes Drupal 7 <ul><li>$databases = array ( </li></ul><ul><li>'default' => array ( </li></ul><ul><li>'default' => array ( </li></ul><ul><li>'prefix' => array ( </li></ul><ul><li>'default' => '', </li></ul><ul><li>'users' => 'shared_.', </li></ul><ul><li>'sessions' => 'shared_.', </li></ul><ul><li>'role' => 'shared_.', </li></ul><ul><li>'authmap' => 'shared_.', </li></ul><ul><li>'users_roles' => 'shared_.', </li></ul><ul><li>), </li></ul><ul><li>), </li></ul><ul><li>), </li></ul><ul><li>); </li></ul>
    25. 25. Database Prefixes Tips, Tricks, and Caveats <ul><li>Can share user data between Drupal and Drupal 7 with table alters and strict prevention of Drupal 7 logins or user saves </li></ul><ul><li>Should log in with the lower version of Drupal </li></ul>
    26. 26. Different Physical Databases <ul><li>Set up additional connections in settings.php </li></ul><ul><li>Change connections using db_set_active() </li></ul><ul><li>Use db_set_active() to switch back when done </li></ul><ul><li>Watch for schema caching and watchdog errors </li></ul>
    27. 27. Different Databases Drupal 6 <ul><li>$db_url = array ( ' default ' => ' mysql://user:pass@host1/db1 ' , ' second ' => ' mysql://user:pass@host2/db2 ', 'third' => ' mysql://user:pass@host3/db3 ', ); </li></ul>
    28. 28. Database Prefixes Drupal 7 <ul><li>$other_database = array (   'database' => 'databasename',   'username' => 'username', </li></ul><ul><li>   'password' => 'password', 'host' => 'localhost', </li></ul><ul><li>'driver’ => 'mysql', </li></ul><ul><li>); </li></ul><ul><li>Database :: addConnectionInfo (’ moduleKey ', 'default', $other_database ) ; db_set_active (' moduleKey ') ; </li></ul><ul><li>// Execute queries </li></ul><ul><li>db_set_active (); </li></ul>
    29. 29. Switching Databases <ul><li>$schema = drupal_get_schema ( ' table_name ' ) ; </li></ul><ul><li>db_set_active (' database_key ') ; </li></ul><ul><li>// Execute queries </li></ul><ul><li>Drupal_write_record ( ' table_name ' , $data) ; </li></ul><ul><li>db_set_active () ; </li></ul>
    30. 30. Saving Data in Another Database <ul><li>Hook_install_schema() </li></ul><ul><li>drupal_write_record() </li></ul><ul><li>Keeps web site database smaller </li></ul><ul><li>Can keep sensitive data offsite </li></ul><ul><li>Partitioned tables can limit/protect your web site database from internal users </li></ul>
    31. 31. Saving Data in Another Database <ul><li>Resume data is submitted via form </li></ul><ul><li>Form’s _submit function accepts final data </li></ul><ul><ul><li>Schema loads table definition </li></ul></ul><ul><ul><li>Connects to the HR instance of MySQL </li></ul></ul><ul><ul><li>Writes new record </li></ul></ul><ul><ul><li>Uploads any files to private file space </li></ul></ul><ul><ul><li>Switches database back </li></ul></ul><ul><li>HR Director can query new resumes </li></ul>
    32. 32. Using MongoDB <ul><li>MongoDB is a NoSQL database </li></ul><ul><li>“ Schema-less” – data schema defined in code </li></ul><ul><li>Fast </li></ul><ul><li>Document-based </li></ul><ul><li>Simpler to scale vertically than MySQL </li></ul>
    33. 33. MongoUK <ul><li>10gen Conference in London, UK </li></ul><ul><li>September 19, 2011 </li></ul><ul><li> </li></ul>
    34. 34. MongoDB and Drupal <ul><li> </li></ul><ul><li>7.x allows for field storage, cache, sessions, </li></ul><ul><li>and blocks to be stored in MongoDB </li></ul><ul><li>Allows for connections to your own collections </li></ul>
    35. 35. MongoDB Data <ul><li>Four levels of objects </li></ul><ul><ul><li>Connection </li></ul></ul><ul><ul><li>Database (schema) </li></ul></ul><ul><ul><li>Collection </li></ul></ul><ul><ul><li>Cursor (query results) </li></ul></ul><ul><li>Non-relational database </li></ul><ul><li>Collections tend to be denormalized </li></ul>
    36. 36. MongoDB Documents <ul><li>Resumes.Resume: { </li></ul><ul><li>first_name : &quot; John &quot;, </li></ul><ul><li>last_name : &quot; Smith &quot;, </li></ul><ul><li>title : &quot; Web Developer &quot;, </li></ul><ul><li>address : { </li></ul><ul><li>city : &quot; London &quot;, </li></ul><ul><li>country : &quot; UK &quot; </li></ul><ul><li>}, </li></ul><ul><li>skills : [ ' PHP ', ' Drupal ', ' MySQL ' ], </li></ul><ul><li>ssn : 123456789, </li></ul><ul><li>} </li></ul>
    37. 37. Querying MongoDB Documents <ul><li>$applicant = $applicants -> find ( </li></ul><ul><li>array ( </li></ul><ul><li>' username ' => ' Smith ' , </li></ul><ul><li>’ ssn ': 1 , </li></ul><ul><li>), </li></ul><ul><li>array ( </li></ul><ul><li>' first_name ’ => 1 , </li></ul><ul><li>' last_name ’ => 1 , </li></ul><ul><li>), </li></ul><ul><li>); </li></ul>
    38. 38. MongoDB Sharing via REST <ul><li>Simple REST – included as part of MongoDB </li></ul><ul><li>Sleepy Mongoose – REST interface for MongoDB (Python) </li></ul><ul><li>MongoDB REST (Node.js) </li></ul>
    39. 39. Ideation REST Interface <ul><li>Get a list of all idea documents </li></ul><ul><ul><li> </li></ul></ul><ul><li>Get all comments for a specific idea </li></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>… ?filter__id=4a8acf6e7fbadc242de5b4f3… </li></ul></ul><ul><ul><li>… &limit=10&offset=20 </li></ul></ul><ul><li>Will likely need a dedicated MongoDB REST inteface </li></ul>
    40. 40. Applications on Separate Web Tiers <ul><li>Application sharding is data sharding </li></ul><ul><li>Separate Drupal instances </li></ul><ul><li>Use mod_proxy as a pass-through </li></ul><ul><li>Can used multiple load-balanced environments </li></ul>
    41. 41. Proxied Web Clusters
    42. 42. Questions?
    43. 43. Contact <ul><li>thagler@phase2technology  </li></ul><ul><li>@phase2tech </li></ul><ul><li>703-548-6050 </li></ul><ul><li>d.o: tobby </li></ul><ul><li>Slides: </li></ul>