Scalable PHP Applications With Cassandra

2,308
-1

Published on

Developing a fast and scalable application for your fancy new startup is hard. Many factors are responsible for the slowness of a website, like network latency, webserver configuration or large assets, but as any developer involved with high volumes knows, the real bottleneck is the database. During the latest years a bunch of NoSQL solutions came to the rescue, each one with his pros and cons. Apache Cassandra is one of the most used and mature "Big Data" NoSQL, and is currently deployed on several projects by tech giants like Twitter, eBay and Netflix, due to its extremely high throughput, automatic replication and decentralization. During the session I'll talk about how to leverage Apache Cassandra best features and data modeling best practices for your web application projects to respond to huge peaks of traffic, using open source tools such as Zend Framework and phpcassa, and describing a large e-commerce project currently using Cassandra.

Published in: Software, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,308
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
34
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Scalable PHP Applications With Cassandra

  1. 1. @akira28 Scalable PHP web applications with Apache Cassandra Andrea De Pirro
  2. 2. @akira28 About me • Co-founder at Yameveo • 9+ years developing in PHP • 2+ years experience with Apache Cassandra • Zend Framework Certified Engineer
  3. 3. @akira28 Yameveo Founded on 2012 in Barcelona, Yameveo is a young, dynamic and international company specialised in e- commerce and web applications development ! ! www.yameveo.com @Yameveo
  4. 4. @akira28 Yameveo Store Dozens of e-commerce modules store.yameveo.com
  5. 5. @akira28 What we will talk about • Apache Cassandra • Data Modeling • Cassandra & PHP • Case study
  6. 6. @akira28 Apache Cassandra Apache Cassandra is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data model designed for maximum flexibility and fast response times. Apache Cassandra documentation
  7. 7. @akira28 Why Cassandra • Open Source (enterprise distribution also available) • Linearly scalable • Fault-tolerant • Fully distributed • Highly performant • Flexible data model
  8. 8. @akira28 Cassandra Uses • Web analytics • Web Applications • Transaction logging • Data collection • …
  9. 9. @akira28
  10. 10. @akira28 Architecture
  11. 11. @akira28 CAP Theorem Only two of:! ! 1. Consistency all nodes see the same data at the same time 2. Availability the guarantee that every request receives a response about whether it was successful or failed 3. Partition Tolerance the system continues to operate despite message loss or failure of part of the system
  12. 12. @akira28 CAP Theorem
  13. 13. @akira28 Architecture • Ring • Each node has a unique token and is identical • Intra-ring communication via “Gossip” protocol • Tokens range from 0 to 2^127
  14. 14. @akira28 Partitioning
  15. 15. @akira28 Data Modeling
  16. 16. @akira28 Data Model • Cluster • Keyspace • Column Family • Super Column • Composite Columns
  17. 17. @akira28 Data Model
  18. 18. @akira28 Data Model
  19. 19. @akira28 Data Modeling Problems • Neither join nor subquery support • Limited support for aggregation • Ordering is done per-partition • Ordering is specified at table creation time
  20. 20. @akira28 Data Modeling Best Practices • Don’t think of a relational table • Model column families around query patterns • De-normalize and duplicate for read performance • Storing values in column names is perfectly OK • Leverage wide rows for ordering, grouping, and filtering
  21. 21. @akira28 Some Numbers
  22. 22. @akira28 Some Numbers
  23. 23. @akira28
  24. 24. @akira28 Cassandra & PHP
  25. 25. @akira28 Apache Thrift Thrift is an interface definition language and binary communication protocol that is used to define and create services for numerous languages. It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development" Wikipedia
  26. 26. @akira28 Apache Thrift
  27. 27. @akira28 PhpCassa • Open Source • Uses the Thrift protocol • Compatible with Cassandra 0.7 through 1.2 • Optional C extension for improved performance https://github.com/thobbs/phpcassa ! require: “thobbs/phpcassa”: “v1.1.0”
  28. 28. @akira28 Examples Opening Connections! ! $pool = new ConnectionPool('Keyspace1'); ! Create a column family object! ! $users = new ColumnFamily($pool, 'Standard1'); $super = new SuperColumnFamily($pool, 'Super1'); ! Inserting! ! $users->insert('key', array('column1' => 'value1', 'column2' => 'value2')); ! Querying! ! $users->get(‘key'); // returns an array $users->multiget(array('key1', ‘key2')); // returns an array of arrays ! Removing! ! $users->remove('key1'); // removes whole row $users->remove('key1', 'column1'); // removes 'column1'
  29. 29. @akira28 Case Study
  30. 30. @akira28 Flash Deals website • 5 Apache servers • 32 GB of RAM • 8 CPU • 6 Cassandra nodes • 4+ millions visits/month • 17+ millions pages/month • 600GB of data
  31. 31. @akira28
  32. 32. @akira28 Requirement • The client wanted a new way to navigate the website: deal attributes • Millions of deals (hundreds new and expiring everyday) • Dozens of stores and categories • Performance is key!
  33. 33. @akira28 How We Solved It • Each day we have new deals, so queries based on date and attributes • Leverage Cassandra wide-rows to create indexes • Use Cassandra multiGet whenever possible
  34. 34. @akira28 Deals CF RowKey name price attributes … 211 Miyagi Sushi 29 [21,20,114] 432 Mos Eisley Cantina 19 [21,20] 12 iPhone 5 32GB 549 [7] … … …
  35. 35. @akira28 Attributes CF RowKey name keyword 21 Restaurants restaurants 114 Japanese japanese 20 Barcelona barcelona 7 Technology tech
  36. 36. @akira28 Cities CF RowKey name attributeid … 1 Madrid 12 8 Barcelona 20 32 Amsterdam 81
  37. 37. @akira28 Urls CF RowKey attributes city … /restaurants/barcelona [21] 8 /restaurants/barcelona/japanese [21,114] 8 /tech [7] - /restaurants [21] - … … …
  38. 38. @akira28 AttributesDeals CF RowKey 211 432 12 … … 21|20140621 true true - 114|20140621 true - - 20|20140621 true true - 7|20140621 - - true … … … …
  39. 39. @akira28 Code /** * List deals action * eg. /restaurants/barcelona/japanese * */ public function dealsAction() { $path = $this->getUrlPath(); // cleaned query string ! $url = $this->manager->getUrl($path); $attributes = Zend_Json::decode($url[‘attributes’]); $cityId = $url[‘city’]; $deals = $this->manager->getDeals($attributes, $cityId); $this->view->assign(‘deals’, $deals); … } Controller
  40. 40. @akira28 Code /** * Retrieves the url containing attributes and city infos * * @param string $path * @return array $url */ public function getUrl($path) { $pool = new ConnectionPool('Keyspace'); $urls = new ColumnFamily($pool, 'Urls'); try { $url = $urls->get($path); } catch (Exception $e) { … } return $url; } Manager
  41. 41. @akira28 Code /** * Retrieves the url containing attributes and city infos * * @param array $attributes * @param int $cityId * @return array $deals */ public function getDeals($attributes, $cityId) { $pool = new ConnectionPool('Keyspace'); $dealsCF = new ColumnFamily($pool, ‘Deals’); if(!empty($cityId) { $attributes[] = $this->getAttributeIdByCity($cityId); } try { $dealsIds = $this->getDealsIdsByAttributes($attributes); $deals = $dealsCF->multiget($dealsIds); } catch (Exception $e) { … } return $deals; } Manager
  42. 42. @akira28 Code/** * Retrieves an array of deals ids given an array of attribute ids * * @param array $attributes * @return array $dealsIds */ protected function getDealsIdsByAttributes($attributes) { $dealsIds = array(); $dealsGroups = array(); $date = date(‘Ymd’); $attributesDeals= new ColumnFamily($pool, 'AttributesDeals'); foreach($attributes as $attributeId) { $attributeKey =“$attributeId|$date"; $dealsGroups[] = array_keys($attributesDeals->get($attributeKey)); // columns! } $countGroups = count($dealsGroups); if($countGroups > 1) { $dealsIds = call_user_func_array('array_intersect', $dealsGroups); } elseif($countGroups == 1) { $dealsIds = reset($dealsGroups); } return $dealsIds; } Manager
  43. 43. @akira28 Cassandra future (and present) • New PHP driver wrapping the C++ driver • Cassandra 2.0 • CQL 3.0
  44. 44. @akira28 Resources • www.yameveo.com • http://planetcassandra.org • https://github.com/thobbs/phpcassa • http://www.hakkalabs.co/articles/cassandra- data-modeling-guide
  45. 45. @akira28 Resources • http://www.ebaytechblog.com/2012/07/16/ cassandra-data-modeling-best-practices-part-1/ • http://www.slideshare.net/DataStax/cassandra- community-webinar-introduction-to-apache- cassandra-12 • http://www.geroba.com/cassandra/apache- cassandra-byteorderedpartitioner/
  46. 46. @akira28 Questions?
  47. 47. @akira28 yameveo@yameveo.com WE ARE HIRING!
  48. 48. @akira28 Dank! joind.in/10865 lanyrd.com/scxyhk ! www.yameveo.com ! @akira28 @Yameveo ! http://bit.ly/andreadepirro

×