• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cassandra - PHP
 

Cassandra - PHP

on

  • 3,894 views

Presentation on integrating Cassandra in PHP projects.

Presentation on integrating Cassandra in PHP projects.
12 november 2013 - PHPMeetup Amersfoort

Statistics

Views

Total Views
3,894
Views on SlideShare
3,892
Embed Views
2

Actions

Likes
1
Downloads
15
Comments
1

1 Embed 2

https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • nice intro! thanks for sharing
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cassandra - PHP Cassandra - PHP Presentation Transcript

    • Cassandra Integrating Cassandra into your project dinsdag 12 november 13
    • Maurits Lawende • • • dinsdag 12 november 13 Work at Dutch Open Projects (DOP) since 2007 Development and technical design for challenging Drupal sites Development of SaaS solutions in PHP & NodeJS
    • ToDoToDay • • • • dinsdag 12 november 13 Data versus information History and usage of Cassandra How to use Cassandra Developments
    • Data versus information Celko, J. (1999). Data and databases dinsdag 12 november 13
    • SQL is designed for information DBMS knows how to use your data dinsdag 12 november 13
    • SQL is designed for flexibility Not even a single line on scalability dinsdag 12 november 13
    • SQL nearly 40 years of experience dinsdag 12 november 13
    • SQL Never designed for scalability dinsdag 12 november 13
    • Alexa top 10 • • • • • dinsdag 12 november 13 Google Facebook YouTube Yahoo Baidu • • • • • Wikipedia QQ.com LinkedIn Live.com Twitter
    • Alexa top 10 • • • • • dinsdag 12 november 13 Google (BigTable) Facebook (MySQL) YouTube (MySQL) Yahoo Baidu (HyperTable) • • • • • Wikipedia (MySQL) QQ.com LinkedIn (Voldemort) Live.com Twitter (MySQL)
    • Cassandra users • • • • • • dinsdag 12 november 13 Facebook (+ Redis & HBase & MySQL) Twitter (+ MySQL) Reddit (+ Postgres) Digg (+ Redis) Bit.ly (+ MongoDB) Netflix
    • Cassandra users • • • • • • dinsdag 12 november 13 Facebook (+ Redis & HBase & MySQL) Twitter (+ MySQL) Reddit (+ Postgres) Digg (+ Redis) Bit.ly (+ MongoDB) Netflix Jeff Hammerbacher
    • Cassandra users • • • • • • dinsdag 12 november 13 Facebook (+ Redis & HBase & MySQL) Twitter (+ MySQL) Reddit (+ Postgres) Digg (+ Redis) Bit.ly (+ MongoDB) Netflix Jeff Hammerbacher left Facebook in 2008
    • Back to basic Don’t think SQL dinsdag 12 november 13
    • Key/value store Evolved towards tables dinsdag 12 november 13
    • Just data • • • dinsdag 12 november 13 No joins Limited sorting capabilities No aggregation, grouping, subqueries whatsoever
    • Schemaless • • dinsdag 12 november 13 Fixed <strike>tables</strike> column families, but; Dynamic column names
    • Operations in Cassandra 1.0 • CREATE KEYSPACE name • • • • dinsdag 12 november 13 USE name CREATE COLUMN FAMILY name DROP KEYSPACE name DROP COLUMN FAMILY name
    • Operations in Cassandra 1.0 • • • • • dinsdag 12 november 13 SET columnfamily[‘row’][‘column’] = ‘value’; GET columnfamily[‘row’] LIST columnfamily DEL columnfamily[‘row’] DEL columnfamily[‘row’][‘column’]
    • Operations in Cassandra 1.0 • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; user[‘mau’][‘firstname’] = ‘Maurits’; user[‘mau’][‘lastname’] = ‘Lawende’;
    • Operations in Cassandra 1.0 post • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; user[‘mau’][‘firstname’] = ‘Maurits’; user[‘mau’][‘lastname’] = ‘Lawende’; title uuid First post! user firstname mau Maurits lastname Lawende
    • Operations in Cassandra 1.0 sorted by rowkey, columnname (all ascending) • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; user[‘mau’][‘firstname’] = ‘Maurits’; user[‘mau’][‘lastname’] = ‘Lawende’;
    • Operations in Cassandra 1.0 • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; post[‘uuid’][‘user’] = ‘mau’; user[‘mau’][‘firstname’] = ‘Maurits’;
    • Operations in Cassandra 1.0 How to get a list of blogs by “mau”? • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; post[‘uuid’][‘user’] = ‘mau’; user[‘mau’][‘firstname’] = ‘Maurits’;
    • Operations in Cassandra 1.0 How to get a list of blogs by “mau”? • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; post[‘uuid’][‘user’] = ‘mau’; user[‘mau’][‘firstname’] = ‘Maurits’; WHERE user = ‘mau’
    • Operations in Cassandra 1.0 How to get a list of blogs by “mau”? • • • dinsdag 12 november 13 WHERE user = ‘mau’ post[‘uuid’][‘title’] = ‘First post!’; Bad Request: No indexed columns present in post[‘uuid’][‘user’] = ‘mau’; by-columns clause with user[‘mau’][‘firstname’] = ‘Maurits’; Equal operator
    • Operations in Cassandra 1.0 How to get a list of blogs by “mau”? • • • WHERE user = ‘mau’ post[‘uuid’][‘title’] = ‘First post!’; Bad Request: No indexed columns present in post[‘uuid’][‘user’] = ‘mau’; by-columns clause with user[‘mau’][‘firstname’] = ‘Maurits’; Equal operator sequal scans are rejected dinsdag 12 november 13
    • Operations in Cassandra 1.0 How to get a list of blogs by “mau”? WHERE user = ‘mau’ post[‘uuid’][‘title’] = ‘First post!’; Bad Request: No indexed columns present in post[‘uuid’][‘user’] = ‘mau’; by-columns clause with user[‘mau’][‘firstname’] = ‘Maurits’; Equal operator Bad Request: Order by is currently only supported on the clustered columns of the PRIMARY KEY • • • dinsdag 12 november 13
    • Operations in Cassandra 1.0 How to get a list of blogs by “mau”? WHERE user = ‘mau’ post[‘uuid’][‘title’] = ‘First post!’; Bad Request: No indexed columns present in post[‘uuid’][‘user’] = ‘mau’; by-columns clause with user[‘mau’][‘firstname’] = ‘Maurits’; Equal operator Bad Request: Order by is currently only supported on the clustered columns of the PRIMARY KEY Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN. • • • dinsdag 12 november 13
    • Operations in Cassandra 1.0 How to get a list of blogs by “mau”? • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; post[‘uuid’][‘user’] = ‘mau’; user[‘mau’][‘firstname’] = ‘Maurits’; WHERE user = ‘mau’ ORDER BY date DESC LIMIT 10
    • Operations in Cassandra 1.0 How to get a list of blogs by “mau”? • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; post[‘uuid’][‘user’] = ‘mau’; user[‘mau’][‘firstname’] = ‘Maurits’; WHERE user = ‘mau’ ORDER BY date DESC LIMIT 10 only possible when user and date is in primary key
    • Predictable performance No performance degradation after data growth dinsdag 12 november 13
    • Operations in Cassandra 1.0 • • • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; post[‘uuid’][‘user’] = ‘mau’; user[‘mau’][‘firstname’] = ‘Maurits’; user[‘mau’][‘post001’] = ‘uuid’; user[‘mau’][‘post002’] = ‘uuid’;
    • Operations in Cassandra 1.0 • • • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; post[‘uuid’][‘user’] = ‘mau’; user[‘mau’][‘firstname’] = ‘Maurits’; user[‘mau’][‘post001’] = ‘uuid’; user[‘mau’][‘post002’] = ‘uuid’; any order and limit
    • Operations in Cassandra 1.0 • • • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; post[‘uuid’][‘user’] = ‘uuid’; user[‘mau’][‘firstname’] = ‘Maurits’; user[‘mau’][‘post001’] = ‘uuid’; user[‘mau’][‘post002’] = ‘uuid’; join
    • Operations in Cassandra 1.0 • • • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; post[‘uuid’][‘user’] = ‘uuid’; user[‘mau’][‘firstname’] = ‘Maurits’; user[‘mau’][‘post001’] = ‘uuid’; user[‘mau’][‘post002’] = ‘uuid’; join no uuid IN (...) or OR’s
    • Operations in Cassandra 1.0 • • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; user[‘mau’][‘firstname’] = ‘Maurits’; user[‘mau’][‘post001:uuid’] = ‘First post!’; user[‘mau’][‘post002:uuid’] = ‘Second post!’;
    • Operations in Cassandra 1.0 • • • • dinsdag 12 november 13 post[‘uuid’][‘title’] = ‘First post!’; user[‘mau’][‘firstname’] = ‘Maurits’; only one query required to get user profile with latest posts user[‘mau’][‘post001:uuid’] = ‘First post!’; user[‘mau’][‘post002:uuid’] = ‘Second post!’;
    • Operations in Cassandra 1.0 • • • • post[‘uuid’][‘title’] = ‘First post!’; user[‘mau’][‘firstname’] = ‘Maurits’; user[‘mau’][‘post001:uuid’] = ‘First post!’; user[‘mau’][‘post002:uuid’] = ‘Second post!’; 64 KB dinsdag 12 november 13 2 billion cells 64 KB 2 GB
    • Beauty? • • • • dinsdag 12 november 13 Dirty in the SQL world, but; It’s a best practice in Big Data Don’t think of it as a relational database No strict rules on how to use it, just push it to the limits
    • dinsdag 12 november 13
    • Each row is a snapshot of data meant to satisfy a given query, sort of like a materialized view. dinsdag 12 november 13
    • Storage in a cluster dinsdag 12 november 13
    • Cluster structures dinsdag 12 november 13
    • Master-slave dinsdag 12 november 13
    • Master-master dinsdag 12 november 13
    • Sharding dinsdag 12 november 13
    • HDFS / GlusterFS dinsdag 12 november 13
    • HyperTable dinsdag 12 november 13
    • Dynamo dinsdag 12 november 13
    • No master or single point of failure Every node is (nearly) identical dinsdag 12 november 13
    • Distribution and replication 2^127 0 dinsdag 12 november 13
    • Distribution and replication dinsdag 12 november 13
    • Distribution and replication dinsdag 12 november 13
    • Distribution and replication dinsdag 12 november 13
    • Distribution and replication dinsdag 12 november 13
    • Distribution and replication dinsdag 12 november 13
    • Client can connect to any node dinsdag 12 november 13
    • Seed nodes • • dinsdag 12 november 13 Required for bootstrapping nodes Define 2 or 3 seed nodes per cluster
    • Extending the ring • • • dinsdag 12 november 13 Assign a token for new node Configure seed node host Start Cassandra on new node
    • Extending the ring • • • dinsdag 12 november 13 Assign a token for new node Configure seed node host Start Cassandra on new node
    • Consistency dinsdag 12 november 13
    • Writing data • • • • dinsdag 12 november 13 Hinted handoff Write to commit log Write in memory Write to disk (together with timestamp)
    • Write consistency • • dinsdag 12 november 13 Choose from ANY, ONE, TWO, THREE, QUORUM, ALL QUORUM = floor((replication factor / 2) + 1)
    • Read consistency • • dinsdag 12 november 13 Choose from ONE, TWO, THREE, QUORUM, ALL Most recent copy is returned
    • Read repair • • • dinsdag 12 november 13 Compares data with 2 other replica’s in the background Fixes inconsistent and missing data At 10% of all reads
    • Node repair • • dinsdag 12 november 13 Gradually compares all data in nodes with replica’s Required in conjunction with read repair to fix ‘forgotten deletes’
    • ACID theorem • • • • dinsdag 12 november 13 Atomic; completed successfully or entirely rolled back Consistent; transations never invalidates the database state Isolated; transactions are processed sequential Durable; completed actions are persistent
    • CAP theorem Impossible to achieve all three: • • • dinsdag 12 november 13 Consistency Availability Partition tolerance
    • Eventual consistency Not guaranteed to be consistent, but becomes consistent later dinsdag 12 november 13
    • Eventual consistency • • Best effort • Configurable consistency level, but no transaction support dinsdag 12 november 13 Consistency is not always more important than speed and scalability (doesn’t require locking)
    • Surrogate keys Say bye to sequences dinsdag 12 november 13
    • Surrogate keys Say bye to sequences ss cluster istent acro not cons dinsdag 12 november 13
    • Surrogate keys Say bye to sequences ss cluster istent acro not cons counters a re for cou n dinsdag 12 november 13 ting
    • Native support for uuid’s f47ac10b-58cc-4372-a567-0e02b2c3d479 Surrogate keys Say bye to sequences ss cluster istent acro not cons counters a re for cou n dinsdag 12 november 13 ting
    • Cassandra 1.2 dinsdag 12 november 13
    • Cassandra 1.2 • • • dinsdag 12 november 13 Not longer schemaless Introduced CQL3 No wide tables anymore
    • Collections • • • dinsdag 12 november 13 Lists Maps Sets
    • Lists • • user[‘mau’][‘posts’] = ‘uuid’; • • UPDATE user SET posts = posts + [‘uuid’] dinsdag 12 november 13 CREATE TABLE user ( username text PRIMARY KEY, posts list<uuid> ); UPDATE user SET posts = [‘uuid’] + posts
    • Set • CREATE TABLE user ( username text PRIMARY KEY, email set<text> ); • UPDATE user SET emails = emails + {‘mail@example.com’} dinsdag 12 november 13
    • Maps • CREATE TABLE user ( username text PRIMARY KEY, attending map<timestamp,text> ); • • UPDATE user SET attending[‘2013-11-12’] = ‘PHPMeetup’ dinsdag 12 november 13 DELETE attending[‘2013-12-05’] FROM user
    • Limits on collections • • • dinsdag 12 november 13 64K Whole collection loaded in memory when reading / writing Not an alternative to wide tables!
    • Limits on collections • • • dinsdag 12 november 13 64K No size check in CQL SET list = list + [‘...’] Whole collection loaded in memory when reading / writing Not an alternative to wide tables!
    • Wide tables in CQL3 • CREATE TABLE tweets ( tweet_id uuid PRIMARY KEY, author varchar, body varchar ); • CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar, PRIMARY KEY (user_id, tweet_id) ) dinsdag 12 november 13
    • Wide tables in CQL3 • • dinsdag 12 november 13 CREATE TABLE tweets ( tweet_id uuid PRIMARY KEY, author varchar, body varchar ); CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar, PRIMARY KEY (user_id, tweet_id) ) user_id mau user_id mike uuid:author anne uuid:author david uuid:body Tweet from Anne uuid:body Tweet from David
    • Wide tables in CQL3 For schemaless lovers: • • dinsdag 12 november 13 CREATE TABLE tweets ( tweet_id uuid PRIMARY KEY, author varchar, body varchar ); CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar, PRIMARY KEY (user_id, tweet_id) ) user_id mau user_id mike CREATE TABLE name ( rowkey varchar, columnname varchar, value blob, PRIMARY KEY (rowkey, columnname) ); uuid:author uuid:body anne Tweet from Anne uuid:author uuid:body david Tweet from David
    • Secondary index • • dinsdag 12 november 13 CREATE INDEX name ON table (column); High memory usage when used with high cardinality
    • Iteration • dinsdag 12 november 13 SELECT * FROM users
    • Iteration unpredictable performance • dinsdag 12 november 13 SELECT * FROM users LIMIT 10 OFFSET 100
    • Iteration • • dinsdag 12 november 13 SELECT * FROM users SELECT token(username), username, country, age FROM user
    • Iteration • • dinsdag 12 november 13 SELECT * FROM users SELECT token(username), username, country, age FROM user WHERE token(username) > 23947239 LIMIT 10
    • Queries are always controlled by one node dinsdag 12 november 13
    • Queries are always controlled by one node Even if data from 100 nodes is involved dinsdag 12 november 13
    • MapReduce Or just ‘MapRed’ dinsdag 12 november 13
    • MapReduce • • dinsdag 12 november 13 array_map array_reduce
    • map() • • dinsdag 12 november 13 Processes a subset of the data array_map(function($v) { return strtoupper($v); }, array('a', 'b'))
    • reduce() • • dinsdag 12 november 13 Merge results from the mapping function array_reduce(array(1, 2, 3), function($a, $b) { return $a + $b; });
    • MapReduce dinsdag 12 november 13
    • MapReduce map() map() map() map() map() map() map() map() dinsdag 12 november 13 map() map() map() map()
    • MapReduce dinsdag 12 november 13
    • MapReduce dinsdag 12 november 13
    • MapReduce dinsdag 12 november 13
    • MapReduce dinsdag 12 november 13
    • MapReduce dinsdag 12 november 13
    • MapReduce dinsdag 12 november 13
    • MapReduce dinsdag 12 november 13
    • MapReduce result dinsdag 12 november 13
    • Wordcount $data = array(‘red green blue’, ‘orange blue’, ‘purple green’); $data = array_map(function($v) { $words = array(); foreach (explode(' ', $v) as $word) $words[$word] = isset($words[$word]) ? $words[$word] + 1 : 1; return $words; }, $data); $data = array_reduce($data, function($a, $b) { foreach ($a as $word => $count) $b[$word] = isset($b[$word]) ? $b[$word] + $count : $count; return $b; }, array()); array(‘red’ => 1, ‘green’ => 2, ‘blue’ => 2, ‘orange’ => 1, ‘purple’ => 1) dinsdag 12 november 13
    • ORDER BY value LIMIT 5 $data = array(array(4,5,2), array(62,35,1), array(74,56,2,34)); $data = array_map(function($v) { sort($v); return array_slice($v, 0, 5); }, $data); $data = array_reduce($data, function($a, $b) { $v = array_merge($a, $b); sort($v); return array_slice($v, 0, 5); }, array()); array(1, 2, 2, 4, 5) dinsdag 12 november 13
    • Remember • • dinsdag 12 november 13 Getting information is a bumpy road in big data Use MapRed to transform data into information
    • MapReduce • • dinsdag 12 november 13 No native support in Cassandra MapReduce possible with Hadoop (requires Java programming)
    • Pig input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; dinsdag 12 november 13
    • Hive SELECT v['ip'], COUNT(1) AS cnt FROM www_access GROUP BY v['ip'] ORDER BY cnt DESC LIMIT 30 dinsdag 12 november 13
    • Pig and Hive • • • dinsdag 12 november 13 Using MapReduce No(t very) predictable performance Good for analysis
    • Hack your own • • • • dinsdag 12 november 13 Not too difficult Data can be split into subsets by filtering on tokens Application must run on all MapRed nodes Probably better performance than Pig / Hive
    • dinsdag 12 november 13
    • Interfaces / protocols • • • dinsdag 12 november 13 Thrift Binary protocol (1.2+) Gossip (internode communication)
    • Thrift • • • • • dinsdag 12 november 13 Something like SOAP in a binary format Tool which generates libraries based on definition files Supports many languages (incl. PHP, JS, NodeJS, c, java, python, ruby.....) Also used by HyperTable, HBase, Accumulo and ElasticSearch Sole interface before 1.2
    • Thrift • dinsdag 12 november 13 No support for collections
    • Binary protocol • • • dinsdag 12 november 13 Recommended protocol for Cassandra 1.2 Few client libraries available No binary connectors were available for PHP https://github.com/mauritsl/php-cassandra
    • php-cassandra require('lib/cassandra/Cassandra.php'); use CassandraConnection as Cassandra; $connection = new Cassandra('localhost', 'keyspace'); $rows = $connection->query('SELECT * FROM user'); foreach ($rows as $row) { print $row->firstname; print $row->listfield[0]; } $rows->count(); $rows->getColumns(); dinsdag 12 november 13
    • Scaling applications dinsdag 12 november 13
    • Rule 1: Don’t ask for NoSQL drivers for a CMS dinsdag 12 november 13
    • Cassandra does not fit all (same story for every NoSQL solution) dinsdag 12 november 13
    • Every page (or API call) should only require a few (if not one) query dinsdag 12 november 13
    • Static versus Dynamic data • Static: information that doesn’t change very often • • • I.e.: translations May go in a RDBMS or local storage (files?) Dynamic: many changes • • dinsdag 12 november 13 Changes must be visible on all nodes Use Cassandra
    • Local versus Global data • Logging • • Separate logs per node Cache • • Sometimes no need to share cache between nodes Statistics • dinsdag 12 november 13 Can be kept local for a limited time
    • Local versus Global data • Sessions • dinsdag 12 november 13 Dependent on session stickiness
    • Caching • • Memcache is recommended for local cache Cassandra can be used for global cache • dinsdag 12 november 13 Has a TTL feature INSERT INTO ... (...) VALUES (...) USING TTL 86400
    • What about files? • dinsdag 12 november 13 Use Hadoop Distributed File System (HDFS) or GlusterFS
    • What about files? • • dinsdag 12 november 13 Use Hadoop Distributed File System (HDFS) or GlusterFS Or use Cassandra
    • What about files? • • Split files in chunks to avoid hotspots and save the heap Not uncommon to have files in Cassandra • • dinsdag 12 november 13 github.com/Netflix/astyanax GB’s are ok, but do not store TB’s
    • Maximum size of cluster? • • No satisfactory answer Probably more dependent on network equipment • • • dinsdag 12 november 13 Rack awareness helps here Facebook: 150 node cluster, 50TB data (2010) Easou: 400 node cluster, 300TB data (300 million images)
    • Minimum size of a cluster? • • • dinsdag 12 november 13 Can run on a single node 4GB RAM recommended Runs fine on 1GB RAM
    • Minimum size of a cluster? • • • dinsdag 12 november 13 Can run on a single node 4GB RAM recommended Runs fine on 1GB RAM “hot data” should fit in RAM
    • Installing Cassandra • Install JDK Oracle Java recommended but OpenJDK works ok • • • • Add Cassandra repository dinsdag 12 november 13 apt-get install cassandra Set listen and seed address (IP address of node and seed) (Re)start Cassandra
    • Last words... dinsdag 12 november 13
    • Data versus information Data structure is naturally responsive for information dinsdag 12 november 13
    • Data versus information Data structure is naturally responsive for information predictable performance dinsdag 12 november 13
    • History and usage Jeff Hammerbacher dinsdag 12 november 13
    • How to use it Schema design, CQL3 and limits dinsdag 12 november 13
    • Developments CQL3 and binary protocol dinsdag 12 november 13
    • Thank you! dinsdag 12 november 13
    • Questions? dinsdag 12 november 13