Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cassandra     forPython Developers     Tyler Hobbs
History    Open-sourced by Facebook   2008    Apache Incubator           2009    Top-level Apache project   2010    Da...
Strengths    Scalable    – 2x Nodes == 2x Performance    Reliable (Available)    – Replication that works    – Multi-DC ...
Strengths    Fast    – 10-30k writes/sec, 1-10k reads/sec    Analytics    – Integrated Hadoop support
Weaknesses    No ACID transactions    – Dont need these as often as youd think    Limited support for ad-hoc queries    ...
Clustering    Every node plays the same role    – No masters, slaves, or special nodes
Clustering             0      50          10      40          20             30
Clustering                       Key: “www.google.com”             0      50          10      40          20             30
Clustering                       Key: “www.google.com”             0                       md5(“www.google.com”)      50  ...
Clustering                       Key: “www.google.com”             0                       md5(“www.google.com”)      50  ...
Clustering                       Key: “www.google.com”             0                       md5(“www.google.com”)      50  ...
Clustering                          Key: “www.google.com”             0                          md5(“www.google.com”)    ...
Clustering   Client can talk to any node
Data Model    Keyspace    – A collection of Column Families    – Controls replication settings    Column Family    – Kin...
ColumnFamilies    Static    – Object data    Dynamic    – Pre-calculated query results
Static Column Families                   Users   zznate    password: *    name: Nate   driftx    password: *   name: Brand...
Dynamic Column Families                     Followingzznate    driftx:   thobbs:driftxthobbs    zznate:jbellis   driftx:  ...
Dynamic Column Families    Timeline of tweets by a user    Timeline of tweets by all of the people a    user is followin...
Pycassa    Python client library for Cassandra    Open Source (MIT License)    – www.github.com/pycassa/pycassa    User...
Installing Pycassa    easy_install pycassa    – or pip
Basic Layout    pycassa.pool    – Connection pooling    pycassa.columnfamily    – Primary module for the data API    py...
The Data API    RPC-based API    Rows are like a sorted list of (name,value)    tuples    – Like a dict, but sorted by t...
Inserting Data>>> from pycassa.pool import ConnectionPool>>> from pycassa.columnfamily import ColumnFamily>>>>>> pool = Co...
Inserting Data>>> columns = {“aaa”: 1, “ccc”: 3}>>> cf.insert(“key”, columns)>>> cf.get(“key”){“aaa”: 1, “ccc”: 3}>>>>>> #...
Fetching Data>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}>>>>>> # Get a set of columns by name>>> cf.get(“ke...
Fetching Data>>> # Get a slice of columns>>> cf.get(“key”, column_start=”bbb”,...               column_finish=”ccc”){“bbb”...
Fetching Data>>> # Get the first two columns in the row>>> cf.get(“key”, column_count=2){“aaa”: 42, “bbb”: 2}>>>>>> # Get ...
Fetching Multiple Rows>>> columns = {“col”: “val”}>>> cf.batch_insert({“k1”: columns,...                  “k2”: columns,.....
Fetching a Range of Rows>>> # Get a generator over all of the rows>>> for key, columns in cf.get_range():...     print key...
Fetching Rows by Secondary Index>>> from pycassa.index import *>>>>>> # Build up our index clause to match>>> exp = create...
Deleting Data>>>   # Delete a whole row>>>   cf.remove(“key1”)>>>>>>   # Or selectively delete columns>>>   cf.remove(“key...
Connection Management    pycassa.pool.ConnectionPool    – Takes a list of servers        • Can be any set of nodes in you...
Async Options    eventlet    – Just need to monkeypatch socket and threading    Twisted    – Use Telephus instead of Pyc...
Tyler Hobbs        @tylhobbstyler@datastax.com
Upcoming SlideShare
Loading in …5
×

Cassandra for Python Developers

4,875 views

Published on

A high level introduction to Apache Cassandra followed by an introduction to pycassa, the Python client library for Cassandra.

Presented at PyTexas 2011 by Tyler Hobbs.

Published in: Technology
  • Follow the link, new dating source: ❶❶❶ http://bit.ly/39pMlLF ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❶❶❶ http://bit.ly/39pMlLF ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Cassandra for Python Developers

  1. 1. Cassandra forPython Developers Tyler Hobbs
  2. 2. History Open-sourced by Facebook 2008 Apache Incubator 2009 Top-level Apache project 2010 DataStax founded 2010
  3. 3. Strengths Scalable – 2x Nodes == 2x Performance Reliable (Available) – Replication that works – Multi-DC support – No single point of failure
  4. 4. Strengths Fast – 10-30k writes/sec, 1-10k reads/sec Analytics – Integrated Hadoop support
  5. 5. Weaknesses No ACID transactions – Dont need these as often as youd think Limited support for ad-hoc queries – Youll give these up anyway when sharding an RDBMS Generally complements another system – Not intended to be one-size-fits-all
  6. 6. Clustering Every node plays the same role – No masters, slaves, or special nodes
  7. 7. Clustering 0 50 10 40 20 30
  8. 8. Clustering Key: “www.google.com” 0 50 10 40 20 30
  9. 9. Clustering Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  10. 10. Clustering Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  11. 11. Clustering Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  12. 12. Clustering Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  13. 13. Clustering Client can talk to any node
  14. 14. Data Model Keyspace – A collection of Column Families – Controls replication settings Column Family – Kinda resembles a table
  15. 15. ColumnFamilies Static – Object data Dynamic – Pre-calculated query results
  16. 16. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  17. 17. Dynamic Column Families Followingzznate driftx: thobbs:driftxthobbs zznate:jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate
  18. 18. Dynamic Column Families Timeline of tweets by a user Timeline of tweets by all of the people a user is following List of comments sorted by score List of friends grouped by state
  19. 19. Pycassa Python client library for Cassandra Open Source (MIT License) – www.github.com/pycassa/pycassa Users – Reddit – ~10k github downloads of every version
  20. 20. Installing Pycassa easy_install pycassa – or pip
  21. 21. Basic Layout pycassa.pool – Connection pooling pycassa.columnfamily – Primary module for the data API pycassa.system_manager – Schema management
  22. 22. The Data API RPC-based API Rows are like a sorted list of (name,value) tuples – Like a dict, but sorted by the names – OrderedDicts are used to preserve sorting
  23. 23. Inserting Data>>> from pycassa.pool import ConnectionPool>>> from pycassa.columnfamily import ColumnFamily>>>>>> pool = ConnectionPool(“MyKeyspace”)>>> cf = ColumnFamily(pool, “MyCF”)>>>>>> cf.insert(“key”, {“col_name”: “col_value”})>>> cf.get(“key”){“col_name”: “col_value”}
  24. 24. Inserting Data>>> columns = {“aaa”: 1, “ccc”: 3}>>> cf.insert(“key”, columns)>>> cf.get(“key”){“aaa”: 1, “ccc”: 3}>>>>>> # Updates are the same as inserts>>> cf.insert(“key”, {“aaa”: 42})>>> cf.get(“key”){“aaa”: 42, “ccc”: 3}>>>>>> # We can insert anywhere in the row>>> cf.insert(“key”, {“bbb”: 2, “ddd”: 4})>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}
  25. 25. Fetching Data>>> cf.get(“key”){“aaa”: 42, “bbb”: 2, “ccc”: 3, “ddd”: 4}>>>>>> # Get a set of columns by name>>> cf.get(“key”, columns=[“bbb”, “ddd”]){“bbb”: 2, “ddd”: 4}
  26. 26. Fetching Data>>> # Get a slice of columns>>> cf.get(“key”, column_start=”bbb”,... column_finish=”ccc”){“bbb”: 2, “ccc”: 3}>>>>>> # Slice from “ccc” to the end>>> cf.get(“key”, column_start=”ccc”){“ccc”: 3, “ddd”: 4}>>>>>> # Slice from “bbb” to the beginning>>> cf.get(“key”, column_start=”bbb”,... column_reversed=True){“bbb”: 2, “aaa”: 42}
  27. 27. Fetching Data>>> # Get the first two columns in the row>>> cf.get(“key”, column_count=2){“aaa”: 42, “bbb”: 2}>>>>>> # Get the last two columns in the row>>> cf.get(“key”, column_reversed=True,... column_count=2){“ddd”: 4, “ccc”: 3}
  28. 28. Fetching Multiple Rows>>> columns = {“col”: “val”}>>> cf.batch_insert({“k1”: columns,... “k2”: columns,... “k3”: columns})>>>>>> # Get multiple rows by name>>> cf.multiget([“k1”,“k2”]){“k1”: {”col”: “val”}, “k2”: {“col”: “val”}}>>> # You can get slices of each row, too>>> cf.multiget([“k1”,“k2”], column_start=”bbb”) …
  29. 29. Fetching a Range of Rows>>> # Get a generator over all of the rows>>> for key, columns in cf.get_range():... print key, columns“k1” {”col”: “val”}“k2” {“col”: “val”}“k3” {“col”: “val”}>>> # You can get slices of each row>>> cf.get_range(column_start=”bbb”) …
  30. 30. Fetching Rows by Secondary Index>>> from pycassa.index import *>>>>>> # Build up our index clause to match>>> exp = create_index_expression(“name”, “Joe”)>>> clause = create_index_clause([exp])>>> matches = users.get_indexed_slices(clause)>>>>>> # results is a generator over matching rows>>> for key, columns in matches:... print key, columns“13” {”name”: “Joe”, “nick”: “thatguy2”}“257” {“name”: “Joe”, “nick”: “flowers”}“98” {“name”: “Joe”, “nick”: “fr0d0”}
  31. 31. Deleting Data>>> # Delete a whole row>>> cf.remove(“key1”)>>>>>> # Or selectively delete columns>>> cf.remove(“key2”, columns=[“name”, “date”])
  32. 32. Connection Management pycassa.pool.ConnectionPool – Takes a list of servers • Can be any set of nodes in your cluster – pool_size, max_retries, timeout – Automatically retries operations against other nodes • Writes are idempotent! – Individual node failures are transparent – Thread safe
  33. 33. Async Options eventlet – Just need to monkeypatch socket and threading Twisted – Use Telephus instead of Pycassa – www.github.com/driftx/telephus – Less friendly, documented, etc
  34. 34. Tyler Hobbs @tylhobbstyler@datastax.com

×