Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Slide presentation pycassa_upload


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Slide presentation pycassa_upload

  1. 1. PYCON INDIA 2012 Pycassa – Python Cassandrified28-30th September 2012 Ramesh Rajini Dharmaram Vidya Infosys Limited, Kshetram Education & Research, Bangalore Bangalore, Karnataka
  2. 2. Session Plan• Need & Introduction to NoSQL DB• Cassandra Introduction• Data model creation• Pycassa in action
  3. 3. Heard of NO - SQL?• Stands for Not Only SQL• Class of non-relational data storage systems• No fixed table schema• No Joins!• Relax one or more of the ACID properties & will implement BASE & CAP Theorem!
  4. 4. Do we “REALLY” need them ? • RDBMS …So strong • so crisp • so vast • And WE know it well!
  5. 5. Trends shrends! – Gartner‟s 10 key IT trends for 2012 • unstructured data will grow some 80% over the course of the next five years 5
  6. 6. What made some apps go No-SQLized?• Explosion of social media sites with large data needs• Open-source community• Upsurge of cloud-based solutions• Migration to dynamically-typed languages
  7. 7. RDBMS..hmmm• Normalization => Joins => Slow Queries /Complications• Consistency => locks /transactions => Performance issues in distributed environments• Scalability becomes a mess as our apps grow in size and demand
  8. 8. Current Approach to Scalability• Add hardware• Upgrade hardware• More machines• Turn off unwanted services• Caching• De-normalize…
  9. 9. RDBMS ..tends to Massive [terabytes] Elastic scalability Easily achieve Fault tolerance Tunable Consistency
  10. 10. But Why.. • ACID • - transaction slow under heavy load • - in distributed /replicated environment = 2 phase commit => infinite wait by either NODE or Coordinator
  11. 11. But RDBMS is still holding up!!• is• Will continue to Co-exist with NOSQL• What if data is no more a problem to me!• What new problems will I like to have?
  12. 12. Seeds of NoSQL• Three major papers – BigTable (Google) – Dynamo (Amazon) • Gossip protocol (discovery and error detection) • Distributed key-value data store • Eventual consistency – CAP Theorem
  13. 13. Brewer’s CAP Theorem• Properties of a system: – Consistency – Availability – Partitions
  14. 14. Brewer’s CAP Theorem• You can have it good, you can have it fast, you can have it cheap: pick two 14
  15. 15. BASE Vs ACID - Eventual Consistency• No updates for a long duration => eventually all updates will propagate through the system => all the nodes will be consistent• Any given accepted update and a given node, eventually either the update reaches the node or the node is removed from service• Known as BASE (Basically Available, Soft state, Eventual consistency)
  16. 16. What kinds of NoSQL• 2 Major areas: – Key/Value or „the big hash table‟. • Dynamo • Voldemort • Scalaris – Schema-less • column-based, document-based or graph-based. – Cassandra (column-based) – CouchDB (document-based) – Neo4J (graph-based) – HBase (column-based)
  17. 17. Any users?
  18. 18. Cassandra to the Rescue! – , source, Open Distributed, Decentralized, Elastically scalable Highly available / fault-tolerant Tune ably consistent Column-oriented database Automatic sharding Gossip Architecture 18
  19. 19. Distributed and Decentralized Can be running Decentralized on multiple • that there is no single machines point of failure. • appearing to users as • All the nodes in single instance cluster function exactly the same [server symmetry] 19
  20. 20. Elastic Scalability• Vertical scaling : – more hardware capacity /memory• Horizontal scaling : • More machines that have all or some of the data • So that no machine is bearing the complete load 20
  21. 21. Elastic Scalability , No single point failure• Elastic scalability : – Cluster will be able to scale up & down• Master Slave issue 21
  22. 22. Scale UP & Scale down• Add nodes and they can start serving clients! – NO server restart / NO query change / NO balancing – JUST add an another machine.• Just unplug the system. – Since cassandra has multiple copies of the same data in more than one node [configurable] there wont be any loss of data.
  23. 23. High Availability and Fault Tolerance• High availability + central server based system = problem – Internal Hard ware redundancy – Sounds cool but Extremely Costly 23
  24. 24. High Availability and Fault Tolerance – Cassandra allows to : • replace failed nodes in with no downtime • replicate data to multiple data centers to prevent downtime [automatic]
  25. 25. Tuneable Consistency• Consistency : All Reads return the most recently written value – Cassandra is “eventually consistent” model by default. 25
  26. 26. But then! • Amazon, Facebook, Google, Twitter which uses this model. – DATA is their main sales item – High performance!
  27. 27. Setting up Apache Cassandra• From the DataStax community Project –• From the Apache Cassandra project: – Believe it.. It‟s easy to install & set up!
  28. 28. Keyspace & Column Family creation Column family 1Key1 ColumnName1 ColumnName2 Value ValueKey2 ColumnName1 ColumnName2 Value ValueKey3 ColumnName1 ColumnName2 ColumnName3 Value Value Value Column family 2 Key1 ColumnName1 ColumnName2 ColumnName3 Value Value Value
  29. 29. Data makes sense.. Column family Close Friends 010051 Mail id tweets Ramesh_Rajini Hello 010052 Mail id tweets Vinz_Raj I‟m logged in! 010053 Mail id tweet1 tweet2 Ragh_Rao Hey, how r u ? Movie.. Column family Colleagues 020061 Mail id City Likes Puru_lal Bangalore Ladoos!
  30. 30. Cassandra Data Structure key space Ex: column family Colony Name, UserIDs, Ex: Address, column EmpIDs Tweets, Likes, name value timestamp Skill Set
  31. 31. Key-in the Key space.. 31
  32. 32. Pycassa in action!
  33. 33. Multi-level Dictionary {“FriendsInfo”: Keyspace {“closefriends”: Column Family Key {010053: OrderedDict( [(“MailId”:“Ragh_Rao”), Columns (“tweet1”:“Hey, how r u ?”), (“tweet2”: “Movie..”)]) OrderedDict( .. }} ColumnKeys ColumnValues
  34. 34. Can I insert in bulk?• Yes, luckily as an ordered dict.. col_fam.batch_insert({010054: {Name: Vinayak, Id: „9308}, 010057: {Name: Poorvi}})__________________________________for i in range(1000, 1010):... col_fam.insert(EmpIDs, {str(i): Hello}) 34
  35. 35. Is the data stored?• With Key , get all details: col_fam.get(010052) OrderedDict ([(Maild, Vinz_Raj), (tweets, Im loggedin!)])• With Key, get specific details: col_fam.get(010053, columns=[MaiID, tweet2]) OrderedDict([(tweet2, Movie..)])• Specifying start & end columns: col_fam.get(EmpIDs, column_start=1002, column_finish=1006) OrderedDict([(1002, Hello), (1003, Hello), (1004, Hello), (1005, Hello), (1006, Hello)]) 35
  36. 36. Can the columns be sliced?• Specifying the reverse way col_fam.get(EmpIDs, column_reversed=True, column_count=3) OrderedDict([(1009, Hello), (1008, Hello), (1007, Hello)])• Fetching multiple rows col_fam.multiget([010053, 010051]) OrderedDict( [(010053, OrderedDict([(Maild, Ragh_Rao), (tweet1, Hey, how r u?), (tweet2, Movie..)])), (010051, OrderedDict([(Mailid, Ramesh_Rajini), (tweets, Hello)]))]) 36
  37. 37. Counting..• get_count()  Count the number of columns in the row with key .• multiget_count()  Perform a column count in parallel on a set of rows.  Similar parameters as for multiget(), except that a list of keys may be used.  A dictionary of the form {key: int} is returned. 37
  38. 38. What Next?• Explore more on Pycassa modules.. –• Start using it.. I‟m sure you‟ll enjoy because it is simply superb! 38
  39. 39. Recap• Need & Introduction to NoSQL DB• Cassandra Introduction• Data model creation• Pycassa in action 39
  40. 40. References• Cassandra, The Definitive Guide – O‟reilly Publication,Eben Hewitt••••!forum/py cassa-discuss 40
  41. 41. Time for R&R? - Requests & Responses
  42. 42. Thank you! - R&R Ramesh RajiniDisclaimer : All logos and images belong to the creator and companies which own them