Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this


  1. 1. Demonstration
  2. 2. Outline● Some comments on what were trying to show ○ high level cluster configuration ○ an example application that might use this config ■ based on a Gowalla data set● Launch cluster nodes on EC2● Launch/configure Cassandra on cluster● Demonstrate use of Cassandra ○ cassandra-cli, pycassa scripts to interact with db● Demonstrate use of Hadoop● Demonstrate use of Pig on the real data
  3. 3. Cluster configuration● Four EC2 nodes ○ m1.medium instances ■ realistically a bit small for real world● 3 nodes part of Cassandra ○ data can be input dynamically into db via Thrift API● All nodes run Hadoop Tasktracker● MapReduce runs close to (Cassandra) data● JobTracker on separate node
  4. 4. Cluster config Job Tracker Cassandra Task Tracker Cassandra Cassandra Task Tracker Task TrackerAll nodes m1.small for demo
  5. 5. Lets get the cluster up... ...over to Lamine!
  6. 6. Lets get Cassandra running... ...and show the basic cli...
  7. 7. Application data● Used Gowalla data in this test application● Gowalla provide anonymized data for test/research purposes: ○ Graph of UID connections ○ List of checkins - UID, LocID● Size of data set: ○ 400MB checkins ■ 6.4m checkins ○ ~200k users● Also generated simpler variant of this data for demonstration ○ more real user information ○ more real location information
  8. 8. Application data - User Graph Simple graph structure - unidirectional graph with UIDs as nodes
  9. 9. Application Data - Checkin info
  10. 10. How this data can be used● Application interested in: ○ my checkins ○ list my friends ○ checkins at given location ○ my friends checkins● Analytics: ○ top ten most active users - most checkins ○ aggregate checkins per week ○ aggregate checkins per week per city
  11. 11. Cassandra data models● The following data models were used: ○ User ○ Location ○ Checkin ○ FriendRels ■ graph of friend relationships ○ UserCheckins ■ checkins by user ○ LocationCheckins ■ checkins by location ○ FriendCheckins ■ checkins by friends
  12. 12. Cassandra data models● Use of valueless columns ○ FriendRels, UserCheckins, LocationCheckins, FriendCheckins are just sets of valueless columns● FriendRel: ○ row_key: {friendid1: , friendid2: , friendid3: , ...} ■ row_key is a uid● UserCheckins: ○ row_key: {checkinid1: , checkinid2: , ...} ■ row_key is uid● LocationCheckins use LocID as row key● FriendCheckins use my UID to get my friends checkins
  13. 13. Lets import the data into Cassandra...
  14. 14. You deserve a coffee...
  15. 15. Using Hadoop and Pig ...and we can do some analytics...