Demonstration
Outline● Some comments on what were trying to  show  ○ high level cluster configuration  ○ an example application that mig...
Cluster configuration● Four EC2 nodes  ○ m1.medium instances    ■ realistically a bit small for real world● 3 nodes part o...
Cluster config        Job Tracker                           Cassandra                                             Task Tra...
Lets get the cluster up...       ...over to Lamine!
Lets get Cassandra      running...  ...and show the basic cli...
Application data● Used Gowalla data in this test application● Gowalla provide anonymized data for  test/research purposes:...
Application data - User Graph Simple graph structure - unidirectional graph with UIDs as nodes
Application Data - Checkin info
How this data can be used● Application interested in:   ○   my checkins   ○   list my friends   ○   checkins at given loca...
Cassandra data models● The following data models were used:  ○ User  ○ Location  ○ Checkin  ○ FriendRels    ■ graph of fri...
Cassandra data models● Use of valueless columns  ○ FriendRels, UserCheckins, LocationCheckins,    FriendCheckins are just ...
Lets import the data into       Cassandra...
You deserve a coffee...
Using Hadoop and Pig ...and we can do some analytics...
Upcoming SlideShare
Loading in...5
×

Demonstration

861

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
861
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Demonstration

  1. 1. Demonstration
  2. 2. Outline● Some comments on what were trying to show ○ high level cluster configuration ○ an example application that might use this config ■ based on a Gowalla data set● Launch cluster nodes on EC2● Launch/configure Cassandra on cluster● Demonstrate use of Cassandra ○ cassandra-cli, pycassa scripts to interact with db● Demonstrate use of Hadoop● Demonstrate use of Pig on the real data
  3. 3. Cluster configuration● Four EC2 nodes ○ m1.medium instances ■ realistically a bit small for real world● 3 nodes part of Cassandra ○ data can be input dynamically into db via Thrift API● All nodes run Hadoop Tasktracker● MapReduce runs close to (Cassandra) data● JobTracker on separate node
  4. 4. Cluster config Job Tracker Cassandra Task Tracker Cassandra Cassandra Task Tracker Task TrackerAll nodes m1.small for demo
  5. 5. Lets get the cluster up... ...over to Lamine!
  6. 6. Lets get Cassandra running... ...and show the basic cli...
  7. 7. Application data● Used Gowalla data in this test application● Gowalla provide anonymized data for test/research purposes: ○ Graph of UID connections ○ List of checkins - UID, LocID● Size of data set: ○ 400MB checkins ■ 6.4m checkins ○ ~200k users● Also generated simpler variant of this data for demonstration ○ more real user information ○ more real location information
  8. 8. Application data - User Graph Simple graph structure - unidirectional graph with UIDs as nodes
  9. 9. Application Data - Checkin info
  10. 10. How this data can be used● Application interested in: ○ my checkins ○ list my friends ○ checkins at given location ○ my friends checkins● Analytics: ○ top ten most active users - most checkins ○ aggregate checkins per week ○ aggregate checkins per week per city
  11. 11. Cassandra data models● The following data models were used: ○ User ○ Location ○ Checkin ○ FriendRels ■ graph of friend relationships ○ UserCheckins ■ checkins by user ○ LocationCheckins ■ checkins by location ○ FriendCheckins ■ checkins by friends
  12. 12. Cassandra data models● Use of valueless columns ○ FriendRels, UserCheckins, LocationCheckins, FriendCheckins are just sets of valueless columns● FriendRel: ○ row_key: {friendid1: , friendid2: , friendid3: , ...} ■ row_key is a uid● UserCheckins: ○ row_key: {checkinid1: , checkinid2: , ...} ■ row_key is uid● LocationCheckins use LocID as row key● FriendCheckins use my UID to get my friends checkins
  13. 13. Lets import the data into Cassandra...
  14. 14. You deserve a coffee...
  15. 15. Using Hadoop and Pig ...and we can do some analytics...
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×