CSC 8101 Non Relational Databases

598 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
598
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Information overload. Creating too much data to be able to store it. Digital Cameras/Video Cameras/CCTVVOIP, Sensors, Medical imaging
  • Information overload.
  • Over time data has evolved to be more interlinked and connectedHypertext has linksBlogs have pingbacksTagging groups related dataOntologies formalise it moreGGG the relationships contain information rather than the data items. e.g. friends on FB – the data was there before but the relationships are the important part
  • Applications in 70s and 80s were simple and rigid. Doesn’t work now with the interconnected world.Semi structured data is bad for RDBMSFB, Twitter,etc have had to build their own databases
  • Used internally at Amazon in services like S3 and EC2Quorum(N, R, W)N = number of replicas that will be written toW = number of responses to wait for for write to succeedR = number of responses to agree on for read to be returnedMeans that n-r/w nodes can go down and the system still function
  • ----- Meeting Notes (29/11/2011 10:32) -----Part 1: 40mins
  • CSC 8101 Non Relational Databases

    1. 1. Part 1: Non Relational DatabasesPart 2: Collaborative Filtering Simon Woodman [s.j.woodman@ncl.ac.uk]
    2. 2. Outline• Part 1: Non-Relational Databases (NoSQL) – Trends forcing change – NoSQL database types – Graph Databases (Neo4J) – Demo• Part 2: Making Recommendations – Background/example – Pearson Score – User based – Item based
    3. 3. Credit: http://ecogreenliving.net/
    4. 4. Trend 1: Data Size Digital Information Created, Captured, Replicated worldwide 3000 2500 2000Exabytes 1500 1000 500 0 2006 2007 2008 2009 2010 2011 2012 Source: IDC 2009
    5. 5. Trend 2: ConnectednessTrend 2: connectedness Giant Global Graph (GGG) Information connectivity Ontologies RDF Folksonomies Tagging Wikis User- generated content Blogs RSS Hypertext Text documents web 1.0 web 2.0 “web 3.0” 1990 2000 2010 2020 Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome
    6. 6. Trend 3: semi-structure• “The great majority of the data out there is not structured and [there’s] no way in the world you can force people to structure it.” [1]• Trend accelerated by the decentralization of content generation that is the hallmark of the age of participation (“web 2.0”)• Evolving applications [1] Stefano Mazzocci Apache and MIT
    7. 7. Types of Databases• Relational• Key-Value Stores• BigTable Clones• Document Databases• Graph Databases
    8. 8. Relational Databases• Data Model: Normalised, multi-table with referential integrity• Good for very static data – Payroll, accounts – Well understood – Not evolving• SQL Queries (joins etc.)• Good Tooling• Examples: Oracle, MySQL, Postgres, …
    9. 9. Key-Value Stores• Data Model: (global) collection of K-V pairs• Massive Distributed HashMap• Partitioning and Replication usually ring based – Load Balancer round robins the requests – Hash(key) = partition – Partition map maintains partition -> node mapping – Quorum System (N, R, W), usually (3,2,2)• Scales Well (1000B rows)• How many apps need that? – Google, Amazon, Facebook etc. – <10 in the world• Examples: Dynomite, Voldemort, Tokyo[http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]
    10. 10. BigTable Clones• Data model: single table, column families• Distributed storage of semi-structured data (column families)• Scale: “Petabyte range”• Supports MapReduce well• Example: Hbase, Hypertable[http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]
    11. 11. Document Databases• Inspired by Lotus Notes• Data model: collections of K-V collections• Document: – Collection of K-V pairs (often JSON) – Often versioned• Scales: Dependant on implementation• Can (potentially) store entire 3 tier web appin the database (probably NOT the bestarchitecture!)• Example: CouchDB, MongoDB
    12. 12. Graph Databases• Inspired by Euler & graph theory• Data model: nodes, relationships, K-V on both• Scale: 10B entities• SPARQL Queries• No O/R Impedance mismatch• Semi Structured & Evolving Schema• Example: AllegroGraph, VertexDB, Neo4j
    13. 13. Social Network Problem• System stores people and friends• Find all “friends of friends”
    14. 14. RDBMS Solution• SQL: single join to get friends• SELECT p.name, p2.name FROM people AS p, people AS p2, friends AS f WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2;• SQL: 2-3 joins or subqueries to get “friends of friends”• i.e. Not trivial and doesn’t scale
    15. 15. Graph DB Solution• Graph Traversal• pathExists(a,b) limit depth 2
    16. 16. Neo4J Model• Nodes• Relationships (edges) type=“KNOWS” age=4 years• Properties on Both 1 2 name = “Simon” job=“RA” 3 name = “Chris”
    17. 17. Live Demo!
    18. 18. Neo4J Model• Transactions• Reference Node• Indexes (Apache Lucene)• Visualisation – Neoclipse – The JIT
    19. 19. Neoclipse
    20. 20. Pros and Cons• “Whiteboard friendly” – fits domain models better• Scales up “enough”• Evolve Schema• Can represent semi-structured data• Good Performance for graph/network traversals• Lacks tool support• Harder to write ad-hoc queries (SPARQL vs. SQL)
    21. 21. Important Reminders• Other options exist apart from the Relational Database• Fit the technology to the domain model, not the domain model to the technology
    22. 22. Questions?• http://neo4j.org/• Some material from[http://nosql.mypopescu.com/post/342947902/ presentation-graphs-neo4j-teh-awesome]
    23. 23. Part 2: Collaborative Filtering• Calculating Similarities• User based filtering• Item based filtering
    24. 24. Why?• Sell more items• Increase market share• Better targeted advertising• Up sell rather than new-sell• Make more £££• Not perfect – Bad recommendations – Inappropriate recommendations
    25. 25. It can go wrong
    26. 26. It will go wrong
    27. 27. Preference DataMovie Ratings Online Shopping Site Recommender 5 Bought 1 Like 1 4 Didn’t Buy 0 No vote 0 3 Didn’t Like -1 2 1
    28. 28. Recommending Items• Step 1: Calculate similarities – either user-user or item-item• Step 2: Predict scores for “unseen” items• Step 3: Normalise and order
    29. 29. Example Data: Movie Reviews Shawshank The Lock Love Titanic Seven Redemption Ghost Stock Actually Simon 5 4 4 1 Chris 1 3 4 5 4 Paul 4 5 2 4
    30. 30. Calculating Similarity• Method 1: Euclidian Distance Score• Compare Common Rankings• n-dimensional preference space• Score 0 – 1• 1 = Identical• 0 = Highly dissimilar
    31. 31. Calculating Euclidian Distance Score• Done for each pair of people• Difference in each axis• Square• Add them together• Add 1 (avoids divide by zero)• Square Root• Invert
    32. 32. Chris and Simon• Difference in each axis – (5-1), (4-3) = 4, 1• Square – 16, 1• Add them together – 17• Add 1 (avoids divide by zero) – = 18• Square Root – = 4.24264069• Invert – = 0.23570226
    33. 33. Euclidian Distance Score• Easy to calculate• Bad for people who are similar but consistently rate higher/lower
    34. 34. Pearson Correlation Coefficient• More Complicated• Line of Best Fit between commonly rated items• Deals with grade inflation• Other measures – Jaccard Coefficient – Manhattan Distance
    35. 35. User based Filtering• Look at what similar people have liked but you haven’t seen? – Similar person likes something that has bad reviews from everyone else?• Weighted Score that ranks the other people and takes into account similarity
    36. 36. Recommending Items Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven Chris 0.23 4 0.92 Paul 0.78 2 1.56 4 3.12 Total 2.48 3.12 Sim Sum 1.01 0.78Total/Sim Sum 2.455445545 4
    37. 37. Recommending Items Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven Chris 0.23 4 0.92 Paul 0.78 2 1.56 4 3.12 Total 2.48 3.12 Sim Sum 1.01 0.78Total/Sim Sum 2.455445545 4
    38. 38. Recommending Items Similarity (ED) Titanic Sim x Titanic Seven Sim x Seven Chris 0.23 4 0.92 Paul 0.78 2 1.56 4 3.12 Total 2.48 3.12 Sim Sum 1.01 0.78Total/Sim Sum 2.455445545 4
    39. 39. User Based Filtering - Conclusions• Calculate Similarity between users• Recommend based on similar users• Similarity – Euclidian Distance Score – Pearson Coefficient – better for non-normalised data• Problem – need to compare every user/item to every other user/item
    40. 40. Item Based Filtering• Pre-compute most similar items for each item – Item similarities change less often than user similarities and can be re-used• Create a weighted list of items most similar to user’s top rated items
    41. 41. Recommending Items Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x SevenShawshank 5 0.084 0.42 0.366 1.83 The Ghost 4 0.125 0.5 0.487 1.948 Lock Stock 4 0.091 0.364 0.318 1.272Love Actually 1 0.737 0.737 0.184 0.184 Total 1.037 2.021 1.355 5.234 Normalised (Rating / Similarity) 1.948 3.862730627
    42. 42. Recommending Items Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x SevenShawshank 5 0.084 0.42 0.366 1.83 The Ghost 4 0.125 0.5 0.487 1.948 Lock Stock 4 0.091 0.364 0.318 1.272Love Actually 1 0.737 0.737 0.184 0.184 Total 1.037 2.021 1.355 5.234 Normalised (Rating / Similarity) 1.948 3.862730627
    43. 43. Recommending Items Rating Titanic (ED) Rat x Titanic Seven (ED) Rat x SevenShawshank 5 0.084 0.42 0.366 1.83 The Ghost 4 0.125 0.5 0.487 1.948 Lock Stock 4 0.091 0.364 0.318 1.272Love Actually 1 0.737 0.737 0.184 0.184 Total 1.037 2.021 1.355 5.234 Normalised (Rating / Similarity) 1.948 3.862730627
    44. 44. Item Based Filtering - Conclusions• Calculate Similarity between items• Recommend based on user’s ratings for items• Similarity (as before) – Euclidian Distance Score – Pearson Coefficient – better for non-normalised data• Problem – need to maintain item similarity data set
    45. 45. Item vs. User Based Filtering• Item based scales better – Need to maintain the similarities data set• User based simpler to implement• May (or may not) want to show users who is similar in terms of habits• Perform equally on dense data sets• Item based performs better on sparse data sets
    46. 46. Questions?• Reference: Programming Collective Intelligence, Toby Seagram, O’Reilly 2007• s.j.woodman@ncl.ac.uk

    ×