Published on

Talk to techmeetup Aberdeen on bigdata and nosql
Some links seem to be missing from the onscreen presentation, particularly http://www.dbshards.com/dbshards/ for the sharding diagram

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for-market-research/
  • http://news.softpedia.com/news/Twitpocalypse-039-s-Aftermath-114084.shtml
  • http://www.dbshards.com/dbshards/
  • http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changedPicture: http://www.datacenterknowledge.com/archives/2009/11/04/inside-a-cloud-computing-data-center/
  • Larryeleison must be mad that his “free” software mysql is used on the biggest website in the world.
  • create keyspace test with strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor=1;use test;create columnfamily users (KEY varchar Primary key, password varchar, gender varchar);INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a');Select * from users;INSERT INTO users (KEY, gender) VALUES ('jbrown', 'male');INSERT INTO users (KEY, phone) VALUES ('jbrown', '01382 345078');What are we going to get ?
  • Whynosql

    1. 1. Andy CobleySchool of ComputingUniversity of DundeeTwitter: @andycobley
    2. 2. Who am I ? Lecturer at University of Dundee Program director of Business Intelligence and new program Data Science (http://goo.gl/ljl6N and http://goo.gl/uwHSi ) Geek and Hacker
    3. 3. So what is Big Data?
    4. 4. From evil Wikipedia “In information technology, big data[1] consists of datasets that grow so large that they become awkward to work with using on-hand database management tools.” Which doesn’t tell us much Any definition that relies on data “size” will become obsolete very quickly as data storage capabilities grows.
    5. 5. Lets try something different  The Three V’s  Volume  How Big is the data, Terabytes ? Petabytes?  Variety  Is it the same sort of data, what about blobs ? Does it change ?  Velocity  How fast is it coming in ? Can we store it fast enough and then use it ?http://nosql.mypopescu.com/post/5547192335/bigdata-the-three-vs-volume-variety-velocity
    6. 6. The Twitter problem Twitpocalypse Overflow of status ids for 32 bit signed integers But beyond that, can we physically store data fast enough ?
    7. 7.  Suppose we are storing 16 columns of 16 bytes At 100 per second 0.7 Terabyte per year Add at 1 million per second that’s 7 petabytes per year This is volume
    8. 8. Variability Data is sparse and can be different sizes Over time the type of data changes Consider click through data, as pages evolve new data types and fields need to be stored
    9. 9. What aboutid MassSpec Meta data Meta data12
    10. 10. We need UDF User Defined functions inside the dB Or a different way of dealing with it, such as Hadoop or MRSQL.
    11. 11. So what is NoSql Throws away everything you know about Databases Is a family of different databases Lots of different “products” BUT ! http://nosql.mypopescu.com/post/1016320617/mongo db-is-web-scale (warning might offend) They should only be used when it’s sensible, they are not magic sauce.
    12. 12. NoSql types Key-Value Column-family Document databases Allow sharding across nodes Graph  Fast for graph like data and operations
    13. 13. Some NoSQL databases CouchDb MongoDb Cassandra Riak Hbase Neo4j http://kkovacs.eu/cassandra-vs-mongodb-vs- couchdb-vs-redis
    14. 14. Sharding ? Distribution of data across nodes Allows performance to be spread across multiple machines SQL databases can be sharded Not all NoSQL databases can be sharded
    15. 15. Cap Theorem  CAP (or Brewers) theorem says:  It’s impossible for a web service to provide the following  Consistency  Availability  Partition tolerancehttp://citeseerx.ist.psu.edu/viewdoc/download?doi= see : http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed and http://codahale.com/you-cant-sacrifice-partition-tolerance/
    16. 16. http://blog.nahurst.com/visual-guide-to-nosql-systems
    17. 17. Partitions ? Essentially failing to achieve consistency within a set time causes a partition. You can sacrifice availability to ensure consistency Partitions are rare and if you have one server, almost never happen Partitions are caused by networks, failed nodees
    18. 18. Eventual Consistency Eventually all nodes will tell the same story Isn’t this a mad idea ? Facebook (Actually not) The Internet is based on and Eventual Consistency dB DNS
    19. 19. Introducing Cassandra Distributed / Decentralized Column Orientated Key Value Store Fault Tolerant
    20. 20. Network topology of a Cassandradb Multiple nodes Cassandra can be Rack Aware Keys are replicated across nodes It’s essentially a DHT Distributed Hash Table Think BitTorrent
    21. 21. CQL Version 8 introduced CQL Cassandra Query Language Almost looks like SQL ! http://crlog.info/2011/09/17/cassandra-query- language-cql-v2-0-reference/ Language ref http://www.datastax.com/docs/0.8/dml/using_cql
    22. 22. Demo Start Cassandra Open CQLSH Create Keyspace Create a columnfamily Now we can insert !
    23. 23. So why does this work ? Jsmith  Password: ch@ngem3a Jbrown  Gender: Male  Phone: 01382 345078Column store, keys with name: value pairs underneath
    24. 24. Interfacing to Cassandra Based on Thrift  http://thrift.apache.org/ Large number of Languages supported  http://wiki.apache.org/cassandra/ClientOptions I’ve used Java and Hector  http://prettyprint.me/ Although there is a Csharp version  http://hectorsharp.com/
    25. 25. Cassandra JDBC Very new, difficult to know how stable it is Needs compiling and libraries not in Cassandra !http://code.google.com/a/apache-extras.org/p/cassandra-jdbc/
    26. 26. Astyanax From Netflix Based on Hector but said to be a lot simpler! https://github.com/Netflix/astyanax/wiki
    27. 27. jBloggyAppy a demo app ofCassandra All Source code on Github https://github.com/acobley/jBoggyAppy Feel free to use and abuse Simple blogging App
    28. 28. A word on using OpenSourcesoftware Versioning ! Things Change ! Documentation is wrong !  http://prettyprint.me/ End up reading unit tests to actually program.
    29. 29. One Last thing Dundee DDD 17th November , Big Data track Anyone interested in speaking ?