Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply



Published on

Talk to techmeetup Aberdeen on bigdata and nosql …

Talk to techmeetup Aberdeen on bigdata and nosql
Some links seem to be missing from the onscreen presentation, particularly for the sharding diagram

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Larryeleison must be mad that his “free” software mysql is used on the biggest website in the world.
  • create keyspace test with strategy_class = 'SimpleStrategy' AND strategy_options:replication_factor=1;use test;create columnfamily users (KEY varchar Primary key, password varchar, gender varchar);INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a');Select * from users;INSERT INTO users (KEY, gender) VALUES ('jbrown', 'male');INSERT INTO users (KEY, phone) VALUES ('jbrown', '01382 345078');What are we going to get ?
  • Transcript

    • 1. Andy CobleySchool of ComputingUniversity of DundeeTwitter: @andycobley
    • 2. Who am I ? Lecturer at University of Dundee Program director of Business Intelligence and new program Data Science ( and ) Geek and Hacker
    • 3. So what is Big Data?
    • 4. From evil Wikipedia “In information technology, big data[1] consists of datasets that grow so large that they become awkward to work with using on-hand database management tools.” Which doesn’t tell us much Any definition that relies on data “size” will become obsolete very quickly as data storage capabilities grows.
    • 5. Lets try something different  The Three V’s  Volume  How Big is the data, Terabytes ? Petabytes?  Variety  Is it the same sort of data, what about blobs ? Does it change ?  Velocity  How fast is it coming in ? Can we store it fast enough and then use it ?
    • 6. The Twitter problem Twitpocalypse Overflow of status ids for 32 bit signed integers But beyond that, can we physically store data fast enough ?
    • 7.  Suppose we are storing 16 columns of 16 bytes At 100 per second 0.7 Terabyte per year Add at 1 million per second that’s 7 petabytes per year This is volume
    • 8. Variability Data is sparse and can be different sizes Over time the type of data changes Consider click through data, as pages evolve new data types and fields need to be stored
    • 9. What aboutid MassSpec Meta data Meta data12
    • 10. We need UDF User Defined functions inside the dB Or a different way of dealing with it, such as Hadoop or MRSQL.
    • 11. So what is NoSql Throws away everything you know about Databases Is a family of different databases Lots of different “products” BUT ! db-is-web-scale (warning might offend) They should only be used when it’s sensible, they are not magic sauce.
    • 12. NoSql types Key-Value Column-family Document databases Allow sharding across nodes Graph  Fast for graph like data and operations
    • 13. Some NoSQL databases CouchDb MongoDb Cassandra Riak Hbase Neo4j couchdb-vs-redis
    • 14. Sharding ? Distribution of data across nodes Allows performance to be spread across multiple machines SQL databases can be sharded Not all NoSQL databases can be sharded
    • 15. Cap Theorem  CAP (or Brewers) theorem says:  It’s impossible for a web service to provide the following  Consistency  Availability  Partition tolerance see : and
    • 16.
    • 17. Partitions ? Essentially failing to achieve consistency within a set time causes a partition. You can sacrifice availability to ensure consistency Partitions are rare and if you have one server, almost never happen Partitions are caused by networks, failed nodees
    • 18. Eventual Consistency Eventually all nodes will tell the same story Isn’t this a mad idea ? Facebook (Actually not) The Internet is based on and Eventual Consistency dB DNS
    • 19. Introducing Cassandra Distributed / Decentralized Column Orientated Key Value Store Fault Tolerant
    • 20. Network topology of a Cassandradb Multiple nodes Cassandra can be Rack Aware Keys are replicated across nodes It’s essentially a DHT Distributed Hash Table Think BitTorrent
    • 21. CQL Version 8 introduced CQL Cassandra Query Language Almost looks like SQL ! language-cql-v2-0-reference/ Language ref
    • 22. Demo Start Cassandra Open CQLSH Create Keyspace Create a columnfamily Now we can insert !
    • 23. So why does this work ? Jsmith  Password: ch@ngem3a Jbrown  Gender: Male  Phone: 01382 345078Column store, keys with name: value pairs underneath
    • 24. Interfacing to Cassandra Based on Thrift  Large number of Languages supported  I’ve used Java and Hector  Although there is a Csharp version 
    • 25. Cassandra JDBC Very new, difficult to know how stable it is Needs compiling and libraries not in Cassandra !
    • 26. Astyanax From Netflix Based on Hector but said to be a lot simpler!
    • 27. jBloggyAppy a demo app ofCassandra All Source code on Github Feel free to use and abuse Simple blogging App
    • 28. A word on using OpenSourcesoftware Versioning ! Things Change ! Documentation is wrong !  End up reading unit tests to actually program.
    • 29. One Last thing Dundee DDD 17th November , Big Data track Anyone interested in speaking ?