Your SlideShare is downloading. ×
0
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
NoSQL
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

NoSQL

1,777

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,777
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
53
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. NOSQL, NO? Introductory presentation
  • 2. RELATIONAL SQL  ACID Relational algebra  Optimal for ad-hoc queries Tables, Columns, Rows  Sharding can be difficult Metadata separate from data Normalized data Optimized storage
  • 3. POPULAR RDBMS MySQL  Informix SQL Server  Progress Oracle  Pervasive Postgres  Sybase DB2  Access Interbase, Firebird …
  • 4. SQL Unified language to create and query both data and metadata Similar to English Verbose(!) Can get complex for non-trivial queries Does not expose execution plan – you say what you want it toreturn, not how
  • 5. SQL EXAMPLES If you can say what you mean, you can query the existing data Results are near-instant when querying based on primary keyselect * from valute where id=1 and sid=42 Results are fast when querying based on non-unique indexselect valuta from valute where ((id=1 and sid=42)) and (valute.firma_id=123 andvalute.firma__sid=1) Very readable for trivial queriesselect r.customer,sum(rs.iznos) sveukupno from racuni rjoin racuni_stavke rs on r.id=rs.racun_idwhere r.id=5order by rs.ordinal
  • 6. SQL EXAMPLES Not so readable for non-trivial queriesselect "MP" tip_prometa, mprac.broj broj_racuna, mprac_stavke.kolicina kolicina, (mprac.tecaj*mprac_stavke.kolicina*mprac_stavke.rabat_iznos)rabat_iznos, (round(mprac_stavke.cijena - mprac_stavke.rabat_iznos - mprac_stavke.rabat2_iznos - mprac_stavke.rabat3_iznos - mprac_stavke.porez1 -mprac_stavke.porez2 - mprac_stavke.porez_potrosnja,6)*mprac_stavke.kolicina) iznos, (mprac_stavke.kolicina* ifnull((selectsum(pn_cijena*kolicina)/sum(kolicina) from mprac_skl left join skl_stavke on mprac_skl.skl_id=skl_stavke.skl_id andmprac_skl.skl__sid=skl_stavke.skl__sid where mprac_skl.mprac_id=mprac.id and mprac_skl.mprac__sid=mprac.sid andskl_stavke.artikl_id=mprac_stavke.artikl_id and skl_stavke.artikl__sid=mprac_stavke.artikl__sid ),0) ) iznos_nabavno, ifnull( (selectsum(mprac_stavke.kolicina*ambalaze.naknada_kom) from artikli_ambalaze left join ambalaze on ambalaze.id=artikli_ambalaze.ambalaza_id andambalaze.sid=artikli_ambalaze.ambalaza__sid where artikli_ambalaze.artikl_id=artikli.id and artikli_ambalaze.artikl__sid=artikli.sid andambalaze.kalkulacija="N" ),0) naknada, radnici_komercijalisti.ime racun_komercijalist_ime, (select naziv from skladista where skladista.tip_skladista="M"and pj_id=mprac.pj_id limit 1) skladiste_naziv , pj.naziv pj_naziv, mprac.datum,cast(concat("(",if(DayOfWeek(mprac.datum)=1,7,DayOfWeek(mprac.datum)-1),") ", if(DayOfWeek(mprac.datum)=1,"1 Nedjelja",if(DayOfWeek(mprac.datum)=2,"2 Ponedjeljak", if(DayOfWeek(mprac.datum)=3,"3 Utorak", if(DayOfWeek(mprac.datum)=4,"4 Srijeda",if(DayOfWeek(mprac.datum)=5,"5 Èetvratk", if(DayOfWeek(mprac.datum)=6,"6 Petak", if(DayOfWeek(mprac.datum)=7,"7 Subota","")))))))) as char(15))dan_u_tjednu, cast(month(mprac.datum) as unsigned) mjesec, cast(week(mprac.datum) as unsigned) tjedan, cast(quarter(mprac.datum) as unsigned) kvartal,cast(year(mprac.datum) as unsigned) godina, cast(if(tipovi_komitenata.tip="F",trim(concat(partneri.ime," ",partneri.prezime)),partneri.naziv) as char(200))kupac_naziv, partneri_mjesta.postanski_broj kupac_mjesto, partneri_mjesta.mjesto kupac_mjesto_naziv, partneri_grupe_mjesta.naziv …
  • 7. RDBMS SCALING Vertical scaling • Better CPU, more CPUs • More RAM • More disks • SAN Partitioning Sharding
  • 8. PARTITIONING With many rows and heavy usage, partitioning is a must What to partition • Tables • Indexes • Views Typical cases • Monthly data • Alphabetical keys
  • 9. RDBMS SHARDING Sharding means using several databases where each represents partof data (500 clients on one server, another 500 on another) Requires changing application code connect(calculate_server_from(sharding_key)) Impossible to join data from different databases, so choose yoursharding key wisely Very difficult to repartition your databases based on a new key
  • 10. RDBMS METADATA Metadata: data describing other data RDBMS structures are explicitly defined, and each data type isoptimized for storage Lots of constraints Can get slow with lot of data
  • 11. NOSQL “Not SQL”, “Not only SQL” Core NoSQL databases invented mostly because RDBMS madelife very hard for huge and heavy traffic web databases NoSQL databases are the ones significantly different fromrelational databases
  • 12. NOSQL TYPES Wide Column Store / Column Families Document Store Key Value / Tuple Store Graph Databases Object Databases XML Databases Multivalue Databases
  • 13. 4 MAIN DATA MODELS Key-Value Stores BigTable Clones (aka "ColumnFamily") Document Databases Graph DatabasesSource: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 14. KEY/VALUE STORES Lineage: Amazons Dynamo paper and Distributed HashTables. Data model: A global collection of key-value pairs. Example: Voldemort, Dynomite, Tokyo CabinetSource: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 15. BIGTABLE CLONES Lineage: Googles BigTable paper. Data model: Column family, i.e. a tabular model where each row atleast in theory can have an individual configuration of columns. Example: HBase, Hypertable, CassandraSource: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 16. DOCUMENT DATABASES Lineage: Inspired by Lotus Notes. Data model: Collections of documents, which contain key-valuecollections (called "documents"). Example: CouchDB, MongoDB, RiakSource: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 17. GRAPH DATABASES Lineage: Draws from Euler and graph theory. Data model: Nodes & relationships, both which can hold key-valuepairs Example: AllegroGraph, InfoGrid, Neo4jSource: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 18. POPULAR NOSQL Hadoop / Hbase  MemcacheDB Cassandra  Voldemort Amazon SimpleDB  Hypertable MongoDB  Cloudata CouchDB  IBM Lotus/Domino Redis
  • 19. NOSQL CHARACTERISTICTS Almost infinite horizontal scaling Very fast Performance doesn’t deteriorate with growth (much) No fixed table schemas No join operations Ad-hoc queries difficult or impossible Structured storage Almost everything happens in RAM
  • 20. REAL-WORLD USE Cassandra • Facebook (original developer, used it till late 2010) • Twitter • Digg • Reddit • Rackspace • Cisco BigTable • Google (open-source version is HBase) MongoDB • Foursquare • Craigslist • Bit.ly • SourceForge • GitHub
  • 21. WHY NOSQL? Handles huge databases (I know, I said it before) Redundancy, data is pretty safe on commodity hardware Super flexible queries using map/reduce Rapid development (no fixed schema, yeah!) Very fast for common use cases
  • 22. PERFORMANCE RDBMS uses buffer to ensure ACID properties NoSQL does not guarantee ACID and is therefore much faster We don’t need ACID everywhere! I used MySQL and switched to MongDB for my analytics app • Data processing (every minute) is 4x faster with MongoDB, despite being a lot more detailed (due to much simple development)
  • 23. SCALING Simple web application with not much traffic • Application server, database server all on one machine
  • 24. SCALING More traffic comes in • Application server • Database server
  • 25. SCALING Even more traffic comes in • Load balancer • Application server x2 • Database server
  • 26. SCALING Even more traffic comes in • Load balancer x N • easy • Application server x N • easy • Database server xN • hard for SQL databases
  • 27. SQL SLOWDOWN Not linear! http://www.slideshare.net/rightscale/scaling-sql-and-nosql-databases-in-the-cloud
  • 28. NOSQL SCALING Need more storage? • Add more servers! Need higher performance? • Add more servers! Need better reliability? • Add more servers!
  • 29. SCALING SUMMARY You can scale SQL databases (Oracle, MySQL, SQL Server…) • This will cost you dearly • If you don’t have a lot of money, you will reach limits quickly You can scale NoSQL databases • Very easy horizontal scaling • Lots of open-source solutions • Scaling is one of the basic incentives for design, so it is well handled • Scaling is the cause of trade-offs causing you to have to use map/reduce
  • 30. RAM Why map/reduce? I just need some simple queries. Tomorrow Iwill need some other queries…. SQL databases are optimized for very efficient disk access, but forsignificant scaling need RAM caching (MySQL+memcached) NoSQL databases are designed to keep whole working set in RAM
  • 31. WORKING SET In real-world use working set is much less than complete database • For analytics 99% of queries will be regarding last 30 days As you need RAM only for working set, you can use commodityservers, VPS, and just add more as your app becomes more popular
  • 32. WORKING SET WOES Foursquare has millions of users and working set the same as the database They used a single 66GB Amazon EC2 High-Memory Quadruple Extra LargeInstance (with cheese) for millions of users When their RAM usage was 65GB, they decided to shard Too late, they started to have disk swaps Disk is much slower than RAM - 100x slowdown Server could not keep up due to swapping 11 hours outage (ouch!)
  • 33. MAP/REDUCE Google’s framework for processing highly distributableproblems across huge datasets using a large number ofcomputers Let’s define large number of computers • Cluster if all of them have same hardware • Grid unless Cluster (if !Cluster for old-style programmers)
  • 34. MAP/REDUCE Process split into two phases • Map • Take the input, partition it delegate to other machines • Other machines can repeat the process, leading to tree structure • Each machine returns results to the machine who gave it the task • Reduce • collect results from machines you gave the tasks • combine results and return it to requester • Slower than sequential data processing, but massively parallel • Sort petabyte of data in a few hours • Input, Map, Shuffle, Reduce, Output
  • 35. MAP/REDUCE EXAMPLE You need to write two functions Count different words in a set of documents
  • 36. MONGODB Document store Basic support for dynamic (ad hoc) queries Query by example (nice!)
  • 37. MONGODB Conditional Operators • <, <=, >, >= • $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $size, $type  Regular expressions
  • 38. MONGODB Data is stored as BSON (binary JSON) • Makes it very well suited for languages with native JSON support Map/Reduce written in Javascript • Slow! There is one single thread of execution in Javascript Master/slave replication (auto failover with replica sets) Sharding built-in Uses memory mapped files for data storage Performance over features On 32bit systems, limited to ~2.5Gb An empty database takes up 192Mb GridFS to store big data + metadata (not actually an FS)Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 39. CASSANDRA Written in: Java Protocol: Custom, binary (Thrift) Tunable trade-offs for distribution and replication (N, R, W) Querying by column, range of keys BigTable-like features: columns, column families Writes are much faster than reads (!) • Constant write time regardless of database size Map/reduce possible with Apache HadoopSource: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 40. HBASE Written in: Java Main point: Billions of rows X millions of columns Modeled after BigTable Map/reduce with Hadoop Query predicate push down via server side scan and get filters Optimizations for real time queries A high performance Thrift gateway HTTP supports XML, Protobuf, and binary Cascading, hive, and pig source and sink modules No single point of failure While Hadoop streams data efficiently, it has overhead for starting map/reduce jobs. HBase is column oriented key/value store andallows for low latency read and writes. Random access performance is like MySQLSource: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 41. REDIS Written in: C/C++ Main point: Blazing fast Disk-backed in-memory database, Master-slave replication Simple values or hash tables by keys, Has sets (also union/diff/inter) Has lists (also a queue; blocking pop) Has hashes (objects of multiple fields) Sorted sets (high score table, good for range queries) Has transactions (!) Values can be set to expire (as in a cache) Pub/Sub lets one implement messaging (!)Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 42. COUCHDB Written in: Erlang Main point: DB consistency, ease of use Bi-directional (!) replication, continuous or ad-hoc, with conflict detection, thus, master-master replication. (!) MVCC - write operations do not block reads Previous versions of documents are available Crash-only (reliable) design Needs compacting from time to time Views: embedded map/reduce Formatting views: lists & shows Server-side document validation possible Authentication possible Real-time updates via _changes (!) Attachment handling CouchApps (standalone JS apps)Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 43. HADOOP Apache project A framework that allows for the distributed processing of largedata sets across clusters of computers Designed to scale up from single servers to thousands of machines Designed to detect and handle failures at the application layer,instead of relying on hardware for it
  • 44. HADOOP Created by Doug Cutting, who named it after his sons toy elephant Hadoop subprojects • Cassandra • HBase • Pig Hive was a Hadoop subproject, but is now a top-level Apache project Used by many large & famous organizations • http://wiki.apache.org/hadoop/PoweredBy Scales to hundreds or thousands of computers, each with several processor cores Designed to efficiently distribute large amounts of work across a set of machines Hundreds of gigabytes of data constitute the low end of Hadoop-scale Built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes
  • 45. HADOOP See http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig Uses Java, but allows streaming so other languages can easily sendand accept data items to/from Hadoop
  • 46. HADOOP Uses distributed file system (HDFS) • Designed to hold very large amounts of data (terabytes or even petabytes) • Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications • Data organized into directories and files • Files are divided into block (64MB by default) and distributed across nodes Design of HDFS is based on the design of the Google File System
  • 47. HIVE A petabyte-scale data warehouse system for Hadoop Easy data summarization, ad-hoc queries Query the data using a SQL-like language called HiveQL Hive compiler generates map-reduce jobs for most queries
  • 48. PIG Platform for analyzing large data sets High-level language for expressing data analysis programs Compiler produces sequences of Map-Reduce programs Textual language called Pig Latin • Ease of programming • System optimizes task execution automatically • Users can create their own functions
  • 49. PIG LATIN Pig Latin – high level Map/Reduce programming Equivalent to SQL for RDBMS systems. Pig Latin can be extended using Java User Defined Functions “Word Count” script in Pig Latin
  • 50. MY MONGODB
  • 51. MY MONGODB
  • 52. SUMMARY NoSQL is a great problem solver if you need it Choose your NoSQL platform carefully as each is designed forspecific purpose Get used to Map/Reduce It’s not a sin to use NoSQL alongside (yes)SQL database I am really happy to work with MongoDB  instead of MySQL

×