SlideShare a Scribd company logo
 Introductory presentation

 SQL                            ACID

 Relational algebra             Optimal for ad-hoc queries

 Tables, Columns, Rows          Sharding can be difficult

 Metadata separate from data

 Normalized data

 Optimized storage

 MySQL                  Informix

 SQL Server             Progress

 Oracle                 Pervasive

 Postgres               Sybase

 DB2                    Access

 Interbase, Firebird   …

 Unified language to create and query both data and metadata

 Similar to English

 Verbose(!)

 Can get complex for non-trivial queries

 Does not expose execution plan – you say what you want it to
return, not how
 If you can say what you mean, you can query the existing data
 Results are near-instant when querying based on primary key
select * from valute where id=1 and sid=42

 Results are fast when querying based on non-unique index
select valuta from valute where ((id=1 and sid=42)) and (valute.firma_id=123 and

 Very readable for trivial queries
select r.customer,sum(rs.iznos) sveukupno from racuni r
join racuni_stavke rs on
order by rs.ordinal

 Not so readable for non-trivial queries
select "MP" tip_prometa, mprac.broj broj_racuna, mprac_stavke.kolicina kolicina, (mprac.tecaj*mprac_stavke.kolicina*mprac_stavke.rabat_iznos)
rabat_iznos, (round(mprac_stavke.cijena - mprac_stavke.rabat_iznos - mprac_stavke.rabat2_iznos - mprac_stavke.rabat3_iznos - mprac_stavke.porez1 -
mprac_stavke.porez2 - mprac_stavke.porez_potrosnja,6)*mprac_stavke.kolicina) iznos, (mprac_stavke.kolicina* ifnull((select
sum(pn_cijena*kolicina)/sum(kolicina) from mprac_skl left join skl_stavke on mprac_skl.skl_id=skl_stavke.skl_id and
mprac_skl.skl__sid=skl_stavke.skl__sid where and mprac_skl.mprac__sid=mprac.sid and
skl_stavke.artikl_id=mprac_stavke.artikl_id and skl_stavke.artikl__sid=mprac_stavke.artikl__sid ),0) ) iznos_nabavno, ifnull( (select
sum(mprac_stavke.kolicina*ambalaze.naknada_kom) from artikli_ambalaze left join ambalaze on and
ambalaze.sid=artikli_ambalaze.ambalaza__sid where and artikli_ambalaze.artikl__sid=artikli.sid and
ambalaze.kalkulacija="N" ),0) naknada, radnici_komercijalisti.ime racun_komercijalist_ime, (select naziv from skladista where skladista.tip_skladista="M"
and pj_id=mprac.pj_id limit 1) skladiste_naziv , pj.naziv pj_naziv, mprac.datum,
cast(concat("(",if(DayOfWeek(mprac.datum)=1,7,DayOfWeek(mprac.datum)-1),") ", if(DayOfWeek(mprac.datum)=1,"1 Nedjelja",
if(DayOfWeek(mprac.datum)=2,"2 Ponedjeljak", if(DayOfWeek(mprac.datum)=3,"3 Utorak", if(DayOfWeek(mprac.datum)=4,"4 Srijeda",
if(DayOfWeek(mprac.datum)=5,"5 Èetvratk", if(DayOfWeek(mprac.datum)=6,"6 Petak", if(DayOfWeek(mprac.datum)=7,"7 Subota","")))))))) as char(15))
dan_u_tjednu, cast(month(mprac.datum) as unsigned) mjesec, cast(week(mprac.datum) as unsigned) tjedan, cast(quarter(mprac.datum) as unsigned) kvartal,
cast(year(mprac.datum) as unsigned) godina, cast(if(tipovi_komitenata.tip="F",trim(concat(partneri.ime," ",partneri.prezime)),partneri.naziv) as char(200))
kupac_naziv, partneri_mjesta.postanski_broj kupac_mjesto, partneri_mjesta.mjesto kupac_mjesto_naziv, partneri_grupe_mjesta.naziv …

 Vertical scaling
     •   Better CPU, more CPUs
     •   More RAM
     •   More disks
     •   SAN

 Partitioning

 Sharding

 With many rows and heavy usage, partitioning is a must

 What to partition
     • Tables
     • Indexes
     • Views

 Typical cases
     • Monthly data
     • Alphabetical keys

 Sharding means using several databases where each represents part
of data (500 clients on one server, another 500 on another)

 Requires changing application code

 Impossible to join data from different databases, so choose your
sharding key wisely

 Very difficult to repartition your databases based on a new key

 Metadata: data describing other data

 RDBMS structures are explicitly defined, and each data type is
optimized for storage

 Lots of constraints

 Can get slow with lot of data

 “Not SQL”, “Not only SQL”

 Core NoSQL databases invented mostly because RDBMS made
life very hard for huge and heavy traffic web databases

 NoSQL databases are the ones significantly different from
relational databases

 Wide Column Store / Column Families
 Document Store
 Key Value / Tuple Store
 Graph Databases
 Object Databases
 XML Databases
 Multivalue Databases

 Key-Value Stores

 BigTable Clones (aka "ColumnFamily")

 Document Databases

 Graph Databases

 Lineage: Amazon's Dynamo paper and Distributed HashTables.

 Data model: A global collection of key-value pairs.

 Example: Voldemort, Dynomite, Tokyo Cabinet

 Lineage: Google's BigTable paper.

 Data model: Column family, i.e. a tabular model where each row at
least in theory can have an individual configuration of columns.

 Example: HBase, Hypertable, Cassandra

 Lineage: Inspired by Lotus Notes.

 Data model: Collections of documents, which contain key-value
collections (called "documents").

 Example: CouchDB, MongoDB, Riak

 Lineage: Draws from Euler and graph theory.

 Data model: Nodes & relationships, both which can hold key-value

 Example: AllegroGraph, InfoGrid, Neo4j

 Hadoop / Hbase     MemcacheDB

 Cassandra          Voldemort

 Amazon SimpleDB    Hypertable

 MongoDB            Cloudata

 CouchDB            IBM Lotus/Domino

 Redis

 Almost infinite horizontal scaling
 Very fast
 Performance doesn’t deteriorate with growth (much)
 No fixed table schemas
 No join operations
 Ad-hoc queries difficult or impossible
 Structured storage
 Almost everything happens in RAM
 Cassandra
      •   Facebook (original developer, used it till late 2010)
      •   Twitter
      •   Digg
      •   Reddit
      •   Rackspace
      •   Cisco
 BigTable
      •   Google (open-source version is HBase)
 MongoDB
      •   Foursquare
      •   Craigslist
      •   SourceForge
      •   GitHub

 Handles huge databases (I know, I said it before)

 Redundancy, data is pretty safe on commodity hardware

 Super flexible queries using map/reduce

 Rapid development (no fixed schema, yeah!)

 Very fast for common use cases

 RDBMS uses buffer to ensure ACID properties

 NoSQL does not guarantee ACID and is therefore much faster

 We don’t need ACID everywhere!

 I used MySQL and switched to MongDB for my analytics app
     • Data processing (every minute) is 4x faster with MongoDB, despite
       being a lot more detailed (due to much simple development)

 Simple web application with not much traffic
     • Application server, database server all on one machine

 More traffic comes in
     • Application server
     • Database server

 Even more traffic comes in
     • Load balancer
     • Application server x2
     • Database server

 Even more traffic comes in
     • Load balancer x N
         • easy
     • Application server x N
         • easy
     • Database server xN
         • hard for SQL databases

 Not linear!

 Need more storage?
     • Add more servers!

 Need higher performance?
     • Add more servers!

 Need better reliability?
     • Add more servers!

 You can scale SQL databases (Oracle, MySQL, SQL Server…)
     • This will cost you dearly
     • If you don’t have a lot of money, you will reach limits quickly

 You can scale NoSQL databases
     •   Very easy horizontal scaling
     •   Lots of open-source solutions
     •   Scaling is one of the basic incentives for design, so it is well handled
     •   Scaling is the cause of trade-offs causing you to have to use

 Why map/reduce? I just need some simple queries. Tomorrow I
will need some other queries….

 SQL databases are optimized for very efficient disk access, but for
significant scaling need RAM caching (MySQL+memcached)

 NoSQL databases are designed to keep whole working set in RAM

 In real-world use working set is much less than complete database
     • For analytics 99% of queries will be regarding last 30 days

 As you need RAM only for working set, you can use commodity
servers, VPS, and just add more as your app becomes more popular

 Foursquare has millions of users and working set the same as the database
 They used a single 66GB Amazon EC2 High-Memory Quadruple Extra Large
Instance (with cheese) for millions of users
 When their RAM usage was 65GB, they decided to shard
 Too late, they started to have disk swaps
 Disk is much slower than RAM - 100x slowdown
 Server could not keep up due to swapping
 11 hours outage (ouch!)

 Google’s framework for processing highly distributable
problems across huge datasets using a large number of

 Let’s define large number of computers
     • Cluster if all of them have same hardware
     • Grid unless Cluster (if !Cluster for old-style programmers)

 Process split into two phases
     • Map
          • Take the input, partition it delegate to other machines
          • Other machines can repeat the process, leading to tree structure
          • Each machine returns results to the machine who gave it the task
     • Reduce
          • collect results from machines you gave the tasks
          • combine results and return it to requester
     • Slower than sequential data processing, but massively parallel
     • Sort petabyte of data in a few hours
     • Input, Map, Shuffle, Reduce, Output

 You need to write two functions

 Count different words in a set of documents

 Document store

 Basic support for dynamic (ad hoc) queries

 Query by example (nice!)

 Conditional Operators
     • <, <=, >, >=
     • $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $size, $type

  Regular expressions
    Data is stored as BSON (binary JSON)
         •    Makes it very well suited for languages with native JSON support
    Map/Reduce written in Javascript
         •    Slow! There is one single thread of execution in Javascript
    Master/slave replication (auto failover with replica sets)
    Sharding built-in
    Uses memory mapped files for data storage
    Performance over features
    On 32bit systems, limited to ~2.5Gb
    An empty database takes up 192Mb
    GridFS to store big data + metadata (not actually an FS)
 Written in: Java
 Protocol: Custom, binary (Thrift)
 Tunable trade-offs for distribution and replication (N, R, W)
 Querying by column, range of keys
 BigTable-like features: columns, column families
 Writes are much faster than reads (!)
         • Constant write time regardless of database size
 Map/reduce possible with Apache Hadoop
     Written in: Java
     Main point: Billions of rows X millions of columns
     Modeled after BigTable
     Map/reduce with Hadoop
     Query predicate push down via server side scan and get filters
     Optimizations for real time queries
     A high performance Thrift gateway
     HTTP supports XML, Protobuf, and binary
     Cascading, hive, and pig source and sink modules
     No single point of failure
     While Hadoop streams data efficiently, it has overhead for starting map/reduce jobs. HBase is column oriented key/value store and
allows for low latency read and writes.
     Random access performance is like MySQL
   Written in: C/C++
   Main point: Blazing fast
   Disk-backed in-memory database,
   Master-slave replication
   Simple values or hash tables by keys,
   Has sets (also union/diff/inter)
   Has lists (also a queue; blocking pop)
   Has hashes (objects of multiple fields)
   Sorted sets (high score table, good for range queries)
   Has transactions (!)
   Values can be set to expire (as in a cache)
   Pub/Sub lets one implement messaging (!)

    Written in: Erlang
    Main point: DB consistency, ease of use
    Bi-directional (!) replication, continuous or ad-hoc, with conflict detection, thus, master-master replication. (!)
    MVCC - write operations do not block reads
    Previous versions of documents are available
    Crash-only (reliable) design
    Needs compacting from time to time
    Views: embedded map/reduce
    Formatting views: lists & shows
    Server-side document validation possible
    Authentication possible
    Real-time updates via _changes (!)
    Attachment handling
    CouchApps (standalone JS apps)

 Apache project

 A framework that allows for the distributed processing of large
data sets across clusters of computers

 Designed to scale up from single servers to thousands of machines

 Designed to detect and handle failures at the application layer,
instead of relying on hardware for it
   Created by Doug Cutting, who named it after his son's toy elephant
   Hadoop subprojects
        •    Cassandra
        •    HBase
        •    Pig
   Hive was a Hadoop subproject, but is now a top-level Apache project
   Used by many large & famous organizations
   Scales to hundreds or thousands of computers, each with several processor cores
   Designed to efficiently distribute large amounts of work across a set of machines
   Hundreds of gigabytes of data constitute the low end of Hadoop-scale
   Built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes

 See

 Uses Java, but allows streaming so other languages can easily send
and accept data items to/from Hadoop

 Uses distributed file system (HDFS)
     • Designed to hold very large amounts of data (terabytes or even
     • Files are stored in a redundant fashion across multiple machines to
       ensure their durability to failure and high availability to very parallel
     • Data organized into directories and files
     • Files are divided into block (64MB by default) and distributed across
 Design of HDFS is based on the design of the Google File System

 A petabyte-scale data warehouse system for Hadoop

 Easy data summarization, ad-hoc queries

 Query the data using a SQL-like language called HiveQL

 Hive compiler generates map-reduce jobs for most queries

 Platform for analyzing large data sets

 High-level language for expressing data analysis programs

 Compiler produces sequences of Map-Reduce programs

 Textual language called Pig Latin
     • Ease of programming
     • System optimizes task execution automatically
     • Users can create their own functions

 Pig Latin – high level Map/Reduce programming

 Equivalent to SQL for RDBMS systems.

 Pig Latin can be extended using Java User Defined Functions

 “Word Count” script in Pig Latin

 NoSQL is a great problem solver if you need it

 Choose your NoSQL platform carefully as each is designed for
specific purpose

 Get used to Map/Reduce

 It’s not a sin to use NoSQL alongside (yes)SQL database

 I am really happy to work with MongoDB  instead of MySQL

More Related Content

What's hot

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep Dive
Niko Neugebauer
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
Clarence J M Tauro
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelis
What's New in Amazon Aurora
What's New in Amazon AuroraWhat's New in Amazon Aurora
What's New in Amazon Aurora
Amazon Web Services
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
Jeremy Beard
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandra
Brian Enochson
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
NoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesNoSQL in Real-time Architectures
NoSQL in Real-time Architectures
Ronen Botzer
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
Amazon Web Services
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
Richard Schneeman
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works Best
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
Remus Rusanu
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
Volodymyr Rovetskiy
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
Amazon Web Services
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
DataStax Academy
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
Kellyn Pot'Vin-Gorman

What's hot (20)

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep Dive
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelis
What's New in Amazon Aurora
What's New in Amazon AuroraWhat's New in Amazon Aurora
What's New in Amazon Aurora
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandra
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
NoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesNoSQL in Real-time Architectures
NoSQL in Real-time Architectures
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works Best
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud

Viewers also liked

Knoldus Inc.
353 357
353 357353 357
IChresemo Technologies
IChresemo TechnologiesIChresemo Technologies
IChresemo Technologies
Chinna Chresemo
Google Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a BearGoogle Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a Bear
Lead Generation Websites
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Anja Bonelli
Backbone js in action
Backbone js in actionBackbone js in action
Backbone js in action
Usha Guduri

Viewers also liked (7)

353 357
353 357353 357
353 357
IChresemo Technologies
IChresemo TechnologiesIChresemo Technologies
IChresemo Technologies
Google Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a BearGoogle Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a Bear
TLA_ fuer_Drittsemester
TLA_ fuer_DrittsemesterTLA_ fuer_Drittsemester
TLA_ fuer_Drittsemester
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Backbone js in action
Backbone js in actionBackbone js in action
Backbone js in action

Similar to NoSQL

Nosql seminar
Nosql seminarNosql seminar
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
Scott Miao
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
Partha Das
AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17
Neal Davis
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
PolarSeven Pty Ltd
Adi Challa
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
How and when to use NoSQL
How and when to use NoSQLHow and when to use NoSQL
How and when to use NoSQL
Amazon Web Services
Database Choices
Database ChoicesDatabase Choices
Database Choices
Lynn Langit
If NoSQL is your answer, you are probably asking the wrong question.
If NoSQL is your answer, you are probably asking the wrong question.If NoSQL is your answer, you are probably asking the wrong question.
If NoSQL is your answer, you are probably asking the wrong question.
Lukas Smith
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
Igor Moochnick
Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5
Mike King
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
Introduction to asdfghjkln b vfgh n v
Introduction to asdfghjkln b vfgh n    vIntroduction to asdfghjkln b vfgh n    v
Introduction to asdfghjkln b vfgh n v
No sql
No sqlNo sql
No sql
Prateek Jain
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explained
Satya Pal
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
Amazon Web Services
NoSQL and MongoDB
NoSQL and MongoDBNoSQL and MongoDB
NoSQL and MongoDB
Rajesh Menon
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Charley Hanania

Similar to NoSQL (20)

Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2
How and when to use NoSQL
How and when to use NoSQLHow and when to use NoSQL
How and when to use NoSQL
Database Choices
Database ChoicesDatabase Choices
Database Choices
If NoSQL is your answer, you are probably asking the wrong question.
If NoSQL is your answer, you are probably asking the wrong question.If NoSQL is your answer, you are probably asking the wrong question.
If NoSQL is your answer, you are probably asking the wrong question.
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
Introduction to asdfghjkln b vfgh n v
Introduction to asdfghjkln b vfgh n    vIntroduction to asdfghjkln b vfgh n    v
Introduction to asdfghjkln b vfgh n v
No sql
No sqlNo sql
No sql
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explained
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
NoSQL and MongoDB
NoSQL and MongoDBNoSQL and MongoDB
NoSQL and MongoDB
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...

Recently uploaded

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14 in predictive maintenance Use cases technologies benefits ... in predictive maintenance Use cases technologies benefits in predictive maintenance Use cases technologies benefits ... in predictive maintenance Use cases technologies benefits ...
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx

Recently uploaded (20)

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture in predictive maintenance Use cases technologies benefits ... in predictive maintenance Use cases technologies benefits in predictive maintenance Use cases technologies benefits ... in predictive maintenance Use cases technologies benefits ...
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx


  • 1. NOSQL, NO? Introductory presentation
  • 2. RELATIONAL  SQL  ACID  Relational algebra  Optimal for ad-hoc queries  Tables, Columns, Rows  Sharding can be difficult  Metadata separate from data  Normalized data  Optimized storage
  • 3. POPULAR RDBMS  MySQL  Informix  SQL Server  Progress  Oracle  Pervasive  Postgres  Sybase  DB2  Access  Interbase, Firebird …
  • 4. SQL  Unified language to create and query both data and metadata  Similar to English  Verbose(!)  Can get complex for non-trivial queries  Does not expose execution plan – you say what you want it to return, not how
  • 5. SQL EXAMPLES  If you can say what you mean, you can query the existing data  Results are near-instant when querying based on primary key select * from valute where id=1 and sid=42  Results are fast when querying based on non-unique index select valuta from valute where ((id=1 and sid=42)) and (valute.firma_id=123 and valute.firma__sid=1)  Very readable for trivial queries select r.customer,sum(rs.iznos) sveukupno from racuni r join racuni_stavke rs on where order by rs.ordinal
  • 6. SQL EXAMPLES  Not so readable for non-trivial queries select "MP" tip_prometa, mprac.broj broj_racuna, mprac_stavke.kolicina kolicina, (mprac.tecaj*mprac_stavke.kolicina*mprac_stavke.rabat_iznos) rabat_iznos, (round(mprac_stavke.cijena - mprac_stavke.rabat_iznos - mprac_stavke.rabat2_iznos - mprac_stavke.rabat3_iznos - mprac_stavke.porez1 - mprac_stavke.porez2 - mprac_stavke.porez_potrosnja,6)*mprac_stavke.kolicina) iznos, (mprac_stavke.kolicina* ifnull((select sum(pn_cijena*kolicina)/sum(kolicina) from mprac_skl left join skl_stavke on mprac_skl.skl_id=skl_stavke.skl_id and mprac_skl.skl__sid=skl_stavke.skl__sid where and mprac_skl.mprac__sid=mprac.sid and skl_stavke.artikl_id=mprac_stavke.artikl_id and skl_stavke.artikl__sid=mprac_stavke.artikl__sid ),0) ) iznos_nabavno, ifnull( (select sum(mprac_stavke.kolicina*ambalaze.naknada_kom) from artikli_ambalaze left join ambalaze on and ambalaze.sid=artikli_ambalaze.ambalaza__sid where and artikli_ambalaze.artikl__sid=artikli.sid and ambalaze.kalkulacija="N" ),0) naknada, radnici_komercijalisti.ime racun_komercijalist_ime, (select naziv from skladista where skladista.tip_skladista="M" and pj_id=mprac.pj_id limit 1) skladiste_naziv , pj.naziv pj_naziv, mprac.datum, cast(concat("(",if(DayOfWeek(mprac.datum)=1,7,DayOfWeek(mprac.datum)-1),") ", if(DayOfWeek(mprac.datum)=1,"1 Nedjelja", if(DayOfWeek(mprac.datum)=2,"2 Ponedjeljak", if(DayOfWeek(mprac.datum)=3,"3 Utorak", if(DayOfWeek(mprac.datum)=4,"4 Srijeda", if(DayOfWeek(mprac.datum)=5,"5 Èetvratk", if(DayOfWeek(mprac.datum)=6,"6 Petak", if(DayOfWeek(mprac.datum)=7,"7 Subota","")))))))) as char(15)) dan_u_tjednu, cast(month(mprac.datum) as unsigned) mjesec, cast(week(mprac.datum) as unsigned) tjedan, cast(quarter(mprac.datum) as unsigned) kvartal, cast(year(mprac.datum) as unsigned) godina, cast(if(tipovi_komitenata.tip="F",trim(concat(partneri.ime," ",partneri.prezime)),partneri.naziv) as char(200)) kupac_naziv, partneri_mjesta.postanski_broj kupac_mjesto, partneri_mjesta.mjesto kupac_mjesto_naziv, partneri_grupe_mjesta.naziv …
  • 7. RDBMS SCALING  Vertical scaling • Better CPU, more CPUs • More RAM • More disks • SAN  Partitioning  Sharding
  • 8. PARTITIONING  With many rows and heavy usage, partitioning is a must  What to partition • Tables • Indexes • Views  Typical cases • Monthly data • Alphabetical keys
  • 9. RDBMS SHARDING  Sharding means using several databases where each represents part of data (500 clients on one server, another 500 on another)  Requires changing application code connect(calculate_server_from(sharding_key))  Impossible to join data from different databases, so choose your sharding key wisely  Very difficult to repartition your databases based on a new key
  • 10. RDBMS METADATA  Metadata: data describing other data  RDBMS structures are explicitly defined, and each data type is optimized for storage  Lots of constraints  Can get slow with lot of data
  • 11. NOSQL  “Not SQL”, “Not only SQL”  Core NoSQL databases invented mostly because RDBMS made life very hard for huge and heavy traffic web databases  NoSQL databases are the ones significantly different from relational databases
  • 12. NOSQL TYPES  Wide Column Store / Column Families  Document Store  Key Value / Tuple Store  Graph Databases  Object Databases  XML Databases  Multivalue Databases
  • 13. 4 MAIN DATA MODELS  Key-Value Stores  BigTable Clones (aka "ColumnFamily")  Document Databases  Graph Databases Source:
  • 14. KEY/VALUE STORES  Lineage: Amazon's Dynamo paper and Distributed HashTables.  Data model: A global collection of key-value pairs.  Example: Voldemort, Dynomite, Tokyo Cabinet Source:
  • 15. BIGTABLE CLONES  Lineage: Google's BigTable paper.  Data model: Column family, i.e. a tabular model where each row at least in theory can have an individual configuration of columns.  Example: HBase, Hypertable, Cassandra Source:
  • 16. DOCUMENT DATABASES  Lineage: Inspired by Lotus Notes.  Data model: Collections of documents, which contain key-value collections (called "documents").  Example: CouchDB, MongoDB, Riak Source:
  • 17. GRAPH DATABASES  Lineage: Draws from Euler and graph theory.  Data model: Nodes & relationships, both which can hold key-value pairs  Example: AllegroGraph, InfoGrid, Neo4j Source:
  • 18. POPULAR NOSQL  Hadoop / Hbase  MemcacheDB  Cassandra  Voldemort  Amazon SimpleDB  Hypertable  MongoDB  Cloudata  CouchDB  IBM Lotus/Domino  Redis
  • 19. NOSQL CHARACTERISTICTS  Almost infinite horizontal scaling  Very fast  Performance doesn’t deteriorate with growth (much)  No fixed table schemas  No join operations  Ad-hoc queries difficult or impossible  Structured storage  Almost everything happens in RAM
  • 20. REAL-WORLD USE  Cassandra • Facebook (original developer, used it till late 2010) • Twitter • Digg • Reddit • Rackspace • Cisco  BigTable • Google (open-source version is HBase)  MongoDB • Foursquare • Craigslist • • SourceForge • GitHub
  • 21. WHY NOSQL?  Handles huge databases (I know, I said it before)  Redundancy, data is pretty safe on commodity hardware  Super flexible queries using map/reduce  Rapid development (no fixed schema, yeah!)  Very fast for common use cases
  • 22. PERFORMANCE  RDBMS uses buffer to ensure ACID properties  NoSQL does not guarantee ACID and is therefore much faster  We don’t need ACID everywhere!  I used MySQL and switched to MongDB for my analytics app • Data processing (every minute) is 4x faster with MongoDB, despite being a lot more detailed (due to much simple development)
  • 23. SCALING  Simple web application with not much traffic • Application server, database server all on one machine
  • 24. SCALING  More traffic comes in • Application server • Database server
  • 25. SCALING  Even more traffic comes in • Load balancer • Application server x2 • Database server
  • 26. SCALING  Even more traffic comes in • Load balancer x N • easy • Application server x N • easy • Database server xN • hard for SQL databases
  • 27. SQL SLOWDOWN  Not linear!  /scaling-sql-and-nosql-databases-in-the- cloud
  • 28. NOSQL SCALING  Need more storage? • Add more servers!  Need higher performance? • Add more servers!  Need better reliability? • Add more servers!
  • 29. SCALING SUMMARY  You can scale SQL databases (Oracle, MySQL, SQL Server…) • This will cost you dearly • If you don’t have a lot of money, you will reach limits quickly  You can scale NoSQL databases • Very easy horizontal scaling • Lots of open-source solutions • Scaling is one of the basic incentives for design, so it is well handled • Scaling is the cause of trade-offs causing you to have to use map/reduce
  • 30. RAM  Why map/reduce? I just need some simple queries. Tomorrow I will need some other queries….  SQL databases are optimized for very efficient disk access, but for significant scaling need RAM caching (MySQL+memcached)  NoSQL databases are designed to keep whole working set in RAM
  • 31. WORKING SET  In real-world use working set is much less than complete database • For analytics 99% of queries will be regarding last 30 days  As you need RAM only for working set, you can use commodity servers, VPS, and just add more as your app becomes more popular
  • 32. WORKING SET WOES  Foursquare has millions of users and working set the same as the database  They used a single 66GB Amazon EC2 High-Memory Quadruple Extra Large Instance (with cheese) for millions of users  When their RAM usage was 65GB, they decided to shard  Too late, they started to have disk swaps  Disk is much slower than RAM - 100x slowdown  Server could not keep up due to swapping  11 hours outage (ouch!)
  • 33. MAP/REDUCE  Google’s framework for processing highly distributable problems across huge datasets using a large number of computers  Let’s define large number of computers • Cluster if all of them have same hardware • Grid unless Cluster (if !Cluster for old-style programmers)
  • 34. MAP/REDUCE  Process split into two phases • Map • Take the input, partition it delegate to other machines • Other machines can repeat the process, leading to tree structure • Each machine returns results to the machine who gave it the task • Reduce • collect results from machines you gave the tasks • combine results and return it to requester • Slower than sequential data processing, but massively parallel • Sort petabyte of data in a few hours • Input, Map, Shuffle, Reduce, Output
  • 35. MAP/REDUCE EXAMPLE  You need to write two functions  Count different words in a set of documents
  • 36.
  • 37. MONGODB  Document store  Basic support for dynamic (ad hoc) queries  Query by example (nice!)
  • 38. MONGODB  Conditional Operators • <, <=, >, >= • $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $size, $type  Regular expressions
  • 39. MONGODB  Data is stored as BSON (binary JSON) • Makes it very well suited for languages with native JSON support  Map/Reduce written in Javascript • Slow! There is one single thread of execution in Javascript  Master/slave replication (auto failover with replica sets)  Sharding built-in  Uses memory mapped files for data storage  Performance over features  On 32bit systems, limited to ~2.5Gb  An empty database takes up 192Mb  GridFS to store big data + metadata (not actually an FS) Source:
  • 40. CASSANDRA  Written in: Java  Protocol: Custom, binary (Thrift)  Tunable trade-offs for distribution and replication (N, R, W)  Querying by column, range of keys  BigTable-like features: columns, column families  Writes are much faster than reads (!) • Constant write time regardless of database size  Map/reduce possible with Apache Hadoop Source:
  • 41. HBASE  Written in: Java  Main point: Billions of rows X millions of columns  Modeled after BigTable  Map/reduce with Hadoop  Query predicate push down via server side scan and get filters  Optimizations for real time queries  A high performance Thrift gateway  HTTP supports XML, Protobuf, and binary  Cascading, hive, and pig source and sink modules  No single point of failure  While Hadoop streams data efficiently, it has overhead for starting map/reduce jobs. HBase is column oriented key/value store and allows for low latency read and writes.  Random access performance is like MySQL Source:
  • 42. REDIS  Written in: C/C++  Main point: Blazing fast  Disk-backed in-memory database,  Master-slave replication  Simple values or hash tables by keys,  Has sets (also union/diff/inter)  Has lists (also a queue; blocking pop)  Has hashes (objects of multiple fields)  Sorted sets (high score table, good for range queries)  Has transactions (!)  Values can be set to expire (as in a cache)  Pub/Sub lets one implement messaging (!) Source:
  • 43. COUCHDB  Written in: Erlang  Main point: DB consistency, ease of use  Bi-directional (!) replication, continuous or ad-hoc, with conflict detection, thus, master-master replication. (!)  MVCC - write operations do not block reads  Previous versions of documents are available  Crash-only (reliable) design  Needs compacting from time to time  Views: embedded map/reduce  Formatting views: lists & shows  Server-side document validation possible  Authentication possible  Real-time updates via _changes (!)  Attachment handling  CouchApps (standalone JS apps) Source:
  • 44. HADOOP  Apache project  A framework that allows for the distributed processing of large data sets across clusters of computers  Designed to scale up from single servers to thousands of machines  Designed to detect and handle failures at the application layer, instead of relying on hardware for it
  • 45. HADOOP  Created by Doug Cutting, who named it after his son's toy elephant  Hadoop subprojects • Cassandra • HBase • Pig  Hive was a Hadoop subproject, but is now a top-level Apache project  Used by many large & famous organizations •  Scales to hundreds or thousands of computers, each with several processor cores  Designed to efficiently distribute large amounts of work across a set of machines  Hundreds of gigabytes of data constitute the low end of Hadoop-scale  Built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes
  • 46. HADOOP  See with-apache-hadoop-pig  Uses Java, but allows streaming so other languages can easily send and accept data items to/from Hadoop
  • 47. HADOOP  Uses distributed file system (HDFS) • Designed to hold very large amounts of data (terabytes or even petabytes) • Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications • Data organized into directories and files • Files are divided into block (64MB by default) and distributed across nodes  Design of HDFS is based on the design of the Google File System
  • 48. HIVE  A petabyte-scale data warehouse system for Hadoop  Easy data summarization, ad-hoc queries  Query the data using a SQL-like language called HiveQL  Hive compiler generates map-reduce jobs for most queries
  • 49. PIG  Platform for analyzing large data sets  High-level language for expressing data analysis programs  Compiler produces sequences of Map-Reduce programs  Textual language called Pig Latin • Ease of programming • System optimizes task execution automatically • Users can create their own functions
  • 50. PIG LATIN  Pig Latin – high level Map/Reduce programming  Equivalent to SQL for RDBMS systems.  Pig Latin can be extended using Java User Defined Functions  “Word Count” script in Pig Latin
  • 53. SUMMARY  NoSQL is a great problem solver if you need it  Choose your NoSQL platform carefully as each is designed for specific purpose  Get used to Map/Reduce  It’s not a sin to use NoSQL alongside (yes)SQL database  I am really happy to work with MongoDB  instead of MySQL