SlideShare a Scribd company logo
What Should I Know
                          about NoSQL?
                                                                                             Cris J. Holdorph
                                                                                                 Software Architect
                                                                                                       Unicon, Inc.

                                                                                                  Jasig Conference
                                                                                                  Westminster, CO
                                                                                                     May 24, 2011




© Copyright Unicon, Inc., 2008. Some rights reserved. This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/
Lethal SQL




             2
3
Agenda
1. Definitions
2. History
3. Projects
4. Example Case Studies




                          4
Definitions




              5
Definitions
●
    RDBMS
●
    SQL
●
    CRUD
●
    ACID
    –   Atomicity, Consistency, Isolation, Durability
●
    BASE
    –   Basically Available, Soft state, Eventual
        consistency


                                                        6
7
Definitions
●
    Big Data
●
    Sharding
●
    Cloud Computing
●
    Distributed File System
●
    Key Value Store




                                8
History




          9
Map Reduce
●
    Patented software framework introduced by Google
    in 2004 to support distributed computing on large
    data sets on clusters of computers.
●
    Naming originally inspired by map and reduce
    functions of functional programming (but their
    purpose is not the same as it was there)
●
    Map
    –   The master node takes the input, partitions it up into
        smaller sub-problems, and distributes those to worker nodes
●
    Reduce
    –   The master node then takes the answers to all the sub-
        problems and combines them in some way to get the output
                                                               10
What does NoSQL Stand For?
●
    NoSQL
●
    No SQL
●
    Not SQL
●
    Not Only SQL
●
    Not the RDBMS
●
    Wikipedia:
    –   Carlo Strozzi used the term "NoSQL" in 1998 to
        name his lightweight, open-source relational
        database that did not expose an SQL interface.

                                                         11
History
●
    Some techniques have existed for over 25
     years
●
    Teradata selling product for more then 20
      years
●
    RDBMS dates back to 1970




                                                12
CAP Theorem
●
    A conjecture made by Eric Brewer at the
      Symposium on Principles of Distributed
      Computing (2000)
●
    States only possible to achieve 2 of 3
    –   Consistency (all nodes see the same data at the
        same time)
    –   Availability (node failures do not prevent survivors
        from continuing to operate)
    –   Partition Tolerance (the system continues to
        operate despite arbitrary message loss)

                                                         13
CAP
●
    Consistent and Available
    –   ACID systems, MySQL cluster, Oracle Coherence,
        Drizzle
●
    Consistent and Partition Tolerance
    –   SCLA (strongly consistent, loosely available)
    –   HBase, Bigtable
●
    Available and Partition Tolerant
    –   BASE systems (CouchDB, SimpleDB, MongoDB
●
    Cassandra (sits between SCLA/BASE
    systems)
                                                        14
Projects




           15
Hadoop
●
    Open-source software for reliable, scalable,
     distributed computing (Hadoop website)
    –   Hadoop Common
    –   HDFS
    –   MapReduce
●
    Created Initially in early 2006 to support
     search engine project Nutch
●
    Inspired by the Google File System and
      MapReduce papers (Oct 2003)

                                                 16
Hadoop Related Projects
●
    Hbase
    –   A scalable, distributed database that supports
        structured data storage for large tables
●
    Hive
    –   A data warehouse infrastructure that provides
        data summarization and ad hoc querying
●
    Pig
    –   A high-level data-flow language and execution
        framework for parallel computation
●
    Cassandra
    –   uses Hadoop for MapReduce                        17
Who Uses Hadoop
●
    EBay (532 nodes, Search optimization)
●
    Facebook (1100x8 node cluster, 300x8 node cluster, more on
    this later)
●
    GumGum (Ken Weiner, 20+ node cluster on Amazon EC2)
●
    Hulu (log storage analysis)
●
    Last.fm (44x2 nodes log analysis, 20x2 nodes profile analysis)
●
    LinkedIn (120x2x4 nodes, 520x2x4 nodes, "People you may
    know")
●
    Twitter (more on this later)
●
    Yahoo! (100,000 cpus running Hadoop, more on this later)



                                                                18
CouchDB
●
    Apache open source document oriented database
    written in Erlang (concurrent programming lang)
●
    Designed to scale horizontally
●
    Stores documents (one or more field value pairs
    expressed as JSON)
●
    ACID Semantics
●
    Map/Reduce Views and Indexes (written in server
    side javascript)
●
    Bi-direction replication (with conflict resolution)
●
    REST API

                                                          19
http://couchdb.apache.org/img/sketch.png

                                           20
CouchDB Sample Document

"Subject": "I like Plankton"
"Author": "Rusty"
"PostedDate": "5/23/2006"
"Tags": ["plankton", "baseball", "decisions"]
"Body": "I decided today that I don't like baseball. I
like plankton."




         http://couchdb.apache.org/docs/intro.html

                                                         21
Who uses CouchDB?
●
    Ubuntu One – cloud storage service
    –   http://ubuntuone.com/
●
    "I Play WoW" facebook app
    –   http://blog.socklabs.com/2008/12/24/iplaywow_monthly_actives.html

●
    Wego - travel site
    –   http://www.wego.com/




                                                                            22
Cassandra
●
    Fault Tolerant (replication, failed nodes can
    be replaced with no downtime)
●
    Decentralized (ever node in cluster is
    identical, no bottlenicks)
●
    Supports either Synchronous or
    Asynchronous update replication
●
    Supports more then simple key/value pair
●
    Elastic (read/write throughput increase
    linearly as machines are added)
●
    Durable (suitable for applictions that can't
                                                    23
    afford to lose data)
Cassandra
●
    Initially developed by Facebook for Inbox
    Search (until replaced by HBase)
●
    Key-value store where values can be multiple
    values
●
    Some inspiration from Amazon's Dynamo
    (another key-value store)




                                                24
Who uses Cassandra?
●
    Facebook (previously)
●
    Twitter
●
    Digg
●
    Cisco




                                    25
MongoDB
●
    Name is derived from "humongous"
●
    Document oriented database written in C++
●
    Manages collections of JSON-like documents
●
    Binaries available for windows, linux, OS X,
    Solaris
●
    Supports dates, regular expressions code,
    binary data (all BSON types)
●
    Cursors for query results
●
    Any field can be queried at any time
                                                   26
MongoDB
●
    Queries can include user-defined JavaScript
    functions
●
    Master/Slave (only master supports writes,
    slaves can be read from)
●
    Scales horizontally using sharding
●
    Support for Map/Reduce




                                                 27
Who uses MongoDB?
●
    New York Times
●
    Shutterfly
●
    Foursquare
●
    SourceForge
●
    Intuit




                                 28
Google Big Table
●
    Built on GFS (Google File System)
●
    Can be used with Google App Engine
●
    Maps two aribtrary strings and a timestamp
●
    Designed to scale into the petabyte range
●
    Designed to scale across hundreds or
    thousands of machines
●
    Portions of a table (tablets) can be
    compressed
●
    HBase was modeled after BigTable
                                                29
Who uses Big Table?
●
    Google Reader
●
    Google Maps
●
    Google Book Search
●
    Google Earth
●
    Blogger.com
●
    Google Code
●
    Orkut
●
    YouTube
●
    Gmail                           30
Amazon SimpleDB
●
    Written in Erlang
●
    Used with Amazon EC2 and Amazon S3
●
    Easy access to lookup and query functions
●
    Without support for the less used complex database
    functions
●
    Do not need to pre-define data formats that will be stored
●
    Scalable (with size limitations)
     –   10gb per domain, up to 250 domains
●
    Fast/Reliable
●
    Supports eventually consistent read and consistent read
●
    Potentially Inexpensive
                                                           31
SimpleDB Data Model




http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataModel.html
                                                                                 32
SimpleDB Data Model
●
    Customer Account (amazon web services account)
●
    Domains (similar to tables, or spreadsheet tabs)
●
    Items (similar to rows)
●
    Attributes (similar to columns)
●
    Values (similar to cells)
     –   Unlike a spreadsheet, however, multiple values can be
         associated with a cell
●
    One domain can contain different types of data
    (some attributes not filled in)


                                                                 33
SimpleDB API Summary
●
    CreateDomain
●
    DeleteDomain
●
    ListDomains
●
    PutAttributes
●
    BatchPutAttributes
●
    DeleteAttributes
●
    BatchDeleteAttributes
●
    GetAttributes
●
    Select
●
    DomainMetadata                  34
Who uses SimpleDB?
●
    Netflix
●
    Other Amazon EC2 customers...




                                    35
memcached
●
    General purpose distributed memory caching system
●
    Often used to cache in RAM that might otherwise be
    obtained from an external data source
●
    LRU (when cache is full)
●
    Can be distributed across multiple machines




                                                    36
Who uses memcached?
●
    YouTube
●
    Zynga
●
    Facebook
●
    Twitter




                                    37
Terracotta
●
    JVM in-memory distributed cache / store
●
    The object store can be persistent
●
    Distribution between nodes is handled through
    Terracotta server
●
    Supports multiple Terracotta servers
●
    Nodes only receive data they need/reference




                                                    38
Who uses Terracotta?
●
    Sakai (thanks to John Wiley & Sons)
●
    PartyGaming (PartyPoker.com)
●
    Adobe
●
    Pearson




                                          39
Example Case Studies




                       40
Yahoo!
●
    Hadoop
    –   http://developer.yahoo.com/blogs/hadoop
    –   More than 100,000 CPUs in >36,000 computers
        running Hadoop
    –   Our biggest cluster: 4000 nodes (2*4cpu boxes w
        4*1TB disk & 16GB RAM)
    –   Used to support research for Ad Systems and Web
        Search
    –   Also used to do scaling tests to support
        development of Hadoop on larger clusters
    –   >60% of Hadoop Jobs within Yahoo are Pig jobs
                                                        41
Twitter
●
    How Twitter Uses NoSQL
    –   http://goo.gl/Bwxoe
●
    Scribe
    –   Syslog stopped scaling
●
    Hadoop
    –   Needs to store more data per day than it can reliably write to a
        single hard drive
●
    Pig
    –   Used for interacting with Hadoop
●
    Hbase
    –   People Search
●
    FlockDB
    –   Social Graph Analysis                                              42
Netflix
    ●
        NoSQL at Netflix
         –   http://goo.gl/SDcsZ
    ●
        SimpleDB
         –   Highly durable, with writes automatically replicated across
             availability zones within a region
         –   Love it when others do heavy lifting for us
●
        Hadoop/HBase
         –   Convenient, high-performance column-oriented distributed
             database solution
         –   HBase makes it really easy to grow your cluster and re-distribute
             load across nodes at runtime
●
        Cassandra
         –   Adding more servers, without the need to re-shard
                                                                            43
Facebook
●
    http://goo.gl/J9EVW
●
    350 million users sending over 15 billion person-to-person messages
    per month
●
    Chat service supports over 300 million users who send over 120 billion
    messages per month
●
    Two patterns emerged
     –   A short set of temporal data that tends to be volatile
     –   An ever-growing set of data that rarely gets accessed
●
    Evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a
    couple of other systems
     –   MySQL proved to not handle the long tail of data well (as
         indexes/data grows large performance suffers
     –   Cassandra's eventual consistency model to be a difficult pattern to
         reconcile for our new Messages infrastructure.
                                                                         44
“There is a learning curve and an
operational overhead. Still, the scalability,
availability and performance advantages of
the NoSQL persistence model are evident
and are paying for themselves already, and
will be central to our long-term cloud
strategy.”
           Yury Izrailevsky, Netflix



                                           45
Questions & Answers




         Cris J. Holdorph
         Software Architect
         Unicon, Inc.

         Twitter: @holdorph

         holdorph@unicon.net
         www.unicon.net        46

More Related Content

What's hot

Using mruby in the nosql database Avocadodb
Using mruby in the nosql database AvocadodbUsing mruby in the nosql database Avocadodb
Using mruby in the nosql database Avocadodb
avocadodb
 
Drupal Migration
Drupal MigrationDrupal Migration
Drupal Migration
永对 陈
 
MySQL - NDB Cluster
MySQL - NDB ClusterMySQL - NDB Cluster
MySQL - NDB Cluster
Rajith Bhanuka Mahanama
 
GeoNetwork workshop introduction mapwindow conference 2012 Velp
GeoNetwork workshop introduction mapwindow conference 2012 VelpGeoNetwork workshop introduction mapwindow conference 2012 Velp
GeoNetwork workshop introduction mapwindow conference 2012 Velp
pvangenuchten
 
LDAP at Lightning Speed
 LDAP at Lightning Speed LDAP at Lightning Speed
LDAP at Lightning Speed
C4Media
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
Zohar Elkayam
 

What's hot (6)

Using mruby in the nosql database Avocadodb
Using mruby in the nosql database AvocadodbUsing mruby in the nosql database Avocadodb
Using mruby in the nosql database Avocadodb
 
Drupal Migration
Drupal MigrationDrupal Migration
Drupal Migration
 
MySQL - NDB Cluster
MySQL - NDB ClusterMySQL - NDB Cluster
MySQL - NDB Cluster
 
GeoNetwork workshop introduction mapwindow conference 2012 Velp
GeoNetwork workshop introduction mapwindow conference 2012 VelpGeoNetwork workshop introduction mapwindow conference 2012 Velp
GeoNetwork workshop introduction mapwindow conference 2012 Velp
 
LDAP at Lightning Speed
 LDAP at Lightning Speed LDAP at Lightning Speed
LDAP at Lightning Speed
 
Big data for cio 2015
Big data for cio 2015Big data for cio 2015
Big data for cio 2015
 

Similar to No SQL Technologies

The NoSQL Ecosystem
The NoSQL Ecosystem The NoSQL Ecosystem
The NoSQL Ecosystem
yarapavan
 
HPTS 2011: The NoSQL Ecosystem
HPTS 2011: The NoSQL EcosystemHPTS 2011: The NoSQL Ecosystem
HPTS 2011: The NoSQL Ecosystem
Adam Marcus
 
Drop acid
Drop acidDrop acid
Drop acid
Mike Feltman
 
Introduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStackIntroduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStack
OpenStack_Online
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
Ceph Community
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
Patrick McGarry
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
Patrick McGarry
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
Nikos Kormpakis
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Community
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the move
Codemotion
 
OSOM Operations in the Cloud
OSOM Operations in the CloudOSOM Operations in the Cloud
OSOM Operations in the Cloud
mstuparu
 
OSOM - Operations in the Cloud
OSOM - Operations in the CloudOSOM - Operations in the Cloud
OSOM - Operations in the Cloud
Marcela Oniga
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring data
Jimmy Ray
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js Tutorial
PHP Support
 
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
PostgreSQL Experts, Inc.
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Community
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Community
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 

Similar to No SQL Technologies (20)

The NoSQL Ecosystem
The NoSQL Ecosystem The NoSQL Ecosystem
The NoSQL Ecosystem
 
HPTS 2011: The NoSQL Ecosystem
HPTS 2011: The NoSQL EcosystemHPTS 2011: The NoSQL Ecosystem
HPTS 2011: The NoSQL Ecosystem
 
Drop acid
Drop acidDrop acid
Drop acid
 
Introduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStackIntroduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStack
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
 
NoSQL on the move
NoSQL on the moveNoSQL on the move
NoSQL on the move
 
OSOM Operations in the Cloud
OSOM Operations in the CloudOSOM Operations in the Cloud
OSOM Operations in the Cloud
 
OSOM - Operations in the Cloud
OSOM - Operations in the CloudOSOM - Operations in the Cloud
OSOM - Operations in the Cloud
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring data
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js Tutorial
 
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's Ceph
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 

More from Cris Holdorph

Programming for Performance
Programming for PerformanceProgramming for Performance
Programming for Performance
Cris Holdorph
 
Clustering Made Easier: Using Terracotta with Hibernate and/or EHCache
Clustering Made Easier: Using Terracotta with Hibernate and/or EHCacheClustering Made Easier: Using Terracotta with Hibernate and/or EHCache
Clustering Made Easier: Using Terracotta with Hibernate and/or EHCache
Cris Holdorph
 
Developing JSR 286 Portlets
Developing JSR 286 PortletsDeveloping JSR 286 Portlets
Developing JSR 286 Portlets
Cris Holdorph
 
Adding Performance Testing to a Software Development Project
Adding Performance Testing to a Software Development ProjectAdding Performance Testing to a Software Development Project
Adding Performance Testing to a Software Development Project
Cris Holdorph
 
Sakai and IMS LIS Integration
Sakai and IMS LIS IntegrationSakai and IMS LIS Integration
Sakai and IMS LIS Integration
Cris Holdorph
 
Clustering Sakai with Terracotta
Clustering Sakai with TerracottaClustering Sakai with Terracotta
Clustering Sakai with Terracotta
Cris Holdorph
 
Introduction to Terracotta
Introduction to TerracottaIntroduction to Terracotta
Introduction to Terracotta
Cris Holdorph
 

More from Cris Holdorph (7)

Programming for Performance
Programming for PerformanceProgramming for Performance
Programming for Performance
 
Clustering Made Easier: Using Terracotta with Hibernate and/or EHCache
Clustering Made Easier: Using Terracotta with Hibernate and/or EHCacheClustering Made Easier: Using Terracotta with Hibernate and/or EHCache
Clustering Made Easier: Using Terracotta with Hibernate and/or EHCache
 
Developing JSR 286 Portlets
Developing JSR 286 PortletsDeveloping JSR 286 Portlets
Developing JSR 286 Portlets
 
Adding Performance Testing to a Software Development Project
Adding Performance Testing to a Software Development ProjectAdding Performance Testing to a Software Development Project
Adding Performance Testing to a Software Development Project
 
Sakai and IMS LIS Integration
Sakai and IMS LIS IntegrationSakai and IMS LIS Integration
Sakai and IMS LIS Integration
 
Clustering Sakai with Terracotta
Clustering Sakai with TerracottaClustering Sakai with Terracotta
Clustering Sakai with Terracotta
 
Introduction to Terracotta
Introduction to TerracottaIntroduction to Terracotta
Introduction to Terracotta
 

Recently uploaded

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 

Recently uploaded (20)

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 

No SQL Technologies

  • 1. What Should I Know about NoSQL? Cris J. Holdorph Software Architect Unicon, Inc. Jasig Conference Westminster, CO May 24, 2011 © Copyright Unicon, Inc., 2008. Some rights reserved. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/
  • 3. 3
  • 4. Agenda 1. Definitions 2. History 3. Projects 4. Example Case Studies 4
  • 6. Definitions ● RDBMS ● SQL ● CRUD ● ACID – Atomicity, Consistency, Isolation, Durability ● BASE – Basically Available, Soft state, Eventual consistency 6
  • 7. 7
  • 8. Definitions ● Big Data ● Sharding ● Cloud Computing ● Distributed File System ● Key Value Store 8
  • 10. Map Reduce ● Patented software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. ● Naming originally inspired by map and reduce functions of functional programming (but their purpose is not the same as it was there) ● Map – The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes ● Reduce – The master node then takes the answers to all the sub- problems and combines them in some way to get the output 10
  • 11. What does NoSQL Stand For? ● NoSQL ● No SQL ● Not SQL ● Not Only SQL ● Not the RDBMS ● Wikipedia: – Carlo Strozzi used the term "NoSQL" in 1998 to name his lightweight, open-source relational database that did not expose an SQL interface. 11
  • 12. History ● Some techniques have existed for over 25 years ● Teradata selling product for more then 20 years ● RDBMS dates back to 1970 12
  • 13. CAP Theorem ● A conjecture made by Eric Brewer at the Symposium on Principles of Distributed Computing (2000) ● States only possible to achieve 2 of 3 – Consistency (all nodes see the same data at the same time) – Availability (node failures do not prevent survivors from continuing to operate) – Partition Tolerance (the system continues to operate despite arbitrary message loss) 13
  • 14. CAP ● Consistent and Available – ACID systems, MySQL cluster, Oracle Coherence, Drizzle ● Consistent and Partition Tolerance – SCLA (strongly consistent, loosely available) – HBase, Bigtable ● Available and Partition Tolerant – BASE systems (CouchDB, SimpleDB, MongoDB ● Cassandra (sits between SCLA/BASE systems) 14
  • 15. Projects 15
  • 16. Hadoop ● Open-source software for reliable, scalable, distributed computing (Hadoop website) – Hadoop Common – HDFS – MapReduce ● Created Initially in early 2006 to support search engine project Nutch ● Inspired by the Google File System and MapReduce papers (Oct 2003) 16
  • 17. Hadoop Related Projects ● Hbase – A scalable, distributed database that supports structured data storage for large tables ● Hive – A data warehouse infrastructure that provides data summarization and ad hoc querying ● Pig – A high-level data-flow language and execution framework for parallel computation ● Cassandra – uses Hadoop for MapReduce 17
  • 18. Who Uses Hadoop ● EBay (532 nodes, Search optimization) ● Facebook (1100x8 node cluster, 300x8 node cluster, more on this later) ● GumGum (Ken Weiner, 20+ node cluster on Amazon EC2) ● Hulu (log storage analysis) ● Last.fm (44x2 nodes log analysis, 20x2 nodes profile analysis) ● LinkedIn (120x2x4 nodes, 520x2x4 nodes, "People you may know") ● Twitter (more on this later) ● Yahoo! (100,000 cpus running Hadoop, more on this later) 18
  • 19. CouchDB ● Apache open source document oriented database written in Erlang (concurrent programming lang) ● Designed to scale horizontally ● Stores documents (one or more field value pairs expressed as JSON) ● ACID Semantics ● Map/Reduce Views and Indexes (written in server side javascript) ● Bi-direction replication (with conflict resolution) ● REST API 19
  • 21. CouchDB Sample Document "Subject": "I like Plankton" "Author": "Rusty" "PostedDate": "5/23/2006" "Tags": ["plankton", "baseball", "decisions"] "Body": "I decided today that I don't like baseball. I like plankton." http://couchdb.apache.org/docs/intro.html 21
  • 22. Who uses CouchDB? ● Ubuntu One – cloud storage service – http://ubuntuone.com/ ● "I Play WoW" facebook app – http://blog.socklabs.com/2008/12/24/iplaywow_monthly_actives.html ● Wego - travel site – http://www.wego.com/ 22
  • 23. Cassandra ● Fault Tolerant (replication, failed nodes can be replaced with no downtime) ● Decentralized (ever node in cluster is identical, no bottlenicks) ● Supports either Synchronous or Asynchronous update replication ● Supports more then simple key/value pair ● Elastic (read/write throughput increase linearly as machines are added) ● Durable (suitable for applictions that can't 23 afford to lose data)
  • 24. Cassandra ● Initially developed by Facebook for Inbox Search (until replaced by HBase) ● Key-value store where values can be multiple values ● Some inspiration from Amazon's Dynamo (another key-value store) 24
  • 25. Who uses Cassandra? ● Facebook (previously) ● Twitter ● Digg ● Cisco 25
  • 26. MongoDB ● Name is derived from "humongous" ● Document oriented database written in C++ ● Manages collections of JSON-like documents ● Binaries available for windows, linux, OS X, Solaris ● Supports dates, regular expressions code, binary data (all BSON types) ● Cursors for query results ● Any field can be queried at any time 26
  • 27. MongoDB ● Queries can include user-defined JavaScript functions ● Master/Slave (only master supports writes, slaves can be read from) ● Scales horizontally using sharding ● Support for Map/Reduce 27
  • 28. Who uses MongoDB? ● New York Times ● Shutterfly ● Foursquare ● SourceForge ● Intuit 28
  • 29. Google Big Table ● Built on GFS (Google File System) ● Can be used with Google App Engine ● Maps two aribtrary strings and a timestamp ● Designed to scale into the petabyte range ● Designed to scale across hundreds or thousands of machines ● Portions of a table (tablets) can be compressed ● HBase was modeled after BigTable 29
  • 30. Who uses Big Table? ● Google Reader ● Google Maps ● Google Book Search ● Google Earth ● Blogger.com ● Google Code ● Orkut ● YouTube ● Gmail 30
  • 31. Amazon SimpleDB ● Written in Erlang ● Used with Amazon EC2 and Amazon S3 ● Easy access to lookup and query functions ● Without support for the less used complex database functions ● Do not need to pre-define data formats that will be stored ● Scalable (with size limitations) – 10gb per domain, up to 250 domains ● Fast/Reliable ● Supports eventually consistent read and consistent read ● Potentially Inexpensive 31
  • 33. SimpleDB Data Model ● Customer Account (amazon web services account) ● Domains (similar to tables, or spreadsheet tabs) ● Items (similar to rows) ● Attributes (similar to columns) ● Values (similar to cells) – Unlike a spreadsheet, however, multiple values can be associated with a cell ● One domain can contain different types of data (some attributes not filled in) 33
  • 34. SimpleDB API Summary ● CreateDomain ● DeleteDomain ● ListDomains ● PutAttributes ● BatchPutAttributes ● DeleteAttributes ● BatchDeleteAttributes ● GetAttributes ● Select ● DomainMetadata 34
  • 35. Who uses SimpleDB? ● Netflix ● Other Amazon EC2 customers... 35
  • 36. memcached ● General purpose distributed memory caching system ● Often used to cache in RAM that might otherwise be obtained from an external data source ● LRU (when cache is full) ● Can be distributed across multiple machines 36
  • 37. Who uses memcached? ● YouTube ● Zynga ● Facebook ● Twitter 37
  • 38. Terracotta ● JVM in-memory distributed cache / store ● The object store can be persistent ● Distribution between nodes is handled through Terracotta server ● Supports multiple Terracotta servers ● Nodes only receive data they need/reference 38
  • 39. Who uses Terracotta? ● Sakai (thanks to John Wiley & Sons) ● PartyGaming (PartyPoker.com) ● Adobe ● Pearson 39
  • 41. Yahoo! ● Hadoop – http://developer.yahoo.com/blogs/hadoop – More than 100,000 CPUs in >36,000 computers running Hadoop – Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) – Used to support research for Ad Systems and Web Search – Also used to do scaling tests to support development of Hadoop on larger clusters – >60% of Hadoop Jobs within Yahoo are Pig jobs 41
  • 42. Twitter ● How Twitter Uses NoSQL – http://goo.gl/Bwxoe ● Scribe – Syslog stopped scaling ● Hadoop – Needs to store more data per day than it can reliably write to a single hard drive ● Pig – Used for interacting with Hadoop ● Hbase – People Search ● FlockDB – Social Graph Analysis 42
  • 43. Netflix ● NoSQL at Netflix – http://goo.gl/SDcsZ ● SimpleDB – Highly durable, with writes automatically replicated across availability zones within a region – Love it when others do heavy lifting for us ● Hadoop/HBase – Convenient, high-performance column-oriented distributed database solution – HBase makes it really easy to grow your cluster and re-distribute load across nodes at runtime ● Cassandra – Adding more servers, without the need to re-shard 43
  • 44. Facebook ● http://goo.gl/J9EVW ● 350 million users sending over 15 billion person-to-person messages per month ● Chat service supports over 300 million users who send over 120 billion messages per month ● Two patterns emerged – A short set of temporal data that tends to be volatile – An ever-growing set of data that rarely gets accessed ● Evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems – MySQL proved to not handle the long tail of data well (as indexes/data grows large performance suffers – Cassandra's eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. 44
  • 45. “There is a learning curve and an operational overhead. Still, the scalability, availability and performance advantages of the NoSQL persistence model are evident and are paying for themselves already, and will be central to our long-term cloud strategy.” Yury Izrailevsky, Netflix 45
  • 46. Questions & Answers Cris J. Holdorph Software Architect Unicon, Inc. Twitter: @holdorph holdorph@unicon.net www.unicon.net 46