Nick Dimiduk - @xefyr
Founder, Drawn to Scale
nick@drawntoscalehq.com

April 28, 2010
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
whoami
Computer Science & Engineering at Ohio State:
Artificial Intelligence, Programming Languages, Systems
Engineering
Ap...
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
What NoSQL is not.

movement
What NoSQL is not.

movement - no ANSI NoSQL-2010
one-size-fits-all
It’s not Anti-RDBMS
It’s about Choice!




   http://www.flickr.com/photos/zakh/337938459/
What NoSQL is not.

movement - no ANSI NoSQL-2010
one-size-fits-all - it’s about choice
silver bullet
What NoSQL is not.

movement - no ANSI NoSQL-2010
one-size-fits-all - it’s about choice
silver bullet - guarantees are hard
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
motivation
more, More, MORE Data!
motivation
more, More, MORE Data!
ACID Burns
motivation
more, More, MORE Data!
ACID Burns
BASE is good enough
motivation
more, More, MORE Data!
ACID Burns
BASE is good enough
Life’s too short
motivation
more, More, MORE Data!
ACID Burns
BASE is good enough
Life’s too short
“typical” application
“typical” application
Data Server                Village People




              App Server
growing pains
Data Server                       Villages of People




              App Servers
vertical partitioning
Data Server                   Villages of People




              App Servers




                 ...
vertical partitioning
Data Server                   Villages of People   Data Server                 Villages of People


...
vertical partitioning
Data Server                   Villages of People




              App Servers




                 ...
“typical” application
growing pains
Data Servers                       Villages of People




               App Servers
horizontal partitioning
              Villages of People
horizontal partitioning
              Villages of People
horizontal partitioning
                     Villages of People




   Data Layer   Application Layer
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
“open source, reliable, distributed
          computing”
“open source, reliable, distributed
          computing”
MapReduce - API for parallel computing
MapReduce - API for parallel computing
HDFS - distributed, replicated file system
MapReduce - API for parallel computing
HDFS - distributed, replicated file system
ZooKeeper - distributed synchronization
MapReduce - API for parallel computing
HDFS - distributed, replicated file system
ZooKeeper - distributed synchronization
A...
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
structured, distributed database for your
         horizontally scalable FS
structured, distributed database for your
         horizontally scalable FS
random access
random access
real-time reads/writes
random access
real-time reads/writes
simple API
random access
real-time reads/writes
simple API
big table
references
           : http://www.nosql-database.org
Eventually Consistent: http://www.allthingsdistributed.com/2007/12/
...
Questions?



Nick Dimiduk - @xefyr
Founder, Drawn to Scale
nick@drawntoscalehq.com

April 28, 2010
Introduction to Hadoop, HBase, and NoSQL
Upcoming SlideShare
Loading in...5
×

Introduction to Hadoop, HBase, and NoSQL

22,226

Published on

3 Comments
15 Likes
Statistics
Notes
No Downloads
Views
Total Views
22,226
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
712
Comments
3
Likes
15
Embeds 0
No embeds

No notes for slide


  • I’m Not an RDBMS Guy!
  • squish the FUD
  • no central point of organization
    no committee or standardizing body
    no plan/strategy/illuminati to take down the RDBMS; lots of "in-fighting"
  • central tenant - there IS NO one-size-fits-all
    unlike RDBMS assumptions, each engineering effort must be evaluated for data needs

  • is it “anti-RDBMS”?
  • not so much

  • will not magically solve all your data or performance problems
    applications won’t magically stop crashing, data corruption, etc.
    Big Data is still hard. These tools make it possible/affordable/approachable

  • data persistence comes down to garantees
  • why are we here?
  • "web scale"
    more users, content, connections
    more trends, insight, knowledge

  • Atomicity: fault-tolerance is moving to the application layer - smaller atomic units
    Consistency: yes! but not necessarily immediate - "availability" (latency, reads) is more important.
    Isolation: smaller atomic units (multi-step transaction vs. compare-and-swap), greater availability, denormalization => reduced dependency on isolation
    Durability: some things are more important that getting every last detail, i.e. latency of response, view in aggregate

  • Basically Available: is the data layer up or not? are we serving content to our users or not?
    Soft State: shifting burden of "correctness" up to application layer. availability is more important than precision. accuracy (correct) vs. precision (repeatable).
    Eventual Consistency: all operations are recorded and ordered. played back as resources permit.

  • agile dev moves too fast for schema and constraints - this isn’t waterfall
    data models change quickly
    up-front schema modeling is akin to waterfall development - not always practical/feasible/possible
    data is messy - record what you have and leave constraints up to the application

  • at scale, data services look like a DHT anyway!
    isolated independent services
    introduced caching layers
    partitioned data by logical and range boundaries.

  • webapp

  • app servers/session self-contained - load-balanced
    data’s in one spot - what do you do?

  • 37-signals approach - DHH “scaling is a good thing because scaling => users => $$$”
  • more users, more instances. easy!
  • doesn’t work for social applications:
    - users cannot interact
    - old MMO’s vs. new social games

  • redesign data server as “data services”
    separate independent logical components
  • knowing each service by name becomes “vexing”

  • configuration/logistical nightmare!

  • abstractions!
    wouldn’t it be nice if...

  • Distributed Computing Made Easy Less Hard

  • programming model/API for parallel computing
    Google's MapReduce paper
  • replicated, high throughput, fairly UNIX-y (not POSIX).
    Google FS Paper
  • Distributed Group Services - coordination, synchronization, configuration, naming.
    Google Chubby Paper
  • efficient, cross-language messaging
    Facebook/Apache Thrift
    Google Protobufs

  • Google BigTable
  • Addresses limitations of Raw M/R, HDFS access
  • request by key: vs. hdfs sequential reads
  • low-latency, ms response times vs. m/r high-latency
  • row/column concepts
    DHT semantics
    Java, ReST, thrift
  • Billions of rows, millions of columns


  • Introduction to Hadoop, HBase, and NoSQL

    1. 1. Nick Dimiduk - @xefyr Founder, Drawn to Scale nick@drawntoscalehq.com April 28, 2010
    2. 2. Agenda what NoSQL is not motivation Hadoop HBase
    3. 3. whoami Computer Science & Engineering at Ohio State: Artificial Intelligence, Programming Languages, Systems Engineering Applied Technical Systems: Hierarchical, non-relational data storage and analysis systems (no-sql before there was NoSQL). Information Retrieval, Wire Serialization/RPC (before there was Thrift/Avro), Data Visualization (GB's) Visible Technologies: Social Media Storage, Processing, Analytics. Monitoring, Engagement, Warehousing, and BI. (TB's) Drawn to Scale: Big Data Storage, Processing, Retrieval, Analytics (TB's, PB's)
    4. 4. Agenda what NoSQL is not motivation Hadoop HBase
    5. 5. What NoSQL is not. movement
    6. 6. What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-fits-all
    7. 7. It’s not Anti-RDBMS
    8. 8. It’s about Choice! http://www.flickr.com/photos/zakh/337938459/
    9. 9. What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-fits-all - it’s about choice silver bullet
    10. 10. What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-fits-all - it’s about choice silver bullet - guarantees are hard
    11. 11. Agenda what NoSQL is not motivation Hadoop HBase
    12. 12. motivation more, More, MORE Data!
    13. 13. motivation more, More, MORE Data! ACID Burns
    14. 14. motivation more, More, MORE Data! ACID Burns BASE is good enough
    15. 15. motivation more, More, MORE Data! ACID Burns BASE is good enough Life’s too short
    16. 16. motivation more, More, MORE Data! ACID Burns BASE is good enough Life’s too short
    17. 17. “typical” application
    18. 18. “typical” application Data Server Village People App Server
    19. 19. growing pains Data Server Villages of People App Servers
    20. 20. vertical partitioning Data Server Villages of People App Servers Data Server Villages of People App Servers
    21. 21. vertical partitioning Data Server Villages of People Data Server Villages of People App Servers App Servers Data Server Villages of People Data Server Villages of People App Servers App Servers
    22. 22. vertical partitioning Data Server Villages of People App Servers Data Server Villages of People App Servers
    23. 23. “typical” application
    24. 24. growing pains Data Servers Villages of People App Servers
    25. 25. horizontal partitioning Villages of People
    26. 26. horizontal partitioning Villages of People
    27. 27. horizontal partitioning Villages of People Data Layer Application Layer
    28. 28. Agenda what NoSQL is not motivation Hadoop HBase
    29. 29. “open source, reliable, distributed computing”
    30. 30. “open source, reliable, distributed computing”
    31. 31. MapReduce - API for parallel computing
    32. 32. MapReduce - API for parallel computing HDFS - distributed, replicated file system
    33. 33. MapReduce - API for parallel computing HDFS - distributed, replicated file system ZooKeeper - distributed synchronization
    34. 34. MapReduce - API for parallel computing HDFS - distributed, replicated file system ZooKeeper - distributed synchronization Avro - Data Serialization / RPC
    35. 35. Agenda what NoSQL is not motivation Hadoop HBase
    36. 36. structured, distributed database for your horizontally scalable FS
    37. 37. structured, distributed database for your horizontally scalable FS
    38. 38. random access
    39. 39. random access real-time reads/writes
    40. 40. random access real-time reads/writes simple API
    41. 41. random access real-time reads/writes simple API big table
    42. 42. references : http://www.nosql-database.org Eventually Consistent: http://www.allthingsdistributed.com/2007/12/ eventually_consistent.html Soft State: http://mercury.lcs.mit.edu/~jnc/tech/hard_soft.html Accuracy and Precision: http://en.wikipedia.org/wiki/Accuracy_and_precision Compare and Swap: http://en.wikipedia.org/wiki/Compare-and-swap Apache Hadoop: http://hadoop.apache.org Google MapReduce: http://labs.google.com/papers/mapreduce.html Google FS: http://labs.google.com/papers/gfs.html Apache Thrift: http://incubator.apache.org/thrift/ Protobuf: http://code.google.com/p/protobuf/ Google BigTable: http://labs.google.com/papers/bigtable.html Google Chubby: http://labs.google.com/papers/chubby.html
    43. 43. Questions? Nick Dimiduk - @xefyr Founder, Drawn to Scale nick@drawntoscalehq.com April 28, 2010
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×