Introduction to Hadoop, HBase, and NoSQL

  • 20,422 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
20,422
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
677
Comments
3
Likes
14

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide


  • I’m Not an RDBMS Guy!
  • squish the FUD
  • no central point of organization
    no committee or standardizing body
    no plan/strategy/illuminati to take down the RDBMS; lots of "in-fighting"
  • central tenant - there IS NO one-size-fits-all
    unlike RDBMS assumptions, each engineering effort must be evaluated for data needs

  • is it “anti-RDBMS”?
  • not so much

  • will not magically solve all your data or performance problems
    applications won’t magically stop crashing, data corruption, etc.
    Big Data is still hard. These tools make it possible/affordable/approachable

  • data persistence comes down to garantees
  • why are we here?
  • "web scale"
    more users, content, connections
    more trends, insight, knowledge

  • Atomicity: fault-tolerance is moving to the application layer - smaller atomic units
    Consistency: yes! but not necessarily immediate - "availability" (latency, reads) is more important.
    Isolation: smaller atomic units (multi-step transaction vs. compare-and-swap), greater availability, denormalization => reduced dependency on isolation
    Durability: some things are more important that getting every last detail, i.e. latency of response, view in aggregate

  • Basically Available: is the data layer up or not? are we serving content to our users or not?
    Soft State: shifting burden of "correctness" up to application layer. availability is more important than precision. accuracy (correct) vs. precision (repeatable).
    Eventual Consistency: all operations are recorded and ordered. played back as resources permit.

  • agile dev moves too fast for schema and constraints - this isn’t waterfall
    data models change quickly
    up-front schema modeling is akin to waterfall development - not always practical/feasible/possible
    data is messy - record what you have and leave constraints up to the application

  • at scale, data services look like a DHT anyway!
    isolated independent services
    introduced caching layers
    partitioned data by logical and range boundaries.

  • webapp

  • app servers/session self-contained - load-balanced
    data’s in one spot - what do you do?

  • 37-signals approach - DHH “scaling is a good thing because scaling => users => $$$”
  • more users, more instances. easy!
  • doesn’t work for social applications:
    - users cannot interact
    - old MMO’s vs. new social games

  • redesign data server as “data services”
    separate independent logical components
  • knowing each service by name becomes “vexing”

  • configuration/logistical nightmare!

  • abstractions!
    wouldn’t it be nice if...

  • Distributed Computing Made Easy Less Hard

  • programming model/API for parallel computing
    Google's MapReduce paper
  • replicated, high throughput, fairly UNIX-y (not POSIX).
    Google FS Paper
  • Distributed Group Services - coordination, synchronization, configuration, naming.
    Google Chubby Paper
  • efficient, cross-language messaging
    Facebook/Apache Thrift
    Google Protobufs

  • Google BigTable
  • Addresses limitations of Raw M/R, HDFS access
  • request by key: vs. hdfs sequential reads
  • low-latency, ms response times vs. m/r high-latency
  • row/column concepts
    DHT semantics
    Java, ReST, thrift
  • Billions of rows, millions of columns


Transcript

  • 1. Nick Dimiduk - @xefyr Founder, Drawn to Scale nick@drawntoscalehq.com April 28, 2010
  • 2. Agenda what NoSQL is not motivation Hadoop HBase
  • 3. whoami Computer Science & Engineering at Ohio State: Artificial Intelligence, Programming Languages, Systems Engineering Applied Technical Systems: Hierarchical, non-relational data storage and analysis systems (no-sql before there was NoSQL). Information Retrieval, Wire Serialization/RPC (before there was Thrift/Avro), Data Visualization (GB's) Visible Technologies: Social Media Storage, Processing, Analytics. Monitoring, Engagement, Warehousing, and BI. (TB's) Drawn to Scale: Big Data Storage, Processing, Retrieval, Analytics (TB's, PB's)
  • 4. Agenda what NoSQL is not motivation Hadoop HBase
  • 5. What NoSQL is not. movement
  • 6. What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-fits-all
  • 7. It’s not Anti-RDBMS
  • 8. It’s about Choice! http://www.flickr.com/photos/zakh/337938459/
  • 9. What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-fits-all - it’s about choice silver bullet
  • 10. What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-fits-all - it’s about choice silver bullet - guarantees are hard
  • 11. Agenda what NoSQL is not motivation Hadoop HBase
  • 12. motivation more, More, MORE Data!
  • 13. motivation more, More, MORE Data! ACID Burns
  • 14. motivation more, More, MORE Data! ACID Burns BASE is good enough
  • 15. motivation more, More, MORE Data! ACID Burns BASE is good enough Life’s too short
  • 16. motivation more, More, MORE Data! ACID Burns BASE is good enough Life’s too short
  • 17. “typical” application
  • 18. “typical” application Data Server Village People App Server
  • 19. growing pains Data Server Villages of People App Servers
  • 20. vertical partitioning Data Server Villages of People App Servers Data Server Villages of People App Servers
  • 21. vertical partitioning Data Server Villages of People Data Server Villages of People App Servers App Servers Data Server Villages of People Data Server Villages of People App Servers App Servers
  • 22. vertical partitioning Data Server Villages of People App Servers Data Server Villages of People App Servers
  • 23. “typical” application
  • 24. growing pains Data Servers Villages of People App Servers
  • 25. horizontal partitioning Villages of People
  • 26. horizontal partitioning Villages of People
  • 27. horizontal partitioning Villages of People Data Layer Application Layer
  • 28. Agenda what NoSQL is not motivation Hadoop HBase
  • 29. “open source, reliable, distributed computing”
  • 30. “open source, reliable, distributed computing”
  • 31. MapReduce - API for parallel computing
  • 32. MapReduce - API for parallel computing HDFS - distributed, replicated file system
  • 33. MapReduce - API for parallel computing HDFS - distributed, replicated file system ZooKeeper - distributed synchronization
  • 34. MapReduce - API for parallel computing HDFS - distributed, replicated file system ZooKeeper - distributed synchronization Avro - Data Serialization / RPC
  • 35. Agenda what NoSQL is not motivation Hadoop HBase
  • 36. structured, distributed database for your horizontally scalable FS
  • 37. structured, distributed database for your horizontally scalable FS
  • 38. random access
  • 39. random access real-time reads/writes
  • 40. random access real-time reads/writes simple API
  • 41. random access real-time reads/writes simple API big table
  • 42. references : http://www.nosql-database.org Eventually Consistent: http://www.allthingsdistributed.com/2007/12/ eventually_consistent.html Soft State: http://mercury.lcs.mit.edu/~jnc/tech/hard_soft.html Accuracy and Precision: http://en.wikipedia.org/wiki/Accuracy_and_precision Compare and Swap: http://en.wikipedia.org/wiki/Compare-and-swap Apache Hadoop: http://hadoop.apache.org Google MapReduce: http://labs.google.com/papers/mapreduce.html Google FS: http://labs.google.com/papers/gfs.html Apache Thrift: http://incubator.apache.org/thrift/ Protobuf: http://code.google.com/p/protobuf/ Google BigTable: http://labs.google.com/papers/bigtable.html Google Chubby: http://labs.google.com/papers/chubby.html
  • 43. Questions? Nick Dimiduk - @xefyr Founder, Drawn to Scale nick@drawntoscalehq.com April 28, 2010