• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to Hadoop, HBase, and NoSQL
 

Introduction to Hadoop, HBase, and NoSQL

on

  • 23,049 views

 

Statistics

Views

Total Views
23,049
Views on SlideShare
22,908
Embed Views
141

Actions

Likes
14
Downloads
661
Comments
3

3 Embeds 141

http://www.slideshare.net 116
http://www.linkedin.com 17
https://www.linkedin.com 8

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

13 of 3 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • <br />
  • I&#x2019;m Not an RDBMS Guy! <br />
  • squish the FUD <br />
  • no central point of organization <br /> no committee or standardizing body <br /> no plan/strategy/illuminati to take down the RDBMS; lots of "in-fighting" <br />
  • central tenant - there IS NO one-size-fits-all <br /> unlike RDBMS assumptions, each engineering effort must be evaluated for data needs <br /> <br />
  • is it &#x201C;anti-RDBMS&#x201D;? <br />
  • not so much <br />
  • <br />
  • will not magically solve all your data or performance problems <br /> applications won&#x2019;t magically stop crashing, data corruption, etc. <br /> Big Data is still hard. These tools make it possible/affordable/approachable <br /> <br />
  • data persistence comes down to garantees <br />
  • why are we here? <br />
  • "web scale" <br /> more users, content, connections <br /> more trends, insight, knowledge <br /> <br />
  • Atomicity: fault-tolerance is moving to the application layer - smaller atomic units <br /> Consistency: yes! but not necessarily immediate - "availability" (latency, reads) is more important. <br /> Isolation: smaller atomic units (multi-step transaction vs. compare-and-swap), greater availability, denormalization => reduced dependency on isolation <br /> Durability: some things are more important that getting every last detail, i.e. latency of response, view in aggregate <br /> <br />
  • Basically Available: is the data layer up or not? are we serving content to our users or not? <br /> Soft State: shifting burden of "correctness" up to application layer. availability is more important than precision. accuracy (correct) vs. precision (repeatable). <br /> Eventual Consistency: all operations are recorded and ordered. played back as resources permit. <br /> <br />
  • agile dev moves too fast for schema and constraints - this isn&#x2019;t waterfall <br /> data models change quickly <br /> up-front schema modeling is akin to waterfall development - not always practical/feasible/possible <br /> data is messy - record what you have and leave constraints up to the application <br /> <br />
  • at scale, data services look like a DHT anyway! <br /> isolated independent services <br /> introduced caching layers <br /> partitioned data by logical and range boundaries. <br /> <br />
  • webapp <br />
  • <br />
  • app servers/session self-contained - load-balanced <br /> data&#x2019;s in one spot - what do you do? <br /> <br />
  • 37-signals approach - DHH &#x201C;scaling is a good thing because scaling => users => $$$&#x201D; <br />
  • more users, more instances. easy! <br />
  • doesn&#x2019;t work for social applications: <br /> - users cannot interact <br /> - old MMO&#x2019;s vs. new social games <br />
  • <br />
  • redesign data server as &#x201C;data services&#x201D; <br /> separate independent logical components <br />
  • knowing each service by name becomes &#x201C;vexing&#x201D; <br /> <br />
  • configuration/logistical nightmare! <br /> <br />
  • abstractions! <br /> wouldn&#x2019;t it be nice if... <br />
  • <br />
  • Distributed Computing Made Easy Less Hard <br />
  • <br />
  • programming model/API for parallel computing <br /> Google&apos;s MapReduce paper <br />
  • replicated, high throughput, fairly UNIX-y (not POSIX). <br /> Google FS Paper <br />
  • Distributed Group Services - coordination, synchronization, configuration, naming. <br /> Google Chubby Paper <br />
  • efficient, cross-language messaging <br /> Facebook/Apache Thrift <br /> Google Protobufs <br />
  • <br />
  • Google BigTable <br />
  • Addresses limitations of Raw M/R, HDFS access <br />
  • request by key: vs. hdfs sequential reads <br />
  • low-latency, ms response times vs. m/r high-latency <br />
  • row/column concepts <br /> DHT semantics <br /> Java, ReST, thrift <br />
  • Billions of rows, millions of columns <br />
  • <br />
  • <br />

Introduction to Hadoop, HBase, and NoSQL Introduction to Hadoop, HBase, and NoSQL Presentation Transcript

  • Nick Dimiduk - @xefyr Founder, Drawn to Scale nick@drawntoscalehq.com April 28, 2010
  • Agenda what NoSQL is not motivation Hadoop HBase
  • whoami Computer Science & Engineering at Ohio State: Artificial Intelligence, Programming Languages, Systems Engineering Applied Technical Systems: Hierarchical, non-relational data storage and analysis systems (no-sql before there was NoSQL). Information Retrieval, Wire Serialization/RPC (before there was Thrift/Avro), Data Visualization (GB's) Visible Technologies: Social Media Storage, Processing, Analytics. Monitoring, Engagement, Warehousing, and BI. (TB's) Drawn to Scale: Big Data Storage, Processing, Retrieval, Analytics (TB's, PB's)
  • Agenda what NoSQL is not motivation Hadoop HBase
  • What NoSQL is not. movement
  • What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-fits-all
  • It’s not Anti-RDBMS
  • It’s about Choice! http://www.flickr.com/photos/zakh/337938459/
  • What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-fits-all - it’s about choice silver bullet
  • What NoSQL is not. movement - no ANSI NoSQL-2010 one-size-fits-all - it’s about choice silver bullet - guarantees are hard
  • Agenda what NoSQL is not motivation Hadoop HBase
  • motivation more, More, MORE Data!
  • motivation more, More, MORE Data! ACID Burns
  • motivation more, More, MORE Data! ACID Burns BASE is good enough
  • motivation more, More, MORE Data! ACID Burns BASE is good enough Life’s too short
  • motivation more, More, MORE Data! ACID Burns BASE is good enough Life’s too short
  • “typical” application
  • “typical” application Data Server Village People App Server
  • growing pains Data Server Villages of People App Servers
  • vertical partitioning Data Server Villages of People App Servers Data Server Villages of People App Servers
  • vertical partitioning Data Server Villages of People Data Server Villages of People App Servers App Servers Data Server Villages of People Data Server Villages of People App Servers App Servers
  • vertical partitioning Data Server Villages of People App Servers Data Server Villages of People App Servers
  • “typical” application
  • growing pains Data Servers Villages of People App Servers
  • horizontal partitioning Villages of People
  • horizontal partitioning Villages of People
  • horizontal partitioning Villages of People Data Layer Application Layer
  • Agenda what NoSQL is not motivation Hadoop HBase
  • “open source, reliable, distributed computing”
  • “open source, reliable, distributed computing”
  • MapReduce - API for parallel computing
  • MapReduce - API for parallel computing HDFS - distributed, replicated file system
  • MapReduce - API for parallel computing HDFS - distributed, replicated file system ZooKeeper - distributed synchronization
  • MapReduce - API for parallel computing HDFS - distributed, replicated file system ZooKeeper - distributed synchronization Avro - Data Serialization / RPC
  • Agenda what NoSQL is not motivation Hadoop HBase
  • structured, distributed database for your horizontally scalable FS
  • structured, distributed database for your horizontally scalable FS
  • random access
  • random access real-time reads/writes
  • random access real-time reads/writes simple API
  • random access real-time reads/writes simple API big table
  • references : http://www.nosql-database.org Eventually Consistent: http://www.allthingsdistributed.com/2007/12/ eventually_consistent.html Soft State: http://mercury.lcs.mit.edu/~jnc/tech/hard_soft.html Accuracy and Precision: http://en.wikipedia.org/wiki/Accuracy_and_precision Compare and Swap: http://en.wikipedia.org/wiki/Compare-and-swap Apache Hadoop: http://hadoop.apache.org Google MapReduce: http://labs.google.com/papers/mapreduce.html Google FS: http://labs.google.com/papers/gfs.html Apache Thrift: http://incubator.apache.org/thrift/ Protobuf: http://code.google.com/p/protobuf/ Google BigTable: http://labs.google.com/papers/bigtable.html Google Chubby: http://labs.google.com/papers/chubby.html
  • Questions? Nick Dimiduk - @xefyr Founder, Drawn to Scale nick@drawntoscalehq.com April 28, 2010