Intro to the Hadoop Stack @ April 2011 JavaMUG
Upcoming SlideShare
Loading in...5
×
 

Intro to the Hadoop Stack @ April 2011 JavaMUG

on

  • 12,684 views

Covers high level concepts of different pieces of the Hadoop project: HDFS, MapReduce, HBase, Hive, Pig & Zookeeper

Covers high level concepts of different pieces of the Hadoop project: HDFS, MapReduce, HBase, Hive, Pig & Zookeeper

Statistics

Views

Total Views
12,684
Slideshare-icon Views on SlideShare
9,341
Embed Views
3,343

Actions

Likes
23
Downloads
866
Comments
5

12 Embeds 3,343

http://hbase.info 1900
http://www.engfers.com 1393
http://cache.baidu.com 12
http://feeds.feedburner.com 10
http://paper.li 9
https://www.facebook.com 4
https://twitter.com 4
http://translate.googleusercontent.com 3
http://www.techgig.com 3
http://www.linkedin.com 3
http://www.zhuaxia.com 1
http://oracle.sociview.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

15 of 5 Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Great presentation I saw so far. THANK YOU.
    Are you sure you want to
    Your message goes here
    Processing…
  • very nice,thanks!
    Are you sure you want to
    Your message goes here
    Processing…
  • PLS.....any one send me hadoop source code and etc...
    Are you sure you want to
    Your message goes here
    Processing…
  • very useful

    thanks
    Are you sure you want to
    Your message goes here
    Processing…
  • Great presentation on Hadoop. I like the demos, Thank you.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Intro to the Hadoop Stack @ April 2011 JavaMUG Intro to the Hadoop Stack @ April 2011 JavaMUG Presentation Transcript

    • About Medavid.engfer@gmail.com@engferMeetup organizer for DFWBigData.org > Hadoop, Cassandra, and all other things BigData and NoSQL > Join up!Sr. Consultant @ > Rapidly growing national IT consulting firm focused on career development while operating within an local-office project model
    • What is Hadoop?0 “framework for running [distributed] applications on large cluster built of commodity hardware” –from Hadoop Wiki Marty McFly?0 Originally created by Doug Cutting > Named the project after his son’s toy0 The name “Hadoop” has now evolved to cover a family of products, but at its core, it’s essentially just the MapReduce programming paradigm + a distributed file system
    • History
    • History >_< Growing Pains +Jeffery Dean: lots of data + tape backup +expensive servers + high network bandwidth +expensive databases + non-linear scalability + etc.(http://bit.ly/ec31VL + http://bit.ly/gq84Ot)
    • History +>_< Growing Pains + Solutions
    • History White Papers: +>_< Growing Pains Google File System • 2003 MapReduce + • 2004 BigTable • 2006 Solutions
    • HistoryWhite Papers: Hadoop Core Google File System c. 2005 • 2003 MapReduce • 2004 BigTable • 2006
    • Hadoop Distributed File System (HDFS)0 OSS implementation of Google File System (bit.ly/ihXkof)0 Master/slave architecture0 Designed to run on commodity hardware0 Hardware failures assumed in design0 Fault-tolerant via replication0 Semi-POSIX compliance; relaxed for performance0 Unix-like permissions; ties into host’s users & groups
    • Hadoop Distributed File System (HDFS)0 Written in Java0 Optimized for larger files0 Focus on streaming data (high-throughput > low-latency)0 Rack-aware0 Only *nix for production env.0 Web consoles for stats
    • HDFS Client API’s0 “Shell-like” commands (hadoop dfs [cmd]) > cat chgrp chmod chown copyFromLocal copyToLocal cp du, dus expunge get getmerge ls, lsr mkdir movefromLocal mv put rm, rmr setrep stat tail test text touchz0 Native Java API0 API for other languages (http://bit.ly/fLgCJC) > C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml
    • Other HDFS Admin Tools0 hadoop dfsadmin [opts] > Basic admin utilities for the DFS cluster > Change file-level replication factors, set quotas, upgrade, safemode, reporting, etc0 hadoop fsck [opts] > Runs distributed file system checking and fixing utility0 hadoop balancer > Utility that rebalances block storage across the nodes
    • HDFS Node TypesMaster NameNode 0 Single node responsible for: > Filesystem metadata operations on cluster > Replication and locations of file blocks 0 SPOF =((backups)CheckpointNode or 0 Nodes responsible for: BackupNode > NameNode backup mechanismsSlaves 0 Nodes responsible for: DataNode DataNode > Storage of file blocks DataNode > Serving actual file data to client
    • HDFS Architecture FS/namespace/meta ops NameNode BackupNode (namespace backups) (heartbeats, balancing, replication, etc) DataNode DataNode DataNode DataNode DataNodeserving data --> nodes write to local disk
    • HDFS Architecture Giant File: (block locations, FS ops, etc)110010101001010100101010 <No file data!!>011001010100 HDFS101010010101 NameNode BackupNode001100101010010101001010 Client10011001010100101010010101001100101010010101001010100101101... data Xfer DataNode DataNode DataNode DataNode DataNode
    • Putting files on HDFSclient buffers blocks to local disk… {64MB} Giant HDFS File: Client110010101001 return block size and010100101010011001010100 nodes for each block (based on “replication factor”)101010010101001100101010010101001010100110010101 (3 by default)00101010010101001100101010010101001010100101101... NameNode BackupNode DataNode DataNode DataNode DataNode DataNode
    • Putting files on HDFS {node1, node2, node3} Giant (based on “replication factor”) HDFS File: Client11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
    • Putting files on HDFS {node1, node3, node5} Giant HDFS File: Client11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
    • Putting files on HDFS {node1, node4, node5} Giant HDFS File: Client11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
    • Putting files on HDFS {node2, node3, node4} Giant HDFS File: Client11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
    • Putting files on HDFS {node2, node4, node5} Giant HDFS File: Client11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101... NameNode BackupNode While buffering to local disk, the client Xfers block directly to assigned data nodes DataNode DataNode DataNode DataNode DataNode
    • Putting files on HDFS Giant HDFS File: Client11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101... NameNode BackupNode Ad noseum… DataNode DataNode DataNode DataNode DataNode
    • Getting files from HDFS Giant HDFS File: Client return locations of110010101001010100101010011001010100101010010101001100101010 blocks for file01010100101010011001010100101010010101001100101010010101001010100101101... NameNode BackupNode Stream blocks from data nodes DataNode DataNode DataNode DataNode DataNode
    • Fault Tolerance? NameNode BackupNode NameNode detects DataNode lossDataNode DataNode DataNode DataNode DataNode
    • Fault Tolerance? NameNode BackupNode Blocks are auto-replicated on remaining nodes to satisfy replication factorDataNode DataNode DataNode DataNode
    • Fault Tolerance? NameNode BackupNode Blocks are auto-replicated on remaining nodes to satisfy replication factorDataNode DataNode DataNode DataNode
    • Fault Tolerance? NameNode BackupNode Blocks are auto-replicated on remaining nodes to satisfy replication factorDataNode DataNode DataNode DataNode
    • Fault Tolerance? NameNode BackupNode NameNode loss = FAIL(requires manual intervention) not an EPIC fail because you have the backup node to replay any FS operations DataNode DataNode DataNode DataNode DataNode **automatic failover is in the works
    • Live horizontal scaling and rebalancing NameNode BackupNode NameNode detects new DataNode is added to clusterDataNode DataNode DataNode DataNode DataNode
    • Live horizontal scaling and rebalancing NameNode BackupNode Blocks are re-balanced and re-distributedDataNode DataNode DataNode DataNode DataNode
    • Live horizontal scaling and rebalancing NameNode BackupNode Blocks are re-balanced and re-distributedDataNode DataNode DataNode DataNode DataNode
    • Live horizontal scaling and rebalancing NameNode BackupNode Blocks are re-balanced and re-distributedDataNode DataNode DataNode DataNode DataNode
    • Live horizontal scaling and rebalancing NameNode BackupNode Once replication factor is satisfied, extra replicas are removedDataNode DataNode DataNode DataNode DataNode
    • HDFS Demonstration
    • Other HDFS Utils0 HDFS Raid (http://bit.ly/fqnzs5) > Uses distributed RAID instead of replication (useful at Petabyte from flume wiki scale)0 Flume/Scribe/Chukwa > Log collection and aggregation frameworks that support streaming log data to HDFS > Flume = Cloudera (http://bit.ly/gX8LeO) > Scribe = Facebook (http://bit.ly/dIh3If)
    • MapReduce0 Distributed programming paradigm and framework that is the OSS implementation of Google’s MapReduce (http://bit.ly/gXZbsk)0 Modeled using the ideas behind functional programming map() and reduce() operations > Distributed on as many nodes as you would like0 2 phase process: map( )  reduce( ) sub-divide & combine & reduce conquer cardinality
    • MapReduce ABC’s0 Essentially, it’s… 1. Take a large problem and divide it into sub-problems 2. Perform the same function on all sub-problems 3. Combine the output from all sub-problems0 Ex: Searching 1. Take a large problem and divide it into sub-problems # Different groups of rows in DB; different parts of files; 1 user from a list of users; etc. 2. Perform the same function on all sub-problems # Search for a key in the given partition of data for the sub-problem; count words; etc. 3. Combine the output from all sub-problems # Combine the results into a result-set and return to the client
    • M/R Facts0 M/R is excellent for problems where the “sub-problems” are not interdependent > For example, the output of one “mapper” should not depend on the output or communication with another “mapper”0 The reduce phase does not begin execution until all mappers have finished0 Failed map and reduce tasks get auto-restarted0 Rack/HDFS-aware
    • MapReduce Visualized <keyA, valuea> <keyB, valueb> <keyi, valuei> Mapper <keyC, valuec> <keyA, list(valuea,valueb, valuec,…)> … Reducer <keyA, valuea> <keyi, valuei> Mapper <keyB, valueb> <keyC, valuec> Sort <keyB, list(valuea,valueb, valuec,…)> … andInput group Reducer Output by <keyA, valuea> <keyi, valuei> Mapper <keyB, valueb> key <keyC, valuec> <keyC, list(valuea,valueb, valuec,…)> … Reducer <keyA, valuea> <keyB, valueb> <keyi, valuei> Mapper <keyC, valuec> …
    • Example: Word Count <“foo”, 3> <“bar”, 14> <?, file1_part1> Mapper <“baz”, 6> <“foo”, (3, 21, 11, 1)> count() … Reducer sum() <“foo”, 21> <?, file1_part2> Mapper <“bar”, 78> <“baz”, 12> Sort <“bar”, (14, 78, 22, 41)> Lots of count() … and bar,155 Input InputBig Files group Reducer baz,59 foo,36 by sum() … <“foo”, 11> <?, file2_part1> Mapper <“bar”, 22> key <“baz”, 31> <“baz”, (6, 12, 31, 10)> count() … Reducer <“foo”, 1> sum() <“bar”, 41> <?, file2_part2> Mapper <“baz”, 10> count() …
    • Hadoop’s MapReduce0 MapReduce tasks are submitted as a “job” > Jobs can be assigned to a specified “queue” of jobs # By default, jobs are submitted to the “default” queue > Job submission is controlled by ACL’s for each queue0 Rack-aware and HDFS-aware > The JobTracker communicates with the HDFS NameNode and schedules map/reduce operations using input data locality on HDFS DataNodes
    • M/R NodesMaster 0 Single node responsible for: JobTracker > Coordinating all M/R tasks & events > Managing job queues and scheduling > Maintains and Controls TaskTrackers > Moves/restarts map/reduce tasks if needed 0 SPOF =( > Uses “checkpointing” to combat thisSlaves 0 Worker nodes responsible for: TaskTracker TaskTracker > Executing individual map and reduce tasks TaskTracker as assigned by JobTracker (in separate JVM)
    • Conceptual Overview JobTracker JobTracker controls and heartbeats TaskTracker nodesTaskTracker TaskTracker TaskTracker TaskTracker TaskTrackers store temp data on HDFS Temporary data stored on HDFS
    • Job Submission M/R submit jobs to JobTracker M/R M/R Client Client Client JobTracker jobs get queued map()’s are assigned to TaskTrackers (HDFS DataNode locality aware) TaskTracker TaskTracker TaskTracker TaskTracker Mapper Mapper Mapper Mappermappers spawned in separate JVM and execute mappers store results on HDFS Temporary data stored on HDFS
    • Job Submission M/R submit jobs to JobTracker M/R M/R Client Client Client JobTracker jobs get queued reduce phase beginsTaskTracker TaskTracker TaskTracker TaskTracker Reducer Reducer Reducer Reducer tmp data read from HDFS Temporary data stored on HDFS
    • MapReduce Tips0 Keys and values can be any type of object > Can specify custom data splitters, partitoners, combiners, InputFormat’s, and OutputFormat’s0 Use ToolRunner.run(Tool) to run your Java jobs… > Will use GenericOptionsParser and DistributedCache so that -files, -libjars, & -archives options are available to distribute your mappers, reducers, and any > Without this, your mappers, reducers, and other utilites will not be propagated and added to the classpath of the other nodes (ClassNotFoundException)
    • MapReduce Demonstration
    • Other M/R Utils0 $HADOOP_HOME/contrib/* > PriorityScheduler & FairScheduler > HOD (Hadoop On Demand) # Uses TORQUE resource manager to dynamically allocate, use, and destroy MapReduce clusters on an as-needed basis # Great for development and testing > Hadoop Streaming (next slide...)0 Amazon’s Elastic MapReduce (EMR) > Essentially production HOD for EC2 data/clusters
    • Hadoop Streaming0 Allows you to write MapReduce jobs in languages other than Java by running any command line process > Input data is partitioned and given to the standard input (STDIN) of the command line mappers and reducers specified > Output (STDOUT) from the command line mappers and reducers get combined into the M/R pipeline0 Can specify custom partitioners and combiners0 Can specify files & archives to propagate to all nodes and unpack on local file system (-archives & -file) hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming.jar -input “/foo/bar/input.txt” -mapper splitz.py -reducer /bin/wc -output “/foo/baz/out” -archives „hdfs://hadoop1/foo/bar/cachedir.jar‟ -file ~/scripts/splitz.py -D mapred.job.name=“Foo bar”
    • Pig0 Framework and language (Pig Latin) for creating and submitting Hadoop MapReduce jobs0 Common data operations (not supported by POJO-M/R) like join, group, filter, sort, select, etc. are provided0 Don’t need to know Java0 Removes boilerplate aspect from M/R > 200 lines in Java  15 lines in Pig!0 Relational qualities (reads and feels SQL-ish)
    • Pig0 Fact from Wiki: 40% of Yahoo’s M/R jobs are in Pig0 Interactive shell (grunt) exists0 User Defined Functions (UDF) > Allows you to specify Java code where the logic may be too complex for Pig Latin > UDF’s can be part of most every operation in Pig Latin > Great for loading and storing custom formats as well as transforming data
    • Pig Relational OperationsCOGROUP JOIN SPLITCROSS LIMIT STOREDISTINCT LOAD STREAMFILTER MAPREDUCE UNIONFOREACH ORDER BYGROUP SAMPLE most of these are pretty self-explanatory
    • Example Pig ScriptTaken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes asearch query log file from the Excite search engine and compares the occurrence of frequency ofsearch phrases across two time periods separated by twelve hours.01: REGISTER ./tutorial.jar;02: raw = LOAD excite.log USING PigStorage(t) AS (user, time, query);03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);04: clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;05: houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;06: ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;07: ngramed2 = DISTINCT ngramed1;08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour);09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) AS count;10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram, $1 as hour, $2 as count;11: hour00 = FILTER hour_frequency2 BY hour eq 00;12: hour12 = FILTER hour_frequency3 BY hour eq 12;13: same = JOIN hour00 BY $0, hour12 BY $0;14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as ngram, $2 as count00, $5 as count12;15: STORE same1 INTO /tmp/tutorial-join-results USING PigStorage();
    • Example Pig ScriptTaken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes asearch query log file from the Excite search engine and compares the occurrence of frequency ofsearch phrases across two time periods separated by twelve hours.01: REGISTER ./tutorial.jar;02: raw = LOAD excite.log USING PigStorage(t) AS (user, time, query);03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);04: clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;05: houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query; UDF’’s06: ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;07: ngramed2 = DISTINCT ngramed1;08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour);09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) AS count;10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram, $1 as hour, $2 as count; Now... image this equivalent in Java...11: hour00 = FILTER hour_frequency2 BY hour eq 00;12: hour12 = FILTER hour_frequency3 BY hour eq 12;13: same = JOIN hour00 BY $0, hour12 BY $0;14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as ngram, $2 as count00, $5 as count12;15: STORE same1 INTO /tmp/tutorial-join-results USING PigStorage();
    • <- ? ZooKeeper0 Centralized coordination service for use by distributed applications > Configuration, naming, synchronization (locks), ownership (master election), etc. ZooKeeper Service Leader! Server Server Server Server Server Client Client Client Client Client Client Client Client0 Important system guarantees: > Sequential consistency (great for locking) > Atomicity – all or nothing at all > Data consistency – all clients view same system state regardless of the server it connects to
    • <- ? ZooKeeper0 Hierarchical namespace of “znodes” (like directories)0 Operations: > create a node at a location in the tree > delete a node > exists - tests if a node exists at a location > get data from a node > set data on a node > get children from a node > sync - waits for data to be propagated leaf znodes
    • HBase0 Sparse, non-relational, column-oriented distributed database built on top of Hadoop Core (HDFS + MapReduce)0 Modeled after Google’s BigTable (http://bit.ly/fQ1NMA)0 NoSQL Not Only SQL... ...not “SQL is terrible”0 HBase also has: > Strong consistency model > In-memory operation > LZO compression (optional) > Live migrations > MapReduce support for querying
    • What HBase Is…0 Good at fast/streaming writes0 Fault tolerant0 Good at linear horizontal scalability0 Very efficient at managing billions of rows and millions of columns0 Good at keeping row history0 Good at auto-balancing0 A complement to a SQL DB/warehouse0 Great with non-normalized data
    • What HBase Is NOT…0 Made for table joins0 Made for splitting into normalized tables (see previous)0 A complete replacement for a SQL relational database0 A complete replacement for a SQL data warehouse0 Great for storing small amounts of data0 Great for storing gobs of large binary data0 The best way to do OLTP0 The best way to do live adhoc querying of any column0 A replacement for a proper caching mechanism0 ACID compliant (http://bit.ly/hhFXCS)
    • HBase Facts0 Written in Java0 Uses ZooKeeper to store metadata and -ROOT- region0 Column-oriented store = flexible schema > Can alter the schema simply by adding the column name and data on insert (“put”) > No schema migrations!0 Every column has a timestamp associated with it > Same column with most recent timestamp wins0 Can export metrics for use with Ganglia, or as JMX0 hbase hbck > Check for errors and fix them (like HDFS fsck)
    • HBase Client API’s0 jRuby interactive shell (hbase shell) > DDL/DML commands > Admin commands > Cluster commands0 Java API (http://bit.ly/ij0MgF)0 REST API > Provided using Stargate0 API for other languages (http://bit.ly/fLgCJC)
    • Column-Oriented? 0 Traditional RDBMS are stored using row-oriented storage which stores entire rows sequentially on disk Row 1 – Cols 1-3 Row 2 – Cols 1-3 Row 3 – Cols 1-3 0 Whereas column-oriented storage only stores columns for each row (or column-families) sequentially on disk Row 1 – Col 1 Row 2 – Col 1 Row 1 – Col 2 Row 2 – Col 2 Row 3 – Col 1 Row 3 – Col 2 Row 1 – Col 3 Row 3 – Col 3Where’s Row 2 - Col 2? Not needed because columns are stored sequentially, so rows have flexible schema!
    • Think of HBase Tables As… 0 More like JSON > And less like spreadsheets row id { "1" : { "A" : { v: "x", ts: 4282 }, "B" : { v: "z", ts: 4282 } }, columns "aaaaa" : { "A" : { v: "y", ts: 4282 } column families allow grouping of }, columns (faster retrieval) "xyz" : { “address” : { “line1" : { v: "hello", ts: 4282 },flexible “line2" : { v: "there", ts: 4282 }, recent TS = default col valueschema “line2" : { v: "there", ts: 1234 } }, old TS “fooo" : { v: "wow!", ts: 4282 } }, "zzzzz" : { value & timestamp (TS) "A" : { v: "woot", ts: 4282 }, "B" : { v: "1337", ts: 4282 } } } Modified from http://bit.ly/hbGWIG
    • HBase Overview Data is sent using The Master server keeps track of the the client metadata for RegionServer’s and their containing Regions and stores it in Zookeeper The HBase client communicates with the Zookeeper cluster only to get Region information; moreover, no data is sent through the Master The actual row “data” (bytes) is sent directly to and from the RegionServersPretty diagrams from Lars George Therefore, the Master server nor the Zookeeperhttp://goo.gl/wRLJP & http://goo.gl/6ehnV cluster don’t serve as data bottlenecks
    • HBase Overview Pretty diagrams from Lars George http://goo.gl/wRLJP All HBase data (HLog and HFiles) are stored on HDFSHDFS breaks files into 64MB chucks and replicates the chunks N times (3 bydefault) to store on “actual” disk (giving HBase it’s fault tolerance)
    • Understanding HBaseTables are split into groups of ~100 Regions are assigned to particularrows (configurable) called Regions RegionServer’s by the Master server. The Master only contains region-location metadata andTable HRegions contains no “real” row data. Pretty diagrams from Lars George http://goo.gl/wRLJP & http://goo.gl/6ehnV
    • Writing to HBase1) HBase client gets the assigned region servers (and Pretty diagrams from Lars Georgeregions) from Master server for the particular keys http://goo.gl/wRLJP & http://goo.gl/6ehnV(rows) in question and sends commands/data HDFS 4) In memory store is periodically flushed to HDFS (disk) when size reaches threshold 2) Transaction is written to write- 3) Same data is written to in memory ahead-log on HDFS (disk) first HDFS store for the assigned region (row group)
    • HBase Scalability Additional RegionServers can be added to the live system. The master server will then rebalance the cluster to migrate Regions onto the new RegionServersMoreover, additional HDFS datanodes can be added to disk givemore space to the HDFS cluster Pretty diagrams from Lars George http://goo.gl/wRLJP & http://goo.gl/6ehnV
    • HBase Demonstration
    • Hive0 Data warehouse infrastructure on top of Hadoop Core > Stores data on HDFS > Allows you to add custom MapReduce plugins0 HiveQL > SQL-like language pretty close to ANSI SQL # Supports joins > JDBC driver exists0 Has interactive shell (like MySQL & PostgreSQL) to run interactive queries
    • Hive0 When running a HiveQL query/script, in > SHOW TABLES; the background Hive creates and runs a series of MapReduce jobs to > CREATE TABLE rating ( > BigData means it can take a long time to run userid INT, queries movieid INT, rating INT, unixtime STRING)0 Therefore, it’s good for offline BigETL, but ROW FORMAT DELIMITED not good replacement for OLTP/OLAP data FIELDS TERMINATED BY t warehouse (like Oracle) STORED AS TEXTFILE; > DESCRIBE rating;0 Learn more from wiki: http://bit.ly/epauio
    • Other useful utilities around Hadoop0 Sqoop (http://bit.ly/eRfVEJ) > Load SQL data from a table into HDFS or Hive > Generates Java classes to interact with the loaded data0 Oozie (http://bit.ly/eNLi3B) > Orchestrates complex workflows around multiple MapReduce jobs0 Mahout (http://bit.ly/hCXRjL) > Algorithm library for collaborative filtering, clustering, classifiers, and machine learning0 Cascading (http://bit.ly/gyZNiI) > Data query abstraction layer similar to Pig > Java API that sits on top of MapReduce framework > Since it’s a Java API you can use it with any program that uses a JVM language: Groovy, Scala, Clojure, jRuby, jython, etc.
    • What about support?0 Community, wikis, forumns, IRC0 Cloudera provides enterprise support > Offerings: # Cloudera Enterprise # Support, professional services, training, management apps > Cloudera Distribution of Hadoop (CDH) # Tested and hardened version of Hadoop products plus some other goodies (oozie, flume, hue, sqoop, whirr) ~ Separate codebase, but patches are made to and form the Apache versions # Packages: debian, redhat, EC2, VM if you want to try Hadoop, CDH is probably the way to go. I recommended this instead of downloading each project individually.
    • Who uses this stuff? and many more
    • Where the heck can I use this stuff?0 The hardest part, is finding the right use-cases to apply Hadoop (and any NoSQL system) > SQL databases are great for data that fits on one machine > Lots of tooling support for SQL; not as much for Hadoop (yet)0 A few questions to think about: > How much data are you processing? > Are you throwing away valuable data due to space? > Are you processing data where steps aren’t interdependent?0 Log storage, log processing, utility data, research data, biological data, medical records, events, mail, tweets, market data, financial data
    • NoSQL ≠
    • NoSQL =
    • The Law Of the Instrument“It is tempting, if the only tool you have isa hammer, to treat everything as if itwere a nail.” -Abraham Maslow
    • ?’s
    • Thank You david.engfer@gmail.comsubmit feedback here!