Published on

Published in: Technology, News & Politics
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Extended reading: higher order function. map , reduce Cluster Computing and MapReduce Lecture  online course provided by Google on YouTube,talking about "the cloud' MapReudce  paper
  • Extended Reading: problem of NameNode,the famous single point of failure.   Facebook`s solution of hadoop HA The Next Generation of Apache Hadoop MapReduce s4.io  another distributed computing platform   Twitter Strom  real time processing ,but not yet opensource/
  • More Detail:      commonly speaking,you can   implement the InputFormat interface to do what ever you want. but be care of the getSplits method.     in general , you may think of just implement the InputSplit ,but in the hadoop internal work flow,at certain point,the return value will used to cast as FileInputSplit.     so,if it is not,you will certainly fail the job submission.     in most of the case,just use the TextInputFormat,it will serve most of your case.     a little tuning may apply to the split stage.since the split is base on file size and block size,with larger block size configuration,you *MAY* get less splits and so decrease the task total.
  • data writes are pipelines,not until the next node ack the block,will the current node actually write to the disk. Extended Reading: Google File System  ,the original theory model of HDFS
  • you can have multiple masters,and zookeeper will select one as live,and the other as backup until the current died. one CF per store file. WAL can be disable. in most of our case it worth to do so. 7k insertion in our cluster,when writes to a single region.the region will hold a big lock aim to ensure write consistent,and so block the other current request.make more regions *MAY* increase the throughput
  • no benchmark in our cluster. some tuning may worth consideration. IO schedule fs journal mode hadoop read write buffer instance Numbers per node filesystem type branch the 哈
  • read the source,and it will tell you why,at least. some consideration can be apply before setting up an cluster.such as separate the disk partitions for hadoop,avoid complex calculation of disk reservation
  • Hadoop

    1. 1. Hadoop Realtime Question  http://goo.gl/4F7Tj
    2. 2. What is Hadoop:      The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model  1  <ul><ul><li>Apach Hadoop Home </li></ul></ul><ul><ul><li>Dean, Jeffrey & Ghemawat, Sanjay (2004).  &quot;MapReduce: Simplified Data Processing on Large Clusters&quot; . Retrieved Apr. 6, 2005. </li></ul></ul>Execution overview 2
    3. 3. Architecture : <ul><ul><li>Wikipedia:Hadoop   Image Source </li></ul></ul>Implementation 1
    4. 4. Architecture 2: <ul><ul><li>Yahoo! Hadoop Tutorial </li></ul></ul>Implementation 1
    5. 5. Architecture 3: <ul><ul><li>Apache HDFS Design </li></ul></ul>Implementation 1
    6. 6. Architecture 4: <ul><ul><li>Hive Architecture </li></ul></ul>Implementation 1
    7. 7. Architecture 5: <ul><ul><li>HBase Architecture 101 - Storage </li></ul></ul>Implementation 1
    8. 8. Performance: ?
    9. 9. Trouble Shooting: May the source be with you