Your SlideShare is downloading. ×
0
What is Big Data ?●   How is big “Big Data” ?    ●   Is 30 40 Terabyte big data ?    ●   ….●   Big data are datasets that ...
Enterprises & Big Data●   Most companies are currently using traditional tools to    store data●   Big data: The next fron...
Hadoop is an ecosystem, not a single product.When you deal with BigData, the data center is your computer.
•   A Brief History of Hadoop•   Contributers and Development•   What is Hadoop•   Wyh Hadoop•   Hadoop Ecosystem
A Brief History of Hadoop•   Hadoop has its origins in Apache Nutch•   Nutch was started in 2002•   Challenge : The billio...
•   A Brief History of Hadoop•   Contributers and Development•   What is Hadoop•   Wyh Hadoop•   Hadoop Ecosystem
Contributers and DevelopmentLifetime patches contributed for all Hadoop-related projects: community members bycurrent empl...
Contributers and Development
Contributers and Development* Resource: Kerberos Konference (Yahoo) – 2010
Development in ASF/Hadoop●   Resources    ●   Mailing List    ●   Wiki Pages , blogs    ●   Issue Tracking – JIRA    ●   V...
•   A Brief History of Hadoop•   Contributers and Development•   What is Hadoop•   Wyh Hadoop•   Hadoop Ecosystem
What is Hadoop•   Open-source project administered by the ASF•   Data Intensive Storage•   and Massivly Paralel Processing...
What is Hadoop ?•   Scalable•   Fault Tolerance•   Reliable data storage using the Hadoop Distributed    File System (HDFS...
What is Hadoop ?•   Hadoop Becoming defacto standard for large scale    dataprocessing•   Becoming more than just MapReduc...
What is Hadoop ? Yahoo Hadoop Cluster38,000 machinesdistributed across 20different clusters.Recource : Yahoo 201050,000 m ...
•   A Brief History of Hadoop•   Contributers and Development•   What is Hadoop•   Wyh Hadoop•   Hadoop Ecosystem
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?•       Hadoop has its origins in Apache Nutch•       Can Process Big Data (Petabytes and more..)•       Unlimi...
Is hadoop alternative for RDBMs ? •   At the moment Apache Hadoop is not a substitute for a database •   No Relation •   K...
•   A Brief History of Hadoop•   Contributers and Development•   What is Hadoop•   Wyh Hadoop•   Hadoop Ecosystem
Hadoop Ecosystem   ETL Tools           BI Reporting     RDBMSPig (Data   Flow)      Hive (SQL)        Sqoop MapReduce (Job...
Hadoop Ecosystem           Important components of Hadoop•   HDFS: A distributed, fault tolerance file system•   MapReduce...
Hadoop EcosystemHadoop is a Distributed Data Computing Platform
HDFS
HDFSNameNode /DataNode interaction in HDFS. The NameNode keeps track of the filemetadata—which files are in the system and...
Hadoop Cluster
Writing Files To HDFS               •   Client consults NameNode               •   Client writes block directly to        ...
Reading Files From HDFS•   Client consults NameNode•   Client receives Data Node list for each block•   Client picks first...
Rackawareness & Fault Tolerance                                                        NameNode                           ...
Cluster Healt
Hadoop Ecosystem           Important components of Hadoop•   HDFS: A distributed, fault tolerance file system•   MapReduce...
MapReduce-Paradigm•   Simplified Data Processing on Large Clusters•   Splitting a Big Problem/Data into Little PiecesHive•...
MapReduce-Batch Processing•       Phases    •     Map    •     Sort/Shuffle    •     Reduce (Aggregation)•       Coordinat...
MapReduce-Map                           K   V                               1                               1Datanode 1   ...
MapReduce-Sort/Shuffle                          1                          1                   SORTDatanode 1             ...
MapReduce-Reduce                      1                                   K   V                      1               SORT ...
MapReduce-All Phases         1                    1         1             SORT   MAP              1         1             ...
MapReduce-Job & Task Tracker                                                                                Namenode      ...
Summary of HDFS and MR
Hadoop Ecosystem           Important components of Hadoop•   HDFS: A distributed, fault tolerance file system•   MapReduce...
Hive
Hive•   Data warehousing package built on top of Hadoop•   It began its life at Facebook processing large amount of user  ...
Hive ComponentsMgmt. Web UI                                                                           Map Reduce   HDFS   ...
Hadoop Ecosystem           Important components of Hadoop•   HDFS: A distributed, fault tolerance file system•   MapReduce...
Pig•       The language used to express data flows, called Pig Latin•       Pig Latin can be extended using UDF (User Defi...
Piggrunt> records = LOAD input/ncdc/micro-tab/sample.txt      AS (year:chararray, temperature:int, quality:int);grunt> DUM...
Hadoop Ecosystem           Important components of Hadoop•   HDFS: A distributed, fault tolerance file system•   MapReduce...
HBase•   Random, realtime read/write access to your Big Data•   Billions of rows X millions of columns•   Column-oriented ...
HBase-Datamodel    •        (Table, RowKey, Family,Column, Timestamp) → Value•       Think of tags. Values any length, no ...
HBase-Datamodel•   (Table, RowKey, Family,Column, Timestamp) → Value
HBase-Datamodel•   (Table, RowKey, Family,Column, Timestamp) → Value
Create Sample Tablehbase(main):003:0> create test, cfhbase(main):004:0> put test, row1, cf:a, value11hbase(main):004:0> pu...
Hbase-Architecture•   Splits•   Auto-Sharding•   Master•   Region Servers•   HFile
Splits & RegionServers•   Rows grouped in regions and served by different servers•   Table dynamically split into “regions...
Hbase-Architecture
Other Components•   Flume•   Sqoop
Commertial Products•   Oracle Big Data Appliance•   Microsoft Azure + Excel + MapReduce•   Cloud Computing , Amazon elasti...
Thank YouFaruk Berksözfberksoz@gmail.com
Hadoop hbase mapreduce
Upcoming SlideShare
Loading in...5
×

Hadoop hbase mapreduce

8,425

Published on

A brief description about Hadoop, HDFS, MapReduce , Hive and Pig

Published in: Technology
1 Comment
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
8,425
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
504
Comments
1
Likes
8
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop hbase mapreduce"

  1. 1. What is Big Data ?● How is big “Big Data” ? ● Is 30 40 Terabyte big data ? ● ….● Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools● Today Terabyte, Petabyte, Exabyte● Tomorrow ?
  2. 2. Enterprises & Big Data● Most companies are currently using traditional tools to store data● Big data: The next frontier for innovation, competition, and productivity● The use of big data will become a key basis of competition● Organisations across the globe need to take the rising importance of big data more seriously
  3. 3. Hadoop is an ecosystem, not a single product.When you deal with BigData, the data center is your computer.
  4. 4. • A Brief History of Hadoop• Contributers and Development• What is Hadoop• Wyh Hadoop• Hadoop Ecosystem
  5. 5. A Brief History of Hadoop• Hadoop has its origins in Apache Nutch• Nutch was started in 2002• Challenge : The billions of pages on the Web ?• 2003 GFS (Google File System)• 2004 NDFS (Nutch File System)• 2004 Google published the paper of MapReduce• 2005 Nutch Developers getting started with development of MapReduce
  6. 6. • A Brief History of Hadoop• Contributers and Development• What is Hadoop• Wyh Hadoop• Hadoop Ecosystem
  7. 7. Contributers and DevelopmentLifetime patches contributed for all Hadoop-related projects: community members bycurrent employer* source : JIRA tickets
  8. 8. Contributers and Development
  9. 9. Contributers and Development* Resource: Kerberos Konference (Yahoo) – 2010
  10. 10. Development in ASF/Hadoop● Resources ● Mailing List ● Wiki Pages , blogs ● Issue Tracking – JIRA ● Version Control SVN – Git
  11. 11. • A Brief History of Hadoop• Contributers and Development• What is Hadoop• Wyh Hadoop• Hadoop Ecosystem
  12. 12. What is Hadoop• Open-source project administered by the ASF• Data Intensive Storage• and Massivly Paralel Processing(MPP)• Enables applications to work with thousands of nodes and petabytes of data• Suitable for application with large data sets
  13. 13. What is Hadoop ?• Scalable• Fault Tolerance• Reliable data storage using the Hadoop Distributed File System (HDFS)• High-performance parallel data processing using a technique called MapReduce
  14. 14. What is Hadoop ?• Hadoop Becoming defacto standard for large scale dataprocessing• Becoming more than just MapReduce• Ecosystem growing rapidly lot’s of great tools around it
  15. 15. What is Hadoop ? Yahoo Hadoop Cluster38,000 machinesdistributed across 20different clusters.Recource : Yahoo 201050,000 m : January 2012Resourcehttp://www.computerworlduk.com/in-depth/applications/3329092/hadoop- SGI Hadoop Clustercould-save-you-money-over-a-traditional-rdbms/
  16. 16. • A Brief History of Hadoop• Contributers and Development• What is Hadoop• Wyh Hadoop• Hadoop Ecosystem
  17. 17. Why Hadoop?
  18. 18. Why Hadoop?
  19. 19. Why Hadoop?
  20. 20. Why Hadoop?• Hadoop has its origins in Apache Nutch• Can Process Big Data (Petabytes and more..)• Unlimited Data Storage & Analyse• No licence cost - Apache License 2.0• Can be build out of the commodity hardware• IT Cost Reduction • Results • Be One Step Ahead of Competition • Stay there
  21. 21. Is hadoop alternative for RDBMs ? • At the moment Apache Hadoop is not a substitute for a database • No Relation • Key Value pairs • Big Data • unstructured (Text) • semi structured (Seq / Binary Files) • Structured (Hbase=Google BigTable) • Works fine together with RDBMs
  22. 22. • A Brief History of Hadoop• Contributers and Development• What is Hadoop• Wyh Hadoop• Hadoop Ecosystem
  23. 23. Hadoop Ecosystem ETL Tools BI Reporting RDBMSPig (Data Flow) Hive (SQL) Sqoop MapReduce (Job Scheduling/Execution System)HBase (Key-Value store) HDFS (Hadoop Distributed File System)
  24. 24. Hadoop Ecosystem Important components of Hadoop• HDFS: A distributed, fault tolerance file system• MapReduce: A paralel data processing framework• Hive : A query framework (like SQL)• PIG : A query scripting tool• HBase : realtime read/write access to your Big Data
  25. 25. Hadoop EcosystemHadoop is a Distributed Data Computing Platform
  26. 26. HDFS
  27. 27. HDFSNameNode /DataNode interaction in HDFS. The NameNode keeps track of the filemetadata—which files are in the system and how each file is broken down into blocks. TheDataNodes provide backup store of the blocks and constantly report to the NameNode to keep themetadata current.»
  28. 28. Hadoop Cluster
  29. 29. Writing Files To HDFS • Client consults NameNode • Client writes block directly to one DataNode • DataNote replicates block • Cycle repeats for next block
  30. 30. Reading Files From HDFS• Client consults NameNode• Client receives Data Node list for each block• Client picks first Data Node for each block• Client reads blocks sequentially
  31. 31. Rackawareness & Fault Tolerance NameNode Rack Aware Metadata Rack 1: File.txt DN1 Blk A: DN2 DN1,DN5,DN6 DN3 DN5 Blk B: DN1,DN2,DN9 Rack 5: DN5 BLKC: DN6 DN5,DN9,DN10 DN7 DN8 Rack N• Never loose all data if entire rack fails• In Rack is higher bandwidth , lower latency
  32. 32. Cluster Healt
  33. 33. Hadoop Ecosystem Important components of Hadoop• HDFS: A distributed, fault tolerance file system• MapReduce: A paralel data processing framework• Hive : A query framework (like SQL)• PIG : A query scripting tool• HBase : A Column oriented Database for OLTP
  34. 34. MapReduce-Paradigm• Simplified Data Processing on Large Clusters• Splitting a Big Problem/Data into Little PiecesHive• Key-Value
  35. 35. MapReduce-Batch Processing• Phases • Map • Sort/Shuffle • Reduce (Aggregation)• Coordination • Job Tracker • Task Tracker
  36. 36. MapReduce-Map K V 1 1Datanode 1 MAP 1 1 1Datanode 2 MAP 1 1 1 1Datanode 3 1 MAP 1 1
  37. 37. MapReduce-Sort/Shuffle 1 1 SORTDatanode 1 1 1 1Datanode 2 1 SORT 1 1 1Datanode 3 1 SORT 1 1
  38. 38. MapReduce-Reduce 1 K V 1 SORT REDUCE 4Datanode 1 1 1 1 K V 1Datanode 2 2 SORT 1 REDUCE 3 1 1 1 K VDatanode 3 SORT REDUCE 3 1 1
  39. 39. MapReduce-All Phases 1 1 1 SORT MAP 1 1 REDUCE 4 1 1 1 1 1 1 1 SORT MAP REDUCE 2 1 1 3 1 1 1 1 1 SORT 1 MAP REDUCE 3 1 1 1 1
  40. 40. MapReduce-Job & Task Tracker Namenode DatanodesJobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a dataprocessing job, the JobTracker partitions the work and assigns different map and reduce tasksto each TaskTracker in the cluster
  41. 41. Summary of HDFS and MR
  42. 42. Hadoop Ecosystem Important components of Hadoop• HDFS: A distributed, fault tolerance file system• MapReduce: A paralel data processing framework• Hive : A query framework (like SQL)• PIG : A query scripting tool• HBase : A Column oriented Database for OLTP
  43. 43. Hive
  44. 44. Hive• Data warehousing package built on top of Hadoop• It began its life at Facebook processing large amount of user and log data• Hadoop subproject with many contributors• Ad hoc queries , summarization , and data analysis on Hadoop- scale data• Directly query data from different formats (text/binary) and file formats (Flat/Sequence)• HiveQL - like SQL
  45. 45. Hive ComponentsMgmt. Web UI Map Reduce HDFS Hive CLI Browsing Queries DDL Thrift API Parser Execution Planner Hive QL MetaStore *Thrift : Interface Definition Lang.
  46. 46. Hadoop Ecosystem Important components of Hadoop• HDFS: A distributed, fault tolerance file system• MapReduce: A paralel data processing framework• Hive : A query framework (like SQL)• PIG : A query scripting tool• HBase : A Column oriented Database for OLTP
  47. 47. Pig• The language used to express data flows, called Pig Latin• Pig Latin can be extended using UDF (User Defined Functions)• was originally developed at Yahoo Research• PigPen is an Eclipse plug-in that provides an environment for developing Pig programs• Running Pig Programs • Script ; script file that contains Pig commands • Grunt ; interactive shell • Embedded ; java
  48. 48. Piggrunt> records = LOAD input/ncdc/micro-tab/sample.txt AS (year:chararray, temperature:int, quality:int);grunt> DUMP records;(1950,0,1)(1950,22,1)(1950,-11,1)(1949,111,1)(1949,78,1)grunt> DESCRIBE records;records: {year: chararray,temperature: int,quality: int}grunt> filtered_records = FILTER records BY temperature != 22 );grunt> DUMP filtered_records;grunt> grouped_records = GROUP records BY year;grunt> DUMP grouped_records;(1949,{(1949,111,1),(1949,78,1)})(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
  49. 49. Hadoop Ecosystem Important components of Hadoop• HDFS: A distributed, fault tolerance file system• MapReduce: A paralel data processing framework• Hive : A query framework (like SQL)• PIG : A query scripting tool• HBase : A Column oriented Database for OLTP
  50. 50. HBase• Random, realtime read/write access to your Big Data• Billions of rows X millions of columns• Column-oriented store modeled after Googles BigTable• provides Bigtable-like capabilities on top of Hadoop and HDFS• HBase is not a column-oriented database in the typical RDBMS sense, but utilizes an on-disk column storage format
  51. 51. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value• Think of tags. Values any length, no predefined names or widths• Column names carry info (just like tags)
  52. 52. HBase-Datamodel• (Table, RowKey, Family,Column, Timestamp) → Value
  53. 53. HBase-Datamodel• (Table, RowKey, Family,Column, Timestamp) → Value
  54. 54. Create Sample Tablehbase(main):003:0> create test, cfhbase(main):004:0> put test, row1, cf:a, value11hbase(main):004:0> put test, row1, cf:a, value12hbase(main):005:0> put test, row2, cf:b, value2hbase(main):006:0> put test, row3, cf:c, value3hbase(main):007:0> scan testROW COLUMN+CELLrow1 column=cf:a, timestamp=1288380727188, value=value12row2 column=cf:b, timestamp=1288380738440, value=value2row3 column=cf:c, timestamp=1288380747365, value=value3hbase(main):007:0> scan test, { VERSIONS => 3 }ROW COLUMN+CELLrow1 column=cf:a, timestamp=1288380727188, value=value12row1 column=cf:a, timestamp=1288380727188, value=value11row2 column=cf:b, timestamp=1288380738440, value=value2row3 column=cf:c, timestamp=1288380747365, value=value3
  55. 55. Hbase-Architecture• Splits• Auto-Sharding• Master• Region Servers• HFile
  56. 56. Splits & RegionServers• Rows grouped in regions and served by different servers• Table dynamically split into “regions”• Each region contains values [startKey, endKey)• Regions hosted on a regionserver
  57. 57. Hbase-Architecture
  58. 58. Other Components• Flume• Sqoop
  59. 59. Commertial Products• Oracle Big Data Appliance• Microsoft Azure + Excel + MapReduce• Cloud Computing , Amazon elastic computing• IBM Hadoop-based InfoSphere BigInsights• VMWare Spring for Apache Hadoop• Toad for Cloud Database• Mapr , Cloudera , HortonWorks, Datameer
  60. 60. Thank YouFaruk Berksözfberksoz@gmail.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×