Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Mayuri Agarwal
  2. 2. Data Management !!!!!!
  3. 3. Big Data-What does it mean? Velocity: Often time sensitive , big data must be used as it is streaming in to the enterprise it order to maximize its value to the business. Batch ,Near time , Real-time ,streams Volume: Big data comes in one size : large . Enterprises are awash with data ,easy amassing terabytes and even petabytes of information. TB , Records , Transactions ,Tables , Files. Variety: Big data extends beyond structured data , including semi-structured and unstructured data to all varieties :text , audio , video ,click streams ,log files and more Structured , Unstructured , Semi-structured Veracity: Quality and provenance of received data. Good , Undefined , bad , Inconsistency , Incompleteness , Ambiguity Value
  4. 4. Big Data 90% 10% Worldwide Data Last 2 years Since the Beginnning of the Time
  5. 5. What is Hadoop? Software project that enables the distributed processing of large data sets across clusters of commodity servers Works with structured and unstructured data Open source software + Hardware commodity = IT cost Reduction It is designed to scale up from a single server to thousands of machines Very high degree of fault tolerance software’s ability to detect and handle failures at the application layer
  6. 6. The origin of the name Hadoop…. The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.
  7. 7. Hadoop Sub-projects  HDFS  Map-Reduce
  8. 8. HDFS-Hadoop Distributed File System  Distributed, scalable, and portable file system Each node in a Hadoop instance typically has a single Namenode : a cluster of Datanodes form the HDFS cluster Asynchronous replication. Data divided into 64mb (default) or 128mb blocks , each block replicated 3 times (default) Namenode holds file system metadata. Files are broken up and spread over Datanode .
  9. 9. HDFS- Read & Write
  10. 10. MapReduce Software framework for distributed computation Input | Map() | Copy/Sort | Reduce () | Output JobTracker schedules and manages jobs. Task tracker executes individual map() and reduce task on each cluster node.
  11. 11. Example : MapReduce
  12. 12. Master – Slave Model
  13. 13. Hadoop Ecosystem
  14. 14. HBase  HBase is an open source , non-relational, distributed database  A Key-value store  A value is identified by the key  Both key and value are a byte array  The values are stored in key-order  Thus access data by key is very fast  Users create table in HBase  There is no schema of HBase table  Very good for sparse data  Takes lots of disk space
  15. 15. HBase Architecture  Master: Responsible for coordinating with region server.  Region server: Serves data for read and write  Zookeeper: Manages the HBase cluster  Low latency and random access to data
  16. 16. Hive  A system for managing and querying structured data built on Hadoop  SQL-Like query language called HQL  Main purpose is analysis and ad hoc querying  Database/table/partition –DDL operation  Not for :small data sets ,Low latency queries ,OLTP
  17. 17. Hadoop-Hive Architecture
  18. 18. HBase-Hive configuration HBase as ETL data sink HBase as Data Source Low Latency warehouse
  19. 19. Hive and MySQL Database Structure
  20. 20. Hadoop Limitations  Not a high-speed SQL database.  Is not a particularly simple technology.  Hadoop is not easy to connect to legacy systems.  Hadoop is not a replacement for traditional data warehouses. It is an adjunctive product to data warehouses.  Normal DBAs will need to learn new skills before they can adopt Hadoop tools.  The architecture around the data - the way you store data, the way you de-normalize data, the way you ingest data, the way you extract data - is different in Hadoop.  Linux and Java skills are critical for making a Hadoop environment a reality.
  21. 21. Hadoop’s Capability  Hadoop is a super-powerful environment that can transform your understanding of data.  Hadoop can store vast amounts of data.  Hadoop can run queries on huge data sets.  You can archive data on Hadoop and still query it.  Hadoop allows you to ingest data at incredible speeds and analyze it and report on it in near real-time.  Hadoop massively reduces the latency of data.
  22. 22. Hadoop: Hot skill to acquire on IT job circuit  The market for data technologies, such as databases, is a multi-billion dollar industry.  Many start-ups are working on technology extensions to Hadoop to make it both analytical and transactional. That would be big.  Major companies have a big data strategy and want to build their businesses on top of this  Google, the originator of Hadoop, has already moved on – suggesting that within a decade either the Hadoop framework will have to be developed beyond all recognition or that something newer could be on the way to supplant it.  Every major internet company - be it Google, Twitter, Linkedin or Facebook - uses some form of Hadoop .
  23. 23. mayuri.enggheads@gmail.com