Big data & hadoop framework


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • - Nói về Google Analytic
  • Sự tăng trưởng ngày càng nhanh của dữ liệu (độ lớn, chủng loải) thông qua số liệu chi tiết
    Giá trị big data đem lại với các ngành (đặc biệt bán lẻ, tư vấn, vận chuyển, xây dựng, …)
    Khó khăn (57.6% tổ chức gặp khó khăn)
    The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;[14] as of 2012, every day 2.5 exabytes (2.5×1018) of data were created.[15] The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.[16]
  • - Các khó khăn thường gặp trong big data (Độ phức tạp dữ liệu, Độ lớn dữ liệu, Hiệu năng, Kĩ năng nhân viên, Sự tăng trưởng của dữ liệu, Giá thành)
  • - 3 vấn đề lớn với big data
  • Tổng kết big data là gì
    Big data[1][2] is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3]search, sharing, transfer, analysis,[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases,link legal citations, combat crime, and determine real-time roadway traffic conditions."
  • Big data = Giao dịch + Tương tác + Theo dõi
    ERP (Enterprise Resource Planning)
    CRM (Customer Relationship Management)
    Đi sâu vào log, user click stream, afilliate networks, behavioral targeting
  • Website
    Social Media
  • - Ví du chi tiết từng loại dữ liệu
  • - Các nguồn thu thập dữ liệu
  • - Các đơn vị thu thập dữ liệu
  • - Mục đích thu thập dữ liệu
  • Extract (Each separate system may also use a different data organization and/or format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures)
    Transform – PRE PROCESS (Selecting only certain columns to load (or selecting nullcolumns not to load), Translating coded values, Encoding free-form values, Deriving a new calculated value, Sorting, Joining data from multiple sources, Aggregation, Generating surrogate-key values, Transposing or pivoting, Splitting a column into multiple columns, Disaggregation of repeating columns into a separate detail table, Lookup and validate the relevant data from tables or referential files for slowly changing dimensions, Applying any form of simple or complex data validation.)
    Load ( Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; frequently, updating extracted data is done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in a historical form at regular intervals -- for example, hourly. To understand this, consider a data warehouse that is required to maintain sales records of the last year. This data warehouse overwrites any data older than a year with newer data. However, the entry of data for any one year window is made in a historical manner.)
  • Reference data are data from outside the organization (often from standards organizations) which is, apart from occasional revisions, static. This non-dynamic data is sometimes also known as "standing data".[1] Examples would be currency codes,Countries (in this case covered by a global standard ISO 3166-1) etc. Reference data should be distinguished[2] from "Master Data" which is also relatively static data but originating from within the organization e.g. products, departments, even customers.
    A staging table is just a regular SQL server table. For example, if you have a process that imports some data from say .CSV files then you put this data in a staging table. You may then decide to apply some data cleaning or business rules to the data and move it to a different staging tables etc..
  • Tệ nhất:
    + No log
    Phổ biến:
    + Chỉ phân tích được một phần nhỏ dữ liệu hiện tại
    + No data warehouse
  • Lưu trữ lâu dài trong kho dữ liệu
    Phục vụ phân tích dữ liệu
  • Vấn đề về scale up server, fault tolerance, performance
  • Vấn đề quản lý lỗi (thiết bị, hệ thống), raid, network device, performance
  • Sử dụng hadoop để làm gì
    Các hướng (core, distribution)
  • MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.[1]
    A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy andfault tolerance.
    The model is inspired by the map and reduce functions commonly used in functional programming,[2] although their purpose in the MapReduce framework is not the same as in their original forms.[3] Furthermore, the key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once.
    MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology but has since been genericized.
  • View from task perspective
  • View from scheduled m/c perspective
  • - Giới thiệu các sub project
  • Kết luận tương lai big data
  • Big data & hadoop framework

    2. 2. 2 Who am I?  Name: Pham Phuong Tu  Work as R&D developer at VC Corp  Related Projects:,,, …  Interest with system & software architecture, big data, data statistic & analytic  Email:
    3. 3. 3 Statistic System
    4. 4. 4 Statistic System Measuring Marketing Effective
    5. 5. 5 User Activity Recorder Mouse Click
    6. 6. 6 User Activity Recorder Mouse Click
    7. 7. 7 User Activity Recorder Mouse Scroll
    8. 8. 8 Mining System Log
    9. 9. 9 Table of content   Challenger Big Data        Hadoop Framework         Overview Data type What – Who – Why Extract, transform, load data Data operation Big data platform Overview History User Architecture Map Reduce Hadoop Environment Q&A Demo
    10. 10. 10 CHALLENGER
    11. 11. 11
    12. 12. 12
    13. 13. 13
    14. 14. 14 BIG DATA
    15. 15. 15
    16. 16. 16 Analyze All Available Data Data warehouse Social Media Website Billing ERP CRM Devices Network Switches
    17. 17. 17 Type of Data  Plain Text Data (Web)  Semi-structured Data (XML, Json)  Relational Data (Tables/Transaction/Legacy Data)  Graph Data  Social Network, Semantic Web (RDF), …  Multi  Media Data Image, Video, …  Streaming  Data You can only scan the data once
    18. 18. 18 What is collecting all this data?       Web Browsers Web Sites (Search Engine, Social Network, E-commerce Platform…) Applications Computer, Smartphone, Tablet, Games Boxes Other System (Banking, Phone, Medical, GPS) Internet Service Providers
    19. 19. 19 Who is collecting your data?  Government Agencies  Companies  Service Provider  Big Stores
    20. 20. 20 Why are they collecting your data?  Search   Recommendation Systems   New York Times, Eyealike Target Marketing   Facebook, AOL Video and Image Analysis   Facebook, Yahoo, Google Data Warehouse   Facebook, Amazon Log analytic   Yahoo, Amazon, Zvents Google Ads, Facebook Ads Business strategy  Walmart
    21. 21. 21 ETL  Extract: To convert the data into a single format appropriate for transformation processing.  Transform: Applies a series of rules or functions to the extracted data from the source.  Load: Loads the data into the end target, usually the Data Warehouse.
    22. 22. 22 Real-life ETL cycle The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up
    23. 23. 23 Operational data with current implement
    24. 24. 24 Operational data with big data implement
    25. 25. 25 Big Data Platform Understand and navigate federated big data sources Federated Discovery and Navigation Manage & store huge volume of any data Hadoop File System MapReduce Structure and control data Data Warehousing Manage streaming data Stream Computing Analyze unstructured data Text Analytics Engine Integrate and govern all data sources Integration, Data Quality, Security, Lifecycle Management, MDM
    26. 26. 26 HADOOP FRAMEWORK
    27. 27. 27 Single-node architecture CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk
    28. 28. 28 Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch CPU Mem Disk Switch CPU … CPU Mem Mem Disk Disk Each rack contains 16-64 nodes CPU … Mem Disk
    29. 29. 29 Hadoop History  Dec 2004 – Google GFS paper published  July 2005 – Nutch uses MapReduce  Feb 2006 – Becomes Lucene subproject  Apr 2007 – Yahoo! on 1000-node cluster  Jan 2008 – An Apache Top Level Project  Jul 2008 – A 4000 node test cluster
    30. 30. 30 Who uses Hadoop?  Google, Yahoo, Bing  Amazon, Ebay, Alibaba  Facebook, Twitter  IBM, HP. Toshiba, Intel  New York Times, BBC  Line, Wechat  VC Corp, FPT, VNG, VTC
    31. 31. 31 Hadoop Components  Distributed   file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance  MapReduce   framework Executes user jobs specified as “map” and “reduce” functions Manages work distribution & fault-tolerance
    32. 32. 32
    33. 33. 33 Goals of HDFS  Very Large Distributed File System   10K nodes, 100 million files, 10 PB Assumes Commodity Hardware  Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing    Provides very high aggregate bandwidth User Space, runs on heterogeneous OS   Data locations exposed so that computations can move to where data resides
    34. 34. 34 HDFS Multi Cluster
    35. 35. 35 Hadoop 1.0 vs 2.0
    36. 36. 36 NameNode  Meta-data in Memory  The entire metadata is in main memory No demand paging of meta-data Types of Metadata    List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g creation time, replication factor A Transaction Log      Records file creations, file deletions. etc
    37. 37. 37 DataNode  A Block Server  Stores data in the local file system (e.g. ext3) Stores meta-data of a block  Serves data and meta-data to Clients Block Report     Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data  Forwards data to other specified DataNodes
    38. 38. 38 Block Placement  Current  Strategy One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed  Clients read from nearest replica  Would like to make this policy pluggable
    39. 39. 39 Data Correctness  Use Checksums to validate data   Use CRC32 File Creation  Client computes checksum per 512 byte DataNode stores the checksum File access    Client retrieves the data and checksum from DataNode  If Validation fails, Client tries other replicas
    40. 40. 40 NameNode Failure A single point of failure  Transaction Log stored in multiple directories  A directory on the local file system A directory on a remote file system (NFS/CIFS)  Need to develop a real HA solution
    41. 41. 41 Data Pipelining  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file
    42. 42. 42 Rebalancer  Goal: % disk full on DataNodes should be similar     Usually run when new DataNodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool
    43. 43. 43 What is MapReduce?  Simple data-parallel programming model designed for scalability and fault-tolerance  Pioneered  by Google Processes 20 petabytes of data per day  Popularized  by open-source Hadoop project Used at Yahoo!, Facebook, Amazon, …
    44. 44. 44 Map Reduce Data Flow
    45. 45. 45 Execution
    46. 46. 46 Parallel Execution
    47. 47. 47 Example: Word count process
    48. 48. 48 Hadoop Environment
    49. 49. 49