Zeshan Sattar- Assessing the skill requirements and industry expectations for...
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
1. Amr Awadallah CTO, Cloudera, Inc. August 5, 2009 How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
2.
3. Our Older Systems Limited Raw Data Access Storage Farm for Unstructured Data (20TB/day) Instrumentation Collection RDBMS (200GB/day) BI / Reports Mostly Append Ad hoc Queries & Data Mining ETL Grid Non-Consumption Filer heads are a bottleneck
4.
5.
6. The Solution: A Store-Compute Grid Storage + Computation Instrumentation Collection RDBMS Interactive Apps “ Batch” Apps Mostly Append ETL and Aggregations Ad hoc Queries & Data Mining
7.
8.
9.
10. HDFS: Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 Cost/GB is a few ¢/month vs $/month
12. MapReduce Example for Word Count cat *.txt | mapper.pl | sort | reducer.pl > out.txt Split 1 Split i Split N Map 1 (docid, text) (docid, text) Map i (docid, text) Map M Reduce 1 Output File 1 (sorted words, sum of counts) Reduce i Output File i (sorted words, sum of counts) Reduce R Output File R (sorted words, sum of counts) (words, counts) (sorted words, counts) Map (in_key, in_value) => list of (out_key, intermediate_value) Reduce (out_key, list of intermediate_values) => out_value(s) Shuffle (words, counts) (sorted words, counts) “ To Be Or Not To Be?” Be, 5 Be, 12 Be, 7 Be, 6 Be, 30
24. Hadoop High-Level Architecture Name Node Maintains mapping of file blocks to data node slaves Job Tracker Schedules jobs across task tracker slaves Data Node Stores and serves blocks of data Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Task Tracker Runs tasks (work units) within a job Share Physical Node
Editor's Notes
Data-As-Product is also referred to as Active DW, Operational BI, Online BI, etc.
The solution is to *augment* the current RDBMSes with a “smart” storage/processing system. The original event level data is kept in this smart storage layer and can be mined as needed. The aggregate data is kept in the RDBMSes for interactive reporting and analytics.
The system is self-healing in the sense that it automatically routes around failure. If a node fails then its workload and data are transparently shifted some where else. The system is intelligent in the sense that the MapReduce scheduler optimizes for the processing to happen on the same node storing the associated data (or co-located on the same leaf Ethernet switch), it also speculatively executes redundant tasks if certain nodes are detected to be slow. One of the key benefits of Hadoop is the ability to just upload any unstructured files to it without having to “schematize” them first. You can dump any type of data into Hadoop then the input record readers will abstract it out as if it was structured (i.e. schema on read vs on write) Open Source Software allows for innovation by partners and customers. It also enables third-party inspection of source code which provides assurances on security and product quality. 1 HDD = 75 MB/sec, 1000 HDDs = 75 GB/sec, the “head of fileserver” bottleneck is eliminated.
Speculative Execution, Data rebalancing, Background Checksumming, etc.
Pool commodity servers in a single hierarchical namespace. Designed for large files that are written once and read many times. Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes. Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks. Default block size is 64MB, though most folks now set it to 128MB
Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries. MapReduce can run on top of HDFS or a selection of other storage systems Intelligent scheduling algorithms for locality, sharing, and resource optimization.
Think: SELECT word, count(*) FROM documents GROUP BY word Checkout ParBASH: http://cloud-dev.blogspot.com/2009/06/introduction-to-parbash.html
Other uses like face recognition, document discovery, OCR, gene sequence alignment, etc. Data Mining: ** Search and Text Analytics ** Clustering/Categorization ** Modeling/Machine Learning ** Optimization/Operations Research ** Response Prediction/Forecasting ** Simulation, Monte-Carlo like. ** Random Walks of Connectivity Graphs
HBase: Low Latency Random-Access with per-row consistency for updates/inserts/deletes
First bullet is like assembly, then it gets higher level from there.
Query: SELECT, FROM, WHERE, JOIN, GROUP BY, SORT BY, LIMIT, DISTINCT, UNION ALL Join: LEFT, RIGHT, FULL, OUTER, INNER DDL: CREATE TABLE, ALTER TABLE, DROP TABLE, DROP PARTITION, SHOW TABLES, SHOW PARTITIONS DML: LOAD DATA INTO, FROM INSERT Types: TINYINT, INT, BIGINT, BOOLEAN, DOUBLE, STRING, ARRAY, MAP, STRUCT, JSON OBJECT Query: Subqueries in FROM, User Defined Functions, User Defined Aggregates, Sampling (TABLESAMPLE) Relational: IS NULL, IS NOT NULL, LIKE, REGEXP Built in aggregates: COUNT, MAX, MIN, AVG, SUM Built in functions: CAST, IF, REGEXP_REPLACE, … Other: EXPLAIN, MAP, REDUCE, DISTRIBUTE BY List and Map operators: array[i], map[k], struct.field
Hadoop is good for storing and processing large amounts of unstructured or structured data in batch form (i.e. full table scans) Hadoop with HBASE (or Hypertable) can do inserts/updates/deletes with reasonable interactive response times (also see Cassandra).
Sports car is refined, accelerates very fast, and has a lot of addons/features. But it is pricey on a per bit basis and is expensive to maintain. Cargo train is rough, missing a lot of functionality, slow to start, but once it gets going it can carry a lot of stuff very economically.
Hadoop is efficient on a cost basis. Security: Need better integration with systems like LDAP or Kerberos. Also need better isolation against malicious users, though auditing can potentially catch those.
The Data Node slave and the Task Tracker slave can, and should, share the same server instance to leverage data locality whenever possible.