Data-As-Product is also referred to as Active DW, Operational BI, Online BI, etc.
The solution is to *augment* the current RDBMSes with a “smart” storage/processing system. The original event level data is kept in this smart storage layer and can be mined as needed. The aggregate data is kept in the RDBMSes for interactive reporting and analytics.
The system is self-healing in the sense that it automatically routes around failure. If a node fails then its workload and data are transparently shifted some where else. The system is intelligent in the sense that the MapReduce scheduler optimizes for the processing to happen on the same node storing the associated data (or co-located on the same leaf Ethernet switch), it also speculatively executes redundant tasks if certain nodes are detected to be slow. One of the key benefits of Hadoop is the ability to just upload any unstructured files to it without having to “schematize” them first. You can dump any type of data into Hadoop then the input record readers will abstract it out as if it was structured (i.e. schema on read vs on write) Open Source Software allows for innovation by partners and customers. It also enables third-party inspection of source code which provides assurances on security and product quality. 1 HDD = 75 MB/sec, 1000 HDDs = 75 GB/sec, the “head of fileserver” bottleneck is eliminated.
Speculative Execution, Data rebalancing, Background Checksumming, etc.
Pool commodity servers in a single hierarchical namespace. Designed for large files that are written once and read many times. Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes. Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks. Default block size is 64MB, though most folks now set it to 128MB
Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries. MapReduce can run on top of HDFS or a selection of other storage systems Intelligent scheduling algorithms for locality, sharing, and resource optimization.
Think: SELECT word, count(*) FROM documents GROUP BY word Checkout ParBASH: http://cloud-dev.blogspot.com/2009/06/introduction-to-parbash.html
Other uses like face recognition, document discovery, OCR, gene sequence alignment, etc. Data Mining: ** Search and Text Analytics ** Clustering/Categorization ** Modeling/Machine Learning ** Optimization/Operations Research ** Response Prediction/Forecasting ** Simulation, Monte-Carlo like. ** Random Walks of Connectivity Graphs
HBase: Low Latency Random-Access with per-row consistency for updates/inserts/deletes
First bullet is like assembly, then it gets higher level from there.
Query: SELECT, FROM, WHERE, JOIN, GROUP BY, SORT BY, LIMIT, DISTINCT, UNION ALL Join: LEFT, RIGHT, FULL, OUTER, INNER DDL: CREATE TABLE, ALTER TABLE, DROP TABLE, DROP PARTITION, SHOW TABLES, SHOW PARTITIONS DML: LOAD DATA INTO, FROM INSERT Types: TINYINT, INT, BIGINT, BOOLEAN, DOUBLE, STRING, ARRAY, MAP, STRUCT, JSON OBJECT Query: Subqueries in FROM, User Defined Functions, User Defined Aggregates, Sampling (TABLESAMPLE) Relational: IS NULL, IS NOT NULL, LIKE, REGEXP Built in aggregates: COUNT, MAX, MIN, AVG, SUM Built in functions: CAST, IF, REGEXP_REPLACE, … Other: EXPLAIN, MAP, REDUCE, DISTRIBUTE BY List and Map operators: array[i], map[k], struct.field
Hadoop is good for storing and processing large amounts of unstructured or structured data in batch form (i.e. full table scans) Hadoop with HBASE (or Hypertable) can do inserts/updates/deletes with reasonable interactive response times (also see Cassandra).
Sports car is refined, accelerates very fast, and has a lot of addons/features. But it is pricey on a per bit basis and is expensive to maintain. Cargo train is rough, missing a lot of functionality, slow to start, but once it gets going it can carry a lot of stuff very economically.
Hadoop is efficient on a cost basis. Security: Need better integration with systems like LDAP or Kerberos. Also need better isolation against malicious users, though auditing can potentially catch those.
The Data Node slave and the Task Tracker slave can, and should, share the same server instance to leverage data locality whenever possible.
Amr Awadallah CTO, Cloudera, Inc. August 5, 2009 How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Our Older Systems Limited Raw Data Access Storage Farm for Unstructured Data (20TB/day) Instrumentation Collection RDBMS (200GB/day) BI / Reports Mostly Append Ad hoc Queries & Data Mining ETL Grid Non-Consumption Filer heads are a bottleneck
We encountered data errors that required reprocessing, which could happen a long time after the fact. “Tape Data” was cost prohibitive to reprocess, we needed to retain raw-data online for long time periods
Conversion of data from raw format to conformed dimensions causes some information loss. We needed access to the original data to recover lost information whenever needed (e.g.: a new browser user agent)
Shrinking ETL Window
The storage filers for raw data started becoming a significant bottleneck as large amounts of data needed to be copied to the ETL grid for processing (e.g. 30 hours to process a day’s worth of data)
Ad Hoc Queries on Raw Data
We wanted to run ad hoc queries against the original raw event data, but the storage filers only store and can’t compute
MapReduce Example for Word Count cat *.txt | mapper.pl | sort | reducer.pl > out.txt Split 1 Split i Split N Map 1 (docid, text) (docid, text) Map i (docid, text) Map M Reduce 1 Output File 1 (sorted words, sum of counts) Reduce i Output File i (sorted words, sum of counts) Reduce R Output File R (sorted words, sum of counts) (words, counts) (sorted words, counts) Map (in_key, in_value) => list of (out_key, intermediate_value) Reduce (out_key, list of intermediate_values) => out_value(s) Shuffle (words, counts) (sorted words, counts) “ To Be Or Not To Be?” Be, 5 Be, 12 Be, 7 Be, 6 Be, 30
Hadoop is a data grid operating system which augments current BI systems and improves their agility by providing an economically scalable solution for storing and processing large amounts of unstructured data over long periods of time
Hadoop High-Level Architecture Name Node Maintains mapping of file blocks to data node slaves Job Tracker Schedules jobs across task tracker slaves Data Node Stores and serves blocks of data Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Task Tracker Runs tasks (work units) within a job Share Physical Node