Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
Current Architecture BuildIn the beginning, there were enterprise applications backed by relational databases. These databases were optimized for processing transactions, or Online Transaction Processing (OLTP), which required high speed transactional reading and writing.Given the valuable data in these databases, business users wanted to be able to query them in order to ask questions. They used Business Intelligence tools that provided features like reports, dashboards, scorecards, alerts, and more. But these queries put a tremendous burden on the OLTP systems, which were not optimized to be queried like this.So architects introduced another database, called a data warehouse – you may also hear about data marts or operational data stores (ODS) – that were optimized for answering user queries.The data warehouse was loaded with data from the source systems. Specialized tools Extracted the source data, applied some Transformations to it – such as parsing, cleansing, validating, matching, translating, encoding, sorting, or aggregating – and then Loaded it into the data warehouse. For short we call these ETL.As it matured, the data warehouse incorporated additional data sources.Since the data warehouse was typically a very powerful database, some organizations also began performing some transformation workloads right in the database, choosing to load raw data for speed and letting the database do the heavy lifting of transformations. This model is called ELT. Many organizations perform both ETL and ELT for data integration.
Issues BuildAs data volumes and business complexity grows, ETL and ELT processing is unable to keep up. Critical business windows are missed.Databases are designed to load and query data, not transform it. Transforming data in the database consumes valuable CPU, making queries run slower.
Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
Conventional databases are expensive to scale as data volumes grow. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it.
Bank of AmericaA multinational bank saves millions by optimizing their EDW for analytics and reducing data storage costs by 99%.Background: A multinational bank has traditionally relied on a Teradata enterprise data warehousefor its data storage, processing and analytics. With the movement from in-person to online banking, the number of transactions and the data each transaction generates has ballooned. Challenge: The bank wanted to make effective use of all the data being generated, but their Teradata system quickly became maxed out. It could no longer handle current workloads and the bank’s business critical applications were hitting performance issues. The system was spending 44% of its resources for operational functions and 42% for ELT processing, leaving only 11% for analytics and discovery of ROI from new opportunities. The bank was forced to either expand the Teradata system which would be very expensive, restrict user access to the system in order to lessen the workload, or offloading raw data to tape backup and relying on small data samples and aggregations for analytics in order to reduce the data volume on Teradata. .Solution: The bank deployed Cloudera to offload data processing, storage and some analytics from the Teradata system, allowing the EDW to focus on its real purpose: performing operational functions and analytics. Results: By offloading data processing and storage onto Cloudera, which runs on industry standard hardware, the bank avoided spending millions to expand their Teradata infrastructure. Expensive CPU is no longer consumed by data processing, and storage costs are a mere 1% of what they were before. Meanwhile, data processing is 42% faster and data center power consumption has been reduced by 25%. The bank can now process 10TB of data every day.
This is a simple example, but close to how a number of companies are using Hadoop now.
Full history of users browing is stored in web logs. This is semi-structured data.
Most companies aren’t going to store raw logs into their DWH because of expense and low value of much of the data. This goes back to the ROB discussion – This data might have value in aggregate, but may be very difficult to justify storing in the typical data warehouse.
This is a very quick overview and glosses over much of the capabilities and functionality offered by Flume. This is describing 1.3 or “Flume NG”.
Client executesSqoop job.Sqoop interrogates DB for column names, types, etc.Based on extracted metadata, Sqoop creates source code for table class, and then kicks off MR job. This table class can be used for processing on extracted records.Sqoop by default will guess at a column for splitting data for distribution across the cluster. This can also be specified by client.
Should be emphasized that with this system we maintain the raw logs in Hadoop, allowing new transformations as needed.
This works well, and is representative of how most companies are doing these types of tasks now.
Very few database/ETL devs have Java, etc. backgrounds.Many organizations have ETL, SQL developers though familiar with common tools such as Informatica.
Pentaho also has integration with NoSQL DBs (Mongo, Cassandra, etc.)
Pentaho orchestrates the entire flow.Ratings data is ingested via a PDI job.Reference data is pre-processed – combined, cleansed, etc.Reference data is then copied into HDFS.Pentaho MapReduce is then used to do extensive transformations – joins, aggregations, etc. to create final data sets to drive analysis.Resulting data sets loaded into Hive.Hive queries drive analysis and reporting.All processing, reporting, etc. in this example performed in Hadoop.
This provides an example of transforming raw input data into final records through the Pentaho UI.
That output then drives a number of reports and visualizations.
Not a promotion for Informatica, but an example of how the largest enterprise vendors are adapting their products for Hadoop.Also shows out-of-cluster transformations
Uses same interface as existing PowerCenter.Transformations are converted to HQL.Existing Informatica jobs can be re-used with Hadoop.Also provides data profiling, data lineage, etc.
Most of these tools integrate to existing data stores using the ODBC standard.
MSTR and Tableau are tested and certified now with the Cloudera driver, but other standard ODBC based tools should also work, and more integrations will be supported soon.
JDBC/ODBC support: HiveServer1 Thrift API lacks support for asynchronous query execution, the ability to cancel running queries, and methods for retrieving information about the capabilities of the remote server.
Performing queries in Hive are basically the equivalent of a full table scan in a standard database. Not a good fit with most BI tools.
Showing a definite bias here, but Impala is available now in beta, soon to be GA, and supported by major BI and analytics vendors. Also the system that I’m familiar with.Systems like Impala provide important new capabilities for performing data analysis with Hadoop, so well worth covering in this context. According to TDWI, lack of real-time query capabilities is an obstacle to Hadoop adoption for many companies.
Impalad’scomponsed of 3 components – planner, coordinator, and execution engine.State Store Daemon isn’t shown here, but maintains information on impala daemons running in system
Queries get sent to a single impalad, which is different from the HiveServerarcictecture.
Changes in CDH4 allow for short-circuit reads – allows impalad’s to read directly from file system rather than going through datanodes.Another change allows Impala to know which disk data blocks are on.
Impala makes it more practical to perform analysis with popular BI tools. You can now do exploratory analysis and quickly generate reports and visualizations with common tools.Integration with MSTR, QlikView, Pentaho, etc.
The data module provides logical abstractions on top of storage subsystems (e.g. HDFS) that let users think and operate in terms of records, datasets, and dataset repositories