• Like
  • Save
Hadoop and Hive in Enterprises
Upcoming SlideShare
Loading in...5
×
 

Hadoop and Hive in Enterprises

on

  • 1,267 views

Presentation by Mark Grover on how Hadoop and Hive are currently being leveraged in enterprises at San Jose State University.

Presentation by Mark Grover on how Hadoop and Hive are currently being leveraged in enterprises at San Jose State University.

Statistics

Views

Total Views
1,267
Views on SlideShare
1,267
Embed Views
0

Actions

Likes
6
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  • Current Architecture BuildIn the beginning, there were enterprise applications backed by relational databases. These databases were optimized for processing transactions, or Online Transaction Processing (OLTP), which required high speed transactional reading and writing.Given the valuable data in these databases, business users wanted to be able to query them in order to ask questions. They used Business Intelligence tools that provided features like reports, dashboards, scorecards, alerts, and more. But these queries put a tremendous burden on the OLTP systems, which were not optimized to be queried like this.So architects introduced another database, called a data warehouse – you may also hear about data marts or operational data stores (ODS) – that were optimized for answering user queries.The data warehouse was loaded with data from the source systems. Specialized tools Extracted the source data, applied some Transformations to it – such as parsing, cleansing, validating, matching, translating, encoding, sorting, or aggregating – and then Loaded it into the data warehouse. For short we call these ETL.As it matured, the data warehouse incorporated additional data sources.Since the data warehouse was typically a very powerful database, some organizations also began performing some transformation workloads right in the database, choosing to load raw data for speed and letting the database do the heavy lifting of transformations. This model is called ELT. Many organizations perform both ETL and ELT for data integration.
  • Issues BuildAs data volumes and business complexity grows, ETL and ELT processing is unable to keep up. Critical business windows are missed.Databases are designed to load and query data, not transform it. Transforming data in the database consumes valuable CPU, making queries run slower.
  • Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
  • Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
  • Conventional databases are expensive to scale as data volumes grow. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it.
  • Bank of AmericaA multinational bank saves millions by optimizing their EDW for analytics and reducing data storage costs by 99%.Background: A multinational bank has traditionally relied on a Teradata enterprise data warehousefor its data storage, processing and analytics. With the movement from in-person to online banking, the number of transactions and the data each transaction generates has ballooned. Challenge: The bank wanted to make effective use of all the data being generated, but their Teradata system quickly became maxed out. It could no longer handle current workloads and the bank’s business critical applications were hitting performance issues. The system was spending 44% of its resources for operational functions and 42% for ELT processing, leaving only 11% for analytics and discovery of ROI from new opportunities. The bank was forced to either expand the Teradata system which would be very expensive, restrict user access to the system in order to lessen the workload, or offloading raw data to tape backup and relying on small data samples and aggregations for analytics in order to reduce the data volume on Teradata. .Solution: The bank deployed Cloudera to offload data processing, storage and some analytics from the Teradata system, allowing the EDW to focus on its real purpose: performing operational functions and analytics. Results: By offloading data processing and storage onto Cloudera, which runs on industry standard hardware, the bank avoided spending millions to expand their Teradata infrastructure. Expensive CPU is no longer consumed by data processing, and storage costs are a mere 1% of what they were before. Meanwhile, data processing is 42% faster and data center power consumption has been reduced by 25%. The bank can now process 10TB of data every day.
  • This is a very quick overview and glosses over much of the capabilities and functionality offered by Flume. This is describing 1.3 or “Flume NG”.
  • Client executesSqoop job.Sqoop interrogates DB for column names, types, etc.Based on extracted metadata, Sqoop creates source code for table class, and then kicks off MR job. This table class can be used for processing on extracted records.Sqoop by default will guess at a column for splitting data for distribution across the cluster. This can also be specified by client.
  • Should be emphasized that with this system we maintain the raw logs in Hadoop, allowing new transformations as needed.
  • Most of these tools integrate to existing data stores using the ODBC standard.
  • MSTR and Tableau are tested and certified now with the Cloudera driver, but other standard ODBC based tools should also work, and more integrations will be supported soon.
  • JDBC/ODBC support: HiveServer1 Thrift API lacks support for asynchronous query execution, the ability to cancel running queries, and methods for retrieving information about the capabilities of the remote server.
  • Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
  • Showing a definite bias here, but Impala is available now in beta, soon to be GA, and supported by major BI and analytics vendors. Also the system that I’m familiar with.Systems like Impala provide important new capabilities for performing data analysis with Hadoop, so well worth covering in this context. According to TDWI, lack of real-time query capabilities is an obstacle to Hadoop adoption for many companies.
  • Impalad’scomponsed of 3 components – planner, coordinator, and execution engine.State Store Daemon isn’t shown here, but maintains information on impala daemons running in system

Hadoop and Hive in Enterprises Hadoop and Hive in Enterprises Presentation Transcript

  • Hive in enterprises Headline Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Mark Grover - Software Engineer, Cloudera (@mark_grover) Speaker Name or Subhead Goes Here Prasad Majumdar – Apache Hive Committer, Software Engineer, Cloudera November 25th, 2013 1 ©2013 Cloudera, Inc. All Rights Reserved.
  • What we will be Talking About • Integration of Hive and Hadoop in enterprises Current challenges • How is Hadoop being leveraged with existing data infrastructures? • • Other tools and features in and around Hive Authentication and Authorization • BI Tools • User Interface • 2 ©2013 Cloudera, Inc. All Rights Reserved.
  • What is Apache Hadoop? Apache Hadoop is an open source platform for data storage and processing that is…  Scalable  Fault tolerant  Distributed Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema 3 CORE HADOOP SYSTEM COMPONENTS Hadoop Distributed File System (HDFS) MapReduce Self-Healing, High Bandwidth Clustered Storage Excels at Processing Complex Data Distributed Computing Framework Scales Economically  Scale-out architecture divides workloads across multiple nodes  Can be deployed on commodity hardware  Flexible file system eliminates ETL bottlenecks  Open source platform guards against vendor lock ©2013 Cloudera, Inc. All Rights Reserved.
  • Current Challenges Limitations of Existing Data Management Systems 4 ©2013 Cloudera, Inc. All Rights Reserved.
  • The Transforming of Transformation Enterprise Applications OLTP Extract Transform Load Query Data Warehouse Transform ODS 5 ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • Volume, Velocity, Variety Cause Capacity Problems Enterprise Applications OLTP 6 1 2 1 Slow Data Transformations = Missed ETL SLAs. Slow Queries = Frustrated Business Users. Extract Transform Load 2 1 Query Data Warehouse Transform ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • Economics: Return on Byte Return on Byte (ROB) = Value of Data Cost of Storing Data High ROB Low ROB (but still a ton of aggregate value) 7 ©2013 Cloudera, Inc. All Rights Reserved.
  • Data Warehouse Optimization Enterprise Applications Data Warehouse Query (High $/Byte) OLTP ETL Hadoop Transform Query ODS 8 Store ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • Data Warehouse Optimization Enterprise Applications Hadoop Transform OLTP ETL Query Store ODS 9 ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • The Key Benefit: Agility/Flexibility Schema-on-Read (Hadoop): Schema-on-Write (RDBMS): • Prescriptive Data Modeling: • Descriptive Data Modeling: • Create static DB schema • Copy data in its native format • Transform data into RDBMS • Create schema + parser • Query data in RDBMS format • Query Data in its native format • New columns must be added explicitly before new data can propagate into the system. • New data can start flowing any time and will appear retroactively once the schema/parser properly describes it. • Good for Known Unknowns (Repetition) • Good for Unknown Unknowns (Exploration) 10 ©2013 Cloudera, Inc. All Rights Reserved.
  • Not Just Transformation Other Ways Hadoop is Being Leveraged 11 ©2013 Cloudera, Inc. All Rights Reserved.
  • Data Archiving Before Hadoop Data Warehouse 12 Tape Archive ©2013 Cloudera, Inc. All Rights Reserved.
  • Active Archiving with Hadoop Data Warehouse 13 Hadoop ©2013 Cloudera, Inc. All Rights Reserved.
  • Offloading Analysis Data Warehouse Hadoop 14 ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • Exploratory Analysis Developers Business Users Analysts Hadoop 15 Data Warehouse ©2013 Cloudera, Inc. All Rights Reserved.
  • Use Case: A Major Financial Institution The Challenge: • Current EDW at capacity; cannot support growing data depth and width • Performance issues in business critical apps; little room for innovation. DATA WAREHOUSE Operational DATA WAREHOUSE Operational (50%) (44%) Analytics (50%) ELT Processing (42%) Analytics (11%) 16 HADOOP Analytics Processing Storage The Solution: • Hadoop offloads data storage (S), processing (T) & some analytics (Q) from the EDW. • EDW resources can now be focused on repeatable operational analytics. • Month data scan in 4 secs vs. 4 hours ©2013 Cloudera, Inc. All Rights Reserved.
  • Hadoop Integration The Big Picture 17 ©2013 Cloudera, Inc. All Rights Reserved.
  • BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL 18 ©2013 Cloudera, Inc. All Rights Reserved.
  • Data Import/Export Tools Data Warehouse /RDBMS Streaming Data 19 Data Import/Export ©2013 Cloudera, Inc. All Rights Reserved.
  • Flume in 2 Minutes Or, why you shouldn’t be using scripts for data movement. Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs. • Open-source, Apache project. • 20 ©2013 Cloudera, Inc. All Rights Reserved.
  • Flume in 2 Minutes JVM process hosting components Flume Agent External Source Source Web Server Twitter Consumes events JMS and forwards to System logs channels … 21 Channel Sink Stores events Removes event from until consumed channel and puts by sinks – into external file, memory, JD destination ©2013 Cloudera, Inc. All Rights BC Reserved. Destination
  • Sqoop Overview Apache project designed to ease import and export of data between Hadoop and relational databases. • Provides functionality to do bulk imports and exports of data with HDFS, Hive and HBase. • Java based. Leverages MapReduce to transfer data in parallel. • 22 ©2012 Cloudera, Inc. All Rights Reserved.
  • Sqoop Overview Uses a “connector” abstraction. • Two types of connectors • Standard connectors are JDBC based. • Direct connectors use native database interfaces to improve performance. • • 23 Direct connectors are available for many open-source and commercial databases – MySQL, PostgreSQL, Oracle, SQL Server, Teradata, etc. ©2012 Cloudera, Inc. All Rights Reserved.
  • Sqoop Import Flow Run import Client Collect metadata Sqoop Pull data Generate code, Execute MR job MapReduce Map Map Write to Hadoop Hadoop 24 ©2012 Cloudera, Inc. All Rights Reserved. Map
  • Transformation/Processing Standard interface is Java MapReduce • Higher-level interfaces are commonly used: • Apache Hive – provides an SQL like interface to data in Hadoop. • Apache Pig – declarative language providing functionality to declare a sequence of transformations. • Cloudera Impala – real-time SQL query engine on Hadoop • Both Hive and Pig convert queries into MapReduce jobs and submit to Hadoop for execution. • Impala has its own execution engine • 25 ©2013 Cloudera, Inc. All Rights Reserved.
  • Orchestration Schedulers for Hadoop jobs Oozie • Azkaban • 26 ©2013 Cloudera, Inc. All Rights Reserved.
  • Data Flow with OSS Tools Transform Web Servers Raw Logs Hadoop Load Sqoop, etc. Flume, etc. Process Orchestration Oozie, etc. 27 ©2013 Cloudera, Inc. All Rights Reserved.
  • Hadoop Integration Data Integration Tools 28 ©2013 Cloudera, Inc. All Rights Reserved.
  • BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL 29 ©2013 Cloudera, Inc. All Rights Reserved.
  • Data Integration Tools 30 ©2013 Cloudera, Inc. All Rights Reserved.
  • Pentaho Existing BI tools extended to support Hadoop. • Provides data import/export, transformation, job orchestration, reporting, and analysis functionality. • Supports integration with HDFS, Hive and HBase. • Community and Enterprise Editions offered. • 31 ©2012 Cloudera, Inc. All Rights Reserved.
  • Informatica • Informatica • • • • • 32 Data import/export Metadata services Data lineage Transformation … ©2013 Cloudera, Inc. All Rights Reserved.
  • Hadoop Integration Business Intelligence/Analytic Tools 33 ©2013 Cloudera, Inc. All Rights Reserved.
  • BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL 34 ©2013 Cloudera, Inc. All Rights Reserved.
  • Business Intelligence/Analytics Tools 35 ©2013 Cloudera, Inc. All Rights Reserved.
  • Business Intelligence/Analytics Tools Relational Databases 36 Data Warehouses … ©2013 Cloudera, Inc. All Rights Reserved.
  • ODBC Driver Most of these tools use the ODBC standard. • Since Hive is an SQL-like system it’s a good fit for ODBC. • Several vendors, including Cloudera, make ODBC drivers available for Hadoop. • JDBC is also used by some products for Hive Integration • 37 ©2013 Cloudera, Inc. All Rights Reserved. BI/Analytics Tools ODBC DRIVER HIVEQL HIVE SERVER HIVE
  • Hadoop Integration Next Generation BI/Analytics Tools 38 ©2013 Cloudera, Inc. All Rights Reserved.
  • New “Hadoop Native” Tools You can think of Hadoop as becoming a shared execution environment supporting new data analysis tools… BI/Analytics New Query Engines Co MapReduce 39 ©2013 Cloudera, Inc. All Rights Reserved.
  • Hadoop Native Tools – Advantages • New data analysis tools: Designed and optimized for working with Hadoop data and large data sets. • Remove reliance on Hive for accessing data – can work with any data in Hadoop. • • New query engines: Provide ability to do low latency queries against Hadoop data. • Make it possible to do ad-hoc, exploratory analysis of data in Hadoop. • 40 ©2013 Cloudera, Inc. All Rights Reserved.
  • Datameer 41 ©2013 Cloudera, Inc. All Rights Reserved.
  • Datameer 42 ©2013 Cloudera, Inc. All Rights Reserved.
  • What Hive community expected? Hive Compiler Executor Embedded Hive engine for batch or ad-hoc queries .. Hadoop Meta Store HDFS
  • What industry users expect ... Integration is the key requirement
  • Need server proxy access • • • Facilitate remote client o Server process to support concurrent clients Standard compliant connectors o JDBC, ODBC Security, Auditing
  • Hive Server2
  • Hive Integration HiveServer1 HiveServer2 No support for concurrent queries. Requires running multiple HiveServers for multiple users • No support for security. • The Thrift API in the Hive Server doesn’t support common JDBC/ODBC calls. • • 47 Adds support for concurrent queries. Can support multiple users. • Adds security support with Kerberos. • Better support for JDBC and ODBC. ©2013 Cloudera, Inc. All Rights Reserved.
  • Protecting Hadoop data and services • • • • Kerberos based authentication Posix style file permissions Access control for job submission Encryption over wire
  • Securing Hive access • • • Restrict access to service Supports Kerberos and LDAP authentication Encryption over wire
  • Need for authorization • • • Secure authorization o Enforce policy control access to data for authenticated user Fine grain authorization o Ability to control subset of data Role based authorization o Ability to associate privileges with roles
  • Current state of authorization • • File based authorization o Control at file level o Insufficient for collaboration o No fine grain access control Sub-optimal built-in authorization o Intended for preventing accidental changes o Not for preventing malicious users for hacking ..
  • Apache Sentry • • • Policy Engine for authorization Fine-grain, role based Pluggable modules for Hadoop components o Works with out of the box with Hive
  • Hue - Hadoop User Experience
  • Recap 54 ©2013 Cloudera, Inc. All Rights Reserved.
  • Data Warehouse Optimization Enterprise Applications Data Warehouse Query (High $/Byte) OLTP ETL Hadoop Transform Query ODS 55 Store ©2013 Cloudera, Inc. All Rights Reserved. Business Intelligence
  • BI/Analytics Tools Data Warehouse /RDBMS Streaming Data Data Import/Export Data Integration Tools NoSQL 56 ©2013 Cloudera, Inc. All Rights Reserved.
  • Questions? Slides at github.com/markgrover/hive-sjsu Prasad: http://www.linkedin.com/pub/prasad-mujumdar/29/147/88b prasadm@cloudera.com Mark: www.linkedin.com/in/grovermark mgrover@cloudera.com • 57 ©2013 Cloudera, Inc. All Rights Reserved.
  • Flume in 2 Minutes Reliable – events are stored in channel until delivered to next stage. • Recoverable – events can be persisted to disk and recovered in the event of failure. • Flume Agent Source 58 Channel Sink ©2013 Cloudera, Inc. All Rights Reserved. Destination
  • Flume in 2 Minutes Supports multi-hop flows for more complex processing. • Also fan-out, fan-in. • Flume Agent Sourc e 59 Channel Flume Agent Sink Sourc e ©2013 Cloudera, Inc. All Rights Reserved. Channel Sink
  • Flume in 2 Minutes • Declarative No coding required. • Configuration specifies how components are wired together. • 60 ©2013 Cloudera, Inc. All Rights Reserved.
  • Flume in 2 Minutes • Similar systems: Scribe • Chukwa • 61 ©2013 Cloudera, Inc. All Rights Reserved.
  • Sqoop Limitations Sqoop has some limitations, including: • Poor support for security. • $ sqoop import –username scott –password tiger… Sqoop can read command line options from an option file, but this still has holes. Error prone syntax. • Tight coupling to JDBC model – not a good fit for non-RDBMS systems. • 62 ©2012 Cloudera, Inc. All Rights Reserved.
  • Fortunately… Sqoop 2 (incubating) will address many of these limitations: Adds a web-based GUI. • Centralized configuration. • More flexible model. • Improved security model. • 63 ©2012 Cloudera, Inc. All Rights Reserved.
  • New Query Engines – Impala • Fast, interactive queries on data stored in Hadoop (HDFS and HBase). • But also designed to support long running queries. Uses familiar Hive Query Language and shares metastore. • Tight integration with Hadoop. • • • • High Performance • • • 64 Reads common Hadoop file formats. Runs on Hadoop DataNodes. C++, not Java. Runtime code generation. Entirely re-designed execution engine bypasses MapReduce. Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
  • Impala Architecture Common Hive SQL and interface Unified metadata and scheduler Hive Metastore SQL App YARN HDFS NN State Store ODBC Query Planner Query Planner Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine HDFS DN 65 HBase HDFS DN HBase Fully MPP Distributed Query Planner Query Coordinator Query Exec Engine HDFS DN HBase