Integrating hadoop - Big Data TechCon 2013


Published on

Presentation from Big Data TechCon 2013 Boston on integrating Hadoop with existing data infrastructures.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  • Current Architecture BuildIn the beginning, there were enterprise applications backed by relational databases. These databases were optimized for processing transactions, or Online Transaction Processing (OLTP), which required high speed transactional reading and writing.Given the valuable data in these databases, business users wanted to be able to query them in order to ask questions. They used Business Intelligence tools that provided features like reports, dashboards, scorecards, alerts, and more. But these queries put a tremendous burden on the OLTP systems, which were not optimized to be queried like this.So architects introduced another database, called a data warehouse – you may also hear about data marts or operational data stores (ODS) – that were optimized for answering user queries.The data warehouse was loaded with data from the source systems. Specialized tools Extracted the source data, applied some Transformations to it – such as parsing, cleansing, validating, matching, translating, encoding, sorting, or aggregating – and then Loaded it into the data warehouse. For short we call these ETL.As it matured, the data warehouse incorporated additional data sources.Since the data warehouse was typically a very powerful database, some organizations also began performing some transformation workloads right in the database, choosing to load raw data for speed and letting the database do the heavy lifting of transformations. This model is called ELT. Many organizations perform both ETL and ELT for data integration.
  • Issues BuildAs data volumes and business complexity grows, ETL and ELT processing is unable to keep up. Critical business windows are missed.Databases are designed to load and query data, not transform it. Transforming data in the database consumes valuable CPU, making queries run slower.
  • Solution BuildOffload slow or complex ETL/ELT transformation workloads to Cloudera in order to meet SLAs. Cloudera processes raw data to feed the warehouse with high value cleansed, conformed data.Reclaim valuable EDW capacity for the high value data and query workloads it was designed for, accelerating query performance and enabling new projects.Gain the ability to query ALL your data through your existing BI tools and practices.
  • Conventional databases are expensive to scale as data volumes grow. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it.
  • Bank of AmericaA multinational bank saves millions by optimizing their EDW for analytics and reducing data storage costs by 99%.Background: A multinational bank has traditionally relied on a Teradata enterprise data warehousefor its data storage, processing and analytics. With the movement from in-person to online banking, the number of transactions and the data each transaction generates has ballooned. Challenge: The bank wanted to make effective use of all the data being generated, but their Teradata system quickly became maxed out. It could no longer handle current workloads and the bank’s business critical applications were hitting performance issues. The system was spending 44% of its resources for operational functions and 42% for ELT processing, leaving only 11% for analytics and discovery of ROI from new opportunities. The bank was forced to either expand the Teradata system which would be very expensive, restrict user access to the system in order to lessen the workload, or offloading raw data to tape backup and relying on small data samples and aggregations for analytics in order to reduce the data volume on Teradata. .Solution: The bank deployed Cloudera to offload data processing, storage and some analytics from the Teradata system, allowing the EDW to focus on its real purpose: performing operational functions and analytics. Results: By offloading data processing and storage onto Cloudera, which runs on industry standard hardware, the bank avoided spending millions to expand their Teradata infrastructure. Expensive CPU is no longer consumed by data processing, and storage costs are a mere 1% of what they were before. Meanwhile, data processing is 42% faster and data center power consumption has been reduced by 25%. The bank can now process 10TB of data every day.
  • This is a simple example, but close to how a number of companies are using Hadoop now.
  • Full history of users browing is stored in web logs. This is semi-structured data.
  • Most companies aren’t going to store raw logs into their DWH because of expense and low value of much of the data. This goes back to the ROB discussion – This data might have value in aggregate, but may be very difficult to justify storing in the typical data warehouse.
  • This is a very quick overview and glosses over much of the capabilities and functionality offered by Flume. This is describing 1.3 or “Flume NG”.
  • Client executesSqoop job.Sqoop interrogates DB for column names, types, etc.Based on extracted metadata, Sqoop creates source code for table class, and then kicks off MR job. This table class can be used for processing on extracted records.Sqoop by default will guess at a column for splitting data for distribution across the cluster. This can also be specified by client.
  • Should be emphasized that with this system we maintain the raw logs in Hadoop, allowing new transformations as needed.
  • This works well, and is representative of how most companies are doing these types of tasks now.
  • Very few database/ETL devs have Java, etc. backgrounds.Many organizations have ETL, SQL developers though familiar with common tools such as Informatica.
  • Pentaho also has integration with NoSQL DBs (Mongo, Cassandra, etc.)
  • Pentaho orchestrates the entire flow.Ratings data is ingested via a PDI job.Reference data is pre-processed – combined, cleansed, etc.Reference data is then copied into HDFS.Pentaho MapReduce is then used to do extensive transformations – joins, aggregations, etc. to create final data sets to drive analysis.Resulting data sets loaded into Hive.Hive queries drive analysis and reporting.All processing, reporting, etc. in this example performed in Hadoop.
  • This provides an example of transforming raw input data into final records through the Pentaho UI.
  • That output then drives a number of reports and visualizations.
  • Not a promotion for Informatica, but an example of how the largest enterprise vendors are adapting their products for Hadoop.Also shows out-of-cluster transformations
  • Uses same interface as existing PowerCenter.Transformations are converted to HQL.Existing Informatica jobs can be re-used with Hadoop.Also provides data profiling, data lineage, etc.
  • Most of these tools integrate to existing data stores using the ODBC standard.
  • MSTR and Tableau are tested and certified now with the Cloudera driver, but other standard ODBC based tools should also work, and more integrations will be supported soon.
  • JDBC/ODBC support: HiveServer1 Thrift API lacks support for asynchronous query execution, the ability to cancel running queries, and methods for retrieving information about the capabilities of the remote server.
  • Performing queries in Hive are basically the equivalent of a full table scan in a standard database. Not a good fit with most BI tools.
  • Showing a definite bias here, but Impala is available now in beta, soon to be GA, and supported by major BI and analytics vendors. Also the system that I’m familiar with.Systems like Impala provide important new capabilities for performing data analysis with Hadoop, so well worth covering in this context. According to TDWI, lack of real-time query capabilities is an obstacle to Hadoop adoption for many companies.
  • Impalad’scomponsed of 3 components – planner, coordinator, and execution engine.State Store Daemon isn’t shown here, but maintains information on impala daemons running in system
  • Queries get sent to a single impalad, which is different from the HiveServerarcictecture.
  • Changes in CDH4 allow for short-circuit reads – allows impalad’s to read directly from file system rather than going through datanodes.Another change allows Impala to know which disk data blocks are on.
  • Impala makes it more practical to perform analysis with popular BI tools. You can now do exploratory analysis and quickly generate reports and visualizations with common tools.Integration with MSTR, QlikView, Pentaho, etc.
  • The data module provides logical abstractions on top of storage subsystems (e.g. HDFS) that let users think and operate in terms of records, datasets, and dataset repositories
  • Integrating hadoop - Big Data TechCon 2013

    1. 1. Extending Your Data InfrastructurePUBLICLY DO NOT USE with Hadoop PRIOR TO 10/23/12 Headline Goes Here Jonathan Seidman | Solutions Architect Speaker Name or Subhead Goes Here Big Data TechCon April 10, 2013 ©2013 Cloudera, Inc. All Rights1 Reserved.
    2. 2. Who I Am • Solutions Architect, Partner Engineering Team. • Co-founder/organizer of Chicago Hadoop User Group and Chicago Big Data. • • @jseidman • ©2013 Cloudera, Inc. All Rights2 Reserved.
    3. 3. What I’ll be Talking About • Big data challenges with current data integration approaches. • How is Hadoop being leveraged with existing data infrastructures? • Hadoop integration – the big picture. • Deeper dive into tool categories. • Data import/export • Data Integration • BI/Analytics • Putting the pieces together. • BI/Analytics with Hadoop. • New approachs to data analysis with Hadoop. ©2013 Cloudera, Inc. All Rights3 Reserved.
    4. 4. What is Apache Hadoop? CORE HADOOP SYSTEM COMPONENTS Apache Hadoop is an open source platform for data storage and processing Hadoop Distributed that is… File System (HDFS) MapReduce  Scalable Self-Healing, High Bandwidth Clustered  Fault tolerant Storage Distributed Computing Framework  Distributed Has the Flexibility to Store and Excels at Scales Mine Any Type of Data Processing Complex Data Economically Ask questions across structured and  Scale-out architecture divides workloads  Can be deployed on commodity unstructured data that were previously across multiple nodes hardware impossible to ask or solve  Flexible file system eliminates ETL  Open source platform guards against Not bound by a single schema bottlenecks vendor lock ©2013 Cloudera, Inc. All Rights4 Reserved.
    5. 5. Current Challenges Limitations of Existing Data Management Systems ©2013 Cloudera, Inc. All Rights5 Reserved.
    6. 6. The Transforming of Transformation Enterprise Applications Extract Query OLTP Transform Data Business Load Warehouse Intelligence Transform ODS ©2013 Cloudera, Inc. All Rights6 Reserved.
    7. 7. Volume, Velocity, Variety Cause Capacity Problems 1 Slow Data Transformations = Missed ETL SLAs. Enterprise Applications 2 Slow Queries = Frustrated Business Users. Extract 2 Query OLTP 1 Transform Data Business Load Warehouse Intelligence 1 Transform ©2013 Cloudera, Inc. All Rights7 Reserved.
    8. 8. Data Warehouse Optimization Enterprise Data Warehouse Applications Query (High $/Byte) OLTP ETL Business Hadoop Intelligence Transform Query ODS Store ©2013 Cloudera, Inc. All Rights8 Reserved.
    9. 9. The Key Benefit: Agility/FlexibilitySchema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Prescriptive Data Modeling: • Descriptive Data Modeling: • Create static DB schema • Copy data in its native format • Transform data into RDBMS • Create schema + parser • Query data in RDBMS format • Query Data in its native format • New columns must be added explicitly • New data can start flowing any time and before new data can propagate into will appear retroactively once the the system. schema/parser properly describes it. • Good for Known Unknowns • Good for Unknown Unknowns (Repetition) (Exploration) ©2013 Cloudera, Inc. All Rights9 Reserved.
    10. 10. Not Just Transformation Other Ways Hadoop is Being Leveraged ©2013 Cloudera, Inc. All Rights10 Reserved.
    11. 11. Data Archiving Before Hadoop Data Tape Warehouse Archive ©2013 Cloudera, Inc. All Rights11 Reserved.
    12. 12. Active Archiving with Hadoop Data Hadoop Warehouse ©2013 Cloudera, Inc. All Rights12 Reserved.
    13. 13. Offloading Analysis Data Warehouse Business Intelligence Hadoop ©2013 Cloudera, Inc. All Rights13 Reserved.
    14. 14. Exploratory Analysis Developers Business Analysts Users Hadoop Data Warehouse ©2013 Cloudera, Inc. All Rights14 Reserved.
    15. 15. The Common Themes? 1 Offload expensive storage and processing to Hadoop • Complement, not replace 2 Reduce strain on the data warehouse • Let it focus on what it was designed to do: • High speed queries on high value relational data • Increase ROI of existing relational stores ©2013 Cloudera, Inc. All Rights15 Reserved.
    16. 16. Economics: Return on Byte Return on Byte (ROB) = Value of Data Cost of Storing Data High ROB Low ROB (but still a ton of aggregate value) ©2013 Cloudera, Inc. All Rights16 Reserved.
    17. 17. Use Case: A Major Financial Institution The Challenge: • Current EDW at capacity; cannot support growing data depth and width • Performance issues in business critical apps; little room for innovation. DATA WAREHOUSE DATA WAREHOUSE The Solution: Operational Operational (50%) • Hadoop offloads data storage (S), (44%) processing (T) & some analytics (Q) Analytics from the EDW. (50%) ELT Processing • EDW resources can now be focused (42%) HADOOP on repeatable operational analytics. Analytics Processing • Month data scan in 4 secs vs. 4 hours Analytics (11%) Storage ©2013 Cloudera, Inc. All Rights17 Reserved.
    18. 18. Hadoop Integration Some Definitions ©2013 Cloudera, Inc. All Rights18 Reserved.
    19. 19. Data Integration • Process in which heterogeneous data from multiple sources is retrieved and transformed to provide a unified view. • ETL (Extract, transform and load) is a central component of DI. ©2013 Cloudera, Inc. All Rights19 Reserved.
    20. 20. ETL – The Wikipedia Definition • Extract, transform and load (ETL) is a process in database usage and especially in data warehousing that involves: • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target (DB or data warehouse),_transform,_load ©2013 Cloudera, Inc. All Rights20 Reserved.
    21. 21. BI – The Forrester Research Definition "Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making.” * * ©2013 Cloudera, Inc. All Rights21 Reserved.
    22. 22. Hadoop Integration The Big Picture ©2013 Cloudera, Inc. All Rights22 Reserved.
    23. 23. BI/Analytics Tools DataWarehouse /RDBMSStreaming Data Data Import/Export Data Integration Tools NoSQL ©2013 Cloudera, Inc. All Rights23 Reserved.
    24. 24. Example Use Case ©2013 Cloudera, Inc. All Rights24 Reserved.
    25. 25. Example Use Case • Online retailer. • Customer, order data stored in data warehouse. ©2013 Cloudera, Inc. All Rights25 Reserved.
    26. 26. Example Use Case • Now wants to leverage behavioral (non- transactional) data, e.g. products viewed on-line to drive recommendations, etc. ©2013 Cloudera, Inc. All Rights26 Reserved.
    27. 27. So Where is This Data? • Record of page views is stored in session logs as users browse site. • So how do we get it out? [2002/11/27 18:58:28.294 -0600] "GET /products/view/952 HTTP/1.1" 200 701 "-" "Mozilla/5.0 (PlayBook; U; RIM Tablet OS 2.0.1; en-US) AppleWebKit/535.8+ (KHTML, like Gecko) Version/ Safari/535.8+" ”age=63&gender=0& incomeCategory=4&session=51620033&user=-2118869394&region=9&userType=0" ©2013 Cloudera, Inc. All Rights27 Reserved.
    28. 28. Load Raw Logs into Data Warehouse? • Very expensive to store. • Difficult to model and process semi-structured Web Logs DWH data. Servers • Oh, and also, very expensive. ©2013 Cloudera, Inc. All Rights28 Reserved.
    29. 29. ETL In/Into Data Warehouse? • Time and resource intensive with larger log sizes. • No archive of raw logs – potentially valuable data is Logs ETL DWH Web thrown away. Servers • How do you decide which fields have value? • Still, some companies are doing things like this. ©2013 Cloudera, Inc. All Rights29 Reserved.
    30. 30. Hadoop Integration Data Import/Export Tools ©2013 Cloudera, Inc. All Rights30 Reserved.
    31. 31. BI/Analytics Tools DataWarehouse /RDBMSStreaming Data Data Import/Export Data Integration Tools NoSQL ©2013 Cloudera, Inc. All Rights31 Reserved.
    32. 32. Data Import/Export Tools Data Warehouse /RDBMS Streaming Data Data Import/Export ©2013 Cloudera, Inc. All Rights32 Reserved.
    33. 33. Flume in 2 Minutes Or, why you shouldn’t be using scripts for data movement. • Reliable, distributed, and available system for efficient collection, aggregation and movement of streaming data, e.g. logs. • Open-source, Apache project. ©2013 Cloudera, Inc. All Rights33 Reserved.
    34. 34. Flume in 2 Minutes JVM process hosting components Flume Agent External Source Sink Destination Channel Source Web Server Twitter Consumes events Stores events Removes event from JMS and forwards to until consumed channel and puts System logs channels by sinks – file, into external … memory, JDBC destination ©2013 Cloudera, Inc. All Rights34 Reserved.
    35. 35. Flume in 2 Minutes • Reliable – events are stored in channel until delivered to next stage. • Recoverable – events can be persisted to disk and recovered in the event of failure. Flume Agent Source Channel Sink Destination ©2013 Cloudera, Inc. All Rights35 Reserved.
    36. 36. Flume in 2 Minutes • Supports multi-hop flows for more complex processing. • Also fan-out, fan-in. Flume Agent Flume Agent Sourc Channel Sink Sourc Channel Sink e e ©2013 Cloudera, Inc. All Rights36 Reserved.
    37. 37. Flume in 2 Minutes • Declarative • No coding required. • Configuration specifies how components are wired together. ©2013 Cloudera, Inc. All Rights37 Reserved.
    38. 38. Flume in 2 Minutes • Similar systems: • Scribe • Chukwa ©2013 Cloudera, Inc. All Rights38 Reserved.
    39. 39. Sqoop Overview • Apache project designed to ease import and export of data between Hadoop and relational databases. • Provides functionality to do bulk imports and exports of data with HDFS, Hive and HBase. • Java based. Leverages MapReduce to transfer data in parallel. ©2012 Cloudera, Inc. All Rights39 Reserved.
    40. 40. Sqoop Overview • Uses a “connector” abstraction. • Two types of connectors • Standard connectors are JDBC based. • Direct connectors use native database interfaces to improve performance. • Direct connectors are available for many open-source and commercial databases – MySQL, PostgreSQL, Oracle, SQL Server, Teradata, etc. ©2012 Cloudera, Inc. All Rights40 Reserved.
    41. 41. Sqoop Import Flow Run import Collect metadata Client Sqoop Pull data Generate code, MapReduce Map Map Map Execute MR job Write to Hadoop Hadoop ©2012 Cloudera, Inc. All Rights41 Reserved.
    42. 42. Sqoop Limitations Sqoop has some limitations, including: • Poor support for security. $ sqoop import –username scott –password tiger… • Sqoop can read command line options from an option file, but this still has holes. • Error prone syntax. • Tight coupling to JDBC model – not a good fit for non-RDBMS systems. ©2012 Cloudera, Inc. All Rights42 Reserved.
    43. 43. Fortunately… Sqoop 2 (incubating) will address many of these limitations: • Adds a web-based GUI. • Centralized configuration. • More flexible model. • Improved security model. ©2012 Cloudera, Inc. All Rights43 Reserved.
    44. 44. MapReduce For Transformation • Standard interface is Java, but higher-level interfaces are commonly used: • Apache Hive – provides an SQL like interface to data in Hadoop. • Apache Pig – declarative language providing functionality to declare a sequence of transformations. • Both Hive and Pig convert queries into MapReduce jobs and submit to Hadoop for execution. ©2013 Cloudera, Inc. All Rights44 Reserved.
    45. 45. Example Implementation with OSS Tools All the tools we need for moving and transforming data: • Hadoop provides: • HDFS for storage • MapReduce for Processing • Also components for process orchestration: • Oozie, Azkaban • And higher-level abstractions: • Pig, Hive, etc. ©2013 Cloudera, Inc. All Rights45 Reserved.
    46. 46. Data Flow with OSS Tools Transform Raw Logs Hadoop Load Web Servers Flume, etc. Sqoop, etc. Process Orchestration Oozie, etc. ©2013 Cloudera, Inc. All Rights46 Reserved.
    47. 47. Flume Configuration for Example Use Case• Spooling source watches directory for new files and moves into channels. Renames files when processed.• HDFS sink ingests files into HDFS. ©2013 Cloudera, Inc. All Rights47 Reserved.
    48. 48. Pig Code for Example Use Case ©2013 Cloudera, Inc. All Rights48 Reserved.
    49. 49. Importing Final Data into DWH Output from Pig script stored in HDFS: 2012-09-16T23:03:16.294Z|1461333428|290 2012-09-20T04:48:52.294Z|772136124|749 2012-09-24T03:51:16.294Z|1144520081|222 2012-09-24T12:29:40.294Z|628304774|407 Moved into destination table with Sqoop: ©2013 Cloudera, Inc. All Rights49 Reserved.
    50. 50. But… • Some DI services are not provided in this stack: • Metadata repository • Master Data Management • Data lineage • … ©2013 Cloudera, Inc. All Rights50 Reserved.
    51. 51. Also… • …very low level: • Requires knowledgeable developers to implement transformations. Not a whole lot of these right now. Hadoop Data Modelers, Developers ETL Developers, etc. ©2013 Cloudera, Inc. All Rights51 Reserved.
    52. 52. Hadoop Integration Data Integration Tools ©2013 Cloudera, Inc. All Rights52 Reserved.
    53. 53. BI/Analytics Tools DataWarehouse /RDBMSStreaming Data Data Import/Export Data Integration Tools NoSQL ©2013 Cloudera, Inc. All Rights53 Reserved.
    54. 54. Data Integration Tools ©2013 Cloudera, Inc. All Rights54 Reserved.
    55. 55. Pentaho • Existing BI tools extended to support Hadoop. • Provides data import/export, transformation, job orchestration, reporting, and analysis functionality. • Supports integration with HDFS, Hive and Hbase. • Community and Enterprise Editions offered. ©2012 Cloudera, Inc. All Rights55 Reserved.
    56. 56. Pentaho • Primary component is Pentaho Data Integration (PDI), also known as Kettle. • PDI Provides a graphical drag- and-drop environment for defining ETL jobs, which interface with Java MapReduce to execute in-cluster transformations. ©2012 Cloudera, Inc. All Rights56 Reserved.
    57. 57. Pentaho/Cloudera Demo • Ingest data into HDFS using Flume • Pre-process the reference data • Copy reference files into Hadoop • Execute transformations in-cluster • Load Hive • Query Hive • Discover, Analyze and Visualize57
    58. 58. Pentaho MapReduce - - [31/Dec/2000:14:11:59 -0800] "GET /rate?movie=1207&rating=4 HTTP/1.1" 200 7 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11" "USER=1" 5|Monty Pythons Life of Brian|1979|5794|M|35- 44|Salesman|53703|Madison|WI|2000|5|5|43295|20th| false|false|false|false|true|false|false|false|fa lse|false|false|false|false|false|false|false|fal se|false58
    59. 59. Structure  Analysis & Visualization5|Monty Pythons Life ofBrian|1979|5794|M|35-44|Salesman|53703|Madison|WI|2000|5|5|43295|20th|false|false|false|false|true|false|false|false|false|false|false|false|false|false|false|false|false|false ...59
    60. 60. Informatica • Informatica • Data import/export • Metadata services • Data lineage • Transformation • … ©2013 Cloudera, Inc. All Rights60 Reserved.
    61. 61. Informatica – Data Import Access Data Pre-Process Ingest Data Web server Databases, PowerExchange PowerCenter Data Warehouse Batch HDFS Message Queues, Email, Social Media CDC HIVE e.g. Filter, Join, ERP, CRM Cleanse Real-time Mainframe ©2013 Cloudera, Inc. All Rights61 Reserved.
    62. 62. Informatica – Data ExportExtract Data Post-Process Deliver Data Web server PowerCenter PowerExchange Databases, Batch Data Warehouse HDFS Real-time e.g. Transform to ERP, CRM target schema Mainframe ©2013 Cloudera, Inc. All Rights62 Reserved.
    63. 63. Informatica Data Import/Export 1. Create Ingest or Extract Mapping 2. Create Hadoop Connection 3. Configure Workflow 4. Configure Hive Properties ©2013 Cloudera, Inc. All Rights63 Reserved.
    64. 64. Informatica – Data Transformation ©2013 Cloudera, Inc. All Rights64 Reserved.
    65. 65. Hadoop Integration Business Intelligence/Analytic Tools ©2013 Cloudera, Inc. All Rights65 Reserved.
    66. 66. BI/Analytics Tools DataWarehouse /RDBMSStreaming Data Data Import/Export Data Integration Tools NoSQL ©2013 Cloudera, Inc. All Rights66 Reserved.
    67. 67. Business Intelligence/Analytics Tools ©2013 Cloudera, Inc. All Rights67 Reserved.
    68. 68. Business Intelligence/Analytics Tools Relational Data … Databases Warehouses ©2013 Cloudera, Inc. All Rights68 Reserved.
    69. 69. ODBC Driver • Most of these tools use the ODBC BI/Analytics Tools standard. • Since Hive is an SQL-like system it’s a ODBC DRIVER good fit for ODBC. HIVEQL • Several vendors, including Cloudera, HIVE SERVER make ODBC drivers available for Hadoop. HIVE • JDBC is also used by some products for Hive Integration ©2013 Cloudera, Inc. All Rights69 Reserved.
    70. 70. Hive Integration HiveServer1 HiveServer2 • No support for concurrent • Adds support for concurrent queries. Requires running queries. Can support multiple multiple HiveServers for users. multiple users • Adds security support with • No support for security. Kerberos. • The Thrift API in the Hive Server doesn’t support • Better support for JDBC and common JDBC/ODBC calls. ODBC. ©2013 Cloudera, Inc. All Rights70 Reserved.
    71. 71. Still Some Limitations With This Model • Hive does not have full SQL support. • Dependent on Hive – data must be loaded in Hive to be available. • Queries are high-latency. ©2013 Cloudera, Inc. All Rights71 Reserved.
    72. 72. Hadoop Integration Next Generation BI/Analytics Tools ©2013 Cloudera, Inc. All Rights72 Reserved.
    73. 73. New “Hadoop Native” Tools You can think of Hadoop as becoming a shared execution environment supporting new data analysis tools… BI/Analytics New Query Engines Co MapReduce ©2013 Cloudera, Inc. All Rights73 Reserved.
    74. 74. Hadoop Native Tools – Advantages • New data analysis tools: • Designed and optimized for working with Hadoop data and large data sets. • Remove reliance on Hive for accessing data – can work with any data in Hadoop. • New query engines: • Provide ability to do low latency queries against Hadoop data. • Make it possible to do ad-hoc, exploratory analysis of data in Hadoop. ©2013 Cloudera, Inc. All Rights74 Reserved.
    75. 75. Datameer ©2013 Cloudera, Inc. All Rights75 Reserved.
    76. 76. Datameer ©2013 Cloudera, Inc. All Rights76 Reserved.
    77. 77. New Query Engines – Impala • Fast, interactive queries on data stored in Hadoop (HDFS and HBase). • But also designed to support long running queries. • Uses familiar Hive Query Language and shares metastore. • Tight integration with Hadoop. • Reads common Hadoop file formats. • Runs on Hadoop DataNodes. • High Performance • C++, not Java. • Runtime code generation. • Entirely re-designed execution engine bypasses MapReduce. • Currently in beta, GA expected in April. Confidential. ©2012 Cloudera, Inc. All77 Rights Reserved.
    78. 78. Impala Architecture Common Hive SQL and interface Unified metadata and scheduler SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query Planner Query Coordinator Query Coordinator Distributed Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase78
    79. 79. Cloudera Impala DetailsClient submits query through ODBC SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
    80. 80. Cloudera Impala DetailsPlanner turns request into collection of plan fragments.Coordinator initiates execution on remote impalad’s SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Fully MPP Query Planner Query Coordinator Query Coordinator Distributed Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
    81. 81. Cloudera Impala DetailsImpalads participating in query access local data in HDFS or HBase SQL App Hive State Metastore YARN HDFS NN Store ODBC Query Planner Query Planner Query PlannerQuery Coordinator Query Coordinator Query CoordinatorQuery Exec Engine Query Exec Engine Query Exec EngineHDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
    82. 82. Cloudera Impala DetailsIntermediate results are streamed between impalad’sFinal results are streamed back to client SQL App Hive State Metastore YARN HDFS NN Store ODBC SQL Results Query Planner Query Planner In Memory Query PlannerQuery Coordinator Query Coordinator Transfers Query CoordinatorQuery Exec Engine Query Exec Engine Query Exec EngineHDFS DN HBase HDFS DN HBase HDFS DN HBase Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
    83. 83. BI Example – Tableau with Impala ©2013 Cloudera, Inc. All Rights83 Reserved.
    84. 84. Development Challenges • According to TDWI research*: • 28% of users feel software tools are few and immature. • And 25% note the lack of metadata management. *TDWI Best Practices Report: Integrating Hadoop Into Business Intelligence and Data Warehousing, Philip Russom, TDWI Research: warehousing.aspx ©2013 Cloudera, Inc. All Rights84 Reserved.
    85. 85. The Cloudera Developer Kit • The CDK is an open-source collection of libraries, tools, examples, and documentation targeted at simplifying the most common tasks when working with the Hadoop. • The first module released is the CDK Data module – APIs to drastically simplify working with datasets in Hadoop filesystems. The Data module: • handles automatic serialization and deserialization of Java POJOs as well as Avro Records. • Automatic compression. • File and directory layout and management. • Automatic partitioning based on configurable functions. • A metadata provider plugin interface to integrate with centralized metadata management systems. ©2013 Cloudera, Inc. All Rights85 Reserved.
    86. 86. Cloudera Developer Kit • Source code, examples, documentation, etc.: • ©2013 Cloudera, Inc. All Rights86 Reserved.
    87. 87. Questions? • Or see me at the Cloudera booth – 11:00-1:00. ©2013 Cloudera, Inc. All Rights87 Reserved.