• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Integrating Hadoop Into the Enterprise
 

Integrating Hadoop Into the Enterprise

on

  • 3,232 views

 

Statistics

Views

Total Views
3,232
Views on SlideShare
3,160
Embed Views
72

Actions

Likes
11
Downloads
0
Comments
0

6 Embeds 72

http://eventifier.co 42
http://eventifier.com 17
http://www.techgig.com 9
http://192.168.6.184 2
http://www.m.techgig.com 1
http://115.112.206.131 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Common theme: moving time, space, or processor intensive processing to Hadoop.
  • Flume provides ingestion of streaming data (e.g. logs) into Hadoop.
  • Client executesSqoop job.Sqoop interrogates DB for column names, types, etc.Based on extracted metadata, Sqoop creates source code for table class, and then kicks off MR job. This table class can be used for processing on extracted records.Sqoop by default will guess at a column for splitting data for distribution across the cluster. This can also be specified by client.
  • Pentaho also has integration with NoSQL DBs (Mongo, Cassandra, etc.)
  • Most of these tools integrate to existing data stores using the ODBC standard.
  • MSTR and Tableau are tested and certified now with the Cloudera driver, but other standard ODBC based tools should also work, and more integrations will be supported soon.
  • Also, Cloudera has implemented a solution for multi-user, which will also soon support authentication.
  • In memory model supports low-latency queries.

Integrating Hadoop Into the Enterprise Integrating Hadoop Into the Enterprise Presentation Transcript

  • Integrating Hadoop into the EnterpriseJonathan SeidmanHadoop Summit 2012June 14th, 2012
  • Who I Am • Solutions Architect, Partner Engineering Team. • Co-founder of Chicago Hadoop User Group and co-founder/organizer of Chicago Big Data. • jseidman@cloudera.com • @jseidman • cloudera.com/careers2 ©2012 Cloudera, Inc. All Rights Reserved.
  • What I’ll Be Talking About • Some Background. • Common uses of Hadoop in an enterprise data infrastructure. • Hadoop Integration – the big picture. • Deeper dive: – Data import/export: Moving data between Hadoop and existing data stores. – ETL tools. – Business intelligence (BI) and analytic tools. • Example architectures and data flows. • Conclusions3 ©2012 Cloudera, Inc. All Rights Reserved.
  • My Life Before Cloudera…4 ©2012 Cloudera, Inc. All Rights Reserved.
  • Hadoop at Orbitz 100.00% Queries 90.00% 80.00% Searches 71.67% 70.00% 60.00% 50.00% 40.00% 34.30% 31.87% 30.00% 20.00% 10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 205 ©2012 Cloudera, Inc. All Rights Reserved.
  • But Hadoop Was An Isolated System Developers Business Analysts Normal Users Humans6 ©2012 Cloudera, Inc. All Rights Reserved.
  • Hadoop + the Data Warehouse…7 ©2012 Cloudera, Inc. All Rights Reserved.
  • …Enabled New Analyses8 ©2012 Cloudera, Inc. All Rights Reserved.
  • In our opinion, integration with existing IT systemsand software is critical, as we know enterprises willnot be replacing these technologies anytime soon. For Hadoop platforms this means integration with existing databases, data warehouses, and business-analytics and business-visualization tools. * * A near-term outlook for big data, Jo Maitland, GigaOM Pro, March 20129 ©2012 Cloudera, Inc. All Rights Reserved.
  • What Can We Do? • ETL – Scalable ETL – allows companies to meet SLA’s (inexpensively). – Agile – facilitates rapid modifications. • Moving analysis off of existing systems. • Sandbox for exploratory analytics. • Using Hadoop as an active archive. • Joining transactional data from a DB with interaction data. • Common theme: freeing up existing systems for tasks they’re better suited for.10 ©2012 Cloudera, Inc. All Rights Reserved.
  • BI/Analytics ToolsEnterprise DataWarehouseRelationalDatabases Flume Data Import/Export ETL Tools Appliances NoSQL 11 ©2012 Cloudera, Inc. All Rights Reserved.
  • Data Import/Export Enterprise Data Warehouse Relational Databases12 ©2012 Cloudera, Inc. All Rights Reserved.
  • Sqoop Overview • Apache project designed to ease import and export of data between Hadoop and relational databases. • Provides functionality to do bulk imports and exports of data with HDFS, Hive and HBase. • Java based. Leverages MapReduce to transfer data in parallel.13 ©2012 Cloudera, Inc. All Rights Reserved.
  • Sqoop Overview • Uses a “connector” abstraction. • Two types of connectors – Standard connectors are JDBC based. – Direct connectors use native database interfaces to improve performance. • Direct connectors are available for many open-source and commercial databases – MySQL, PostgreSQL, Oracle, SQL Server, Teradata, etc.14 ©2012 Cloudera, Inc. All Rights Reserved.
  • Sqoop Import Flow Run import Collect metadata Client Sqoop Generate code, Pull data Execute MR job MapReduce Map Map Map Write to Hadoop Hadoop15 ©2012 Cloudera, Inc. All Rights Reserved.
  • Sqoop Limitations Sqoop has some limitations, including: • Poor support for security. $ sqoop import –username scott –password tiger… – Sqoop can read command line options from an option file, but this still has holes. • Error prone syntax. • Tight coupling to JDBC model – not a good fit for non-RDBMS systems.16 ©2012 Cloudera, Inc. All Rights Reserved.
  • Fortunately… Sqoop 2 (incubating) will address many of these limitations: • Adds a web-based GUI. • Centralized configuration. • More flexible model. • Improved security model.17 ©2012 Cloudera, Inc. All Rights Reserved.
  • Informatica PowerExchange • Not just RDBMS integration – provides consistent, native integration between Hadoop and a range of data sources, databases, legacy systems, standard file formats, CRM… • Integrated with PowerCenter for pre/post- processing of data, administration, and metadata management.18 ©2012 Cloudera, Inc. All Rights Reserved.
  • Power Exchange – Data Import Access Data Pre-Process Ingest Data Web serverDatabases, PowerExchange PowerCenterData Warehouse Batch HDFSMessage Queues,Email, Social Media CDC HIVE e.g. Filter, Join, Cle ERP, CRM anse Real-time Mainframe 19 ©2012 Cloudera, Inc. All Rights Reserved.
  • Power Exchange – Data ExportExtract Data Post-Process Deliver Data Web server PowerCenter PowerExchange Databases, Data Warehouse HDFS Batch Real-time ERP, CRM e.g. Transform to target schema Mainframe20 ©2012 Cloudera, Inc. All Rights Reserved.
  • Informatica PowerExchange 1. Create Ingest or Extract Mapping 2. Create Hadoop Connection 3. Configure Workflow 4. Configure Hive Properties21 ©2012 Cloudera, Inc. All Rights Reserved.
  • There’s Always the Low-Tech Way… GreenPlum GPLoad Hadoop GreenPlumProcessing Hive Local Disk GreenPlum22 ©2012 Cloudera, Inc. All Rights Reserved.
  • BI/Analytics ToolsEnterprise DataWarehouseRelationalDatabases Flume Data Import/Export ETL Tools Appliances NoSQL 23 ©2012 Cloudera, Inc. All Rights Reserved.
  • ETL Tools24 ©2012 Cloudera, Inc. All Rights Reserved.
  • ETL Tools25 ©2012 Cloudera, Inc. All Rights Reserved.
  • ETL – The Wikipedia Definition • Extract, transform and load (ETL) is a process in database usage and especially in data warehousing that involves: – Extracting data from outside sources – Transforming it to fit operational needs – Loading it into the end target (DB or data warehouse) http://en.wikipedia.org/wiki/Extract,_transform,_load26 ©2012 Cloudera, Inc. All Rights Reserved.
  • ETL Tools • Very common use case for Hadoop. • Most ETL in Hadoop is still done through plain old MapReduce. • Companies want to leverage their existing developer skills – many enterprises have armies of SQL and ETL developers.27 ©2012 Cloudera, Inc. All Rights Reserved.
  • Informatica HParser • Not exactly ETL – provides data transformation and parsing optimized for parallel processing on Hadoop. • Supports deeply hierarchical data and complex data formats. • Transformations are defined in a Windows UI and then deployed to a Hadoop Cluster for execution.28 ©2012 Cloudera, Inc. All Rights Reserved.
  • HParser – How does it work? hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt HDFS1. Develop a DT transformation2. Deploy the transformation to Hadoop3. Run DT on Hadoop to produce tabular data4. Analyze the data with HIVE / PIG / MapReduce / Other… 29 ©2012 Cloudera, Inc. All Rights Reserved.
  • Pentaho • Existing BI tools extended to support Hadoop. • Not just ETL – also provides data import/export, job orchestration, reporting, and analysis functionality. • Supports integration with HDFS, Hive and Hbase. • Community and Enterprise Editions offered.30 ©2012 Cloudera, Inc. All Rights Reserved.
  • Pentaho • Primary component is Pentaho Data Integration (PDI), also known as Kettle. • PDI Provides a graphical drag-and- drop environment for defining ETL jobs, which interface with Java MapReduce to execute in-cluster transformations.31 ©2012 Cloudera, Inc. All Rights Reserved.
  • Other ETL Solutions • Talend – Also following an open-source model. – Extending their existing data integration tools to data integration. • Pervasive RushAnalyzer – Software to build and run big data ETL, data transformation, mining and visualization on Hadoop.32 ©2012 Cloudera, Inc. All Rights Reserved.
  • BI/Analytics ToolsEnterprise DataWarehouseRelationalDatabases Flume Data Import/Export ETL Tools Appliances NoSQL 33 ©2012 Cloudera, Inc. All Rights Reserved.
  • Business Intelligence/Analytics Tools34 ©2012 Cloudera, Inc. All Rights Reserved.
  • BI – The Forrester Research Definition "Business Intelligence is a set of methodologies, processes, architectures, an d technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making.” * * http://en.wikipedia.org/wiki/Business_intelligence35 ©2012 Cloudera, Inc. All Rights Reserved.
  • Business Intelligence/Analytics Tools Relational Data … Databases Warehouses36 ©2012 Cloudera, Inc. All Rights Reserved.
  • Cloudera ODBC Driver • Most of these tools use the ODBC standard. • Since Hive is an SQL-like ODBC system it’s a good fit for DRIVER ODBC. HIVEQL • ODBC driver for Hive is available, but has licensing HIVE SERVER issues. HIVE • Because of this, Cloudera developed it’s own drivers, available for free download.37 ©2012 Cloudera, Inc. All Rights Reserved.
  • Hive ODBC Limitations • Hive does not have full SQL support. • Multi-user is currently not supported by Hive Server. • Poor support for security. • Dependent on Hive – data must be loaded in Hive to be available. • The Thrift API in the Hive Server doesn’t support common ODBC calls.38 ©2012 Cloudera, Inc. All Rights Reserved.
  • Hive ODBC LimitationsThe Hive community is working on Hive Server 2 toaddress some of these limitations: • Improved support for multiple users. • Improved support for ODBC and JDBC drivers. • And better support for security is coming.39 ©2012 Cloudera, Inc. All Rights Reserved.
  • MicroStrategy40 ©2012 Cloudera, Inc. All Rights Reserved.
  • Tableau41 ©2012 Cloudera, Inc. All Rights Reserved.
  • Other BI Connectors • Microsoft ODBC Driver – Part of the Hadoop on Windows solution. – Provides connectivity for MS BI tools such as Excel, PowerPivot, etc. • MapR ODBC driver – Support for standard ODBC based tools.42 ©2012 Cloudera, Inc. All Rights Reserved.
  • Analytic Tools – RHadoop project. – Integration of SAS analytics with Hadoop. – Integration of SAP HANA with Hadoop – Toad for Cloud43 ©2012 Cloudera, Inc. All Rights Reserved.
  • Hadoop Specific Tools – Karmasphere44 ©2012 Cloudera, Inc. All Rights Reserved.
  • Hadoop Specific Tools – Datameer45 ©2012 Cloudera, Inc. All Rights Reserved.
  • Example Integration Event HParser PowerCenter/Power Data Hive Exchange Logs Warehouse https://community.informatica.com/mpresources/Communities/IW2012/Docs/bos_65.pdf46 ©2012 Cloudera, Inc. All Rights Reserved.
  • Example – Migration of ETL Logs Raw ETL (SQL) Target Tables Tables Data Warehouse HDFS ETL Logs Flume (MapReduce) Sqoop Target Tables Data Warehouse47 ©2012 Cloudera, Inc. All Rights Reserved.
  • What’s Missing? • Better tools for ETL without coding. • Better tools for data governance, data quality, etc. – Ensuring that data in Hadoop complies with policies, rules, etc. • Integration with commercial enterprise schedulers/workflow engines. – Although open-source workflow schedulers exist (e.g. Oozie).48 ©2012 Cloudera, Inc. All Rights Reserved.
  • Conclusions • Hadoop integration is still in the early stages. – Expect to see new/better tools coming from both vendors and the open-source community. • Despite the relative immaturity of this space, there’s already a dizzying array of solutions available. – Choose solutions based on existing skills and tools already in use by your organization. • If using current BI tools integrated with Hive keep in mind that enhancements for multi-user, security, etc. are on the way. • And it bears repeating: always use the right tool for the job. – Hadoop won’t replace your data warehouses and databases, but will complement them.49 ©2012 Cloudera, Inc. All Rights Reserved.
  • Thank Questions? You! http://www.cloudera.com/partners/spotlight/ +1 (888) 789-1488 cloudera.com twitter.com/ cloudera sales@cloudera.com facebook.com/ cloudera50 ©2011 Cloudera, Inc. All Rights Reserved.
  • Lunch!Lunch takes place in the Community Showcase (Hall 2)Sessions will resume at 1:30pm Page 51