Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Integrating Hadoop Into the Enterprise – Hadoop Summit 2012

  • 4,773 views
Uploaded on

A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with......

A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,773
On Slideshare
3,207
From Embeds
1,566
Number of Embeds
9

Actions

Shares
Downloads
181
Comments
0
Likes
6

Embeds 1,566

http://www.conseilsmarketing.com 1,511
http://eventifier.co 36
http://192.168.6.179 7
http://www.linkedin.com 4
http://tweetedtimes.com 3
http://192.168.6.184 2
http://pigeindexermau 1
http://pigeindexeroff2 1
http://www.alertize.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Integrating Hadoop into the EnterpriseJonathan SeidmanHadoop Summit 2012June 14th, 2012
  • 2. Who I Am •  Solutions Architect, Partner Engineering Team. •  Co-founder of Chicago Hadoop User Group and co-founder/organizer of Chicago Big Data. •  jseidman@cloudera.com •  @jseidman •  cloudera.com/careers2 ©2012 Cloudera, Inc. All Rights Reserved.
  • 3. What I’ll Be Talking About •  Some Background. •  Common uses of Hadoop in an enterprise data infrastructure. •  Hadoop Integration – the big picture. •  Deeper dive: –  Data import/export: Moving data between Hadoop and existing data stores. –  ETL tools. –  Business intelligence (BI) and analytic tools. •  Example architectures and data flows. •  Conclusions3 ©2012 Cloudera, Inc. All Rights Reserved.
  • 4. My Life Before Cloudera…4 ©2012 Cloudera, Inc. All Rights Reserved.
  • 5. Hadoop at Orbitz 100.00% Queries 90.00% 80.00% Searches 71.67% 70.00% 60.00% 50.00% 40.00% 34.30% 31.87% 30.00% 20.00% 10.00% 2.78% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 205 ©2012 Cloudera, Inc. All Rights Reserved.
  • 6. But Hadoop Was An Isolated System Developers Business Analysts Normal Users Humans6 ©2012 Cloudera, Inc. All Rights Reserved.
  • 7. Hadoop + the Data Warehouse…7 ©2012 Cloudera, Inc. All Rights Reserved.
  • 8. …Enabled New Analyses8 ©2012 Cloudera, Inc. All Rights Reserved.
  • 9. In our opinion, integration with existing IT systemsand software is critical, as we know enterprises willnot be replacing these technologies anytime soon. For Hadoop platforms this means integration with existing databases, data warehouses, and business-analytics and business-visualization tools. * * A near-term outlook for big data, Jo Maitland, GigaOM Pro, March 20129 ©2012 Cloudera, Inc. All Rights Reserved.
  • 10. What Can We Do? •  ETL –  Scalable ETL – allows companies to meet SLA’s (inexpensively). –  Agile – facilitates rapid modifications. •  Moving analysis off of existing systems. •  Sandbox for exploratory analytics. •  Using Hadoop as an active archive. •  Joining transactional data from a DB with interaction data. •  Common theme: freeing up existing systems for tasks they’re better suited for.10 ©2012 Cloudera, Inc. All Rights Reserved.
  • 11. BI/Analytics ToolsEnterprise   Data  Warehouse  Rela2onal    Databases   Flume   Data Import/Export ETL Tools Appliances NoSQL 11 ©2012 Cloudera, Inc. All Rights Reserved.
  • 12. Data Import/Export Enterprise   Data   Warehouse   Rela2onal     Databases  12 ©2012 Cloudera, Inc. All Rights Reserved.
  • 13. Sqoop Overview •  Apache project designed to ease import and export of data between Hadoop and relational databases. •  Provides functionality to do bulk imports and exports of data with HDFS, Hive and HBase. •  Java based. Leverages MapReduce to transfer data in parallel.13 ©2012 Cloudera, Inc. All Rights Reserved.
  • 14. Sqoop Overview •  Uses a “connector” abstraction. •  Two types of connectors –  Standard connectors are JDBC based. –  Direct connectors use native database interfaces to improve performance. •  Direct connectors are available for many open-source and commercial databases – MySQL, PostgreSQL, Oracle, SQL Server, Teradata, etc.14 ©2012 Cloudera, Inc. All Rights Reserved.
  • 15. Sqoop Import Flow Run import Collect metadata Client Sqoop Generate code, Pull data Execute MR job MapReduce Map Map Map Write to Hadoop Hadoop15 ©2012 Cloudera, Inc. All Rights Reserved.
  • 16. Sqoop Limitations Sqoop has some limitations, including: •  Poor support for security. $ sqoop import –username scott –password tiger… –  Sqoop can read command line options from an option file, but this still has holes. •  Error prone syntax. •  Tight coupling to JDBC model – not a good fit for non-RDBMS systems.16 ©2012 Cloudera, Inc. All Rights Reserved.
  • 17. Fortunately… Sqoop 2 (incubating) will address many of these limitations: •  Adds a web-based GUI. •  Centralized configuration. •  More flexible model. •  Improved security model.17 ©2012 Cloudera, Inc. All Rights Reserved.
  • 18. Informatica PowerExchange •  Not just RDBMS integration – provides consistent, native integration between Hadoop and a range of data sources, databases, legacy systems, standard file formats, CRM… •  Integrated with PowerCenter for pre/post- processing of data, administration, and metadata management.18 ©2012 Cloudera, Inc. All Rights Reserved.
  • 19. Power Exchange – Data Import Access Data Pre-Process Ingest Data Web serverDatabases, PowerExchange PowerCenterData Warehouse Batch HDFSMessage Queues,Email, Social Media CDC HIVE e.g. Filter, Join, Cleanse ERP, CRM Real-time Mainframe 19 ©2012 Cloudera, Inc. All Rights Reserved.
  • 20. Power Exchange – Data ExportExtract Data Post-Process Deliver Data Web server PowerCenter PowerExchange Databases, Data Warehouse HDFS Batch Real-time ERP, CRM e.g. Transform to target schema Mainframe20 ©2012 Cloudera, Inc. All Rights Reserved.
  • 21. Informatica PowerExchange 1. Create Ingest or Extract Mapping 2. Create Hadoop Connection 3. Configure Workflow 4. Configure Hive Properties21 ©2012 Cloudera, Inc. All Rights Reserved.
  • 22. There’s Always the Low-Tech Way… GreenPlum   GPLoad Hadoop   GreenPlum  Processing   Hive   Local  Disk   GreenPlum  22 ©2012 Cloudera, Inc. All Rights Reserved.
  • 23. BI/Analytics ToolsEnterprise   Data  Warehouse  Rela2onal    Databases   Flume   Data Import/Export ETL Tools Appliances NoSQL 23 ©2012 Cloudera, Inc. All Rights Reserved.
  • 24. ETL Tools24 ©2012 Cloudera, Inc. All Rights Reserved.
  • 25. ETL Tools25 ©2012 Cloudera, Inc. All Rights Reserved.
  • 26. ETL – The Wikipedia Definition •  Extract, transform and load (ETL) is a process in database usage and especially in data warehousing that involves: –  Extracting data from outside sources –  Transforming it to fit operational needs –  Loading it into the end target (DB or data warehouse) http://en.wikipedia.org/wiki/Extract,_transform,_load26 ©2012 Cloudera, Inc. All Rights Reserved.
  • 27. ETL Tools •  Very common use case for Hadoop. •  Most ETL in Hadoop is still done through plain old MapReduce. •  Companies want to leverage their existing developer skills – many enterprises have armies of SQL and ETL developers.27 ©2012 Cloudera, Inc. All Rights Reserved.
  • 28. Informatica HParser •  Not exactly ETL – provides data transformation and parsing optimized for parallel processing on Hadoop. •  Supports deeply hierarchical data and complex data formats. •  Transformations are defined in a Windows UI and then deployed to a Hadoop Cluster for execution.28 ©2012 Cloudera, Inc. All Rights Reserved.
  • 29. HParser – How does it work? hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt HDFS1.  Develop a DT transformation2.  Deploy the transformation to Hadoop3.  Run DT on Hadoop to produce tabular data4.  Analyze the data with HIVE / PIG / MapReduce / Other… 29 ©2012 Cloudera, Inc. All Rights Reserved.
  • 30. Pentaho •  Existing BI tools extended to support Hadoop. •  Not just ETL – also provides data import/ export, job orchestration, reporting, and analysis functionality. •  Supports integration with HDFS, Hive and Hbase. •  Community and Enterprise Editions offered.30 ©2012 Cloudera, Inc. All Rights Reserved.
  • 31. Pentaho •  Primary component is Pentaho Data Integration (PDI), also known as Kettle. •  PDI Provides a graphical drag-and- drop environment for defining ETL jobs, which interface with Java MapReduce to execute in-cluster transformations.31 ©2012 Cloudera, Inc. All Rights Reserved.
  • 32. Other ETL Solutions •  Talend –  Also following an open-source model. –  Extending their existing data integration tools to data integration. •  Pervasive RushAnalyzer –  Software to build and run big data ETL, data transformation, mining and visualization on Hadoop.32 ©2012 Cloudera, Inc. All Rights Reserved.
  • 33. BI/Analytics ToolsEnterprise   Data  Warehouse  Rela2onal    Databases   Flume   Data Import/Export ETL Tools Appliances NoSQL 33 ©2012 Cloudera, Inc. All Rights Reserved.
  • 34. Business Intelligence/Analytics Tools34 ©2012 Cloudera, Inc. All Rights Reserved.
  • 35. BI – The Forrester Research Definition "Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision- making.” * * http://en.wikipedia.org/wiki/Business_intelligence35 ©2012 Cloudera, Inc. All Rights Reserved.
  • 36. Business Intelligence/Analytics Tools Rela2onal     Data   …   Databases   Warehouses  36 ©2012 Cloudera, Inc. All Rights Reserved.
  • 37. Cloudera ODBC Driver •  Most of these tools use the ODBC standard. •  Since Hive is an SQL-like ODBC   system it’s a good fit for DRIVER ODBC. HIVEQL   •  ODBC driver for Hive is available, but has licensing HIVE SERVER issues. HIVE •  Because of this, Cloudera developed it’s own drivers, available for free download.37 ©2012 Cloudera, Inc. All Rights Reserved.
  • 38. Hive ODBC Limitations •  Hive does not have full SQL support. •  Multi-user is currently not supported by Hive Server. •  Poor support for security. •  Dependent on Hive – data must be loaded in Hive to be available. •  The Thrift API in the Hive Server doesn’t support common ODBC calls.38 ©2012 Cloudera, Inc. All Rights Reserved.
  • 39. Hive ODBC LimitationsThe Hive community is working on Hive Server 2 toaddress some of these limitations: •  Improved support for multiple users. •  Improved support for ODBC and JDBC drivers. •  And better support for security is coming.39 ©2012 Cloudera, Inc. All Rights Reserved.
  • 40. MicroStrategy40 ©2012 Cloudera, Inc. All Rights Reserved.
  • 41. Tableau41 ©2012 Cloudera, Inc. All Rights Reserved.
  • 42. Other BI Connectors •  Microsoft ODBC Driver –  Part of the Hadoop on Windows solution. –  Provides connectivity for MS BI tools such as Excel, PowerPivot, etc. •  MapR ODBC driver –  Support for standard ODBC based tools.42 ©2012 Cloudera, Inc. All Rights Reserved.
  • 43. Analytic Tools –  RHadoop project. –  Integration of SAS analytics with Hadoop. –  Integration of SAP HANA with Hadoop –  Toad for Cloud43 ©2012 Cloudera, Inc. All Rights Reserved.
  • 44. Hadoop Specific Tools – Karmasphere44 ©2012 Cloudera, Inc. All Rights Reserved.
  • 45. Hadoop Specific Tools – Datameer45 ©2012 Cloudera, Inc. All Rights Reserved.
  • 46. Example Integration Event HParser PowerCenter/ Data Hive PowerExchange Logs Warehouse https://community.informatica.com/mpresources/Communities/IW2012/Docs/bos_65.pdf46 ©2012 Cloudera, Inc. All Rights Reserved.
  • 47. Example – Migration of ETL Logs Raw ETL (SQL) Target Tables Tables Data Warehouse HDFS ETL Logs Flume (MapReduce) Sqoop Target Tables Data Warehouse47 ©2012 Cloudera, Inc. All Rights Reserved.
  • 48. What’s Missing? •  Better tools for ETL without coding. •  Better tools for data governance, data quality, etc. –  Ensuring that data in Hadoop complies with policies, rules, etc. •  Integration with commercial enterprise schedulers/workflow engines. –  Although open-source workflow schedulers exist (e.g. Oozie).48 ©2012 Cloudera, Inc. All Rights Reserved.
  • 49. Conclusions •  Hadoop integration is still in the early stages. –  Expect to see new/better tools coming from both vendors and the open-source community. •  Despite the relative immaturity of this space, there’s already a dizzying array of solutions available. –  Choose solutions based on existing skills and tools already in use by your organization. •  If using current BI tools integrated with Hive keep in mind that enhancements for multi-user, security, etc. are on the way. •  And it bears repeating: always use the right tool for the job. –  Hadoop won’t replace your data warehouses and databases, but will complement them.49 ©2012 Cloudera, Inc. All Rights Reserved.
  • 50. Thank Questions? You! http://www.cloudera.com/partners/spotlight/ +1 (888) 789-1488 cloudera.com twitter.com/ sales@cloudera.com cloudera facebook.com/ cloudera50 ©2011 Cloudera, Inc. All Rights Reserved.