ODI11g, Hadoop and "Big Data" Sources


Published on

Presentation from the Rittman Mead BI Forum 2013 on ODI11g's Hadoop connectivity. Provides a background to Hadoop, HDFS and Hive, and talks about how ODI11g, and OBIEE, uses Hive to connect to "big data" sources.

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ODI11g, Hadoop and "Big Data" Sources

  1. 1. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comODI11g, Hadoop and “Big Data”Mark Rittman, Technical Director, Rittman MeadRittman Mead BI Forum 2013, Brighton & AtlantaT : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comWednesday, 8 May 13
  2. 2. T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comBig Data, Hadoop and Unstructured Data Sources•“Big data” is the hot topic in BI, DW and Analytics circles•The ability to harness vast datasets, at a highly-granular level, by harnessing massively-parallel computing•Crunching loosely-structured and modelled datasets using simple algorithms: Map (project) + Reduce (agg)•Largely based around open-source projects, non-relational technologies‣ Apache Hadoop‣ MapReduce‣ Hadoop Distributed File System‣ Apache Hive, Sqoop, HBase etc•Emerging commercial vendors‣ Cloudera‣ Hortonworks etc•Can be used standalone, or linked to anenterprise DW/BI architecture+Wednesday, 8 May 13
  3. 3. T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comOracle’s Strategy for Business Analytics•Connect to all of your data, from all your sources,•Subject it to the full range of possible inquiry•Package solutions for known problems and fixed sources, and•Deploy to PCs and mobile devices, on premise or in the cloudOn Premise,On Cloud,On MobileAny Data,Any SourceFull Range ofAnalyticsIntegratedAnalytic AppsWednesday, 8 May 13
  4. 4. T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comConnect to All of Your Data, From All of Your Sources•As well as traditional application and database files sources, unstructured sourceand “big data” sources are within scope for business decision-making‣ Data of great volume, great velocity and great varietyAny Data,Any SourceYour Data :Decisions based onyour dataBig Data :Decisions based onall data relevant toyouTransactionsDocuments& SocialDataMachine-GeneratedDataWednesday, 8 May 13
  5. 5. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comOracle’s Big Data Products•Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing‣Cloudera Distribution of Hadoop‣Cloudera Manager‣Open-source R‣Oracle NoSQL Database Community Edition‣Oracle Enterprise Linux + Oracle JVM•Oracle Big Data Connectors‣Oracle Loader for Hadoop (Hadoop > Oracle RBDMS)‣Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS)‣Oracle Data Integration Adapter for Hadoop‣Oracle R Connector for Hadoop•Oracle NoSQL Database (column/key-store DB based on BerkeleyDB)Wednesday, 8 May 13
  6. 6. T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comODI as Part of Oracle’s Big Data Strategy•ODI is the data integration tool for extracting data from Hadoop/MapReduce, and loadinginto Oracle Big Data Appliance, Oracle Exadata and Oracle Exalytics•Oracle Application Adaptor for Hadoop provides required data adapters‣ Load data into Hadoop from local filesystem,or HDFS (Hadoop clustered FS)‣ Read data from Hadoop/MapReduce usingApache Hive (JDBC) and HiveQL, loadinto Oracle RDBMS usingOracle Loader for Hadoop•Supported by Oracle’s Engineered Systems‣ Exadata‣ Exalytics‣ Big Data Appliance (w/Cloudera Hadoop Distrib)Wednesday, 8 May 13
  7. 7. T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comHow ODI Accesses Hadoop and MapReduce•ODI accesses data in Hadoop clusters through Apache Hive‣ Metadata and query layer over MapReduce‣ Provides SQL-like language (HiveQL) and ametadata store (data dictionary)‣ Provides a means to define “tables”, into which filedata is loaded, and then queried via MapReduce‣ Accessed via Hive JDBC driver(separate Hadoop install requiredon ODI server, for client libs)•Additional access throughOracle Direct Connector for HDFSand Oracle Loader for HadoopHadoop ClusterHive ServerODI 11gOracle RDBMSHiveQLMapReduceDirect-path loads usingOracle Loader for Hadoop,transformation logic inMapReduceWednesday, 8 May 13
  8. 8. T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comOracle Business Analytics and Big Data Sources• OBIEE 11g, and other Oracle Business Analytics tools, can also make use of big data sources‣ Oracle Exalytics, through in-memory aggregates and InfiniBand connection to Exadata, can analyze vast (structured)datasets held in relational and OLAP databases‣ Endeca Information Discovery can analyze unstructured and semi-structured sources‣ InfiniBand connector to Big Data Applicance + Hadoop connector in OBIEE supports analysis via Map/Reduce‣ Oracle R distribution + Oracle Enterprise R supports SAS-style statistical analysisof large data sets, as part ofOracle Advanced Analytics Option‣ OBIEE can access Hadoopdatasource through anotherApache technology called HiveWednesday, 8 May 13
  9. 9. T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comOBIEE Access to Hadoop/Hive for BI Administration Tool RPD Creation•HiveODBC driver has to be installed into Windows environment, so thatBI Administration tool can connect to Hive and return table metadata•Import as ODBC datasource, change physical DB type to Apache Hadoop afterwards•Note that OBIEE queries cannot span >1 Hive schema (no table prefixes)Wednesday, 8 May 13
  10. 10. T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comSet up ODBC Connection at the OBIEE Server (Linux Only)•OBIEE ships with HiveODBC drivers, need to use 7.x versions though•Configure the ODBC connection in odbc.ini, name needs to match RPD ODBC name•BI Server should then be able to connect to the Hive server, and Hadoop/MapReduce[ODBC Data Sources]AnalyticsWeb=Oracle BI ServerCluster=Oracle BI ServerSSL_Sample=Oracle BI Serverbigdatalite=Oracle 7.1 Apache Hive Wire Protocol[bigdatalite]Driver=/u01/app/Middleware/Oracle_BI1/common/ODBC/Merant/7.0.1/lib/ARhive27.soDescription=Oracle 7.1 Apache Hive Wire ProtocolArraySize=16384Database=defaultDefaultLongDataBuffLen=1024EnableLongDataBuffLen=1024EnableDescribeParam=0Hostname=bigdataliteLoginTimeout=30MaxVarcharSize=2000PortNumber=10000RemoveColumnQualifiers=0StringDescribeType=12TransactionMode=0UseCurrentSchema=0Wednesday, 8 May 13
  11. 11. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comOpportunities for OBIEE and ODI with Big Data Sources and Tools•Load data from a Hadoop/HDFS/NoSQL environment into a structured DW for analysis•Provide OBIEE as an alternative toJava coding or HiveQL for analysts•Leverage Hadoop & HDFS formassively-parallel staging-layernumber crunching•Make use of low-cost, fault-toleranthardware for parts of your BI platform•Provide the reporting and analysisfor customers who have boughtOracle Big Data ApplianceWednesday, 8 May 13
  12. 12. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comWhat is Hadoop?•Apache Hadoop is one of the most well-known Big Data technologies•Family of open-source products used to store, and analyze distributed datasets•Hadoop is the enabling framework, automatically parallelises and co-ordinates jobs‣“Moves the compute to the data”•MapReduce is the programming frameworkfor filtering, sorting and aggregating data‣Map : filter and interpret input data, create key/value pairs‣Reduce : summarise and aggregate•MapReduce jobs can be written in anylanguage (Java etc), but it is complicatedWednesday, 8 May 13
  13. 13. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comWhat is HDFS?•The filesystem behind Hadoop, used to store data for Hadoop analysis‣Unix-like, uses commands such as ls, mkdir, chown, chmod•Fault-tolerant, with rapid fault detection and recovery•High-throughput, with streaming data access and large block sizes•Designed for data-locality, placing data closed to where it is processed•Accessed from the command-line, via internet (hdfs://), GUI tools etc[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff[oracle@bigdatalite mapreduce]$ hadoop fs -ls /user/oracleFound 5 itemsdrwx------ - oracle hadoop 0 2013-04-27 16:48 /user/oracle/.stagingdrwxrwxrwx - oracle hadoop 0 2012-09-18 17:02 /user/oracle/moviedemodrwxrwxrwx - oracle hadoop 0 2012-10-17 15:58 /user/oracle/movieworkdrwxr-xr-x - oracle hadoop 0 2013-05-03 17:49 /user/oracle/my_stuffdrwxr-xr-x - oracle hadoop 0 2012-08-10 16:08 /user/oracle/stageWednesday, 8 May 13
  14. 14. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comHive as the Hadoop “Data Warehouse”•MapReduce jobs are typically written in Java, but Hive can make this simpler•Hive is a query environment over Hadoop/MapReduce to support SQL-like queries•Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automaticallycreates MapReduce jobs against data previously loaded into the Hive HDFS tables•Approach used by ODI and OBIEEto gain access to Hadoop data•Allows Hadoop data to be accessed just likeany other data source (sort of...)Wednesday, 8 May 13
  15. 15. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comHive Data and Metadata•Hive uses a RBDMS metastore to holdtable and column definitions in schemas•Hive tables then map onto HDFS-stored files‣Managed tables‣External tables•Oracle-like query optimizer, compiler,executor•JDBC and OBDC drivers,plus CLI etcHive Driver(CompileOptimize, Execute)Managed Tables/user/hive/warehouse/External Tables/user/oracle//user/movies/data/HDFSHDFS or local filesloaded into Hive HDFSarea, using HiveQLCREATE TABLEcommandHDFS files loaded into HDFSusing external process, thenmapped into Hive usingCREATE EXTERNAL TABLEcommandMetastoreWednesday, 8 May 13
  16. 16. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comTransforming HiveQL Queries into MapReduce Jobs•HiveQL queries are automatically translated into Java MapReduce jobs•Selection and filtering part becomes Map tasks•Aggregation part becomes the Reduce tasksSELECT a, sum(b)FROM myTableWHERE a<100GROUP BY aMapTaskMapTaskMapTaskReduceTaskReduceTaskResultWednesday, 8 May 13
  17. 17. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comAn example Hive Query Session: Connect and Display Table List[oracle@bigdatalite ~]$ hiveHive history file=/tmp/oracle/hive_job_log_oracle_201304170403_1991392312.txthive> show tables;OKdwh_customerdwh_customer_tmpi_dwh_customerratingssrc_customersrc_sales_personweblogweblog_preprocessedweblog_sessionizedTime taken: 2.925 secondsHive Server lists out all“tables” that have beendefined within the HiveenvironmentWednesday, 8 May 13
  18. 18. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comAn example Hive Query Session: Display Table Row Counthive> select count(*) from src_customer;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers:set hive.exec.reducers.max=In order to set a constant number of reducers:set mapred.reduce.tasks=Starting Job = job_201303171815_0003, Tracking URL =http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0003Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_00032013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0%2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0%2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33%2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100%Ended Job = job_201303171815_0003OK25Time taken: 22.21 secondsRequest count(*) from tableHive server generatesMapReduce job to “map” tablekey/value pairs, and thenreduce the results to tablecountMapReduce job automaticallyrun by Hive ServerResults returned to userWednesday, 8 May 13
  19. 19. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comOBIEE and ODI Access to Hive, Leveraging MapReduce with no Java Coding•Requests in HiveQL arrive via HiveODBC, HiveJDBCor through the Hive command shell•JDBC and ODBC access requires Thift server‣Provides RPC call interface over Hive for external procs•All queries then get parsed, optimized and compiled, thensent to Hadoop NameNode and Job Tracker•Then Hadoop processes the query, generating MapReducejobs and distributing it to run in parallel across all data nodes•Hadoop access can still be performed procedurally if needed,typically coded by hand in Java, or through Pig, etc‣The equivalent of PL/SQL compared to SQL‣But Hive works well with the OBIEE/ODI paradigmWednesday, 8 May 13
  20. 20. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comComplementary Technologies: HDFS, Cloudera Manager, Hue, Beeswax etc•You can download your own Hive binaries, libraries etc from Apache Hadoop website•Or use pre-built VMs and distributions from the likes of Cloudera‣Cloudera CDH3/4 is used on Oracle Big Data Appliance‣Open-source + proprietary tools (Cloudera Manager)•Other tools for managing Hive, HFDS etc including‣Hue (HDFS file browser + management)‣Beeswax (Hive administration + querying)•Other complementary/required Hadoop tools‣Sqoop‣HDFS‣ThriftWednesday, 8 May 13
  21. 21. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comDemonstrationSimple Data Selection and Querying using Hive on Cloudera CDH3Wednesday, 8 May 13
  22. 22. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comODI + Big Data Examples : Providing the Bridge Between Hadoop + OBIEE•OBIEE now has the ability to reportagainst Hadoop data, via Hive‣Assumes that data is already loadedinto the Hive warehouse tables•ODI therefore can be used to loadthe Hive tables, through either:‣Loading Hive from files‣Joining and loading from Hive-Hive‣Loading and transforming viashell scripts (python, perl etc)•ODI could also extract the Hive dataand load into Oracle, if more appropriateWednesday, 8 May 13
  23. 23. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comConfiguring ODI for Hadoop Connectivity•Obtain an installation of Hadoop/Hive from somewhere (Cloudera CDH3/4 for example)•Copy the following files into a temp directory, archive and transfer to ODI environmentfor example...•Copy JAR files into userlib directory and (standalone) agent lib directory•Restart ODI Studio$HIVE_HOME/lib/*.jar$HADOOP_HOME/hadoop-*-core*.jar,$HADOOP_HOME/Hadoop-*-tools*.jar/usr/lib/hive/lib/*.jar/usr/lib/hadoop-0.20/hadoop-*-core*.jar,/usr/lib/hadoop-0.20/Hadoop-*-tools*.jarc:UsersAdministratorAppDataRoamingodioraclediuserlibWednesday, 8 May 13
  24. 24. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comRegistering HDFS and Hive Sources and Targets in the ODI Topology•For Hive sources and targets, use Hive technology‣JDBC Driver : Apache Hive JDBC Driver‣JDBC URL : jdbc:hive://[server_name]:10000/default‣(Flexfield Name) Hive Metastore URIs : thrift://[server_name]:10000•For HFDS sources, use File technology‣JDBC URL :hdfs://[server_name]:port‣Special HDFS “trick” to use File tech(no specific HDFS technology)Wednesday, 8 May 13
  25. 25. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comReverse Engineering Hive, HDFS and Local File Datastores + Models•Hive tables reverse-engineer just like regular tables•Define model in Designer navigator, uses Hive RKM to retrieve table metadata•Information on Hive-specific metadata stored in flexfields‣Hive Buckets‣Hive Partition Column‣Hive Cluster Column‣Hive Sort ColumnWednesday, 8 May 13
  26. 26. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comDemonstrationODI Configured for Hadoop Access, with Hive/HFDS source and targets registeredWednesday, 8 May 13
  27. 27. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comODI Application Adapter for Hadoop KMs•Application Adapter (pay-extra option) for Hadoop connectivity•Works for both Windows and Linux installs of ODI Studio‣Need to source HiveJDBC drivers and JARs from separate Hadoop install•Provides six new knowledge modules‣IKM File to Hive (Load Data)‣IKM Hive Control Append‣IKM Hive Transform‣IKM File-Hive to Oracle (OLH)‣CKM Hive‣RKM HiveWednesday, 8 May 13
  28. 28. T : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comOracle Loader for Hadoop•Oracle technology for accessing Hadoop data, and loading it into an Oracle database•Pushes data transformation, “heavy lifting” to the Hadoop cluster, using MapReduce•Direct-path loads into Oracle Database, partitioned and non-partitioned•Online and offline loads•Key technology for fast load ofHadoop results into Oracle DBWednesday, 8 May 13
  29. 29. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comIKM File to Hive (Load Data): Loading of Hive Tables from Local File or HDFS•Uses the Hive Load Data command to loadfrom local or HDFS files‣Calls Hadoop FS commands for simplecopy/move into/around HDFS‣Commands generated by ODI throughIKM File to Hive (Load Data)hive> load data inpath /user/oracle/movielens_src/u.data> overwrite into table movie_ratings;Loading data to table default.movie_ratingsDeleted hdfs://localhost.localdomain/user/hive/warehouse/movie_ratingsOKTime taken: 0.341 secondsWednesday, 8 May 13
  30. 30. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comIKM File to Hive (Load Data): Loading of Hive Tables from Local File or HDFS•IKM File to Hive (Load Data) generates therequired HiveQL commands using a script template•Executed over HiveJDBC interface•Success/Failure/Warning returned to ODIWednesday, 8 May 13
  31. 31. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comLoad Data and Hadoop SerDe (Serializer-Deserializer) Transformations•Hadoop SerDe transformations can beaccessed, for example to transform weblogs•Hadoop interface that contains:‣Deserializer - converts incoming datainto Java objects for Hive manipulation‣Serializer - takes Hive Java objects &converts to output for HDFS•Library of SerDe transformations readilyavailable for use with Hive•Use the OVERRIDE_ROW_FORMAToption in IKM to override regular columnmappings in Mapping tabWednesday, 8 May 13
  32. 32. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comIKM Hive Control Append: Loading, Joining & Filtering Between Hive Tables•Hive source and target, transformations according to HiveQLfunctionality (aggregations, functions etc)•Ability to join data sources•Other data sources can be used,but will involve staging tables andadditional KMs (as per any multi-source join)Wednesday, 8 May 13
  33. 33. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comIKM Hive Transform: Use Custom Shell Scripts to Integrate into Hive Table•Gives developer the abilityto transform dataprogrammatically usingPython, Perl etc scripts•Options to map outputof script to columns inHive table•Useful for moreprogrammatic and complexdata transformationsWednesday, 8 May 13
  34. 34. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comIKM File-Hive to Oracle: Extract from Hive into Oracle Tables•Uses Oracle Loaded for Hadoop (OLH) to processany filtering, aggregation, transformation in Hadoop,using MapReduce•OLH part of Oracle Big Data Connectors (additional cost)•High-performance loader into Oracle DB•Optional sort by primary key, pre-partioning of data•Can utilise the two OLH loading modes:‣JDBC or OCI direct load into Oracle‣Unload to files, Oracle DP into Oracle DBWednesday, 8 May 13
  35. 35. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comDemonstrationData Integration Tasks using ODIAAH Hadoop KMsWednesday, 8 May 13
  36. 36. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comNoSQL Data Sources and Targets with ODI 11g•No specific technology or driver for NoSQL databases, but can use Hive external tables•Requires a specific “Hive Storage Handler” for key/value store sources‣Hive feature for accessing data from other DB systems, for example MongoDB, Cassandra‣For example, https://github.com/vilcek/HiveKVStorageHandler•Additionally needs Hive collect_set aggregation method to aggregate results‣Has to be defined in Languages panel in TopologyWednesday, 8 May 13
  37. 37. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comPig, Sqoop and other Hadoop Technologies, and Hive•Future versions of ODI might use other Hadoop technologies‣Apache Sqoop for bulk transfer between Hadoop and RBDMSs•Other technologies are not such an obvious fit‣Apache Pig - the equivalent of PL/SQL for Hive’s SQL•Commercial vendors may produce “better” versions of Hive, MapReduce etc‣Cloudera Impala - more “real-time” version of Hive‣MapR - solves many current issues with MapReduce, 100% Hadoop API compatibility•Watch this space...!Wednesday, 8 May 13
  38. 38. T : +1 (888) 631-1410 E : inquiries@rittmanmead.com W: www.rittmanmead.comODI11g, Hadoop and “Big Data”Mark Rittman, Technical Director, Rittman MeadRittman Mead BI Forum 2013, Brighton & AtlantaT : +44 (0) 8446 697 995 E : enquiries@rittmanmead.com W: www.rittmanmead.comWednesday, 8 May 13