Many examples of users using Hadoop for analysis/miningExample - social networking data – need to identify the same user across multiple applicationsMap/reduce functions – powerful – low levelInformatica is the leader>>>>>>>>>>>> You may wonder: How? Why do so many people use Informatica tools?
Historical perspective – proliferation of data sources over time, data fragmentationDeveloper productivity is due to higher-level abstraction, built-in transformations and re-use.Vendor neutrality is not at the cost of performance. It gives the flexibility to move to a different vendor easily.The challenge of being productive with Hadoop is similar.>>>>>>> Let’s make this more concrete with a very simple example.
This could be sales data that you want to analyze.
PIG script calls the lookup as a UDFAppealing for somebody familiar with PIG.>>>>>>>>>>>>>> Or for somebody familiar with Informatica
This is more appealing for Informatica users.>>>>>> We started prototyping with these ideas.
Choice of PIGAppeal to 2 different user segments>>>>>>>> Next I will go into some implementation details.
Amapplet may be treated as a source, target or a function>>>>>> Just a few more low-level details for the Hadoop hackers
PIG should have array execution for UDFsIdeally don’t want the runtime to access Informatica domainDistcache seems like the right solutionWorks for native libsSome problems with jarsAddress doctor supports 240 countries!>>>>>>>>>> Next we will look at mapping deploymentRegistering each individual jar is tedious & error prone; also, PIG re-packs everything together which overwrites files Top-level jar with other jars on class-path-- Need to have the jars distributed preserving the dir structureRules out mapred.cache.filesProblem with mapred.cache.archives (can’t add top-level jar to classpath - mapred.job.classpath.files entries must be from mapred.cache.files)Problem with mapred.child.java.opts (can’t add to java.class.path but can add to java.library.path)
Leveraging PIG – saves us a lot of work, avoids re-inventing the wheel>>>>> Details of conversion
So many similaritiesNote dummy files in load & storeThe concept could be generalized – currently, the parallelism is a problem>>>>>>> Let’s look at an example
Anybody curious what this translates into?>>>>>>>>>>> Some implementation details.
>>>>>>>>>>>Where are we going with all this?
Sqoop adaptersReader/writers to allow HDFS sources/targets>>>>>>>> To summarize
Any quick questions?>>>>>>>>>>> I didn’t mention some non-trivial extras
Data Integration on Hadoop<br />Sanjay Kaluskar<br />Senior Architect, Informatica<br />Feb 2011<br />
Introduction<br />Challenges<br />Results of analysis or mining are only as good as the completeness & quality of underlying data<br />Need for the right level of abstraction & tools<br />Data integration & data quality tools have tackled these challenges for many years!<br />More than 4,200 enterprises worldwide rely on Informatica<br />
Could use PIG/Hive to leverage the join operator
Implement Java code to lookup the database table
Need to use access method based on the vendor</li></li></ul><li>Or… leverage Informatica’s Lookup<br />a = load 'RelLookupInput.dat' as (deptid: double);<br />b = foreach a generate flatten(com.mycompany.pig.RelLookup(deptid));<br />store b into 'RelLookupOutput.out';<br />
Or… you could start with a mapping<br />STORE<br />Filter<br />Load<br />
Goals of the prototype<br />Enable Hadoop developers to leverage Data Transformation and Data Quality logic<br />Ability to invoke mappletsfrom Hadoop<br />Lower the barrier to Hadoop entry by using Informatica Developer as the toolset<br />Ability to run a mapping on Hadoop<br />
MappletInvocation<br />Generation of the UDF of the right type<br />Output-only mapplet Load UDF<br />Input-only mapplet Store UDF<br />Input/output Eval UDF<br />Packaging into a jar<br />Compiled UDF<br />Other meta-data: connections, reference tables<br />Invokes Informatica engine (DTM) at runtime<br />
Mapplet Invocation (contd.)<br />Challenges<br />UDF execution is per-tuple; mappletsare optimized for batch execution<br />Connection info/reference data need to be plugged in<br />Runtime dependencies: 280 jars, 558 native dependencies<br />Benefits<br />PIG user can leverage Informatica functionality<br />Connectivity to many (50+) data sources<br />Specialized transformations<br />Re-use of already developed logic<br />
Mapping Deployment: Idea<br />Leverage PIG<br />Map to equivalent operators where possible<br />Let the PIG compiler optimize & translate to Hadoop jobs<br />Wraps some transformations as UDFs<br />Transformations with no equivalents, e.g., standardizer, address validator<br />Transformations with richer functionality, e.g., case-insensitive sorter<br />
Mapping Deployment<br />Design<br />Leverages PIG operators where possible<br />Wraps other transformations as UDFs<br />Relies on optimization by the PIG compiler<br />Challenges<br />Finding equivalent operators and expressions<br />Limitations of the UDF model – no notion of a user defined operator<br />Benefits<br />Re-use of already developed logic<br />Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer<br />
HadoopConnector for Enterprise data access<br />Opens up all the connectivity available from Informatica for Hadoopprocessing<br />Sqoop-based connectors<br />Hadoop sources & targets in mappings<br />Benefits<br />Loaddata from Enterprise data sources into Hadoop<br />Extract summarized data from Hadoop to load into DW and other targets<br />Data federation<br />
Data Node<br />PIG<br />Script<br />HDFS<br />UDF<br />Informatica eDTM<br />Mapplets<br />Complex Transformations:<br /> Addr Cleansing<br /> Dedup/Matching<br /> Hierarchical data parsing<br />Enterprise Data Access<br />InformaticaDeveloper tool for Hadoop<br />Metadata<br />Repository<br />Informatica developer builds hadoop mappings and deploys to Hadoop cluster <br />InformaticaHadoopDeveloper<br />Mapping PIG<br />script<br />eDTM<br />Mapplets etc <br /> PIG UDF<br />Informatica Developer<br />Hadoop <br />Designer<br />