Data Integration on Hadoop<br />Sanjay Kaluskar<br />Senior Architect, Informatica<br />Feb 2011<br />
Introduction<br />Challenges<br />Results of analysis or mining are only as good as the completeness & quality of underlyi...
Files<br />Applications<br />Databases<br />Hadoop<br />HBase<br />HDFS<br />Data sources<br />Transact-SQL<br />Java<br /...
Vendor neutrality/flexibility</li></ul>Developer tools<br />
Lookup example<br />‘Bangalore’, …, 234, …<br />‘Chennai’, …, 82, …<br />‘Mumbai’, …, 872, …<br />‘Delhi’, …, 11, …<br />‘...
Could use PIG/Hive to leverage the join operator
Implement Java code to lookup the database table
Need to use access method based on the vendor</li></li></ul><li>Or… leverage Informatica’s Lookup<br />a = load 'RelLookup...
Or… you could start with a mapping<br />STORE<br />Filter<br />Load<br />
Goals of the prototype<br />Enable Hadoop developers to leverage Data Transformation and Data Quality logic<br />Ability t...
MappletInvocation<br />Generation of the UDF of the right type<br />Output-only mapplet Load UDF<br />Input-only mapplet ...
Mapplet Invocation (contd.)<br />Challenges<br />UDF execution is per-tuple; mappletsare optimized for batch execution<br ...
Mapping Deployment: Idea<br />Leverage PIG<br />Map to equivalent operators where possible<br />Let the PIG compiler optim...
Leveraging PIG Operators<br />
LeveragingInformaticaTransformations<br />Case converter<br />UDF<br />Native<br />PIG<br />Source<br />UDFs<br />Lookup<b...
Mapping Deployment<br />Design<br />Leverages PIG operators where possible<br />Wraps other transformations as UDFs<br />R...
Enterprise <br />Connectivity  for <br />Hadoop programs<br />Hadoop Cluster<br />Weblogs<br />Databases<br />BI<br />HDFS...
Files<br />Applications<br />Databases<br />Hadoop<br />HBase<br />HDFS<br />Data sources<br />Java<br />C/C++<br />SQL<br...
Connectivity
Rich transforms
Upcoming SlideShare
Loading in …5
×

Data integration-on-hadoop

3,336 views

Published on

Presentation at Hadoop user summit in Bangalore on Feb 16, 2011 by Sanjay Kaluskar

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,336
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
131
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Many examples of users using Hadoop for analysis/miningExample - social networking data – need to identify the same user across multiple applicationsMap/reduce functions – powerful – low levelInformatica is the leader&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; You may wonder: How? Why do so many people use Informatica tools?
  • Historical perspective – proliferation of data sources over time, data fragmentationDeveloper productivity is due to higher-level abstraction, built-in transformations and re-use.Vendor neutrality is not at the cost of performance. It gives the flexibility to move to a different vendor easily.The challenge of being productive with Hadoop is similar.&gt;&gt;&gt;&gt;&gt;&gt;&gt; Let’s make this more concrete with a very simple example.
  • This could be sales data that you want to analyze.
  • PIG script calls the lookup as a UDFAppealing for somebody familiar with PIG.&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Or for somebody familiar with Informatica
  • This is more appealing for Informatica users.&gt;&gt;&gt;&gt;&gt;&gt; We started prototyping with these ideas.
  • Choice of PIGAppeal to 2 different user segments&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Next I will go into some implementation details.
  • Amapplet may be treated as a source, target or a function&gt;&gt;&gt;&gt;&gt;&gt; Just a few more low-level details for the Hadoop hackers
  • PIG should have array execution for UDFsIdeally don’t want the runtime to access Informatica domainDistcache seems like the right solutionWorks for native libsSome problems with jarsAddress doctor supports 240 countries!&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Next we will look at mapping deploymentRegistering each individual jar is tedious &amp; error prone; also, PIG re-packs everything together which overwrites files Top-level jar with other jars on class-path-- Need to have the jars distributed preserving the dir structureRules out mapred.cache.filesProblem with mapred.cache.archives (can’t add top-level jar to classpath - mapred.job.classpath.files entries must be from mapred.cache.files)Problem with mapred.child.java.opts (can’t add to java.class.path but can add to java.library.path)
  • Leveraging PIG – saves us a lot of work, avoids re-inventing the wheel&gt;&gt;&gt;&gt;&gt; Details of conversion
  • So many similaritiesNote dummy files in load &amp; storeThe concept could be generalized – currently, the parallelism is a problem&gt;&gt;&gt;&gt;&gt;&gt;&gt; Let’s look at an example
  • Anybody curious what this translates into?&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Some implementation details.
  • &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;Where are we going with all this?
  • Sqoop adaptersReader/writers to allow HDFS sources/targets&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; To summarize
  • Any quick questions?&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I didn’t mention some non-trivial extras
  • Data integration-on-hadoop

    1. 1. Data Integration on Hadoop<br />Sanjay Kaluskar<br />Senior Architect, Informatica<br />Feb 2011<br />
    2. 2. Introduction<br />Challenges<br />Results of analysis or mining are only as good as the completeness & quality of underlying data<br />Need for the right level of abstraction & tools<br />Data integration & data quality tools have tackled these challenges for many years!<br />More than 4,200 enterprises worldwide rely on Informatica<br />
    3. 3. Files<br />Applications<br />Databases<br />Hadoop<br />HBase<br />HDFS<br />Data sources<br />Transact-SQL<br />Java<br />C/C++<br />SQL<br />Web services<br />OCI<br />Java<br />JMS<br />BAPI<br />JDBC<br />PL/SQL<br />ODBC<br />Hive<br />XQuery<br />vi<br />PIG<br />Word<br />Notepad<br />Sqoop<br />CLI<br />Access methods & languages<br />Excel<br /><ul><li>Developer productivity
    4. 4. Vendor neutrality/flexibility</li></ul>Developer tools<br />
    5. 5. Lookup example<br />‘Bangalore’, …, 234, …<br />‘Chennai’, …, 82, …<br />‘Mumbai’, …, 872, …<br />‘Delhi’, …, 11, …<br />‘Chennai’, …, 43, …<br />‘xxx’, …, 2, …<br />Database table<br />HDFS file<br />Your choices<br /><ul><li>Move table to HDFS using Sqoop and join
    6. 6. Could use PIG/Hive to leverage the join operator
    7. 7. Implement Java code to lookup the database table
    8. 8. Need to use access method based on the vendor</li></li></ul><li>Or… leverage Informatica’s Lookup<br />a = load 'RelLookupInput.dat' as (deptid: double);<br />b = foreach a generate flatten(com.mycompany.pig.RelLookup(deptid));<br />store b into 'RelLookupOutput.out';<br />
    9. 9. Or… you could start with a mapping<br />STORE<br />Filter<br />Load<br />
    10. 10. Goals of the prototype<br />Enable Hadoop developers to leverage Data Transformation and Data Quality logic<br />Ability to invoke mappletsfrom Hadoop<br />Lower the barrier to Hadoop entry by using Informatica Developer as the toolset<br />Ability to run a mapping on Hadoop<br />
    11. 11. MappletInvocation<br />Generation of the UDF of the right type<br />Output-only mapplet Load UDF<br />Input-only mapplet  Store UDF<br />Input/output  Eval UDF<br />Packaging into a jar<br />Compiled UDF<br />Other meta-data: connections, reference tables<br />Invokes Informatica engine (DTM) at runtime<br />
    12. 12. Mapplet Invocation (contd.)<br />Challenges<br />UDF execution is per-tuple; mappletsare optimized for batch execution<br />Connection info/reference data need to be plugged in<br />Runtime dependencies: 280 jars, 558 native dependencies<br />Benefits<br />PIG user can leverage Informatica functionality<br />Connectivity to many (50+) data sources<br />Specialized transformations<br />Re-use of already developed logic<br />
    13. 13. Mapping Deployment: Idea<br />Leverage PIG<br />Map to equivalent operators where possible<br />Let the PIG compiler optimize & translate to Hadoop jobs<br />Wraps some transformations as UDFs<br />Transformations with no equivalents, e.g., standardizer, address validator<br />Transformations with richer functionality, e.g., case-insensitive sorter<br />
    14. 14. Leveraging PIG Operators<br />
    15. 15. LeveragingInformaticaTransformations<br />Case converter<br />UDF<br />Native<br />PIG<br />Source<br />UDFs<br />Lookup<br />UDF<br />Target UDF<br />Native<br />PIG<br />Native PIG<br />Informatica Transformation (Translated to PIG UDFs)<br />
    16. 16. Mapping Deployment<br />Design<br />Leverages PIG operators where possible<br />Wraps other transformations as UDFs<br />Relies on optimization by the PIG compiler<br />Challenges<br />Finding equivalent operators and expressions<br />Limitations of the UDF model – no notion of a user defined operator<br />Benefits<br />Re-use of already developed logic<br />Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer<br />
    17. 17. Enterprise <br />Connectivity for <br />Hadoop programs<br />Hadoop Cluster<br />Weblogs<br />Databases<br />BI<br />HDFS<br />Name Node<br />DW/DM<br />Metadata<br />Repository<br />Graphical IDE for<br />Hadoop Development<br />Semi-structured<br />Un-structured<br />Data Node<br />HDFS<br />Enterprise Applications<br />HDFS<br />Job Tracker<br />Informatica & HadoopBig Picture<br />Transformation<br />Engine for custom<br />data processing<br />
    18. 18. Files<br />Applications<br />Databases<br />Hadoop<br />HBase<br />HDFS<br />Data sources<br />Java<br />C/C++<br />SQL<br />Web services<br />JMS<br />OCI<br />Java<br />BAPI<br />PL/SQL<br />XQuery<br />vi<br />Hive<br />PIG<br />Word<br />Notepad<br />Sqoop<br />Access methods & languages<br />Excel<br /><ul><li>Developer productivity
    19. 19. Connectivity
    20. 20. Rich transforms
    21. 21. Designer tool
    22. 22. Vendor neutrality/flexibility
    23. 23. Without losing</li></ul>performance<br />Developer tools<br />
    24. 24. Informatica Extras…<br />Specialized transformations<br />Matching<br />Address validation<br />Standardization<br />Connectivity<br />Other tools<br />Data federation<br />Analyst tool<br />Administration<br />Metadata manager<br />Business glossary<br />
    25. 25.
    26. 26. HadoopConnector for Enterprise data access<br />Opens up all the connectivity available from Informatica for Hadoopprocessing<br />Sqoop-based connectors<br />Hadoop sources & targets in mappings<br />Benefits<br />Loaddata from Enterprise data sources into Hadoop<br />Extract summarized data from Hadoop to load into DW and other targets<br />Data federation<br />
    27. 27. Data Node<br />PIG<br />Script<br />HDFS<br />UDF<br />Informatica eDTM<br />Mapplets<br />Complex Transformations:<br /> Addr Cleansing<br /> Dedup/Matching<br /> Hierarchical data parsing<br />Enterprise Data Access<br />InformaticaDeveloper tool for Hadoop<br />Metadata<br />Repository<br />Informatica developer builds hadoop mappings and deploys to Hadoop cluster <br />InformaticaHadoopDeveloper<br />Mapping  PIG<br />script<br />eDTM<br />Mapplets etc <br /> PIG UDF<br />Informatica Developer<br />Hadoop <br />Designer<br />
    28. 28. Data Node<br />PIG<br />Script<br />HDFS<br />UDF<br />Informatica eDTM<br />Mapplets<br />Complex Transformations:<br /> Dedupe/Matching<br /> Hierarchical data parsing<br />Enterprise Data Access<br />Metadata<br />Repository<br />Invoke Informatica Transformations from yourHadoopMapReduce/PIG scripts<br />Hadoop developer invokes Informatica UDFs from PIG scripts<br />Hadoop Developer<br />Informatica Developer Tool<br />Mapplets  PIG UDF<br />Reuse <br />Informatica<br />Components in<br />Hadoop<br />

    ×