Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar

  1. 1. Data Integration on Hadoop<br />Sanjay Kaluskar<br />Senior Architect, Informatica<br />Feb 2011<br />
  2. 2. Introduction<br />Challenges<br />Results of analysis or mining are only as good as the completeness & quality of underlying data<br />Need for the right level of abstraction & tools<br />Data integration & data quality tools have tackled these challenges for many years!<br />More than 4,200 enterprises worldwide rely on Informatica<br />
  3. 3. Files<br />Applications<br />Databases<br />Hadoop<br />HBase<br />HDFS<br />Data sources<br />Transact-SQL<br />Java<br />C/C++<br />SQL<br />Web services<br />OCI<br />Java<br />JMS<br />BAPI<br />JDBC<br />PL/SQL<br />ODBC<br />Hive<br />XQuery<br />vi<br />PIG<br />Word<br />Notepad<br />Sqoop<br />CLI<br />Access methods & languages<br />Excel<br /><ul><li>Developer productivity
  4. 4. Vendor neutrality/flexibility</li></ul>Developer tools<br />
  5. 5. Lookup example<br />‘Bangalore’, …, 234, …<br />‘Chennai’, …, 82, …<br />‘Mumbai’, …, 872, …<br />‘Delhi’, …, 11, …<br />‘Chennai’, …, 43, …<br />‘xxx’, …, 2, …<br />Database table<br />HDFS file<br />Your choices<br /><ul><li>Move table to HDFS using Sqoop and join
  6. 6. Could use PIG/Hive to leverage the join operator
  7. 7. Implement Java code to lookup the database table
  8. 8. Need to use access method based on the vendor</li></li></ul><li>Or… leverage Informatica’s Lookup<br />a = load 'RelLookupInput.dat' as (deptid: double);<br />b = foreach a generate flatten(com.mycompany.pig.RelLookup(deptid));<br />store b into 'RelLookupOutput.out';<br />
  9. 9. Or… you could start with a mapping<br />STORE<br />Filter<br />Load<br />
  10. 10. Goals of the prototype<br />Enable Hadoop developers to leverage Data Transformation and Data Quality logic<br />Ability to invoke mappletsfrom Hadoop<br />Lower the barrier to Hadoop entry by using Informatica Developer as the toolset<br />Ability to run a mapping on Hadoop<br />
  11. 11. MappletInvocation<br />Generation of the UDF of the right type<br />Output-only mapplet Load UDF<br />Input-only mapplet  Store UDF<br />Input/output  Eval UDF<br />Packaging into a jar<br />Compiled UDF<br />Other meta-data: connections, reference tables<br />Invokes Informatica engine (DTM) at runtime<br />
  12. 12. Mapplet Invocation (contd.)<br />Challenges<br />UDF execution is per-tuple; mappletsare optimized for batch execution<br />Connection info/reference data need to be plugged in<br />Runtime dependencies: 280 jars, 558 native dependencies<br />Benefits<br />PIG user can leverage Informatica functionality<br />Connectivity to many (50+) data sources<br />Specialized transformations<br />Re-use of already developed logic<br />
  13. 13. Mapping Deployment: Idea<br />Leverage PIG<br />Map to equivalent operators where possible<br />Let the PIG compiler optimize & translate to Hadoop jobs<br />Wraps some transformations as UDFs<br />Transformations with no equivalents, e.g., standardizer, address validator<br />Transformations with richer functionality, e.g., case-insensitive sorter<br />
  14. 14. Leveraging PIG Operators<br />
  15. 15. LeveragingInformaticaTransformations<br />Case converter<br />UDF<br />Native<br />PIG<br />Source<br />UDFs<br />Lookup<br />UDF<br />Target UDF<br />Native<br />PIG<br />Native PIG<br />Informatica Transformation (Translated to PIG UDFs)<br />
  16. 16. Mapping Deployment<br />Design<br />Leverages PIG operators where possible<br />Wraps other transformations as UDFs<br />Relies on optimization by the PIG compiler<br />Challenges<br />Finding equivalent operators and expressions<br />Limitations of the UDF model – no notion of a user defined operator<br />Benefits<br />Re-use of already developed logic<br />Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer<br />
  17. 17. Enterprise <br />Connectivity for <br />Hadoop programs<br />Hadoop Cluster<br />Weblogs<br />Databases<br />BI<br />HDFS<br />Name Node<br />DW/DM<br />Metadata<br />Repository<br />Graphical IDE for<br />Hadoop Development<br />Semi-structured<br />Un-structured<br />Data Node<br />HDFS<br />Enterprise Applications<br />HDFS<br />Job Tracker<br />Informatica & HadoopBig Picture<br />Transformation<br />Engine for custom<br />data processing<br />
  18. 18. Files<br />Applications<br />Databases<br />Hadoop<br />HBase<br />HDFS<br />Data sources<br />Java<br />C/C++<br />SQL<br />Web services<br />JMS<br />OCI<br />Java<br />BAPI<br />PL/SQL<br />XQuery<br />vi<br />Hive<br />PIG<br />Word<br />Notepad<br />Sqoop<br />Access methods & languages<br />Excel<br /><ul><li>Developer productivity
  19. 19. Connectivity
  20. 20. Rich transforms
  21. 21. Designer tool
  22. 22. Vendor neutrality/flexibility
  23. 23. Without losing</li></ul>performance<br />Developer tools<br />
  24. 24. Informatica Extras…<br />Specialized transformations<br />Matching<br />Address validation<br />Standardization<br />Connectivity<br />Other tools<br />Data federation<br />Analyst tool<br />Administration<br />Metadata manager<br />Business glossary<br />
  25. 25.
  26. 26. HadoopConnector for Enterprise data access<br />Opens up all the connectivity available from Informatica for Hadoopprocessing<br />Sqoop-based connectors<br />Hadoop sources & targets in mappings<br />Benefits<br />Loaddata from Enterprise data sources into Hadoop<br />Extract summarized data from Hadoop to load into DW and other targets<br />Data federation<br />
  27. 27. Data Node<br />PIG<br />Script<br />HDFS<br />UDF<br />Informatica eDTM<br />Mapplets<br />Complex Transformations:<br /> Addr Cleansing<br /> Dedup/Matching<br /> Hierarchical data parsing<br />Enterprise Data Access<br />InformaticaDeveloper tool for Hadoop<br />Metadata<br />Repository<br />Informatica developer builds hadoop mappings and deploys to Hadoop cluster <br />InformaticaHadoopDeveloper<br />Mapping  PIG<br />script<br />eDTM<br />Mapplets etc <br /> PIG UDF<br />Informatica Developer<br />Hadoop <br />Designer<br />
  28. 28. Data Node<br />PIG<br />Script<br />HDFS<br />UDF<br />Informatica eDTM<br />Mapplets<br />Complex Transformations:<br /> Dedupe/Matching<br /> Hierarchical data parsing<br />Enterprise Data Access<br />Metadata<br />Repository<br />Invoke Informatica Transformations from yourHadoopMapReduce/PIG scripts<br />Hadoop developer invokes Informatica UDFs from PIG scripts<br />Hadoop Developer<br />Informatica Developer Tool<br />Mapplets  PIG UDF<br />Reuse <br />Informatica<br />Components in<br />Hadoop<br />