Your SlideShare is downloading. ×
0
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar

2,137

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,137
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
98
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Integration on Hadoop<br />Sanjay Kaluskar<br />Senior Architect, Informatica<br />Feb 2011<br />
  • 2. Introduction<br />Challenges<br />Results of analysis or mining are only as good as the completeness & quality of underlying data<br />Need for the right level of abstraction & tools<br />Data integration & data quality tools have tackled these challenges for many years!<br />More than 4,200 enterprises worldwide rely on Informatica<br />
  • 3. Files<br />Applications<br />Databases<br />Hadoop<br />HBase<br />HDFS<br />Data sources<br />Transact-SQL<br />Java<br />C/C++<br />SQL<br />Web services<br />OCI<br />Java<br />JMS<br />BAPI<br />JDBC<br />PL/SQL<br />ODBC<br />Hive<br />XQuery<br />vi<br />PIG<br />Word<br />Notepad<br />Sqoop<br />CLI<br />Access methods & languages<br />Excel<br /><ul><li>Developer productivity
  • 4. Vendor neutrality/flexibility</li></ul>Developer tools<br />
  • 5. Lookup example<br />‘Bangalore’, …, 234, …<br />‘Chennai’, …, 82, …<br />‘Mumbai’, …, 872, …<br />‘Delhi’, …, 11, …<br />‘Chennai’, …, 43, …<br />‘xxx’, …, 2, …<br />Database table<br />HDFS file<br />Your choices<br /><ul><li>Move table to HDFS using Sqoop and join
  • 6. Could use PIG/Hive to leverage the join operator
  • 7. Implement Java code to lookup the database table
  • 8. Need to use access method based on the vendor</li></li></ul><li>Or… leverage Informatica’s Lookup<br />a = load 'RelLookupInput.dat' as (deptid: double);<br />b = foreach a generate flatten(com.mycompany.pig.RelLookup(deptid));<br />store b into 'RelLookupOutput.out';<br />
  • 9. Or… you could start with a mapping<br />STORE<br />Filter<br />Load<br />
  • 10. Goals of the prototype<br />Enable Hadoop developers to leverage Data Transformation and Data Quality logic<br />Ability to invoke mappletsfrom Hadoop<br />Lower the barrier to Hadoop entry by using Informatica Developer as the toolset<br />Ability to run a mapping on Hadoop<br />
  • 11. MappletInvocation<br />Generation of the UDF of the right type<br />Output-only mapplet Load UDF<br />Input-only mapplet  Store UDF<br />Input/output  Eval UDF<br />Packaging into a jar<br />Compiled UDF<br />Other meta-data: connections, reference tables<br />Invokes Informatica engine (DTM) at runtime<br />
  • 12. Mapplet Invocation (contd.)<br />Challenges<br />UDF execution is per-tuple; mappletsare optimized for batch execution<br />Connection info/reference data need to be plugged in<br />Runtime dependencies: 280 jars, 558 native dependencies<br />Benefits<br />PIG user can leverage Informatica functionality<br />Connectivity to many (50+) data sources<br />Specialized transformations<br />Re-use of already developed logic<br />
  • 13. Mapping Deployment: Idea<br />Leverage PIG<br />Map to equivalent operators where possible<br />Let the PIG compiler optimize & translate to Hadoop jobs<br />Wraps some transformations as UDFs<br />Transformations with no equivalents, e.g., standardizer, address validator<br />Transformations with richer functionality, e.g., case-insensitive sorter<br />
  • 14. Leveraging PIG Operators<br />
  • 15. LeveragingInformaticaTransformations<br />Case converter<br />UDF<br />Native<br />PIG<br />Source<br />UDFs<br />Lookup<br />UDF<br />Target UDF<br />Native<br />PIG<br />Native PIG<br />Informatica Transformation (Translated to PIG UDFs)<br />
  • 16. Mapping Deployment<br />Design<br />Leverages PIG operators where possible<br />Wraps other transformations as UDFs<br />Relies on optimization by the PIG compiler<br />Challenges<br />Finding equivalent operators and expressions<br />Limitations of the UDF model – no notion of a user defined operator<br />Benefits<br />Re-use of already developed logic<br />Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer<br />
  • 17. Enterprise <br />Connectivity for <br />Hadoop programs<br />Hadoop Cluster<br />Weblogs<br />Databases<br />BI<br />HDFS<br />Name Node<br />DW/DM<br />Metadata<br />Repository<br />Graphical IDE for<br />Hadoop Development<br />Semi-structured<br />Un-structured<br />Data Node<br />HDFS<br />Enterprise Applications<br />HDFS<br />Job Tracker<br />Informatica & HadoopBig Picture<br />Transformation<br />Engine for custom<br />data processing<br />
  • 18. Files<br />Applications<br />Databases<br />Hadoop<br />HBase<br />HDFS<br />Data sources<br />Java<br />C/C++<br />SQL<br />Web services<br />JMS<br />OCI<br />Java<br />BAPI<br />PL/SQL<br />XQuery<br />vi<br />Hive<br />PIG<br />Word<br />Notepad<br />Sqoop<br />Access methods & languages<br />Excel<br /><ul><li>Developer productivity
  • 19. Connectivity
  • 20. Rich transforms
  • 21. Designer tool
  • 22. Vendor neutrality/flexibility
  • 23. Without losing</li></ul>performance<br />Developer tools<br />
  • 24. Informatica Extras…<br />Specialized transformations<br />Matching<br />Address validation<br />Standardization<br />Connectivity<br />Other tools<br />Data federation<br />Analyst tool<br />Administration<br />Metadata manager<br />Business glossary<br />
  • 25.
  • 26. HadoopConnector for Enterprise data access<br />Opens up all the connectivity available from Informatica for Hadoopprocessing<br />Sqoop-based connectors<br />Hadoop sources & targets in mappings<br />Benefits<br />Loaddata from Enterprise data sources into Hadoop<br />Extract summarized data from Hadoop to load into DW and other targets<br />Data federation<br />
  • 27. Data Node<br />PIG<br />Script<br />HDFS<br />UDF<br />Informatica eDTM<br />Mapplets<br />Complex Transformations:<br /> Addr Cleansing<br /> Dedup/Matching<br /> Hierarchical data parsing<br />Enterprise Data Access<br />InformaticaDeveloper tool for Hadoop<br />Metadata<br />Repository<br />Informatica developer builds hadoop mappings and deploys to Hadoop cluster <br />InformaticaHadoopDeveloper<br />Mapping  PIG<br />script<br />eDTM<br />Mapplets etc <br /> PIG UDF<br />Informatica Developer<br />Hadoop <br />Designer<br />
  • 28. Data Node<br />PIG<br />Script<br />HDFS<br />UDF<br />Informatica eDTM<br />Mapplets<br />Complex Transformations:<br /> Dedupe/Matching<br /> Hierarchical data parsing<br />Enterprise Data Access<br />Metadata<br />Repository<br />Invoke Informatica Transformations from yourHadoopMapReduce/PIG scripts<br />Hadoop developer invokes Informatica UDFs from PIG scripts<br />Hadoop Developer<br />Informatica Developer Tool<br />Mapplets  PIG UDF<br />Reuse <br />Informatica<br />Components in<br />Hadoop<br />

×