Your SlideShare is downloading. ×
Data integration-on-hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data integration-on-hadoop

2,539
views

Published on

Presentation at Hadoop user summit in Bangalore on Feb 16, 2011 by Sanjay Kaluskar

Presentation at Hadoop user summit in Bangalore on Feb 16, 2011 by Sanjay Kaluskar

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,539
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
105
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Many examples of users using Hadoop for analysis/miningExample - social networking data – need to identify the same user across multiple applicationsMap/reduce functions – powerful – low levelInformatica is the leader>>>>>>>>>>>> You may wonder: How? Why do so many people use Informatica tools?
  • Historical perspective – proliferation of data sources over time, data fragmentationDeveloper productivity is due to higher-level abstraction, built-in transformations and re-use.Vendor neutrality is not at the cost of performance. It gives the flexibility to move to a different vendor easily.The challenge of being productive with Hadoop is similar.>>>>>>> Let’s make this more concrete with a very simple example.
  • This could be sales data that you want to analyze.
  • PIG script calls the lookup as a UDFAppealing for somebody familiar with PIG.>>>>>>>>>>>>>> Or for somebody familiar with Informatica
  • This is more appealing for Informatica users.>>>>>> We started prototyping with these ideas.
  • Choice of PIGAppeal to 2 different user segments>>>>>>>> Next I will go into some implementation details.
  • Amapplet may be treated as a source, target or a function>>>>>> Just a few more low-level details for the Hadoop hackers
  • PIG should have array execution for UDFsIdeally don’t want the runtime to access Informatica domainDistcache seems like the right solutionWorks for native libsSome problems with jarsAddress doctor supports 240 countries!>>>>>>>>>> Next we will look at mapping deploymentRegistering each individual jar is tedious & error prone; also, PIG re-packs everything together which overwrites files Top-level jar with other jars on class-path-- Need to have the jars distributed preserving the dir structureRules out mapred.cache.filesProblem with mapred.cache.archives (can’t add top-level jar to classpath - mapred.job.classpath.files entries must be from mapred.cache.files)Problem with mapred.child.java.opts (can’t add to java.class.path but can add to java.library.path)
  • Leveraging PIG – saves us a lot of work, avoids re-inventing the wheel>>>>> Details of conversion
  • So many similaritiesNote dummy files in load & storeThe concept could be generalized – currently, the parallelism is a problem>>>>>>> Let’s look at an example
  • Anybody curious what this translates into?>>>>>>>>>>> Some implementation details.
  • >>>>>>>>>>>Where are we going with all this?
  • Sqoop adaptersReader/writers to allow HDFS sources/targets>>>>>>>> To summarize
  • Any quick questions?>>>>>>>>>>> I didn’t mention some non-trivial extras
  • Transcript

    • 1. Data Integration on Hadoop
      Sanjay Kaluskar
      Senior Architect, Informatica
      Feb 2011
    • 2. Introduction
      Challenges
      Results of analysis or mining are only as good as the completeness & quality of underlying data
      Need for the right level of abstraction & tools
      Data integration & data quality tools have tackled these challenges for many years!
      More than 4,200 enterprises worldwide rely on Informatica
    • 3. Files
      Applications
      Databases
      Hadoop
      HBase
      HDFS
      Data sources
      Transact-SQL
      Java
      C/C++
      SQL
      Web services
      OCI
      Java
      JMS
      BAPI
      JDBC
      PL/SQL
      ODBC
      Hive
      XQuery
      vi
      PIG
      Word
      Notepad
      Sqoop
      CLI
      Access methods & languages
      Excel
      • Developer productivity
      • 4. Vendor neutrality/flexibility
      Developer tools
    • 5. Lookup example
      ‘Bangalore’, …, 234, …
      ‘Chennai’, …, 82, …
      ‘Mumbai’, …, 872, …
      ‘Delhi’, …, 11, …
      ‘Chennai’, …, 43, …
      ‘xxx’, …, 2, …
      Database table
      HDFS file
      Your choices
      • Move table to HDFS using Sqoop and join
      • 6. Could use PIG/Hive to leverage the join operator
      • 7. Implement Java code to lookup the database table
      • 8. Need to use access method based on the vendor
    • Or… leverage Informatica’s Lookup
      a = load 'RelLookupInput.dat' as (deptid: double);
      b = foreach a generate flatten(com.mycompany.pig.RelLookup(deptid));
      store b into 'RelLookupOutput.out';
    • 9. Or… you could start with a mapping
      STORE
      Filter
      Load
    • 10. Goals of the prototype
      Enable Hadoop developers to leverage Data Transformation and Data Quality logic
      Ability to invoke mappletsfrom Hadoop
      Lower the barrier to Hadoop entry by using Informatica Developer as the toolset
      Ability to run a mapping on Hadoop
    • 11. MappletInvocation
      Generation of the UDF of the right type
      Output-only mapplet Load UDF
      Input-only mapplet  Store UDF
      Input/output  Eval UDF
      Packaging into a jar
      Compiled UDF
      Other meta-data: connections, reference tables
      Invokes Informatica engine (DTM) at runtime
    • 12. Mapplet Invocation (contd.)
      Challenges
      UDF execution is per-tuple; mappletsare optimized for batch execution
      Connection info/reference data need to be plugged in
      Runtime dependencies: 280 jars, 558 native dependencies
      Benefits
      PIG user can leverage Informatica functionality
      Connectivity to many (50+) data sources
      Specialized transformations
      Re-use of already developed logic
    • 13. Mapping Deployment: Idea
      Leverage PIG
      Map to equivalent operators where possible
      Let the PIG compiler optimize & translate to Hadoop jobs
      Wraps some transformations as UDFs
      Transformations with no equivalents, e.g., standardizer, address validator
      Transformations with richer functionality, e.g., case-insensitive sorter
    • 14. Leveraging PIG Operators
    • 15. LeveragingInformaticaTransformations
      Case converter
      UDF
      Native
      PIG
      Source
      UDFs
      Lookup
      UDF
      Target UDF
      Native
      PIG
      Native PIG
      Informatica Transformation (Translated to PIG UDFs)
    • 16. Mapping Deployment
      Design
      Leverages PIG operators where possible
      Wraps other transformations as UDFs
      Relies on optimization by the PIG compiler
      Challenges
      Finding equivalent operators and expressions
      Limitations of the UDF model – no notion of a user defined operator
      Benefits
      Re-use of already developed logic
      Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer
    • 17. Enterprise
      Connectivity for
      Hadoop programs
      Hadoop Cluster
      Weblogs
      Databases
      BI
      HDFS
      Name Node
      DW/DM
      Metadata
      Repository
      Graphical IDE for
      Hadoop Development
      Semi-structured
      Un-structured
      Data Node
      HDFS
      Enterprise Applications
      HDFS
      Job Tracker
      Informatica & HadoopBig Picture
      Transformation
      Engine for custom
      data processing
    • 18. Files
      Applications
      Databases
      Hadoop
      HBase
      HDFS
      Data sources
      Java
      C/C++
      SQL
      Web services
      JMS
      OCI
      Java
      BAPI
      PL/SQL
      XQuery
      vi
      Hive
      PIG
      Word
      Notepad
      Sqoop
      Access methods & languages
      Excel
      • Developer productivity
      • 19. Connectivity
      • 20. Rich transforms
      • 21. Designer tool
      • 22. Vendor neutrality/flexibility
      • 23. Without losing
      performance
      Developer tools
    • 24. Informatica Extras…
      Specialized transformations
      Matching
      Address validation
      Standardization
      Connectivity
      Other tools
      Data federation
      Analyst tool
      Administration
      Metadata manager
      Business glossary
    • 25.
    • 26. HadoopConnector for Enterprise data access
      Opens up all the connectivity available from Informatica for Hadoopprocessing
      Sqoop-based connectors
      Hadoop sources & targets in mappings
      Benefits
      Loaddata from Enterprise data sources into Hadoop
      Extract summarized data from Hadoop to load into DW and other targets
      Data federation
    • 27. Data Node
      PIG
      Script
      HDFS
      UDF
      Informatica eDTM
      Mapplets
      Complex Transformations:
      Addr Cleansing
      Dedup/Matching
      Hierarchical data parsing
      Enterprise Data Access
      InformaticaDeveloper tool for Hadoop
      Metadata
      Repository
      Informatica developer builds hadoop mappings and deploys to Hadoop cluster
      InformaticaHadoopDeveloper
      Mapping  PIG
      script
      eDTM
      Mapplets etc 
      PIG UDF
      Informatica Developer
      Hadoop
      Designer
    • 28. Data Node
      PIG
      Script
      HDFS
      UDF
      Informatica eDTM
      Mapplets
      Complex Transformations:
      Dedupe/Matching
      Hierarchical data parsing
      Enterprise Data Access
      Metadata
      Repository
      Invoke Informatica Transformations from yourHadoopMapReduce/PIG scripts
      Hadoop developer invokes Informatica UDFs from PIG scripts
      Hadoop Developer
      Informatica Developer Tool
      Mapplets  PIG UDF
      Reuse
      Informatica
      Components in
      Hadoop