Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar

on

  • 2,484 views

 

Statistics

Views

Total Views
2,484
Views on SlideShare
2,322
Embed Views
162

Actions

Likes
0
Downloads
96
Comments
0

1 Embed 162

http://d.hatena.ne.jp 162

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Apache Hadoop India Summit 2011 talk "Data Integration on Hadoop" by Sanjay Kaluskar Presentation Transcript

  • 1. Data Integration on Hadoop
    Sanjay Kaluskar
    Senior Architect, Informatica
    Feb 2011
  • 2. Introduction
    Challenges
    Results of analysis or mining are only as good as the completeness & quality of underlying data
    Need for the right level of abstraction & tools
    Data integration & data quality tools have tackled these challenges for many years!
    More than 4,200 enterprises worldwide rely on Informatica
  • 3. Files
    Applications
    Databases
    Hadoop
    HBase
    HDFS
    Data sources
    Transact-SQL
    Java
    C/C++
    SQL
    Web services
    OCI
    Java
    JMS
    BAPI
    JDBC
    PL/SQL
    ODBC
    Hive
    XQuery
    vi
    PIG
    Word
    Notepad
    Sqoop
    CLI
    Access methods & languages
    Excel
    • Developer productivity
    • 4. Vendor neutrality/flexibility
    Developer tools
  • 5. Lookup example
    ‘Bangalore’, …, 234, …
    ‘Chennai’, …, 82, …
    ‘Mumbai’, …, 872, …
    ‘Delhi’, …, 11, …
    ‘Chennai’, …, 43, …
    ‘xxx’, …, 2, …
    Database table
    HDFS file
    Your choices
    • Move table to HDFS using Sqoop and join
    • 6. Could use PIG/Hive to leverage the join operator
    • 7. Implement Java code to lookup the database table
    • 8. Need to use access method based on the vendor
  • Or… leverage Informatica’s Lookup
    a = load 'RelLookupInput.dat' as (deptid: double);
    b = foreach a generate flatten(com.mycompany.pig.RelLookup(deptid));
    store b into 'RelLookupOutput.out';
  • 9. Or… you could start with a mapping
    STORE
    Filter
    Load
  • 10. Goals of the prototype
    Enable Hadoop developers to leverage Data Transformation and Data Quality logic
    Ability to invoke mappletsfrom Hadoop
    Lower the barrier to Hadoop entry by using Informatica Developer as the toolset
    Ability to run a mapping on Hadoop
  • 11. MappletInvocation
    Generation of the UDF of the right type
    Output-only mapplet Load UDF
    Input-only mapplet  Store UDF
    Input/output  Eval UDF
    Packaging into a jar
    Compiled UDF
    Other meta-data: connections, reference tables
    Invokes Informatica engine (DTM) at runtime
  • 12. Mapplet Invocation (contd.)
    Challenges
    UDF execution is per-tuple; mappletsare optimized for batch execution
    Connection info/reference data need to be plugged in
    Runtime dependencies: 280 jars, 558 native dependencies
    Benefits
    PIG user can leverage Informatica functionality
    Connectivity to many (50+) data sources
    Specialized transformations
    Re-use of already developed logic
  • 13. Mapping Deployment: Idea
    Leverage PIG
    Map to equivalent operators where possible
    Let the PIG compiler optimize & translate to Hadoop jobs
    Wraps some transformations as UDFs
    Transformations with no equivalents, e.g., standardizer, address validator
    Transformations with richer functionality, e.g., case-insensitive sorter
  • 14. Leveraging PIG Operators
  • 15. LeveragingInformaticaTransformations
    Case converter
    UDF
    Native
    PIG
    Source
    UDFs
    Lookup
    UDF
    Target UDF
    Native
    PIG
    Native PIG
    Informatica Transformation (Translated to PIG UDFs)
  • 16. Mapping Deployment
    Design
    Leverages PIG operators where possible
    Wraps other transformations as UDFs
    Relies on optimization by the PIG compiler
    Challenges
    Finding equivalent operators and expressions
    Limitations of the UDF model – no notion of a user defined operator
    Benefits
    Re-use of already developed logic
    Easy way for Informatica users to start using Hadoop simultaneously; can also use the designer
  • 17. Enterprise
    Connectivity for
    Hadoop programs
    Hadoop Cluster
    Weblogs
    Databases
    BI
    HDFS
    Name Node
    DW/DM
    Metadata
    Repository
    Graphical IDE for
    Hadoop Development
    Semi-structured
    Un-structured
    Data Node
    HDFS
    Enterprise Applications
    HDFS
    Job Tracker
    Informatica & HadoopBig Picture
    Transformation
    Engine for custom
    data processing
  • 18. Files
    Applications
    Databases
    Hadoop
    HBase
    HDFS
    Data sources
    Java
    C/C++
    SQL
    Web services
    JMS
    OCI
    Java
    BAPI
    PL/SQL
    XQuery
    vi
    Hive
    PIG
    Word
    Notepad
    Sqoop
    Access methods & languages
    Excel
    • Developer productivity
    • 19. Connectivity
    • 20. Rich transforms
    • 21. Designer tool
    • 22. Vendor neutrality/flexibility
    • 23. Without losing
    performance
    Developer tools
  • 24. Informatica Extras…
    Specialized transformations
    Matching
    Address validation
    Standardization
    Connectivity
    Other tools
    Data federation
    Analyst tool
    Administration
    Metadata manager
    Business glossary
  • 25.
  • 26. HadoopConnector for Enterprise data access
    Opens up all the connectivity available from Informatica for Hadoopprocessing
    Sqoop-based connectors
    Hadoop sources & targets in mappings
    Benefits
    Loaddata from Enterprise data sources into Hadoop
    Extract summarized data from Hadoop to load into DW and other targets
    Data federation
  • 27. Data Node
    PIG
    Script
    HDFS
    UDF
    Informatica eDTM
    Mapplets
    Complex Transformations:
    Addr Cleansing
    Dedup/Matching
    Hierarchical data parsing
    Enterprise Data Access
    InformaticaDeveloper tool for Hadoop
    Metadata
    Repository
    Informatica developer builds hadoop mappings and deploys to Hadoop cluster
    InformaticaHadoopDeveloper
    Mapping  PIG
    script
    eDTM
    Mapplets etc 
    PIG UDF
    Informatica Developer
    Hadoop
    Designer
  • 28. Data Node
    PIG
    Script
    HDFS
    UDF
    Informatica eDTM
    Mapplets
    Complex Transformations:
    Dedupe/Matching
    Hierarchical data parsing
    Enterprise Data Access
    Metadata
    Repository
    Invoke Informatica Transformations from yourHadoopMapReduce/PIG scripts
    Hadoop developer invokes Informatica UDFs from PIG scripts
    Hadoop Developer
    Informatica Developer Tool
    Mapplets  PIG UDF
    Reuse
    Informatica
    Components in
    Hadoop