SlideShare a Scribd company logo
1 of 23
Data Ingestion, Extraction, and
           Preparation for Hadoop


             Sanjay Kaluskar, Sr. Architect,
                                Informatica
             David Teniente, Data Architect,
                                 Rackspace




1
Safe Harbor Statement
•   The information being provided today is for informational purposes only. The
    development, release and timing of any Informatica product or functionality described
    today remain at the sole discretion of Informatica and should not be relied upon in
    making a purchasing decision. Statements made today are based on
    currently available information, which is subject to change. Such statements should
    not be relied upon as a representation, warranty or commitment to deliver specific
    products or functionality in the future.
•   Some of the comments we will make today are forward-looking statements including
    statements concerning our product portfolio, our growth and operational strategies,
    our opportunities, customer adoption of and demand for our products and services,
    the use and expected benefits of our products and services by customers, the
    expected benefit from our partnerships and our expectations regarding future
    industry trends and macroeconomic development.
•   All forward-looking statements are based upon current expectations and beliefs.
    However, actual results could differ materially. There are many reasons why actual
    results may differ from our current expectations. These forward-looking statements
    should not be relied upon as representing our views as of any subsequent date and
    Informatica undertakes no obligation to update forward-looking statements to reflect
    events or circumstances after the date that they are made.
•   Please refer to our recent SEC filings including the Form 10-Q for the quarter ended
    September 30th, 2011 for a detailed discussion of the risk factors that may affect our
    results. Copies of these documents may be obtained from the SEC or by contacting
    our Investor Relations department.




                                                                                             2
The Hadoop Data Processing Pipeline
Informatica PowerCenter + PowerExchange
      Available Today
                                                                            Sales & Marketing             Customer Service
      1H / 2012                                                                 Data mart                      Portal




                                           4. Extract Data from Hadoop



                                           3. Transform & Cleanse Data
                                           on Hadoop


                                           2. Parse & Prepare Data on
        PowerCenter +                      Hadoop
       PowerExchange

                                           1. Ingest Data into Hadoop




                                                                    Product & Service                    Customer Service
  Marketing Campaigns   Customer Profile    Account Transactions                          Social Media
                                                                        Offerings                         Logs & Surveys




                                                                                                                             3
Options

                   Ingest/Extract       Parse & Prepare   Transform &
                   Data                 Data              Cleanse Data
Structured (e.g.   Informatica          N/A               Hive, PIG, MR,
OLTP, OLAP)        PowerCenter +                          Future:
                   PowerExchange,                         Informatica
                   Sqoop                                  Roadmap

Unstructured,      Informatica          Informatica       Hive, PIG, MR,
semi-structured    PowerCenter +        HParser,          Future:
(e.g. web logs,    PowerExchange,       PIG/Hive UDFs,    Informatica
JSON)              copy files, Flume,   MR                Roadmap
                   Scribe, Kafka




                                                                           4
Unleash the Power of Hadoop
    With High Performance Universal Data Access

     Messaging,                                                                                     Packaged
and Web Services     WebSphere MQ         Web Services     JD Edwards        SAP NetWeaver        Applications
                     JMS                  TIBCO            Lotus Notes       SAP NetWeaver BI
                     MSMQ                 webMethods       Oracle E-Business SAS
                     SAP NetWeaver XI                      PeopleSoft        Siebel

   Relational and    Oracle               Informix                                                  SaaS/BPO
        Flat Files                                         Salesforce CRM   ADP
                     DB2 UDB              Teradata                          Hewitt
                     DB2/400              Netezza          Force.com
                                                           RightNow         SAP By Design
                     SQL Server           ODBC                              Oracle OnDemand
                     Sybase               JDBC             NetSuite
      Mainframe                                                                                       Industry
   and Midrange                                            EDI–X12          AST                     Standards
                     ADABAS          VSAM
                     Datacom         C-ISAM                EDI-Fact         FIX
                     DB2             Binary Flat Files     RosettaNet       Cargo IMP
                     IDMS            Tape Formats…         HL7              MVR
                     IMS
                                                           HIPAA
    Unstructured
   Data and Files    Word, Excel           Flat files                                           XML Standards
                     PDF                   ASCII reports   XML              ebXML
                     StarOffice            HTML            LegalXML         HL7 v3.0
                     WordPerfect           RPG             IFX              ACORD (AL3, XML)
                     Email (POP, IMPA)     ANSI            cXML
                     HTTP                  LDAP
MPP Appliances

                     EMC/Greenplum       AsterData         Facebook         LinkedIn
                     Vertica                               Twitter

                                                                                                Social Media



                                                                                                                 5
Ingest Data
                      Access Data         Pre-Process     Ingest Data
  Web server




                      PowerExchange         PowerCenter
  Databases,
Data Warehouse
                         Batch                            HDFS




 Message Queues,          CDC                              HIVE
Email, Social Media                        e.g. Filter,
                                         Join, Cleanse
                        Real-time
  ERP, CRM
                                        Reuse
                                      PowerCenter
                                       mappings
  Mainframe



                                                                        6
Extract Data

Extract Data   Post-Process           Deliver Data

                                                        Web server



                 PowerCenter           PowerExchange
                                                         Databases,
 HDFS                                     Batch        Data Warehouse




               e.g. Transform
                                                         ERP, CRM
                  to target
                   schema

                                  Reuse                  Mainframe
                                PowerCenter
                                 mappings




                                                                        7
1. Create Ingest or
Extract Mapping




2. Create Hadoop
Connection




                                  3. Configure
                                  Workflow




          4. Create & Load Into
          Hive Table




                                                 8
The Hadoop Data Processing Pipeline
Informatica HParser
      Available Today
                                                                            Sales & Marketing             Customer Service
      1H / 2012                                                                 Data mart                      Portal




                                           4. Extract Data from Hadoop



                                           3. Transform & Cleanse Data
                                           on Hadoop


                                           2. Parse & Prepare Data on
            HParser                        Hadoop



                                           1. Ingest Data into Hadoop




                                                                    Product & Service                    Customer Service
  Marketing Campaigns   Customer Profile    Account Transactions                          Social Media
                                                                        Offerings                         Logs & Surveys




                                                                                                                             9
Options

                   Ingest/Extract       Parse & Prepare   Transform &
                   Data                 Data              Cleanse Data
Structured (e.g.   Informatica          N/A               Hive, PIG, MR,
OLTP, OLAP)        PowerCenter +                          Future:
                   PowerExchange,                         Informatica
                   Sqoop                                  Roadmap

Unstructured,      Informatica          Informatica       Hive, PIG, MR,
semi-structured    PowerCenter +        HParser,          Future:
(e.g. web logs,    PowerExchange,       PIG/Hive UDFs,    Informatica
JSON)              copy files, Flume,   MR                Roadmap
                   Scribe, Kafka




                                                                           10
Informatica HParser
Productivity: Data Transformation Studio




                                           11
Informatica HParser
    Productivity: Data Transformation Studio


Financial            Insurance           B2B Standards
                                                             Out of the box
SWIFT MT             DTCC-NSCC                               transformations for
                                         UNEDIFACT
SWIFT MX             ACORD-AL3
                                                             all messages in all
                                         Easy example
                                         EDI-X12
NACHA
                                                             versions
                     ACORD XML           based visual
                                         EDI ARR
FIX                                      enhancements
                                         EDI UCS+WINS
Telekurs                                 and edits
                                         EDI VICS            Updates and new
FpML
                                         RosettaNet          versions delivered
BAI – V2.0Lockbox
                     Healthcare          OAGI                from Informatica
CREST DEX
IFX                  HL7
                                  Definition is done using
TWIST                             Business (industry)
                                           Other
                     HL7 V3
  Enhanced
UNIFI (ISO 20022)
                                  terminology and
                     HIPAA
  Validations                     definitions
                                           IATA-PADIS
SEPA                 NCPDP
FIXML                                    PLMXML
                     CDISC
MISMO                                    NEIM



                                                                                  12
Informatica HParser
    How does it work?
                                                 Hadoop cluster




                                Svc Repository

                                      S


hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt


1. Develop an HParser transformation
2. Deploy the transformation
3. Run HParser on Hadoop to produce
   tabular data                                      HDFS
4. Analyze the data with HIVE / PIG /
   MapReduce / Other


                                                                  13
The Hadoop Data Processing Pipeline
Informatica Roadmap
      Available Today
                                                                            Sales & Marketing             Customer Service
      1H / 2012                                                                 Data mart                      Portal




                                           4. Extract Data from Hadoop



                                           3. Transform & Cleanse Data
                                           on Hadoop


                                           2. Parse & Prepare Data on
                                           Hadoop



                                           1. Ingest Data into Hadoop




                                                                    Product & Service                    Customer Service
  Marketing Campaigns   Customer Profile    Account Transactions                          Social Media
                                                                        Offerings                         Logs & Surveys




                                                                                                                             14
Options

                   Ingest/Extract       Parse & Prepare   Transform &
                   Data                 Data              Cleanse Data
Structured (e.g.   Informatica          N/A               Hive, PIG, MR,
OLTP, OLAP)        PowerCenter +                          Future:
                   PowerExchange,                         Informatica
                   Sqoop                                  Roadmap

Unstructured,      Informatica          Informatica       Hive, PIG, MR,
semi-structured    PowerCenter +        HParser,          Future:
(e.g. web logs,    PowerExchange,       PIG/Hive UDFs,    Informatica
JSON)              copy files, Flume,   MR                Roadmap
                   Scribe, Kafka




                                                                           15
Informatica Hadoop Roadmap – 1H 2012

• Process data on Hadoop
   • IDE, administration, monitoring, workflow
   • Data processing flow designed through IDE: Source/Target,
     Filter, Join, Lookup, etc.
   • Execution on Hadoop cluster (pushdown via Hive)

• Flexibility to plug-in custom code
   • Hive and PIG UDFs
   • MR scripts

• Productivity with optimal performance
   • Exploit Hive performance characteristics
   • Optimize end-to-end data flow for performance


                                                                 16
Mapping for Hive execution

                                                      Logical
                                                      representation
                                                      of processing
                                                      steps




                              Validate &
                              configure for
          Source
                              Hive translation
                   INSERT INTO STG0
                   SELECT * FROM StockAnalysis0;   Pre-view
                   INSERT INTO STG1
                   SELECT * FROM StockAnalysis1;
                                                   generated
                   INSERT INTO STG2
                   SELECT * FROM StockAnalysis2;
                                                   Hive code



                                                                       17
                                                                        17
Takeaways

• Universal connectivity
   • Completeness and enrichment of raw data for holistic analysis
   • Prevent Hadoop from becoming another silo accessible to a few
     experts

• Maximum productivity
   • Collaborative development environment
      • Right level of abstraction for data processing logic
      • Re-use of algorithms and data flow logic
   • Meta-data driven processing
      • Document data lineage for auditing and impact analysis
      • Deploy on any platform for optimal performance and utilization




                                                                         18
Customer Sentiment - Reaching beyond
NPS (Net Promoter Score) and surveys

Gaining insight in to our customer’s sentiment
will improve Rackspace’s ability to provide
Fanatical Support™
Objectives:
• What are “they” saying
• Gauge the level of sentiment
• Fanatical Support™ for the win
   • Increase NPS
   • Increase MRR
   • Decrease Churn
   • Provide the right products
   • Keep our promises


                              19                 19
Customer Sentiment Use Cases
Pulling it all together
                     Case 1                   Case 2
           Match social media posts        Determine the
           with Customer. Determine        sentiment of a
               a probable match.          post, searching
                                          key words and
                                         scoring the post.
          Case 3
   Determine correlations
between posts, ticket volume
and NPS leading to negative                   Case 4
   or positive sentiments.            Determine correlations in
                                          sentiments with
                                      products/configurations
                                      which lead to negative or
                   Case 5               positive sentiments.
           The ability to trend all
            inputs over time…


                                                                  20
Rackspace Fanatical Support™
Big Data Environment

  Data Sources
(DBs, Flat files, Data
     Streams)




   Oracle
   MySql
   MS SQL                                                                     Greenplum DB
                                                        Indirect Analytics
   Postgres                                                over Hadoop
   DB2                                                                          BI Analytics

   Excel
   CSV                                                                           BI Stack
   Flat File             Message bus /
   XML                   port listening


   EDI                                                   Direct Analytics
                                                          over Hadoop
   Binary
   Sys Logs
                                          Hadoop HDFS                        Search, Analytics,
   Messaging
   APIs                                                                         Algorithmic




                                                                                                  21
Twitter Feed for Rackspace
Using Informatica




      Input Data             Output Data




                    22                     22
23

More Related Content

What's hot

Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map ReduceEdureka!
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldDataWorks Summit
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
 
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...DataWorks Summit/Hadoop Summit
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceDataWorks Summit/Hadoop Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 

What's hot (19)

Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
Red Hat in Financial Services - Presentation at Hortonworks Booth - Strata 2014
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 

Similar to Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sanjay Kaluskar, Informatica

Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoopskaluska
 
Informatica
InformaticaInformatica
Informaticamukharji
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 
Hadoop India Summit, Feb 2011 - Informatica
Hadoop India Summit, Feb 2011 - InformaticaHadoop India Summit, Feb 2011 - Informatica
Hadoop India Summit, Feb 2011 - InformaticaSanjeev Kumar
 
Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831Cana Ko
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentationpbridges
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems divjeev
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarApache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarYahoo Developer Network
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudBring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudDataWorks Summit/Hadoop Summit
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPCNetApp
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 

Similar to Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sanjay Kaluskar, Informatica (20)

Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
 
Informatica
InformaticaInformatica
Informatica
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
Hadoop India Summit, Feb 2011 - Informatica
Hadoop India Summit, Feb 2011 - InformaticaHadoop India Summit, Feb 2011 - Informatica
Hadoop India Summit, Feb 2011 - Informatica
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831Talk IT_ Oracle_김태완_110831
Talk IT_ Oracle_김태완_110831
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev KumarApache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudBring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPC
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sanjay Kaluskar, Informatica

  • 1. Data Ingestion, Extraction, and Preparation for Hadoop Sanjay Kaluskar, Sr. Architect, Informatica David Teniente, Data Architect, Rackspace 1
  • 2. Safe Harbor Statement • The information being provided today is for informational purposes only. The development, release and timing of any Informatica product or functionality described today remain at the sole discretion of Informatica and should not be relied upon in making a purchasing decision. Statements made today are based on currently available information, which is subject to change. Such statements should not be relied upon as a representation, warranty or commitment to deliver specific products or functionality in the future. • Some of the comments we will make today are forward-looking statements including statements concerning our product portfolio, our growth and operational strategies, our opportunities, customer adoption of and demand for our products and services, the use and expected benefits of our products and services by customers, the expected benefit from our partnerships and our expectations regarding future industry trends and macroeconomic development. • All forward-looking statements are based upon current expectations and beliefs. However, actual results could differ materially. There are many reasons why actual results may differ from our current expectations. These forward-looking statements should not be relied upon as representing our views as of any subsequent date and Informatica undertakes no obligation to update forward-looking statements to reflect events or circumstances after the date that they are made. • Please refer to our recent SEC filings including the Form 10-Q for the quarter ended September 30th, 2011 for a detailed discussion of the risk factors that may affect our results. Copies of these documents may be obtained from the SEC or by contacting our Investor Relations department. 2
  • 3. The Hadoop Data Processing Pipeline Informatica PowerCenter + PowerExchange Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on PowerCenter + Hadoop PowerExchange 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 3
  • 4. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse Data Structured (e.g. Informatica N/A Hive, PIG, MR, OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop Roadmap Unstructured, Informatica Informatica Hive, PIG, MR, semi-structured PowerCenter + HParser, Future: (e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica JSON) copy files, Flume, MR Roadmap Scribe, Kafka 4
  • 5. Unleash the Power of Hadoop With High Performance Universal Data Access Messaging, Packaged and Web Services WebSphere MQ Web Services JD Edwards SAP NetWeaver Applications JMS TIBCO Lotus Notes SAP NetWeaver BI MSMQ webMethods Oracle E-Business SAS SAP NetWeaver XI PeopleSoft Siebel Relational and Oracle Informix SaaS/BPO Flat Files Salesforce CRM ADP DB2 UDB Teradata Hewitt DB2/400 Netezza Force.com RightNow SAP By Design SQL Server ODBC Oracle OnDemand Sybase JDBC NetSuite Mainframe Industry and Midrange EDI–X12 AST Standards ADABAS VSAM Datacom C-ISAM EDI-Fact FIX DB2 Binary Flat Files RosettaNet Cargo IMP IDMS Tape Formats… HL7 MVR IMS HIPAA Unstructured Data and Files Word, Excel Flat files XML Standards PDF ASCII reports XML ebXML StarOffice HTML LegalXML HL7 v3.0 WordPerfect RPG IFX ACORD (AL3, XML) Email (POP, IMPA) ANSI cXML HTTP LDAP MPP Appliances EMC/Greenplum AsterData Facebook LinkedIn Vertica Twitter Social Media 5
  • 6. Ingest Data Access Data Pre-Process Ingest Data Web server PowerExchange PowerCenter Databases, Data Warehouse Batch HDFS Message Queues, CDC HIVE Email, Social Media e.g. Filter, Join, Cleanse Real-time ERP, CRM Reuse PowerCenter mappings Mainframe 6
  • 7. Extract Data Extract Data Post-Process Deliver Data Web server PowerCenter PowerExchange Databases, HDFS Batch Data Warehouse e.g. Transform ERP, CRM to target schema Reuse Mainframe PowerCenter mappings 7
  • 8. 1. Create Ingest or Extract Mapping 2. Create Hadoop Connection 3. Configure Workflow 4. Create & Load Into Hive Table 8
  • 9. The Hadoop Data Processing Pipeline Informatica HParser Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on HParser Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 9
  • 10. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse Data Structured (e.g. Informatica N/A Hive, PIG, MR, OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop Roadmap Unstructured, Informatica Informatica Hive, PIG, MR, semi-structured PowerCenter + HParser, Future: (e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica JSON) copy files, Flume, MR Roadmap Scribe, Kafka 10
  • 11. Informatica HParser Productivity: Data Transformation Studio 11
  • 12. Informatica HParser Productivity: Data Transformation Studio Financial Insurance B2B Standards Out of the box SWIFT MT DTCC-NSCC transformations for UNEDIFACT SWIFT MX ACORD-AL3 all messages in all Easy example EDI-X12 NACHA versions ACORD XML based visual EDI ARR FIX enhancements EDI UCS+WINS Telekurs and edits EDI VICS Updates and new FpML RosettaNet versions delivered BAI – V2.0Lockbox Healthcare OAGI from Informatica CREST DEX IFX HL7 Definition is done using TWIST Business (industry) Other HL7 V3 Enhanced UNIFI (ISO 20022) terminology and HIPAA Validations definitions IATA-PADIS SEPA NCPDP FIXML PLMXML CDISC MISMO NEIM 12
  • 13. Informatica HParser How does it work? Hadoop cluster Svc Repository S hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt 1. Develop an HParser transformation 2. Deploy the transformation 3. Run HParser on Hadoop to produce tabular data HDFS 4. Analyze the data with HIVE / PIG / MapReduce / Other 13
  • 14. The Hadoop Data Processing Pipeline Informatica Roadmap Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 14
  • 15. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse Data Structured (e.g. Informatica N/A Hive, PIG, MR, OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop Roadmap Unstructured, Informatica Informatica Hive, PIG, MR, semi-structured PowerCenter + HParser, Future: (e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica JSON) copy files, Flume, MR Roadmap Scribe, Kafka 15
  • 16. Informatica Hadoop Roadmap – 1H 2012 • Process data on Hadoop • IDE, administration, monitoring, workflow • Data processing flow designed through IDE: Source/Target, Filter, Join, Lookup, etc. • Execution on Hadoop cluster (pushdown via Hive) • Flexibility to plug-in custom code • Hive and PIG UDFs • MR scripts • Productivity with optimal performance • Exploit Hive performance characteristics • Optimize end-to-end data flow for performance 16
  • 17. Mapping for Hive execution Logical representation of processing steps Validate & configure for Source Hive translation INSERT INTO STG0 SELECT * FROM StockAnalysis0; Pre-view INSERT INTO STG1 SELECT * FROM StockAnalysis1; generated INSERT INTO STG2 SELECT * FROM StockAnalysis2; Hive code 17 17
  • 18. Takeaways • Universal connectivity • Completeness and enrichment of raw data for holistic analysis • Prevent Hadoop from becoming another silo accessible to a few experts • Maximum productivity • Collaborative development environment • Right level of abstraction for data processing logic • Re-use of algorithms and data flow logic • Meta-data driven processing • Document data lineage for auditing and impact analysis • Deploy on any platform for optimal performance and utilization 18
  • 19. Customer Sentiment - Reaching beyond NPS (Net Promoter Score) and surveys Gaining insight in to our customer’s sentiment will improve Rackspace’s ability to provide Fanatical Support™ Objectives: • What are “they” saying • Gauge the level of sentiment • Fanatical Support™ for the win • Increase NPS • Increase MRR • Decrease Churn • Provide the right products • Keep our promises 19 19
  • 20. Customer Sentiment Use Cases Pulling it all together Case 1 Case 2 Match social media posts Determine the with Customer. Determine sentiment of a a probable match. post, searching key words and scoring the post. Case 3 Determine correlations between posts, ticket volume and NPS leading to negative Case 4 or positive sentiments. Determine correlations in sentiments with products/configurations which lead to negative or Case 5 positive sentiments. The ability to trend all inputs over time… 20
  • 21. Rackspace Fanatical Support™ Big Data Environment Data Sources (DBs, Flat files, Data Streams) Oracle MySql MS SQL Greenplum DB Indirect Analytics Postgres over Hadoop DB2 BI Analytics Excel CSV BI Stack Flat File Message bus / XML port listening EDI Direct Analytics over Hadoop Binary Sys Logs Hadoop HDFS Search, Analytics, Messaging APIs Algorithmic 21
  • 22. Twitter Feed for Rackspace Using Informatica Input Data Output Data 22 22
  • 23. 23

Editor's Notes

  1. * EXAMPLE *Some talking points to cover over the next few slides on PowerExchange for Hadoop…Access all data sourcesAbility to pre-process (e.g. filter) before landing to HDFS and post-process to fit target schemaPerformance of load via partitioning, native APIs, grid, pushdown to source or target, process offloadingProductivity via visual designerDifferent latencies (batch, near real-time)One of the first challenges Hadoop developers face is accessing all the data needed for processing and getting it into Hadoop. All too often developers resort to reinventing the wheel by building custom adapters and scripts that require expert knowledge of the source systems, applications, data structures and formats. Once they overcome this hurdle they need to make sure their custom code will perform and scale as data volumes grow. Along with the need for speed, security and reliability are often overlooked which increases the risk of non-compliance and system downtime. Needless to say building a robust custom adapter takes time and can be costly to maintain as software versions change. Sometimes the end result is adapters that lack direct connectivity between the source systems and Hadoop which means you need to temporarily stage the data before it can move into Hadoop, increasing storage costs. Informatica PowerExchange can access data from virtually any data source at any latency (e.g. batch, real-time, or near real-time) and deliver all your data directly into Hadoop (see Figure 2). Similarly, Informatica PowerExchange can deliver data from Hadoop to your enterprise applications and information management systems. You can schedule batch loads to move data from multiple source systems directly into Hadoop without any staging. Alternatively, you can move only changed data from relational and mainframe systems directly into Hadoop. For real-time data feeds, you can move data off of message queues and deliver into Hadoop. Informatica PowerExchange accesses data through native API’s to ensure optimal performance and is designed to minimize the impact to source systems through caching and process offloading. To further increase the performance of data flows between the source systems and Hadoop, PowerCenter supports data partitioning to distribute the processing across CPUs.  Informatica PowerExchange for Hadoop is integrated with PowerCenter so that you can pre-process data from multiple data sources before it lands in Hadoop. This enables you to leverage the source system metadata since this information is not retained in the Hadoop File System (HDFS). For example, you can perform lookups, filters, or relational joins based on primary and foreign key relationships before data is delivered to HDFS. You can also pushdown the pre-processing to the source system to limit data movement and unnecessary data duplication to Hadoop. Common design patterns for data flows into or out of Hadoop can be generated in PowerCenter using parameterized templates built in Microsoft Visio to dramatically increase productivity. To securely and reliably manage the file transfer and collection of very large data files from both inside and outside the firewall you can use Informatica Managed File Transfer (MFT).
  2. Sanjay’s notes:Flume, scribe are options for streaming ingestion of log filesKafka is for near real-time
  3. See PWX for Hadoop white paperDoes not require expert knowledge of source systemsDeliver data directly to Hadoop without any intermediate stagingAccess data through native API’s for optimal performanceBring in both un-modeled / un-structured and structured relational data to make the analysis completeUse example to illustrate combining both unstructured and structured data needed for analysis
  4. Have lineage of where data came from
  5. Informatica announced on Nov 2 the industry’s first data parser for HadoopThe solution is designed to provide a powerful data parsing alternative to organizations who are seeking to achieve the full potential of Big Data in Hadoop with efficiency and scale.This solution addresses the industry’s growing demand in turning the unstructured, complex data into structured or semi-structured format in Hadoop to drive insights and improve operations.Tapping our industry leading experience in parsing unstructured data and handling industry formats and documents within and across enterprise, Informatica pioneered the development of the data parser that exploits the parallelism of MapReduce framework.Using an engine-based, interactive tool to simplify the data parsing process, Informatica HParser processes complex files and messages in Hadoop with the following three offerings:Informatica HParser for logs, Omniture, XML and JSON (community edition), free of charge.Informatica HParser for industry standards (commercial edition).Informatica HParser for documents (commercial edition).With HParser, organizations can derive unique benefits using:Accelerate deployment using out of the box ready to use transformations and industry standards.Increase productivity for tackling diverse complex formats including proprietary log files.Speed the development of parsing exploiting the parallelism inside MapReduce.Optimize performance in data parsing for large files including logs, XML, JSON and industry standards.Informatica also provides a free 30 day trial of the commercial edition of Hparser for Documents to the users interested in learning about the design environment for data transformation.
  6. Definethe extraction/transformation logic using the designerRun the parser as a standalone MR jobCommand line arguments are script, input, output filesParallelism across files, no support for file splits
  7. Describe each of the future capabilities in the bulletsYou can design and specify the entire end-to-end flow of your data processing pipeline with the flexibility to insert custom code.Choose the right level of abstraction to define your data flow, don’t reinvent the wheel. Informatica provides the right level of abstraction for data processing for rapid development (e.g. metadata driven development environment) and easy maintenance (e.g. complete specification and lineage of data)