Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Ingestion, Extraction & Parsing on Hadoop

  • Be the first to comment

Data Ingestion, Extraction & Parsing on Hadoop

  1. 1. Data Ingestion, Extraction, and Preparation for Hadoop Sanjay Kaluskar, Sr. Architect, Informatica David Teniente, Data Architect, Rackspace1
  2. 2. Safe Harbor Statement• The information being provided today is for informational purposes only. The development, release and timing of any Informatica product or functionality described today remain at the sole discretion of Informatica and should not be relied upon in making a purchasing decision. Statements made today are based on currently available information, which is subject to change. Such statements should not be relied upon as a representation, warranty or commitment to deliver specific products or functionality in the future.• Some of the comments we will make today are forward-looking statements including statements concerning our product portfolio, our growth and operational strategies, our opportunities, customer adoption of and demand for our products and services, the use and expected benefits of our products and services by customers, the expected benefit from our partnerships and our expectations regarding future industry trends and macroeconomic development.• All forward-looking statements are based upon current expectations and beliefs. However, actual results could differ materially. There are many reasons why actual results may differ from our current expectations. These forward-looking statements should not be relied upon as representing our views as of any subsequent date and Informatica undertakes no obligation to update forward-looking statements to reflect events or circumstances after the date that they are made.• Please refer to our recent SEC filings including the Form 10-Q for the quarter ended September 30th, 2011 for a detailed discussion of the risk factors that may affect our results. Copies of these documents may be obtained from the SEC or by contacting our Investor Relations department. 2
  3. 3. The Hadoop Data Processing PipelineInformatica PowerCenter + PowerExchange Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on PowerCenter + Hadoop PowerExchange 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 3
  4. 4. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse DataStructured (e.g. Informatica N/A Hive, PIG, MR,OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop RoadmapUnstructured, Informatica Informatica Hive, PIG, MR,semi-structured PowerCenter + HParser, Future:(e.g. web logs, PowerExchange, PIG/Hive UDFs, InformaticaJSON) copy files, Flume, MR Roadmap Scribe, Kafka 4
  5. 5. Unleash the Power of Hadoop With High Performance Universal Data Access Messaging, Packagedand Web Services WebSphere MQ Web Services JD Edwards SAP NetWeaver Applications JMS TIBCO Lotus Notes SAP NetWeaver BI MSMQ webMethods Oracle E-Business SAS SAP NetWeaver XI PeopleSoft Siebel Relational and Oracle Informix SaaS/BPO Flat Files Salesforce CRM ADP DB2 UDB Teradata Hewitt DB2/400 Netezza Force.com RightNow SAP By Design SQL Server ODBC Oracle OnDemand Sybase JDBC NetSuite Mainframe Industry and Midrange EDI–X12 AST Standards ADABAS VSAM Datacom C-ISAM EDI-Fact FIX DB2 Binary Flat Files RosettaNet Cargo IMP IDMS Tape Formats… HL7 MVR IMS HIPAA Unstructured Data and Files Word, Excel Flat files XML Standards PDF ASCII reports XML ebXML StarOffice HTML LegalXML HL7 v3.0 WordPerfect RPG IFX ACORD (AL3, XML) Email (POP, IMPA) ANSI cXML HTTP LDAPMPP Appliances EMC/Greenplum AsterData Facebook LinkedIn Vertica Twitter Social Media 5
  6. 6. Ingest Data Access Data Pre-Process Ingest Data Web server PowerExchange PowerCenter Databases,Data Warehouse Batch HDFS Message Queues, CDC HIVEEmail, Social Media e.g. Filter, Join, Cle anse Real-time ERP, CRM Reuse PowerCenter mappings Mainframe 6
  7. 7. Extract DataExtract Data Post-Process Deliver Data Web server PowerCenter PowerExchange Databases, HDFS Batch Data Warehouse e.g. Transform ERP, CRM to target schema Reuse Mainframe PowerCenter mappings 7
  8. 8. 1. Create Ingest orExtract Mapping2. Create HadoopConnection 3. Configure Workflow 4. Create & Load Into Hive Table 8
  9. 9. The Hadoop Data Processing PipelineInformatica HParser Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on HParser Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 9
  10. 10. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse DataStructured (e.g. Informatica N/A Hive, PIG, MR,OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop RoadmapUnstructured, Informatica Informatica Hive, PIG, MR,semi-structured PowerCenter + HParser, Future:(e.g. web logs, PowerExchange, PIG/Hive UDFs, InformaticaJSON) copy files, Flume, MR Roadmap Scribe, Kafka 10
  11. 11. Informatica HParserProductivity: Data Transformation Studio 11
  12. 12. Informatica HParser Productivity: Data Transformation StudioFinancial Insurance B2B Standards Out of the boxSWIFT MT DTCC-NSCC transformations for UNEDIFACTSWIFT MX ACORD-AL3 all messages in all Easy example EDI-X12NACHA versions ACORD XML based visual EDI ARRFIX enhancements EDI UCS+WINSTelekurs and edits EDI VICS Updates and newFpML RosettaNet versions deliveredBAI – V2.0Lockbox Healthcare OAGI from InformaticaCREST DEXIFX HL7 Definition is done usingTWIST Business (industry) Other HL7 V3 EnhancedUNIFI (ISO 20022) terminology and HIPAA Validations definitions IATA-PADISSEPA NCPDPFIXML PLMXML CDISCMISMO NEIM 12
  13. 13. Informatica HParser How does it work? Hadoop cluster Svc Repository Shadoop … dt-hadoop.jar… My_Parser /input/*/input*.txt1. Develop an HParser transformation2. Deploy the transformation3. Run HParser on Hadoop to produce tabular data HDFS4. Analyze the data with HIVE / PIG / MapReduce / Other 13
  14. 14. The Hadoop Data Processing PipelineInformatica Roadmap Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 14
  15. 15. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse DataStructured (e.g. Informatica N/A Hive, PIG, MR,OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop RoadmapUnstructured, Informatica Informatica Hive, PIG, MR,semi-structured PowerCenter + HParser, Future:(e.g. web logs, PowerExchange, PIG/Hive UDFs, InformaticaJSON) copy files, Flume, MR Roadmap Scribe, Kafka 15
  16. 16. Informatica Hadoop Roadmap – 1H 2012• Process data on Hadoop • IDE, administration, monitoring, workflow • Data processing flow designed through IDE: Source/Target, Filter, Join, Lookup, etc. • Execution on Hadoop cluster (pushdown via Hive)• Flexibility to plug-in custom code • Hive and PIG UDFs • MR scripts• Productivity with optimal performance • Exploit Hive performance characteristics • Optimize end-to-end data flow for performance 16
  17. 17. Mapping for Hive execution Logical representation of processing steps Validate & configure for Source Hive translation INSERT INTO STG0 SELECT * FROM StockAnalysis0; Pre-view INSERT INTO STG1 SELECT * FROM StockAnalysis1; generated INSERT INTO STG2 SELECT * FROM StockAnalysis2; Hive code 17 17
  18. 18. Takeaways• Universal connectivity • Completeness and enrichment of raw data for holistic analysis • Prevent Hadoop from becoming another silo accessible to a few experts• Maximum productivity • Collaborative development environment • Right level of abstraction for data processing logic • Re-use of algorithms and data flow logic • Meta-data driven processing • Document data lineage for auditing and impact analysis • Deploy on any platform for optimal performance and utilization 18
  19. 19. Customer Sentiment - Reaching beyondNPS (Net Promoter Score) and surveysGaining insight in to our customer’s sentimentwill improve Rackspace’s ability to provideFanatical Support™Objectives:• What are “they” saying• Gauge the level of sentiment• Fanatical Support™ for the win • Increase NPS • Increase MRR • Decrease Churn • Provide the right products • Keep our promises 19 19
  20. 20. Customer Sentiment Use CasesPulling it all together Case 1 Case 2 Match social media posts Determine the with Customer. Determine sentiment of a a probable match. post, searching key words and scoring the post. Case 3 Determine correlationsbetween posts, ticket volumeand NPS leading to negative Case 4 or positive sentiments. Determine correlations in sentiments with products/configurations which lead to negative or Case 5 positive sentiments. The ability to trend all inputs over time… 20
  21. 21. Rackspace Fanatical Support™Big Data Environment Data Sources(DBs, Flat files, Data Streams) Oracle MySql MS SQL Greenplum DB Indirect Analytics Postgres over Hadoop DB2 BI Analytics Excel CSV BI Stack Flat File Message bus / XML port listening EDI Direct Analytics over Hadoop Binary Sys Logs Hadoop HDFS Search, Analytics, Messaging APIs Algorithmic 21
  22. 22. Twitter Feed for RackspaceUsing Informatica Input Data Output Data 22 22
  23. 23. 23

×