• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sanjay Kaluskar, Informatica
 

Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sanjay Kaluskar, Informatica

on

  • 4,796 views

One of the first challenges Hadoop developers face is accessing all the data they need and getting it into Hadoop for analysis. Informatica PowerExchange accesses a variety of data types and ...

One of the first challenges Hadoop developers face is accessing all the data they need and getting it into Hadoop for analysis. Informatica PowerExchange accesses a variety of data types and structures at different latencies (e.g. batch, real-time, or near real-time) and ingests data directly into Hadoop.  The next step is to parse the data in preparation for analysis in Hadoop.  Informatica provides a visual IDE to deploy pre-built parsers or design specific parsers for complex data formats and deploy them on Hadoop.  Once the analysis is complete,  Informatica PowerExhange delivers the resulting output to other information management systems such as a data warehouse.  Learn in this session from Informatica and one of their customers, how to get all the data you need into Hadoop, parse a variety of data formats and structures, and egress the resultant output to other systems.

Statistics

Views

Total Views
4,796
Views on SlideShare
4,554
Embed Views
242

Actions

Likes
3
Downloads
0
Comments
0

4 Embeds 242

http://www.cloudera.com 235
http://blog.cloudera.com 4
https://www.cloudera.com 2
http://cloudera.matt.dev 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • * EXAMPLE *Some talking points to cover over the next few slides on PowerExchange for Hadoop…Access all data sourcesAbility to pre-process (e.g. filter) before landing to HDFS and post-process to fit target schemaPerformance of load via partitioning, native APIs, grid, pushdown to source or target, process offloadingProductivity via visual designerDifferent latencies (batch, near real-time)One of the first challenges Hadoop developers face is accessing all the data needed for processing and getting it into Hadoop. All too often developers resort to reinventing the wheel by building custom adapters and scripts that require expert knowledge of the source systems, applications, data structures and formats. Once they overcome this hurdle they need to make sure their custom code will perform and scale as data volumes grow. Along with the need for speed, security and reliability are often overlooked which increases the risk of non-compliance and system downtime. Needless to say building a robust custom adapter takes time and can be costly to maintain as software versions change. Sometimes the end result is adapters that lack direct connectivity between the source systems and Hadoop which means you need to temporarily stage the data before it can move into Hadoop, increasing storage costs. Informatica PowerExchange can access data from virtually any data source at any latency (e.g. batch, real-time, or near real-time) and deliver all your data directly into Hadoop (see Figure 2). Similarly, Informatica PowerExchange can deliver data from Hadoop to your enterprise applications and information management systems. You can schedule batch loads to move data from multiple source systems directly into Hadoop without any staging. Alternatively, you can move only changed data from relational and mainframe systems directly into Hadoop. For real-time data feeds, you can move data off of message queues and deliver into Hadoop. Informatica PowerExchange accesses data through native API’s to ensure optimal performance and is designed to minimize the impact to source systems through caching and process offloading. To further increase the performance of data flows between the source systems and Hadoop, PowerCenter supports data partitioning to distribute the processing across CPUs.  Informatica PowerExchange for Hadoop is integrated with PowerCenter so that you can pre-process data from multiple data sources before it lands in Hadoop. This enables you to leverage the source system metadata since this information is not retained in the Hadoop File System (HDFS). For example, you can perform lookups, filters, or relational joins based on primary and foreign key relationships before data is delivered to HDFS. You can also pushdown the pre-processing to the source system to limit data movement and unnecessary data duplication to Hadoop. Common design patterns for data flows into or out of Hadoop can be generated in PowerCenter using parameterized templates built in Microsoft Visio to dramatically increase productivity. To securely and reliably manage the file transfer and collection of very large data files from both inside and outside the firewall you can use Informatica Managed File Transfer (MFT).
  • Sanjay’s notes:Flume, scribe are options for streaming ingestion of log filesKafka is for near real-time
  • See PWX for Hadoop white paperDoes not require expert knowledge of source systemsDeliver data directly to Hadoop without any intermediate stagingAccess data through native API’s for optimal performanceBring in both un-modeled / un-structured and structured relational data to make the analysis completeUse example to illustrate combining both unstructured and structured data needed for analysis
  • Have lineage of where data came from
  • Informatica announced on Nov 2 the industry’s first data parser for HadoopThe solution is designed to provide a powerful data parsing alternative to organizations who are seeking to achieve the full potential of Big Data in Hadoop with efficiency and scale.This solution addresses the industry’s growing demand in turning the unstructured, complex data into structured or semi-structured format in Hadoop to drive insights and improve operations.Tapping our industry leading experience in parsing unstructured data and handling industry formats and documents within and across enterprise, Informatica pioneered the development of the data parser that exploits the parallelism of MapReduce framework.Using an engine-based, interactive tool to simplify the data parsing process, Informatica HParser processes complex files and messages in Hadoop with the following three offerings:Informatica HParser for logs, Omniture, XML and JSON (community edition), free of charge.Informatica HParser for industry standards (commercial edition).Informatica HParser for documents (commercial edition).With HParser, organizations can derive unique benefits using:Accelerate deployment using out of the box ready to use transformations and industry standards.Increase productivity for tackling diverse complex formats including proprietary log files.Speed the development of parsing exploiting the parallelism inside MapReduce.Optimize performance in data parsing for large files including logs, XML, JSON and industry standards.Informatica also provides a free 30 day trial of the commercial edition of Hparser for Documents to the users interested in learning about the design environment for data transformation.
  • Definethe extraction/transformation logic using the designerRun the parser as a standalone MR jobCommand line arguments are script, input, output filesParallelism across files, no support for file splits
  • Describe each of the future capabilities in the bulletsYou can design and specify the entire end-to-end flow of your data processing pipeline with the flexibility to insert custom code.Choose the right level of abstraction to define your data flow, don’t reinvent the wheel. Informatica provides the right level of abstraction for data processing for rapid development (e.g. metadata driven development environment) and easy maintenance (e.g. complete specification and lineage of data)

Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sanjay Kaluskar, Informatica Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sanjay Kaluskar, Informatica Presentation Transcript

  • Data Ingestion, Extraction, and Preparation for Hadoop Sanjay Kaluskar, Sr. Architect, Informatica David Teniente, Data Architect, Rackspace1
  • Safe Harbor Statement• The information being provided today is for informational purposes only. The development, release and timing of any Informatica product or functionality described today remain at the sole discretion of Informatica and should not be relied upon in making a purchasing decision. Statements made today are based on currently available information, which is subject to change. Such statements should not be relied upon as a representation, warranty or commitment to deliver specific products or functionality in the future.• Some of the comments we will make today are forward-looking statements including statements concerning our product portfolio, our growth and operational strategies, our opportunities, customer adoption of and demand for our products and services, the use and expected benefits of our products and services by customers, the expected benefit from our partnerships and our expectations regarding future industry trends and macroeconomic development.• All forward-looking statements are based upon current expectations and beliefs. However, actual results could differ materially. There are many reasons why actual results may differ from our current expectations. These forward-looking statements should not be relied upon as representing our views as of any subsequent date and Informatica undertakes no obligation to update forward-looking statements to reflect events or circumstances after the date that they are made.• Please refer to our recent SEC filings including the Form 10-Q for the quarter ended September 30th, 2011 for a detailed discussion of the risk factors that may affect our results. Copies of these documents may be obtained from the SEC or by contacting our Investor Relations department. 2
  • The Hadoop Data Processing PipelineInformatica PowerCenter + PowerExchange Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on PowerCenter + Hadoop PowerExchange 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 3
  • Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse DataStructured (e.g. Informatica N/A Hive, PIG, MR,OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop RoadmapUnstructured, Informatica Informatica Hive, PIG, MR,semi-structured PowerCenter + HParser, Future:(e.g. web logs, PowerExchange, PIG/Hive UDFs, InformaticaJSON) copy files, Flume, MR Roadmap Scribe, Kafka 4
  • Unleash the Power of Hadoop With High Performance Universal Data Access Messaging, Packagedand Web Services WebSphere MQ Web Services JD Edwards SAP NetWeaver Applications JMS TIBCO Lotus Notes SAP NetWeaver BI MSMQ webMethods Oracle E-Business SAS SAP NetWeaver XI PeopleSoft Siebel Relational and Oracle Informix SaaS/BPO Flat Files Salesforce CRM ADP DB2 UDB Teradata Hewitt DB2/400 Netezza Force.com RightNow SAP By Design SQL Server ODBC Oracle OnDemand Sybase JDBC NetSuite Mainframe Industry and Midrange EDI–X12 AST Standards ADABAS VSAM Datacom C-ISAM EDI-Fact FIX DB2 Binary Flat Files RosettaNet Cargo IMP IDMS Tape Formats… HL7 MVR IMS HIPAA Unstructured Data and Files Word, Excel Flat files XML Standards PDF ASCII reports XML ebXML StarOffice HTML LegalXML HL7 v3.0 WordPerfect RPG IFX ACORD (AL3, XML) Email (POP, IMPA) ANSI cXML HTTP LDAPMPP Appliances EMC/Greenplum AsterData Facebook LinkedIn Vertica Twitter Social Media 5
  • Ingest Data Access Data Pre-Process Ingest Data Web server PowerExchange PowerCenter Databases,Data Warehouse Batch HDFS Message Queues, CDC HIVEEmail, Social Media e.g. Filter, Join, Cleanse Real-time ERP, CRM Reuse PowerCenter mappings Mainframe 6
  • Extract DataExtract Data Post-Process Deliver Data Web server PowerCenter PowerExchange Databases, HDFS Batch Data Warehouse e.g. Transform ERP, CRM to target schema Reuse Mainframe PowerCenter mappings 7
  • 1. Create Ingest orExtract Mapping2. Create HadoopConnection 3. Configure Workflow 4. Create & Load Into Hive Table 8
  • The Hadoop Data Processing PipelineInformatica HParser Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on HParser Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 9
  • Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse DataStructured (e.g. Informatica N/A Hive, PIG, MR,OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop RoadmapUnstructured, Informatica Informatica Hive, PIG, MR,semi-structured PowerCenter + HParser, Future:(e.g. web logs, PowerExchange, PIG/Hive UDFs, InformaticaJSON) copy files, Flume, MR Roadmap Scribe, Kafka 10
  • Informatica HParserProductivity: Data Transformation Studio 11
  • Informatica HParser Productivity: Data Transformation StudioFinancial Insurance B2B Standards Out of the boxSWIFT MT DTCC-NSCC transformations for UNEDIFACTSWIFT MX ACORD-AL3 all messages in all Easy example EDI-X12NACHA versions ACORD XML based visual EDI ARRFIX enhancements EDI UCS+WINSTelekurs and edits EDI VICS Updates and newFpML RosettaNet versions deliveredBAI – V2.0Lockbox Healthcare OAGI from InformaticaCREST DEXIFX HL7 Definition is done usingTWIST Business (industry) Other HL7 V3 EnhancedUNIFI (ISO 20022) terminology and HIPAA Validations definitions IATA-PADISSEPA NCPDPFIXML PLMXML CDISCMISMO NEIM 12
  • Informatica HParser How does it work? Hadoop cluster Svc Repository Shadoop … dt-hadoop.jar… My_Parser /input/*/input*.txt1. Develop an HParser transformation2. Deploy the transformation3. Run HParser on Hadoop to produce tabular data HDFS4. Analyze the data with HIVE / PIG / MapReduce / Other 13
  • The Hadoop Data Processing PipelineInformatica Roadmap Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 14
  • Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse DataStructured (e.g. Informatica N/A Hive, PIG, MR,OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop RoadmapUnstructured, Informatica Informatica Hive, PIG, MR,semi-structured PowerCenter + HParser, Future:(e.g. web logs, PowerExchange, PIG/Hive UDFs, InformaticaJSON) copy files, Flume, MR Roadmap Scribe, Kafka 15
  • Informatica Hadoop Roadmap – 1H 2012• Process data on Hadoop • IDE, administration, monitoring, workflow • Data processing flow designed through IDE: Source/Target, Filter, Join, Lookup, etc. • Execution on Hadoop cluster (pushdown via Hive)• Flexibility to plug-in custom code • Hive and PIG UDFs • MR scripts• Productivity with optimal performance • Exploit Hive performance characteristics • Optimize end-to-end data flow for performance 16
  • Mapping for Hive execution Logical representation of processing steps Validate & configure for Source Hive translation INSERT INTO STG0 SELECT * FROM StockAnalysis0; Pre-view INSERT INTO STG1 SELECT * FROM StockAnalysis1; generated INSERT INTO STG2 SELECT * FROM StockAnalysis2; Hive code 17 17
  • Takeaways• Universal connectivity • Completeness and enrichment of raw data for holistic analysis • Prevent Hadoop from becoming another silo accessible to a few experts• Maximum productivity • Collaborative development environment • Right level of abstraction for data processing logic • Re-use of algorithms and data flow logic • Meta-data driven processing • Document data lineage for auditing and impact analysis • Deploy on any platform for optimal performance and utilization 18
  • Customer Sentiment - Reaching beyondNPS (Net Promoter Score) and surveysGaining insight in to our customer’s sentimentwill improve Rackspace’s ability to provideFanatical Support™Objectives:• What are “they” saying• Gauge the level of sentiment• Fanatical Support™ for the win • Increase NPS • Increase MRR • Decrease Churn • Provide the right products • Keep our promises 19 19
  • Customer Sentiment Use CasesPulling it all together Case 1 Case 2 Match social media posts Determine the with Customer. Determine sentiment of a a probable match. post, searching key words and scoring the post. Case 3 Determine correlationsbetween posts, ticket volumeand NPS leading to negative Case 4 or positive sentiments. Determine correlations in sentiments with products/configurations which lead to negative or Case 5 positive sentiments. The ability to trend all inputs over time… 20
  • Rackspace Fanatical Support™Big Data Environment Data Sources(DBs, Flat files, Data Streams) Oracle MySql MS SQL Greenplum DB Indirect Analytics Postgres over Hadoop DB2 BI Analytics Excel CSV BI Stack Flat File Message bus / XML port listening EDI Direct Analytics over Hadoop Binary Sys Logs Hadoop HDFS Search, Analytics, Messaging APIs Algorithmic 21
  • Twitter Feed for RackspaceUsing Informatica Input Data Output Data 22 22
  • 23