Nov 2011 HUG: HParser


Published on

Organizations are now increasingly interested in finding more efficient ways to tackle deeply hierarchical data including XML and JSON as wellas other complex data formats like Web logs, binaries, and machine generated data in Hadoop.

How are you currently developing setting up data parsing tasks insideMapReduce? Are you interested in native streaming and splitting capabilities allow effective handling of files in any size regardless of format. In this session, we will share with you about HParseroptimized for parallel parsing in Hadoop including technical demonstration of HParser.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Nov 2011 HUG: HParser

  1. 1. Ronen Schwartz VP Products B2B Data Exchange BU Informatica November, 20111
  2. 2. Hadooprecords M results M R M 2
  3. 3. real-world Hadoopdata records M results M R M 3
  4. 4. real-world Hadoopdata records M results M R M 4
  5. 5. real-world Hadoopdata records M results M R M 80% 5
  6. 6. HParser UI - any format - any complexity - easily - in Map Reducereal-world Hadoopdata records M results M R M 80% 6
  7. 7. HParser UI - any format 5% - any complexity - easily - in Map Reducereal-world Hadoopdata records M results M R M 80% 7
  8. 8. Demo Construction Execution (Windows) (Linux)binary textrecords records Map Reduce HParser UI in out transform definition input output 8
  9. 9. Real-world Data Flat files HParserLogs recordsXML, JSONIndustry standardsEx. FIX, SWIFT, X12, ASN.1DocumentsEx. PDF, Excel 9
  10. 10. DEMO10
  11. 11. Informatica HParserTackling Diversity of Big Data The broadest coverage for Big Data EngineDTThe Engineusescan immediatelyand this As shown Developerthe transformationviaways: send PowerCenter leveragesgeneral the a 1. simple a actual in Studio re-entrant. invocationthe be to develop TheThedeployisto shared library. DT DTservice 2.  To enginebe also thread-safe andthis engine runs The below,custom deploys The logic anduse 3. withincallingis invokedfullybufferapplication. line ForDT process ofisapplications data is 4. the application can embeddable DT Developerembedded inthe canmiddleware InternalDT engineserver,transformation 2. can alsointegration, a command two other embed fully buffers to DT for processing. application. can be invokedDatacalling invokethe various transformation the data. completely independent of any calling technologies. folder isto Standardsanyusing (directory). transformation using the serversupported Unstructured services of the services. interface service toTransformation (UDT). service moved repository via FTP, to local isprocess to Flat Files & XML Industry available to invoke DT in multiple allows output side, DT WebMethods, BizTalk) data Interaction APIs.some (WBIMB, can also writeto it, andandINFA This On the the calling application For you can develop a transformation once,memory APIs. script, can copy,is a GUI etc. be passed back to DT This means Filenamestransformation widget in will 1.  Documents This Itthreads toexternal are returned tothe calling application. is not an files similar GUI widgets transformation All increase throughput. removes any overhead provides neededThis the for buffers which engine. the file(s) (agents) for the leverage it indirectly open processes, for processing. multiple environments simultaneously resulting from passing data between system is across the the DT NOTE: If the serverwhich wraps mountable from Powercenter fileenvironments. around social respective design maintenance times network, are moved. and in reduced developmentdynamically invoked and and lower etc. The engine is also the C, .NET, web services does not ThoughFor shown engine. engine fully supports step 2 input impactAPI ‘started theor supportdirectly, then multiple Java, andtheDT’sthe layer can be also directly A goodnot others up’machine ofDT canused directly. developer be C++, output need toof On below,API side, PowerCenter partitioning example would deploy directly to the externally. maintained server. andscale up processing. as needed by the transformation. to output files or buffers Svc Repository write to the filesystem. device/sensor S scientific Productivity Any DI/BI architecture•  Visual parsing environment PIG EDW•  Predefined MDM translations 11
  12. 12. Universal Data TransformationData Formats - Subset of Supported Data FormatsUNSTRUCTURED SEMI STRUCTURED XML/JSON HL7 ACORD XML SWIFT LegalXMLMicrosoft Word AL3 IFXMicrosoft Excel HIPAA cXMLPDF EDI–X12 ebXMLPowerPoint EDI-Fact HL7 V3.0ASCII reports FIX RosettaNetHTML NACHA ISO 20022EBCDIC ASTM xBRLCustom binaries Cargo IMP OtherFlat files COBOLRPG PL1ANSI UCS WINS PRINT STREAMS VICS AFP ASN.1 PostScript 12
  13. 13. Universal Data TransformationProductivity: Data Transformation Studio 13
  14. 14. Universal Data Transformation Productivity: Data Transformation StudioFinancial Insurance B2B Standards Out of the boxSWIFT MT DTCC-NSCC transformations for UNEDIFACTSWIFT MX ACORD-AL3 all messages in all Easy example EDI-X12NACHA versions ACORD XML based visual EDI ARRFIX enhancements EDI UCS+WINSTelekurs and edits EDI VICS Updates and newFpML RosettaNet versions deliveredBAI – V2.0Lockbox Healthcare OAGI from InformaticaCREST DEXIFX HL7 Definition is done usingTWIST Business (industry) Other HL7 V3 EnhancedUNIFI (ISO 20022) terminology and HIPAA Validations definitions IATA-PADISSEPA NCPDPFIXML PLMXML CDISCMISMO NEIM 14
  15. 15. HParser – How Does It Work? hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt HDFS1.  Develop a DT transformation2.  Deploy the transformation3.  Run HParser to produce tabular data4.  Analyze the data with HIVE / PIG / MapReduce / Other 15
  16. 16. Example use cases Trade data•  Why Hadoop? •  trades data represent extremely large sets of data •  We are not sure what trades patterns we would like to investigate •  Compare to other large data sets: Bloomberg, Reuters, NYSE 16
  17. 17. Example use cases Trade data•  Why is handling Fix data complex? •  Variable length •  Variations •  Name value pair •  Proprietary tags •  Meaningful tags •  Yearly releases •  Hierarchy •  FIXML - XML version 17
  18. 18. Example use cases Call Detail record•  Why Hadoop? •  CDR – Large data sets every 7 seconds every mobile phone in the region create a record •  Desire to analyze behavior, location to personalize and optimize pricing and ,marketing 18
  19. 19. Example use cases Trade data•  Why is handling CDRs data complex? •  Binary format •  Vendor variations •  ASN.1 •  SWITCH Software update •  Meaningful tags •  Hierarchy 19
  20. 20. Example use casesProprietary logs •  Why Hadoop? •  Extremely large data sets •  Often information is split across multi files •  Not sure what are we looking for 20
  21. 21. Example use casesProprietary logs •  Why is handling proprietary logs complex? •  Many times hierarchical data: •  flat files •  JSON •  XML •  Data logic and business/ context logic •  Variations 21
  22. 22. Thank you