Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
DATA ANALYSIS USING DATA
FLUX
FROM-SUNIL PAI
TYPICAL USAGE - CUSTOMER DATA
OPERATIONS
• Data De-Duping
• Data Standardization
• Data Analysis and Data Profiling
• Data...
DF NODE - DATA INPUTS
DF could use various Input sources such as Relational
Databases (using queries), Excel files, Acces...
DF NODE - DATA OUTPUTS
By using DF we can insert a Job/Result output in an Excel,
Access ,Text, relational database like ...
DF NODE – QUALITY
• Standardization
 dfPower Architect's Standardization node is used to make similar items the
same
 Th...
DF NODE – QUALITY
• Standardization (More Examples)-Addresses
1 Comcast Center to 1 Comcast Ctr
10 Glenlake Pkwy north eas...
DF NODE – QUALITY
• Parsing
 DF Power Architect's Parsing node is a simple but intelligent tool for
separating multi-part...
DF NODE – INTEGRATION
• Match Codes
dfPower Architect's Match Codes is to identify duplicate records
in your data. These s...
DF NODE – INTEGRATION
• Clustering
DFPower Architect's Data Clustering node is used to employ the
clustering functionalit...
DF NODE – INTEGRATION
• Surviving Record Identification
 DFPower Architect's Surviving Record Identification (SRI) node
e...
DF NODE – INTEGRATION
• Surviving Record Identification
Examples (Continued) –Please see the cluster column and the
Surviv...
DF MATCH EXAMPLES
• Standardization and Match codes combined in job flow gives
Remarkable results as shown below
Exact or ...
DF MATCH EXAMPLES
• 75% Match Results
Input-Name MatchedName Input-ADDRESS Matched-Address
ArizonaStateUniversity ArizonaS...
DF MATCH EXAMPLES
• Loose and Tight Contact Matches-See email addresses
100 % Matches
EMAILADDRESS(InputSource) EMAIL_ADDR...
DF NODE – UTILITIES
• Data Joining Node
This nodes is used to joining data form various sources such as
Two different dat...
DF NODE – UTILITIES
• SQL Lookup
SQL Lookup lets the user find rows in a database table that
have one or more fields matc...
DF NODE – UTILITIES
• SQL Execute
This is a stand-alone node (no parents or children) that lets you
construct and execute...
DF NODE – UTILITIES
 Data Union
 DFPower Architect's Data Union node is based on the SQL concept of
UNION. As with Data ...
DF NODE – UTILITIES
• Branch
This step lets multiple children (up to 32) simultaneously
access data from a single source....
DF NODE – UTILITIES
Concatenate
DFPower Architect's Concatenate node performs the opposite
function of the Parse node. R...
DF NODE – UTILITIES
• Expression
 Use DFPower Architect's Expression node to run a Visual BASIC-like
language to process ...
DF NODE – UTILITIES
• Data Sorting
Use DFPower Architect's Data Sorting node to re-order
(Ascending or Descending way)you...
DF NODE – PROFILING
• Basic Statistics
 DFPower Architect's Basic Statistics node is used to calculate
statistics about y...
DF NODE – PROFILING
Pattern Analysis
DFPower Architect's Pattern Analysis node is used to generate a
new field containin...
DF NODE – PROFILING
• Frequency Distribution
 DFPower Architect's Frequency Distribution node is used to
calculate the nu...
DF NODE – PROFILING
• Data validation
 DFPower Architect's Data Validation node is used to analyze
the content of data by...
DF NODE – ENRICHMENT
 Address Verification
 DFPower Architect Address Verification (US/Canada/World) node to
verify, cor...
DF NODE – ENRICHMENT
• For US Addresses
Text Result
Code
Numeric
Result Code Description
OK 0 Address was verified success...
• For Canada Addresses
Result Code Description
0 No error occurred
1 Internal error
2 Cannot load database
3 Invalid - uns...
• Rest of World(Excluding US and Canada)
ResultCode Description
0 Addresscorrectasentered.
1 Addresscorrectedautomatically...
DF NODE – MONITORING
 Data Monitoring
 The Data Monitoring node enables you to analyze data according to
business rules ...
Upcoming SlideShare
Loading in …5
×

Data Analysis using Data Flux

939 views

Published on

  • Be the first to comment

  • Be the first to like this

Data Analysis using Data Flux

  1. 1. DATA ANALYSIS USING DATA FLUX FROM-SUNIL PAI
  2. 2. TYPICAL USAGE - CUSTOMER DATA OPERATIONS • Data De-Duping • Data Standardization • Data Analysis and Data Profiling • Data Consolidation from various sources • Comparing multiple data sets as per predefined parameters • Insert Data in to Target Data Bases • Match at the glance Reports for various New Acquisitions
  3. 3. DF NODE - DATA INPUTS DF could use various Input sources such as Relational Databases (using queries), Excel files, Access Files, Text files This sources are connected Via ODBC Examples-A query is inserted in SQL Query Node .By selecting a database/Access file in the node properties For Excel-Area needs to be defined for selection by using Name manager under formula tab in excel sheet .For excel sheets Data Source Input node is used
  4. 4. DF NODE - DATA OUTPUTS By using DF we can insert a Job/Result output in an Excel, Access ,Text, relational database like Oracle /Sql Server DF uses Insert/Update/Target/Output utilities for Data output stage Examples-The output result can be directly inserted into Database table by using Data Target Insert Node Output can also be taken in an text file via Text file output node
  5. 5. DF NODE – QUALITY • Standardization  dfPower Architect's Standardization node is used to make similar items the same  The various definition of standardizations are Name, Address, Organization,Zip, Phone, email address ,country, State ,Non Alpha numeric remover, Numeric remover, Alpha Numeric remover ,space remover ,Quotation remover etc  Various Schemas can also be selected which can be defined in QKB of DataFlux  For Example-using full company names instead of initials ("International Business Machines" vs. "IBM"),
  6. 6. DF NODE – QUALITY • Standardization (More Examples)-Addresses 1 Comcast Center to 1 Comcast Ctr 10 Glenlake Pkwy north east to 10 Glenlake Pkwy NE "North Dakota" vs. "ND“ United States vs USA
  7. 7. DF NODE – QUALITY • Parsing  DF Power Architect's Parsing node is a simple but intelligent tool for separating multi-part field values into multiple, single-part fields. For example, if you have a Name field that includes the value "Mr. Igor Bela Bonski III, Esq.," you can use parsing to create six separate fields: Name Prefix: "Mr." Given Name: "Igor" Middle Name: "Bela" Family Name: "Bonski" Name Suffix: "III" Name Appendage: "Esq."
  8. 8. DF NODE – INTEGRATION • Match Codes dfPower Architect's Match Codes is to identify duplicate records in your data. These steps create match codes, that evaluate the quantity of duplicate fields in your data and eliminate the extra fields. Match codes can be set from 50%(Lowest) to 100%(Exact) and various schemas can be selectedFieldName Defination Sensitivity AccountName BussinessTiTtle 85% Address_Line1 Address/AddressLong 85% City City Exact-All,Exact-10characters Country Country Exact-All,Exact-10characters
  9. 9. DF NODE – INTEGRATION • Clustering DFPower Architect's Data Clustering node is used to employ the clustering functionality to group match duplicates or set of unique records as per conditions defined. See cluster numbers in given example belowCluster AccountName AccountAddress1 MatchCriteria 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1 7663 Metlife,Incorporated 27-01QueensPlzN ExactCompanyName+Address-1 7663 Metlife,Incorporated 27-01QueensPlzN ExactCompanyName+Address-1 7791 EatonCorporation 34899CurtisBlvd ExactCompanyName+Address-1 7791 EatonCorporation 34899CurtisBlvd ExactCompanyName+Address-1
  10. 10. DF NODE – INTEGRATION • Surviving Record Identification  DFPower Architect's Surviving Record Identification (SRI) node examines clustered data and determines a surviving record for each cluster. This process lets you eliminate duplicate information in a data source. The surviving record is identified using one or more user-configurable record rules. The user may also enter field rules to perform automated field-level edits of the surviving record's data during SRI processing. The SRI step can be configured to keep all existing data, marking the surviving records with a flag or primary key value, or it can remove all data except for that associated with the surviving records. Examples- Consider you have set of duplicate Accounts and addresses in the system and you need to keep one distinct record out of those duplicates but the record should have proper phone numbers in it. You can use SRI node and define rule for selection which can be done in properties of SRI Node. Please see the example given in the next slide
  11. 11. DF NODE – INTEGRATION • Surviving Record Identification Examples (Continued) –Please see the cluster column and the Surviving record column given below. So each cluster has only one surviving record Cluster AccountName AccountAddress1 Phone SurvivingRecord 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay (609) 883-1300 TRUE 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay Null FALSE 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay 987 FALSE 7663 Metlife,Incorporated 27-01QueensPlzN 1-800-638-5000 TRUE 7663 Metlife,Incorporated 27-01QueensPlzN Null FALSE 7791 EatonCorporation 34899CurtisBlvd 1-900-735-5674 TRUE 7791 EatonCorporation 34899CurtisBlvd Null FALSE
  12. 12. DF MATCH EXAMPLES • Standardization and Match codes combined in job flow gives Remarkable results as shown below Exact or 100% Match results Input-COMPANYNAME Matched/OutputCompanyName ADDRESS1(Input) ADDR(Matched) NetscapeCommunicationsCorporation NetscapeCommunicationsCorporation 501EMiddlefieldRd 501EMiddlefieldRd Alston&BirdLLP Alston&BirdLLP 1201WPeachtreeSt 1201WPeachtreeSt GeorgiaPerimeterCollege GeorgiaPerimeterCollege 3251PanthersvilleRd 3251PanthersvilleRd CountyofOneida CountyofOneida 800ParkAve 800ParkAve EliLillyandCompany EliLillyandCompany POBox6034 POBox6034 ActuateCorporation ActuateCorporation 2207BridgepointePkwy.Ste.500 2207BridgepointePkwySte500 ShrinersHospitalsForChildren ShrinersHospitalsForChildren 3551NBroadSt 3551NBroadSt CatholicHealthInitiatives CatholicHealthInitiatives 440CreameryWay 440CreameryWay ElPasoElectricCompany ElPasoElectricCompany 123WMillsAve 123WMillsAve
  13. 13. DF MATCH EXAMPLES • 75% Match Results Input-Name MatchedName Input-ADDRESS Matched-Address ArizonaStateUniversity ArizonaStateUniversity UniversityDrandalsoMillAve UniversityDrive&MillAvenue CybernetSoftwareSystems,Inc. CybernetSoftwareSystemsIncorporated3031TischWaySte.1002 3031TischWay VertrueInc. VertrueIncorporated 20GloverAve. 20GloverAve DollarBank,FSB DollarBank 3GatewayCenter 3GatewayCenter8East TextronInc. TextronIncorporated 40WestminsterStreet 40WestminsterSt ArcherTechnologies ArcherTechnologiesLLC 13200Metcalf,Suite300 13200MetcalfAve BMWFinancialServicesNA BMWFinancialServicesNAIncorporated 5515ParkCenterCircle 5515ParkcenterCir GreatAmericanFinancialResources,Inc. GreatAmericanFinancialResourcesIncorporated250E.5thSt. 250E5thSt CecEntertainment,Inc. CECEntertainmentIncorporated 4441WAirportFreeway 4441WAirportFwy
  14. 14. DF MATCH EXAMPLES • Loose and Tight Contact Matches-See email addresses 100 % Matches EMAILADDRESS(InputSource) EMAIL_ADDRESS(Matched) NAME(InputSource) FIRST_NAME-MatchedOutput adam.fenech@priorityhealth.com adam.fenech@priority-health.com AdamFenech AdamFenech braddpiontek@alliant-energy.com braddpiontek@alliantenergy.com BraddPiontek BraddPiontek EMAILADDRESS-Input CONTACT_EMAIL_ADDRESS-Matched NAME-Input CONTACT_FIRST_NAME-Matched brent.alexander@cingular.com brentalexander@cingular.com BrentAlexander BrentAlexander chris.sims@fiserv.com chris.sims@fiserv.com ChrisSims ChrisSims
  15. 15. DF NODE – UTILITIES • Data Joining Node This nodes is used to joining data form various sources such as Two different databases/Excels/Access etc DFPower Architect's Data Joining job flow step is based on the SQL concept of JOIN. You can use Data Joining to combine two data sets in an intelligent way so that the records of one, the other, or both data sets are used as the basis for the resulting data set
  16. 16. DF NODE – UTILITIES • SQL Lookup SQL Lookup lets the user find rows in a database table that have one or more fields matching those in the job flow. It provides an explicit advantage with performance, especially with large databases since the large database is not copied locally on the hard drive in order to perform the operation (as is the case with joins).
  17. 17. DF NODE – UTILITIES • SQL Execute This is a stand-alone node (no parents or children) that lets you construct and execute any valid SQL statement (or series of statements). It performs some database-specific task(s), either before, after, or in-between architect job flows. Examples-SQL Statements like Update, delete ,commit for a particular table can be used in this node
  18. 18. DF NODE – UTILITIES  Data Union  DFPower Architect's Data Union node is based on the SQL concept of UNION. As with Data Joining, use the Data Union node to combine data from two data sets. Unlike Data Joining, however, Data Union does not perform an intelligent combination. Rather, Data Union simply adds the two data sets together; the resulting data set contains one record for each record in each of the original data sets Examples- Data from two or more sheets/Databases/DF job flows needs to be clubbed together. This node performs the Task
  19. 19. DF NODE – UTILITIES • Branch This step lets multiple children (up to 32) simultaneously access data from a single source. Depending on step's configuration and children's access patterns, you can pass data from the parent directly to each of the children, or it may be temporarily stored in memory and/or disk caches, before being passed to the children. In other words it can be one input and multiple outputs(Max- 32)
  20. 20. DF NODE – UTILITIES Concatenate DFPower Architect's Concatenate node performs the opposite function of the Parse node. Rather than separate a single field into multiple fields, Concatenate combines one or more fields into a single field. Example Suffix-Mr First Name- Rahul Last Name- Jain Concatenate output – Mr Rahul Jain
  21. 21. DF NODE – UTILITIES • Expression  Use DFPower Architect's Expression node to run a Visual BASIC-like language to process your data sets in ways that are not built into dfPower Studio. The Expression language provides many statements, functions, and variables for manipulating data Examples like creating a column Match Criteria in middle of Job flow.The syntax would be Expression Match_Criteria = “ “ Pre-Processing Expression string Match Criteria
  22. 22. DF NODE – UTILITIES • Data Sorting Use DFPower Architect's Data Sorting node to re-order (Ascending or Descending way)your data set at any point in a job flow.
  23. 23. DF NODE – PROFILING • Basic Statistics  DFPower Architect's Basic Statistics node is used to calculate statistics about your data, such as value ranges, counts, or sums for any given field The Basic Statistics node is typically used on numerical rather than text fields. However, statistics such as Count, Missing, MAX, and MIN could be useful on any field type This can be used in middle of the job as well to do a Fault finding by checking the counts of each step Examples Basic stat of Siebel TableRow_Id Created Created_By Account Name Partner Flag Email Addr Phone CSN Records 267413 267413 267413 267413 267413 267413 267413 267413 Count 267413 267413 267413 267413 267413 5 72552 181643 Null Count 0 0 0 0 0 267408 194861 85770 Distinct yes yes yes yes no yes yes yes Min 1 0-5200 1/1/1980 0:00 0-1 N dllee@pentasoft.co.kr ###iswrong 1 Max 1 O-2 9/9/2010 21:55 1-XVOET ültje GmbH Y tloughran@infopath.net xxxxxxxxx
  24. 24. DF NODE – PROFILING Pattern Analysis DFPower Architect's Pattern Analysis node is used to generate a new field containing alphanumeric patterns that represent each value in a selected field. You can specify whether these patterns represent each character or each word (as separated by spaces) in a field.
  25. 25. DF NODE – PROFILING • Frequency Distribution  DFPower Architect's Frequency Distribution node is used to calculate the number of occurrences of each unique value in a field. For example, Frequency Distribution can determine how many customers in your customer database are in each of the 50 US states, the District of Columbia, and the 13 Canadian provinces.State Count of Customers %Total CA 19593 12 CO 4041 2 CT 2807 1 DC 2555 1 DE 746 0 FL 7105 4 GA 5198 3 GE 1 0 GEO GEO_count GEO % Americas 187235 57 AsiaPacific 30642 9 EMEA 107412 33
  26. 26. DF NODE – PROFILING • Data validation  DFPower Architect's Data Validation node is used to analyze the content of data by setting validation conditions. These conditions create validation expressions that you can use to filter data for a more accurate view of that data.
  27. 27. DF NODE – ENRICHMENT  Address Verification  DFPower Architect Address Verification (US/Canada/World) node to verify, correct, and enhance any addresses in your existing data (QKB). Address Verification (US/Canada/World) uses geographic information from various reference databases to match and standardize addresses. You can also use Address Verification (US/Canada) for proper casing and CASS /SERP compliance. The addresses are distinguished as per codes mentioned in the next slide. So it gives the status of addresses i.e how valid it is
  28. 28. DF NODE – ENRICHMENT • For US Addresses Text Result Code Numeric Result Code Description OK 0 Address was verified successfully. PARSE 11 Error parsing address. Components of the address may be missing. CITY 12 Could not locate city/state or zip in the USPS database. At least (city and state) or ZIP must be present in the input. MULTI 13 Ambiguous address. There were two or more possible matches for this address with differing data. NOMATCH 14 No matching address found in the USPS data. OVER 15 One or more input strings is too long (maximum 100 characters).
  29. 29. • For Canada Addresses Result Code Description 0 No error occurred 1 Internal error 2 Cannot load database 3 Invalid - unspecified reason 4 Invalid civic number 5 Invalid street 6 Invalid unit 7 Invalid delivery mode 8 Invalid delivery installation 9 Invalid city 10 Invalid province 11 Invalid postal code 12 Address is not Canadian
  30. 30. • Rest of World(Excluding US and Canada) ResultCode Description 0 Addresscorrectasentered. 1 Addresscorrectedautomatically. 2 Addressneedstobecorrected,butcouldnot 3 Addressneedstobecorrected,butcouldnot bedeterminedautomatically.Thereisafair 4 Addressneedstobecorrected,butcouldnot bedeterminedautomatically.Thereisasmall
  31. 31. DF NODE – MONITORING  Data Monitoring  The Data Monitoring node enables you to analyze data according to business rules you create using the Business Rule Manager. The business rules you create in Rule Manager can analyze the structure of the data and trigger an event, such as logging a message or sending an email alert, when a condition is detected. By using the Data Monitoring node, you can insert these business rules in your job flow to analyze data at various points in the flow.

×