SlideShare a Scribd company logo
Enterprise Intelligence
Enterprise Intelligence
Data Analysis,
Extraction and
Validation
Technology Partners:
Presenter
Pete Zybrick – Enterprise Solutions Architect
 Over 30 years of experience designing and delivering
complex software solutions.
 From Bell Labs to BMW to Big Data, Pete has architected,
managed, tested and implemented large scale mission
critical systems directly responsible for billions of dollars in
annual transaction.
 As the leader of the Big Data technical programs within IPC
Global, Pete is responsible for building a framework of
collaboration between IPC Global and our technology
partners, Cloudera and AWS.
 Cloudera Certified Apache Hadoop Developer
 Amazon Web Services Certified Developer - Associate
Data Analysis, Extraction and Validation
Objectives
• IPC Global Experience with Analysis and
Extraction of Big Data
• Techniques And Tools To Enable Business Users
To Rapidly Access Subsets Of Large Datasets
• Validation And Import
• Not Presentation Capabilities
Data Analysis, Extraction and Validation
• Federal Reserve Economic Database (FRED2)
• SiteCatalyst aka Adobe Analytics
Data Analysis, Extraction and Validation
Applications
• Overview - http://research.stlouisfed.org/fred2/
• 240K Discrete Series’ – README file
• Business Case: Consumer Price Index
• FRED: http://research.stlouisfed.org/fred2/series/CPIAUCSL/
• QlikView: CPI demo - Single
Data Analysis, Extraction and Validation
FRED
• FRED data is discrete
• Consumer Price Index for All Urban Consumers: All Items
• Consumer Price Index for All Urban Consumers: Apparel
• Consumer Price Index for All Urban Consumers: Energy
• Analytics: Groupings/Categories, Drill Downs
• Consumer Price Index
• All Urban Consumers
• All Items
• Apparel
• Energy
Data Analysis, Extraction and Validation
FRED Raw Structure
• “Live the Data”
• Programs/Tools/Spreadsheets to Iteratively Analyze
• Pattern Analysis, Parsing Rules – Split into Categories
• Iterative Distinct Values used to generate Reference Files – Separate Code
from Data, Standardize Values (example: Countries)
• Parsing Example
Consumer Price Index for All Urban Consumers: All items in Atlanta, GA
Consumer Price Index for All Urban Consumers: Energy in Atlanta, GA
Consumer Price Index for All Urban Consumers: All items less food and energy in Atlanta, GA
Data Analysis, Extraction and Validation
FRED Programmatic Analysis
• Implementation of Parsing Rules/Reference Files
• Definition of Common Table Structure (example: Create Table’s)
• Output Files in Format Suitable for Database Bulk Import
• GZip’ed Tab Separated Values (HDFS for Impala, S3 for Redshift)
• Bulk Import into Impala and Redshift
Data Analysis, Extraction and Validation
FRED Programmatic Extraction
• Programmatically Generated
• Distinct Categories Spreadsheet
• QlikView Load/Select Scripts
• Demo:
• Find “Home Price Index (High Tier)” in Distinct Categories
• Find File in Generated Scripts based on Row Number
• Copy/Paste the Generated Load/Select into new Dashboard
• Manually modify Select for Middle Tier and Low Tier, Update Dashboard
Data Analysis, Extraction and Validation
FRED Productivity Enhancement Tools/Techniques
• ~12-15MM Main Rows/Day, up to 554 Columns/Row
• ~300MM Event Rows/Day (~24 per Main Row, 12 Before, 12 After)
• Data Falls Into Logical Groupings (i.e. general, video, mobile, etc.)
• Reference Table Lookups Of Varying Complexity
• 2-3% Error Rate
• Example of Inbound Data
Data Analysis, Extraction and Validation
SiteCatalyst – Overview
• Created Spreadsheets With Top 2000 Distinct Values For Each
Column, Reviewed Every Column
• Identified Data Type and Criteria (i.e. range) For Each Column
• Defined/Developed Data Validation Framework
• Iterative Application of Data Validation
• Identified Reference Table Lookups, Implemented Simple and
Complex
• Defined/Developed Test Data Generator
Data Analysis, Extraction and Validation
SiteCatalyst – Programmatic Analysis
• Specify the Rules For Inbound Columns
• Hadoop Map Program to Process Each Inbound Row
• Hadoop Distributed Cache containing Target Columns->Tables, Validation
Rules and Reference Lookups
• Data Validation Performed Against All Inbound Columns
• Rejects Written To Error Table
• Reference Lookups
• Valid Data Written To Separate Tables Based On Category (main, video,
mobile, etc.)
Data Analysis, Extraction and Validation
SiteCatalyst – Data Validation and Table Framework
• All Code/Processes Developed Internally by IPC Global
• Multiple Sources, Multiple Target Databases
• Iterative Analysis Utilizing Rapidly Coded Tools
• Iterative Analyze/Extract/Validate/Apply
• Performance Insight – Reference Tables, Caching
• Increased Access to Business Data
Data Analysis, Extraction and Validation
Summary
Enterprise Intelligence
Enterprise Intelligence
Technology Partners:
Data Analysis,
Extraction and
Validation

More Related Content

What's hot

Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
 Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ... Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
YASH Technologies
 
Centralized data warehouse and multidimensional analysis
Centralized data warehouse and multidimensional analysisCentralized data warehouse and multidimensional analysis
Centralized data warehouse and multidimensional analysis
Diaspark
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphMassive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Databricks
 
Atlas ApacheCon 2017
Atlas ApacheCon 2017Atlas ApacheCon 2017
Atlas ApacheCon 2017
Vimal Sharma
 
[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps
WSO2
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...
Robert Sanders
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceMercedes Coyle
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
DataWorks Summit/Hadoop Summit
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Roland Bouman
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1GurinderG
 
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
 Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of... Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
Dataconomy Media
 
II-SDV 2016 VantagePoint
II-SDV 2016 VantagePointII-SDV 2016 VantagePoint
II-SDV 2016 VantagePoint
Dr. Haxel Consult
 
Data warehousing testing strategies cognos
Data warehousing testing strategies cognosData warehousing testing strategies cognos
Data warehousing testing strategies cognos
Sandeep Mehta
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
Noriaki Tatsumi
 
Mutable data @ scale
Mutable data @ scaleMutable data @ scale
Mutable data @ scale
Ori Reshef
 
StreamSet ETL tool
StreamSet  ETL toolStreamSet  ETL tool
StreamSet ETL tool
SwapnilSHampi
 
GWAVACon 2015: Micro Focus - Novell File Reporter
GWAVACon 2015: Micro Focus - Novell File ReporterGWAVACon 2015: Micro Focus - Novell File Reporter
GWAVACon 2015: Micro Focus - Novell File Reporter
GWAVA
 

What's hot (20)

Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
 Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ... Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
 
Centralized data warehouse and multidimensional analysis
Centralized data warehouse and multidimensional analysisCentralized data warehouse and multidimensional analysis
Centralized data warehouse and multidimensional analysis
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphMassive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
 
Atlas ApacheCon 2017
Atlas ApacheCon 2017Atlas ApacheCon 2017
Atlas ApacheCon 2017
 
[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1
 
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
 Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of... Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
 
II-SDV 2016 VantagePoint
II-SDV 2016 VantagePointII-SDV 2016 VantagePoint
II-SDV 2016 VantagePoint
 
Data warehousing testing strategies cognos
Data warehousing testing strategies cognosData warehousing testing strategies cognos
Data warehousing testing strategies cognos
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
 
Mutable data @ scale
Mutable data @ scaleMutable data @ scale
Mutable data @ scale
 
Micro strategy 7i
Micro strategy 7iMicro strategy 7i
Micro strategy 7i
 
StreamSet ETL tool
StreamSet  ETL toolStreamSet  ETL tool
StreamSet ETL tool
 
GWAVACon 2015: Micro Focus - Novell File Reporter
GWAVACon 2015: Micro Focus - Novell File ReporterGWAVACon 2015: Micro Focus - Novell File Reporter
GWAVACon 2015: Micro Focus - Novell File Reporter
 

Viewers also liked

UV Spectroscopic Assay Method Development and Validation of Amoxicillin in ...
UV Spectroscopic Assay Method Development  and Validation of Amoxicillin in ...UV Spectroscopic Assay Method Development  and Validation of Amoxicillin in ...
UV Spectroscopic Assay Method Development and Validation of Amoxicillin in ...
Imran al
 
analytical method validation and validation of hplc
analytical method validation and validation of hplcanalytical method validation and validation of hplc
analytical method validation and validation of hplc
venkatesh thota
 
Tracxn Research — Business Intelligence Landscape, September 2016
Tracxn Research —  Business Intelligence Landscape, September 2016Tracxn Research —  Business Intelligence Landscape, September 2016
Tracxn Research — Business Intelligence Landscape, September 2016
Tracxn
 
The Effects of Stress And The Brain
The Effects of Stress And The BrainThe Effects of Stress And The Brain
The Effects of Stress And The BrainNational Safe Place
 
tablet presentation
tablet presentationtablet presentation
tablet presentationAnju K John
 
Database vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewDatabase vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative Review
Health Catalyst
 
Types of tablets
Types of tabletsTypes of tablets
Types of tablets
Tooba Rehman
 

Viewers also liked (7)

UV Spectroscopic Assay Method Development and Validation of Amoxicillin in ...
UV Spectroscopic Assay Method Development  and Validation of Amoxicillin in ...UV Spectroscopic Assay Method Development  and Validation of Amoxicillin in ...
UV Spectroscopic Assay Method Development and Validation of Amoxicillin in ...
 
analytical method validation and validation of hplc
analytical method validation and validation of hplcanalytical method validation and validation of hplc
analytical method validation and validation of hplc
 
Tracxn Research — Business Intelligence Landscape, September 2016
Tracxn Research —  Business Intelligence Landscape, September 2016Tracxn Research —  Business Intelligence Landscape, September 2016
Tracxn Research — Business Intelligence Landscape, September 2016
 
The Effects of Stress And The Brain
The Effects of Stress And The BrainThe Effects of Stress And The Brain
The Effects of Stress And The Brain
 
tablet presentation
tablet presentationtablet presentation
tablet presentation
 
Database vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative ReviewDatabase vs Data Warehouse: A Comparative Review
Database vs Data Warehouse: A Comparative Review
 
Types of tablets
Types of tabletsTypes of tablets
Types of tablets
 

Similar to IPC Data Analysis and Extraction

StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
Raheel Retiwalla
 
DesignMind SQL Server 2008 Migration
DesignMind SQL Server 2008 MigrationDesignMind SQL Server 2008 Migration
DesignMind SQL Server 2008 Migration
Mark Ginnebaugh
 
Delivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analyticsDelivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analytics
MariaDB plc
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable Analytics
MariaDB plc
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Mark Kromer
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
RTTS
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Building A Product Assortment Recommendation Engine
Building A Product Assortment Recommendation EngineBuilding A Product Assortment Recommendation Engine
Building A Product Assortment Recommendation Engine
Databricks
 
rough-work.pptx
rough-work.pptxrough-work.pptx
rough-work.pptx
sharpan
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Victor Holman
 
Business analytics and data visualisation
Business analytics and data visualisationBusiness analytics and data visualisation
Business analytics and data visualisation
Shwetabh Jaiswal
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
DataWorks Summit
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
RTTS
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
Big Data Spain
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
Rob Winters
 
Test Automation for Data Warehouses
Test Automation for Data Warehouses Test Automation for Data Warehouses
Test Automation for Data Warehouses
Patrick Van Renterghem
 
Presentation application change management and data masking strategies for ...
Presentation   application change management and data masking strategies for ...Presentation   application change management and data masking strategies for ...
Presentation application change management and data masking strategies for ...
xKinAnx
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 

Similar to IPC Data Analysis and Extraction (20)

StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
 
DesignMind SQL Server 2008 Migration
DesignMind SQL Server 2008 MigrationDesignMind SQL Server 2008 Migration
DesignMind SQL Server 2008 Migration
 
Delivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analyticsDelivering fast, powerful and scalable analytics
Delivering fast, powerful and scalable analytics
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable Analytics
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Building A Product Assortment Recommendation Engine
Building A Product Assortment Recommendation EngineBuilding A Product Assortment Recommendation Engine
Building A Product Assortment Recommendation Engine
 
rough-work.pptx
rough-work.pptxrough-work.pptx
rough-work.pptx
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Business analytics and data visualisation
Business analytics and data visualisationBusiness analytics and data visualisation
Business analytics and data visualisation
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
Test Automation for Data Warehouses
Test Automation for Data Warehouses Test Automation for Data Warehouses
Test Automation for Data Warehouses
 
Presentation application change management and data masking strategies for ...
Presentation   application change management and data masking strategies for ...Presentation   application change management and data masking strategies for ...
Presentation application change management and data masking strategies for ...
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 

IPC Data Analysis and Extraction

  • 1. Enterprise Intelligence Enterprise Intelligence Data Analysis, Extraction and Validation Technology Partners:
  • 2. Presenter Pete Zybrick – Enterprise Solutions Architect  Over 30 years of experience designing and delivering complex software solutions.  From Bell Labs to BMW to Big Data, Pete has architected, managed, tested and implemented large scale mission critical systems directly responsible for billions of dollars in annual transaction.  As the leader of the Big Data technical programs within IPC Global, Pete is responsible for building a framework of collaboration between IPC Global and our technology partners, Cloudera and AWS.  Cloudera Certified Apache Hadoop Developer  Amazon Web Services Certified Developer - Associate Data Analysis, Extraction and Validation
  • 3. Objectives • IPC Global Experience with Analysis and Extraction of Big Data • Techniques And Tools To Enable Business Users To Rapidly Access Subsets Of Large Datasets • Validation And Import • Not Presentation Capabilities Data Analysis, Extraction and Validation
  • 4. • Federal Reserve Economic Database (FRED2) • SiteCatalyst aka Adobe Analytics Data Analysis, Extraction and Validation Applications
  • 5. • Overview - http://research.stlouisfed.org/fred2/ • 240K Discrete Series’ – README file • Business Case: Consumer Price Index • FRED: http://research.stlouisfed.org/fred2/series/CPIAUCSL/ • QlikView: CPI demo - Single Data Analysis, Extraction and Validation FRED
  • 6. • FRED data is discrete • Consumer Price Index for All Urban Consumers: All Items • Consumer Price Index for All Urban Consumers: Apparel • Consumer Price Index for All Urban Consumers: Energy • Analytics: Groupings/Categories, Drill Downs • Consumer Price Index • All Urban Consumers • All Items • Apparel • Energy Data Analysis, Extraction and Validation FRED Raw Structure
  • 7. • “Live the Data” • Programs/Tools/Spreadsheets to Iteratively Analyze • Pattern Analysis, Parsing Rules – Split into Categories • Iterative Distinct Values used to generate Reference Files – Separate Code from Data, Standardize Values (example: Countries) • Parsing Example Consumer Price Index for All Urban Consumers: All items in Atlanta, GA Consumer Price Index for All Urban Consumers: Energy in Atlanta, GA Consumer Price Index for All Urban Consumers: All items less food and energy in Atlanta, GA Data Analysis, Extraction and Validation FRED Programmatic Analysis
  • 8. • Implementation of Parsing Rules/Reference Files • Definition of Common Table Structure (example: Create Table’s) • Output Files in Format Suitable for Database Bulk Import • GZip’ed Tab Separated Values (HDFS for Impala, S3 for Redshift) • Bulk Import into Impala and Redshift Data Analysis, Extraction and Validation FRED Programmatic Extraction
  • 9. • Programmatically Generated • Distinct Categories Spreadsheet • QlikView Load/Select Scripts • Demo: • Find “Home Price Index (High Tier)” in Distinct Categories • Find File in Generated Scripts based on Row Number • Copy/Paste the Generated Load/Select into new Dashboard • Manually modify Select for Middle Tier and Low Tier, Update Dashboard Data Analysis, Extraction and Validation FRED Productivity Enhancement Tools/Techniques
  • 10. • ~12-15MM Main Rows/Day, up to 554 Columns/Row • ~300MM Event Rows/Day (~24 per Main Row, 12 Before, 12 After) • Data Falls Into Logical Groupings (i.e. general, video, mobile, etc.) • Reference Table Lookups Of Varying Complexity • 2-3% Error Rate • Example of Inbound Data Data Analysis, Extraction and Validation SiteCatalyst – Overview
  • 11. • Created Spreadsheets With Top 2000 Distinct Values For Each Column, Reviewed Every Column • Identified Data Type and Criteria (i.e. range) For Each Column • Defined/Developed Data Validation Framework • Iterative Application of Data Validation • Identified Reference Table Lookups, Implemented Simple and Complex • Defined/Developed Test Data Generator Data Analysis, Extraction and Validation SiteCatalyst – Programmatic Analysis
  • 12. • Specify the Rules For Inbound Columns • Hadoop Map Program to Process Each Inbound Row • Hadoop Distributed Cache containing Target Columns->Tables, Validation Rules and Reference Lookups • Data Validation Performed Against All Inbound Columns • Rejects Written To Error Table • Reference Lookups • Valid Data Written To Separate Tables Based On Category (main, video, mobile, etc.) Data Analysis, Extraction and Validation SiteCatalyst – Data Validation and Table Framework
  • 13. • All Code/Processes Developed Internally by IPC Global • Multiple Sources, Multiple Target Databases • Iterative Analysis Utilizing Rapidly Coded Tools • Iterative Analyze/Extract/Validate/Apply • Performance Insight – Reference Tables, Caching • Increased Access to Business Data Data Analysis, Extraction and Validation Summary
  • 14. Enterprise Intelligence Enterprise Intelligence Technology Partners: Data Analysis, Extraction and Validation