IPC Data Analysis and Extraction

Enterprise Intelligence
Data Analysis,
Extraction and
Validation
Technology Partners:

Presenter
Pete Zybrick – Enterprise Solutions Architect
 Over 30 years of experience designing and delivering
complex software solutions.
 From Bell Labs to BMW to Big Data, Pete has architected,
managed, tested and implemented large scale mission
critical systems directly responsible for billions of dollars in
annual transaction.
 As the leader of the Big Data technical programs within IPC
Global, Pete is responsible for building a framework of
collaboration between IPC Global and our technology
partners, Cloudera and AWS.
 Cloudera Certified Apache Hadoop Developer
 Amazon Web Services Certified Developer - Associate
Data Analysis, Extraction and Validation

Objectives
• IPC Global Experience with Analysis and
Extraction of Big Data
• Techniques And Tools To Enable Business Users
To Rapidly Access Subsets Of Large Datasets
• Validation And Import
• Not Presentation Capabilities

• Federal Reserve Economic Database (FRED2)
• SiteCatalyst aka Adobe Analytics
Applications

• Overview - http://research.stlouisfed.org/fred2/
• 240K Discrete Series’ – README file
• Business Case: Consumer Price Index
• FRED: http://research.stlouisfed.org/fred2/series/CPIAUCSL/
• QlikView: CPI demo - Single
FRED

• FRED data is discrete
• Consumer Price Index for All Urban Consumers: All Items
• Consumer Price Index for All Urban Consumers: Apparel
• Consumer Price Index for All Urban Consumers: Energy
• Analytics: Groupings/Categories, Drill Downs
• Consumer Price Index
• All Urban Consumers
• All Items
• Apparel
• Energy
FRED Raw Structure

• “Live the Data”
• Programs/Tools/Spreadsheets to Iteratively Analyze
• Pattern Analysis, Parsing Rules – Split into Categories
• Iterative Distinct Values used to generate Reference Files – Separate Code
from Data, Standardize Values (example: Countries)
• Parsing Example
Consumer Price Index for All Urban Consumers: All items in Atlanta, GA
Consumer Price Index for All Urban Consumers: Energy in Atlanta, GA
Consumer Price Index for All Urban Consumers: All items less food and energy in Atlanta, GA
FRED Programmatic Analysis

• Implementation of Parsing Rules/Reference Files
• Definition of Common Table Structure (example: Create Table’s)
• Output Files in Format Suitable for Database Bulk Import
• GZip’ed Tab Separated Values (HDFS for Impala, S3 for Redshift)
• Bulk Import into Impala and Redshift
FRED Programmatic Extraction

• Programmatically Generated
• Distinct Categories Spreadsheet
• QlikView Load/Select Scripts
• Demo:
• Find “Home Price Index (High Tier)” in Distinct Categories
• Find File in Generated Scripts based on Row Number
• Copy/Paste the Generated Load/Select into new Dashboard
• Manually modify Select for Middle Tier and Low Tier, Update Dashboard
FRED Productivity Enhancement Tools/Techniques

• ~12-15MM Main Rows/Day, up to 554 Columns/Row
• ~300MM Event Rows/Day (~24 per Main Row, 12 Before, 12 After)
• Data Falls Into Logical Groupings (i.e. general, video, mobile, etc.)
• Reference Table Lookups Of Varying Complexity
• 2-3% Error Rate
• Example of Inbound Data
SiteCatalyst – Overview

• Created Spreadsheets With Top 2000 Distinct Values For Each
Column, Reviewed Every Column
• Identified Data Type and Criteria (i.e. range) For Each Column
• Defined/Developed Data Validation Framework
• Iterative Application of Data Validation
• Identified Reference Table Lookups, Implemented Simple and
Complex
• Defined/Developed Test Data Generator
SiteCatalyst – Programmatic Analysis

• Specify the Rules For Inbound Columns
• Hadoop Map Program to Process Each Inbound Row
• Hadoop Distributed Cache containing Target Columns->Tables, Validation
Rules and Reference Lookups
• Data Validation Performed Against All Inbound Columns
• Rejects Written To Error Table
• Reference Lookups
• Valid Data Written To Separate Tables Based On Category (main, video,
mobile, etc.)
SiteCatalyst – Data Validation and Table Framework

• All Code/Processes Developed Internally by IPC Global
• Multiple Sources, Multiple Target Databases
• Iterative Analysis Utilizing Rapidly Coded Tools
• Iterative Analyze/Extract/Validate/Apply
• Performance Insight – Reference Tables, Caching
• Increased Access to Business Data
Summary

Technology Partners:
Data Analysis,
Extraction and
Validation

IPC Data Analysis and Extraction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to IPC Data Analysis and Extraction

Similar to IPC Data Analysis and Extraction (20)

IPC Data Analysis and Extraction