SlideShare a Scribd company logo
1 of 21
Hadoop Powered
Corporate Data
How to Produce and Manage Meaningful Data and
Analytics
Dr. Geoffrey Malafsky
Phasic Systems Inc.
Phasic Systems Inc. 2
Governance
Warehouse
Analytics
NoSQL Streaming
BI
Integration
Architecture
Modeling
Big Data Hadoop Velocity,
Volume,
Variety
Veracity
Phasic Systems Inc. 3
Governance
Warehouse
Analytics
NoSQL Streaming
BI
Integration
Architecture
Modeling
Big Data Hadoop Velocity,
Volume,
Variety
Veracity
What does this
really mean for
my corporate
data?
Disruption
Phasic Systems Inc. 4
Organizational Issues
Technology Issues
Business Issues
Phasic Systems Inc. 5
Are we discovering new knowledge?
Are we analyzing business and
operations for decisions, audit,
compliance, consolidation?
Are we fulfilling required reports?
Phasic Systems Inc. 6
Veracity, Meaningful
Does it matter?
Topic Should Does
BI Yes Sometimes
Required Reports Yes Sometimes
Audit Yes Yes
Compliance Yes Yes
Consolidation Yes Sometimes
Marketing Yes Sometimes
Financial Yes Yes but….
Decision Making Yes Yes but….
TechLab by InsideAnalysis
Phasic Systems Inc. 7
Normalizing Corporate Small Data With Hadoop and Data Science
By Dr. Geoffrey P Malafsky
In part one of this discussion series (Hadoop for Small Data), I introduced the idea that Small Data is the mission-critical data management challenge. To
reiterate, Small Data is “corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past
the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision
making, applications, reports, and Business Intelligence.”
I am excluding what I call stochastic data use cases which can succeed even if there is error in the source data and uncertainty in the results since the
business objective is getting trends or making general associations. Most Big Data examples are this type. In stark contrast are deterministic use cases,
which I am focusing on here and in the next TechLab in September, where the ramifications for wrong results are severely negative. This is the realm of
executive decision making, Accounting, Risk Management, regulatory compliance, security, to name a few.
Corporate Small Data is
structured data that is the
fuel of its main activities
Data Normalization combines
subject matter knowledge,
governance, business rules,
and raw data to make it
meaningful.
Phasic Systems Inc. 8
Hadoop was created to handle extraordinarily
large and constantly changing data sets. It is a
very well-engineered software framework and
set of tools for distributed storage and cluster
computing. But, can it help solve the intractable
challenges with key corporate data ?
The Challenge of Corporate Small Data
Phasic Systems Inc. 9
multiple sources multiple definitions multiple copies
variable structures
different data values
hidden conflicts in data
definitions
which to use
different model types &
standards
more storage more data flows
Many DW & marts different ETL
complex dependencies
conflicting
business rules
analyses restricted
by inconsistencies
Phasic Systems Inc. 10
An example of embedded errors that defy traditional tools and methods. Two
authoritative data systems have many occurrences of conflicts, errors, and
quantitative discrepancies. Finding these has been too difficult with common tools.
But, using small Hadoop cluster (this is Corporate Data not Big Data) allows us to
iteratively detect, learn, adjust. Once detected, investigated, and understood we can
find just the one answer from business needed to correct.
Phasic Systems Inc. 11
136666505 adese genc petrol
136666505 amy lily chung
136666505 anderson erin ruth
136666505 andrew william knef
136666505 anduaga-arias laura
136666505 angelica m. de la cruz
136666505 anthony o'brien, 330531-5100194
136666505 batac belle
136666505 bottesini beth ms.
136666505 bouck shannon
136666505 bunn amy b.
136666505 carlene clark
136666505 cho, boong haeng
136666505 choe, sun young
136666505 christina michajlyszyn
136666505 christopher cannon
136666505 christopher l. booth
136666505 chun, kil mo
136666505 conflict + transition consultancies
136666505 cozzone elaine
136666505 deborah p. carney
136666505 denihan patricia joann
136666505 dong sook mcgeorge, 690525-2716816
136666505 dorene d.lukewalton,pharm d.
136666505 dr. terry a. klein
0
10
20
30
40
50
60
70
80
90
100
WhiteSpace Transpose Acronym NoiseWord LowSim Punctuation
PercentofDUNSWith>=50%NamesMatched
Proportion of DUNS Matched by Transform Type
FPDS FPDS-WAWF FPDS-WAWF-GDUNS
Requirements for Data Analytics
1. Data must be understood
2. The right definitions must apply at the right time for the right user
3. Data’s lineage and provenance must be clear
4. Data integrity must be preserved
5. Data must be accurate, consistent, complete, timely, unique and valid
6. Data and system access must be secure
7. Data must be provided in multiple arrangements to meet different user needs and analytical
processing requirements
8. Data must be prepared and tracked to support meaningful analysis for different user needs
9. Data processing must be flexible to adapt to new knowledge and discoveries on data already
being used
10. Data must be normalized using authoritative or best known sets of codes, lookup values, and
source adjudication knowledge and rules
11. High speed, low maintenance techniques and tools are needed to be cost and time effective
12. Lifecycle audits and data maintenance must be performed including maintaining and
documenting data from raw source to intermediate transformed to full normalized
13. Use Common data models that align, correct, and semantically unify data from multiple
sources to enforce meaningful and consistent analysis
Phasic Systems Inc. 12
Phasic Systems Inc. 13
An Example of Hidden Business Rules and Logic
• If (DELIVERY_ORDER=NULL) v_piid = CONTRACT else v_piid =
DELIVERY_ORDER
• If ( x1='0') v_modification_number = '0‘ else v_modification_number =
x2
• where x1: if (ACO_MOD=NULL) x1 = x3 else x1 = ACO_MOD
• where x3: if (PCO_MOD=NULL) x3='0‘ else x3=PCO_MOD
• where x2: if (x4=NULL) x2='0‘ else x2=x4
• where x4: x4= LTRIM(x5)
• where x5: x5=x1
• essentially this first tries to use ACO_MOD, and if this is NULL then it tries
to use PCO_MOD and sets = '0' if these are NULL
• If (DELIVERY_ORDER=NULL) v_idv_piid = y1 else v_idv_piid = CONTRACT
• where y1: y1 = REF_PROC_INSTRUMENT with all '-' characters
removed
Phasic Systems Inc. 14
key business logic as buried in a database stored procedure (condensed)
Phasic Systems Inc. 15
Flexible, Fast, Adaptive, Multi-Tool Data Analytics Environment
Phasic Systems Inc. 16
Phasic Systems Inc. 17
0
50
100
150
200
250
300
350
400
Hive Impala SQLServer
FPDS Hadoop Query Times Text Field (secs)
Text Parquet Parquet Partitioned
Phasic Systems Inc. 18
Parallel Jobs in Hadoop
Phasic Systems Inc. 19
Phasic Systems Inc. 20
Phasic Systems Inc. 21

More Related Content

Similar to Phasic Systems - Dr. Geoffrey Malafsky

Enterprise Integration in a nutshell (16:9)
Enterprise Integration in a nutshell (16:9)Enterprise Integration in a nutshell (16:9)
Enterprise Integration in a nutshell (16:9)Dmytro Golodiuk
 
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Denodo
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategyHimanshu Bari
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsRyan Gross
 
Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...
Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...
Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...FindWhitePapers
 
Whitepaper Building Power BI Solutions with Power Query
Whitepaper  Building Power BI Solutions with Power QueryWhitepaper  Building Power BI Solutions with Power Query
Whitepaper Building Power BI Solutions with Power QueryMILL5
 
1Running head BIG DATA6BIG DATAMIT 681 MSIT.docx
1Running head BIG DATA6BIG DATAMIT 681  MSIT.docx1Running head BIG DATA6BIG DATAMIT 681  MSIT.docx
1Running head BIG DATA6BIG DATAMIT 681 MSIT.docxaulasnilda
 
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxUnleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxGolu187360
 
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxUnleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxGolu187360
 
The Data Warehouse Essays
The Data Warehouse EssaysThe Data Warehouse Essays
The Data Warehouse EssaysMelissa Moore
 
Solix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
Solix Common Data Platform: Advanced Analytics and the Data-Driven EnterpriseSolix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
Solix Common Data Platform: Advanced Analytics and the Data-Driven EnterpriseLindaWatson19
 
Building an API for EHR integration at scale
Building an API for EHR integration at scaleBuilding an API for EHR integration at scale
Building an API for EHR integration at scaleRedox Engine
 
How to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing OrganizationHow to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing OrganizationKissmetrics on SlideShare
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
Data and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFOData and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFODamian R. Mingle, MBA
 
Decision Point AI, plan around what will happen instead of what has happened?
Decision Point AI, plan around what will happen instead of what has happened?Decision Point AI, plan around what will happen instead of what has happened?
Decision Point AI, plan around what will happen instead of what has happened?Karl Smith
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringRy Walker
 
Semantic Applications for Financial Services
Semantic Applications for Financial ServicesSemantic Applications for Financial Services
Semantic Applications for Financial ServicesDavidSNewman
 

Similar to Phasic Systems - Dr. Geoffrey Malafsky (20)

Enterprise Integration in a nutshell (16:9)
Enterprise Integration in a nutshell (16:9)Enterprise Integration in a nutshell (16:9)
Enterprise Integration in a nutshell (16:9)
 
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
 
Mighty Guides- Data Disruption
Mighty Guides- Data DisruptionMighty Guides- Data Disruption
Mighty Guides- Data Disruption
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 
Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...
Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...
Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...
 
Whitepaper Building Power BI Solutions with Power Query
Whitepaper  Building Power BI Solutions with Power QueryWhitepaper  Building Power BI Solutions with Power Query
Whitepaper Building Power BI Solutions with Power Query
 
1Running head BIG DATA6BIG DATAMIT 681 MSIT.docx
1Running head BIG DATA6BIG DATAMIT 681  MSIT.docx1Running head BIG DATA6BIG DATAMIT 681  MSIT.docx
1Running head BIG DATA6BIG DATAMIT 681 MSIT.docx
 
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxUnleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
 
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxUnleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
 
The Data Warehouse Essays
The Data Warehouse EssaysThe Data Warehouse Essays
The Data Warehouse Essays
 
Solix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
Solix Common Data Platform: Advanced Analytics and the Data-Driven EnterpriseSolix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
Solix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
 
Building an API for EHR integration at scale
Building an API for EHR integration at scaleBuilding an API for EHR integration at scale
Building an API for EHR integration at scale
 
How to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing OrganizationHow to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing Organization
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
IT Ready - DW: 1st Day
IT Ready - DW: 1st Day IT Ready - DW: 1st Day
IT Ready - DW: 1st Day
 
Data and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFOData and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFO
 
Decision Point AI, plan around what will happen instead of what has happened?
Decision Point AI, plan around what will happen instead of what has happened?Decision Point AI, plan around what will happen instead of what has happened?
Decision Point AI, plan around what will happen instead of what has happened?
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
Semantic Applications for Financial Services
Semantic Applications for Financial ServicesSemantic Applications for Financial Services
Semantic Applications for Financial Services
 

More from Inside Analysis

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIInside Analysis
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessInside Analysis
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownInside Analysis
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security Inside Analysis
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeInside Analysis
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataInside Analysis
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsInside Analysis
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingInside Analysis
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLInside Analysis
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelInside Analysis
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureInside Analysis
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataInside Analysis
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseInside Analysis
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldInside Analysis
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave DuggalInside Analysis
 
Red Hat - Sarangan Rangachari
Red Hat - Sarangan RangachariRed Hat - Sarangan Rangachari
Red Hat - Sarangan RangachariInside Analysis
 

More from Inside Analysis (20)

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time Analytics
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of Everything
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global Level
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your Architecture
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave Duggal
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
 
Red Hat - Sarangan Rangachari
Red Hat - Sarangan RangachariRed Hat - Sarangan Rangachari
Red Hat - Sarangan Rangachari
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Phasic Systems - Dr. Geoffrey Malafsky

  • 1. Hadoop Powered Corporate Data How to Produce and Manage Meaningful Data and Analytics Dr. Geoffrey Malafsky Phasic Systems Inc.
  • 2. Phasic Systems Inc. 2 Governance Warehouse Analytics NoSQL Streaming BI Integration Architecture Modeling Big Data Hadoop Velocity, Volume, Variety Veracity
  • 3. Phasic Systems Inc. 3 Governance Warehouse Analytics NoSQL Streaming BI Integration Architecture Modeling Big Data Hadoop Velocity, Volume, Variety Veracity What does this really mean for my corporate data? Disruption
  • 4. Phasic Systems Inc. 4 Organizational Issues Technology Issues Business Issues
  • 5. Phasic Systems Inc. 5 Are we discovering new knowledge? Are we analyzing business and operations for decisions, audit, compliance, consolidation? Are we fulfilling required reports?
  • 6. Phasic Systems Inc. 6 Veracity, Meaningful Does it matter? Topic Should Does BI Yes Sometimes Required Reports Yes Sometimes Audit Yes Yes Compliance Yes Yes Consolidation Yes Sometimes Marketing Yes Sometimes Financial Yes Yes but…. Decision Making Yes Yes but….
  • 7. TechLab by InsideAnalysis Phasic Systems Inc. 7 Normalizing Corporate Small Data With Hadoop and Data Science By Dr. Geoffrey P Malafsky In part one of this discussion series (Hadoop for Small Data), I introduced the idea that Small Data is the mission-critical data management challenge. To reiterate, Small Data is “corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence.” I am excluding what I call stochastic data use cases which can succeed even if there is error in the source data and uncertainty in the results since the business objective is getting trends or making general associations. Most Big Data examples are this type. In stark contrast are deterministic use cases, which I am focusing on here and in the next TechLab in September, where the ramifications for wrong results are severely negative. This is the realm of executive decision making, Accounting, Risk Management, regulatory compliance, security, to name a few. Corporate Small Data is structured data that is the fuel of its main activities Data Normalization combines subject matter knowledge, governance, business rules, and raw data to make it meaningful.
  • 8. Phasic Systems Inc. 8 Hadoop was created to handle extraordinarily large and constantly changing data sets. It is a very well-engineered software framework and set of tools for distributed storage and cluster computing. But, can it help solve the intractable challenges with key corporate data ?
  • 9. The Challenge of Corporate Small Data Phasic Systems Inc. 9 multiple sources multiple definitions multiple copies variable structures different data values hidden conflicts in data definitions which to use different model types & standards more storage more data flows Many DW & marts different ETL complex dependencies conflicting business rules analyses restricted by inconsistencies
  • 10. Phasic Systems Inc. 10 An example of embedded errors that defy traditional tools and methods. Two authoritative data systems have many occurrences of conflicts, errors, and quantitative discrepancies. Finding these has been too difficult with common tools. But, using small Hadoop cluster (this is Corporate Data not Big Data) allows us to iteratively detect, learn, adjust. Once detected, investigated, and understood we can find just the one answer from business needed to correct.
  • 11. Phasic Systems Inc. 11 136666505 adese genc petrol 136666505 amy lily chung 136666505 anderson erin ruth 136666505 andrew william knef 136666505 anduaga-arias laura 136666505 angelica m. de la cruz 136666505 anthony o'brien, 330531-5100194 136666505 batac belle 136666505 bottesini beth ms. 136666505 bouck shannon 136666505 bunn amy b. 136666505 carlene clark 136666505 cho, boong haeng 136666505 choe, sun young 136666505 christina michajlyszyn 136666505 christopher cannon 136666505 christopher l. booth 136666505 chun, kil mo 136666505 conflict + transition consultancies 136666505 cozzone elaine 136666505 deborah p. carney 136666505 denihan patricia joann 136666505 dong sook mcgeorge, 690525-2716816 136666505 dorene d.lukewalton,pharm d. 136666505 dr. terry a. klein 0 10 20 30 40 50 60 70 80 90 100 WhiteSpace Transpose Acronym NoiseWord LowSim Punctuation PercentofDUNSWith>=50%NamesMatched Proportion of DUNS Matched by Transform Type FPDS FPDS-WAWF FPDS-WAWF-GDUNS
  • 12. Requirements for Data Analytics 1. Data must be understood 2. The right definitions must apply at the right time for the right user 3. Data’s lineage and provenance must be clear 4. Data integrity must be preserved 5. Data must be accurate, consistent, complete, timely, unique and valid 6. Data and system access must be secure 7. Data must be provided in multiple arrangements to meet different user needs and analytical processing requirements 8. Data must be prepared and tracked to support meaningful analysis for different user needs 9. Data processing must be flexible to adapt to new knowledge and discoveries on data already being used 10. Data must be normalized using authoritative or best known sets of codes, lookup values, and source adjudication knowledge and rules 11. High speed, low maintenance techniques and tools are needed to be cost and time effective 12. Lifecycle audits and data maintenance must be performed including maintaining and documenting data from raw source to intermediate transformed to full normalized 13. Use Common data models that align, correct, and semantically unify data from multiple sources to enforce meaningful and consistent analysis Phasic Systems Inc. 12
  • 14. An Example of Hidden Business Rules and Logic • If (DELIVERY_ORDER=NULL) v_piid = CONTRACT else v_piid = DELIVERY_ORDER • If ( x1='0') v_modification_number = '0‘ else v_modification_number = x2 • where x1: if (ACO_MOD=NULL) x1 = x3 else x1 = ACO_MOD • where x3: if (PCO_MOD=NULL) x3='0‘ else x3=PCO_MOD • where x2: if (x4=NULL) x2='0‘ else x2=x4 • where x4: x4= LTRIM(x5) • where x5: x5=x1 • essentially this first tries to use ACO_MOD, and if this is NULL then it tries to use PCO_MOD and sets = '0' if these are NULL • If (DELIVERY_ORDER=NULL) v_idv_piid = y1 else v_idv_piid = CONTRACT • where y1: y1 = REF_PROC_INSTRUMENT with all '-' characters removed Phasic Systems Inc. 14 key business logic as buried in a database stored procedure (condensed)
  • 15. Phasic Systems Inc. 15 Flexible, Fast, Adaptive, Multi-Tool Data Analytics Environment
  • 17. Phasic Systems Inc. 17 0 50 100 150 200 250 300 350 400 Hive Impala SQLServer FPDS Hadoop Query Times Text Field (secs) Text Parquet Parquet Partitioned
  • 18. Phasic Systems Inc. 18 Parallel Jobs in Hadoop