SlideShare a Scribd company logo
1 of 11
#BDWmeetup @joe_Caserta 
Big Data 
Warehousing: 
September 17, 2014 
Big ETL 
• Traditional Tools 
• Map Reduce 
• Pig 
• Hive 
• Python 
What to use when
Agenda 
7:00 Networking 
Grab some food and drink... Make some friends. 
7:15 Joe Caserta 
President 
Caserta Concepts 
Welcome + Intro 
About the Meetup, about Caserta Concepts 
Overview of evolution and future of ETL 
7:35 Elliott Cordo 
Chief Architect 
Caserta Concepts 
Deeper dive in to Pig, Hive, Spark, etc. 
Demo of Spark! 
8:00 Kyle Hubert 
Principal Data Architect 
Simulmedia 
Hadoop Streaming with Python 
8:45 Q&A, More Networking 
Tell us what you’re up to… 
#BDWmeetup @joe_Caserta
About the BDW Meetup Twitter: #BDWmeetup 
• Big Data is a complex, rapidly changing 
landscape 
• We want to share our stories and hear 
about yours 
• Great networking opportunity for like 
minded data nerds 
• Opportunities to collaborate on exciting 
projects 
• Founded by Caserta Concepts 
• November 10, 2012 – HAPPY ANNIVERSARY!!! 
• Next BDW Meetup: Want to present? 
#BDWmeetup @joe_Caserta 
@CasertaConcepts 
@Simulmedia
About Caserta Concepts 
• Technology services company with expertise in data analysis: 
• Big Data Solutions 
• Data Warehousing 
• Business Intelligence 
• Data Science & Analytics 
• Data on the Cloud 
• Data Interaction & Visualization 
• Core focus in the following industries: 
• eCommerce / Retail / Marketing 
• Financial Services / Insurance 
• Healthcare / Ad Tech / Higher Ed 
• Established in 2001: 
• Increased growth year-over-year 
• Industry recognized work force 
• Strategy, Implementation 
• Writing, Education, Mentoring 
#BDWmeetup @joe_Caserta
Help Wanted 
Does this word cloud excite you? 
Cassandra 
Speak with us about our open positions: leslie@casertaconcepts.com 
#BDWmeetup @joe_Caserta 
Storm 
Big Data Architect Hbase
The Evolution of the Enterprise Data Hub POC 
Enrollments 
Claims 
Finance 
ETL 
NoSQL 
Databases 
Traditional 
EDW 
ETL 
Enterprise Data Hub 
Spark MapReduce Pig/Hive 
N1 N2 N3 N4 N5 
Hadoop Distributed File System (HDFS) 
Horizontally Scalable Environment - Optimized for Analytics 
Others… 
ETL 
#BDWmeetup @joe_Caserta
ETL for the (Big Data) Enterprise Data Hub 
• Convergence of 
• Data quality 
• Data Management and policies 
• All data in an organization 
• Set of processes 
• Ensure data assets are formally managed 
throughout the enterprise. 
• Ensure data can be trusted 
• EDH - Backbone of business 
• Production environment 
• Agile 
#BDWmeetup @joe_Caserta
Components of a Mature Enterprise Data Hub 
• Add Big Data to overall framework and assign responsibility 
• Add data scientists to the Stewardship program 
• Assign stewards to new data sets (twitter, call center logs, etc.) 
• This is the ‘people’ part. Establishing Enterprise Data Council, 
Data Stewards, etc. Organization 
• Larger scale 
• New datatypes 
• Integrate with Hive Metastore, HCatalog, home grown tables 
•Definitions, lineage (where does this data come from), 
business definitions, technical metadata Metadata 
Privacy/Security •Identify • Data detection and control and masking sensitive on unstructured data, regulatory data upon compliance 
ingest 
• Data Quality and Monitoring (probably home grown, drools?) 
• Quality checks not only SQL: machine learning, Pig and Map Reduce 
• Acting on large dataset quality checks may require distribution 
•Data must be complete and correct. Measure, improve, 
certify 
Data Quality and 
Monitoring 
Business Process Integration •Policies around data frequency, source availability, etc. 
• Near-zero latency, DevOps, Core component of business operations 
• Graph databases are more flexible than relational 
• Lower latency service required 
• Distributed data quality and matching algorithms 
•Ensure consistent business critical data i.e. Members, 
Providers, Agents, etc. Master Data Management 
• Secure and mask multiple data types (not just tabular) 
• Deletes are more uncommon (unless there is regulatory requirement) 
• Take advantage of compression and archiving (like AWS Glacier) 
•Data retention, purge schedule, storage/archiving 
Information Lifecycle 
Management (ILM) 
#BDWmeetup @joe_Caserta
Enterprise Data Pyramid 
 ETL cleans, conforms, consolidates, enriches each tier. 
 Only top (trusted) tier of the pyramid is fully accessible by the 
masses. 
Big 
Data 
Warehouse 
Fully Data Governed ( trusted) 
ETL 
Data Science Workspace 
Agile business insight through data-munging 
machine learning, blending with external 
data, development of to-be BDW facts 
Metadata  Catalog 
ILM  who has access, how long do we “manage it” 
ETL 
Data Lake – Integrated Sandbox 
Data Quality and Monitoring  Monitoring of 
completeness of data 
Landing Area – Source Data in “Full Fidelity” 
Data is ready to be 
turned into information: 
organized, well defined, 
complete. 
#BDWmeetup @joe_Caserta 
Metadata  Catalog 
ILM  who has access, 
how long do we “manage it” 
Raw machine 
data collection, 
collect 
everything 
Metadata  Catalog 
ILM  who has access, how long to “manage it” 
Data Quality and Monitoring  Monitoring 
of completeness of data 
User community arbitrary queries and 
reporting 
ETL 
ETL
Thank You 
Joe Caserta 
President, Caserta Concepts 
joe@casertaconcepts.com 
(914) 261-3648 
@joe_Caserta 
#BDWmeetup @joe_Caserta
Free Python books, Courtesy of: 
RAFFLE!!! 
#BDWmeetup @joe_Caserta

More Related Content

Viewers also liked

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
Marcel Franke
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETL
Jonathan Levin
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouse
Dao Vo
 

Viewers also liked (17)

National Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard DemoNational Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard Demo
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
 
Customer Centricity Score
Customer Centricity ScoreCustomer Centricity Score
Customer Centricity Score
 
Veri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL DesteğiVeri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL Desteği
 
Oracle PL/SQL Best Practices
Oracle PL/SQL Best PracticesOracle PL/SQL Best Practices
Oracle PL/SQL Best Practices
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETL
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouse
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools Comparison
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 

More from Caserta

Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
Caserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Big Data Warehousing Meetup: BigETL: Trad Tool vs Pig vs Hive vs Python. What to use when. (Slide Set #1)

  • 1. #BDWmeetup @joe_Caserta Big Data Warehousing: September 17, 2014 Big ETL • Traditional Tools • Map Reduce • Pig • Hive • Python What to use when
  • 2. Agenda 7:00 Networking Grab some food and drink... Make some friends. 7:15 Joe Caserta President Caserta Concepts Welcome + Intro About the Meetup, about Caserta Concepts Overview of evolution and future of ETL 7:35 Elliott Cordo Chief Architect Caserta Concepts Deeper dive in to Pig, Hive, Spark, etc. Demo of Spark! 8:00 Kyle Hubert Principal Data Architect Simulmedia Hadoop Streaming with Python 8:45 Q&A, More Networking Tell us what you’re up to… #BDWmeetup @joe_Caserta
  • 3. About the BDW Meetup Twitter: #BDWmeetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts • November 10, 2012 – HAPPY ANNIVERSARY!!! • Next BDW Meetup: Want to present? #BDWmeetup @joe_Caserta @CasertaConcepts @Simulmedia
  • 4. About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring #BDWmeetup @joe_Caserta
  • 5. Help Wanted Does this word cloud excite you? Cassandra Speak with us about our open positions: leslie@casertaconcepts.com #BDWmeetup @joe_Caserta Storm Big Data Architect Hbase
  • 6. The Evolution of the Enterprise Data Hub POC Enrollments Claims Finance ETL NoSQL Databases Traditional EDW ETL Enterprise Data Hub Spark MapReduce Pig/Hive N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS) Horizontally Scalable Environment - Optimized for Analytics Others… ETL #BDWmeetup @joe_Caserta
  • 7. ETL for the (Big Data) Enterprise Data Hub • Convergence of • Data quality • Data Management and policies • All data in an organization • Set of processes • Ensure data assets are formally managed throughout the enterprise. • Ensure data can be trusted • EDH - Backbone of business • Production environment • Agile #BDWmeetup @joe_Caserta
  • 8. Components of a Mature Enterprise Data Hub • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc. Organization • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables •Definitions, lineage (where does this data come from), business definitions, technical metadata Metadata Privacy/Security •Identify • Data detection and control and masking sensitive on unstructured data, regulatory data upon compliance ingest • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution •Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring Business Process Integration •Policies around data frequency, source availability, etc. • Near-zero latency, DevOps, Core component of business operations • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms •Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Master Data Management • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) #BDWmeetup @joe_Caserta
  • 9. Enterprise Data Pyramid  ETL cleans, conforms, consolidates, enriches each tier.  Only top (trusted) tier of the pyramid is fully accessible by the masses. Big Data Warehouse Fully Data Governed ( trusted) ETL Data Science Workspace Agile business insight through data-munging machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” ETL Data Lake – Integrated Sandbox Data Quality and Monitoring  Monitoring of completeness of data Landing Area – Source Data in “Full Fidelity” Data is ready to be turned into information: organized, well defined, complete. #BDWmeetup @joe_Caserta Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Metadata  Catalog ILM  who has access, how long to “manage it” Data Quality and Monitoring  Monitoring of completeness of data User community arbitrary queries and reporting ETL ETL
  • 10. Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta #BDWmeetup @joe_Caserta
  • 11. Free Python books, Courtesy of: RAFFLE!!! #BDWmeetup @joe_Caserta

Editor's Notes

  1. Robotman was actually the first cyborg superhero. Robert Crane was fatally shot and had his brain placed in a super strong robot body. The cybernetic Robotman lived on, using a rubber mask and flesh-like body suit to disguise himself as Paul Dennis. The new hero used his cyborg might to smash crime during DC’s Golden Age. First Appearance: Star Spangled Comics #7 (1942)