SlideShare a Scribd company logo
1 of 11
#BDWmeetup @joe_Caserta 
Big Data 
Warehousing: 
September 17, 2014 
Big ETL 
• Traditional Tools 
• Map Reduce 
• Pig 
• Hive 
• Python 
What to use when
Agenda 
7:00 Networking 
Grab some food and drink... Make some friends. 
7:15 Joe Caserta 
President 
Caserta Concepts 
Welcome + Intro 
About the Meetup, about Caserta Concepts 
Overview of evolution and future of ETL 
7:35 Elliott Cordo 
Chief Architect 
Caserta Concepts 
Deeper dive in to Pig, Hive, Spark, etc. 
Demo of Spark! 
8:00 Kyle Hubert 
Principal Data Architect 
Simulmedia 
Hadoop Streaming with Python 
8:45 Q&A, More Networking 
Tell us what you’re up to… 
#BDWmeetup @joe_Caserta
About the BDW Meetup Twitter: #BDWmeetup 
• Big Data is a complex, rapidly changing 
landscape 
• We want to share our stories and hear 
about yours 
• Great networking opportunity for like 
minded data nerds 
• Opportunities to collaborate on exciting 
projects 
• Founded by Caserta Concepts 
• November 10, 2012 – HAPPY ANNIVERSARY!!! 
• Next BDW Meetup: Want to present? 
#BDWmeetup @joe_Caserta 
@CasertaConcepts 
@Simulmedia
About Caserta Concepts 
• Technology services company with expertise in data analysis: 
• Big Data Solutions 
• Data Warehousing 
• Business Intelligence 
• Data Science & Analytics 
• Data on the Cloud 
• Data Interaction & Visualization 
• Core focus in the following industries: 
• eCommerce / Retail / Marketing 
• Financial Services / Insurance 
• Healthcare / Ad Tech / Higher Ed 
• Established in 2001: 
• Increased growth year-over-year 
• Industry recognized work force 
• Strategy, Implementation 
• Writing, Education, Mentoring 
#BDWmeetup @joe_Caserta
Help Wanted 
Does this word cloud excite you? 
Cassandra 
Speak with us about our open positions: leslie@casertaconcepts.com 
#BDWmeetup @joe_Caserta 
Storm 
Big Data Architect Hbase
The Evolution of the Enterprise Data Hub POC 
Enrollments 
Claims 
Finance 
ETL 
NoSQL 
Databases 
Traditional 
EDW 
ETL 
Enterprise Data Hub 
Spark MapReduce Pig/Hive 
N1 N2 N3 N4 N5 
Hadoop Distributed File System (HDFS) 
Horizontally Scalable Environment - Optimized for Analytics 
Others… 
ETL 
#BDWmeetup @joe_Caserta
ETL for the (Big Data) Enterprise Data Hub 
• Convergence of 
• Data quality 
• Data Management and policies 
• All data in an organization 
• Set of processes 
• Ensure data assets are formally managed 
throughout the enterprise. 
• Ensure data can be trusted 
• EDH - Backbone of business 
• Production environment 
• Agile 
#BDWmeetup @joe_Caserta
Components of a Mature Enterprise Data Hub 
• Add Big Data to overall framework and assign responsibility 
• Add data scientists to the Stewardship program 
• Assign stewards to new data sets (twitter, call center logs, etc.) 
• This is the ‘people’ part. Establishing Enterprise Data Council, 
Data Stewards, etc. Organization 
• Larger scale 
• New datatypes 
• Integrate with Hive Metastore, HCatalog, home grown tables 
•Definitions, lineage (where does this data come from), 
business definitions, technical metadata Metadata 
Privacy/Security •Identify • Data detection and control and masking sensitive on unstructured data, regulatory data upon compliance 
ingest 
• Data Quality and Monitoring (probably home grown, drools?) 
• Quality checks not only SQL: machine learning, Pig and Map Reduce 
• Acting on large dataset quality checks may require distribution 
•Data must be complete and correct. Measure, improve, 
certify 
Data Quality and 
Monitoring 
Business Process Integration •Policies around data frequency, source availability, etc. 
• Near-zero latency, DevOps, Core component of business operations 
• Graph databases are more flexible than relational 
• Lower latency service required 
• Distributed data quality and matching algorithms 
•Ensure consistent business critical data i.e. Members, 
Providers, Agents, etc. Master Data Management 
• Secure and mask multiple data types (not just tabular) 
• Deletes are more uncommon (unless there is regulatory requirement) 
• Take advantage of compression and archiving (like AWS Glacier) 
•Data retention, purge schedule, storage/archiving 
Information Lifecycle 
Management (ILM) 
#BDWmeetup @joe_Caserta
Enterprise Data Pyramid 
 ETL cleans, conforms, consolidates, enriches each tier. 
 Only top (trusted) tier of the pyramid is fully accessible by the 
masses. 
Big 
Data 
Warehouse 
Fully Data Governed ( trusted) 
ETL 
Data Science Workspace 
Agile business insight through data-munging 
machine learning, blending with external 
data, development of to-be BDW facts 
Metadata  Catalog 
ILM  who has access, how long do we “manage it” 
ETL 
Data Lake – Integrated Sandbox 
Data Quality and Monitoring  Monitoring of 
completeness of data 
Landing Area – Source Data in “Full Fidelity” 
Data is ready to be 
turned into information: 
organized, well defined, 
complete. 
#BDWmeetup @joe_Caserta 
Metadata  Catalog 
ILM  who has access, 
how long do we “manage it” 
Raw machine 
data collection, 
collect 
everything 
Metadata  Catalog 
ILM  who has access, how long to “manage it” 
Data Quality and Monitoring  Monitoring 
of completeness of data 
User community arbitrary queries and 
reporting 
ETL 
ETL
Thank You 
Joe Caserta 
President, Caserta Concepts 
joe@casertaconcepts.com 
(914) 261-3648 
@joe_Caserta 
#BDWmeetup @joe_Caserta
Free Python books, Courtesy of: 
RAFFLE!!! 
#BDWmeetup @joe_Caserta

More Related Content

Viewers also liked

National Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard DemoNational Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard DemoEdgewater
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseMarcel Franke
 
Customer Centricity Score
Customer Centricity ScoreCustomer Centricity Score
Customer Centricity ScoreJan-Erik Baars
 
Veri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL DesteğiVeri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL DesteğiEmrah METE
 
Oracle PL/SQL Best Practices
Oracle PL/SQL Best PracticesOracle PL/SQL Best Practices
Oracle PL/SQL Best PracticesEmrah METE
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLJonathan Levin
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouseDao Vo
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonRoberto Espinosa
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architectureanicewick
 

Viewers also liked (17)

National Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard DemoNational Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard Demo
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
 
Customer Centricity Score
Customer Centricity ScoreCustomer Centricity Score
Customer Centricity Score
 
Veri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL DesteğiVeri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL Desteği
 
Oracle PL/SQL Best Practices
Oracle PL/SQL Best PracticesOracle PL/SQL Best Practices
Oracle PL/SQL Best Practices
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETL
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouse
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools Comparison
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Big Data Warehousing Meetup: BigETL: Trad Tool vs Pig vs Hive vs Python. What to use when. (Slide Set #1)

  • 1. #BDWmeetup @joe_Caserta Big Data Warehousing: September 17, 2014 Big ETL • Traditional Tools • Map Reduce • Pig • Hive • Python What to use when
  • 2. Agenda 7:00 Networking Grab some food and drink... Make some friends. 7:15 Joe Caserta President Caserta Concepts Welcome + Intro About the Meetup, about Caserta Concepts Overview of evolution and future of ETL 7:35 Elliott Cordo Chief Architect Caserta Concepts Deeper dive in to Pig, Hive, Spark, etc. Demo of Spark! 8:00 Kyle Hubert Principal Data Architect Simulmedia Hadoop Streaming with Python 8:45 Q&A, More Networking Tell us what you’re up to… #BDWmeetup @joe_Caserta
  • 3. About the BDW Meetup Twitter: #BDWmeetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts • November 10, 2012 – HAPPY ANNIVERSARY!!! • Next BDW Meetup: Want to present? #BDWmeetup @joe_Caserta @CasertaConcepts @Simulmedia
  • 4. About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring #BDWmeetup @joe_Caserta
  • 5. Help Wanted Does this word cloud excite you? Cassandra Speak with us about our open positions: leslie@casertaconcepts.com #BDWmeetup @joe_Caserta Storm Big Data Architect Hbase
  • 6. The Evolution of the Enterprise Data Hub POC Enrollments Claims Finance ETL NoSQL Databases Traditional EDW ETL Enterprise Data Hub Spark MapReduce Pig/Hive N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS) Horizontally Scalable Environment - Optimized for Analytics Others… ETL #BDWmeetup @joe_Caserta
  • 7. ETL for the (Big Data) Enterprise Data Hub • Convergence of • Data quality • Data Management and policies • All data in an organization • Set of processes • Ensure data assets are formally managed throughout the enterprise. • Ensure data can be trusted • EDH - Backbone of business • Production environment • Agile #BDWmeetup @joe_Caserta
  • 8. Components of a Mature Enterprise Data Hub • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc. Organization • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables •Definitions, lineage (where does this data come from), business definitions, technical metadata Metadata Privacy/Security •Identify • Data detection and control and masking sensitive on unstructured data, regulatory data upon compliance ingest • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution •Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring Business Process Integration •Policies around data frequency, source availability, etc. • Near-zero latency, DevOps, Core component of business operations • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms •Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Master Data Management • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) #BDWmeetup @joe_Caserta
  • 9. Enterprise Data Pyramid  ETL cleans, conforms, consolidates, enriches each tier.  Only top (trusted) tier of the pyramid is fully accessible by the masses. Big Data Warehouse Fully Data Governed ( trusted) ETL Data Science Workspace Agile business insight through data-munging machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” ETL Data Lake – Integrated Sandbox Data Quality and Monitoring  Monitoring of completeness of data Landing Area – Source Data in “Full Fidelity” Data is ready to be turned into information: organized, well defined, complete. #BDWmeetup @joe_Caserta Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Metadata  Catalog ILM  who has access, how long to “manage it” Data Quality and Monitoring  Monitoring of completeness of data User community arbitrary queries and reporting ETL ETL
  • 10. Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta #BDWmeetup @joe_Caserta
  • 11. Free Python books, Courtesy of: RAFFLE!!! #BDWmeetup @joe_Caserta

Editor's Notes

  1. Robotman was actually the first cyborg superhero. Robert Crane was fatally shot and had his brain placed in a super strong robot body. The cybernetic Robotman lived on, using a rubber mask and flesh-like body suit to disguise himself as Paul Dennis. The new hero used his cyborg might to smash crime during DC’s Golden Age. First Appearance: Star Spangled Comics #7 (1942)