SlideShare a Scribd company logo
1 of 37
Download to read offline
Transitioning to a
Modern Data Platform
Michael Ghen
Benefits Data Trust
Big Data Sherpa, exploring the edges of info tech
● B.S. Computer Engineer (‘14)
– Penn State
● M.S. Strategic Analytics (‘16)
– Brandeis University
● Adjunct Instructor
– Saint Joseph's University
● Systems Engineer
– Brandeis University
● Data Platform Engineer
– Cohealo, Inc.
● Data Science Project Manager
– Benefits Data Trust
● Entrepreneurship
● Python
● Computers
Academic Professional Personal
Overview
Transitioning to a Modern
Data Platform
Data Science at BDT
Legacy Data Science Infrastructure
Framework for a Modern Data Platform
BDT Data Platform Infrastructure
BDT Data Platform Operations
Questions
Benefits Data Trust?
BDT seeks to make benefits access
more simple, comprehensive, and
cost-effective through policy and
systems change
BDT builds comprehensive and
person-centered programs that
connect individuals to all the
benefits and services they need,
while treating each person with
dignity and respect
Data Science at Benefits Data Trust: data-driven initiatives
across multiple states to connect low-income individuals to
critical benefit programs
Main Things:
● Mail Outreach: Process data needed
to run outreach operations
● Outcomes Reporting: Summarize
and report impact of our work on the
clients and communities we serve
● Internal Reporting: Extract insights
from data to empower tactical and
strategic decision making
Data Science at Benefits Data Trust: data-driven initiatives
across multiple states to connect low-income individuals to
critical benefit programs
Main Things:
● Mail Outreach: Process data needed
to run outreach operations
● Outcomes Reporting: Summarize
and report impact of our work on the
clients and communities we serve
● Internal Reporting: Extract insights
from data to empower tactical and
strategic decision making
New Main Things
● Strategic Data Mining
● A/B Testing
● Machine Learning
● Automation
● Simulation
Data Science at Benefits Data Trust: data-driven initiatives
across multiple states to connect low-income individuals to
critical benefit programs
New Programs
● Nudges: Text message and outbound
call campaigns to encourage clients
to apply or finish applying for
benefits
● Software as a Service: Sharing our
proprietary software with partners
● Social Determinants of Health:
Make sure patients are accessing
benefits as part of their healthcare
New Main Things
● Strategic Data Mining
● A/B Testing
● Machine Learning
● Automation
● Simulation
Challenges with the
Legacy System
Michael Ghen
Benefits Data Trust
Legacy Infrastructure: SAS Scripts Export Excel Reports
Challenges stem from the tools we are using but also from
the processes and culture we operate within
Ambiguity
● What outcomes are we concern with? How they are measured?
Centralization
● Data Science serves as the single point of contact between everyone and the answers to questions
Repetition
● Solution to a problem is recycled or slightly modified; small defects snowball
Myopic
● Data Science is formalized as a final step; conversations about the optimal solution are deferred
Primitive
● The tactics used, like creating multiple flat files to service requests, are primitive; do not scale well
Unreliable: CC Manager finds an error, Analyst runs it down
!%#?
Confusion: External Relations wonders why the numbers are different than
the numbers provided last year, Analyst runs it down
!%#?
Framework for a
Modern Data Platform
How do we support the
organization's data needs?
Data Platforms are composed of technology, process, and people
● Definitions:
● Platform: A platform is a group of technologies, processes, and people that are used as
a base upon which other applications, processes or technologies are developed.
● The real challenge was trying to establish a vision for what our
infrastructure should look like
Data Platform: A group of technologies, processes, and people that
are used as the base upon which data-driven application, processes,
and technologies are developed upon.
Oil Refinery Analogy: Data, Information, Knowledge
Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
Understand theory and put it into practice
Organizational Memory
● Represents an organization's
accumulated history, including
data, information, and
knowledge
● Focuses on the storage of
intellectual sources (data,
information, and explicit
knowledge) in such form that
they can later be accessed and
used
● Capabilities
● Operational Databases
● Data Lake
● Data Warehouse
● Knowledge Repositories
Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
Internal Databases, External Data from Partners
Understand theory and put it into practice
Information Integration
● Represents the ability to link past
structured and unstructured
content from a variety of sources
that comprise organizational
memory with the new, real-time,
content
● Capabilities
● Integrating
External/Internal Structured
Data
● Integrating
External/Internal
Unstructured Data
● Environmental Scanning
● Text/Web Mining
Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
All information flows into a central location
All information flows into a central location
Understand theory and put it into practice
Insight Creation
● Focuses on the utilization of “raw
materials” to produce valuable
new insights and enable effective
decisions making based on
continual rather than periodic
analysis.
● Capabilities
● Data Mining
● Business Analytics
● Real-time Decision Support
Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
Information is
integrated and
distilled into
smaller,
subject-specific
data sets
Understand theory and put it into practice
Presentation
● The point of contact between BI
and the end user
● Focuses on presenting the
appropriate information in a
user-friendly fashion based on
the user's role, the specific task,
and the user's inputs regarding
the nature of the presentation
● Capabilities
● Enterprise OLAP
● Visual Analytics
● Performance Dashboards
● Scorecards
● Enterprise Key Performance
Indicators
Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
End users are
supported by the
technologies,
processes, and
people that
support the
platform
Technologies for Creating a
Modern Data Platform
Business Intelligence Practice
Technology
Operations
● Workflow Management with Airflow
● Data Warehousing with BigQuery
● Visual Analytics with Looker
Data Platform
Apache Airflow manages workflows written in Python
● Why we choose it:
● Workflows are defined as code, are
maintainable, versionable, testable
● Uses a python programming model
● How we use it:
● ELT Jobs; Scheduled Reporting
● Application Integrations (e.g. SFTP to
Ruby on Rails applications)
Airflow is a platform to programmatically author, schedule and monitor workflows
Airflow is used to author workflows as directed acyclic graphs (DAGs) of tasks
Apache Airflow manages workflows written in Python
● Challenges:
● Deployment, management
● Testing workflows
● Training analysts
● Advantages:
● Everything is done with Python
● Logging, monitoring
● Simple to make improvements to
the underlying codebase if needed
Airflow is a platform to programmatically author, schedule and monitor workflows
Airflow is used to author workflows as directed acyclic graphs (DAGs) of tasks
BigQuery allows us to access all our data with SQL
● Why we choose it:
● Fully managed, always available
● Queries are just fast
● Easy import/export
● Dirt Cheap
● How we use it:
● Data Warehousing
● All data flows into BigQuery, all
reports based on BigQuery
● Extract, Load, Transform
BigQuery is Google's serverless, highly scalable, low cost enterprise data
warehouse designed to make our data scientists productive
BigQuery allows us to access all our data with SQL
● Challenges:
● Still requires data that is clean and
structured before import
● Does not support transactions
● Requires strong SQL skills
● Advantages:
● No database administration
● No servers
● Leverages SQL
● Great documentation
● Strong Software Development Kits
BigQuery is Google's serverless, highly scalable, low cost enterprise data
warehouse designed to make our data scientists productive
Looker provides control over how we are reporting
● Why we choose it:
● Data model as code, version
control, transparency with respect
to data model/etl changes
● Centralized data definitions
● Usage tracking
● How we use it:
● Dashboards
● Internal and external reporting
● Embedded reporting
Looker is a business intelligence software and big data analytics platform that
helps you explore, analyze and share real-time business analytics easily.
Looker provides control over how we are reporting
● Challenges:
● Server-side/in-memory
transformations, slow queries
● Setting up drill-down
● Advantages:
● Easy to troubleshoot
(everything is SQL)
● Dimensions and measures are
centralized, sharable
● We like to code, we like version
control
Looker is a business intelligence software and big data analytics platform that
helps you explore, analyze and share real-time business analytics easily.
Final Remarks
Organizational Challenges
& Benefits
Things got worse before they got better
● Departments need to own some of the analysts' tasks
○ Data Science needs to provide the tooling to support self-services
● Departments need data literacy and technical training
○ Data Science needs to be the experts supporting this
● IT needs to provide the infrastructure to support analytics technologies
○ Data Science needs to have the expertise to run these technologies
● Programs need to provide clarity about what they will report on
○ Data Science needs to seek clarity before building reporting
Thank you!
Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011

More Related Content

What's hot

Oracle primavera p6 and ebs projects where do they meet ppt
Oracle primavera p6 and ebs projects   where do they meet pptOracle primavera p6 and ebs projects   where do they meet ppt
Oracle primavera p6 and ebs projects where do they meet pptp6academy
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceDatabricks
 
Use of unifier with primavera
Use of unifier with primaveraUse of unifier with primavera
Use of unifier with primaveragjmstleonards
 
Copy of Alok_Singh_CV
Copy of Alok_Singh_CVCopy of Alok_Singh_CV
Copy of Alok_Singh_CVAlok Singh
 
AdvancedMiner predictive analytics platform overview
AdvancedMiner predictive analytics platform overviewAdvancedMiner predictive analytics platform overview
AdvancedMiner predictive analytics platform overviewAlgolytics (old account)
 
Connect SAP Business One using Service Layer (HANA)
Connect SAP Business One using Service Layer (HANA)Connect SAP Business One using Service Layer (HANA)
Connect SAP Business One using Service Layer (HANA)APPSeCONNECT
 
Oracle Cloud ERP Report and Analytics | What are Cloud ERP reporting Options ...
Oracle Cloud ERP Report and Analytics | What are Cloud ERP reporting Options ...Oracle Cloud ERP Report and Analytics | What are Cloud ERP reporting Options ...
Oracle Cloud ERP Report and Analytics | What are Cloud ERP reporting Options ...CLTConsultingService
 
Integrating primavera p6 with oracle erp which technology path is right for...
Integrating primavera p6 with oracle erp   which technology path is right for...Integrating primavera p6 with oracle erp   which technology path is right for...
Integrating primavera p6 with oracle erp which technology path is right for...p6academy
 
Achievo ATK - A Business Framework (DPC 2007)
Achievo ATK - A Business Framework (DPC 2007)Achievo ATK - A Business Framework (DPC 2007)
Achievo ATK - A Business Framework (DPC 2007)Ivo Jansch
 
Machine Learning - Eine Challenge für Architekten
Machine Learning - Eine Challenge für ArchitektenMachine Learning - Eine Challenge für Architekten
Machine Learning - Eine Challenge für ArchitektenHarald Erb
 
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14p6academy
 
Alok_Patle_Resume
Alok_Patle_ResumeAlok_Patle_Resume
Alok_Patle_Resumealok patle
 

What's hot (12)

Oracle primavera p6 and ebs projects where do they meet ppt
Oracle primavera p6 and ebs projects   where do they meet pptOracle primavera p6 and ebs projects   where do they meet ppt
Oracle primavera p6 and ebs projects where do they meet ppt
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-Source
 
Use of unifier with primavera
Use of unifier with primaveraUse of unifier with primavera
Use of unifier with primavera
 
Copy of Alok_Singh_CV
Copy of Alok_Singh_CVCopy of Alok_Singh_CV
Copy of Alok_Singh_CV
 
AdvancedMiner predictive analytics platform overview
AdvancedMiner predictive analytics platform overviewAdvancedMiner predictive analytics platform overview
AdvancedMiner predictive analytics platform overview
 
Connect SAP Business One using Service Layer (HANA)
Connect SAP Business One using Service Layer (HANA)Connect SAP Business One using Service Layer (HANA)
Connect SAP Business One using Service Layer (HANA)
 
Oracle Cloud ERP Report and Analytics | What are Cloud ERP reporting Options ...
Oracle Cloud ERP Report and Analytics | What are Cloud ERP reporting Options ...Oracle Cloud ERP Report and Analytics | What are Cloud ERP reporting Options ...
Oracle Cloud ERP Report and Analytics | What are Cloud ERP reporting Options ...
 
Integrating primavera p6 with oracle erp which technology path is right for...
Integrating primavera p6 with oracle erp   which technology path is right for...Integrating primavera p6 with oracle erp   which technology path is right for...
Integrating primavera p6 with oracle erp which technology path is right for...
 
Achievo ATK - A Business Framework (DPC 2007)
Achievo ATK - A Business Framework (DPC 2007)Achievo ATK - A Business Framework (DPC 2007)
Achievo ATK - A Business Framework (DPC 2007)
 
Machine Learning - Eine Challenge für Architekten
Machine Learning - Eine Challenge für ArchitektenMachine Learning - Eine Challenge für Architekten
Machine Learning - Eine Challenge für Architekten
 
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14
 
Alok_Patle_Resume
Alok_Patle_ResumeAlok_Patle_Resume
Alok_Patle_Resume
 

Similar to Transitioning to a Modern Data Platform with Airflow, BigQuery, and Looker

Business intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and ApplicationsBusiness intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and Applicationsraj
 
BizTrans SysTech_Analytics_Serv_SAP_v1.0
BizTrans SysTech_Analytics_Serv_SAP_v1.0BizTrans SysTech_Analytics_Serv_SAP_v1.0
BizTrans SysTech_Analytics_Serv_SAP_v1.0BizTrans SysTech
 
Destination Digital: Tracking Progress to Continue First Class Performance
Destination Digital: Tracking Progress to Continue First Class PerformanceDestination Digital: Tracking Progress to Continue First Class Performance
Destination Digital: Tracking Progress to Continue First Class PerformanceNGA Human Resources
 
SG Data Mgt - Findings and Recommendations.pptx
SG Data Mgt - Findings and Recommendations.pptxSG Data Mgt - Findings and Recommendations.pptx
SG Data Mgt - Findings and Recommendations.pptxssuser57f752
 
Business Analytics Training
Business Analytics TrainingBusiness Analytics Training
Business Analytics TrainingNatalija Pavic
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
 
Aayush Sinha_8.4Yrs_PO_BA
Aayush Sinha_8.4Yrs_PO_BAAayush Sinha_8.4Yrs_PO_BA
Aayush Sinha_8.4Yrs_PO_BAaayush sinha
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellIT Arena
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfprevota
 
Architecting for analytics
Architecting for analyticsArchitecting for analytics
Architecting for analyticsRob Winters
 
Self service BI for humans
Self service BI for humansSelf service BI for humans
Self service BI for humansAdrian Brudaru
 
Pratik Patel resume
Pratik Patel  resumePratik Patel  resume
Pratik Patel resumePratik Patel
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Kent Graziano
 
Big data and other buzzwords
Big data and other buzzwordsBig data and other buzzwords
Big data and other buzzwordsAndrew Clark
 
IBM's Business Analytics Portfolio for Training Purposes
IBM's Business Analytics Portfolio for Training PurposesIBM's Business Analytics Portfolio for Training Purposes
IBM's Business Analytics Portfolio for Training PurposesNatalija Pavic
 

Similar to Transitioning to a Modern Data Platform with Airflow, BigQuery, and Looker (20)

Business intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and ApplicationsBusiness intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and Applications
 
BizTrans SysTech_Analytics_Serv_SAP_v1.0
BizTrans SysTech_Analytics_Serv_SAP_v1.0BizTrans SysTech_Analytics_Serv_SAP_v1.0
BizTrans SysTech_Analytics_Serv_SAP_v1.0
 
Destination Digital: Tracking Progress to Continue First Class Performance
Destination Digital: Tracking Progress to Continue First Class PerformanceDestination Digital: Tracking Progress to Continue First Class Performance
Destination Digital: Tracking Progress to Continue First Class Performance
 
SG Data Mgt - Findings and Recommendations.pptx
SG Data Mgt - Findings and Recommendations.pptxSG Data Mgt - Findings and Recommendations.pptx
SG Data Mgt - Findings and Recommendations.pptx
 
Business Analytics Training
Business Analytics TrainingBusiness Analytics Training
Business Analytics Training
 
The Manulife Journey
The Manulife JourneyThe Manulife Journey
The Manulife Journey
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Aayush Sinha_8.4Yrs_PO_BA
Aayush Sinha_8.4Yrs_PO_BAAayush Sinha_8.4Yrs_PO_BA
Aayush Sinha_8.4Yrs_PO_BA
 
Ashish cv2016
Ashish cv2016Ashish cv2016
Ashish cv2016
 
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a NutshellNadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
Nadine Schöne, Dataiku. The Complete Data Value Chain in a Nutshell
 
Remin's Resume
Remin's ResumeRemin's Resume
Remin's Resume
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdf
 
Architecting for analytics
Architecting for analyticsArchitecting for analytics
Architecting for analytics
 
Dhrub_Resume_New
Dhrub_Resume_NewDhrub_Resume_New
Dhrub_Resume_New
 
Self service BI for humans
Self service BI for humansSelf service BI for humans
Self service BI for humans
 
Pratik Patel resume
Pratik Patel  resumePratik Patel  resume
Pratik Patel resume
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
 
Subrat K Panigrahi Resume
Subrat K Panigrahi ResumeSubrat K Panigrahi Resume
Subrat K Panigrahi Resume
 
Big data and other buzzwords
Big data and other buzzwordsBig data and other buzzwords
Big data and other buzzwords
 
IBM's Business Analytics Portfolio for Training Purposes
IBM's Business Analytics Portfolio for Training PurposesIBM's Business Analytics Portfolio for Training Purposes
IBM's Business Analytics Portfolio for Training Purposes
 

More from Michael Ghen

Customer segmentation and marketing automation with Apache Unomi
Customer segmentation and marketing automation with Apache UnomiCustomer segmentation and marketing automation with Apache Unomi
Customer segmentation and marketing automation with Apache UnomiMichael Ghen
 
Simulating Patient Populations
Simulating Patient PopulationsSimulating Patient Populations
Simulating Patient PopulationsMichael Ghen
 
Big Data Readiness & Business Intelligence Capabilities Matrix
Big Data Readiness & Business Intelligence Capabilities MatrixBig Data Readiness & Business Intelligence Capabilities Matrix
Big Data Readiness & Business Intelligence Capabilities MatrixMichael Ghen
 
AWS Machine Learning Workshp
AWS Machine Learning WorkshpAWS Machine Learning Workshp
AWS Machine Learning WorkshpMichael Ghen
 
Influencer marketing: Buying and Selling Audience Impressions
Influencer marketing: Buying and Selling Audience ImpressionsInfluencer marketing: Buying and Selling Audience Impressions
Influencer marketing: Buying and Selling Audience ImpressionsMichael Ghen
 
Decoding healthcare codes: ICD-10, DRG, CPT, HCPCS
Decoding healthcare codes: ICD-10, DRG, CPT, HCPCSDecoding healthcare codes: ICD-10, DRG, CPT, HCPCS
Decoding healthcare codes: ICD-10, DRG, CPT, HCPCSMichael Ghen
 

More from Michael Ghen (6)

Customer segmentation and marketing automation with Apache Unomi
Customer segmentation and marketing automation with Apache UnomiCustomer segmentation and marketing automation with Apache Unomi
Customer segmentation and marketing automation with Apache Unomi
 
Simulating Patient Populations
Simulating Patient PopulationsSimulating Patient Populations
Simulating Patient Populations
 
Big Data Readiness & Business Intelligence Capabilities Matrix
Big Data Readiness & Business Intelligence Capabilities MatrixBig Data Readiness & Business Intelligence Capabilities Matrix
Big Data Readiness & Business Intelligence Capabilities Matrix
 
AWS Machine Learning Workshp
AWS Machine Learning WorkshpAWS Machine Learning Workshp
AWS Machine Learning Workshp
 
Influencer marketing: Buying and Selling Audience Impressions
Influencer marketing: Buying and Selling Audience ImpressionsInfluencer marketing: Buying and Selling Audience Impressions
Influencer marketing: Buying and Selling Audience Impressions
 
Decoding healthcare codes: ICD-10, DRG, CPT, HCPCS
Decoding healthcare codes: ICD-10, DRG, CPT, HCPCSDecoding healthcare codes: ICD-10, DRG, CPT, HCPCS
Decoding healthcare codes: ICD-10, DRG, CPT, HCPCS
 

Recently uploaded

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 

Recently uploaded (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 

Transitioning to a Modern Data Platform with Airflow, BigQuery, and Looker

  • 1. Transitioning to a Modern Data Platform Michael Ghen Benefits Data Trust
  • 2. Big Data Sherpa, exploring the edges of info tech ● B.S. Computer Engineer (‘14) – Penn State ● M.S. Strategic Analytics (‘16) – Brandeis University ● Adjunct Instructor – Saint Joseph's University ● Systems Engineer – Brandeis University ● Data Platform Engineer – Cohealo, Inc. ● Data Science Project Manager – Benefits Data Trust ● Entrepreneurship ● Python ● Computers Academic Professional Personal
  • 3. Overview Transitioning to a Modern Data Platform Data Science at BDT Legacy Data Science Infrastructure Framework for a Modern Data Platform BDT Data Platform Infrastructure BDT Data Platform Operations Questions
  • 4. Benefits Data Trust? BDT seeks to make benefits access more simple, comprehensive, and cost-effective through policy and systems change BDT builds comprehensive and person-centered programs that connect individuals to all the benefits and services they need, while treating each person with dignity and respect
  • 5. Data Science at Benefits Data Trust: data-driven initiatives across multiple states to connect low-income individuals to critical benefit programs Main Things: ● Mail Outreach: Process data needed to run outreach operations ● Outcomes Reporting: Summarize and report impact of our work on the clients and communities we serve ● Internal Reporting: Extract insights from data to empower tactical and strategic decision making
  • 6. Data Science at Benefits Data Trust: data-driven initiatives across multiple states to connect low-income individuals to critical benefit programs Main Things: ● Mail Outreach: Process data needed to run outreach operations ● Outcomes Reporting: Summarize and report impact of our work on the clients and communities we serve ● Internal Reporting: Extract insights from data to empower tactical and strategic decision making New Main Things ● Strategic Data Mining ● A/B Testing ● Machine Learning ● Automation ● Simulation
  • 7. Data Science at Benefits Data Trust: data-driven initiatives across multiple states to connect low-income individuals to critical benefit programs New Programs ● Nudges: Text message and outbound call campaigns to encourage clients to apply or finish applying for benefits ● Software as a Service: Sharing our proprietary software with partners ● Social Determinants of Health: Make sure patients are accessing benefits as part of their healthcare New Main Things ● Strategic Data Mining ● A/B Testing ● Machine Learning ● Automation ● Simulation
  • 8. Challenges with the Legacy System Michael Ghen Benefits Data Trust
  • 9. Legacy Infrastructure: SAS Scripts Export Excel Reports
  • 10.
  • 11. Challenges stem from the tools we are using but also from the processes and culture we operate within Ambiguity ● What outcomes are we concern with? How they are measured? Centralization ● Data Science serves as the single point of contact between everyone and the answers to questions Repetition ● Solution to a problem is recycled or slightly modified; small defects snowball Myopic ● Data Science is formalized as a final step; conversations about the optimal solution are deferred Primitive ● The tactics used, like creating multiple flat files to service requests, are primitive; do not scale well
  • 12. Unreliable: CC Manager finds an error, Analyst runs it down !%#?
  • 13. Confusion: External Relations wonders why the numbers are different than the numbers provided last year, Analyst runs it down !%#?
  • 14. Framework for a Modern Data Platform How do we support the organization's data needs?
  • 15. Data Platforms are composed of technology, process, and people ● Definitions: ● Platform: A platform is a group of technologies, processes, and people that are used as a base upon which other applications, processes or technologies are developed. ● The real challenge was trying to establish a vision for what our infrastructure should look like Data Platform: A group of technologies, processes, and people that are used as the base upon which data-driven application, processes, and technologies are developed upon.
  • 16. Oil Refinery Analogy: Data, Information, Knowledge Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
  • 17. Understand theory and put it into practice Organizational Memory ● Represents an organization's accumulated history, including data, information, and knowledge ● Focuses on the storage of intellectual sources (data, information, and explicit knowledge) in such form that they can later be accessed and used ● Capabilities ● Operational Databases ● Data Lake ● Data Warehouse ● Knowledge Repositories Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
  • 18. Internal Databases, External Data from Partners
  • 19. Understand theory and put it into practice Information Integration ● Represents the ability to link past structured and unstructured content from a variety of sources that comprise organizational memory with the new, real-time, content ● Capabilities ● Integrating External/Internal Structured Data ● Integrating External/Internal Unstructured Data ● Environmental Scanning ● Text/Web Mining Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
  • 20. All information flows into a central location
  • 21. All information flows into a central location
  • 22. Understand theory and put it into practice Insight Creation ● Focuses on the utilization of “raw materials” to produce valuable new insights and enable effective decisions making based on continual rather than periodic analysis. ● Capabilities ● Data Mining ● Business Analytics ● Real-time Decision Support Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
  • 23. Information is integrated and distilled into smaller, subject-specific data sets
  • 24. Understand theory and put it into practice Presentation ● The point of contact between BI and the end user ● Focuses on presenting the appropriate information in a user-friendly fashion based on the user's role, the specific task, and the user's inputs regarding the nature of the presentation ● Capabilities ● Enterprise OLAP ● Visual Analytics ● Performance Dashboards ● Scorecards ● Enterprise Key Performance Indicators Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011
  • 25. End users are supported by the technologies, processes, and people that support the platform
  • 26. Technologies for Creating a Modern Data Platform Business Intelligence Practice
  • 27. Technology Operations ● Workflow Management with Airflow ● Data Warehousing with BigQuery ● Visual Analytics with Looker
  • 29. Apache Airflow manages workflows written in Python ● Why we choose it: ● Workflows are defined as code, are maintainable, versionable, testable ● Uses a python programming model ● How we use it: ● ELT Jobs; Scheduled Reporting ● Application Integrations (e.g. SFTP to Ruby on Rails applications) Airflow is a platform to programmatically author, schedule and monitor workflows Airflow is used to author workflows as directed acyclic graphs (DAGs) of tasks
  • 30. Apache Airflow manages workflows written in Python ● Challenges: ● Deployment, management ● Testing workflows ● Training analysts ● Advantages: ● Everything is done with Python ● Logging, monitoring ● Simple to make improvements to the underlying codebase if needed Airflow is a platform to programmatically author, schedule and monitor workflows Airflow is used to author workflows as directed acyclic graphs (DAGs) of tasks
  • 31. BigQuery allows us to access all our data with SQL ● Why we choose it: ● Fully managed, always available ● Queries are just fast ● Easy import/export ● Dirt Cheap ● How we use it: ● Data Warehousing ● All data flows into BigQuery, all reports based on BigQuery ● Extract, Load, Transform BigQuery is Google's serverless, highly scalable, low cost enterprise data warehouse designed to make our data scientists productive
  • 32. BigQuery allows us to access all our data with SQL ● Challenges: ● Still requires data that is clean and structured before import ● Does not support transactions ● Requires strong SQL skills ● Advantages: ● No database administration ● No servers ● Leverages SQL ● Great documentation ● Strong Software Development Kits BigQuery is Google's serverless, highly scalable, low cost enterprise data warehouse designed to make our data scientists productive
  • 33. Looker provides control over how we are reporting ● Why we choose it: ● Data model as code, version control, transparency with respect to data model/etl changes ● Centralized data definitions ● Usage tracking ● How we use it: ● Dashboards ● Internal and external reporting ● Embedded reporting Looker is a business intelligence software and big data analytics platform that helps you explore, analyze and share real-time business analytics easily.
  • 34. Looker provides control over how we are reporting ● Challenges: ● Server-side/in-memory transformations, slow queries ● Setting up drill-down ● Advantages: ● Easy to troubleshoot (everything is SQL) ● Dimensions and measures are centralized, sharable ● We like to code, we like version control Looker is a business intelligence software and big data analytics platform that helps you explore, analyze and share real-time business analytics easily.
  • 36. Things got worse before they got better ● Departments need to own some of the analysts' tasks ○ Data Science needs to provide the tooling to support self-services ● Departments need data literacy and technical training ○ Data Science needs to be the experts supporting this ● IT needs to provide the infrastructure to support analytics technologies ○ Data Science needs to have the expertise to run these technologies ● Programs need to provide clarity about what they will report on ○ Data Science needs to seek clarity before building reporting
  • 37. Thank you! Business intelligence: practices, technologies, and management, Rajiv Sabherwal - Irma Becerra-Fernandez - Wiley - 2011