SlideShare a Scribd company logo
1 of 44
Confidential and Proprietary to Daugherty Business Solutions
Retooling on the Modern Data and Analytics Stack
2/2020
Confidential and Proprietary to Daugherty Business Solutions 2
What is the Modern
Tech Stack?
The tools and
technologies needed to
solve difficult problems
due to their size, speed,
and complexity.
Confidential and Proprietary to Daugherty Business Solutions 3
Competencies
Information Management
Data Solutions
Modern Data Architectures
Data Science
Data Governance
Confidential and Proprietary to Daugherty Business Solutions 4
Information Management
Data loading
Data modeling
Querying
Confidential and Proprietary to Daugherty Business Solutions 5
IM Focus: NoSQL
Confidential and Proprietary to Daugherty Business Solutions 6
IM Focus: Platforms & Services
Confidential and Proprietary to Daugherty Business Solutions 7
IM Focus: Serialization
JSON
Confidential and Proprietary to Daugherty Business Solutions 8
Data Solutions
Tell me a story…
Confidential and Proprietary to Daugherty Business Solutions 9
Data Solutions Focus
Profiling
Sampling
Aggregation
Confidential and Proprietary to Daugherty Business Solutions 10
Data Governance
Governance outputs are eternal.
Confidential and Proprietary to Daugherty Business Solutions 11
Focus On…
Scale
Confidential and Proprietary to Daugherty Business Solutions 12
Data Engineering
Data Science
Decision ScienceData Science
Decision Science
Confidential and Proprietary to Daugherty Business Solutions 13
Data Science
Confidential and Proprietary to Daugherty Business Solutions 14
Focus On…
Confidential and Proprietary to Daugherty Business Solutions 15
Modern Data Architecture
Programmatic
data
manipulation
Confidential and Proprietary to Daugherty Business Solutions 16
Cloud
Confidential and Proprietary to Daugherty Business Solutions 17
Big Data
Confidential and Proprietary to Daugherty Business Solutions 18
Streaming
Kafka for Publish/Subscribe
KSQL – Kafka + SQL
Debezium – Change Data Capture
Confidential and Proprietary to Daugherty Business Solutions 19
Data Engineering
https://www.logicalclocks.com/blog/feature-store-the-missing-data-layer-in-ml-pipelines
Confidential and Proprietary to Daugherty Business Solutions 20
Focus on…
Confidential and Proprietary to Daugherty Business Solutions 21
Five Steps to Retooling
Awareness
Exposure
Guided Practice
Evolving Practice
Growing Expertise
Confidential and Proprietary to Daugherty Business Solutions
AWARENESS AWARENESS
AWARE,
-ISH
Confidential and Proprietary to Daugherty Business Solutions
AWARENESS
Podcasts
Data science /
advanced tech
groups
Major tech
companies
Confidential and Proprietary to Daugherty Business Solutions
EXPOSURE
Use case studies
Blogs
Try-it-for-free
Confidential and Proprietary to Daugherty Business Solutions
GUIDED PRACTICE
Online courses
Free AWS and Azure accounts
Open-source downloads
Confidential and Proprietary to Daugherty Business Solutions
EVOLVING PRACTICE
1. Pick something
familiar
2. Make it a little
strange
3. Rinse & repeat
Confidential and Proprietary to Daugherty Business Solutions
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
MySQL
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Local File
Local File
Confidential and Proprietary to Daugherty Business Solutions
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
MySQL
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
AWS S3
AWS S3
Confidential and Proprietary to Daugherty Business Solutions
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
AWS S3
AWS S3
AWS RDS
Confidential and Proprietary to Daugherty Business Solutions
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Confidential and Proprietary to Daugherty Business Solutions
AWS Step Function
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
Confidential and Proprietary to Daugherty Business Solutions
AWS Step Function
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Dynamo DB
Add New Features
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
2. Improve scalability
Confidential and Proprietary to Daugherty Business Solutions
AWS Step Function
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
ECS Task ECS Task ECS Task
Python
Lambda
Dynamo DB
Improve Scalability
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
2. Improve scalability
3. Improve performance
Confidential and Proprietary to Daugherty Business Solutions
AWS Step Function
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
Python
Lambda
ECS Task EMR ECS Task
Python
Lambda
Dynamo DB
Improve Performance
Spark
Scala
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
2. Improve scalability
3. Improve performance
4. Batch vs. stream
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
2. Improve scalability
3. Improve performance
4. Batch vs. stream
5. Automation
Confidential and Proprietary to Daugherty Business Solutions 40
Conclusion
Don’t do too much at once!
Confidential and Proprietary to Daugherty Business Solutions 41
Questions?
Confidential and Proprietary to Daugherty Business Solutions
Resources
General
• https://www.analyticsvidhya.com/blog/2018/11/data-engineer-comprehensive-list-resources-get-
started/
• https://towardsdatascience.com/who-is-a-data-engineer-how-to-become-a-data-engineer-
1167ddc12811
• https://www.dataquest.io/path/data-engineer/
• https://dataengweekly.com/
Podcasts
• https://towardsdatascience.com/our-podcast-c5c1129bc5cf
• https://www.stitcher.com/podcast/httpanalyticshourlibsyncom/the-digital-analytics-power-hour
• https://www.stitcher.com/podcast/data-stories-podcast/data-stories
• https://www.stitcher.com/podcast/data-skeptic-podcast/the-data-skeptic-podcast (Data Science
focused)
• https://www.stitcher.com/podcast/oreilly-media-2/the-oreilly-data-show-podcast?refid=stpr
• https://www.dataengineeringpodcast.com/
Reference Architectures
• https://medium.com/refraction-tech-everything/how-netflix-works-the-hugely-simplified-complex-
stuff-that-happens-every-time-you-hit-play-3a40c9be254b
• http://highscalability.com/blog/2015/11/9/a-360-degree-view-of-the-entire-netflix-stack.html (older
but interesting)
• https://medium.com/airbnb-engineering/airbnb-engineering-infrastructure/home
• https://towardsdatascience.com/how-linkedin-uber-lyft-airbnb-and-netflix-are-solving-data-
management-and-discovery-for-machine-9b79ee9184bb
Confidential and Proprietary to Daugherty Business Solutions
Resources – continued
Use Cases
• https://www.mongodb.com/use-cases
• https://www.confluent.io/blog/category/use-cases/
• https://kafka.apache.org/uses
• https://aws.amazon.com/big-data/use-cases/
• https://www.dataversity.net/eight-big-data-analytics-options-on-microsoft-azure/
• https://www.toptal.com/spark/introduction-to-apache-spark
Try it for Free
• https://neo4j.com/sandbox/ (Neo4J
• https://www.mongodb.com/cloud/atlas/lp/general/try (MongoDB)
• https://www.postman.com/ + https://www.guru99.com/postman-tutorial.html (trying out
APIs)
• https://databricks.com/try-databricks (Spark)
• https://jupyter.org/try
Confidential and Proprietary to Daugherty Business Solutions
Resources – continued
Open Source Downloads + Guides
• https://spark.apache.org/docs/latest/index.html
• https://kafka.apache.org/documentation/#gettingStarted
• https://www.mongodb.com/download-center/community
• https://www.python.org/about/gettingstarted/
Free Cloud Accounts
• https://aws.amazon.com/free/
• https://azure.microsoft.com/en-us/free/
Online Training
• www.acloud.guru *Recommended
• www.coursera.org
• www.udemy.com

More Related Content

What's hot

Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Data-Ed: Essential Metadata Strategies
Data-Ed: Essential Metadata StrategiesData-Ed: Essential Metadata Strategies
Data-Ed: Essential Metadata StrategiesDATAVERSITY
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
What You Need to Know About the Cloud
What You Need to Know About the CloudWhat You Need to Know About the Cloud
What You Need to Know About the CloudChris Roche
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
 
DataOps, DevOps and the Developer: Treating Database Code Just Like App Code
DataOps, DevOps and the Developer: Treating Database Code Just Like App CodeDataOps, DevOps and the Developer: Treating Database Code Just Like App Code
DataOps, DevOps and the Developer: Treating Database Code Just Like App CodeDevOps.com
 

What's hot (7)

Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Data-Ed: Essential Metadata Strategies
Data-Ed: Essential Metadata StrategiesData-Ed: Essential Metadata Strategies
Data-Ed: Essential Metadata Strategies
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Scaling Your Data: Data Democratisation and DataOps
Scaling Your Data: Data Democratisation and DataOpsScaling Your Data: Data Democratisation and DataOps
Scaling Your Data: Data Democratisation and DataOps
 
What You Need to Know About the Cloud
What You Need to Know About the CloudWhat You Need to Know About the Cloud
What You Need to Know About the Cloud
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
DataOps, DevOps and the Developer: Treating Database Code Just Like App Code
DataOps, DevOps and the Developer: Treating Database Code Just Like App CodeDataOps, DevOps and the Developer: Treating Database Code Just Like App Code
DataOps, DevOps and the Developer: Treating Database Code Just Like App Code
 

Similar to Retooling on the Modern Data and Analytics Tech Stack

Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
IBM Governed Data Lake
IBM Governed Data LakeIBM Governed Data Lake
IBM Governed Data LakeKaran Sachdeva
 
Get to Know Trino, StL Trino IG 20240220.pdf
Get to Know Trino, StL Trino IG 20240220.pdfGet to Know Trino, StL Trino IG 20240220.pdf
Get to Know Trino, StL Trino IG 20240220.pdfMatthewBoyett1
 
Semantic Web Standards and the Variety “V” of Big Data
Semantic Web Standards and  the Variety “V” of Big DataSemantic Web Standards and  the Variety “V” of Big Data
Semantic Web Standards and the Variety “V” of Big Databobdc
 
Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyArcadia Data
 
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Yellowfin
 
4. aws enterprise summit seoul 기존 엔터프라이즈 it 솔루션 클라우드로 이전하기 - thomas park
4. aws enterprise summit seoul   기존 엔터프라이즈 it 솔루션 클라우드로 이전하기 - thomas park4. aws enterprise summit seoul   기존 엔터프라이즈 it 솔루션 클라우드로 이전하기 - thomas park
4. aws enterprise summit seoul 기존 엔터프라이즈 it 솔루션 클라우드로 이전하기 - thomas parkAmazon Web Services Korea
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsRyan Gross
 
Building Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New NormalBuilding Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New NormalDenodo
 
Ten Type of Innovation - Universitas Indonesia
Ten Type of Innovation - Universitas Indonesia Ten Type of Innovation - Universitas Indonesia
Ten Type of Innovation - Universitas Indonesia PT Datacomm Diangraha
 
CWIN17 Singapore / Darmadi komo (microsoft) modern data estate
CWIN17 Singapore / Darmadi komo (microsoft)   modern data estateCWIN17 Singapore / Darmadi komo (microsoft)   modern data estate
CWIN17 Singapore / Darmadi komo (microsoft) modern data estateCapgemini
 
Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101Adam Doyle
 
Data Ninja Webinar Series: Accelerating Business Value with Data Virtualizati...
Data Ninja Webinar Series: Accelerating Business Value with Data Virtualizati...Data Ninja Webinar Series: Accelerating Business Value with Data Virtualizati...
Data Ninja Webinar Series: Accelerating Business Value with Data Virtualizati...Denodo
 
Spark: Building an application from Start to Finish
Spark: Building an application from Start to FinishSpark: Building an application from Start to Finish
Spark: Building an application from Start to FinishAdam Doyle
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Mark Tabladillo
 
Lessons learned from over 25 Data Virtualization implementations
Lessons learned from over 25 Data Virtualization implementationsLessons learned from over 25 Data Virtualization implementations
Lessons learned from over 25 Data Virtualization implementationsDenodo
 
Clearpath-Partnerships
Clearpath-PartnershipsClearpath-Partnerships
Clearpath-PartnershipsJason Taub
 

Similar to Retooling on the Modern Data and Analytics Tech Stack (20)

Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
IBM Governed Data Lake
IBM Governed Data LakeIBM Governed Data Lake
IBM Governed Data Lake
 
Get to Know Trino, StL Trino IG 20240220.pdf
Get to Know Trino, StL Trino IG 20240220.pdfGet to Know Trino, StL Trino IG 20240220.pdf
Get to Know Trino, StL Trino IG 20240220.pdf
 
Semantic Web Standards and the Variety “V” of Big Data
Semantic Web Standards and  the Variety “V” of Big DataSemantic Web Standards and  the Variety “V” of Big Data
Semantic Web Standards and the Variety “V” of Big Data
 
Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics Strategy
 
The coding portion of Data Science
The coding portion of Data ScienceThe coding portion of Data Science
The coding portion of Data Science
 
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
 
4. aws enterprise summit seoul 기존 엔터프라이즈 it 솔루션 클라우드로 이전하기 - thomas park
4. aws enterprise summit seoul   기존 엔터프라이즈 it 솔루션 클라우드로 이전하기 - thomas park4. aws enterprise summit seoul   기존 엔터프라이즈 it 솔루션 클라우드로 이전하기 - thomas park
4. aws enterprise summit seoul 기존 엔터프라이즈 it 솔루션 클라우드로 이전하기 - thomas park
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 
How to Streamline DataOps on AWS
How to Streamline DataOps on AWSHow to Streamline DataOps on AWS
How to Streamline DataOps on AWS
 
Building Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New NormalBuilding Resiliency and Agility with Data Virtualization for the New Normal
Building Resiliency and Agility with Data Virtualization for the New Normal
 
Ten Type of Innovation - Universitas Indonesia
Ten Type of Innovation - Universitas Indonesia Ten Type of Innovation - Universitas Indonesia
Ten Type of Innovation - Universitas Indonesia
 
CWIN17 Singapore / Darmadi komo (microsoft) modern data estate
CWIN17 Singapore / Darmadi komo (microsoft)   modern data estateCWIN17 Singapore / Darmadi komo (microsoft)   modern data estate
CWIN17 Singapore / Darmadi komo (microsoft) modern data estate
 
Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101Back to school: Big Data IDEA 101
Back to school: Big Data IDEA 101
 
Data Ninja Webinar Series: Accelerating Business Value with Data Virtualizati...
Data Ninja Webinar Series: Accelerating Business Value with Data Virtualizati...Data Ninja Webinar Series: Accelerating Business Value with Data Virtualizati...
Data Ninja Webinar Series: Accelerating Business Value with Data Virtualizati...
 
Spark: Building an application from Start to Finish
Spark: Building an application from Start to FinishSpark: Building an application from Start to Finish
Spark: Building an application from Start to Finish
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612
 
Lessons learned from over 25 Data Virtualization implementations
Lessons learned from over 25 Data Virtualization implementationsLessons learned from over 25 Data Virtualization implementations
Lessons learned from over 25 Data Virtualization implementations
 
Clearpath-Partnerships
Clearpath-PartnershipsClearpath-Partnerships
Clearpath-Partnerships
 

More from Adam Doyle

Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop DevelopmentAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoopAdam Doyle
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoop
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 

Recently uploaded

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Retooling on the Modern Data and Analytics Tech Stack

  • 1. Confidential and Proprietary to Daugherty Business Solutions Retooling on the Modern Data and Analytics Stack 2/2020
  • 2. Confidential and Proprietary to Daugherty Business Solutions 2 What is the Modern Tech Stack? The tools and technologies needed to solve difficult problems due to their size, speed, and complexity.
  • 3. Confidential and Proprietary to Daugherty Business Solutions 3 Competencies Information Management Data Solutions Modern Data Architectures Data Science Data Governance
  • 4. Confidential and Proprietary to Daugherty Business Solutions 4 Information Management Data loading Data modeling Querying
  • 5. Confidential and Proprietary to Daugherty Business Solutions 5 IM Focus: NoSQL
  • 6. Confidential and Proprietary to Daugherty Business Solutions 6 IM Focus: Platforms & Services
  • 7. Confidential and Proprietary to Daugherty Business Solutions 7 IM Focus: Serialization JSON
  • 8. Confidential and Proprietary to Daugherty Business Solutions 8 Data Solutions Tell me a story…
  • 9. Confidential and Proprietary to Daugherty Business Solutions 9 Data Solutions Focus Profiling Sampling Aggregation
  • 10. Confidential and Proprietary to Daugherty Business Solutions 10 Data Governance Governance outputs are eternal.
  • 11. Confidential and Proprietary to Daugherty Business Solutions 11 Focus On… Scale
  • 12. Confidential and Proprietary to Daugherty Business Solutions 12 Data Engineering Data Science Decision ScienceData Science Decision Science
  • 13. Confidential and Proprietary to Daugherty Business Solutions 13 Data Science
  • 14. Confidential and Proprietary to Daugherty Business Solutions 14 Focus On…
  • 15. Confidential and Proprietary to Daugherty Business Solutions 15 Modern Data Architecture Programmatic data manipulation
  • 16. Confidential and Proprietary to Daugherty Business Solutions 16 Cloud
  • 17. Confidential and Proprietary to Daugherty Business Solutions 17 Big Data
  • 18. Confidential and Proprietary to Daugherty Business Solutions 18 Streaming Kafka for Publish/Subscribe KSQL – Kafka + SQL Debezium – Change Data Capture
  • 19. Confidential and Proprietary to Daugherty Business Solutions 19 Data Engineering https://www.logicalclocks.com/blog/feature-store-the-missing-data-layer-in-ml-pipelines
  • 20. Confidential and Proprietary to Daugherty Business Solutions 20 Focus on…
  • 21. Confidential and Proprietary to Daugherty Business Solutions 21 Five Steps to Retooling Awareness Exposure Guided Practice Evolving Practice Growing Expertise
  • 22. Confidential and Proprietary to Daugherty Business Solutions AWARENESS AWARENESS AWARE, -ISH
  • 23. Confidential and Proprietary to Daugherty Business Solutions AWARENESS Podcasts Data science / advanced tech groups Major tech companies
  • 24. Confidential and Proprietary to Daugherty Business Solutions EXPOSURE Use case studies Blogs Try-it-for-free
  • 25. Confidential and Proprietary to Daugherty Business Solutions GUIDED PRACTICE Online courses Free AWS and Azure accounts Open-source downloads
  • 26. Confidential and Proprietary to Daugherty Business Solutions EVOLVING PRACTICE 1. Pick something familiar 2. Make it a little strange 3. Rinse & repeat
  • 27. Confidential and Proprietary to Daugherty Business Solutions Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results MySQL (Local) Python (Local) Python (Local) Python (Local) Python (Local) Python (Local) Local File Local File
  • 28. Confidential and Proprietary to Daugherty Business Solutions Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results MySQL (Local) Python (Local) Python (Local) Python (Local) Python (Local) Python (Local) AWS S3 AWS S3
  • 29. Confidential and Proprietary to Daugherty Business Solutions Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results Python (Local) Python (Local) Python (Local) Python (Local) Python (Local) AWS S3 AWS S3 AWS RDS
  • 30. Confidential and Proprietary to Daugherty Business Solutions Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 AWS RDS Python Lambda Python Lambda Python Lambda Python Lambda Python Lambda
  • 31. Confidential and Proprietary to Daugherty Business Solutions AWS Step Function Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 AWS RDS Python Lambda Python Lambda Python Lambda Python Lambda Python Lambda
  • 32. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features
  • 33. Confidential and Proprietary to Daugherty Business Solutions AWS Step Function Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 AWS RDS Python Lambda Python Lambda Python Lambda Python Lambda Python Lambda Dynamo DB Add New Features
  • 34. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features 2. Improve scalability
  • 35. Confidential and Proprietary to Daugherty Business Solutions AWS Step Function Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 AWS RDS Python Lambda ECS Task ECS Task ECS Task Python Lambda Dynamo DB Improve Scalability
  • 36. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features 2. Improve scalability 3. Improve performance
  • 37. Confidential and Proprietary to Daugherty Business Solutions AWS Step Function Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 Python Lambda ECS Task EMR ECS Task Python Lambda Dynamo DB Improve Performance Spark Scala
  • 38. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features 2. Improve scalability 3. Improve performance 4. Batch vs. stream
  • 39. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features 2. Improve scalability 3. Improve performance 4. Batch vs. stream 5. Automation
  • 40. Confidential and Proprietary to Daugherty Business Solutions 40 Conclusion Don’t do too much at once!
  • 41. Confidential and Proprietary to Daugherty Business Solutions 41 Questions?
  • 42. Confidential and Proprietary to Daugherty Business Solutions Resources General • https://www.analyticsvidhya.com/blog/2018/11/data-engineer-comprehensive-list-resources-get- started/ • https://towardsdatascience.com/who-is-a-data-engineer-how-to-become-a-data-engineer- 1167ddc12811 • https://www.dataquest.io/path/data-engineer/ • https://dataengweekly.com/ Podcasts • https://towardsdatascience.com/our-podcast-c5c1129bc5cf • https://www.stitcher.com/podcast/httpanalyticshourlibsyncom/the-digital-analytics-power-hour • https://www.stitcher.com/podcast/data-stories-podcast/data-stories • https://www.stitcher.com/podcast/data-skeptic-podcast/the-data-skeptic-podcast (Data Science focused) • https://www.stitcher.com/podcast/oreilly-media-2/the-oreilly-data-show-podcast?refid=stpr • https://www.dataengineeringpodcast.com/ Reference Architectures • https://medium.com/refraction-tech-everything/how-netflix-works-the-hugely-simplified-complex- stuff-that-happens-every-time-you-hit-play-3a40c9be254b • http://highscalability.com/blog/2015/11/9/a-360-degree-view-of-the-entire-netflix-stack.html (older but interesting) • https://medium.com/airbnb-engineering/airbnb-engineering-infrastructure/home • https://towardsdatascience.com/how-linkedin-uber-lyft-airbnb-and-netflix-are-solving-data- management-and-discovery-for-machine-9b79ee9184bb
  • 43. Confidential and Proprietary to Daugherty Business Solutions Resources – continued Use Cases • https://www.mongodb.com/use-cases • https://www.confluent.io/blog/category/use-cases/ • https://kafka.apache.org/uses • https://aws.amazon.com/big-data/use-cases/ • https://www.dataversity.net/eight-big-data-analytics-options-on-microsoft-azure/ • https://www.toptal.com/spark/introduction-to-apache-spark Try it for Free • https://neo4j.com/sandbox/ (Neo4J • https://www.mongodb.com/cloud/atlas/lp/general/try (MongoDB) • https://www.postman.com/ + https://www.guru99.com/postman-tutorial.html (trying out APIs) • https://databricks.com/try-databricks (Spark) • https://jupyter.org/try
  • 44. Confidential and Proprietary to Daugherty Business Solutions Resources – continued Open Source Downloads + Guides • https://spark.apache.org/docs/latest/index.html • https://kafka.apache.org/documentation/#gettingStarted • https://www.mongodb.com/download-center/community • https://www.python.org/about/gettingstarted/ Free Cloud Accounts • https://aws.amazon.com/free/ • https://azure.microsoft.com/en-us/free/ Online Training • www.acloud.guru *Recommended • www.coursera.org • www.udemy.com

Editor's Notes

  1. The modern tech stack is the tools and technologies needed to solve difficult problems due to their size, speed, and complexity.
  2. That set of tools and technologies is broad. To make it more manageable, it’s useful to divide it into areas of focus, or competencies: Information Management Data Solutions Modern Data Architecture Data Science Data Governance
  3. Information management is the competency that deals with databases. Loading, modeling, storing, querying. SQL is eternal. As you’ll see – many of the more modern technologies expose SQL interfaces.
  4. Traditional relational databases work well when your data is predictable and fits well into tables, columns, rows, and wherever queries are not very join-intensive. But if your data is not predictable, structured, or when it is highly connected or you need lightning fast performance, you may consider a noSQL database. NoSQL databases are an important part of the modern Information Management landscape. They fall into roughly four categories: Key-Value, Columnar, Document, and Graph. It’s a good idea to have a high-level understanding of each kind of NoSQL database, and to know the use cases for each. Broadly speaking: Key-value stores are great for caching. Columnar stores such as Cassandra are great when you’re dealing with big, big data. Document databases do a great job with storing semi-structured data. Graph databases are suitable when you are dealing with a rich, highly-connected data domain.
  5. Platforms such as Snowflake DB provide “warehouse as a service” and accommodate both structured data from relational sources, as well as semi-structured data. Open-source search tools (or services) such as ElasticSearch and search data ingestion tools such as Logstash also allow you to ingest and search data in almost any format. These are available as open-source downloads for you to deploy, as cloud deployments, and also as SaaS.
  6. Data serialization is the process of converting structured data to a format that allows sharing or storage of the data in a form that allows recovery of its original structure. In some cases, the secondary intention of data serialization is to minimize the data’s size which then reduces disk space or bandwidth requirements. Data serialization converts data objects into a byte stream for storage, transfer and distribution purposes on physical devices. Computer systems may vary in their hardware architecture, OS, addressing mechanisms. Internal binary representations of data also vary accordingly in every environment. Storing and exchanging data between such varying environments requires a platform-and-language-neutral data format that all systems understand. Choice of data serialization format for an application depends on factors such as data complexity, need for human readability, speed and storage space constraints. Three common formats are AVRO, JSON and Parquet. Of these three, only JSON is human-readable. The biggest difference between Avro and Parquet is how they store the data. Parquet stores data in columns, while Avro stores data in a row-based format. Column-oriented data stores are optimized for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.
  7. Data Solutions is the competency that deals with the visual expression of data. Dashboard and Infographic development would fall into this competency. The real skill is story telling with data, or unearthing information hidden in all of the data. New tools in this space include Tableau, PowerBI, and Google Looker.
  8. As the size of datasets has exploded, the challenge of telling a story with those datasets has become more complex. A modern practitioner must master the skills of data profiling, sampling, and aggregation. Data profiling simply helps the visualization expert understand the dataset. Tools such as Talend and Informatica come with data profiling capabilities. An understanding of sampling strategies, such as probability-based vs. non-probability-based sampling and when to use which kind is an important skill for data visualization. (https://towardsdatascience.com/sampling-techniques-a4e34111d808) This is a crucial step, since the accuracy of insights from data analysis depends heavily on the amount and quality of data used. It is important to gather high-quality accurate data and a large enough amount to create relevant results.  Finally, what aggregations are meaningful to your industry and/or project? What aggregations might obfuscate vs. illustrate? How important are outliers? (Health Effects Story – Susan)
  9. Data Governance is the competency that changes least in terms of outputs. Changes most in terms of scale. Cheap storage enables sophisticated data lineage storage. Data Stewardship.
  10. Data Governance is the competency that changes least in terms of outputs. Changes most in terms of scale. Cheap storage enables sophisticated data lineage storage. Data Stewardship.
  11. Data Engineering combines with Data Science to form Decision Science.
  12. Data Science is the competency that garners the most hype. Rumplestiltskin-ing data into insights. Advanced analytics. Tools are commodifying data science to a degree. Machine learning is not just running Data Robot and picking the winner. Operationalizing data science. Need foundational math and statistical knowledge.
  13. Data Science is the competency that garners the most hype. Rumplestiltskin-ing data into insights. Advanced analytics. Tools are commodifying data science to a degree. Machine learning is not just running Data Robot and picking the winner. Operationalizing data science. Need foundational math and statistical knowledge.
  14. SUSAN: Modern Data Architecture is the competency most directly associated with the modern tech stack. MDA is broad, and it includes Cloud Big Data Streaming and Data Engineering. At its heart, MDA is the practice of accomplishing business tasks using programmatic manipulations of the data.
  15. The major cloud platform players continue to be Amazon with AWS, Microsoft with Azure, and Google Cloud. In November 2019, a Goldman Sachs report concluded that AWS, Azure, Google Cloud, and Alibaba Cloud made up 56% of the total cloud market, with that projected to grow to 84% in 2020. The report shows AWS in the considerable lead with 47% of the market projected for this year, with Azure and Google trailing at 22% and 8% market share, respectively. With that said, it will be interesting to see how the actual numbers play out, especially as Google is positioning itself for multi-cloud support, and Azure shows aggressive growth rates. The number of services available in each cloud platform is rapidly expanding, and it is impossible to be familiar with them all. AWS offers over 190 services at this time. I would recommend starting with the foundational services: for AWS that would be EC2 (virtual instances); S3 (object storage); Virtual Private Cloud (setting up your private network in the cloud); and IAM (identity and access management). Take a look at the core services of one platform, then port that knowledge to the other. They all contain roughly the same capabilities – at least at this point in time -- just packaged differently and aimed at users of different experience levels.
  16. So you may wonder: Is Hadoop dead? Not entirely. But it is steadily in decline. It is down to one major vendor: Cloudera. Hadoop has serious issues in terms of processing smaller datasets, security, and is limited to batch processing. It also requires mid-level programming skills. More and more, programmers are finding workarounds or fixes to Hadoop’s problems of security and medium-skill programming. For instance, new tools speed up Hadoop’s MapReduce functionality: Apache Spark processes data up to 100 times faster. And it provides APIs for Python, Scala, Java, R and a SQL interface as well. And Spark also supports streaming. Another popular option is Kubernetes, which clusters containers across public, private and hybrid clouds The open-source container orchestration technology is picking up major traction as developers overwhelmingly embrace container technology. Kubernetes’ speed offers near real-time data analysis, something that Hadoop and MapReduce just can’t offer. A comparison of Google search results indicates that Kubernetes is on the rise just as sharply as Hadoop is in decline. https://trends.google.com/trends/explore?date=all&geo=US&q=hadoop,kubernetes NOTE to self: Remember that Hadoop includes a number of components: the Hadoop Distributed File System (HDFS), which stores files in a Hadoop-native format and parallelizes them across a cluster; YARN, a schedule that coordinates application runtimes; and MapReduce, the algorithm that actually processes the data in parallel. Hadoop also includes Hive, a SQL-like interface allowing users to run queries on HDFS.
  17. The ability to perform near real-time analytics is now a basic enterprise expectation. And that often involves data streaming. One extremely popular technology for near real-time data transfer is Kafka. Like many of the technologies mentioned here today, it started as an Apache project. It is so broadly used that it really dominates messaging and pub-sub, so it’s a good technology to understand. And also like many of the technologies mentioned here today, as it has gained in popularity a SQL-like interface has been exposed to lower the barrier to entry. KSQL is the open-source streaming SQL engine on top of Apache Kafka, Another technology that is actually built on top of Kafka is Debezium. Debezium is an open-source, distributed platform for change data capture.
  18. Data Engineering is a relatively new area with Modern Data Architectures. Data Engineers are responsible for the creation and maintenance of analytics infrastructures that enable other functions, commonly – but not only – Data Science. Data Engineers use many of the technologies we’ve just discussed. One notable difference between data engineering and traditional information management is that data engineering manages the creation and maintenance analytics infrastructures within a software engineering framework. Two common programming languages used are Python and Scala, with Python having a considerably lower barrier to entry. Another difference is the kind of work that data engineering enables. Data engineers enable the delivery of machine learning solutions in production and at large scale They work to reduce model training time and infrastructure costs They work to make it easier and cheaper to build new models. One way to understand how data engineering approaches problem solving is to look a the Feature Store. The feature store is a central place to store curated features within an organization. A feature is a measurable property of some data sample. Features can be extracted directly from files and database tables, or can be derived values, computed from one or more data sources. Importantly, the feature store holds data in a format that is understandable for predictive models. Data engineers expose APIs for reading, searching and adding features. Data Scientists can search for features and use them to build models with minimal data engineering. In addition, features can be cached and reused by other models, reducing model training time and infrastructure costs. Features are now a managed, governed asset in the Enterprise.
  19. MDA should focus on acquiring skills with Python Spark Kafka Containerization – whether that’s Docker or Kubernetes -- and at least one cloud Platform. If you have an interest in data engineering, understanding software engineering practices is also helpful.
  20. Given the modern data and analytics landscape, how do you retool myself? Retooling is a six step process. Awareness, Exposure, Guided Practice, Evolving Practice, Growing Expertise Awareness is the first step. You can’t learn something that you don’t know that you don’t know. Unconscious incompetence. Awareness can be achieved by reading up on the state of the art in modern data. Recommendations are data engineering weekly, data engineering podcast. Meetups. Exposure is learning about the tool. Conscience incompetence. What does it do? What problems does it solve? Does it seem like the kind of thing that you want to learn? That you need to learn? Reviews – positive and negative. Guided Practice. Conscious Competence. Color by numbers exercises. Tutorials. Get someone to help you. Consulting? Evolving Practice. Start with a small problem. Possibly something you’ve solved before. Point a bazooka at an ant hill. Scott’s CD organizer. Retrospective. More complex problem. POC. Growing Expertise. Unconscious competence. Start on eventual development. Start probing for the edges of the technology. Not every tool is appropriate for every problem.
  21. As Shakespeare would say, I had awareness thrust upon me. I was moved to a new team that would be using all new tech.
  22. So if you don’t have awareness thrust upon you, or if you want to find ways of increasing your awareness, what do you do? There is an EXPLOSION of new tech out there. It can be overwhelming. If you want to narrow it down a bit and figure out where to concentrate, I recommend: Listening to Podcasts: these often feature new technology that “has legs” If your organization has an advanced technology or data science group, ask and see what they are using or looking at using Read up on some of the big companies (Amazon, Netflix, AirBnB). There are a lot of interesting articles out there on their data architectures.
  23. The team had already made a few tech selections: Kafka, Neo4J, Spark Streaming. They had an idea for how these would be used together, and I began to understand – at a high level – what the capabilities of these technologies were. TIP: I also read a fair number of use cases on these technologies. Do this before deciding where to invest your time or money. TIP: Many use cases are in blogs. TIP: Books are expensive and become outdated quickly. TIP: There are LOTS of try-it-for-frees… these are also useful for the next step: Guided Practice.
  24. Many of the technologies in use today are open-source, and have a low barrier to entry. I was able to download Kafka, and follow the getting started guide to get everything working on my local machine. That worked great because it was open-source, and there is a getting started guide right on the site. Through the course of working through these guides, I also got an understanding of JSON and Avro. Much of the modern tech stack is either 1) Available for free or 2) In the cloud – where you can sign up for a free account. And there are getting started guides for just about any technology.
  25. I was able to get a fair amount of exposure to Kafka, REST-ful services, and Neo4J. But I didn’t start getting any depth of exposure until I moved to a position at a new company. In this position, I would be building data pipelines and all the tech would be cloud-native, specifically AWS. The key here is to focus on that old saying: Jack of All Trades; Master of None. Don’t get hung up on mastery unless you are in a position where you can really afford to do so. Much of the power of the modern tech stack can only be realized when you use these tools in conjunction with one another. I decided that in order to do this job, I was going to need to really understand the tech I was using. But it was going to be overwhelming to build it all in a new environment. And I am also someone who works better within a known framework or frame of reference. So here’s what I did: I picked a data pipeline that was familiar to me. And I decided to build it on my own, using only one new piece of tech at a time. That would give me a little depth of practice while building out the larger picture.
  26. I picked a simple pipeline: ingest a file with some people data, clean that people data, match it against some known people, pick the best match for each input, and save the results. I knew that picking the best “match” for a person was going to be a feature our data scientists would want to use. In the past I had done this sort of thing with a basic ETL process developed using ETL tools. Data would be ingested, staged in tables, joined to other tables, etc. But one of the key features of the modern tech stack is a movement away from strictly SQL-based ETL, and a move towards code-based ETL. More specifically, JVM-based ETL (for performance reasons on large datasets). I knew that Python would be a valuable skill going forward and had a lower barrier to entry, so instead of going for a JVM-based language I started with Python. I had some experience with Python from tooling around with it at home. I wrote very simple Python scripts that did a bare minimum of each step, one calling the other. This got me experience with manipulating data in Python, and also introduced me to a variety of Python libraries.
  27. Next, I decided to replace the local input file and output files with files read from Amazon’s Simple Storage Service, or S3. S3 is a cornerstone service offering, and it is basically an object store. By deciding to read from and write to S3, I got experience using not just S3, but also using the AWS APIs for interacting with S3 (called boto for Python). Additionally, it got me accustomed to working with AWS security credentials and understanding how role based security works in AWS Identity Access Manager, as well as setting security policies on S3. TIP: Don’t skimp on security. Consider it from the beginning. You can’t afford to put it in last.
  28. Next I decided to replace my local MySQL database used for lookups with an AWS Relational Database Service instance (PostgreSQL flavor). This got me experience with creating RDS instances, accessing those instances, setting up security on those instances, and accessing data via the AWS APIs.
  29. Next I took my Python code, and used it to create AWS Lambdas – small, serverless functions you can deploy in AWS in a variety of languages. Lambdas can be easily triggered by an object landing in an S3 bucket. But they can only run for 15 minutes. This got me experience with triggering, Lambda creation and deployment.
  30. All my components were now in AWS, but I was having to manually run each Lambda function. The last piece to consider was orchestration. I opted to use an AWS Step Function as it allowed me to orchestrate my workflow, as well as to maintain a state machine that would catch errors.
  31. In order to grow your expertise, you can take your MVP and build upon it. Consider adding new features. New features may introduce new tools, or they may get you more experience with the tools you already use.
  32. For example, I decided to make the format of the input files configurable and used a DynamoDB key/value store in AWS to keep track of my configuration data.
  33. You can also grow your expertise by subjecting your MVP to increasing scale, or decided you want to support increasing scalability.
  34. For example, AWS Lambdas can only run for 15 minutes. If I wanted to clean, find candidate matches for, and pick best candidate for a number of inputs beyond, say, 30K… I was going to have to find ways to making my processes run longer. So I containerized my Python code (Docker!), pushed my containers to Amazon’s container repository (ECR!) and set up an ECS Cluster where the containers could be run. By doing cleansing, candidate generation, and candidate selection in ECS I could support longer running processes on larger input files.
  35. So that’s great that my processes can run longer now on larger inputs, but I’d really like my slowest tasks to become more performant. I am limited to referencing an RDS instance. But what if I could parallelize the candidate generation process by using Spark?
  36. I looked at spinning up an AWS Elastic MapReduce (AWS’s managed Hadoop solution) cluster and submitting candidate match selection jobs to it. I knew that Scala was more performant with Spark than Python, and so wrote a matching app using Scala. I didn’t even end up using it! But the experience I got using Spark, Scala, and EMR was put to good use on other data pipelines.
  37. Where would I go next? What if I wanted to stream in records and match them as they stream in, rather than waiting for a file to land? How would that affect my solution? Are any of my components re-usable? How could I re-architect to make them re-usable? At this point, this is a thought exercise.
  38. Introducing automation such as: Test automation Continuous integration and deployment pipelines (Jenkins, Groovy) Infrastructure as code (Terraform) Can also expose you to a wider variety of tools and technologies, as well as deepen your knowledge and discipline.
  39. Conclusion - Boiling the Ocean. Choose tools that you are going to get to use, that you want to use. Concentrate all fire power on the superstardestroyer. Don’t try to do too much at once.