SlideShare a Scribd company logo
1 of 8
Download to read offline
The Simple 5-Step Process for Creating a
Winning Data Pipeline
In 2009, a prescient insight was offered by Hal Varian, Google's Chief Economist. He
stated that the method of harvesting massive data and extracting value using it would
change modern business. Varian was correct in making this statement. Today, data
science helps create machine learning algorithms to solve business problems and
enable decision-making. For data practitioners, it all starts with creating data
pipelines. A primer about these building blocks of data science – the data pipelines.
Running a business in the 21st-century makes hiring a Data Scientist inevitable. If
some businessmen do not yet feel this need, they should blame the newness of the data
science field, introduced only in 2001 by William S. Cleveland as an extension of
statistics.
The benefits of data science are many, and they have all become possible because of
data pipelines architecture. They are an important part of data and analytics.
In this article, we discuss what data pipelines are, why they are needed, types of data
passing through them, how to create a data pipeline, and the roles played by data
engineers and data scientists in their making.
Data pipelines: A brief
A data pipeline can be considered a series of steps taken to move raw data from a
source to the destination, thereby ensuring handling and consumption of it. The data
pipeline is a sum of processes and tools to enable data integration. In the case of
business intelligence, the source can be a transactional database, and the destination is
mostly a data warehouse or the data lake. The destination is the platform where the
analysis of data achieves business insights.
Two major benefits of data pipelines are:
 They consolidate data from various disparate sources into a single
common destination. This helps in quick data analysis for the purpose of
finding business insights.
 They also ensure consistency in data quality, which is critical for gaining
reliable insights.
In a broader sense, two types of data pass through a data pipeline:
Structured data : This category of data can be stored and retrieved in a specific format.
This comprises of email addresses, device-specific statistics, phone numbers,
locations, IP addresses and banking information.
Unstructured data : This category of data is difficult to get tracked in a fixed format.
This comprises of social media content, email content, images, mobile phone
searches, online reviews.
A typical data pipeline
Stages in Data Pipeline
1. Obtaining your data
Data is of paramount importance in data science. Without it, you cannot apply data
science. Therefore, the first and most crucial part is to get the data. But it cannot be
any data. You must obtain ‘reliable and authentic data’. The reason is simple, garbage
enters in, garbage moves out.
A rule of thumb is to have strict checks when obtaining the data. All the available
datasets (originating from the internal or external databases, third parties, or internet)
must be gathered and the data should be extracted into an appropriate format (.CVS,
XML, JSON, and many more).
Skills required for obtaining the data
 Distributed Storage: Flint/Apache Spark, Hadoops
 Database Management: PostgreSQL, MySQL, MongoDB.
 Querying Relational Databases
 Unstructured data retrieval: videos, text, documents, audio files.
2. Cleaning and preparing your data
This part of the data pipeline is very laborious and time-consuming. Many a times, it
so happens that the data comes with anomalies, for instance, duplication of values,
missing parameters, irrelevant features, etc. It is for these difficulties that cleaning up
the data is crucial. The results obtained from the machine learning model is good only
if the input is of value. It can be repeated; garbage as input results in the garbage as
output.
This step may consider only those aspect of the data which are important to solve the
targeted problem. Therefore, a thorough examination of the data as per the further
operations to be performed on it can help. The objectives here are error identification,
filling the data holes, removing duplicate entries, and removing corrupt records,
among others. Expertise in the domain or being thorough in domain knowledge is
important to know the impact of any value or feature.
Skills required for cleaning or preparing the data
 Programming language : R, Python.
 Tools for data modification : NumPy, Python libs, R, Pandas.
 Distributed Processing : Spark/Mac Reduce, Hadoop.
3. Visualization or exploration of data
At the data exploration phase, the values and patterns of the data have to be explored.
In this, you should apply different categories of visualization and statistical tools to
support the results. Domain knowledge is essential at this level so that visualizations
and their interpretation is correctly understood.
The objective of this stage is to explore patterns by applying visualizations and charts.
This would result in feature extraction by applying statistical techniques. This results
in the identification and testing of significant variables.
Skills required for visualization or exploration of data
 Python : Pandas, Matplotlib, SciPy, NumPy.
 R : Dplyr, GGplot2.
 Statistical tools : Inferential, Random sampling
 Data Visualization : Tableau
4. Data modeling in Machine Learning
The data is obtained and cleaned in the initial steps, and then the features that are most
crucial for a given problem are spotted. This is accomplished by making use of
relevant models as a predictive tool. This process results in improving the decision-
making capabilities by making them data-driven.
The objective of data modeling is to perform an in-depth analysis that mainly involves
creating machine learning models, for instance, an algorithm or a predictive model.
This is done to give predictive power to your data.
Once a data model is created with machine learning, it must be tested for error rate,
and performance. Another process of data modeling is to perform evaluation and
refinement of the created model. This process involves multiple sessions and cycles of
evaluation and optimization. This is because any machine learning model cannot be
superlative in the very first attempt. This process increases the accuracy by training
with new ingestion of data, minimizing data loss, etc.
Methods commonly applied at this stage:
 Logarithmic loss
 Classification accuracy
 Confusion matrix
 F1 score
 The area under the curve
 Mean squared error
 Mean absolute error
Skills required in data modeling
 Machine Learning: Unsupervised or Supervised algorithms.
 Methods of evaluation.
 Libraries for machine learning: Python (NumPy, Sci-kit Learn).
 Multivariate Calculus and linear algebra.
5. Interpretations of the data findings
Interpretation of the data entails communicating the findings to the stakeholders. A
lack of proper explanation of the findings for the interested parties means whatever
tasks a data scientist has performed are of little use. This is the reason why this step is
crucial.
Data interpretation first aims at understanding the business goals and then linking
them correctly to the data findings. A domain expert may be required for correlating
business problems with the findings. They can enable visualization of the findings and
help in communicating the facts to the non-technical stakeholders.
Skills required for interpretation of the data
 Domain knowledge of the business.
 Tools for data visualization : D3.js, Tableau, ggplot2, Matplotlib,
Seaborn.
 Abilities to communicate : Speaking, presenting, writing, and reporting.
Updating the model
Once the model is deployed in production, it becomes increasingly essential to update
and revisit your model periodically. The period of updating will be decided on how
frequently you receive data or if any new changes are brought in the business.
For instance, consider you are a data professional at a transportation company and the
company decides to open up a division for electric vehicles owing to new consumer
trends. If your old model does not consider this sector, then you must revisit your
model and include data on these new types of vehicles. Nevertheless, if you do not
revisit or update your company's model, then the model will fail over time and will
not perform as per the requirements. The insertion of new information or features will
change the model's performance and keep it relevant.
Role of a Data Engineer in creating data pipelines
SKILLS AND TECHNOLOGIES DATA ENGINEERS NEED FOR DATA PIPELINES
In the case of a multidisciplinary team (data engineers, BI users, and data scientists),
the role of a data engineer in creating data pipelines is mainly to ensure the
availability and quality of data. In addition to this, a data engineer can collaborate with
the others in the team to design or implement a data-related product or feature like the
refinement an already existing data source, and deployment of machine learning
model.
 Data engineers must have a sound knowledge of the programming
language, that is, at least Python or Java/Scala.
 Data engineers must know different types of databases (SQL and
NoSQL), data platforms, concepts like MapReduce, stream and batch
processing, and some basic theory of data itself, for instance, descriptive
statistics, data types.
 Data engineers must have experience with several data storage
frameworks and technologies, which they can put together to create data
pipelines.
 Data engineers must gain proficiency in data warehouse tools, such as:
Teradata Data Warehouse, SAP Data Warehouse, IBM db2, Oracle
Exadata, Amazon Redshift (Cloud-based solution), Google BigQuery
(cloud-based solution).
 Big Data tools that data engineers must acquire know-how of for data
pipelines are Hadoop, Elasticsearch, and ETL and data platforms.
Data Engineer Vs Data Scientist: A comparative analysis
of their roles in creating data pipelines
Both data engineers and data scientists perform tasks related to data, but they solve
quite a different set of problems. They bring different skills to the table, and make use
of different tools.
Source : Ryan Swanstrom
 Data engineers build as well as maintain massive data storage. They
apply their engineering skills like ETL techniques, programming
languages, database languages, data warehouses.
 On the other hand, data scientists clean and analyze the data, look for
insights from the data, deploy models for predictive and forecasting
analytics, and often apply their algorithmic and mathematics skills,
machine learning tools, and algorithms.
Endnotes
Identification of a business problem and asking relevant questions is important in
creation of robust data pipelines. It involves scrouging for reliable and authentic
sources, and determining the stages through which the data will pass.
Are you all ready to design and deploy a data pipeline for your organization? The
attempt will shine through both hard work and logic flowing in. Step up in your career
with DASCA’s Data Engineering certifications! To learn more, check our
certifications.

More Related Content

Similar to The 5-Step Process for Creating Winning Data Pipelines

How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business IntelligenceSukirti Garg
 
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET Journal
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Papershashanksalunkhe12
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 
Top 10 areas of expertise in data science
Top 10 areas of expertise in data scienceTop 10 areas of expertise in data science
Top 10 areas of expertise in data scienceGlobalTechCouncil
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Rohit Dubey
 
Data Science course in Hyderabad .
Data Science course in Hyderabad            .Data Science course in Hyderabad            .
Data Science course in Hyderabad .rajasrichalamala3zen
 
Data Science course in Hyderabad .
Data Science course in Hyderabad         .Data Science course in Hyderabad         .
Data Science course in Hyderabad .rajasrichalamala3zen
 
data science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabaddata science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabadakhilamadupativibhin
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabadmadhupriya3zen
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabadmadhupriya3zen
 
best data science course institutes in Hyderabad
best data science course institutes in Hyderabadbest data science course institutes in Hyderabad
best data science course institutes in Hyderabadrajasrichalamala3zen
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxNagarajanG35
 

Similar to The 5-Step Process for Creating Winning Data Pipelines (20)

How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
DataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdfDataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdf
 
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Implementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White PaperImplementing Data Mesh WP LTIMindtree White Paper
Implementing Data Mesh WP LTIMindtree White Paper
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
Top 10 areas of expertise in data science
Top 10 areas of expertise in data scienceTop 10 areas of expertise in data science
Top 10 areas of expertise in data science
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
Data Science course in Hyderabad .
Data Science course in Hyderabad            .Data Science course in Hyderabad            .
Data Science course in Hyderabad .
 
Data Science course in Hyderabad .
Data Science course in Hyderabad         .Data Science course in Hyderabad         .
Data Science course in Hyderabad .
 
data science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabaddata science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabad
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabad
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabad
 
data science.pptx
data science.pptxdata science.pptx
data science.pptx
 
best data science course institutes in Hyderabad
best data science course institutes in Hyderabadbest data science course institutes in Hyderabad
best data science course institutes in Hyderabad
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptx
 

More from Data Science Council of America

The Value of Data Visualization for Data Science Professionals.pdf
The Value of Data Visualization for Data Science Professionals.pdfThe Value of Data Visualization for Data Science Professionals.pdf
The Value of Data Visualization for Data Science Professionals.pdfData Science Council of America
 
Why Big Data Automation is Important for Your Business.pdf
Why Big Data Automation is Important for Your Business.pdfWhy Big Data Automation is Important for Your Business.pdf
Why Big Data Automation is Important for Your Business.pdfData Science Council of America
 
Why Big Data Automation is Important for Your Business.pdf
Why Big Data Automation is Important for Your Business.pdfWhy Big Data Automation is Important for Your Business.pdf
Why Big Data Automation is Important for Your Business.pdfData Science Council of America
 
Data Science - The New Skill for Today’s Entrepreneurs.pdf
Data Science - The New Skill for Today’s Entrepreneurs.pdfData Science - The New Skill for Today’s Entrepreneurs.pdf
Data Science - The New Skill for Today’s Entrepreneurs.pdfData Science Council of America
 
Know How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfKnow How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfData Science Council of America
 
Pandas vs. SQL – Tools that Data Scientists use most often.pdf
Pandas vs. SQL – Tools that Data Scientists use most often.pdfPandas vs. SQL – Tools that Data Scientists use most often.pdf
Pandas vs. SQL – Tools that Data Scientists use most often.pdfData Science Council of America
 
Is Data Visualization Literacy Part of Your Company Culture.pdf
Is Data Visualization Literacy Part of Your Company Culture.pdfIs Data Visualization Literacy Part of Your Company Culture.pdf
Is Data Visualization Literacy Part of Your Company Culture.pdfData Science Council of America
 
Maximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdf
Maximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdfMaximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdf
Maximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdfData Science Council of America
 
Data Observability- The Next Frontier of Data Engineering Pdf.pdf
Data Observability- The Next Frontier of Data Engineering Pdf.pdfData Observability- The Next Frontier of Data Engineering Pdf.pdf
Data Observability- The Next Frontier of Data Engineering Pdf.pdfData Science Council of America
 
How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...
How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...
How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...Data Science Council of America
 
Importance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdf
Importance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdfImportance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdf
Importance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdfData Science Council of America
 
Top Trends & Predictions That Will Drive Data Science in 2022.pdf
Top Trends & Predictions That Will Drive Data Science in 2022.pdfTop Trends & Predictions That Will Drive Data Science in 2022.pdf
Top Trends & Predictions That Will Drive Data Science in 2022.pdfData Science Council of America
 
Essential capabilities of data scientist to have in 2022
Essential capabilities of data scientist to have in 2022Essential capabilities of data scientist to have in 2022
Essential capabilities of data scientist to have in 2022Data Science Council of America
 

More from Data Science Council of America (20)

Why Data Scientists Should Learn Machine Learning.pdf
Why Data Scientists Should Learn Machine Learning.pdfWhy Data Scientists Should Learn Machine Learning.pdf
Why Data Scientists Should Learn Machine Learning.pdf
 
The Value of Data Visualization for Data Science Professionals.pdf
The Value of Data Visualization for Data Science Professionals.pdfThe Value of Data Visualization for Data Science Professionals.pdf
The Value of Data Visualization for Data Science Professionals.pdf
 
Why Big Data Automation is Important for Your Business.pdf
Why Big Data Automation is Important for Your Business.pdfWhy Big Data Automation is Important for Your Business.pdf
Why Big Data Automation is Important for Your Business.pdf
 
Why Big Data Automation is Important for Your Business.pdf
Why Big Data Automation is Important for Your Business.pdfWhy Big Data Automation is Important for Your Business.pdf
Why Big Data Automation is Important for Your Business.pdf
 
Top 3 Interesting Careers in Big Data.pdf
Top 3 Interesting Careers in Big Data.pdfTop 3 Interesting Careers in Big Data.pdf
Top 3 Interesting Careers in Big Data.pdf
 
Achieving Business Success with Data.pdf
Achieving Business Success with Data.pdfAchieving Business Success with Data.pdf
Achieving Business Success with Data.pdf
 
Data Science - The New Skill for Today’s Entrepreneurs.pdf
Data Science - The New Skill for Today’s Entrepreneurs.pdfData Science - The New Skill for Today’s Entrepreneurs.pdf
Data Science - The New Skill for Today’s Entrepreneurs.pdf
 
Know How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdfKnow How to Create and Visualize a Decision Tree with Python.pdf
Know How to Create and Visualize a Decision Tree with Python.pdf
 
Pandas vs. SQL – Tools that Data Scientists use most often.pdf
Pandas vs. SQL – Tools that Data Scientists use most often.pdfPandas vs. SQL – Tools that Data Scientists use most often.pdf
Pandas vs. SQL – Tools that Data Scientists use most often.pdf
 
Augmented Analytics The Future Of Data & Analytics.pdf
Augmented Analytics The Future Of Data & Analytics.pdfAugmented Analytics The Future Of Data & Analytics.pdf
Augmented Analytics The Future Of Data & Analytics.pdf
 
Is Data Visualization Literacy Part of Your Company Culture.pdf
Is Data Visualization Literacy Part of Your Company Culture.pdfIs Data Visualization Literacy Part of Your Company Culture.pdf
Is Data Visualization Literacy Part of Your Company Culture.pdf
 
Maximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdf
Maximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdfMaximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdf
Maximize Your D&A Strategy The Role Of A Citizen Data Scientist.pdf
 
Data Observability- The Next Frontier of Data Engineering Pdf.pdf
Data Observability- The Next Frontier of Data Engineering Pdf.pdfData Observability- The Next Frontier of Data Engineering Pdf.pdf
Data Observability- The Next Frontier of Data Engineering Pdf.pdf
 
How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...
How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...
How To Transform Your Analytics Maturity Model Levels, Technologies, and Appl...
 
Importance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdf
Importance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdfImportance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdf
Importance of Data-Driven Storytelling Data Analysis &amp Visual Narratives.pdf
 
Top Trends & Predictions That Will Drive Data Science in 2022.pdf
Top Trends & Predictions That Will Drive Data Science in 2022.pdfTop Trends & Predictions That Will Drive Data Science in 2022.pdf
Top Trends & Predictions That Will Drive Data Science in 2022.pdf
 
Essential capabilities of data scientist to have in 2022
Essential capabilities of data scientist to have in 2022Essential capabilities of data scientist to have in 2022
Essential capabilities of data scientist to have in 2022
 
Senior Data Scientist
Senior Data ScientistSenior Data Scientist
Senior Data Scientist
 
Senior Big Data Analyst
Senior Big Data AnalystSenior Big Data Analyst
Senior Big Data Analyst
 
Associate Big Data Analyst | ABDA
Associate Big Data Analyst | ABDAAssociate Big Data Analyst | ABDA
Associate Big Data Analyst | ABDA
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

The 5-Step Process for Creating Winning Data Pipelines

  • 1. The Simple 5-Step Process for Creating a Winning Data Pipeline In 2009, a prescient insight was offered by Hal Varian, Google's Chief Economist. He stated that the method of harvesting massive data and extracting value using it would change modern business. Varian was correct in making this statement. Today, data science helps create machine learning algorithms to solve business problems and enable decision-making. For data practitioners, it all starts with creating data pipelines. A primer about these building blocks of data science – the data pipelines. Running a business in the 21st-century makes hiring a Data Scientist inevitable. If some businessmen do not yet feel this need, they should blame the newness of the data science field, introduced only in 2001 by William S. Cleveland as an extension of statistics. The benefits of data science are many, and they have all become possible because of data pipelines architecture. They are an important part of data and analytics. In this article, we discuss what data pipelines are, why they are needed, types of data passing through them, how to create a data pipeline, and the roles played by data engineers and data scientists in their making. Data pipelines: A brief A data pipeline can be considered a series of steps taken to move raw data from a source to the destination, thereby ensuring handling and consumption of it. The data pipeline is a sum of processes and tools to enable data integration. In the case of business intelligence, the source can be a transactional database, and the destination is mostly a data warehouse or the data lake. The destination is the platform where the analysis of data achieves business insights. Two major benefits of data pipelines are:  They consolidate data from various disparate sources into a single common destination. This helps in quick data analysis for the purpose of finding business insights.  They also ensure consistency in data quality, which is critical for gaining reliable insights. In a broader sense, two types of data pass through a data pipeline:
  • 2. Structured data : This category of data can be stored and retrieved in a specific format. This comprises of email addresses, device-specific statistics, phone numbers, locations, IP addresses and banking information. Unstructured data : This category of data is difficult to get tracked in a fixed format. This comprises of social media content, email content, images, mobile phone searches, online reviews. A typical data pipeline Stages in Data Pipeline 1. Obtaining your data Data is of paramount importance in data science. Without it, you cannot apply data science. Therefore, the first and most crucial part is to get the data. But it cannot be any data. You must obtain ‘reliable and authentic data’. The reason is simple, garbage enters in, garbage moves out. A rule of thumb is to have strict checks when obtaining the data. All the available datasets (originating from the internal or external databases, third parties, or internet) must be gathered and the data should be extracted into an appropriate format (.CVS, XML, JSON, and many more). Skills required for obtaining the data  Distributed Storage: Flint/Apache Spark, Hadoops  Database Management: PostgreSQL, MySQL, MongoDB.  Querying Relational Databases  Unstructured data retrieval: videos, text, documents, audio files. 2. Cleaning and preparing your data
  • 3. This part of the data pipeline is very laborious and time-consuming. Many a times, it so happens that the data comes with anomalies, for instance, duplication of values, missing parameters, irrelevant features, etc. It is for these difficulties that cleaning up the data is crucial. The results obtained from the machine learning model is good only if the input is of value. It can be repeated; garbage as input results in the garbage as output. This step may consider only those aspect of the data which are important to solve the targeted problem. Therefore, a thorough examination of the data as per the further operations to be performed on it can help. The objectives here are error identification, filling the data holes, removing duplicate entries, and removing corrupt records, among others. Expertise in the domain or being thorough in domain knowledge is important to know the impact of any value or feature. Skills required for cleaning or preparing the data  Programming language : R, Python.  Tools for data modification : NumPy, Python libs, R, Pandas.  Distributed Processing : Spark/Mac Reduce, Hadoop. 3. Visualization or exploration of data At the data exploration phase, the values and patterns of the data have to be explored. In this, you should apply different categories of visualization and statistical tools to support the results. Domain knowledge is essential at this level so that visualizations and their interpretation is correctly understood. The objective of this stage is to explore patterns by applying visualizations and charts. This would result in feature extraction by applying statistical techniques. This results in the identification and testing of significant variables. Skills required for visualization or exploration of data  Python : Pandas, Matplotlib, SciPy, NumPy.  R : Dplyr, GGplot2.  Statistical tools : Inferential, Random sampling  Data Visualization : Tableau 4. Data modeling in Machine Learning The data is obtained and cleaned in the initial steps, and then the features that are most crucial for a given problem are spotted. This is accomplished by making use of
  • 4. relevant models as a predictive tool. This process results in improving the decision- making capabilities by making them data-driven. The objective of data modeling is to perform an in-depth analysis that mainly involves creating machine learning models, for instance, an algorithm or a predictive model. This is done to give predictive power to your data. Once a data model is created with machine learning, it must be tested for error rate, and performance. Another process of data modeling is to perform evaluation and refinement of the created model. This process involves multiple sessions and cycles of evaluation and optimization. This is because any machine learning model cannot be superlative in the very first attempt. This process increases the accuracy by training with new ingestion of data, minimizing data loss, etc. Methods commonly applied at this stage:  Logarithmic loss  Classification accuracy  Confusion matrix  F1 score  The area under the curve  Mean squared error  Mean absolute error Skills required in data modeling  Machine Learning: Unsupervised or Supervised algorithms.  Methods of evaluation.  Libraries for machine learning: Python (NumPy, Sci-kit Learn).  Multivariate Calculus and linear algebra. 5. Interpretations of the data findings Interpretation of the data entails communicating the findings to the stakeholders. A lack of proper explanation of the findings for the interested parties means whatever tasks a data scientist has performed are of little use. This is the reason why this step is crucial. Data interpretation first aims at understanding the business goals and then linking them correctly to the data findings. A domain expert may be required for correlating
  • 5. business problems with the findings. They can enable visualization of the findings and help in communicating the facts to the non-technical stakeholders. Skills required for interpretation of the data  Domain knowledge of the business.  Tools for data visualization : D3.js, Tableau, ggplot2, Matplotlib, Seaborn.  Abilities to communicate : Speaking, presenting, writing, and reporting. Updating the model Once the model is deployed in production, it becomes increasingly essential to update and revisit your model periodically. The period of updating will be decided on how frequently you receive data or if any new changes are brought in the business. For instance, consider you are a data professional at a transportation company and the company decides to open up a division for electric vehicles owing to new consumer trends. If your old model does not consider this sector, then you must revisit your model and include data on these new types of vehicles. Nevertheless, if you do not revisit or update your company's model, then the model will fail over time and will not perform as per the requirements. The insertion of new information or features will change the model's performance and keep it relevant. Role of a Data Engineer in creating data pipelines SKILLS AND TECHNOLOGIES DATA ENGINEERS NEED FOR DATA PIPELINES In the case of a multidisciplinary team (data engineers, BI users, and data scientists), the role of a data engineer in creating data pipelines is mainly to ensure the availability and quality of data. In addition to this, a data engineer can collaborate with the others in the team to design or implement a data-related product or feature like the refinement an already existing data source, and deployment of machine learning model.
  • 6.  Data engineers must have a sound knowledge of the programming language, that is, at least Python or Java/Scala.  Data engineers must know different types of databases (SQL and NoSQL), data platforms, concepts like MapReduce, stream and batch processing, and some basic theory of data itself, for instance, descriptive statistics, data types.  Data engineers must have experience with several data storage frameworks and technologies, which they can put together to create data pipelines.  Data engineers must gain proficiency in data warehouse tools, such as: Teradata Data Warehouse, SAP Data Warehouse, IBM db2, Oracle Exadata, Amazon Redshift (Cloud-based solution), Google BigQuery (cloud-based solution).  Big Data tools that data engineers must acquire know-how of for data pipelines are Hadoop, Elasticsearch, and ETL and data platforms. Data Engineer Vs Data Scientist: A comparative analysis of their roles in creating data pipelines Both data engineers and data scientists perform tasks related to data, but they solve quite a different set of problems. They bring different skills to the table, and make use of different tools.
  • 7. Source : Ryan Swanstrom  Data engineers build as well as maintain massive data storage. They apply their engineering skills like ETL techniques, programming languages, database languages, data warehouses.  On the other hand, data scientists clean and analyze the data, look for insights from the data, deploy models for predictive and forecasting analytics, and often apply their algorithmic and mathematics skills, machine learning tools, and algorithms. Endnotes Identification of a business problem and asking relevant questions is important in creation of robust data pipelines. It involves scrouging for reliable and authentic sources, and determining the stages through which the data will pass. Are you all ready to design and deploy a data pipeline for your organization? The attempt will shine through both hard work and logic flowing in. Step up in your career
  • 8. with DASCA’s Data Engineering certifications! To learn more, check our certifications.