This document outlines best practices for processing raw data into tidy datasets. It discusses preparing by validating variables with a codebook, organizing by planning steps and labeling variables, quality control through reproducible code, and communication with comments, codebooks and providing raw and tidy datasets. The presentation demonstrates these practices using examples from agriculture and education data, showing how to reshape data, generate variables, and comment code for clarity.
Make compliance fulfillment count doubleDirk Ortloff
This whitepaper gives an overview about the requirements and the approaches to
make your compliance initiative count double. Not only to fulfill compliance but to go
the next step bringing your documentation and knowledge handling to a stage where
future projects can learn from previous successes and mistakes. This will make your
R&D department ready for future challenges, faster markets and global
partnerships.
DevOps unquestionably is one of the most transformational movements to happen to IT and is helping IT deliver to ideas to market faster. But where does one start? What should we focus on first. This infographic explores what are the critical success factors for ensure success with DevOps. A related eBook is available for download at http://info.scriptrock.com/prerequisites-for-devops-success.
The document provides an overview of disaster recovery best practices and examples. It discusses defining recovery objectives, building replication strategies based on those objectives, and maintaining disaster recovery plans through regular testing. Several customer examples are also provided that implemented disaster recovery solutions using a combination of on-premise infrastructure, colocation, and cloud services.
An annotated slide deck from a webinar hosted by Stilo International and conducted on June 24, 2014.
The talk introduces tactics for moving a content solution project forward quickly while also attending to essential details.
The document summarizes a presentation on decision-making in the oil and gas industry. It discusses how increased focus on uncertainty modeling may have "confused more than enlightened". It also examines how companies say they deal with uncertainty versus what they actually do. While some use advanced methods, others rely on simple what-if scenarios or treat uncertainty deterministically. The document cautions against processes done just because tools are available, rather than for clear decision-making goals. It stresses the importance of addressing the most critical uncertainties.
This document provides a summary of Diana Macys-Staley's qualifications and experience in project management, IT management, risk analysis, and compliance. She has over 15 years of experience leading large projects in healthcare IT, energy, and government. Her background includes implementing electronic medical record systems, developing project management offices, reducing project cost overruns, and leading teams through change. Currently she is an IT Project Manager for the implementation of an electronic health record system across three state mental health facilities in Alabama.
Selection criterion and implementation of case tools in gap analysis towaIAEME Publication
This document discusses gaps in distributed software development projects and proposes using CASE tools to analyze these gaps. It first defines what gaps are and provides examples of common gaps in distributed projects, such as geographical distance, time zone differences, and inadequate communication. The document then discusses how gap analysis is used to compare a project's actual performance to expected performance to identify differences. It proposes using CASE tools to help conduct gap analysis for distributed projects in order to identify limitations and differences between the current and desired states of development. Traditional and SWOT analysis approaches to gap analysis are also briefly described.
This document provides an overview of developing a stakeholder management system for large infrastructure projects. It discusses identifying stakeholders, analyzing them to determine their power/interests, and assessing their current engagement levels. The document then outlines developing stakeholder management plans with engagement strategies. These strategies aim to keep supporters engaged, neutralize sceptics, decrease negative impacts, and raise interest of disinterested stakeholders over the project life. The overall goal is an effective system to ensure stakeholder support and participation for project success.
Make compliance fulfillment count doubleDirk Ortloff
This whitepaper gives an overview about the requirements and the approaches to
make your compliance initiative count double. Not only to fulfill compliance but to go
the next step bringing your documentation and knowledge handling to a stage where
future projects can learn from previous successes and mistakes. This will make your
R&D department ready for future challenges, faster markets and global
partnerships.
DevOps unquestionably is one of the most transformational movements to happen to IT and is helping IT deliver to ideas to market faster. But where does one start? What should we focus on first. This infographic explores what are the critical success factors for ensure success with DevOps. A related eBook is available for download at http://info.scriptrock.com/prerequisites-for-devops-success.
The document provides an overview of disaster recovery best practices and examples. It discusses defining recovery objectives, building replication strategies based on those objectives, and maintaining disaster recovery plans through regular testing. Several customer examples are also provided that implemented disaster recovery solutions using a combination of on-premise infrastructure, colocation, and cloud services.
An annotated slide deck from a webinar hosted by Stilo International and conducted on June 24, 2014.
The talk introduces tactics for moving a content solution project forward quickly while also attending to essential details.
The document summarizes a presentation on decision-making in the oil and gas industry. It discusses how increased focus on uncertainty modeling may have "confused more than enlightened". It also examines how companies say they deal with uncertainty versus what they actually do. While some use advanced methods, others rely on simple what-if scenarios or treat uncertainty deterministically. The document cautions against processes done just because tools are available, rather than for clear decision-making goals. It stresses the importance of addressing the most critical uncertainties.
This document provides a summary of Diana Macys-Staley's qualifications and experience in project management, IT management, risk analysis, and compliance. She has over 15 years of experience leading large projects in healthcare IT, energy, and government. Her background includes implementing electronic medical record systems, developing project management offices, reducing project cost overruns, and leading teams through change. Currently she is an IT Project Manager for the implementation of an electronic health record system across three state mental health facilities in Alabama.
Selection criterion and implementation of case tools in gap analysis towaIAEME Publication
This document discusses gaps in distributed software development projects and proposes using CASE tools to analyze these gaps. It first defines what gaps are and provides examples of common gaps in distributed projects, such as geographical distance, time zone differences, and inadequate communication. The document then discusses how gap analysis is used to compare a project's actual performance to expected performance to identify differences. It proposes using CASE tools to help conduct gap analysis for distributed projects in order to identify limitations and differences between the current and desired states of development. Traditional and SWOT analysis approaches to gap analysis are also briefly described.
This document provides an overview of developing a stakeholder management system for large infrastructure projects. It discusses identifying stakeholders, analyzing them to determine their power/interests, and assessing their current engagement levels. The document then outlines developing stakeholder management plans with engagement strategies. These strategies aim to keep supporters engaged, neutralize sceptics, decrease negative impacts, and raise interest of disinterested stakeholders over the project life. The overall goal is an effective system to ensure stakeholder support and participation for project success.
Data Quality Testing Generic (http://www.geektester.blogspot.com/)raj.kamal13
The document discusses the importance of data quality and information management for business success. It states that virtually all business today relies on managing information effectively, as poor data quality can cost companies billions per year. It also outlines various categories and types of data quality checks that should be tested, such as row counts, completeness, consistency, validity, redundancy, and integrity.
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERSijseajournal
This document summarizes a study that investigated factors affecting requirements adherence to business processes in agile projects. The study conducted a cross-case analysis of seven companies, collecting data through interviews, observations, and document analysis. The analysis identified that customer business knowledge by the development team and customer availability during requirements elicitation and validation are essential for requirements to align with business needs in agile projects. This allows developed systems to better meet the needs of an organization's business processes.
This document discusses the Data Vault modeling methodology. It begins with an introduction and definition of the Data Vault as a set of normalized tables for tracking historical and detailed data across business functions. It then covers Data Vault architecture, including raw and business data vaults, data marts and sources. Finally, it discusses Data Vault modeling with hubs, satellites and links to represent business keys, descriptive attributes and relationships between entities.
Enterprises looking to modernize the data warehouse have been facing the dilemma of choosing between migration of legacy systems and total re-engineering. While migrating the systems as is is not the best transformation choice for all workloads, total re-engineering can be complex and disrupt business processes.
An EDW transformation approach that maintains a balance between the two extreme approaches is required to solve this problem. You will have full control in addressing low-performing or costly, resource-constrained workloads, while also building a strong foundation for a modern data warehouse architecture that supports sophisticated analytics using a cloud/hybrid/on-premises strategy.
In this webinar, experts from Impetus will walk you through a path to solve this dilemma. Attendees of this session will be able to:
Arrive at a pragmatic decision when faced with the dilemma
Learn methods to estimate and measure the bottom-line impact across applications and reporting workloads
Build a prioritized transformation roadmap of use cases based on cost, performance, and business needs
Understand how automation can be used to optimize workloads for a scalable and iterative architecture.
To view the webinar visit - https://bit.ly/31KWqO6
Presentation on DR testing featuring quotes by Robert Nardella in an intervie...Robert Nardella
This document provides an overview and guidance for developing an effective disaster recovery (DR) test plan. It discusses the benefits of DR testing such as reducing downtime and improving organizational preparedness. Regular DR testing ensures an organization's DR plan stays up-to-date as IT infrastructure changes and validates the effectiveness of recovery procedures. The document outlines a 4-phase process for creating a DR test plan and provides tools and templates to help with planning, execution, and incorporating lessons learned from tests. It also offers options for guided or onsite assistance to help organizations develop and implement their DR test plan.
This document discusses risk evaluation and management in exploration and production projects. It emphasizes the importance of integrated data management, analysis, and visualization in reducing risks. Key aspects of risk include reservoir, trap and hydrocarbon risks, which depend on understanding geological processes. The document outlines the typical workflow in E&P projects, including data collection, mapping, interpretation, modeling, and risk analysis. It argues that integrating tools like seismic inversion, well log analysis, and basin modeling at different stages can help transform data into useful knowledge and reduce project risks.
Lean/Agile system architecture presentation. Goes through pros/cons of upfront work and the xp/agile approach. Defines important concepts, shows examples of how things could be done and goes through tools/things that can help along the way.
This document describes the Scaled Agile Framework (SAFe) which is a framework for implementing agile development practices at the enterprise level. It discusses how SAFe addresses the limitations of traditional waterfall development and scales agile to meet the needs of large projects. SAFe incorporates key lean principles and consists of three levels - Team, Program, and Portfolio. At each level it defines roles and practices for planning, prioritizing work, and delivering value in short iterations. The goal of SAFe is to synchronize collaboration across many agile teams to continuously and predictably deliver working software.
Introducition to Data scinece compiled by huwekineheshete
This document provides an overview of data science and its key components. It discusses that data science uses scientific methods and algorithms to extract knowledge from structured, semi-structured, and unstructured data sources. It also notes that data science involves organizing data, packaging it through visualization and statistics, and delivering insights. The document further outlines the data science lifecycle and workflow, covering understanding the problem, exploring and preprocessing data, developing models, and evaluating results.
Python software development provides ease of programming to the developers and gives quick results for any kind of projects. Suma Soft is an expert company providing complete Python software development services for small, mid and big level companies. It holds an expertise for 19 years and is backed up by a strong patronage. To know more- https://www.sumasoft.com/python-software-development
Data science involves extracting knowledge and insights from structured, semi-structured, and unstructured data using scientific processes. It encompasses more than just data analysis. The data value chain describes the process of acquiring data and transforming it into useful information and insights. It involves data acquisition, analysis, curation, storage, and usage. There are three main types of data: structured data that follows a predefined model like databases, semi-structured data with some organization like JSON, and unstructured data like text without a clear model. Metadata provides additional context about data to help with analysis. Big data is characterized by its large volume, velocity, and variety that makes it difficult to process with traditional tools.
The document discusses best practices for collecting software project data including defining a process for collection, storage, and review of data to ensure integrity. It emphasizes personally interacting with data sources to clarify information, establishing a central repository, and normalizing data for later analysis and calibration of estimation models. The checklist provides guidance on reviewing various aspects of the data collection to validate completeness and accuracy.
The document describes the key phases of a data analytics lifecycle for big data projects:
1) Discovery - The team learns about the problem, data sources, and forms hypotheses.
2) Data Preparation - Data is extracted, transformed, and loaded into an analytic sandbox.
3) Model Planning - The team determines appropriate modeling techniques and variables.
4) Model Building - Models are developed using selected techniques and training/test data.
5) Communicate Results - The team analyzes outcomes, articulates findings to stakeholders.
6) Operationalization - Useful models are deployed in a production environment on a small scale.
Frameworks provide structure. The core objective of the Big Data Framework is...RINUSATHYAN
Frameworks provide structure. The core objective of the Big Data Framework is to provide a structure for enterprise organisations that aim to benefit from the potential of Big Data
The document discusses data wrangling, which is the process of cleaning, organizing, and transforming raw data into a usable format for analysis. It defines data wrangling and describes the importance, benefits, common tools, and examples of data wrangling. It also outlines the typical iterative steps in data wrangling software and provides examples of data exploration, cleaning, and filtering in Python.
The document provides a summary of a candidate's professional experience including 4+ years of experience developing data warehouses using Informatica. Specific experiences include ETL development, data modeling, performance tuning, and testing. Details are provided on 4 projects involving healthcare, banking, and retail clients. Responsibilities included developing mappings, transformations, documentation, testing, and support. Technologies used include Informatica, Oracle, SQL, and Unix.
Data is unprocessed facts and figures that can be represented using characters. Information is processed data used to make decisions. Data science uses scientific methods to extract knowledge from structured, semi-structured, and unstructured data. The data processing cycle involves inputting data, processing it, and outputting the results. There are different types of data from both computer programming and data analytics perspectives including structured, semi-structured, and unstructured data. Metadata provides additional context about data.
This document provides an overview of Module 1 of a course on Big Data Analytics. It introduces key concepts related to big data, including its characteristics, types, and classification. It describes approaches to data architecture design, data storage, processing and analytics for both traditional and big data systems. It also covers topics like data sources, quality, preprocessing, and case studies and applications of big data analytics.
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
"R is the most popular language in the data-science community with 2+ million users and 6000+ R packages. R’s adoption evolved along with its easy-to-use statistical language, graphics, packages, tools and active community. In this session we will introduce Distributed R, a new open-source technology that solves the scalability and performance limitations of vanilla R. Since R is single-threaded and does not scale to accommodate large datasets, Distributed R addresses many of R’s limitations. Distributed R efficiently shares sparse structured data, leverages multi-cores, and dynamically partitions data to mitigate load imbalance.
In this talk, we will show the promise of this approach by demonstrating how important machine learning and graph algorithms can be expressed in a single framework and are substantially faster under Distributed R. Additionally, we will show how Distributed R complements Vertica, a state-of-the-art columnar analytics database, to deliver a full-cycle, fully integrated, data “prep-analyze-deploy” solution."
Data science combines fields like statistics, programming, and domain expertise to extract meaningful insights from data. It involves preparing, analyzing, and modeling data to discover useful information. Exploratory data analysis is the process of investigating data to understand its characteristics and check assumptions before modeling. There are four types of EDA: univariate non-graphical, univariate graphical, multivariate non-graphical, and multivariate graphical. Python and R are popular tools used for EDA due to their data analysis and visualization capabilities.
Data Quality Testing Generic (http://www.geektester.blogspot.com/)raj.kamal13
The document discusses the importance of data quality and information management for business success. It states that virtually all business today relies on managing information effectively, as poor data quality can cost companies billions per year. It also outlines various categories and types of data quality checks that should be tested, such as row counts, completeness, consistency, validity, redundancy, and integrity.
IT PROJECT SHOWSTOPPER FRAMEWORK: THE VIEW OF PRACTITIONERSijseajournal
This document summarizes a study that investigated factors affecting requirements adherence to business processes in agile projects. The study conducted a cross-case analysis of seven companies, collecting data through interviews, observations, and document analysis. The analysis identified that customer business knowledge by the development team and customer availability during requirements elicitation and validation are essential for requirements to align with business needs in agile projects. This allows developed systems to better meet the needs of an organization's business processes.
This document discusses the Data Vault modeling methodology. It begins with an introduction and definition of the Data Vault as a set of normalized tables for tracking historical and detailed data across business functions. It then covers Data Vault architecture, including raw and business data vaults, data marts and sources. Finally, it discusses Data Vault modeling with hubs, satellites and links to represent business keys, descriptive attributes and relationships between entities.
Enterprises looking to modernize the data warehouse have been facing the dilemma of choosing between migration of legacy systems and total re-engineering. While migrating the systems as is is not the best transformation choice for all workloads, total re-engineering can be complex and disrupt business processes.
An EDW transformation approach that maintains a balance between the two extreme approaches is required to solve this problem. You will have full control in addressing low-performing or costly, resource-constrained workloads, while also building a strong foundation for a modern data warehouse architecture that supports sophisticated analytics using a cloud/hybrid/on-premises strategy.
In this webinar, experts from Impetus will walk you through a path to solve this dilemma. Attendees of this session will be able to:
Arrive at a pragmatic decision when faced with the dilemma
Learn methods to estimate and measure the bottom-line impact across applications and reporting workloads
Build a prioritized transformation roadmap of use cases based on cost, performance, and business needs
Understand how automation can be used to optimize workloads for a scalable and iterative architecture.
To view the webinar visit - https://bit.ly/31KWqO6
Presentation on DR testing featuring quotes by Robert Nardella in an intervie...Robert Nardella
This document provides an overview and guidance for developing an effective disaster recovery (DR) test plan. It discusses the benefits of DR testing such as reducing downtime and improving organizational preparedness. Regular DR testing ensures an organization's DR plan stays up-to-date as IT infrastructure changes and validates the effectiveness of recovery procedures. The document outlines a 4-phase process for creating a DR test plan and provides tools and templates to help with planning, execution, and incorporating lessons learned from tests. It also offers options for guided or onsite assistance to help organizations develop and implement their DR test plan.
This document discusses risk evaluation and management in exploration and production projects. It emphasizes the importance of integrated data management, analysis, and visualization in reducing risks. Key aspects of risk include reservoir, trap and hydrocarbon risks, which depend on understanding geological processes. The document outlines the typical workflow in E&P projects, including data collection, mapping, interpretation, modeling, and risk analysis. It argues that integrating tools like seismic inversion, well log analysis, and basin modeling at different stages can help transform data into useful knowledge and reduce project risks.
Lean/Agile system architecture presentation. Goes through pros/cons of upfront work and the xp/agile approach. Defines important concepts, shows examples of how things could be done and goes through tools/things that can help along the way.
This document describes the Scaled Agile Framework (SAFe) which is a framework for implementing agile development practices at the enterprise level. It discusses how SAFe addresses the limitations of traditional waterfall development and scales agile to meet the needs of large projects. SAFe incorporates key lean principles and consists of three levels - Team, Program, and Portfolio. At each level it defines roles and practices for planning, prioritizing work, and delivering value in short iterations. The goal of SAFe is to synchronize collaboration across many agile teams to continuously and predictably deliver working software.
Introducition to Data scinece compiled by huwekineheshete
This document provides an overview of data science and its key components. It discusses that data science uses scientific methods and algorithms to extract knowledge from structured, semi-structured, and unstructured data sources. It also notes that data science involves organizing data, packaging it through visualization and statistics, and delivering insights. The document further outlines the data science lifecycle and workflow, covering understanding the problem, exploring and preprocessing data, developing models, and evaluating results.
Python software development provides ease of programming to the developers and gives quick results for any kind of projects. Suma Soft is an expert company providing complete Python software development services for small, mid and big level companies. It holds an expertise for 19 years and is backed up by a strong patronage. To know more- https://www.sumasoft.com/python-software-development
Data science involves extracting knowledge and insights from structured, semi-structured, and unstructured data using scientific processes. It encompasses more than just data analysis. The data value chain describes the process of acquiring data and transforming it into useful information and insights. It involves data acquisition, analysis, curation, storage, and usage. There are three main types of data: structured data that follows a predefined model like databases, semi-structured data with some organization like JSON, and unstructured data like text without a clear model. Metadata provides additional context about data to help with analysis. Big data is characterized by its large volume, velocity, and variety that makes it difficult to process with traditional tools.
The document discusses best practices for collecting software project data including defining a process for collection, storage, and review of data to ensure integrity. It emphasizes personally interacting with data sources to clarify information, establishing a central repository, and normalizing data for later analysis and calibration of estimation models. The checklist provides guidance on reviewing various aspects of the data collection to validate completeness and accuracy.
The document describes the key phases of a data analytics lifecycle for big data projects:
1) Discovery - The team learns about the problem, data sources, and forms hypotheses.
2) Data Preparation - Data is extracted, transformed, and loaded into an analytic sandbox.
3) Model Planning - The team determines appropriate modeling techniques and variables.
4) Model Building - Models are developed using selected techniques and training/test data.
5) Communicate Results - The team analyzes outcomes, articulates findings to stakeholders.
6) Operationalization - Useful models are deployed in a production environment on a small scale.
Frameworks provide structure. The core objective of the Big Data Framework is...RINUSATHYAN
Frameworks provide structure. The core objective of the Big Data Framework is to provide a structure for enterprise organisations that aim to benefit from the potential of Big Data
The document discusses data wrangling, which is the process of cleaning, organizing, and transforming raw data into a usable format for analysis. It defines data wrangling and describes the importance, benefits, common tools, and examples of data wrangling. It also outlines the typical iterative steps in data wrangling software and provides examples of data exploration, cleaning, and filtering in Python.
The document provides a summary of a candidate's professional experience including 4+ years of experience developing data warehouses using Informatica. Specific experiences include ETL development, data modeling, performance tuning, and testing. Details are provided on 4 projects involving healthcare, banking, and retail clients. Responsibilities included developing mappings, transformations, documentation, testing, and support. Technologies used include Informatica, Oracle, SQL, and Unix.
Data is unprocessed facts and figures that can be represented using characters. Information is processed data used to make decisions. Data science uses scientific methods to extract knowledge from structured, semi-structured, and unstructured data. The data processing cycle involves inputting data, processing it, and outputting the results. There are different types of data from both computer programming and data analytics perspectives including structured, semi-structured, and unstructured data. Metadata provides additional context about data.
This document provides an overview of Module 1 of a course on Big Data Analytics. It introduces key concepts related to big data, including its characteristics, types, and classification. It describes approaches to data architecture design, data storage, processing and analytics for both traditional and big data systems. It also covers topics like data sources, quality, preprocessing, and case studies and applications of big data analytics.
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
"R is the most popular language in the data-science community with 2+ million users and 6000+ R packages. R’s adoption evolved along with its easy-to-use statistical language, graphics, packages, tools and active community. In this session we will introduce Distributed R, a new open-source technology that solves the scalability and performance limitations of vanilla R. Since R is single-threaded and does not scale to accommodate large datasets, Distributed R addresses many of R’s limitations. Distributed R efficiently shares sparse structured data, leverages multi-cores, and dynamically partitions data to mitigate load imbalance.
In this talk, we will show the promise of this approach by demonstrating how important machine learning and graph algorithms can be expressed in a single framework and are substantially faster under Distributed R. Additionally, we will show how Distributed R complements Vertica, a state-of-the-art columnar analytics database, to deliver a full-cycle, fully integrated, data “prep-analyze-deploy” solution."
Data science combines fields like statistics, programming, and domain expertise to extract meaningful insights from data. It involves preparing, analyzing, and modeling data to discover useful information. Exploratory data analysis is the process of investigating data to understand its characteristics and check assumptions before modeling. There are four types of EDA: univariate non-graphical, univariate graphical, multivariate non-graphical, and multivariate graphical. Python and R are popular tools used for EDA due to their data analysis and visualization capabilities.
This document provides an overview of a course on data structures and algorithms. The course covers fundamental data structures like arrays, stacks, queues, lists, trees, hashing, and graphs. It emphasizes good programming practices like modularity, documentation and readability. Key concepts covered include data types, abstract data types, algorithms, selecting appropriate data structures based on efficiency requirements, and the goals of learning commonly used structures and analyzing structure costs and benefits.
Mba ii rm unit-4.1 data analysis & presentation aRai University
The document provides information about data analysis and presentation. It discusses various steps in data preparation including editing, coding, data entry, and handling missing data. It also covers hypothesis testing, which involves forming the null and alternative hypotheses, calculating a test statistic such as the z-statistic, determining the p-value, and making conclusions based on the significance level. An example is provided to illustrate a hypothesis test about population mean body weight. The document emphasizes that data preparation is crucial for ensuring accurate analysis. Hypothesis testing allows researchers to systematically evaluate claims about population parameters.
Prashant Yadav presented on data science and analysis at Babasaheb Bhimrao Ambedkar University in Lucknow, Uttar Pradesh. The presentation introduced data science, discussed its applications in various fields like business and healthcare, and covered key topics like open source tools for data science, common data analysis methodologies and algorithms, using Python for data analysis, and challenges in the field. The presentation provided an overview of data science from introducing the concept to discussing real-world applications and issues.
Data Science for Fundraising: Build Data-Driven Solutions Using R - Rodger De...Rodger Devine
Although the non-profit industry has advanced using CRMs and donor databases, it has not fully explored the data stored in those databases. Meanwhile, data scientists, in the for-profit industry, using sophisticated tools, have generated data-driven results and effective solutions for several challenges in their organizations. Regardless of your skill level, you can equip yourself and help your organization succeed with these data science techniques using R.
How can I become a data scientist? What are the most valuable skills to learn for a data scientist now? Could I learn how to be a data scientist by going through online tutorials? What does a data scientist do?
These are only some of the questions that are being discussed online, on blogs, on forums and on knowledge-sharing platforms like Quora.
Let me share the Beginner's Guide to Data Science which will be really helpful to you.
Also Checkout: http://bit.ly/2Mub6xP
FlorenceAI: Reinventing Data Science at HumanaDatabricks
Humana strives to help the communities we serve and our individual members achieve their best health – no small task in the past year! We had the opportunity to rethink our existing operations and reimagine what a collaborative ML platform for hundreds of data scientists might look like. The primary goal of our ML Platform, named FlorenceAI, is to automate and accelerate the delivery lifecycle of data science solutions at scale. In this presentation, we will walk through an end-to-end example of how to build a model at scale on FlorenceAI and deploy it to production. Tools highlighted include Azure Databricks, MLFlow, AppInsights, and Azure Data Factory.
We will employ slides, notebooks and code snippets covering problem framing and design, initial feature selection, model design and experimentation, and a framework of centralized production code to streamline implementation. Hundreds of data scientists now use our feature store that has tens of thousands of features refreshed in daily and monthly cadences across several years of historical data. We already have dozens of models in production and also daily provide fresh insights for our Enterprise Clinical Operating Model. Each day, billions of rows of data are generated to give us timely information.
We already have examples of teams operating orders of magnitude faster and at a scale not within reach using fixed on-premise resources. Given rapid adoption from a dozen pilot users to over 100 MAU in the first 5 months, we will also share some anecodotes about key early wins created by the platform. We want FlorenceAI to enable Humana’s data scientists to focus their efforts where they add the most value so we can continue to deliver high-quality solutions that remain fresh, relevant and fair in an ever changing world.
Similar to 20150814 Wrangling Data From Raw to Tidy vs (20)
1. Wrangling Data From Raw to
Tidy:
Preparation, Organization,
Quality Control, and
Communication
August 14, 2015
Professional Education and Knowledge Seminar
2. Purpose
• The purpose of this presentation is to identify and demonstrate
best practices for processing raw data into tidy data sets.
• This presentation will demonstrate these practices using Stata and
R.
• The examples are taken from a Department of Agriculture project
and an assignment from Johns Hopkins University’s Getting and
Cleaning Data course on Coursera.1
1 https://www.coursera.org/specialization/jhudatascience/1/certificate
2Wrangling Data From Raw to Tidy
4. Cheat Sheet
• Preparation
– Do you have a code book?
– Have you reviewed and validated the variables?
• Organization
– Plan out the final product and the steps necessary to create that output.
– Label variables and values.
• Quality Control
– Create reproducible cleaning code.
– Rerun code from start to finish in a clean directory.
• Communication
– Comment, Comment, Comment.
– Create a codebook and provide raw and tidy datasets.
4Wrangling Data From Raw to Tidy
5. Preparation
• Do you have a codebook for the raw data?
• Do you have the ability to read through and validate (review,
summarize, graph, etc.) every variable?
• If the answer is no, return to your client (or data provider) and ask!
5Wrangling Data From Raw to Tidy
6. Organization
• Plan out steps for investigating the data.
• Create a code shell to include basic commands necessary for
organized and logical coding.
• Add variable and value labels to make data understandable.
6Wrangling Data From Raw to Tidy
7. Organizing Your Thoughts
7Wrangling Data From Raw to Tidy
• What form do you want your data set to be in?
• Should the data be long, wide?
• Who will be using this data and how will it be used?
• Will the output require a particular format employed for specific
projects or tasks?
Raw Data Desired Format
8. Setup a “Shell” Document with a Header
8Wrangling Data From Raw to Tidy
• Outline the major functions performed in the code in the header.
9. Plan Out How to Move from One Step to Another
• You may not know exactly how to get from one step to another.
• Take a step back and think about where you are going.
• Write out small sections of your code in the text editor the way you
think it would produce your expected results.
• Run the code and test that it produces those outputs.
• Think critically about your inputs and the desired form you want
your data to be in.
9Wrangling Data From Raw to Tidy
10. Data Cleaning Example: Installment Data
10Wrangling Data From Raw to Tidy
• Challenge: Generate wide-formatted loan payment schedule given
the loan tenor, installment amounts, and equal payment indicator.
• Rushed Solution: Use sophisticated and complex coding to extend
wide-formatted schedule by replacing missing installment values.
• Planned Solution: Reshape data into long format and replace values.
11. Data Cleaning Example: Installment Data
11Wrangling Data From Raw to Tidy
• Stata code to reshape data and extend loan schedule:
12. Data Cleaning Example: Installment Data
12Wrangling Data From Raw to Tidy
• Sample of observations after reshaping data to long-format:
Tenor Installments“Year”
13. Data Cleaning Example: Installment Data
13Wrangling Data From Raw to Tidy
• Sample of final dataset after reshaping loan schedule into wide
format (i.e., the desired loan schedule):
14. Label Variable Names and Values
• Why?
• It helps you!
– If you need to find variables quickly and easily and you don’t want to have to
store the variable descriptions in your brain’s memory.
• It helps your peers!
– If you spend a good amount of time with the data, variables become second
nature, but they will not necessarily be obvious to your colleagues.
• Try to interpret these variables: flp_asst_type_cd,
dir_loan_pgm_cd, tBodyGyro-arCoeff, tBodyAccJerkMag-mean.
14Wrangling Data From Raw to Tidy
15. Label Variable Names and Values in Stata
• label variable varname “variable label”
• label define valuename number “value label”
• label values varname valuename
15Wrangling Data From Raw to Tidy
16. Label Variable Names in R
• Functions - names() and sub():
• names(dataset) <- sub(“find”, “replace”,
names(dataset))
• perl = TRUE
• ignore.case = TRUE
16Wrangling Data From Raw to Tidy
17. Label Values in R
• Function - mutate() from the dplyr package:
• mutate(varname = factor(varname, levels =
c(levels), labels = c(labels))
17Wrangling Data From Raw to Tidy
18. Limit Each Column to One Type of Data
18Wrangling Data From Raw to Tidy
• Example of data with multiple variables in each column:
19. Data Cleaning Example: Phone Health Data
19Wrangling Data From Raw to Tidy
• R code to reshape phone health data and generate variables to
separate out each observation’s characteristics:
# Reshape Data Long and Group Related Variables #
reshape_tidy_data <- tidy_data %>%
gather(Variable, Average_Value,3:ncol(tidy_data))
Note: This code uses the tidyr and dplyr packages.
20. Data Cleaning Example: Phone Health Data
20Wrangling Data From Raw to Tidy
# Reshape Data Long and Group Related Variables #
reshape_tidy_data <- mutate(Axis = ifelse(grepl("X$",
Variable,perl=TRUE),1, ifelse(grepl("Y$",
Variable,perl=TRUE), 2, 3)) ... )
Note: This code uses the tidyr and dplyr packages.
21. Quality Control
21Wrangling Data From Raw to Tidy
• Create reproducible cleaning code.
• Rerun code from start to finish in a clean directory.
– Running code line by line can result in accidental manual errors, or
unexpected results.
– Also start in a completely blank directory to ensure the code produces the
results without referencing files you previously generated.
• Provide code, documentation, and expected output to a colleague
for feedback.
22. Communication
In almost every circumstance, commenting is the superior to not
commenting.
• If your code will be reviewed by a colleague (and I hope it will)
then your time upfront will save them time during review.
• If anyone, including yourself, will ever use this code again, it allows
you and them to understand it more easily.
• If you require multiple validations or cleanings of similar data sets,
commenting your code can indicate what code to reuse.
22Wrangling Data From Raw to Tidy
24. Provide a Processed Dataset
24Wrangling Data From Raw to Tidy
• Create a data dictionary containing the variable names, values,
labels, and units.
• Components of a Processed Dataset1:
– The raw data set
– A tidy data set
– A codebook describing each variable and its value in the tidy data.
– An explicit and exact recipe used to get from the raw to tidy data.
• If there are steps that cannot be coded, they need to be explicitly
described.
1 Taken from Getting and Cleaning Data Coursera course from Johns Hopkins University.
25. Summary
• Preparation
– Do you have a code book?
– Have you reviewed and validated the variables?
• Organization
– Plan out the final product and the steps necessary to create that output.
– Label variables and values.
• Quality Control
– Create reproducible cleaning code.
– Rerun code from start to finish in a clean directory.
• Communication
– Comment, Comment, Comment.
– Create a codebook and provide raw and tidy datasets.
25Wrangling Data From Raw to Tidy