Talk given by Mike Skarlinski and Brian Graham from WW (new Weight Watchers) data science team in 5th NYC RecSys meetup, June 20, 2019, hosted at WW HQ
Data Quality: principles, approaches, and best practicesCarl Anderson
The document discusses principles and best practices for data quality. It outlines key facets of data quality including accuracy, coherence, completeness, consistency, being defined and timely. It provides examples of how to measure these facets through metrics like percentage of records quarantined or missing fields. The document advocates establishing data governance practices like publishing schemas, adhering to definitions, and integrating data quality checks and monitoring into normal workflows. It promotes a culture where data quality is a shared responsibility across teams.
Setting up Data Science for Success: The Data LayerCarl Anderson
This document discusses setting up data science projects for success by focusing on the importance of data preparation. It notes that 76% of data scientists view data preparation as the least enjoyable part of their work. The document outlines various facets of data preparation, including collecting, understanding, cleaning, and reshaping data. It emphasizes that data quality is important and a shared responsibility across data engineering, data science, and business intelligence teams. It recommends creating a single source of truth for data through techniques like data dictionaries to define data for all teams.
Creating a Data-Driven Organization (Data Day Seattle 2015)Carl Anderson
Creating a Data-Driven Organization
The document discusses how to create a data-driven organization. It argues that being data-driven requires having strong analytics, a data-focused culture, and using data to drive impact and business results. Some key aspects of a data-driven culture discussed are having a testing mindset, open data sharing, self-service analytics access for business units, broad data literacy, and visible data leadership. The presentation provides examples of actions organizations can take to promote a data-driven culture, such as improving analyst competencies and linking metrics to strategic goals. It cautions that becoming complacent once progress is made can undermine data-driven efforts, as demonstrated by Tesco's experience.
Targeting towards the health and human services communities, this presentation covers the importance of a data-driven culture, how to identify areas where data can be used to innovate and how to recognize the operational processes you must have in place to fully utilize your data.
Data is becoming an engine for many businesses in the information age, and every company needs to consider look at how that feels in their business model.
This an introductory guest lecture for students at Stockholm School of Entrepreneurship.
In times of digitalization, every aspect of our life is connected to data. To leverage this data, companies need to understand and master analytics. In this presentation, Leo Marose will guide you through the world of big data & data science and show you his approach of how to build a data-driven organization.
Creating a Data-Driven Organization, Crunchconf, October 2015Carl Anderson
Creating a data-driven organization requires developing a data-driven culture. Key aspects of a data-driven culture include having a strong testing culture that encourages hypothesis generation and experimentation, an open and sharing culture without data silos, a self-service culture where business units have necessary data access and analytical skills, and broad data literacy across all decision makers. Ultimately, an organization is data-driven when it uses data to drive impact and business results by pushing data through an analytics value chain from collection to analysis to decisions and actions. Maintaining a data-driven culture requires continuous effort as well as data leadership from a chief data or analytics officer.
Data Quality: principles, approaches, and best practicesCarl Anderson
The document discusses principles and best practices for data quality. It outlines key facets of data quality including accuracy, coherence, completeness, consistency, being defined and timely. It provides examples of how to measure these facets through metrics like percentage of records quarantined or missing fields. The document advocates establishing data governance practices like publishing schemas, adhering to definitions, and integrating data quality checks and monitoring into normal workflows. It promotes a culture where data quality is a shared responsibility across teams.
Setting up Data Science for Success: The Data LayerCarl Anderson
This document discusses setting up data science projects for success by focusing on the importance of data preparation. It notes that 76% of data scientists view data preparation as the least enjoyable part of their work. The document outlines various facets of data preparation, including collecting, understanding, cleaning, and reshaping data. It emphasizes that data quality is important and a shared responsibility across data engineering, data science, and business intelligence teams. It recommends creating a single source of truth for data through techniques like data dictionaries to define data for all teams.
Creating a Data-Driven Organization (Data Day Seattle 2015)Carl Anderson
Creating a Data-Driven Organization
The document discusses how to create a data-driven organization. It argues that being data-driven requires having strong analytics, a data-focused culture, and using data to drive impact and business results. Some key aspects of a data-driven culture discussed are having a testing mindset, open data sharing, self-service analytics access for business units, broad data literacy, and visible data leadership. The presentation provides examples of actions organizations can take to promote a data-driven culture, such as improving analyst competencies and linking metrics to strategic goals. It cautions that becoming complacent once progress is made can undermine data-driven efforts, as demonstrated by Tesco's experience.
Targeting towards the health and human services communities, this presentation covers the importance of a data-driven culture, how to identify areas where data can be used to innovate and how to recognize the operational processes you must have in place to fully utilize your data.
Data is becoming an engine for many businesses in the information age, and every company needs to consider look at how that feels in their business model.
This an introductory guest lecture for students at Stockholm School of Entrepreneurship.
In times of digitalization, every aspect of our life is connected to data. To leverage this data, companies need to understand and master analytics. In this presentation, Leo Marose will guide you through the world of big data & data science and show you his approach of how to build a data-driven organization.
Creating a Data-Driven Organization, Crunchconf, October 2015Carl Anderson
Creating a data-driven organization requires developing a data-driven culture. Key aspects of a data-driven culture include having a strong testing culture that encourages hypothesis generation and experimentation, an open and sharing culture without data silos, a self-service culture where business units have necessary data access and analytical skills, and broad data literacy across all decision makers. Ultimately, an organization is data-driven when it uses data to drive impact and business results by pushing data through an analytics value chain from collection to analysis to decisions and actions. Maintaining a data-driven culture requires continuous effort as well as data leadership from a chief data or analytics officer.
Predictive Analytics - How to get stuff out of your Crystal BallDATAVERSITY
Everyone wants to leverage data. The optimal implementation of analytics is an organization-wide set of capabilities. These are called advantageous organizational analytic capabilities in that a clear ROI is demonstrable from these efforts. Turns out that there are a number of prerequisites to advantageous organizational analytics. These include:
Adopting a crawl, walk, run strategy
Understanding current and potential organizational maturity and corresponding capabilities
Achieving an appropriate technology/human capability balance
Implementing useful IT systems development practices
Installing necessary non-IT leadership
This webinar will explore these and other topics using examples drawn from DOD, healthcare researchers, and donation center operations.
Data Driven Strategy Analytics Technology Approach CorporateSlideTeam
This complete deck can be used to present to your team. It has PPT slides on various topics highlighting all the core areas of your business needs. This complete deck focuses on Data Driven Strategy Analytics Technology Approach Corporate and has professionally designed templates with suitable visuals and appropriate content. This deck consists of total of thirteen slides. All the slides are completely customizable for your convenience. You can change the colour, text and font size of these templates. You can add or delete the content if needed. Get access to this professionally designed complete presentation by clicking the download button below. https://bit.ly/3yjusdQ
This document discusses best practices for working in the gig economy as an independent contractor on data and analytics projects. It recommends finding the right fit between contractor skills and project needs, committing to an agile or waterfall project management approach, setting quantitative goals, creating extensible code and documentation, and over-communicating through frequent updates rather than relying on emails. The document concludes with case studies comparing two different broadcaster clients' projects that illustrates these principles in action and contrasts their outcomes.
Reinventing the Modern Information Pipeline: Paxata and MapRLilia Gutnik
(Presented at MapR's Big Data Everywhere event in Redwood City, CA in December 2016)
The relationship between business teams and IT has changed as the complexity of data has increased. A traditional data pipeline designed for an IT-centered approach to information management is not designed for the data demands of today's business decisions. Designing a big data strategy requires modernizing previous approaches. Self-service data preparation in a collaborative, intuitive, governed, and secure environment is the key to a nimble and decisive business unit.
Four Key Considerations for your Big Data Analytics StrategyArcadia Data
This document discusses considerations for big data analytics strategies. It covers how big data analytics have evolved from focusing on structured data and batch processing to also including real-time, multi-structured data from various sources. It emphasizes that discovery is key and requires visual exploration of granular data details. Native big data analytics platforms are needed that can handle real-time streaming data and provide self-service capabilities through customizable applications. The document provides examples of how various companies are using big data analytics for applications like cybersecurity, customer analytics, and supply chain optimization.
This document discusses how to build a data-driven organization by collecting and analyzing metrics. It emphasizes that data is important for making decisions, hitting goals, and knowing if systems are working properly. The author promotes their tool called Larimar, which aims to automate data collection and analysis at the application level to provide insights without configuration. Building a data culture where employees are inspired to act on insights is key to success.
This document discusses the importance of data quality and data governance. It states that poor data quality can lead to wrong decisions, bad reputation, and wasted money. It then provides examples of different dimensions of data quality like accuracy, completeness, currency, and uniqueness. It also discusses methods and tools for ensuring data quality, such as validation, data merging, and minimizing human errors. Finally, it defines data governance as a set of policies and standards to maintain data quality and provides examples of data governance team missions and a sample data quality scorecard.
Dataiku is a collaborative data science platform that allows teams to prototype, design, and run data science projects at scale across various technologies and locations. It has over 300% growth in users and is used by many leading companies. Dataiku was named a "visionary" in Gartner's 2017 Magic Quadrant for data science platforms based on its completeness of vision.
Stop searching for that elusive data scientistYogita Bansal
Companies are increasingly seeking data scientists to drive data-based decision making, but there is a lack of qualified candidates. To address this, companies should build effective teams by coordinating existing resources, promoting a data-focused culture, and encouraging all members to contribute insights from available data. Even small groups can draw meaningful conclusions and make informed decisions by maximizing their current capabilities.
Data-Ed Webinar: The Seven Deadly Data Sins - Emerging from Management PurgatoryDATAVERSITY
While wrath and envy are best left for human resources to address, overcoming the numerous obstacles that often inhibit successful data management must be a full organizational effort. The difficulty of implementing a new data strategy often goes underappreciated, particularly the multi-faceted nature of the challenges that need to be met. Deficiencies in organizational readiness and core competence represent clearly visible problems faced by data managers, but beyond that there are several cultural and structural barriers common to virtually all organizations that must be eliminated in order to facilitate effective management of data.
In this webinar, we will discuss these barriers—the titular “Seven Deadly Data Sins”, and in the process will also:
Elaborate upon the three critical factors that lead to strategy failure
Demonstrate a two-stage data strategy implementation process
Explore the sources and rationales behind the “Seven Deadly Data Sins”, and recommend solutions and alternative approaches
Analytics Strategy and Roadmap Offering v2 (1)Joey Amanchukwu
The document outlines a 4-step methodology for developing a data and analytics strategy: 1) Aligning to business priorities through stakeholder interviews, 2) Assessing the current state of data and analytics capabilities and identifying gaps, 3) Creating a future state blueprint with recommendations and a technology architecture, and 4) Prioritizing opportunities into a phased roadmap for implementation. The goal is to leverage data and analytics capabilities to create business value.
Webinar: Data Quality, Data Engineering, and Data ScienceDATAVERSITY
This webinar explores the organizational constructs and processes for enabling business to build better insights through Data Quality, Data Engineering, and Data Science. In particular, it examines the needs for:
A Data Lab to foster an open, questioning, and collaborative environment to develop the right data principles, patterns, and standards.
A Data Factory to implement those standards developed in the Data Lab.
Different Data Quality requirements in the Lab and Factory, how Data Engineering aims to meet both needs.
Data Engineering, in advance of the sexier Data Science, to create the right environments in both the lab and the factory and to actually examine the data.
All of the above to provide the data needed to create more efficient processes for the Data Scientists to be more effective in their roles.
Join this webinar to hear Tom “The Data Doc” Redman discuss with Dr. Prashanth Southekal, recent author of Data for Business Performance, the details of achieving better insights with examples of a case study from an Oil and Gas company.
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoSri Ambati
The document discusses advanced analytics at Macys.com. It outlines the challenges of big data predictive modeling such as scaling models, ensuring timely models, integrating models, and testing models. It describes Macys.com's advanced analytics team which includes data scientists with backgrounds in quantitative fields. The team works on projects such as personalized site recommendations, response propensity models, customer acquisition/retention modeling, and experimentation platforms. It provides examples of Macys.com's real-time site personalization and customer segmentation work.
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...DATAVERSITY
This document summarizes a presentation on self-service data analysis, data wrangling, data munging, and how they fit together with data modeling. It discusses how these techniques allow business stakeholders and data scientists to prepare and transform data for analysis without extensive technical expertise. While these tools increase flexibility, they can also decrease governance if not used properly. The document advocates finding a balance between managed data assets and exploratory analysis to maximize insights while maintaining data quality.
The document outlines five questions to consider when analyzing data from courses: 1) What does the data tell you? 2) What does the data not tell you? 3) What are the celebrations about the data? 4) What opportunities for improvement does the data allow? 5) Based on your analysis, what are the next steps and timeline? It provides guidance on focusing the analysis to find both positive and negative trends, missing information, areas for celebration or improvement, and developing an action plan.
RWDG Slides: Using Agile to Justify Data GovernanceDATAVERSITY
The Agile development methodology is here to stay. Data Governance is not going away any time soon. These two discipline share some common ground but often compete when it comes to the “right” thing to do when it comes to managing the data. The disciplines need to learn to play well together. The old mantra of “do unto others” applies here in a big way.
In this month’s Real-World Data Governance webinar, Bob Seiner will share tips and techniques to take advantage of the Agile methodology to justify the need for, and practice of, Data Governance. The two disciplines are the core of delivering on-time quality data through timely applications. You will walk away from this session inspired to try ideas on your own organization.
This webinar will cover:
• The governance aspects of Agile
• Why Data Governance Practitioners Should Embrace Agile
• Agile considerations for Data Governance
• The audience of both Agile and Data Governance
• How to Use Agile to Justify Data Governance
Data analytics is the need of any organization using any branded erp software, home grown erp or using MS Excel. To grow business to new verticals Data Analytics show the insights of business!
Data-Ed Slides: Exorcising the Seven Deadly Data SinsDATAVERSITY
The difficulty of implementing a new data strategy often goes underappreciated, particularly the multi-faceted procedural challenges that need to be met while doing so. Deficiencies in organizational readiness and core competence represent clearly visible problems faced by data managers, but beyond that there are several cultural and structural barriers common to virtually all organizations that must be eliminated in order to facilitate effective management of data. This webinar will discuss these barriers--as well as the titular "Seven Deadly Data Sins"--and in the process will also:
- Elaborate upon the three critical factors that lead to strategy failure
- Demonstrate a two-stage data strategy implementation process
- Explore the sources and rationales behind the “Seven Deadly Data Sins”, and recommend solutions and alternative approaches
The document summarizes activities of the Digital Analytics Association (DAA). It discusses the DAA's history and growth over time, including an increasing number of corporate members, individual members, and local chapters. It outlines the DAA's educational programs and certifications, events, resources like reports and guides, and goals for further expanding its offerings in 2017.
#Datacaeer - AI Guild workshop on data roles in industry with Adam GreenAI Guild
Based on AI Guild career coaching this workshop looks at roles such as Data Analyst, Data Scientist, and Data Engineer in industry and startups. We discuss emerging specialization, and how to upgrade your competence profile. Also included, tips and tricks from practitioners on how to find your next role.
Please find the event series on aiguild.eventbrite.com
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
In this data management session, Christopher describes how to build robust and reliable data products in BigQuery and dbt, for PPC and SEO use cases. After an introduction to the modern data stack, six principles of reliable data products are presented, followed by the following use cases:
- Google Ads Conversion upload
- SEO sitemap efficiency report
- Google Shopping product rating sync
- Large-Scale link checker with advertools
- Inventory-based PPC campaigns with dbt
Here is the referenced selection of gists on github: https://gist.github.com/ChrisGutknecht
Predictive Analytics - How to get stuff out of your Crystal BallDATAVERSITY
Everyone wants to leverage data. The optimal implementation of analytics is an organization-wide set of capabilities. These are called advantageous organizational analytic capabilities in that a clear ROI is demonstrable from these efforts. Turns out that there are a number of prerequisites to advantageous organizational analytics. These include:
Adopting a crawl, walk, run strategy
Understanding current and potential organizational maturity and corresponding capabilities
Achieving an appropriate technology/human capability balance
Implementing useful IT systems development practices
Installing necessary non-IT leadership
This webinar will explore these and other topics using examples drawn from DOD, healthcare researchers, and donation center operations.
Data Driven Strategy Analytics Technology Approach CorporateSlideTeam
This complete deck can be used to present to your team. It has PPT slides on various topics highlighting all the core areas of your business needs. This complete deck focuses on Data Driven Strategy Analytics Technology Approach Corporate and has professionally designed templates with suitable visuals and appropriate content. This deck consists of total of thirteen slides. All the slides are completely customizable for your convenience. You can change the colour, text and font size of these templates. You can add or delete the content if needed. Get access to this professionally designed complete presentation by clicking the download button below. https://bit.ly/3yjusdQ
This document discusses best practices for working in the gig economy as an independent contractor on data and analytics projects. It recommends finding the right fit between contractor skills and project needs, committing to an agile or waterfall project management approach, setting quantitative goals, creating extensible code and documentation, and over-communicating through frequent updates rather than relying on emails. The document concludes with case studies comparing two different broadcaster clients' projects that illustrates these principles in action and contrasts their outcomes.
Reinventing the Modern Information Pipeline: Paxata and MapRLilia Gutnik
(Presented at MapR's Big Data Everywhere event in Redwood City, CA in December 2016)
The relationship between business teams and IT has changed as the complexity of data has increased. A traditional data pipeline designed for an IT-centered approach to information management is not designed for the data demands of today's business decisions. Designing a big data strategy requires modernizing previous approaches. Self-service data preparation in a collaborative, intuitive, governed, and secure environment is the key to a nimble and decisive business unit.
Four Key Considerations for your Big Data Analytics StrategyArcadia Data
This document discusses considerations for big data analytics strategies. It covers how big data analytics have evolved from focusing on structured data and batch processing to also including real-time, multi-structured data from various sources. It emphasizes that discovery is key and requires visual exploration of granular data details. Native big data analytics platforms are needed that can handle real-time streaming data and provide self-service capabilities through customizable applications. The document provides examples of how various companies are using big data analytics for applications like cybersecurity, customer analytics, and supply chain optimization.
This document discusses how to build a data-driven organization by collecting and analyzing metrics. It emphasizes that data is important for making decisions, hitting goals, and knowing if systems are working properly. The author promotes their tool called Larimar, which aims to automate data collection and analysis at the application level to provide insights without configuration. Building a data culture where employees are inspired to act on insights is key to success.
This document discusses the importance of data quality and data governance. It states that poor data quality can lead to wrong decisions, bad reputation, and wasted money. It then provides examples of different dimensions of data quality like accuracy, completeness, currency, and uniqueness. It also discusses methods and tools for ensuring data quality, such as validation, data merging, and minimizing human errors. Finally, it defines data governance as a set of policies and standards to maintain data quality and provides examples of data governance team missions and a sample data quality scorecard.
Dataiku is a collaborative data science platform that allows teams to prototype, design, and run data science projects at scale across various technologies and locations. It has over 300% growth in users and is used by many leading companies. Dataiku was named a "visionary" in Gartner's 2017 Magic Quadrant for data science platforms based on its completeness of vision.
Stop searching for that elusive data scientistYogita Bansal
Companies are increasingly seeking data scientists to drive data-based decision making, but there is a lack of qualified candidates. To address this, companies should build effective teams by coordinating existing resources, promoting a data-focused culture, and encouraging all members to contribute insights from available data. Even small groups can draw meaningful conclusions and make informed decisions by maximizing their current capabilities.
Data-Ed Webinar: The Seven Deadly Data Sins - Emerging from Management PurgatoryDATAVERSITY
While wrath and envy are best left for human resources to address, overcoming the numerous obstacles that often inhibit successful data management must be a full organizational effort. The difficulty of implementing a new data strategy often goes underappreciated, particularly the multi-faceted nature of the challenges that need to be met. Deficiencies in organizational readiness and core competence represent clearly visible problems faced by data managers, but beyond that there are several cultural and structural barriers common to virtually all organizations that must be eliminated in order to facilitate effective management of data.
In this webinar, we will discuss these barriers—the titular “Seven Deadly Data Sins”, and in the process will also:
Elaborate upon the three critical factors that lead to strategy failure
Demonstrate a two-stage data strategy implementation process
Explore the sources and rationales behind the “Seven Deadly Data Sins”, and recommend solutions and alternative approaches
Analytics Strategy and Roadmap Offering v2 (1)Joey Amanchukwu
The document outlines a 4-step methodology for developing a data and analytics strategy: 1) Aligning to business priorities through stakeholder interviews, 2) Assessing the current state of data and analytics capabilities and identifying gaps, 3) Creating a future state blueprint with recommendations and a technology architecture, and 4) Prioritizing opportunities into a phased roadmap for implementation. The goal is to leverage data and analytics capabilities to create business value.
Webinar: Data Quality, Data Engineering, and Data ScienceDATAVERSITY
This webinar explores the organizational constructs and processes for enabling business to build better insights through Data Quality, Data Engineering, and Data Science. In particular, it examines the needs for:
A Data Lab to foster an open, questioning, and collaborative environment to develop the right data principles, patterns, and standards.
A Data Factory to implement those standards developed in the Data Lab.
Different Data Quality requirements in the Lab and Factory, how Data Engineering aims to meet both needs.
Data Engineering, in advance of the sexier Data Science, to create the right environments in both the lab and the factory and to actually examine the data.
All of the above to provide the data needed to create more efficient processes for the Data Scientists to be more effective in their roles.
Join this webinar to hear Tom “The Data Doc” Redman discuss with Dr. Prashanth Southekal, recent author of Data for Business Performance, the details of achieving better insights with examples of a case study from an Oil and Gas company.
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoSri Ambati
The document discusses advanced analytics at Macys.com. It outlines the challenges of big data predictive modeling such as scaling models, ensuring timely models, integrating models, and testing models. It describes Macys.com's advanced analytics team which includes data scientists with backgrounds in quantitative fields. The team works on projects such as personalized site recommendations, response propensity models, customer acquisition/retention modeling, and experimentation platforms. It provides examples of Macys.com's real-time site personalization and customer segmentation work.
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...DATAVERSITY
This document summarizes a presentation on self-service data analysis, data wrangling, data munging, and how they fit together with data modeling. It discusses how these techniques allow business stakeholders and data scientists to prepare and transform data for analysis without extensive technical expertise. While these tools increase flexibility, they can also decrease governance if not used properly. The document advocates finding a balance between managed data assets and exploratory analysis to maximize insights while maintaining data quality.
The document outlines five questions to consider when analyzing data from courses: 1) What does the data tell you? 2) What does the data not tell you? 3) What are the celebrations about the data? 4) What opportunities for improvement does the data allow? 5) Based on your analysis, what are the next steps and timeline? It provides guidance on focusing the analysis to find both positive and negative trends, missing information, areas for celebration or improvement, and developing an action plan.
RWDG Slides: Using Agile to Justify Data GovernanceDATAVERSITY
The Agile development methodology is here to stay. Data Governance is not going away any time soon. These two discipline share some common ground but often compete when it comes to the “right” thing to do when it comes to managing the data. The disciplines need to learn to play well together. The old mantra of “do unto others” applies here in a big way.
In this month’s Real-World Data Governance webinar, Bob Seiner will share tips and techniques to take advantage of the Agile methodology to justify the need for, and practice of, Data Governance. The two disciplines are the core of delivering on-time quality data through timely applications. You will walk away from this session inspired to try ideas on your own organization.
This webinar will cover:
• The governance aspects of Agile
• Why Data Governance Practitioners Should Embrace Agile
• Agile considerations for Data Governance
• The audience of both Agile and Data Governance
• How to Use Agile to Justify Data Governance
Data analytics is the need of any organization using any branded erp software, home grown erp or using MS Excel. To grow business to new verticals Data Analytics show the insights of business!
Data-Ed Slides: Exorcising the Seven Deadly Data SinsDATAVERSITY
The difficulty of implementing a new data strategy often goes underappreciated, particularly the multi-faceted procedural challenges that need to be met while doing so. Deficiencies in organizational readiness and core competence represent clearly visible problems faced by data managers, but beyond that there are several cultural and structural barriers common to virtually all organizations that must be eliminated in order to facilitate effective management of data. This webinar will discuss these barriers--as well as the titular "Seven Deadly Data Sins"--and in the process will also:
- Elaborate upon the three critical factors that lead to strategy failure
- Demonstrate a two-stage data strategy implementation process
- Explore the sources and rationales behind the “Seven Deadly Data Sins”, and recommend solutions and alternative approaches
The document summarizes activities of the Digital Analytics Association (DAA). It discusses the DAA's history and growth over time, including an increasing number of corporate members, individual members, and local chapters. It outlines the DAA's educational programs and certifications, events, resources like reports and guides, and goals for further expanding its offerings in 2017.
#Datacaeer - AI Guild workshop on data roles in industry with Adam GreenAI Guild
Based on AI Guild career coaching this workshop looks at roles such as Data Analyst, Data Scientist, and Data Engineer in industry and startups. We discuss emerging specialization, and how to upgrade your competence profile. Also included, tips and tricks from practitioners on how to find your next role.
Please find the event series on aiguild.eventbrite.com
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
In this data management session, Christopher describes how to build robust and reliable data products in BigQuery and dbt, for PPC and SEO use cases. After an introduction to the modern data stack, six principles of reliable data products are presented, followed by the following use cases:
- Google Ads Conversion upload
- SEO sitemap efficiency report
- Google Shopping product rating sync
- Large-Scale link checker with advertools
- Inventory-based PPC campaigns with dbt
Here is the referenced selection of gists on github: https://gist.github.com/ChrisGutknecht
Big Data for Data Scientists - Info SessionWeCloudData
In this talk, WeCloudData introduces the Hadoop/Spark ecosystem and how businesses use big data tools and platforms. For more detail about WeCloudData's big data for data scientist course please visit: https://weclouddata.com/data-science/
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
This document discusses how data science models have transitioned to the cloud to take advantage of greater computing resources. It notes that data science models are resource-intensive and traditionally required powerful local machines. The cloud allows data scientists to run models on cloud infrastructure for lower costs than high-end laptops and with access to many GPUs. Several major cloud platforms - Azure, AWS, and Google Cloud - are discussed and compared in terms of their machine learning offerings. The document also introduces Microsoft's Team Data Science Process, which aims to help data science teams collaborate more effectively on projects in the cloud.
Using Compass to Diagnose Performance Problems MongoDB
Speaker: Brian Blevins, Technical Services Engineer, MongoDB
Level: 200 (Intermediate)
Track: Performance
Since the performance of your application drives engagement and revenue, it can make or break the success of your organization. You can use the Compass graphical client from MongoDB to visualize your database schema, collect information on optimization opportunities and make database changes to improve performance. In this talk, we will briefly introduce Compass and then delve into the features supporting database performance optimization. The talk will combine instruction on the use of Compass with recommendations for performance best practices. We will also review the detection and resolution of slow queries and excessive network utilization. After attending the talk, audience members will have a better understanding of the capabilities of Compass, including how those capabilities can be used to find and correct performance bottlenecks in MongoDB databases. This session is designed for those with limited MongoDB experience. Attendees should have a basic understanding of MongoDB’s schema design, the server/database/collection layout, and how their application accesses and uses the MongoDB database.
What You Will Learn:
- Identify excessive network utilization, adjust queries appropriately and use Compass to confirm results.
- Understand how the Compass graphical client can help you improve performance in your MongoDB deployment.
- Use Compass real time statistics to identify slow queries and recognize when a query is a good candidate for adding an index.
Using Compass to Diagnose Performance Problems in Your ClusterMongoDB
Using Compass to Diagnose Performance Problems in Your Cluster
Speaker: Brian Blevins, Technical Services Engineer, MongoDB
Date/Time: June 20, 1:50 PM
Track: Performance
Since the performance of your application drives engagement and revenue, it can make or break the success of your organization. You can use the Compass graphical client from MongoDB to visualize your database schema, collect information on optimization opportunities and make database changes to improve performance. In this talk, we will briefly introduce Compass and then delve into the features supporting database performance optimization. The talk will combine instruction on the use of Compass with recommendations for performance best practices. We will also review the detection and resolution of slow queries and excessive network utilization. After attending the talk, audience members will have a better understanding of the capabilities of Compass, including how those capabilities can be used to find and correct performance bottlenecks in MongoDB databases. This session is designed for those with limited MongoDB experience. Attendees should have a basic understanding of MongoDB’s schema design, the server/database/collection layout, and how their application accesses and uses the MongoDB database.
What You Will Learn:
- Identify excessive network utilization, adjust queries appropriately and use Compass to confirm results.
- Understand how the Compass graphical client can help you improve performance in your MongoDB deployment.
- Use Compass real time statistics to identify slow queries and recognize when a query is a good candidate for adding an index.
While the adoption of machine learning and deep learning techniques continue to grow, many organizations find it difficult to actually deploy these sophisticated models into production. It is common to see data scientists build powerful models, yet these models are not deployed because of the complexity of the technology used or lack of understanding related to the process of pushing these models into production.
As part of this talk, I will review several deployment design patterns for both real-time and batch use cases. I’ll show how these models can be deployed as scalable, distributed deployments within the cloud, scaled across hadoop clusters, as APIs, and deployed within streaming analytics pipelines. I will also touch on topics related to security, end-to-end governance, pitfalls, challenges, and useful tools across a variety of platforms. This presentation will involve demos and sample code for the the deployment design patterns.
There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
This is a presentation made on the 13th August 2014 at the SF Data Mining Meetup at Trulia. It's about Dataiku and the Kaggle Personalized Web Search Ranking challenge sponsored by Yandex
Architecting an Open Source AI Platform 2018 editionDavid Talby
How to build a scalable AI platform using open source software. The end-to-end architecture covers data integration, interactive queries & visualization, machine learning & deep learning, deploying models to production, and a full 24x7 operations toolset in a high-compliance environment.
Introduction to Machine Learning - WeCloudDataWeCloudData
WeCloudData offers data science training programs and customized corporate training. They have 21 part-time instructors and 2 full-time instructors with expertise in tools like Python, Spark, and AWS. WeCloudData organizes data science meetup events and conferences, and provides workshops at various conferences. Their Applied Machine Learning course teaches tools and techniques over 12 sessions, includes a hands-on project, and helps with interview preparation.
Introduction to Machine Learning - WeCloudDataWeCloudData
In this talk, WeCloudData introduces the lifecycle of machine learning and its tools/ecosystems. For more detail about WeCloudData's machine learning course please visit: https://weclouddata.com/data-science/
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
This document discusses principles for applying continuous delivery practices to machine learning models. It begins with background on the speaker and their company Indix, which builds location and product-aware software using machine learning. The document then outlines four principles for continuous delivery of machine learning: 1) Automating training, evaluation, and prediction pipelines using tools like Go-CD; 2) Using source code and artifact repositories to improve reproducibility; 3) Deploying models as containers for microservices; and 4) Performing A/B testing using request shadowing rather than multi-armed bandits. Examples and diagrams are provided for each principle.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
This document provides a tutorial on how to develop a digital library using the Greenstone open source software (OSS). It discusses what digital information is, the purpose of digital libraries, and walks through the steps to create a new collection in Greenstone including gathering content, enriching metadata, designing search types, indexes, cross-collection searching and browsing classifiers. It emphasizes properly linking metadata fields to collection data and provides tips for each step.
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
Similar to Leveraging an in-house modeling framework for fun and profit (20)
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Leveraging an in-house modeling framework for fun and profit
1. Leveraging an in-house modeling
framework for fun and profit
Mike Skarlinski & Brian Graham
{michael.skarlinski, brian.graham}@weightwatchers.com
June 2019
2. Outline
• Introduction: data science at WW – the new Weight Watchers
• Problem: scalable, simple modeling and recommendation systems with a small team
• Solution: design and benefits of building a framework
• Implementation: Examples of deployed recommenders
3.
4. WW is a data driven application to help members
on their wellness journeys
Member Social
Network
Activity & Food
tracking
Weight progress &
goals
Recipe & food
database
5. As a new team, we are tasked with building a
foundation of data products
Social
Network:
Connect
Growth
WW
Program
Infra-
structure
Churn model
Return model
LTV models
Single Member View
Recipe recommender
Similar recipes
Composite foods ontology
Personalized feed
Groups search
Who to follow
APIs
Primrose
6. Data science team’s success hinges on effectively
sharing work and knowledge
openopen
Brian
Graham
Reka
Daniel-Weiner
Yameng
(Eliza) Zhang
Kevin
Zecchini
Carl
Anderson
Michael (Mike)
Skarlinski
open
Dec.
2019
May
2018
Jan.
2019
Mar.
2019
Feb.
2019
...
(Hint hint)
How can we build software that helps us grow and develop as a team?
8. Taking stock of our own challenges at WW
What would make a good recommender system at WW?
Slow serialization
but our medium data
can be kept in RAM...
No live features
but we know Docker, k8s...
Easy onboarding
mono repo with config as code...
9. We built a framework to solve our challenges and
enforce our design decisions
(Open source coming soon!!!!!)
11. Primrose has features to address each design
consideration
Python in-memory DAG runner, with no
serialization between nodes of the DAG.
DAG is defined as configuration-as-code
approach -- one container for all models
Abstract ML and data manipulation operations,
data scientists can easily extend the framework
Data science Infrastructure People
Primrose: (Production In-Memory Solution) framework for solving
WW’s most common use cases, caching batched predictions with
machine-learning engineering baked-in.
12. Primrose jobs are executed as Directed Acyclic
Graphs (DAG)s in python
Flexibility: any number of operations
allowed in a single DAG, across any
python library
Data and functions are passed between
nodes in an object that understands how
to extract the correct data for each node
13. DAGs are composed of implementation agnostic,
extensible nodes for data science
Data scientists can write any class that
matches the abstract interface &
incorporate in their DAGs
Data scientists can write individual nodes using
any Python framework or library they choose
14. Primrose is run like an ETL pipeline in a single
docker container for each configuration
15. For simpler deployments: Primrose uses a
“configuration as code” approach
Object configuration and DAG structure
are build in a configuration JSON
Primrose validates the configuration
and instantiates the correct classes at
runtime
Different outputs and results for each
DAG
Recipe recommender DAG JSON
Churn Model DAG JSON
Connect Feed DAG JSON
Primrose container Success, fame, money...
16. The framework has helped our team grow
and develop production models
Deployed 3 production
models and 3 production
recommenders
Onboarded 6 members in less
than a year, everyone is working
in the framework!
We’re going to open-source Primrose !!! Keep on the lookout or contact us!
19. We know you and meet you where you are.
coffee
croissant
fish tacos
apple
cobb salad
pasta with red sauce
ice cream
Personalize your
experience using your data
21. Similar Recipes Flow
US WW Recipes
Similar Ingredients
Similar Names
Filters
dietary
course
cuisine
main ingredient
document = ingredient list or name string
lemmatize, tokenize, TF-IDF
Cosine similarity
Rank
*Only recipes with images*
22. Business Logic (filters)
Productionalize in Primrose DAG
Google BigQuery Data lake Reader
NLTK + Custom Lemmatization
Sklearn TF-IDF + cosine similarity
Write to GCS Bucket and Google MemoryStore
Success!
logging.info(‘Your newbie DS has written production quality code.’)
23. Business Logic (filters)
Productionalize in Primrose DAG
Google BigQuery Data lake Reader
NLTK + Custom Lemmatization
Sklearn TF-IDF + cosine similarity
Write to GCS Bucket and Google MemoryStore
Success!
logging.info(‘Your newbie DS has written production quality code.’)
24. Business Logic (filters)
Productionalize in Primrose DAG
Google BigQuery Data lake Reader
NLTK + Custom Lemmatization
Sklearn TF-IDF + cosine similarity
Write to GCS Bucket and Google MemoryStore
Success!
logging.info(‘Your newbie DS has written production quality code.’)
25. Dinner Recommendations Flow
US WW Recipes
Similar Ingredients
Similar Names Business Logic
Eligible Members
2 weeks of tracking history
Tracked >= 1 recipe
US members
Potential Recs
tracked
most similar
X XX
X
2nd most sim.
n = 4 recommendations
26. Productionalizing is easier the second time
Same BQ reader class,
different SQL input file
New postprocess class to sort, filter and interleave potential recommendations
Success!
logging.warning(‘Data Scientist is developing software engineering skills.’)