The Other 99% of a Data Science Project

•Download as PPTX, PDF•

5 likes•1,594 views

Slides from my talk at Open Data Science Conference 2016. Algorithms and models are an important (and cool) part of data science. This talk is about all the other steps that it takes to deploy a data science project that makes a product slightly smarter. Stuff that you hear from practitioners, but is not covered well enough in books.

Software

THE OTHER
99%
OF A DATA
SCIENCE
PROJECT
Open Data Science Conference
Santa Clara | November 4-6th 2016
Eugene Mandel
@eugmandel

∎ @eugmandel
∎ lead of data science at directly
∎ formerly:
□ data science team at Jawbone
□ co-founder qualaroo, jaxtr
ABOUT ME

DATA SCIENCE NEEDS
PRODUCT MANAGEMENT
success of a data science
project has as much to do
with product management
as with data science

2 KINDS OF DATA SCIENCE
B
ANALYZE
A
BUILD

∎ “don’t you know me?!” -> “you get me!”
∎ get smarter with every interaction
∎ reduce search space
SMART
PRODUCTS

Show and explain your web, app or
software projects using these gadget
templates.
PARKING
APP
ON DEMAND CUSTOMER
SUPPORT

PROBLEM:
choose support
tickets that expert
users can resolve

CHOOSE
RESOLVABLE
TICKETS
WITH
MACHINE
LEARNING

CLEAN YOUR DATA
Automated bug reports
Surveys
Bounced emails
Internal tickets
Email metadata
Email threads
...

TRAINING -
COLD START PROBLEM
all tickets
tickets seen by expert

TRAINING -GET LABELS
“Is there a cat in this picture?” “Is this support ticket resolvable?”

TRAINING -GET LABELS
∎ label manually
∎ derive labels from user behavior
∎ derive labels from external sources
∎ mix

My favorite data science
algorithm is division.
Monica Rogati
Former VP of Data, Jawbone & LinkedIn data scientist

Tokenization
Bag of words (BOW)
Tf–idf
Random Forest Classifier
MODEL

PLAYING WELL WITH
ENGINEERING
∎ gaining trust
∎ development process

POINTS OF
INTEGRATION
online or offline?

IS IT
WORKING?
evaluating
data
products
Image source: https://themouseandthewindmill.wordpress.com

accuracy
precision/recall
driven by business
EVALUATION METRICS

IS IT
WORKING?
QA’ing
data
products
Image source: https://themouseandthewindmill.wordpress.com

THE KNOBS:
HOW TO CONTROL
THE PRODUCT
∎ on/off switch per customer
∎ prediction threshold
∎ exclusions

“... SMART…”
“... AI …”
“...MACHINE LEARNING…”
“...INTELLIGENT…”
NAMING THINGS

UPDATING THE MODEL
∎ input data changes
∎ users behaviour changes
∎ dataset grows

NEGATIVE SAMPLING
send small % of
predicted negative
as if they were
positive
predicted positive

NEGATIVE LABELING
send small % of
predicted negative
for manual labeling
predicted positive

∎ “Would you be able to resolve this ticket successfully?”
∎ “Would an expert user be able to resolve this ticket
successfully?”
∎ “Would an expert user be able to resolve this ticket
successfully without getting a negative rating?”
LABELING - HOW TO
PHRASE THE
QUESTION?

∎ customers
∎ sales
∎ account managers
∎ marketing
∎ execs
MESSAGING

INTERPRETABILITY
Image source:https://en.wikipedia.org/wiki/File:Blue_Poles_(Jackson_Pollock_painting).jpg

∎ Presentation template by SlidesCarnival
∎ Images:
□ http://jedismedicine.blogspot.com/
□ Jawbone
□ Directly
□ Wikipedia
□ https://themouseandthewindmill.wordpress.com
□ http://www.imdb.com/
CREDITS

Curious about Data Science? Self-taught on some aspects, but missing the big picture? Well, you’ve got to start somewhere and this session is the place to do it. This session will cover, at a layman’s level, some of the basic concepts of Data Science. In a conversational format, we will discuss: What are the differences between Big Data and Data Science – and why aren’t they the same thing? What distinguishes descriptive, predictive, and prescriptive analytics? What purpose do predictive models serve in a practical context? What kinds of models are there and what do they tell us? What is the difference between supervised and unsupervised learning? What are some common pitfalls that turn good ideas into bad science? During this session, attendees will learn the difference between k-nearest neighbor and k-means clustering, understand the reasons why we do normalize and don’t overfit, and grasp the meaning of No Free Lunch.

A Practical-ish Introduction to Data Science

Mark West

In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections: 1. I'll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation. 2. Next up well run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied. 3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.

What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...

Edureka!

This Edureka Data Science course slides will take you through the basics of Data Science - why Data Science, what is Data Science, use cases, BI vs Data Science, Data Science tools and Data Science lifecycle process. This is ideal for beginners to get started with learning data science. You can read the blog here: https://goo.gl/OoDCxz You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc

A Hybrid Approach to Data Science Project Management

Elaine K. Lee

Data science presentation 2nd CI day

Mohammed Barakat

Evaluation of big data analysis

Καρολίνα Κάτι

Data science vs. Data scientist by Jothi Periasamy

Peter Kua

data scientist the sexiest job of the 21st century

Frank Kienle

Introduction to Data Science

ANOOP V S

Introduction to Data Science

Niko Vuokko

Domino and AWS: collaborative analytics and model governance at financial ser...

Domino Data Lab

Data science applications and usecases

Sreenatha Reddy K R

Data science presentation

MSDEVMTL

Introduction to Data Science and Analytics

Srinath Perera

This webinar serves as an introduction to WSO2 Summer School. It will discuss how to build a pipeline for your organization and for each use case, and the technology and tooling choices that need to be made for the same. This session will explore analytics under four themes: Hindsight (what happened) Oversight (what is happening) Insight (why is it happening) Foresight (what will happen) Recording http://t.co/WcMFEAJHok

Data Science Lifecycle

SwapnilDahake2

Agile Data Science

Volodymyr Kazantsev

CRISP-DM: a data science project methodology

Sergey Shelpuk

Challenges of managing Data Science Project

Lamjed Ben Jabeur

What's hot

Agile data science

Joel Horwitz

8 minute intro to data science

Mahesh Kumar CV

Session 01 designing and scoping a data science project

bodaceacat

Data Science

Prithwis Mukerjee

Data Science 101

odsc

A Practical-ish Introduction to Data Science

Mark West

What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...

Edureka!

A Hybrid Approach to Data Science Project Management

Elaine K. Lee

Data science presentation 2nd CI day

Mohammed Barakat

Evaluation of big data analysis

Καρολίνα Κάτι

Data science vs. Data scientist by Jothi Periasamy

Peter Kua

data scientist the sexiest job of the 21st century

Frank Kienle

Introduction to Data Science

ANOOP V S

Introduction to Data Science

Niko Vuokko

Domino and AWS: collaborative analytics and model governance at financial ser...

Domino Data Lab

Data science applications and usecases

Sreenatha Reddy K R

Data science presentation

MSDEVMTL

Introduction to Data Science and Analytics

Srinath Perera

Data Science Lifecycle

SwapnilDahake2

Agile Data Science

Volodymyr Kazantsev

What's hot (20)

Agile data science

8 minute intro to data science

Session 01 designing and scoping a data science project

Data Science

Data Science 101

A Practical-ish Introduction to Data Science

What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...

A Hybrid Approach to Data Science Project Management

Data science presentation 2nd CI day

Evaluation of big data analysis

Data science vs. Data scientist by Jothi Periasamy

data scientist the sexiest job of the 21st century

Introduction to Data Science

Domino and AWS: collaborative analytics and model governance at financial ser...

Data science applications and usecases

Data science presentation

Introduction to Data Science and Analytics

Data Science Lifecycle

Agile Data Science

Viewers also liked

CRISP-DM: a data science project methodology

Sergey Shelpuk

Challenges of managing Data Science Project

Lamjed Ben Jabeur

SAP FORUM 2016 - CAPGEMINI COLOMBIA - DIGITAL TRANSFORMATION

José Antonio Lorenzo

Metis data science_project_kiva_20150407

Frederik Durant

CRISP-DM: Data Mining e Modelos Preditivos

Leandro Guerra

Leading an open source project oscon2016

Tessa Mero

]project-open[ CVS+ACL Permission ConfigurationKlaus Hofeditz

Tutorial: Writing Sencha Touch Mobile Apps using ]project-open[

Klaus Hofeditz

BFBM(12-2016) Business to business marketing

Hub Myanmar Company Limited

၂၀၁၆ ခုႏွစ္၊ ေအာက္တိုဘာလ (၂၃)ရက္ေန႔ (တနဂၤေႏြ) မွာ က်င္းပျပဳလုပ္ခဲ့တဲ့ Better Future Better Myanmar - 2016 စီးပြားေရးဆိုင္ရာ အခမဲ့ေဟာေျပာပြဲမ်ားရဲ႕႕ (၁၂) ႀကိမ္ေျမာက္ေဟာေျပာပြဲ တြင္ Panel Discussion အျဖစ္ေဆြးေႏြးပို႔ခ်ခဲ႔ေသာ PowerPoint Slide ျဖစ္ပါတယ္။

How to cover the whole Translation Project Workflow with one open-source syst...

Qabiria

The Top 10 Free and Open Source Project Management Software For Your Small Bu...

Capterra

Open Source Project Management Part 2

Semen Arslan

Eclipse Mylyn Integration with ]project-open[

Klaus Hofeditz

BFBM(7-2016) Productivity : Smarter Faster Better ေဟာေျပာပြဲ (မံုရြာ)

Hub Myanmar Company Limited

Five awesome django tutorials - Open Data Science

opendatascience

Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

MLconf

Despite a wide array of advanced techniques available today, too many practitioners are forced to return to their old toolkit of approaches deemed “more interpretable.” Whether because of non-legal policy or difficulty in executive presentation, these restraints result from poor analytics communication and inability to explain model risks and outcomes, not a failing of the techniques. From sampling to feature reduction to supervised modeling, the toolbox and communications of data scientists are limited by these constraints. But, instead of simplifying models, data scientists can re-introduce often ignored statistical practices to describe the models, their risk, and the impact of changes in the customer environment. Even in situations without restrictions, these approaches will improve how practitioners select models and communicate results. Through measurement and simulation, reviewed approaches can be used to articulate the promises, risks, and assumptions of developed models, without requiring deep statistical explanations.

Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...

VMware Tanzu

Enterprise companies starting the transformation into a data-driven organization often wonder where to start. Companies have traditionally collected large amounts of data from sources such as operational systems. With the rise of big data, big data technologies and the Internet of Things (IoT), additional sources – such as sensor readings and social media posts – are rapidly becoming available. In order to effectively utilize both traditional sources and new ones, companies first need to join and view the data in a holistic context. After establishing a data lake to bring all data sources together in a single analytics environment, one of the first data science projects worth exploring is segmentation, which automatically identifies patterns. In this DSC webinar, two Pivotal data scientists will discuss: · What segmentation is · Traditional approaches to segmentation · How big data technologies are enabling advances in this field They will also share some stories from past data science engagements, outline best practices and discuss the kinds of insights that can be derived from a big data approach to segmentation using both internal and external data sources. Panelist: Grace Gee, Data Scientist -- Pivotal Jarrod Vawdrey, Data Scientist -- Pivotal Hosted by: Tim Matteson, Co-Founder -- Data Science Central To learn more about data at Pivotal, visit http://www.pivotal.io/big-data To view video, visit https://www.youtube.com/watch?v=svKLdMWusGA

]project-open[ Budget Planning and Tracking

Klaus Hofeditz

]project-open[ Timesheet Project Invoicing

Klaus Hofeditz

]project-open[ on Amazon AWS

Klaus Hofeditz

Viewers also liked (20)

CRISP-DM: a data science project methodology

Challenges of managing Data Science Project

SAP FORUM 2016 - CAPGEMINI COLOMBIA - DIGITAL TRANSFORMATION

Metis data science_project_kiva_20150407

CRISP-DM: Data Mining e Modelos Preditivos

Leading an open source project oscon2016

]project-open[ CVS+ACL Permission Configuration

Tutorial: Writing Sencha Touch Mobile Apps using ]project-open[

BFBM(12-2016) Business to business marketing

How to cover the whole Translation Project Workflow with one open-source syst...

The Top 10 Free and Open Source Project Management Software For Your Small Bu...

Open Source Project Management Part 2

Eclipse Mylyn Integration with ]project-open[

BFBM(7-2016) Productivity : Smarter Faster Better ေဟာေျပာပြဲ (မံုရြာ)

Five awesome django tutorials - Open Data Science

Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...

]project-open[ Budget Planning and Tracking

]project-open[ Timesheet Project Invoicing

]project-open[ on Amazon AWS

Similar to The Other 99% of a Data Science Project

Introduction to Data Science

Ann Venkataraman

IoT as a metaphor!

PG Madhavan

The Sky’s the Limit – The Rise of Machine Learnin

Inside Analysis

The Briefing Room with Analyst Dr. Robin Bloor and SkyTree Live Webcast on June 24, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=1da2b498fc39b8b331a5bbb8dea2660f With data growing more complex these days, many organizations are looking for ways to make sense of new information sources. The goal? Sprint ahead of the competition by exploiting fast-moving opportunities. The challenge? The data volumes, variety and velocity call for significantly greater horsepower than ever before. That’s where machine learning comes into play, and it’s already fundamentally changing the Big Data Analytics landscape. Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he explains how advanced analytics technology can transform the enterprise. He’ll be briefed by Martin Hack, CEO of Skytree, who will tout his company’s machine learning solution for big data. Hack will discuss the critical challenges facing today’s data professionals, and present use cases to show how machine learning can help organizations leverage big data as a capital asset. He’ll specifically address the power of predictive analytics, which can help companies seize opportunities and prevent serious problems. Visit InsideAnlaysis.com for more information.

BarCampBangalore presentation on MindCanvas

Amit Ranjan

Leap into data science!

David "Gonzo" Gonzalez

LEAP into Data Science!

Dev Gonzalez

How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...

Kai Wähner

"Big Data" is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud. "Fast Data" via stream processing is the solution to embed patterns - which were obtained from analyzing historical data - into future transactions in real-time. This session uses several real world success stories to explain the concepts behind stream processing and its relation to Hadoop and other big data platforms. The session discusses how patterns and statistical models of R, Spark MLlib and other technologies can be integrated into real-time processing using open source frameworks (such as Apache Storm, Spark or Flink) or products (such as IBM InfoSphere Streams or TIBCO StreamBase). A live demo shows the complete development lifecycle combining analytics, machine learning and stream processing.

Data-Driven Design for User Experience

Emi Kwon

Shared at "Data-Driven Design for User Experience" with Le Wagon Tokyo, 25 Aug https://www.meetup.com/ja-JP/Le-Wagon-Tokyo-Coding-Station/events/280067831/ In UX design, data means the voice of users (customers) and actionable insights that are beyond just numbers. Hearing these voices through user research and usage analytics is a critical process of building a human-centric design. Based on data-driven design, UX designers, product managers, and even senior management can listen to the inner voice of users and extrapolate those to discover a user journey for clear call-to-action and unwavering customer loyalty. At this webinar, our guest speaker Emi Kwon, UX Design Director at Metlife, will walk you through the basics of data-driven design as well as share some tips and tricks for making data-driven design your value proposition as a product manager/ UX specialist. Agenda: ✔️ Data ecosystem — Data lake, data warehouse…what does it mean for UX? ✔️ Small data and big data — the opportunities and pitfalls ✔️ Research method basics — qualitative, quantitative or triangulated ✔️ Usage analytics and A/B testing ✔️ What about COVID-19 and remote usability testing?

AI in the Financial Services Industry

Alison B. Lowndes

How Can Analytics Improve Business?

Inside Analysis

TechWise with Eric Kavanagh, Dr. Robin Bloor and Dr. Kirk Borne Live Webcast on July 23, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=59d50a520542ee7ed00a0c38e8319b54 Analytical applications are everywhere these days, and for good reason. Organizations large and small are using analytics to better understand any aspect of their business: customers, processes, behaviors, even competitors. There are several critical success factors for using analytics effectively: 1) know which kind of apps make sense for your company; 2) figure out which data sets you can use, both internal and external; 3) determine optimal roles and responsibilities for your team; 4) identify where you need help, either by hiring new employees or using consultants 5) manage your program effectively over time. Register for this episode of TechWise to learn from two of the most experienced analysts in the business: Dr. Robin Bloor, Chief Analyst of The Bloor Group, and Dr. Kirk Borne, Data Scientist, George Mason University. Each will provide their perspective on how companies can address each of the key success factors in building, refining and using analytics to improve their business. There will then be an extensive Q&A session in which attendees can ask detailed questions of our experts and get answers in real time. Registrants will also receive a consolidated deck of slides, not just from the main presenters, but also from a variety of software vendors who provide targeted solutions. Visit InsideAnlaysis.com for more information.

Forces and Threats in a Data Warehouse (and why metadata and architecture is ...

Stefan Urbanek

The What, Why and How of Analytics Testing

Anand Bagmar

Practical Strategies for Targeting the Fortune 1000

BAO Inc.

Big data in marketing at harvard business club nick1 june 15 2013

nkabra

Predictive Asset Optimization - Advanced Analytics

Leonard Lee

Clicks, Conversions and Crawls

Michelle Robbins

In this webinar hosted by DeepCrawl, we take a look at how clickstream data - from the SERPs through to checkout - can be analyzed to form predictions around the optimal customer journey. We dig into how the predictions can be utilized to optimize sites and apps to more fluidly guide customers from point of entry to conversion. We also review how understanding your crawl budget and the factors that impact which of your site's pages are indexed are all critical to creating a valid model to optimize your content for increased conversions.

Webinar: Everyone cares about sample quality but not everyone values it!

Matt Dusig

On December 7, 2016, Mark Menig, Chief Executive Officer of TrueSample and Lisa Wilding-Brown, Chief Research Officer of Innovate MR explored various strategies to help research professionals navigate the challenging landscape of online sample quality. The webinar addressed: • A brief overview of quality through the years. Where have we been and where are we going? • What are current examples of online sample fraud (i.e., bots, hijackers, foreign click shops etc.)? • What are the challenges and costs associated with today’s online fraud? How does online fraud impact data quality, specifically B2B research? • What technical and behavioral strategies help to protect online research?

Webinar: Everyone cares about sample quality but not everyone values it!

Matt Dusig

SENTIENT ENTERPRISE

Teradata

Humans are sentient. We perceive. We feel. We listen. The problem is the more you put together, the more we lose these capabilities. We get slower. The idea is, how we create a company that acts like a single organism, where we identify opportunities, and that allows us to work in a faster and exponential world world where development happens in months rather than years. Don't let digital transformation become a war of competitive attrition. You may need to invest in your future to change the game.

Iotx futures research_futures_trends_2011Andy Hunter

Similar to The Other 99% of a Data Science Project (20)

Introduction to Data Science

IoT as a metaphor!

The Sky’s the Limit – The Rise of Machine Learnin

BarCampBangalore presentation on MindCanvas

Leap into data science!

LEAP into Data Science!

How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...

Data-Driven Design for User Experience

AI in the Financial Services Industry

How Can Analytics Improve Business?

Forces and Threats in a Data Warehouse (and why metadata and architecture is ...

The What, Why and How of Analytics Testing

Practical Strategies for Targeting the Fortune 1000

Big data in marketing at harvard business club nick1 june 15 2013

Predictive Asset Optimization - Advanced Analytics

Clicks, Conversions and Crawls

Webinar: Everyone cares about sample quality but not everyone values it!

SENTIENT ENTERPRISE

Iotx futures research_futures_trends_2011

Recently uploaded

Designing for Privacy in Amazon Web Services

KrzysztofKkol1

Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload. Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.

BoxLang: Review our Visionary Licenses of 2024

Ortus Solutions, Corp

Cracking the code review at SpringIO 2024

Paco van Beckhoven

Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production. Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process? In this session we will cover: - The Art of Effective Code Reviews - Streamlining the Review Process - Elevating Reviews with Automated Tools By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Shahin Sheidaei

Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.

Quarkus Hidden and Forbidden Extensions

Max Andersen

How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?

XfilesPro

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

takuyayamamoto1800

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...

informapgpstrackings

Enhancing Research Orchestration Capabilities at ORNL.pdf

Globus

Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Globus

The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.

Understanding Globus Data Transfers with NetSage

Globus

NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?

GlobusWorld 2024 Opening Keynote session

Globus

Accelerate Enterprise Software Engineering with Platformless

WSO2

Key takeaways: Challenges of building platforms and the benefits of platformless. Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience. How Choreo enables the platformless experience. How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo. Demo of an end-to-end app built and deployed on Choreo.

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

XfilesPro

Explore Modern SharePoint Templates for 2024

Sharepoint Designs

Vitthal Shirke Microservices Resume Montevideo

Vitthal Shirke

Large Language Models and the End of Programming

Matt Welsh

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Juraj Vysvader

Strategies for Successful Data Migration Tools.pptx

varshanayak241

Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.

Using IESVE for Room Loads Analysis - Australia & New Zealand

IES VE

Recently uploaded (20)

Designing for Privacy in Amazon Web Services

BoxLang: Review our Visionary Licenses of 2024

Cracking the code review at SpringIO 2024

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Quarkus Hidden and Forbidden Extensions

How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...

Enhancing Research Orchestration Capabilities at ORNL.pdf

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Understanding Globus Data Transfers with NetSage

GlobusWorld 2024 Opening Keynote session

Accelerate Enterprise Software Engineering with Platformless

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

Explore Modern SharePoint Templates for 2024

Vitthal Shirke Microservices Resume Montevideo

Large Language Models and the End of Programming

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Strategies for Successful Data Migration Tools.pptx

Using IESVE for Room Loads Analysis - Australia & New Zealand

The Other 99% of a Data Science Project

1. THE OTHER 99% OF A DATA SCIENCE PROJECT Open Data Science Conference Santa Clara | November 4-6th 2016 Eugene Mandel @eugmandel

2. ∎ @eugmandel ∎ lead of data science at directly ∎ formerly: □ data science team at Jawbone □ co-founder qualaroo, jaxtr ABOUT ME

3. DATA SCIENCE NEEDS PRODUCT MANAGEMENT success of a data science project has as much to do with product management as with data science

4. 2 KINDS OF DATA SCIENCE B ANALYZE A BUILD

5. PAY FOR PARKING WITH YOUR PHONE

6. DON’T YOU KNOW ME?!

7. ∎ “don’t you know me?!” -> “you get me!” ∎ get smarter with every interaction ∎ reduce search space SMART PRODUCTS

8. SMART PRODUCTS BUT NOT THAT SMART...

9. SMART PRODUCTS GO PROBABILISTIC

10. THE OTHER 99% PERCENT algorithms

11. Show and explain your web, app or software projects using these gadget templates. PARKING APP ON DEMAND CUSTOMER SUPPORT

12. LOOKING FOR OPPORTUNITIES

13. PROBLEM: choose support tickets that expert users can resolve

14. LOOKING FOR OPPORTUNITIES

15. CHOOSE RESOLVABLE TICKETS WITH MACHINE LEARNING

16. GETTING THE DATA

17. GETTING ALLIES

18. GETTING THE DATA

19. CLEAN YOUR DATA Automated bug reports Surveys Bounced emails Internal tickets Email metadata Email threads ...

20. GUYS CLEAN A DATASET, GET RICH

21. FEATURE ENGINEERING

22. TRAINING - COLD START PROBLEM all tickets tickets seen by expert

23. TRAINING -GET LABELS “Is there a cat in this picture?” “Is this support ticket resolvable?”

24. TRAINING -GET LABELS ∎ label manually ∎ derive labels from user behavior ∎ derive labels from external sources ∎ mix

25. My favorite data science algorithm is division. Monica Rogati Former VP of Data, Jawbone & LinkedIn data scientist

26. Tokenization Bag of words (BOW) Tf–idf Random Forest Classifier MODEL

27. DEVELOPMENT

28. PLAYING WELL WITH ENGINEERING ∎ gaining trust ∎ development process

29. POINTS OF INTEGRATION online or offline?

30. DEVELOPMENT integration - broad APIs

31. “NAPKIN ARCHITECTURE”

32. IS IT WORKING? evaluating data products Image source: https://themouseandthewindmill.wordpress.com

33. accuracy precision/recall driven by business EVALUATION METRICS

34. IS IT WORKING? QA’ing data products Image source: https://themouseandthewindmill.wordpress.com

35. PLAYING WELL WITH DEVOPS

36. BRIDGING TECH STACKS

37. IN PRODUCTION

38. THE KNOBS: HOW TO CONTROL THE PRODUCT ∎ on/off switch per customer ∎ prediction threshold ∎ exclusions

39. “... SMART…” “... AI …” “...MACHINE LEARNING…” “...INTELLIGENT…” NAMING THINGS

40. UPDATING THE MODEL ∎ input data changes ∎ users behaviour changes ∎ dataset grows

41. NEGATIVE SAMPLING send small % of predicted negative as if they were positive predicted positive

42. NEGATIVE LABELING send small % of predicted negative for manual labeling predicted positive

43. ∎ “Would you be able to resolve this ticket successfully?” ∎ “Would an expert user be able to resolve this ticket successfully?” ∎ “Would an expert user be able to resolve this ticket successfully without getting a negative rating?” LABELING - HOW TO PHRASE THE QUESTION?

44. ∎ customers ∎ sales ∎ account managers ∎ marketing ∎ execs MESSAGING

45. CUSTOMER ENGAGEMENT PLAYBOOK

46. DATA ETHICS

47. INTERPRETABILITY Image source:https://en.wikipedia.org/wiki/File:Blue_Poles_(Jackson_Pollock_painting).jpg

48. THANKS! Eugene Mandel @eugmandel

49. ∎ Presentation template by SlidesCarnival ∎ Images: □ http://jedismedicine.blogspot.com/ □ Jawbone □ Directly □ Wikipedia □ https://themouseandthewindmill.wordpress.com □ http://www.imdb.com/ CREDITS

Editor's Notes

There are 2 big areas of data science - A for “analyze” and B for “build”. A is product development informed by data. It became adopted pretty widely by now. Having analytics, running A/B tests, doing cohort and funnel analysis became part of the product management culture. The “build” kind of data science is about building smarts into the product itself and this is the kind I want to talk about. Implementing some of this requires machine learning and it is important for product managers to understand the level of complexity of some techniques that apply to their products. However, when machine learning is discussed, too much emphasis is put on the algorithms. More needs to be said about how a smart product gains humans’ trust and make them feel good about using it.
An app that allows you to pay for parking. You fire it up, it shows 3 choices - start a new parking session, see you old sessions. Choose “Start a new session”, go to next screen, there are several options here - select a parking zone. Done. I would not give this a second thought on a desktop.
But when I use this app, I’m late, I hold the phone in one hand and trying to pay for parking while running to the ferry. I am running and fumbling with the phone and thinking - DON’T YOU KNOW ME?! It’s a weekday morning, I am at the parking lot next to the ferry terminal, you have seen me here before. More than once. Just give me one button - PAY NOW. And a small link to all the other features.
Every time a user has this “DON’T YOU KNOW ME?!” moment, it is an opportunity to make a product just a little bit smarter. Smart products convert DONT YOU KNOW ME?! into YOU GET ME! Even when they don’t know my next step exactly, they reduce the search space.
Smarter products - new problems. Complexity goes way beyond the algorithms.
Take Nest smart thermostat - great visual design, easy to install, it is powered by machine learning that learns your preferences. It’s a good product, but even they can’t get it quite right. Got it when we just had our baby. We both like it pretty cool, but my wife felt cold after birth. This is just when Nest was learning. Once it did, for some reason it was very tough for it to adjust. Another thing - when it turns the heater on, there is no indicator is it was a human in the house or the software. I am OK correcting Nest. But not my wife. Making products smarter introduces probabilistic behavior. Because probabilistic behavior feels kind of like life, you start having different expectations. Northern California has some very hot days with cold mornings. On a day like that I would not turn the heater on in the morning. But Nest would. It just knows - get to 68 degrees. But it has no context - something that is easy and intuitive to a human is not easy to software.
Getting the relationship of the user with a smart product right is tricky. Product managers are the best people in a company to get the tradeoffs right. Just like a pm does not have to be developer to manage a software product, she does not have to be a mathematician or a data scientist to manage a data product. But it is necessary to understand some core concepts. I'll use 2 data products to demonstrate some of these necessary concepts.
Here is the second data product. This one is B2B and is working in the background. Directly helps companies like Airbnb, Linkedin, Pinterest with on-demand customer support. When a user submits a support ticket, some of these are sent to Directly which distributes them to a network of expert users that are ready to answer them. If experts resolve a question successfully, they get paid and Directly takes a cut. Otherwise, the experts can reroute the ticket back to the customer’s call center.
When questions are created in the helpdesk how do we find ones that the expert users can (and want) to solve? Initially, we relied on our customers to configure some categories that their users chose when they were filling out the support form. Users are not great about categorizing their issues. We tried keywords. Very cumbersome to manage. We need to pick as many tickets as we can, but not to create too much noise for the experts.
Getting the relationship of the user with a smart product right is tricky. Product managers are the best people in a company to get the tradeoffs right. Just like a pm does not have to be developer to manage a software product, she does not have to be a mathematician or a data scientist to manage a data product. But it is necessary to understand some core concepts. I'll use 2 data products to demonstrate some of these necessary concepts.
Solution: let us look at at ALL your tickets as they come in and a machine learning model will choose which ones will be sent to the expert users. Here is how it works: ….. Explain the image The model is a classifier and it needs examples to learn what a good ticket looks. It can do so from watching how the experts respond to tickets they have seen earlier. If the experts took a ticket and resolved it successfully, it becomes a positive example. If the send the question back or resolve it, but the user reviews their answer negatively, this question becomes a negative example.
ML startups ask companies “give us all your data” I was preparing for a touch conversation. Getting access to more and better data… “Is it a yes?” Think of getting data early, before you need it Legal. Stripping of anything personal. Insist on storing.
Customer success (Account managers) - interested. One of the main metrics they are responsible for is our ticket share- percentage of tickets we are handling at a customer.
ML startups ask companies “give us all your data” I was preparing for a touch conversation. Getting access to more and better data… “Is it a yes?” Think of getting data early, before you need it Legal. Stripping of anything personal. Insist on storing.
The improvements that you can get from cleaning your data are great. The plot of the movie Big Short can be summarized as “guys clean a dataset, get rich”. In case of Jawbone meal logging, the biggest lyft in performance came from realizing that breakfasts are different from other meals. Spinach in the morning was probably a part of omelete. Spinach at lunch was most likely a salad. Sometimes, cleaning your data requires a good understanding of the domain you are working with. Which properties of your data you do and don’t use is to a significant degree a product management decision. For example, different cuisines disagree on what foods are eaten best together. Do you use this knowledge somehow? Depends what you know about your users.
Monica Rogati, who used to be VP of data at Jawbone has this saying:... Yes, you could go much more advanced algorithm, but this simple one can get you pretty far. the biggest improvements were achieved by cleaning the data and understanding it deeply
Account managers - interested
How do we know if a model is good? When "normal software” breaks, it breaks with high visibility. An issue with ML is that it will ALWAYS give you an answer. How we compare models? An obvious metric is accuracy. Basically the percentage of predictions that the algorithm, gets right. However in product is data science this is a very bad metric. This depends on how balanced or unbalanced the classes that you are predicting are. Example: fraud detection, rare disease testing. If 0.1% of transactions are fraudulent, you can create a “very sophisticated” predictive model. When asked “Is this transaction fraudulent?” it will always say “no”. The accuracy of this model will be about 99.9%. Thinking through this is exactly the PM’s job. In this case you don’t need to know the math that underlies the predictive model. How do we QA data products?
How do we know if a model is good? When "normal software” breaks, it breaks with high visibility. An issue with ML is that it will ALWAYS give you an answer. How we compare models? An obvious metric is accuracy. Basically the percentage of predictions that the algorithm, gets right. However in product is data science this is a very bad metric. This depends on how balanced or unbalanced the classes that you are predicting are. Example: fraud detection, rare disease testing. If 0.1% of transactions are fraudulent, you can create a “very sophisticated” predictive model. When asked “Is this transaction fraudulent?” it will always say “no”. The accuracy of this model will be about 99.9%. Thinking through this is exactly the PM’s job. In this case you don’t need to know the math that underlies the predictive model. How do we QA data products?
How do we QA data products? When "normal software” breaks, it breaks with high visibility. An issue with ML is that it will ALWAYS give you an answer. Monitoring in production
Unless you are making the ultimate data product - a make money while you sleep fund runner :) - your system lives in the world and interacts with people. Once the product is out, other people carry the message and you cannot control it. Listen to how an account manager talks about this with a client, how a salesperson talks with a prospect. ML/DS is uniquely susceptible to BS - how to control it?
"Why did you show me ‘french fries’?" Well, because this is the item that is logged together most frequently with burger. "Why you decided that this transaction is fraudulent? Why did you decide that this customer support ticket is resolvable?" the simpler the model the more interpretable it is. When a model is not easily interpreted, but it performs well, it’s your task to manage expectations.

The Other 99% of a Data Science Project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Other 99% of a Data Science Project

Similar to The Other 99% of a Data Science Project (20)

Recently uploaded

Recently uploaded (20)

The Other 99% of a Data Science Project

Editor's Notes