SlideShare a Scribd company logo
1 of 44
Download to read offline
Yulan Lin @y3 2n
Justin Gosses @JustinGosses
Data Science & Software Engineering
Valador Inc.
Supporting NASA OCIO
Rice Data Science Conference, Oct. 2017
Practical Considerations for Data
Science Consulting and Innovation
in a Large Organization
Why practical
There’s a lot of conversation around what mathematical models are good, what
technologies to buy, or even what open-source libraries have the best
implementations of machine learning algorithms. However, there’s a lot more that
goes into effective data science and how it integrates into an organization’s
day-to-day than the math or the models. Here are some of the lessons that we’ve
learned working on an internal consulting team that serves a large organization
whose mission just might happen to include putting people in space.
Now might be a good time to introduce who we are and what we do. You might notice
that I keep using the word “we” here. My colleague and I submitted this talk because
we wanted to share with you some of our hard-earned lessons about the parts of data
science that have very little to do with code, feature selection, and model choice. He’s
currently in Virginia at a poster session, but will be presenting his portion via video,
and hopefully will also be online for Q&A.
● Startup
● Built around a software product
● Small companies
This talk is NOT from the perspective of...
Both Justin and I are scientists by training, so we like to make sure that our biases,
assumptions, and perspective are made explicit and clear. Let’s begin with the
caveats; the hedging as my thesis advisor would call it.
● Established organization (> 10 years old)
● Large organization (> 10,000 people)
● Core function is not software/technical
Clarification: not just NASA! Oil & Gas, Banking, Health, etc.
This talk IS from the perspective of
Now that we’ve gotten the caveats and what we are not out of the way, let’s talk about
what this talk is from the perspective of.
This talk is from the perspective of an organization who is established, has been
around, and has ingrained in it culture, habits, and patterns from before data science
was the hot new buzzword. In addition, we want to make it clear that we’re talking
from the perspective of data scientists who are operating in an organization whose
sole focus isn’t building software products.
This isn’t necessary NASA-specific, and not just from our own experiences, but with
conversations with data scientists at other large companies in oil & gas, health, and
A grab bag at the intersection of math + code that includes
● Machine Learning
● Deep Learning
● Statistics
● Data Visualization
What is Data Science?
If we’re going to talk about practical considerations of Data Science, we’re going to
need to define it here. There’s some really great Venn Diagrams out there, and blog
posts expounding on what Data Science is as a field. For the purposes of this talk,
we’re going to talk about data science as a grab bag at the intersection of math and
code that includes machine learning, deep learning, statistics, and data visualization.
● Data access
● How data science adds value
● Influence of procurement constraints
● Communication & narratives
● How data scientists are distributed
Roadmap of our talk
In some ways, we’re trying to start small, and then zoom out. Data access is about
the “smallest” level: can we even get to the data we need? Adding value is convincing
stakeholders that data science matters. Procurement constraints starts getting
beyond data scientists and data science clients into the larger organization.
Communication is about communicating out the results of the data science Finally,
the distribution of data scientists is a big-picture how we structure the organization
question. With that, I’d like to pass it off to Justin, to talk about data access and the
value-add of data science.
Data Access
I’m (Justin) going to talk about Data access in this section of the talk.
- It is common in data science talks and articles for someone to say data
science can be 80% data cleaning, and 20% actual analytics and modeling.
While that can sometimes be true, it assumes that you can get to the data in
the first place.
- For data science teams acting as an internal consultancy in a large
organization, obtaining access to actionable data can be a substantial task by
itself. Often the data owners are not you, your team, or even the people you’re
partnering with on our project, so getting good data is a common pain point.
- In this section, I’m going to talk about a few data access topics, including :
creating data that does not yet exist, whether data is programmatically
accessible, data cleaning concerns, data compliance, data as political power,
and finally the importance of having a good local partner on any data project.
Does the data exist?
● Is it actually useful for your problem?
● If you’re going to collect it or make it:
welcome to data engineering
- Does the data exist yet is not always a straight forward question. Sometimes
the data exists but it would take too long to get it or sometimes it exists but
doesn’t have all the attributes you need.
- If the data doesn’t exist yet… and you’re going to collect real
data…’re now into data engineering,
- Which involves data collection, reliability engineering, database engineering,
etc. …. these often have a tendency to slowly grow in scope as expectations
rise over time.. Even for toy prototype systems, so this is one area to be
careful of scope creep.
- Another option is to say, “oh, I’ll just create some fake data”. There are a
variety of helpful libraries for creating fake data, however, in my experience a
product built on fake data almost always has to be re-built, to some extent,
once you get the real data,,,,,, so that probable additional work should be built
into your schedule.
Is the data programmatically accessible?
- Getting the data out of legacy systems and into a form you can work with is
another common hurdle when in a large organization.
- Large organizations have probably around for a while. This means most of its
systems were made before data science was a thing.. And many systems
made before APIs or microservices were common. Basically, you might be
dealing with something designed for human entry and retrieval of data, one file
at a time, not programatic access…. Which might be a problem if you need to
get out thousands of files not just one.
- Often, if you’re working a small study or prototype, you might be to start with
scraping some of the data. By scraping, I mean accessing the system by
logging in as a human and then using a combination of selenium and beautiful
soup libraries to write a program to “pretend to be a person” automating the
process of getting at least some of the data out.
- With this data, you can develop a small demo, and then use that demo to
argue to the data owners… that hey look, we make something something
cool...that’s really useful for you.... Therefore you should give us better data
How “clean” is the data?
(or: how much data translation services do you require from subject
matter experts?)
- Data cleaning is another issue.
- Often you need a subject matter expert to not just help clean the data so much
as translate all of its warts and weird “features” that have developed over a
significant amount of time as people and systems change.
- Getting a subject matter expert with enough time to help you is another area
where.. It can really make or break a project.
- If you get all the way to the presentation phase and there is some
non-intuitive part of the data that subject matter experts all know... but
you don’t, it can make the audiance skeptical of your entire project.
- Worse yet, if you don’t have access to a subject matter expert during
the project work who is responding quickly to you… you can end up in
email question / answer pergutory, where work gets stalled while you
wait for answers.
- Understanding how much help you’ll need is critical when setting up a new
project. If you don’t have the right organizational support, a project can get
slowed to a crawl and potentially never finish.
Compliance: all. of. the. rules.
● Data Access
● Data Transfer
● Data Storage
● Data Anonymization
● Data Sharing
- Even in public facing university or government positions, where you expect a
lot of data to be public, there can be a lot of data that needs to be kept secure.
- Financial data, people data, and data not yet finalized all may have security
protocals. …. Who Approves that you’re doing it correctly? Who set up the
rules, did they have you and your use case in mind? Can approval processes
actually get done on your timeframe? Is there a well established approval
- These are all things to find out as early as possible as they can push back
your timeline.
- Also, forgiveness rather than permission,,,, does not work well here.
Data is currency:
● Power & Politics (at some level)
● Empathy is useful
- While it is not necessary to go Machiavellian here, it is important to remember
that data has human data owners, producers, cleaners, and users. This
means, at some level, data is power.
- When you try to do a new thing with data, especially data that already exists in
a legacy system, you may be distrupting an already well balanced system.
- If you experience pushback, being able to step back and understand another
person’s perspective can sometimes be useful.
A good partner helps you navigate:
● If data exists
● Data access
● Data oddities
● Compliance processes
● Data politics
- It is critically important for data science consulting groups to have good
partners whose interests and resources align with their goals.
- An ideal partner will already have data access permissions, knowledge of the
“data reality”, and is better equipped to fight organizational battles around the
data, as they often already know the players, the turf, and the priors of the
context the project is operating within.
- A good partner will also have the time and permission to contribute to the
project at a level necessary for its success and is capable of carrying on
ownership of the project into the future.
How Data Scientists
Add Value
- As data science is a new field. There’s lots of hype, lots of new teams, lots of
real but ambiguous possiblities,,,,, and low management familiarity with how to
apply this skillset.
- This all means the potential for a mismatch between expectations and
delivery... is high, so being able to demonstrate how you’ve added value to the
larger organization is the currency needed to continue operating.
- Data science teams add value in at least three ways. They can create a new
data product, build organizational skills and capabilities, or expand awareness
of technologies and capabilities that then leads other individual or teams to
take the next step themselves.
- Products…. can include data visualizations, predictions, recommendation
engines, speech recognition capabilities, image recognition, natural language
processing, and all the different types of web applications and Internet of
things methods for user interface and deployment.
- More abstractly, you’re very often speeding up a previously human driven
process or enabling something to be done that wasn’t done before largely
because it was too costly to be done as human driven process….. Because of
this… your products are often compared against human level accuracy as a
first reaction.
- Additionally… you rightly might get different reactions to outreach about
potential projects……. from people based on how secure they feel in their
position, and whether you’re providing
- a new capability they would have never had time to do themselves,,,
- automating 10% of the their job that they don’t like anyways,
- or automating 75% of their job that makes up the bulk of the reason for
them being there.
- Again, this goes back to understanding how your projects fit within the large
Spreading knowledge of what is possible
- Today there is a lot of centralization of data scientists in organizations. Part of
the reason for this is the skills are in limited supply and the best ways to apply
those skills are still being developed. However, as more and more staff knows
how to program and has machine-learning experience, more activities under
data science will happen throughout the organization and not on centralized
- If you’re on a centralized data science teams, it is likely that part of your job
(even if not written down) is to speed up that transition by expanding the
number of people who know what is, isn’t, or might be possible…... and
increasing the number of people wanting to use new technology or
brainstorming new applications.
- For example, Sometimes this takes the form of singular meetings with people
to talk about the range of potential software solutions or open-source code
libraries that might apply to their problem. By saving a team a week worth of
understanding the landscape, you’ve increased the probability they’ll make
something useful with their limited time.
Building skills & Bringing in new tools
- Just like awareness, building the capability of the larger organization is also
part of a data science team.
- This includes both skills and IT infrastructure.
- Project partners and subject matter experts are often where this capability
building happens informally.
- Data science products are often completed and then handed off to
another organization.
- Building that organiation’s staff’s ability to maintain, understand, and
use solutions is both often necessary for longer-term success of
projects …….and builds the capability of the larger organization….. in
terms of skills and establishing the first use of new
In large, established, organizations, there are often policy and culture constraints
around procurement, many of which directly or indirectly influence the ability of data
scientists to effectively do their job. When we’re talking about procurement, what
we’re trying to describe here is a collection of considerations around purchases of
software that go beyond just one or two people demo-ing a product on their local
machines, or a couple people running a trial of some enterprise software. Large
organizations often have policy and procedures that major software purchases have
to go through, some to make sure things get paid for, and other times because of
concerns that include security and reliability.
Does the proposed
product fit the
organization now and
in the future?
● Skill development
● Workflow
● Tech stack
There’s a lot of products out there, but in a large and established organization,
procurement and adoption are a big process. It’s not enough to consider whether or
not a product is the right one for you right now, but also to consider whether or not it
will work over the course of time. There’s a couple things to consider - where does
this fit into your organizations’ plan in terms of skill development/distribution across
your workforce, how easy it will be to integrate with the workflows you are phasing out
and workflows you are phasing in, as well as the technology stack you’re using; what
are the common tools you’re using, does it integrate with your data sources now,
does it integrate with potential productivity tools that you currently use (Office) - how
the licensing model works (per install or per concurrent user? One-time or per year?);
is it desktop or browser based (desktop-based comes with limitations)
What is the official process?
What kinds of considerations, paperwork, and approvals need to be made. Who is
your audience for each of these justifications? Who is the point of contact for each
part of the process? Which regulations are applicable, and where is that
documentation about the process stored?
What is the culture?
And unofficial process…
There’s the process on paper, and then there are navigating the people and the
motivations. It’s essentially a very complicated people optimization problem.
Open-source vs. proprietary
Open source can be scary and you might have to talk about actual risk. Important to
also consider the fact that for people who sign off on things, proprietary from a big
name is always going to be their preference. Microsoft (maybe Amazon) will not
cause people harm to their careers. On the other hand, for prototyping, culture
change, and skill-building, open-source is great because it’s free and easily available,
and you don’t have to go out and buy anything to get started. There are also
organizations who have internal repositories of open source software that’s been
vetted by security so that you mitigate the risk of malicious code sitting mimicking your
dependencies in a repository somewhere. One last thing - if your company does
depend on a lot of open source software, please find ways to give back. Ask your
developers to contribute code/documentation, and donate monetarily to projects; it
helps to keep the ecosystem alive!
Communication and
Effective communication and design at the start, middle, and end of a project is
important for a variety of reasons that I will get into over the next few slides.
Data science:
What does that even mean?
(or why managing expectations is
- First, effective communication is important because, data science is new,
ill-defined and sometimes even magical..…this is fun... but it also means there
is high potential for a mismatch of expectations …. as this great XKCD comic
demonstrates..and as I’ve mentioned on previous slides.
- Clients, partners, and managers aren’t always going to have understanding of
what things are technically hard, easy, or even doable. You’ll need to tell
- They also are going to have different definitions in their head for what does or
doesn’t fall into data science.
- Both because data scientist driven teams haven’t existed,,, at least in
their current form very long...and also because the what is possible is
changing very quickly. For example, this comic was written only three
years ago and I bet there are many people in this audience right now
thinking to themselves “is bird recognition really a five year problem?”
- This situation is different than say an oil companies’ exploration team of
geologists and geophysicists where very similar products have been produced
for similar questions for many years.
- Because of this…. expectation management needs to be a constant part of
any project…
Effective Narratives:
Don’t let the
Buzzwords + math + programming
get in the way of the
Business value + project schedule + uncertainty
- When describing your work, the details of the mathematics, statistics, and
algorithms are not always going to be understood in detail, and are not always
not feasible to communicate.
- In order to correctly manage expectations, the emphasis is better put on
business value, business decision logic, and range of uncertainty in terms of
organizational resources required, time required, and range of possible
prediction accuracy.
- A good general rule when presenting or explaining is you should be able to
drop any of the math or programming or buzzwords and still have what you’re
saying make sense.
Understand as early as possible
● What’s the real problem?
● Does the data exist?
● Can you access the data?
● How clean is the data?
● What is the business value?
● What is the organizational context?
- At beginning of projects, there is a significant need to define the problem,
understand the data, how that data is accessed, how this project might be
used in workflows, and how it fits into the larger data and decision space
ecosystem in the organization.
- These are the questions that we’ve found important to answer as early as
- Bad assumptions at the start of a project can cause you to waste time working
on dead-ends, which then cause you to go back and repeat work.
- Being able to ask the right questions, and get the right level of detail back, is a
very important skill to develop … that doesn’t just happen automatically.
When delivering something that will be used by people:
Consider user-centered design
- A product that delivers a high accuracy prediction but is a pain to use is still
poor project.
- User-centered design is a range of techniques that can be used to better
understand user problems and build effective easy to use tools to meet those
- I won’t attempt to go into detail about it here, except to say..
- I would encourage anyone working on data science products that will
eventually be used by humans (and not another program) to explore this
philosophy and use it not just on a user-facing front-end but throughout the
project design as well.
Data Visualization:
You’re likely undervaluing it
- My personal opinion on this is I think data visualization is undervalued.
- With machine Learning, success is easy to measure as we can define it in
terms of accuracy, false negatives, etc.
- It is very difficult for us to look at a data visualization and recognize how good
or bad it is perfoming, because our comprehension of it happens so
fast...often faster than we’re cosciously thinking about it.
- Therefore, we’re more attracted to the insights from machine-learning,
especially supervised learning, in a way we aren’t to data visualization.
- I think this is unfortunate, because the number of problems that can be
impacted by data visualization is probably larger than machine learning in any
given organization… and we’re probably missing a lot of opportunities.
- This could be its own talk, so I’ll end with..…learn web development and
- JavaScript is where the really interesting data visualization is
happening and web development is the most effective way to share
complex high dimensional information to a large number of people with
minimal requirements on the user end
The distribution of
data scientists in an
Let’s talk now about how and where you distribute data scientists in an organization.
As a remind of the perspective we’re taking, we want to specifically focus on
organizations where data science and software are not the main focus, and in a large
organization. Another way of framing it, particularly for this part of the discussion, is to
think about what it takes for a large, established organization to start investing in data
science, and what that organizational structure can/should look like. To begin with,
here are some key concepts.
Distribution of Data Scientists
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
Something to note as we talk about different organizational structures is this: what is
the relationship of the data science team to the rest of the organization?
Organizational fence
Distribution of Data Scientists
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
Is there an organizational fence, or boundary between them?
Distribution of Data Scientists
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
Organizational fence
How are data, problems, finished products, training, and best practices passed
between “Data Scientists” and the “rest of the organization”? Another thing to
consider: who “pays” for the data science team? Are they a core service, just like a
basic IT seat? Is their time billed to specific internal clients?
Innovation Lab
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
Innovation lab:
Central tank of data scientists working on longer term projects inside the group that
eventually transition outside once “fully grown.” Datasets and problems are tossed by
the organization “over the fence” at the Data Science team, which has training and
best practices ready to go.
Innovation Lab
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
When a project is complete, the data science team tosses the fully formed product
back over the fence to the rest of the organization.
Pros: This is really great for honing a team with really great knowledge, training, and
fully formed best practices.
Cons: This silos the training and knowledge within the innovation, and depending on
your funding model for the innovation lab
Embedded + Rotations
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
A smaller central teams trains up and deploys data scientists that then sit embedded
within the organization. Work varies from consultancy to longer-term projects
dependending on local needs. Data scientists typically work on teams with other
experts and less with larger groups of other data scientists and software engineers.
Embedded + Rotations
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
Centralized Consultancy
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
Project Working Group
Centralized Consultancy
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
Project Working Group
People from different organizations contribute people at different levels of
commitment and capacity to a project working group that then produces a deliverable.
Centralized Consultancy
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Data Science
Project Working Group
How to grow data science in an org?
Top-down vs. grassroots
Data / Systems Skills / Culture
Top-down = focus is on big projects to transform systems, data, and workflows to
enable new data science approaches. Whole teams are stood up to assist in this
transformation. Data engineering is seen as critical.
Bottom-up = focus is on building skills and giving permission for innovation. No single
“approved way of doing things” - Different groups within the organization might pick
different languages and vendors based on individual needs. Flexible funding and time
given to try new data science methods by individuals who have skills but might not
have data scientist, data engineer, or software engineer on their business card.
Data scientists need to manage “outward”
into many parts of an organization
Because data science is a new field… it isn’t embedded into all parts of an
organization… and data science skills are sparsely distributed…. In 5-10 years, most
of this presentation will need to be updated as Data Science becomes more and more
All of these can make-or-break a project
Data access: Will you need to navigate legacy systems and/or data owners?
Value of data science: Is the project’s business value well defined?
Procurement constraints: Can a project operationalize/grow within the org?
Communication & design: Is the right information flowing effectively?
Organizational structure: What are the pros/cons of your structures/workflow?
To summarize, all of these can make or break a project, and I’ve seen this happen.
- Data Access: do you know whether navigating legacy systems will slow you
down? Do you have buy-in from data owners who you’re dependent on?
- Value of data science: Can you describe the project in terms of its business
value, business context, and schedule uncertainty to anyone who asks?
- Procurement constraints: can that project be started, grown, and eventually
left in that organization considering the organizations range of skills, funds,
human resources, etc.
- Communication & Design: Are you communicating effectively to get the
information you need at the start of the project? Are you designing
user-interfaces at the end of the project that are effective and maximize the
value of all the other work you’ve done?
- Organizational structure: How does your team fit within the wider
organization? Do you understand the strengths and weaknesses that come
from that placement? How are you dealing with those constraints? What sorts
of projects fit well or don’t fit well with that type of organizational position?
Thanks, and keep in touch!
Justin Gosses @JustinGosses
Yulan Lin @y3 2n

More Related Content

Recently uploaded

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls - Grow your wealth with trading signals - Grow your wealth with trading - Grow your wealth with trading signals - Grow your wealth with trading signalsInvezz1
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam

Recently uploaded (20)

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779 - Grow your wealth with trading signals - Grow your wealth with trading - Grow your wealth with trading signals - Grow your wealth with trading signals
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction


How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming LanguageSimplilearn

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language

Practical considrations of data science cleaned.pptx

  • 1. Yulan Lin @y3 2n Justin Gosses @JustinGosses Data Science & Software Engineering Valador Inc. Supporting NASA OCIO Rice Data Science Conference, Oct. 2017 Practical Considerations for Data Science Consulting and Innovation in a Large Organization
  • 2. Why practical considerations? There’s a lot of conversation around what mathematical models are good, what technologies to buy, or even what open-source libraries have the best implementations of machine learning algorithms. However, there’s a lot more that goes into effective data science and how it integrates into an organization’s day-to-day than the math or the models. Here are some of the lessons that we’ve learned working on an internal consulting team that serves a large organization whose mission just might happen to include putting people in space. Now might be a good time to introduce who we are and what we do. You might notice that I keep using the word “we” here. My colleague and I submitted this talk because we wanted to share with you some of our hard-earned lessons about the parts of data science that have very little to do with code, feature selection, and model choice. He’s currently in Virginia at a poster session, but will be presenting his portion via video, and hopefully will also be online for Q&A.
  • 3. ● Startup ● Built around a software product ● Small companies This talk is NOT from the perspective of... Both Justin and I are scientists by training, so we like to make sure that our biases, assumptions, and perspective are made explicit and clear. Let’s begin with the caveats; the hedging as my thesis advisor would call it.
  • 4. ● Established organization (> 10 years old) ● Large organization (> 10,000 people) ● Core function is not software/technical Clarification: not just NASA! Oil & Gas, Banking, Health, etc. This talk IS from the perspective of Now that we’ve gotten the caveats and what we are not out of the way, let’s talk about what this talk is from the perspective of. This talk is from the perspective of an organization who is established, has been around, and has ingrained in it culture, habits, and patterns from before data science was the hot new buzzword. In addition, we want to make it clear that we’re talking from the perspective of data scientists who are operating in an organization whose sole focus isn’t building software products. This isn’t necessary NASA-specific, and not just from our own experiences, but with conversations with data scientists at other large companies in oil & gas, health, and banking.
  • 5. A grab bag at the intersection of math + code that includes ● Machine Learning ● Deep Learning ● Statistics ● Data Visualization What is Data Science? If we’re going to talk about practical considerations of Data Science, we’re going to need to define it here. There’s some really great Venn Diagrams out there, and blog posts expounding on what Data Science is as a field. For the purposes of this talk, we’re going to talk about data science as a grab bag at the intersection of math and code that includes machine learning, deep learning, statistics, and data visualization.
  • 6. ● Data access ● How data science adds value ● Influence of procurement constraints ● Communication & narratives ● How data scientists are distributed Roadmap of our talk In some ways, we’re trying to start small, and then zoom out. Data access is about the “smallest” level: can we even get to the data we need? Adding value is convincing stakeholders that data science matters. Procurement constraints starts getting beyond data scientists and data science clients into the larger organization. Communication is about communicating out the results of the data science Finally, the distribution of data scientists is a big-picture how we structure the organization question. With that, I’d like to pass it off to Justin, to talk about data access and the value-add of data science.
  • 7. Data Access I’m (Justin) going to talk about Data access in this section of the talk. - It is common in data science talks and articles for someone to say data science can be 80% data cleaning, and 20% actual analytics and modeling. While that can sometimes be true, it assumes that you can get to the data in the first place. - For data science teams acting as an internal consultancy in a large organization, obtaining access to actionable data can be a substantial task by itself. Often the data owners are not you, your team, or even the people you’re partnering with on our project, so getting good data is a common pain point. - In this section, I’m going to talk about a few data access topics, including : creating data that does not yet exist, whether data is programmatically accessible, data cleaning concerns, data compliance, data as political power, and finally the importance of having a good local partner on any data project.
  • 8. Does the data exist? ● Is it actually useful for your problem? ● If you’re going to collect it or make it: welcome to data engineering - Does the data exist yet is not always a straight forward question. Sometimes the data exists but it would take too long to get it or sometimes it exists but doesn’t have all the attributes you need. - If the data doesn’t exist yet… and you’re going to collect real data…’re now into data engineering, - Which involves data collection, reliability engineering, database engineering, etc. …. these often have a tendency to slowly grow in scope as expectations rise over time.. Even for toy prototype systems, so this is one area to be careful of scope creep. - Another option is to say, “oh, I’ll just create some fake data”. There are a variety of helpful libraries for creating fake data, however, in my experience a product built on fake data almost always has to be re-built, to some extent, once you get the real data,,,,,, so that probable additional work should be built into your schedule.
  • 9. Is the data programmatically accessible? - Getting the data out of legacy systems and into a form you can work with is another common hurdle when in a large organization. - Large organizations have probably around for a while. This means most of its systems were made before data science was a thing.. And many systems made before APIs or microservices were common. Basically, you might be dealing with something designed for human entry and retrieval of data, one file at a time, not programatic access…. Which might be a problem if you need to get out thousands of files not just one. - Often, if you’re working a small study or prototype, you might be to start with scraping some of the data. By scraping, I mean accessing the system by logging in as a human and then using a combination of selenium and beautiful soup libraries to write a program to “pretend to be a person” automating the process of getting at least some of the data out. - With this data, you can develop a small demo, and then use that demo to argue to the data owners… that hey look, we make something something cool...that’s really useful for you.... Therefore you should give us better data access.
  • 10. How “clean” is the data? (or: how much data translation services do you require from subject matter experts?) - Data cleaning is another issue. - Often you need a subject matter expert to not just help clean the data so much as translate all of its warts and weird “features” that have developed over a significant amount of time as people and systems change. - Getting a subject matter expert with enough time to help you is another area where.. It can really make or break a project. - If you get all the way to the presentation phase and there is some non-intuitive part of the data that subject matter experts all know... but you don’t, it can make the audiance skeptical of your entire project. - Worse yet, if you don’t have access to a subject matter expert during the project work who is responding quickly to you… you can end up in email question / answer pergutory, where work gets stalled while you wait for answers. - Understanding how much help you’ll need is critical when setting up a new project. If you don’t have the right organizational support, a project can get slowed to a crawl and potentially never finish.
  • 11. Compliance: all. of. the. rules. ● Data Access ● Data Transfer ● Data Storage ● Data Anonymization ● Data Sharing - Even in public facing university or government positions, where you expect a lot of data to be public, there can be a lot of data that needs to be kept secure. - Financial data, people data, and data not yet finalized all may have security protocals. …. Who Approves that you’re doing it correctly? Who set up the rules, did they have you and your use case in mind? Can approval processes actually get done on your timeframe? Is there a well established approval process? - These are all things to find out as early as possible as they can push back your timeline. - Also, forgiveness rather than permission,,,, does not work well here.
  • 12. Data is currency: ● Power & Politics (at some level) ● Empathy is useful - While it is not necessary to go Machiavellian here, it is important to remember that data has human data owners, producers, cleaners, and users. This means, at some level, data is power. - When you try to do a new thing with data, especially data that already exists in a legacy system, you may be distrupting an already well balanced system. - If you experience pushback, being able to step back and understand another person’s perspective can sometimes be useful.
  • 13. A good partner helps you navigate: ● If data exists ● Data access ● Data oddities ● Compliance processes ● Data politics - It is critically important for data science consulting groups to have good partners whose interests and resources align with their goals. - An ideal partner will already have data access permissions, knowledge of the “data reality”, and is better equipped to fight organizational battles around the data, as they often already know the players, the turf, and the priors of the context the project is operating within. - A good partner will also have the time and permission to contribute to the project at a level necessary for its success and is capable of carrying on ownership of the project into the future.
  • 14. How Data Scientists Add Value - As data science is a new field. There’s lots of hype, lots of new teams, lots of real but ambiguous possiblities,,,,, and low management familiarity with how to apply this skillset. - This all means the potential for a mismatch between expectations and delivery... is high, so being able to demonstrate how you’ve added value to the larger organization is the currency needed to continue operating. - Data science teams add value in at least three ways. They can create a new data product, build organizational skills and capabilities, or expand awareness of technologies and capabilities that then leads other individual or teams to take the next step themselves.
  • 15. Products - Products…. can include data visualizations, predictions, recommendation engines, speech recognition capabilities, image recognition, natural language processing, and all the different types of web applications and Internet of things methods for user interface and deployment. - More abstractly, you’re very often speeding up a previously human driven process or enabling something to be done that wasn’t done before largely because it was too costly to be done as human driven process….. Because of this… your products are often compared against human level accuracy as a first reaction. - Additionally… you rightly might get different reactions to outreach about potential projects……. from people based on how secure they feel in their position, and whether you’re providing - a new capability they would have never had time to do themselves,,, - automating 10% of the their job that they don’t like anyways, - or automating 75% of their job that makes up the bulk of the reason for them being there. - Again, this goes back to understanding how your projects fit within the large organization.
  • 16. Awareness: Spreading knowledge of what is possible - Today there is a lot of centralization of data scientists in organizations. Part of the reason for this is the skills are in limited supply and the best ways to apply those skills are still being developed. However, as more and more staff knows how to program and has machine-learning experience, more activities under data science will happen throughout the organization and not on centralized teams. - If you’re on a centralized data science teams, it is likely that part of your job (even if not written down) is to speed up that transition by expanding the number of people who know what is, isn’t, or might be possible…... and increasing the number of people wanting to use new technology or brainstorming new applications. - For example, Sometimes this takes the form of singular meetings with people to talk about the range of potential software solutions or open-source code libraries that might apply to their problem. By saving a team a week worth of understanding the landscape, you’ve increased the probability they’ll make something useful with their limited time.
  • 17. Capability: Building skills & Bringing in new tools - Just like awareness, building the capability of the larger organization is also part of a data science team. - This includes both skills and IT infrastructure. - Project partners and subject matter experts are often where this capability building happens informally. - Data science products are often completed and then handed off to another organization. - Building that organiation’s staff’s ability to maintain, understand, and use solutions is both often necessary for longer-term success of projects …….and builds the capability of the larger organization….. in terms of skills and establishing the first use of new software/libraries/etc.
  • 18. Procurement Constraints In large, established, organizations, there are often policy and culture constraints around procurement, many of which directly or indirectly influence the ability of data scientists to effectively do their job. When we’re talking about procurement, what we’re trying to describe here is a collection of considerations around purchases of software that go beyond just one or two people demo-ing a product on their local machines, or a couple people running a trial of some enterprise software. Large organizations often have policy and procedures that major software purchases have to go through, some to make sure things get paid for, and other times because of concerns that include security and reliability.
  • 19. Does the proposed product fit the organization now and in the future? Consider: ● Skill development ● Workflow ● Tech stack There’s a lot of products out there, but in a large and established organization, procurement and adoption are a big process. It’s not enough to consider whether or not a product is the right one for you right now, but also to consider whether or not it will work over the course of time. There’s a couple things to consider - where does this fit into your organizations’ plan in terms of skill development/distribution across your workforce, how easy it will be to integrate with the workflows you are phasing out and workflows you are phasing in, as well as the technology stack you’re using; what are the common tools you’re using, does it integrate with your data sources now, does it integrate with potential productivity tools that you currently use (Office) - how the licensing model works (per install or per concurrent user? One-time or per year?); is it desktop or browser based (desktop-based comes with limitations)
  • 20. What is the official process? What kinds of considerations, paperwork, and approvals need to be made. Who is your audience for each of these justifications? Who is the point of contact for each part of the process? Which regulations are applicable, and where is that documentation about the process stored?
  • 21. What is the culture? And unofficial process… There’s the process on paper, and then there are navigating the people and the motivations. It’s essentially a very complicated people optimization problem.
  • 22. Open-source vs. proprietary Open source can be scary and you might have to talk about actual risk. Important to also consider the fact that for people who sign off on things, proprietary from a big name is always going to be their preference. Microsoft (maybe Amazon) will not cause people harm to their careers. On the other hand, for prototyping, culture change, and skill-building, open-source is great because it’s free and easily available, and you don’t have to go out and buy anything to get started. There are also organizations who have internal repositories of open source software that’s been vetted by security so that you mitigate the risk of malicious code sitting mimicking your dependencies in a repository somewhere. One last thing - if your company does depend on a lot of open source software, please find ways to give back. Ask your developers to contribute code/documentation, and donate monetarily to projects; it helps to keep the ecosystem alive!
  • 23. Communication and Design Effective communication and design at the start, middle, and end of a project is important for a variety of reasons that I will get into over the next few slides.
  • 24. Data science: What does that even mean? (or why managing expectations is important) Credit: - First, effective communication is important because, data science is new, ill-defined and sometimes even magical..…this is fun... but it also means there is high potential for a mismatch of expectations …. as this great XKCD comic demonstrates..and as I’ve mentioned on previous slides. - Clients, partners, and managers aren’t always going to have understanding of what things are technically hard, easy, or even doable. You’ll need to tell them. - They also are going to have different definitions in their head for what does or doesn’t fall into data science. - Both because data scientist driven teams haven’t existed,,, at least in their current form very long...and also because the what is possible is changing very quickly. For example, this comic was written only three years ago and I bet there are many people in this audience right now thinking to themselves “is bird recognition really a five year problem?” - This situation is different than say an oil companies’ exploration team of geologists and geophysicists where very similar products have been produced for similar questions for many years. - Because of this…. expectation management needs to be a constant part of any project…
  • 25. Effective Narratives: Don’t let the Buzzwords + math + programming get in the way of the Business value + project schedule + uncertainty story - When describing your work, the details of the mathematics, statistics, and algorithms are not always going to be understood in detail, and are not always not feasible to communicate. - In order to correctly manage expectations, the emphasis is better put on business value, business decision logic, and range of uncertainty in terms of organizational resources required, time required, and range of possible prediction accuracy. - A good general rule when presenting or explaining is you should be able to drop any of the math or programming or buzzwords and still have what you’re saying make sense.
  • 26. Understand as early as possible ● What’s the real problem? ● Does the data exist? ● Can you access the data? ● How clean is the data? ● What is the business value? ● What is the organizational context? - At beginning of projects, there is a significant need to define the problem, understand the data, how that data is accessed, how this project might be used in workflows, and how it fits into the larger data and decision space ecosystem in the organization. - These are the questions that we’ve found important to answer as early as possible. - Bad assumptions at the start of a project can cause you to waste time working on dead-ends, which then cause you to go back and repeat work. - Being able to ask the right questions, and get the right level of detail back, is a very important skill to develop … that doesn’t just happen automatically.
  • 27. When delivering something that will be used by people: Consider user-centered design - A product that delivers a high accuracy prediction but is a pain to use is still poor project. - User-centered design is a range of techniques that can be used to better understand user problems and build effective easy to use tools to meet those needs. - I won’t attempt to go into detail about it here, except to say.. - I would encourage anyone working on data science products that will eventually be used by humans (and not another program) to explore this philosophy and use it not just on a user-facing front-end but throughout the project design as well.
  • 28. Data Visualization: You’re likely undervaluing it - My personal opinion on this is I think data visualization is undervalued. - With machine Learning, success is easy to measure as we can define it in terms of accuracy, false negatives, etc. - It is very difficult for us to look at a data visualization and recognize how good or bad it is perfoming, because our comprehension of it happens so fast...often faster than we’re cosciously thinking about it. - Therefore, we’re more attracted to the insights from machine-learning, especially supervised learning, in a way we aren’t to data visualization. - I think this is unfortunate, because the number of problems that can be impacted by data visualization is probably larger than machine learning in any given organization… and we’re probably missing a lot of opportunities. - This could be its own talk, so I’ll end with..…learn web development and JavaScript. - JavaScript is where the really interesting data visualization is happening and web development is the most effective way to share complex high dimensional information to a large number of people with minimal requirements on the user end
  • 29. The distribution of data scientists in an organization Let’s talk now about how and where you distribute data scientists in an organization. As a remind of the perspective we’re taking, we want to specifically focus on organizations where data science and software are not the main focus, and in a large organization. Another way of framing it, particularly for this part of the discussion, is to think about what it takes for a large, established organization to start investing in data science, and what that organizational structure can/should look like. To begin with, here are some key concepts.
  • 30. Distribution of Data Scientists Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Something to note as we talk about different organizational structures is this: what is the relationship of the data science team to the rest of the organization?
  • 31. Organizational fence Distribution of Data Scientists Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Is there an organizational fence, or boundary between them?
  • 32. Data Problems Finished Product Training Best Practices Distribution of Data Scientists Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Organizational fence How are data, problems, finished products, training, and best practices passed between “Data Scientists” and the “rest of the organization”? Another thing to consider: who “pays” for the data science team? Are they a core service, just like a basic IT seat? Is their time billed to specific internal clients?
  • 33. Innovation Lab Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Data Problems Training Best Practices Innovation lab: Central tank of data scientists working on longer term projects inside the group that eventually transition outside once “fully grown.” Datasets and problems are tossed by the organization “over the fence” at the Data Science team, which has training and best practices ready to go.
  • 34. Innovation Lab Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Data Problems Training Best Practices Finished Product When a project is complete, the data science team tosses the fully formed product back over the fence to the rest of the organization. Pros: This is really great for honing a team with really great knowledge, training, and fully formed best practices. Cons: This silos the training and knowledge within the innovation, and depending on your funding model for the innovation lab
  • 35. Embedded + Rotations Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science TeamTraining Best Practices A smaller central teams trains up and deploys data scientists that then sit embedded within the organization. Work varies from consultancy to longer-term projects dependending on local needs. Data scientists typically work on teams with other experts and less with larger groups of other data scientists and software engineers.
  • 36. Embedded + Rotations Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science TeamTraining Best Practices Projects Learning Training Best Practices
  • 37. Centralized Consultancy Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Project Working Group Data Problems Training Training Best Practices
  • 38. Centralized Consultancy Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Project Working Group Data Problems Training Training Best Practices People from different organizations contribute people at different levels of commitment and capacity to a project working group that then produces a deliverable.
  • 39. Centralized Consultancy Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Project Working Group Data Problems Training Training Best Practices Finished Product
  • 40. How to grow data science in an org? Top-down vs. grassroots Data / Systems Skills / Culture Top-down = focus is on big projects to transform systems, data, and workflows to enable new data science approaches. Whole teams are stood up to assist in this transformation. Data engineering is seen as critical. Bottom-up = focus is on building skills and giving permission for innovation. No single “approved way of doing things” - Different groups within the organization might pick different languages and vendors based on individual needs. Flexible funding and time given to try new data science methods by individuals who have skills but might not have data scientist, data engineer, or software engineer on their business card.
  • 42. Data scientists need to manage “outward” into many parts of an organization Because data science is a new field… it isn’t embedded into all parts of an organization… and data science skills are sparsely distributed…. In 5-10 years, most of this presentation will need to be updated as Data Science becomes more and more prevalent.
  • 43. All of these can make-or-break a project Data access: Will you need to navigate legacy systems and/or data owners? Value of data science: Is the project’s business value well defined? Procurement constraints: Can a project operationalize/grow within the org? Communication & design: Is the right information flowing effectively? Organizational structure: What are the pros/cons of your structures/workflow? To summarize, all of these can make or break a project, and I’ve seen this happen. - Data Access: do you know whether navigating legacy systems will slow you down? Do you have buy-in from data owners who you’re dependent on? - Value of data science: Can you describe the project in terms of its business value, business context, and schedule uncertainty to anyone who asks? - Procurement constraints: can that project be started, grown, and eventually left in that organization considering the organizations range of skills, funds, human resources, etc. - Communication & Design: Are you communicating effectively to get the information you need at the start of the project? Are you designing user-interfaces at the end of the project that are effective and maximize the value of all the other work you’ve done? - Organizational structure: How does your team fit within the wider organization? Do you understand the strengths and weaknesses that come from that placement? How are you dealing with those constraints? What sorts of projects fit well or don’t fit well with that type of organizational position?
  • 44. Thanks, and keep in touch! Justin Gosses @JustinGosses Yulan Lin @y3 2n