Practical considrations of data science cleaned.pptx

Yulan Lin @y3 2n
Justin Gosses @JustinGosses
Data Science & Software Engineering
Valador Inc.
Supporting NASA OCIO
Rice Data Science Conference, Oct. 2017
Practical Considerations for Data
Science Consulting and Innovation
in a Large Organization

Why practical
considerations?
There’s a lot of conversation around what mathematical models are good, what
technologies to buy, or even what open-source libraries have the best
implementations of machine learning algorithms. However, there’s a lot more that
goes into effective data science and how it integrates into an organization’s
day-to-day than the math or the models. Here are some of the lessons that we’ve
learned working on an internal consulting team that serves a large organization
whose mission just might happen to include putting people in space.
Now might be a good time to introduce who we are and what we do. You might notice
that I keep using the word “we” here. My colleague and I submitted this talk because
we wanted to share with you some of our hard-earned lessons about the parts of data
science that have very little to do with code, feature selection, and model choice. He’s
currently in Virginia at a poster session, but will be presenting his portion via video,
and hopefully will also be online for Q&A.

● Startup
● Built around a software product
● Small companies
This talk is NOT from the perspective of...
Both Justin and I are scientists by training, so we like to make sure that our biases,
assumptions, and perspective are made explicit and clear. Let’s begin with the
caveats; the hedging as my thesis advisor would call it.

● Established organization (> 10 years old)
● Large organization (> 10,000 people)
● Core function is not software/technical
Clarification: not just NASA! Oil & Gas, Banking, Health, etc.
This talk IS from the perspective of
Now that we’ve gotten the caveats and what we are not out of the way, let’s talk about
what this talk is from the perspective of.
This talk is from the perspective of an organization who is established, has been
around, and has ingrained in it culture, habits, and patterns from before data science
was the hot new buzzword. In addition, we want to make it clear that we’re talking
from the perspective of data scientists who are operating in an organization whose
sole focus isn’t building software products.
This isn’t necessary NASA-specific, and not just from our own experiences, but with
conversations with data scientists at other large companies in oil & gas, health, and
banking.

A grab bag at the intersection of math + code that includes
● Machine Learning
● Deep Learning
● Statistics
● Data Visualization
What is Data Science?
If we’re going to talk about practical considerations of Data Science, we’re going to
need to define it here. There’s some really great Venn Diagrams out there, and blog
posts expounding on what Data Science is as a field. For the purposes of this talk,
we’re going to talk about data science as a grab bag at the intersection of math and
code that includes machine learning, deep learning, statistics, and data visualization.

● Data access
● How data science adds value
● Influence of procurement constraints
● Communication & narratives
● How data scientists are distributed
Roadmap of our talk
In some ways, we’re trying to start small, and then zoom out. Data access is about
the “smallest” level: can we even get to the data we need? Adding value is convincing
stakeholders that data science matters. Procurement constraints starts getting
beyond data scientists and data science clients into the larger organization.
Communication is about communicating out the results of the data science Finally,
the distribution of data scientists is a big-picture how we structure the organization
question. With that, I’d like to pass it off to Justin, to talk about data access and the
value-add of data science.

Data Access
I’m (Justin) going to talk about Data access in this section of the talk.
- It is common in data science talks and articles for someone to say data
science can be 80% data cleaning, and 20% actual analytics and modeling.
While that can sometimes be true, it assumes that you can get to the data in
the first place.
- For data science teams acting as an internal consultancy in a large
organization, obtaining access to actionable data can be a substantial task by
itself. Often the data owners are not you, your team, or even the people you’re
partnering with on our project, so getting good data is a common pain point.
- In this section, I’m going to talk about a few data access topics, including :
creating data that does not yet exist, whether data is programmatically
accessible, data cleaning concerns, data compliance, data as political power,
and finally the importance of having a good local partner on any data project.

Does the data exist?
● Is it actually useful for your problem?
● If you’re going to collect it or make it:
welcome to data engineering
- Does the data exist yet is not always a straight forward question. Sometimes
the data exists but it would take too long to get it or sometimes it exists but
doesn’t have all the attributes you need.
- If the data doesn’t exist yet… and you’re going to collect real
data….congratulations...you’re now into data engineering,
- Which involves data collection, reliability engineering, database engineering,
etc. …. these often have a tendency to slowly grow in scope as expectations
rise over time.. Even for toy prototype systems, so this is one area to be
careful of scope creep.
- Another option is to say, “oh, I’ll just create some fake data”. There are a
variety of helpful libraries for creating fake data, however, in my experience a
product built on fake data almost always has to be re-built, to some extent,
once you get the real data,,,,,, so that probable additional work should be built
into your schedule.

Is the data programmatically accessible?
- Getting the data out of legacy systems and into a form you can work with is
another common hurdle when in a large organization.
- Large organizations have probably around for a while. This means most of its
systems were made before data science was a thing.. And many systems
made before APIs or microservices were common. Basically, you might be
dealing with something designed for human entry and retrieval of data, one file
at a time, not programatic access…. Which might be a problem if you need to
get out thousands of files not just one.
- Often, if you’re working a small study or prototype, you might be to start with
scraping some of the data. By scraping, I mean accessing the system by
logging in as a human and then using a combination of selenium and beautiful
soup libraries to write a program to “pretend to be a person” automating the
process of getting at least some of the data out.
- With this data, you can develop a small demo, and then use that demo to
argue to the data owners… that hey look, we make something something
cool...that’s really useful for you.... Therefore you should give us better data
access.

How “clean” is the data?
(or: how much data translation services do you require from subject
matter experts?)
- Data cleaning is another issue.
- Often you need a subject matter expert to not just help clean the data so much
as translate all of its warts and weird “features” that have developed over a
significant amount of time as people and systems change.
- Getting a subject matter expert with enough time to help you is another area
where.. It can really make or break a project.
- If you get all the way to the presentation phase and there is some
non-intuitive part of the data that subject matter experts all know... but
you don’t, it can make the audiance skeptical of your entire project.
- Worse yet, if you don’t have access to a subject matter expert during
the project work who is responding quickly to you… you can end up in
email question / answer pergutory, where work gets stalled while you
wait for answers.
- Understanding how much help you’ll need is critical when setting up a new
project. If you don’t have the right organizational support, a project can get
slowed to a crawl and potentially never finish.

Compliance: all. of. the. rules.
● Data Access
● Data Transfer
● Data Storage
● Data Anonymization
● Data Sharing
- Even in public facing university or government positions, where you expect a
lot of data to be public, there can be a lot of data that needs to be kept secure.
- Financial data, people data, and data not yet finalized all may have security
protocals. …. Who Approves that you’re doing it correctly? Who set up the
rules, did they have you and your use case in mind? Can approval processes
actually get done on your timeframe? Is there a well established approval
process?
- These are all things to find out as early as possible as they can push back
your timeline.
- Also, forgiveness rather than permission,,,, does not work well here.

Data is currency:
● Power & Politics (at some level)
● Empathy is useful
- While it is not necessary to go Machiavellian here, it is important to remember
that data has human data owners, producers, cleaners, and users. This
means, at some level, data is power.
- When you try to do a new thing with data, especially data that already exists in
a legacy system, you may be distrupting an already well balanced system.
- If you experience pushback, being able to step back and understand another
person’s perspective can sometimes be useful.

A good partner helps you navigate:
● If data exists
● Data access
● Data oddities
● Compliance processes
● Data politics
- It is critically important for data science consulting groups to have good
partners whose interests and resources align with their goals.
- An ideal partner will already have data access permissions, knowledge of the
“data reality”, and is better equipped to fight organizational battles around the
data, as they often already know the players, the turf, and the priors of the
context the project is operating within.
- A good partner will also have the time and permission to contribute to the
project at a level necessary for its success and is capable of carrying on
ownership of the project into the future.

How Data Scientists
Add Value
- As data science is a new field. There’s lots of hype, lots of new teams, lots of
real but ambiguous possiblities,,,,, and low management familiarity with how to
apply this skillset.
- This all means the potential for a mismatch between expectations and
delivery... is high, so being able to demonstrate how you’ve added value to the
larger organization is the currency needed to continue operating.
- Data science teams add value in at least three ways. They can create a new
data product, build organizational skills and capabilities, or expand awareness
of technologies and capabilities that then leads other individual or teams to
take the next step themselves.

Products
- Products…. can include data visualizations, predictions, recommendation
engines, speech recognition capabilities, image recognition, natural language
processing, and all the different types of web applications and Internet of
things methods for user interface and deployment.
- More abstractly, you’re very often speeding up a previously human driven
process or enabling something to be done that wasn’t done before largely
because it was too costly to be done as human driven process….. Because of
this… your products are often compared against human level accuracy as a
first reaction.
- Additionally… you rightly might get different reactions to outreach about
potential projects……. from people based on how secure they feel in their
position, and whether you’re providing
- a new capability they would have never had time to do themselves,,,
- automating 10% of the their job that they don’t like anyways,
- or automating 75% of their job that makes up the bulk of the reason for
them being there.
- Again, this goes back to understanding how your projects fit within the large
organization.

Awareness:
Spreading knowledge of what is possible
- Today there is a lot of centralization of data scientists in organizations. Part of
the reason for this is the skills are in limited supply and the best ways to apply
those skills are still being developed. However, as more and more staff knows
how to program and has machine-learning experience, more activities under
data science will happen throughout the organization and not on centralized
teams.
- If you’re on a centralized data science teams, it is likely that part of your job
(even if not written down) is to speed up that transition by expanding the
number of people who know what is, isn’t, or might be possible…... and
increasing the number of people wanting to use new technology or
brainstorming new applications.
- For example, Sometimes this takes the form of singular meetings with people
to talk about the range of potential software solutions or open-source code
libraries that might apply to their problem. By saving a team a week worth of
understanding the landscape, you’ve increased the probability they’ll make
something useful with their limited time.

Capability:
Building skills & Bringing in new tools
- Just like awareness, building the capability of the larger organization is also
part of a data science team.
- This includes both skills and IT infrastructure.
- Project partners and subject matter experts are often where this capability
building happens informally.
- Data science products are often completed and then handed off to
another organization.
- Building that organiation’s staff’s ability to maintain, understand, and
use solutions is both often necessary for longer-term success of
projects …….and builds the capability of the larger organization….. in
terms of skills and establishing the first use of new
software/libraries/etc.

Procurement
Constraints
In large, established, organizations, there are often policy and culture constraints
around procurement, many of which directly or indirectly influence the ability of data
scientists to effectively do their job. When we’re talking about procurement, what
we’re trying to describe here is a collection of considerations around purchases of
software that go beyond just one or two people demo-ing a product on their local
machines, or a couple people running a trial of some enterprise software. Large
organizations often have policy and procedures that major software purchases have
to go through, some to make sure things get paid for, and other times because of
concerns that include security and reliability.

Does the proposed
product fit the
organization now and
in the future?
Consider:
● Skill development
● Workflow
● Tech stack
There’s a lot of products out there, but in a large and established organization,
procurement and adoption are a big process. It’s not enough to consider whether or
not a product is the right one for you right now, but also to consider whether or not it
will work over the course of time. There’s a couple things to consider - where does
this fit into your organizations’ plan in terms of skill development/distribution across
your workforce, how easy it will be to integrate with the workflows you are phasing out
and workflows you are phasing in, as well as the technology stack you’re using; what
are the common tools you’re using, does it integrate with your data sources now,
does it integrate with potential productivity tools that you currently use (Office) - how
the licensing model works (per install or per concurrent user? One-time or per year?);
is it desktop or browser based (desktop-based comes with limitations)

What is the official process?
What kinds of considerations, paperwork, and approvals need to be made. Who is
your audience for each of these justifications? Who is the point of contact for each
part of the process? Which regulations are applicable, and where is that
documentation about the process stored?

What is the culture?
And unofficial process…
There’s the process on paper, and then there are navigating the people and the
motivations. It’s essentially a very complicated people optimization problem.

Open-source vs. proprietary
Open source can be scary and you might have to talk about actual risk. Important to
also consider the fact that for people who sign off on things, proprietary from a big
name is always going to be their preference. Microsoft (maybe Amazon) will not
cause people harm to their careers. On the other hand, for prototyping, culture
change, and skill-building, open-source is great because it’s free and easily available,
and you don’t have to go out and buy anything to get started. There are also
organizations who have internal repositories of open source software that’s been
vetted by security so that you mitigate the risk of malicious code sitting mimicking your
dependencies in a repository somewhere. One last thing - if your company does
depend on a lot of open source software, please find ways to give back. Ask your
developers to contribute code/documentation, and donate monetarily to projects; it
helps to keep the ecosystem alive!

Communication and
Design
Effective communication and design at the start, middle, and end of a project is
important for a variety of reasons that I will get into over the next few slides.

Data science:
What does that even mean?
(or why managing expectations is
important)
Credit: https://xkcd.com
- First, effective communication is important because, data science is new,
ill-defined and sometimes even magical..…this is fun... but it also means there
is high potential for a mismatch of expectations …. as this great XKCD comic
demonstrates..and as I’ve mentioned on previous slides.
- Clients, partners, and managers aren’t always going to have understanding of
what things are technically hard, easy, or even doable. You’ll need to tell
them.
- They also are going to have different definitions in their head for what does or
doesn’t fall into data science.
- Both because data scientist driven teams haven’t existed,,, at least in
their current form very long...and also because the what is possible is
changing very quickly. For example, this comic was written only three
years ago and I bet there are many people in this audience right now
thinking to themselves “is bird recognition really a five year problem?”
- This situation is different than say an oil companies’ exploration team of
geologists and geophysicists where very similar products have been produced
for similar questions for many years.
- Because of this…. expectation management needs to be a constant part of
any project…

Effective Narratives:
Don’t let the
Buzzwords + math + programming
get in the way of the
Business value + project schedule + uncertainty
story
- When describing your work, the details of the mathematics, statistics, and
algorithms are not always going to be understood in detail, and are not always
not feasible to communicate.
- In order to correctly manage expectations, the emphasis is better put on
business value, business decision logic, and range of uncertainty in terms of
organizational resources required, time required, and range of possible
prediction accuracy.
- A good general rule when presenting or explaining is you should be able to
drop any of the math or programming or buzzwords and still have what you’re
saying make sense.

Understand as early as possible
● What’s the real problem?
● Does the data exist?
● Can you access the data?
● How clean is the data?
● What is the business value?
● What is the organizational context?
- At beginning of projects, there is a significant need to define the problem,
understand the data, how that data is accessed, how this project might be
used in workflows, and how it fits into the larger data and decision space
ecosystem in the organization.
- These are the questions that we’ve found important to answer as early as
possible.
- Bad assumptions at the start of a project can cause you to waste time working
on dead-ends, which then cause you to go back and repeat work.
- Being able to ask the right questions, and get the right level of detail back, is a
very important skill to develop … that doesn’t just happen automatically.

When delivering something that will be used by people:
Consider user-centered design
- A product that delivers a high accuracy prediction but is a pain to use is still
poor project.
- User-centered design is a range of techniques that can be used to better
understand user problems and build effective easy to use tools to meet those
needs.
- I won’t attempt to go into detail about it here, except to say..
- I would encourage anyone working on data science products that will
eventually be used by humans (and not another program) to explore this
philosophy and use it not just on a user-facing front-end but throughout the
project design as well.

Data Visualization:
You’re likely undervaluing it
- My personal opinion on this is I think data visualization is undervalued.
- With machine Learning, success is easy to measure as we can define it in
terms of accuracy, false negatives, etc.
- It is very difficult for us to look at a data visualization and recognize how good
or bad it is perfoming, because our comprehension of it happens so
fast...often faster than we’re cosciously thinking about it.
- Therefore, we’re more attracted to the insights from machine-learning,
especially supervised learning, in a way we aren’t to data visualization.
- I think this is unfortunate, because the number of problems that can be
impacted by data visualization is probably larger than machine learning in any
given organization… and we’re probably missing a lot of opportunities.
- This could be its own talk, so I’ll end with..…learn web development and
JavaScript.
- JavaScript is where the really interesting data visualization is
happening and web development is the most effective way to share
complex high dimensional information to a large number of people with
minimal requirements on the user end

The distribution of
data scientists in an
organization
Let’s talk now about how and where you distribute data scientists in an organization.
As a remind of the perspective we’re taking, we want to specifically focus on
organizations where data science and software are not the main focus, and in a large
organization. Another way of framing it, particularly for this part of the discussion, is to
think about what it takes for a large, established organization to start investing in data
science, and what that organizational structure can/should look like. To begin with,
here are some key concepts.

Distribution of Data Scientists
Team BTeam A
Org 1 Org 2 Org 3 Org 4
Executives
Data Science
Team
Something to note as we talk about different organizational structures is this: what is
the relationship of the data science team to the rest of the organization?

Organizational fence
Team BTeam A
Executives
Data Science
Team
Is there an organizational fence, or boundary between them?

Data
Problems
Finished
Product
Training
Best
Practices
Team BTeam A
Executives
Data Science
Team
Organizational fence
How are data, problems, finished products, training, and best practices passed
between “Data Scientists” and the “rest of the organization”? Another thing to
consider: who “pays” for the data science team? Are they a core service, just like a
basic IT seat? Is their time billed to specific internal clients?

Innovation Lab
Team BTeam A
Executives
Data Science
Team
Data
Problems
Training
Best
Practices
Innovation lab:
Central tank of data scientists working on longer term projects inside the group that
eventually transition outside once “fully grown.” Datasets and problems are tossed by
the organization “over the fence” at the Data Science team, which has training and
best practices ready to go.

Innovation Lab
Team BTeam A
Executives
Data Science
Team
Data
Problems
Training
Best
Practices
Finished
Product
When a project is complete, the data science team tosses the fully formed product
back over the fence to the rest of the organization.
Pros: This is really great for honing a team with really great knowledge, training, and
fully formed best practices.
Cons: This silos the training and knowledge within the innovation, and depending on
your funding model for the innovation lab

Embedded + Rotations
Team BTeam A
Executives
Data Science
TeamTraining
Best
Practices
A smaller central teams trains up and deploys data scientists that then sit embedded
within the organization. Work varies from consultancy to longer-term projects
dependending on local needs. Data scientists typically work on teams with other
experts and less with larger groups of other data scientists and software engineers.

Embedded + Rotations
Team BTeam A
Executives
Data Science
TeamTraining
Best
Practices
Projects
Learning
Training
Best
Practices

Centralized Consultancy
Team BTeam A
Executives
Data Science
Team
Project Working Group
Data
Problems
Training
Training
Best
Practices

Team BTeam A
Executives
Data Science
Team
Data
Problems
Training
Training
Best
Practices
People from different organizations contribute people at different levels of
commitment and capacity to a project working group that then produces a deliverable.

Team BTeam A
Executives
Data Science
Team
Data
Problems
Training
Training
Best
Practices
Finished
Product

How to grow data science in an org?
Top-down vs. grassroots
Data / Systems Skills / Culture
Top-down = focus is on big projects to transform systems, data, and workflows to
enable new data science approaches. Whole teams are stood up to assist in this
transformation. Data engineering is seen as critical.
Bottom-up = focus is on building skills and giving permission for innovation. No single
“approved way of doing things” - Different groups within the organization might pick
different languages and vendors based on individual needs. Flexible funding and time
given to try new data science methods by individuals who have skills but might not
have data scientist, data engineer, or software engineer on their business card.

Data scientists need to manage “outward”
into many parts of an organization
Because data science is a new field… it isn’t embedded into all parts of an
organization… and data science skills are sparsely distributed…. In 5-10 years, most
of this presentation will need to be updated as Data Science becomes more and more
prevalent.

All of these can make-or-break a project
Data access: Will you need to navigate legacy systems and/or data owners?
Value of data science: Is the project’s business value well defined?
Procurement constraints: Can a project operationalize/grow within the org?
Communication & design: Is the right information flowing effectively?
Organizational structure: What are the pros/cons of your structures/workflow?
To summarize, all of these can make or break a project, and I’ve seen this happen.
- Data Access: do you know whether navigating legacy systems will slow you
down? Do you have buy-in from data owners who you’re dependent on?
- Value of data science: Can you describe the project in terms of its business
value, business context, and schedule uncertainty to anyone who asks?
- Procurement constraints: can that project be started, grown, and eventually
left in that organization considering the organizations range of skills, funds,
human resources, etc.
- Communication & Design: Are you communicating effectively to get the
information you need at the start of the project? Are you designing
user-interfaces at the end of the project that are effective and maximize the
value of all the other work you’ve done?
- Organizational structure: How does your team fit within the wider
organization? Do you understand the strengths and weaknesses that come
from that placement? How are you dealing with those constraints? What sorts
of projects fit well or don’t fit well with that type of organizational position?

Thanks, and keep in touch!
Justin Gosses @JustinGosses
Yulan Lin @y3 2n

Practical considrations of data science cleaned.pptx

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Practical considrations of data science cleaned.pptx