Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Practical considrations of data science cleaned.pptx

37 views

Published on

THIS VERSION INCLUDES SPEAKER NOTES IN SLIDES: It was delivered at the 2017 Rice Data Science Conference. As opposed to the typical data science talk on math, models, or frameworks, this talk discusses the need to successfully manage people relationships when doing data science consulting and prototyping in a large organization. Common traps to avoid, key questions to answer early, how organizational procurement patterns influences tool selection and the importance of having a good local partner close to the data are all discussed. The in-person presenter of this talk at Rice Data Science Day was Yulan lin - https://www.linkedin.com/in/yulanlin/ Justin's slides were recorded in advance. The version without speaker notes is here: https://www.slideshare.net/JustinGosses/practical-considerations-of-data-science-consulting-in-large-organizations-oct-12-2017

Published in: Data & Analytics
  • Be the first to comment

Practical considrations of data science cleaned.pptx

  1. 1. Yulan Lin @y3 2n Justin Gosses @JustinGosses Data Science & Software Engineering Valador Inc. Supporting NASA OCIO Rice Data Science Conference, Oct. 2017 Practical Considerations for Data Science Consulting and Innovation in a Large Organization
  2. 2. Why practical considerations? There’s a lot of conversation around what mathematical models are good, what technologies to buy, or even what open-source libraries have the best implementations of machine learning algorithms. However, there’s a lot more that goes into effective data science and how it integrates into an organization’s day-to-day than the math or the models. Here are some of the lessons that we’ve learned working on an internal consulting team that serves a large organization whose mission just might happen to include putting people in space. Now might be a good time to introduce who we are and what we do. You might notice that I keep using the word “we” here. My colleague and I submitted this talk because we wanted to share with you some of our hard-earned lessons about the parts of data science that have very little to do with code, feature selection, and model choice. He’s currently in Virginia at a poster session, but will be presenting his portion via video, and hopefully will also be online for Q&A.
  3. 3. ● Startup ● Built around a software product ● Small companies This talk is NOT from the perspective of... Both Justin and I are scientists by training, so we like to make sure that our biases, assumptions, and perspective are made explicit and clear. Let’s begin with the caveats; the hedging as my thesis advisor would call it.
  4. 4. ● Established organization (> 10 years old) ● Large organization (> 10,000 people) ● Core function is not software/technical Clarification: not just NASA! Oil & Gas, Banking, Health, etc. This talk IS from the perspective of Now that we’ve gotten the caveats and what we are not out of the way, let’s talk about what this talk is from the perspective of. This talk is from the perspective of an organization who is established, has been around, and has ingrained in it culture, habits, and patterns from before data science was the hot new buzzword. In addition, we want to make it clear that we’re talking from the perspective of data scientists who are operating in an organization whose sole focus isn’t building software products. This isn’t necessary NASA-specific, and not just from our own experiences, but with conversations with data scientists at other large companies in oil & gas, health, and banking.
  5. 5. A grab bag at the intersection of math + code that includes ● Machine Learning ● Deep Learning ● Statistics ● Data Visualization What is Data Science? If we’re going to talk about practical considerations of Data Science, we’re going to need to define it here. There’s some really great Venn Diagrams out there, and blog posts expounding on what Data Science is as a field. For the purposes of this talk, we’re going to talk about data science as a grab bag at the intersection of math and code that includes machine learning, deep learning, statistics, and data visualization.
  6. 6. ● Data access ● How data science adds value ● Influence of procurement constraints ● Communication & narratives ● How data scientists are distributed Roadmap of our talk In some ways, we’re trying to start small, and then zoom out. Data access is about the “smallest” level: can we even get to the data we need? Adding value is convincing stakeholders that data science matters. Procurement constraints starts getting beyond data scientists and data science clients into the larger organization. Communication is about communicating out the results of the data science Finally, the distribution of data scientists is a big-picture how we structure the organization question. With that, I’d like to pass it off to Justin, to talk about data access and the value-add of data science.
  7. 7. Data Access I’m (Justin) going to talk about Data access in this section of the talk. - It is common in data science talks and articles for someone to say data science can be 80% data cleaning, and 20% actual analytics and modeling. While that can sometimes be true, it assumes that you can get to the data in the first place. - For data science teams acting as an internal consultancy in a large organization, obtaining access to actionable data can be a substantial task by itself. Often the data owners are not you, your team, or even the people you’re partnering with on our project, so getting good data is a common pain point. - In this section, I’m going to talk about a few data access topics, including : creating data that does not yet exist, whether data is programmatically accessible, data cleaning concerns, data compliance, data as political power, and finally the importance of having a good local partner on any data project.
  8. 8. Does the data exist? ● Is it actually useful for your problem? ● If you’re going to collect it or make it: welcome to data engineering - Does the data exist yet is not always a straight forward question. Sometimes the data exists but it would take too long to get it or sometimes it exists but doesn’t have all the attributes you need. - If the data doesn’t exist yet… and you’re going to collect real data….congratulations...you’re now into data engineering, - Which involves data collection, reliability engineering, database engineering, etc. …. these often have a tendency to slowly grow in scope as expectations rise over time.. Even for toy prototype systems, so this is one area to be careful of scope creep. - Another option is to say, “oh, I’ll just create some fake data”. There are a variety of helpful libraries for creating fake data, however, in my experience a product built on fake data almost always has to be re-built, to some extent, once you get the real data,,,,,, so that probable additional work should be built into your schedule.
  9. 9. Is the data programmatically accessible? - Getting the data out of legacy systems and into a form you can work with is another common hurdle when in a large organization. - Large organizations have probably around for a while. This means most of its systems were made before data science was a thing.. And many systems made before APIs or microservices were common. Basically, you might be dealing with something designed for human entry and retrieval of data, one file at a time, not programatic access…. Which might be a problem if you need to get out thousands of files not just one. - Often, if you’re working a small study or prototype, you might be to start with scraping some of the data. By scraping, I mean accessing the system by logging in as a human and then using a combination of selenium and beautiful soup libraries to write a program to “pretend to be a person” automating the process of getting at least some of the data out. - With this data, you can develop a small demo, and then use that demo to argue to the data owners… that hey look, we make something something cool...that’s really useful for you.... Therefore you should give us better data access.
  10. 10. How “clean” is the data? (or: how much data translation services do you require from subject matter experts?) - Data cleaning is another issue. - Often you need a subject matter expert to not just help clean the data so much as translate all of its warts and weird “features” that have developed over a significant amount of time as people and systems change. - Getting a subject matter expert with enough time to help you is another area where.. It can really make or break a project. - If you get all the way to the presentation phase and there is some non-intuitive part of the data that subject matter experts all know... but you don’t, it can make the audiance skeptical of your entire project. - Worse yet, if you don’t have access to a subject matter expert during the project work who is responding quickly to you… you can end up in email question / answer pergutory, where work gets stalled while you wait for answers. - Understanding how much help you’ll need is critical when setting up a new project. If you don’t have the right organizational support, a project can get slowed to a crawl and potentially never finish.
  11. 11. Compliance: all. of. the. rules. ● Data Access ● Data Transfer ● Data Storage ● Data Anonymization ● Data Sharing - Even in public facing university or government positions, where you expect a lot of data to be public, there can be a lot of data that needs to be kept secure. - Financial data, people data, and data not yet finalized all may have security protocals. …. Who Approves that you’re doing it correctly? Who set up the rules, did they have you and your use case in mind? Can approval processes actually get done on your timeframe? Is there a well established approval process? - These are all things to find out as early as possible as they can push back your timeline. - Also, forgiveness rather than permission,,,, does not work well here.
  12. 12. Data is currency: ● Power & Politics (at some level) ● Empathy is useful - While it is not necessary to go Machiavellian here, it is important to remember that data has human data owners, producers, cleaners, and users. This means, at some level, data is power. - When you try to do a new thing with data, especially data that already exists in a legacy system, you may be distrupting an already well balanced system. - If you experience pushback, being able to step back and understand another person’s perspective can sometimes be useful.
  13. 13. A good partner helps you navigate: ● If data exists ● Data access ● Data oddities ● Compliance processes ● Data politics - It is critically important for data science consulting groups to have good partners whose interests and resources align with their goals. - An ideal partner will already have data access permissions, knowledge of the “data reality”, and is better equipped to fight organizational battles around the data, as they often already know the players, the turf, and the priors of the context the project is operating within. - A good partner will also have the time and permission to contribute to the project at a level necessary for its success and is capable of carrying on ownership of the project into the future.
  14. 14. How Data Scientists Add Value - As data science is a new field. There’s lots of hype, lots of new teams, lots of real but ambiguous possiblities,,,,, and low management familiarity with how to apply this skillset. - This all means the potential for a mismatch between expectations and delivery... is high, so being able to demonstrate how you’ve added value to the larger organization is the currency needed to continue operating. - Data science teams add value in at least three ways. They can create a new data product, build organizational skills and capabilities, or expand awareness of technologies and capabilities that then leads other individual or teams to take the next step themselves.
  15. 15. Products - Products…. can include data visualizations, predictions, recommendation engines, speech recognition capabilities, image recognition, natural language processing, and all the different types of web applications and Internet of things methods for user interface and deployment. - More abstractly, you’re very often speeding up a previously human driven process or enabling something to be done that wasn’t done before largely because it was too costly to be done as human driven process….. Because of this… your products are often compared against human level accuracy as a first reaction. - Additionally… you rightly might get different reactions to outreach about potential projects……. from people based on how secure they feel in their position, and whether you’re providing - a new capability they would have never had time to do themselves,,, - automating 10% of the their job that they don’t like anyways, - or automating 75% of their job that makes up the bulk of the reason for them being there. - Again, this goes back to understanding how your projects fit within the large organization.
  16. 16. Awareness: Spreading knowledge of what is possible - Today there is a lot of centralization of data scientists in organizations. Part of the reason for this is the skills are in limited supply and the best ways to apply those skills are still being developed. However, as more and more staff knows how to program and has machine-learning experience, more activities under data science will happen throughout the organization and not on centralized teams. - If you’re on a centralized data science teams, it is likely that part of your job (even if not written down) is to speed up that transition by expanding the number of people who know what is, isn’t, or might be possible…... and increasing the number of people wanting to use new technology or brainstorming new applications. - For example, Sometimes this takes the form of singular meetings with people to talk about the range of potential software solutions or open-source code libraries that might apply to their problem. By saving a team a week worth of understanding the landscape, you’ve increased the probability they’ll make something useful with their limited time.
  17. 17. Capability: Building skills & Bringing in new tools - Just like awareness, building the capability of the larger organization is also part of a data science team. - This includes both skills and IT infrastructure. - Project partners and subject matter experts are often where this capability building happens informally. - Data science products are often completed and then handed off to another organization. - Building that organiation’s staff’s ability to maintain, understand, and use solutions is both often necessary for longer-term success of projects …….and builds the capability of the larger organization….. in terms of skills and establishing the first use of new software/libraries/etc.
  18. 18. Procurement Constraints In large, established, organizations, there are often policy and culture constraints around procurement, many of which directly or indirectly influence the ability of data scientists to effectively do their job. When we’re talking about procurement, what we’re trying to describe here is a collection of considerations around purchases of software that go beyond just one or two people demo-ing a product on their local machines, or a couple people running a trial of some enterprise software. Large organizations often have policy and procedures that major software purchases have to go through, some to make sure things get paid for, and other times because of concerns that include security and reliability.
  19. 19. Does the proposed product fit the organization now and in the future? Consider: ● Skill development ● Workflow ● Tech stack There’s a lot of products out there, but in a large and established organization, procurement and adoption are a big process. It’s not enough to consider whether or not a product is the right one for you right now, but also to consider whether or not it will work over the course of time. There’s a couple things to consider - where does this fit into your organizations’ plan in terms of skill development/distribution across your workforce, how easy it will be to integrate with the workflows you are phasing out and workflows you are phasing in, as well as the technology stack you’re using; what are the common tools you’re using, does it integrate with your data sources now, does it integrate with potential productivity tools that you currently use (Office) - how the licensing model works (per install or per concurrent user? One-time or per year?); is it desktop or browser based (desktop-based comes with limitations)
  20. 20. What is the official process? What kinds of considerations, paperwork, and approvals need to be made. Who is your audience for each of these justifications? Who is the point of contact for each part of the process? Which regulations are applicable, and where is that documentation about the process stored?
  21. 21. What is the culture? And unofficial process… There’s the process on paper, and then there are navigating the people and the motivations. It’s essentially a very complicated people optimization problem.
  22. 22. Open-source vs. proprietary Open source can be scary and you might have to talk about actual risk. Important to also consider the fact that for people who sign off on things, proprietary from a big name is always going to be their preference. Microsoft (maybe Amazon) will not cause people harm to their careers. On the other hand, for prototyping, culture change, and skill-building, open-source is great because it’s free and easily available, and you don’t have to go out and buy anything to get started. There are also organizations who have internal repositories of open source software that’s been vetted by security so that you mitigate the risk of malicious code sitting mimicking your dependencies in a repository somewhere. One last thing - if your company does depend on a lot of open source software, please find ways to give back. Ask your developers to contribute code/documentation, and donate monetarily to projects; it helps to keep the ecosystem alive!
  23. 23. Communication and Design Effective communication and design at the start, middle, and end of a project is important for a variety of reasons that I will get into over the next few slides.
  24. 24. Data science: What does that even mean? (or why managing expectations is important) Credit: https://xkcd.com - First, effective communication is important because, data science is new, ill-defined and sometimes even magical..…this is fun... but it also means there is high potential for a mismatch of expectations …. as this great XKCD comic demonstrates..and as I’ve mentioned on previous slides. - Clients, partners, and managers aren’t always going to have understanding of what things are technically hard, easy, or even doable. You’ll need to tell them. - They also are going to have different definitions in their head for what does or doesn’t fall into data science. - Both because data scientist driven teams haven’t existed,,, at least in their current form very long...and also because the what is possible is changing very quickly. For example, this comic was written only three years ago and I bet there are many people in this audience right now thinking to themselves “is bird recognition really a five year problem?” - This situation is different than say an oil companies’ exploration team of geologists and geophysicists where very similar products have been produced for similar questions for many years. - Because of this…. expectation management needs to be a constant part of any project…
  25. 25. Effective Narratives: Don’t let the Buzzwords + math + programming get in the way of the Business value + project schedule + uncertainty story - When describing your work, the details of the mathematics, statistics, and algorithms are not always going to be understood in detail, and are not always not feasible to communicate. - In order to correctly manage expectations, the emphasis is better put on business value, business decision logic, and range of uncertainty in terms of organizational resources required, time required, and range of possible prediction accuracy. - A good general rule when presenting or explaining is you should be able to drop any of the math or programming or buzzwords and still have what you’re saying make sense.
  26. 26. Understand as early as possible ● What’s the real problem? ● Does the data exist? ● Can you access the data? ● How clean is the data? ● What is the business value? ● What is the organizational context? - At beginning of projects, there is a significant need to define the problem, understand the data, how that data is accessed, how this project might be used in workflows, and how it fits into the larger data and decision space ecosystem in the organization. - These are the questions that we’ve found important to answer as early as possible. - Bad assumptions at the start of a project can cause you to waste time working on dead-ends, which then cause you to go back and repeat work. - Being able to ask the right questions, and get the right level of detail back, is a very important skill to develop … that doesn’t just happen automatically.
  27. 27. When delivering something that will be used by people: Consider user-centered design - A product that delivers a high accuracy prediction but is a pain to use is still poor project. - User-centered design is a range of techniques that can be used to better understand user problems and build effective easy to use tools to meet those needs. - I won’t attempt to go into detail about it here, except to say.. - I would encourage anyone working on data science products that will eventually be used by humans (and not another program) to explore this philosophy and use it not just on a user-facing front-end but throughout the project design as well.
  28. 28. Data Visualization: You’re likely undervaluing it - My personal opinion on this is I think data visualization is undervalued. - With machine Learning, success is easy to measure as we can define it in terms of accuracy, false negatives, etc. - It is very difficult for us to look at a data visualization and recognize how good or bad it is perfoming, because our comprehension of it happens so fast...often faster than we’re cosciously thinking about it. - Therefore, we’re more attracted to the insights from machine-learning, especially supervised learning, in a way we aren’t to data visualization. - I think this is unfortunate, because the number of problems that can be impacted by data visualization is probably larger than machine learning in any given organization… and we’re probably missing a lot of opportunities. - This could be its own talk, so I’ll end with..…learn web development and JavaScript. - JavaScript is where the really interesting data visualization is happening and web development is the most effective way to share complex high dimensional information to a large number of people with minimal requirements on the user end
  29. 29. The distribution of data scientists in an organization Let’s talk now about how and where you distribute data scientists in an organization. As a remind of the perspective we’re taking, we want to specifically focus on organizations where data science and software are not the main focus, and in a large organization. Another way of framing it, particularly for this part of the discussion, is to think about what it takes for a large, established organization to start investing in data science, and what that organizational structure can/should look like. To begin with, here are some key concepts.
  30. 30. Distribution of Data Scientists Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Something to note as we talk about different organizational structures is this: what is the relationship of the data science team to the rest of the organization?
  31. 31. Organizational fence Distribution of Data Scientists Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Is there an organizational fence, or boundary between them?
  32. 32. Data Problems Finished Product Training Best Practices Distribution of Data Scientists Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Organizational fence How are data, problems, finished products, training, and best practices passed between “Data Scientists” and the “rest of the organization”? Another thing to consider: who “pays” for the data science team? Are they a core service, just like a basic IT seat? Is their time billed to specific internal clients?
  33. 33. Innovation Lab Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Data Problems Training Best Practices Innovation lab: Central tank of data scientists working on longer term projects inside the group that eventually transition outside once “fully grown.” Datasets and problems are tossed by the organization “over the fence” at the Data Science team, which has training and best practices ready to go.
  34. 34. Innovation Lab Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Data Problems Training Best Practices Finished Product When a project is complete, the data science team tosses the fully formed product back over the fence to the rest of the organization. Pros: This is really great for honing a team with really great knowledge, training, and fully formed best practices. Cons: This silos the training and knowledge within the innovation, and depending on your funding model for the innovation lab
  35. 35. Embedded + Rotations Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science TeamTraining Best Practices A smaller central teams trains up and deploys data scientists that then sit embedded within the organization. Work varies from consultancy to longer-term projects dependending on local needs. Data scientists typically work on teams with other experts and less with larger groups of other data scientists and software engineers.
  36. 36. Embedded + Rotations Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science TeamTraining Best Practices Projects Learning Training Best Practices
  37. 37. Centralized Consultancy Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Project Working Group Data Problems Training Training Best Practices
  38. 38. Centralized Consultancy Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Project Working Group Data Problems Training Training Best Practices People from different organizations contribute people at different levels of commitment and capacity to a project working group that then produces a deliverable.
  39. 39. Centralized Consultancy Team BTeam A Org 1 Org 2 Org 3 Org 4 Executives Data Science Team Project Working Group Data Problems Training Training Best Practices Finished Product
  40. 40. How to grow data science in an org? Top-down vs. grassroots Data / Systems Skills / Culture Top-down = focus is on big projects to transform systems, data, and workflows to enable new data science approaches. Whole teams are stood up to assist in this transformation. Data engineering is seen as critical. Bottom-up = focus is on building skills and giving permission for innovation. No single “approved way of doing things” - Different groups within the organization might pick different languages and vendors based on individual needs. Flexible funding and time given to try new data science methods by individuals who have skills but might not have data scientist, data engineer, or software engineer on their business card.
  41. 41. Wrap-up
  42. 42. Data scientists need to manage “outward” into many parts of an organization Because data science is a new field… it isn’t embedded into all parts of an organization… and data science skills are sparsely distributed…. In 5-10 years, most of this presentation will need to be updated as Data Science becomes more and more prevalent.
  43. 43. All of these can make-or-break a project Data access: Will you need to navigate legacy systems and/or data owners? Value of data science: Is the project’s business value well defined? Procurement constraints: Can a project operationalize/grow within the org? Communication & design: Is the right information flowing effectively? Organizational structure: What are the pros/cons of your structures/workflow? To summarize, all of these can make or break a project, and I’ve seen this happen. - Data Access: do you know whether navigating legacy systems will slow you down? Do you have buy-in from data owners who you’re dependent on? - Value of data science: Can you describe the project in terms of its business value, business context, and schedule uncertainty to anyone who asks? - Procurement constraints: can that project be started, grown, and eventually left in that organization considering the organizations range of skills, funds, human resources, etc. - Communication & Design: Are you communicating effectively to get the information you need at the start of the project? Are you designing user-interfaces at the end of the project that are effective and maximize the value of all the other work you’ve done? - Organizational structure: How does your team fit within the wider organization? Do you understand the strengths and weaknesses that come from that placement? How are you dealing with those constraints? What sorts of projects fit well or don’t fit well with that type of organizational position?
  44. 44. Thanks, and keep in touch! Justin Gosses @JustinGosses Yulan Lin @y3 2n

×