Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What the Aspiring or New Data Scientist Needs to Know About the Enterprise

212 views

Published on

Many data scientists are well grounded in creating accomplishment in the enterprise, but many come from outside – from academia, from PhD programs and research. They have the necessary technical skills, but it doesn’t count until their product gets to production and in use. The speaker recently helped a struggling data scientist understand his organization and how to create success in it. That turned into this presentation, because many new data scientists struggle with the complexities of an enterprise.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

What the Aspiring or New Data Scientist Needs to Know About the Enterprise

  1. 1. What the Aspiring or New Data Scientist Needs to Know About the Enterprise Presented by: William McKnight “#1 Global Influencer in Data Warehousing” OnAlytica President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET
  2. 2. Data Science Pioneers Are Locking In • Data Science Pioneers – Let the Data Speak – Use Statistical Models – Use Machine Learning – Generate Deep Business Implications to Work – Deal in Algorithm Management • Some fake-it-till-you-make-it Data Scientists make it • First wave of Data Science leaders are emerging – And reap the exponential benefits 2
  3. 3. Enhance in-car navigation using computer vision Reduce cost of handling misplaced items improve call center experiences with chatbots Improve financial fraud detection and reduce costly false positives Automate paper-based, human-intensive process and reduce Document Verification Predict flight delays based on maintenance records and past flights, in order reduce cost associated with delays AI in Action in the Enterprise
  4. 4. Grayscale of Analytics and Data Science • Reports • Correlations • Predictions • Recommendations • Interactions 4
  5. 5. Data Scientists • Part business analyst, part high-skilled programmer, high-level statistician, and industry & company domain expert • Difficult to find • Lengthy non-linear recruitment process • Difficult to retain • Top Jobs – High-skill data analysis and interpreting – Data Architecture – Data modeling – AI/ML – Top Job 5
  6. 6. Work on Your Specific Challenge(s) Organization not ready for Data Scientist Data Scientist trained in data science Data Scientist from analyst position Organization ready for Data Scientist Organizational realities: Shield or Deal Organizational realities: Shield or Deal and Grow Organizational Readiness Grow Organizational Readiness and Grow Data Science Skills Grow Data Science Skills
  7. 7. Data Scientists Backgrounds • Physical Sciences • Computer Science • Finance & Economics • Math & Statistics 7
  8. 8. Skill Requirements for AI/Data Scientist • The split of the necessary AI/ML between the 'edge' of corporate users and the software itself is still to be determined • Math and Statistics – floating point arithmetic, deep statistics, and linear algebra • Python – easy to program and it good enough – NumPy and pandas libraries are available • TensorFlow – adds a computational/symbolic graph to Python • R and MATLAB – optimized for math with features such as direct slice and dice of matrices and rich libraries to draw from • Java and Scala – work well with Hadoop and Spark respectively • Soft Skills 8
  9. 9. Soft Skills of a Data Scientist • Business Goal Focus • Communication • Detail-Orientation • Creativity • Curiosity These are harder to learn than the hard skills. 9
  10. 10. Ideas for Growing Data Science • Shadow Existing Data Scientist • University Data Science Training • Online Education • Hackathons 10
  11. 11. AI Pattern 1. Hire/Grow Data Science 2. Uncouple AI from Organizational Constraints – While Conforming the Organization 3. Ideation 4. Compile Data! (Data Lake!) – Internal and External 5. Label Data 6. Build Model 7. Prototype 8. Iterate 9. Productionalize 10. Scale 11
  12. 12. Advice to the Data Scientist 12
  13. 13. There are Lots of Data Platforms • Operational Database • Operational Real-Time • Operational Big Data • Operational Data Hub • Master Data Management • A Data Warehouse • A Data Mart – Dependent – Independent • A Data Lake • Analytic Big Data Application • Archive Storage • A Staging Area 13
  14. 14. Inefficient Information Architecture Exists • Mostly exist for historical reasons • Enterprise data can be a casualty • Swelling budgets • Several paths forward
  15. 15. RDBMS LEGACY SOURCES DATA INTEGRATION USERS/REPORTS MASTER DATA OPERATIONAL ANALYTICAL COLUMNAR DATABASES DATA LAKE Data Architecture 2019 DATA WAREHOUSE DATA MARTS IN THE CLOUD SYNDICATED DATA DATA STREAM PROCESSING NOSQL 15 OPERATIONAL APPLICATIONS AND USERS DATA SCIENTISTS
  16. 16. Data Warehouses Have Flavors ● The Customer Experience Transformation Data Warehouse focuses on customer attributes and touchpoints to improve the value of customers. ● The Asset Maximization with IoT data warehouse deals with the high volume of edge data tracking the physical assets of the organization. ● The Operational Extension Data Warehouse supports company operations directly with real- time analytics. ● The Risk Management Data Warehouse supports the ever- growing compliance and reporting requirements and corporate risk. ● The Finance Modernization Data Warehouse handles the voluminous financial reporting and ensures the bottom line is considered in every aspect of the business. ● The Product Innovation Data Warehouse delivers all product- related information into the decisions of the product life cycle.
  17. 17. You’ll Spend Most of Your time in the Data Lake Data Scientist Workbench and Data Warehouse Staging OLTP Systems Data Lake Data Scientists ERP CRM Supply Chain MDM … Data Warehouse Data Mart Stream or Batch Updates DI Real-Time, Event-Driven Apps 17
  18. 18. You’ll Need Many Data Domains • Marketing – segmentation analysis, campaign effectiveness • Cybersecurity – proactive data collection and analysis of threats • Smart Cities – track vehicle movements, traffic data, environmental factors to optimize traffic lights, ensure smooth flow and manage tolling • Oil and Gas - determine drilling patterns, ensure maximum utilization of assets, manage operational expenses, ensure safety, predictive maintenance • Life Sciences – study human genome (100s MB/person) for improving health • Customer • Employee • Partner • Patient • Supplier • Product • Bill of Materials • Assets • Equipment • Media • Agencies • Branches • Facilities • Franchises • Stores • Account • Certifications • Contracts • Financials • Policies Typical Data Domains
  19. 19. You need an architecture that delivers • Management of data sets both in learning and inference • Manage identities and access controls over data sets, models and insights • Ability to deploy easily to custom CPU/GPU infrastructure and libraries • Operational monitoring of ML jobs, for efficiency and quality of insights • Transparency and trust (e.g. through audit trails) to build confidence in the results 19
  20. 20. Data is ready when it is… • In a leveragable platform • In an appropriate platform for its profile and usage • With high non-functionals (Availability, performance, scalability, stability, durability, secure) • Data is captured at the most granular level • Data is at a data quality standard (as defined by Data Governance) 20
  21. 21. The Data May be of Suspect Quality Referential integrity Uniqueness Cardinality Subtype/supertype constructs Value reasonability Consistent Value Sets Formatting Data derivation Completeness Correctness Conformance to a clean set of values
  22. 22. Somebody will focus on the Non-Functional Requirements • Availability, Performance, Reliability, Scalability, Maintenance, Capacity, Security, Usability, Connectivity, Systems management, Disaster recovery • Used to describe the quality attributes of the system and the constraints which the design options will be required to satisfy in order to deliver the business goals, objectives or capabilities. • These requirements assist in determining sizing, cost and viability of the proposed system as well service level requirements for operational management of the solution.
  23. 23. Find an Application to Ride
  24. 24. Organizations are trying to run as agile • Agile roles • Sprints • Commitments • You may be on or lead an agile team(s) 24
  25. 25. Organizations will need to do Organizational Change Management to get your work accepted • Some OCM tasks may be done at release or product levels, i.e., – Stakeholder Management – Training – Future State Job Roles • Others done for epics/features, i.e., – Deployment Readiness – Deployment Communications – Future State Job Roles Product Release Sprint
  26. 26. Data Science Modeling • Evaluate various models and algorithms – Classification – Clustering – Regression – Others • Tune parameters • Iterative experimentation • Data preparation • May discover additional data needs or DQ issues 26
  27. 27. You May Feel/Be Alone • Allow Ample Time For Skill Building • Stay Close To Business Strategy – Eventually you will set business strategy • Look to Digital Leaders • Pair With Data Scientists Across Different Domains To Collaborate 27
  28. 28. Machine Learning Ethics • Elon Musk: “AI is our biggest threat” • Weapons • Bias • Generating Training Data • Transparency • Fake News • Jobs • Surveillance • Birth traits • AI Rights 28
  29. 29. Be a Leader. Shoot for this… Analytics Strategy Analytics Architecture Analytics Modeling Analytics Processes Analytics Ethics Multiple data scientists on staff. New team members brought up to speed in weeks, not quarters. Analytics contributions to all major projects is considered. Central catalog to track all models along their lifecycle. Enterprise data is cataloged, accessible, well- performing and managed. Hard to make manual errors. Logic within analytics is transparent. Model expansion in the enterprise. Output from analytics is predictable and consistent, with auditable outcomes. Models are reproducible. Unused and redundant settings are detectable. Access restrictions applied to models. Data is tested for model applicability. Easy to specify a configuration as a small change from a previous configuration. Analytic applications monitored for operational issues. Production analytic flow includes packaging, deployment, serving and monitoring. Scoring runs on a periodic basis. Good faith attempts to remove biased variables from models. Potential for malicious use of analytics considered in analytics lifecycle.
  30. 30. …and beyond. Business is fundamentally different than 2 years ago due to ML. ML is driving company initiatives. Engineers & researchers are embedded on same teams (and perhaps the same person). Full ML code reviews. ML can be deployed from anywhere. Automated end to end ML lifecycle support frequent model updates, model testing. Dozens to hundreds of models running simultaneously. Impact of small changes to ML can be measured. New algorithmic approaches tested at full scale. Visual model configuration changes. Cybersecurity experts engaged in ML operations. ML systems protected from manipulation and corruption; incorruptibility highly considered in all models. Model transparency, actions can be explained. End to end audit trail for ML – who, why, when. Only fully vetted models are used. 30 Analytics Strategy Analytics Architecture Analytics Modeling Analytics Processes Analytics Ethics
  31. 31. Benefits of MLOps • MLOps draws on DevOps principles and practices. Built upon notions of continuous integration, delivery and deployment, DevOps responds to the needs of the agile business – in summary, to be able to deliver innovation at scale. Principles include: • Continuous integration and delivery (CI/CD): initiatives follow iterative models that can create value quickly, while building understanding and experience. • Collaborative development: solutions are defined, created and optimised based on input from multiple stakeholder groups • Business value focus: measurement and management look at both the efficiency and effectiveness of solutions • Governance by design: Quality, security, compliance and other factors are to be considered at the outset and across the project. 31
  32. 32. Focuses • Templates and automation for model development • Model Monitoring • Deployment Management • Quality Assurance/SDLC • Human Alerting 32
  33. 33. Challenges • These are still very early days for Data Science, and practices are still being ironed out • Many Data Science initiatives work in isolation from each other and the broader business • Data Science can require massive volumes of data, which needs to be accessed scalably • It is difficult to measure and manage the value of Data Science projects • Senior management does not yet see Data Science as strategic 33
  34. 34. 34
  35. 35. What the Aspiring or New Data Scientist Needs to Know About the Enterprise Presented by: William McKnight “#1 Global Influencer in Data Warehousing” OnAlytica President, McKnight Consulting Group An Inc. 5000 Company in 2018 and 2017 @williammcknight www.mcknightcg.com (214) 514-1444 Second Thursday of Every Month, at 2:00 ET

×