More Related Content

More from Databricks(20)

Recently uploaded(20)


Creating an Omni-Channel Customer Experience with ML, Apache Spark, and Azure Databricks

  1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. Todd Dube – CarMax Technology Creating an Omni- Channel Customer Experience (Spark / ML) #UnifiedAnalytics #SparkAISummit
  3. About CarMax • Original used car industry disruptor 25 Years • Nation’s largest retailer of used cars (over $18b Revenue) • Sold 5+ million wholesale vehicles • 200+ stores in 41 states • Top 10 used car loan originator • #174 on 2018 Fortune 500 List • 25,000+ associates nationwide • FORTUNE 100 Best Companies to Work For 15 years in a row (Feb 2019) 3#UnifiedAnalytics #SparkAISummit
  4. CarMax and Omni-Channel • Omni-Channel’s focus is on the Customer – Convenience, seamlessness and personalization • We customize the experience for how the customer wants to buy a vehicle • Online, In Store, Express Pickup, and Delivery • Data Science and it’s enablement are part of the our growth strategy.
  5. CarMax and Data Science – history • Data Science (DS) has grown tremendously Year after Year • Key Foundational DS Assets are now critical to our current and future growth • Data Scientists create / repurpose code SQL/Python on our Data Warehouse or their Laptops • Limited Compute / Space on Laptops / On-Prem Limitations • Ad-Hoc datasets everywhere = governance and truth in data issues • Models have to be rewritten for integration w/ other apps/services
  6. Real Example of prior work involved • Recommender flow involved manually pulling prior data • Work done on Datasets via local Laptop, exported CSV for import • C#/.net Application with Logic and Coefficients for Model Service • Data Could change but Model couldn’t change with out planning and effort (once a Month if planned) • Need for streaming and real-time ingestion of vehicle information MMT Matrix CSV Cosmos DB Recommender Generator Vehicle Recommendations (Stock to Stock) Recommender Service API CarMax.comTeradata S Q L . n e t . n e t . n e t . n e t
  7. We had to Define Goals • CarMax needs a set of tools and a platform for Data Science and ML – Model Development, Testing, and Deployment – Data Accessibility, Research and Development, – 3rd party datasets (Acxiom, NuStar, LiveRamp, Adobe, etc) – Scalable / Affordable Storage • Develop richer faster changing models (Real-Time) • Drive Enhanced Customer and Associate Experience – Omni-Channel – – Key Business Areas (Marketing, Finance, Pricing, etc) Data (Raw) Data Prep Develop Test Train Evaluate Governance Model Lifecyle
  8. We had to define a Data Scientist We had to define new roles for Technology and Business – You need both types: Business: • Data Scientist Type-A: (Analyst) producing meaningful insights from the data. Best suited for statisticians with engineering knowledge Technology: • Data Scientist Type-B: (Build) implement production models that interact directly with users. Best suited for engineers with statistics knowledge.
  9. Set Technology Goals and Use Case • Enable our Data Scientists: – Enable CarMax Data Scientists to more autonomously build, test, and deploy models – Leverage data of varied structures for research, on-demand, self-service machine learning – Support for familiar data science tools and libraries – Support Common Python, Spark, Jupyter, etc and packages/frameworks – Spend less time wrangling data • Support Key Use Case to Prove out Platform and Value to CarMax – 1st - Recommender System, then Bidding and Others…
  10. CarMax Technology Requirements • Centralize Hosted Data Lake Storage • “Catalog” for Managing Data Assets • Defined Ingest and Management Patterns • Performant and Easy Management of Compute – Support for Tools and Technologies new and emerging • Support for Spark, Python, Scala, Python DS/ML Packages • Managed Platform in Cloud utilizing PaaS/SaaS Resources • Architecture to Support Batch and Real-Time Model Build/Test/Deployments – Real Time Model Serving and A/B Testing for Data Scientists
  11. DataLake Zones – Curation and Flow • Data is loaded in natural state without applying transformations • “Landing zone” of the data lake • Converted to standardized file format to reduce storage and improve processing • Metadata validation to confirm data is in expected format • Aggregation and/or consolidation of one or multiple valid data sets for use as input to model • Enrichment of data Raw Valid Refined RAW VALID REFINED ERROR Pipelines move data through Production data lake zones
  12. DS / ML Platform Phase 1 Phase 1: 4 Months Starting in July 2018 - DONE Modernize batch Recommender: – Architecture and Solution POC (Evaluated - Knime, Dataiku, Databricks, AzureML Studio, H2O.AI) – New Daily Batch Recommender System – model refresh any time • Batch Daily Based on prior history of click, sales, and other relevant data sources – Pure Agile Approach w/ 2 week Sprints – Utilize Vendor partner to bring expertise in Data Lake, Spark, Azure and Data Science – Framework for Metadata Driven Data Ingestion • November 2018 Deploy Batch Recommender !!! 5 MONTHS! July Requirements, Design, Vendor Candidate August Vendor POC Selection, Build Infrastructure September Framework, Ingest, Catalog, Replatform Model October Finalize Ingest, Model Testing and Refinement November Finalize Model, API, and Measurement
  13. DS / ML Platform Phase 2 / 3 Phase 2: 3-4 Months / Phase 3: 2 Months Real-Time Recommender Model and Architecture – Model Development, Deployment, and Testing in Real-Time – High SLA for Web/Mobile – Prove out Architecture for Hosting, Testing, and Deployment of Models – Real-Time Streaming of Input Data – Real-Time Serving of Recommender Model Request/Responses Broader Business Unit Support for other Models – Bidding, Propensity, Lead, and other models deployed in Phase 2 – User Adoption and Roll Out to Other Data Scientist Teams
  14. Wait you did what and how ? Really ?
  15. CarMax Technologies Chosen on Azure Azure Kubernetes Service – Build, Test, Deploy, Monitor Models Azure Data Factory – Data Pipeline service to orchestrate and automate data movement and data transformation. Azure Data Lake Storage (ADLS) – Hadoop- compatible scalable storage for big data analytic workloads. Azure Data Catalog – Metadata service for registration and discovery of enterprise data assets. Azure Functions – “Serverless” compute service that can run code on-demand without having to explicitly provision or manage infrastructure Azure Event Hubs – Data streaming and event ingestion service, capable of receiving and processing millions of events per second Databricks – Apache Spark-based analytics platform as PaaS w/ Full Support For: Python, PySpark, SparkSQL, etc MLFlow – End to End ML Lifecycle (easy to use) Azure ML – Open Source Python/.Net/Java for building, deploying, and monitoring Models
  16. Unified Data Processing and ML: Batch Recommender
  17. Real-Time Architecture Proposed
  18. Outcomes and Results • Batch Recommender created 10%+ more engagement with recommendations • Model / Data and Recommendations now updated Daily • We had to tune our model on Vehicle Inventory status in Real-Time (Changed ingestion / not model) • Refined model and ingest numerous times with out outage or issues 18#UnifiedAnalytics #SparkAISummit
  19. Why Databricks and Spark Data Scientists • Manage Code and Python Notebooks similar to Jupyter • Model Management w/ MLFlow • Move way beyond prior limitations (Computer, storage, datasets) • Centralized place for everyone casual exploration to hardcode DS/ML Spark Development • Full Support for All Python Libraries • Deep ML (Horvod, Tensor, anything..) • Collaboration / Sharing of Notebooks with others • Detractor – hard to get old “guard” up to speed on all new things..
  20. Why Databricks and Spark – cont’d Technology • Easily Manage Scalable Compute – (no Hadoop/cluster skills) • Spark is a Go Forward Platform – – Databricks Far and Away is biggest committer to Spark project – Product Reflects their knowledge and enablement of Platform • Spark is complicated but Databricks helps make it very easy • Easily Fits in Azure Architecture (ADLS, AAD, ADF, etc) • Orchestration of Pipelines in Notebooks
  21. Things about me • Apple/Mac Purist • 25+ Years Technology Reading / Learning: • Gartner – yes really • iPad Pro – Python, C#, Others? • Dedicate Time to learning and reading as part of job… • – Podcasts! • Favorite Podcast – MPU Mac Power Users YouTube – – Deep Learning SIMPLIFIED – Azure Everything
  22. Questions ? Connect with me on LinkedIn – search ‘Todd Dube’ WE ARE HIRING –!!!!