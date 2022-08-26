Successfully reported this slideshow.
Deconstructing a Machine Learning Pipeline with Virtual Data Lake

Aug. 26, 2022
Alluxio Product School Webinar
August 25, 2022

For more Alluxio events: https://alluxio.io/events/

Speaker: Jingwen Ouyang

As more and more companies turn to AI / ML / DL to unlock insight, AI has become this mythical word that adds unnecessary barriers to new adaptors. Oftentimes it was regarded as luxury for those big tech companies only - this should not be the case.

In this talk, Jingwen will first dissect the ML life cycle into five stages - starting from data collection, to data cleansing, model training, model validation, and end at model inference / deployment stages. For each stage, Jingwen will then go over its concept, functionality, characteristics, and use cases to demystify ML operations. Finally, Jingwen will showcase how Alluxio, a virtual data lake, could help simplify each stage.

Deconstructing a Machine Learning Pipeline with Virtual Data Lake

  1. 1. Deconstructing Machine Learning Pipeline and Case Study w/ Virtual Data Lake
  2. 2. Intro to AI and A Virtual Data Lake
  3. 3. AI vs ML vs DL Model Performance scales with data AI: Intelligence demonstrated by machines rather than human or animals ML: Giving computers the skills to learn without explicit programming DL: An ML subset, examining algorithms that learn and improves on their own (typically a neural network that consists of more than three layers)
  4. 4. Many Use Cases of AI AI has wide variety of use cases in different industries! Image source: page 7 Health Care Retail Automotive Manufacturing Financial Services Government and Defense
  5. 5. The common learning process - ML Lifecycle Stages 1 2 3 4 5 1. Data Collection 2. Data Preprocessing 3. Model training 4. Model Evaluation 5. Model Inference
  6. 6. Intro to Alluxio - a Virtual Data Lake layer Did you know? Alluxio shines in all AI lifecycle stages! Unified namespace provides a single point of access and eliminates silos in the data lake Server side API translation from a standard client-side interface to any storage interface, enables any compute to access any storage (portability) Cache layer provides data access acceleration and off load stress from underlying storage
  7. 7. ML Lifecycle Stage 1: Data Collection
  8. 8. Data comes from everywhere and in diﬀerent forms Dictated by business Source: lenovo/netapp
  9. 9. Moving Data • Data often flows from edge to core data centers / cloud for training Image source: page 10
  10. 10. Data Collection Case Study w/ Alluxio Before: moves PBs of data from merged subsedes to parent company for analysis • Poor performance • Error-prone • High S3 egress cost • Needs synchronization Read more: blog With Alluxio: no-copy solution with unified namespace • Eliminates data silo, and improves manageability • Reduces S3 egress cost (50%) ● The world's leading online travel service ● The eighth largest travel agency in the U.S.
  11. 11. ML Lifecycle Stage 2: Data Preprocessing
  12. 12. “Garbage in garbage out” Background: What is Involved in Data Preprocessing Many approaches • Data formatting • Data cleansing ○ Missing data ○ Duplicates ○ Structural errors ○ Outliers • Data aggregation • Data sampling • Feature engineering • Handling categorical data • Feature scaling • Dimensionality reduction • Feature selection ○ filter ○ wrapper ○ embedded • Feature creation Lots of data Very complex Compute expensive MLE spend most of their time on 30%-40% companies painpoint is in data cleansing
  13. 13. Read more blog Feature Extraction Case Study w/ Alluxio ● “Honor of Kings” - world’s largest mobile game (MOBA) ● Highest-grossing mobile game of all time ● Upward of 80 million people play it each day (high concurrency) Alluxio Worker Pods Alluxio Worker Pods Alluxio Worker Pods Alluxio Alluxio HA master 1000 Application pods (Spark: Feature Extraction) Under File System (CephFS)
  14. 14. ML Lifecycle Stage 3: Model Training
  15. 15. A Light Weight Intro to Training and Data ** Cross validation is meant to cover all data to validate the model, but sometimes for DL iteration is too expensive. so they may just assume data is random enough and skip iteration Image source Image source For training iterations** For evaluation Typical test data split
  16. 16. Optimization Goal of Training • Infra team: GPU utilization rate (electricity = money) => Reduce IO stall • Machine learning engineer: accuracy => more data, better data, bigger model, available resources
  17. 17. Model Training Case Study w/ Alluxio Read more blog ● No more redownload ● But single machine has limited capacity ● Distributed layer very scalable ● Video sharing (China’s Youtube) ● Almost 80 million DAU API: S3, HDFS API: POSIX API: POSIX Compute simplification and portability ● On restart needs to redownload data
  18. 18. ML Lifecycle Stage 4: Model Evaluation
  19. 19. Intro to Model Evaluation • What is model evaluation ○ A method of assessing the correctness of models on test data. Different aspects of model evaluation Image source For training iterations For evaluation • Challenge ○ Methodology - statistical ○ Data quality and quantity ○ Compute intensive
  20. 20. Model Evaluation Case Study w/ Alluxio x Future Plan
  21. 21. ML Lifecycle Stage 5 Inference
  22. 22. Model Inference Offline Inference Online Inference Intro ● The process of running data points into a machine learning model to calculate an output, such as a single numerical score ● Similar data flow as training - same feature extractor too Characteristics ● In batch ● Large amount of data - can take advantage of big data tool like Spark ● Latency is acceptable ● Result is stored then served ● At run time upon request ● Needs real time result (SLA) ● Streamed data ● Interactive Examples ● Amazon product recommendation ● Microsoft bing search result ● Tesla autonomous driving on the road ● Manufacturing robotic arm (QA) ● Uber Eats estimated time
  23. 23. Oﬀline Model Inference Case Study w/ Alluxio Read more: blog “By implementing Alluxio, we are able to speed up the inference job, reduce I/O stall, and improve performance by about 18%.” • Prefetch with scheduler into Alluxio cache allows jobs to execute immediately without IO stall • Alluxio provides read retry • Alluxio allows customized cache replacement policies making the inference job more efficient • Largest vendor of computer software in the world. • Leading provider of cloud computing services, video games, computer and gaming hardware, search and other online services.
  24. 24. Summary
  25. 25. 3 4 5 2 1
  26. 26. Alluxio as a common layer Read more blog Focus for Alluxio - data volume + data silo / need for speed • Large amount of data • Heterogeneous compute / storage systems • Heterogeneous typology (hybrid / multi-cloud + on prem) • I/O becomes bottleneck (GPU utilization, caching) Alluxio can be in all the stages of ML life cycles! Read more blog
  27. 27. ALLUXIO 27 Thanks Slack slackin.alluxio.io Website www.alluxio.io Social Media twitter.com/alluxio lindedin.com/alluxio

