Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI

740 views

Published on

- Speaker: Hervé Vũ Roussel - CEO & Co-founder @ QuodAI

- Vài nét về speaker: Hervé Vũ Roussel trước đây đã từng là CTO của một công ty phần mềm ở Silicon Valley Mỹ. Anh đã và đang là advisor và mentor cho nhiều tổ chức như IBM AI XPRIZE, PlatoHQ (YC'16), RMIT, AngelHack, ... Anh cũng là một trong các diễn giả thường xuyên cho chủ đề AI và Software engineer cũng như đã tư vấn cho nhiều trường đại học, công ty về các chương trình đào tạo khoa học máy tính và kỹ sư phần mềm. Hiện tại, Hervé đang là CEO của Quod AI, một nền tảng giúp giải thích source code bằng ngôn ngữ tự nhiên.


Đến với talk lần này anh sẽ chia sẻ kinh nghiệm của mình trong việc thiết kế một kiến trúc chịu tải cao và dễ mở rộng (highly scalable architecture) cho các nền tảng AI bao gồm:
- Những nguyên tắc nền tảng trong xây dựng kiến trúc phần mềm
- Cách lựa chọn công nghệ lưu trữ dữ liệu
- Xây dựng data pipelines bất đồng bộ

Published in: Technology
  • How to Manifest Anything with the Law of Attraction ■■■ http://ishbv.com/manifmagic/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Free Miracle "Angel Music" Attract abundance, happiness, and miracles into your life by listening the sounds of the Angels. Go here to listen now! ♥♥♥ http://scamcb.com/manifmagic/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI

  1. 1. Architectures of AI systems Engineering for Big Data & AI HCMC, Sep 6th 2019 herve@quod.aiHerve Roussel
  2. 2. What is Data Engineering ?
  3. 3. Is this data engineering? UploadData.java upload_data.py
  4. 4. cat console.log | grep “ERROR” > errors.log Is this data engineering?
  5. 5. Data engineering? Transformed dataEvent data Program
  6. 6. Backend vs Data?
  7. 7. cat console.log | grep “ERROR” > errors.log Is this data engineering? Event data Transform Transformed data
  8. 8. What is Big Data Engineering ?
  9. 9. Where is Big Data?
  10. 10. How to query news feed? SELECT * FROM posts INNER JOIN friends WHERE ... ORDER BY posts.timestamp DESC
  11. 11. Notify? Web, mobile? Who can see this? Racist? Vulgar? Is this a face? Who’s this? Friend? Celebrity? Courtney likes. Is that good or bad? Paddy commented. Is that good or bad? Chris posted. Is that good or bad? Anybody tagged? What rank in feed?
  12. 12. Copyright violation?
  13. 13. Is Big Data just for big companies? 300K QPS [R] 6K QPS [W] As of JULY 8, 2013 1B+ QPM [P] 250M+ QPM [R] 400M LOC [P] 1.8 TB per year [P]
  14. 14. Data Engineering Augmented dataEvent data Program
  15. 15. Event data Transform Augmented data Big Data Engineering + AI
  16. 16. Pipeline (Transform) Source (Event data) Sink (Augmented data)
  17. 17. What is a source ?
  18. 18. Synchronous_ ( 10-100 ms )_ Where is data coming from? Main data Event source Why split? Asynchronous_ ( 3-5 s )_
  19. 19. What’s in an event data? Post { id: 12345, content: “hello world”, created_at: … updated_at: … author_id: 67890, … } PostCreatedEvent { story_id: 12345, type: “story_posted” … }
  20. 20. Job 1 Job 2 Scheduler What’s batch processing?
  21. 21. Which DB for event source?
  22. 22. ● Volume? ● Velocity? QPS reads? QPS writes? ● Latency? ● Cost? Storage & R/W ● How to write? ○ Integrity? ○ Consistency? ○ Durability? ○ Version? ● How to read? ○ Random access or sequential? ○ Full text search? ○ Geo distance? How to store events?
  23. 23. MySQL MongoDB JSON on S3 (or GCS) 30 GB OK Good Very good 10K WPS OK Good Very good 1K RPS OK Good Very good Range read OK Good Very good Cost $$ $$$ $ MySQL MongoDB 30 GB OK Good 10K WPS OK Good 1K RPS OK Good Sequential read OK Good Cost $$ $$$ How to store events?
  24. 24. Who wants to become architect?
  25. 25. Job 1 Job 2 Scheduler What’s the problem with batch? LATENCY
  26. 26. How to process real-time? Stream processing
  27. 27. How can 2 processes talk?
  28. 28. QUEUE Why not use database?
  29. 29. Importance MySQL Kafka Redis 10K WPS 1.0 5 10 10 1K RPS 1.0 5 10 10 Sequential read 1.0 10 (with B-TREE) 10 10 (using Lists) Order guarantee 0.2 10 0 10 Durability 0.1 10 5 (but perf. hit) 0 Deployability 0.5 10 5 7.5 Score 5.6 / 10 6.6 / 10 7.15 / 10 Why not database?
  30. 30. What is a transform ?
  31. 31. Transforms Source Sink
  32. 32. Functional vs OOP Librarian .startShift() Catalog.open() Library.close() Books.create() Operations on things Add more things find(book) assign(book) Things with operations Add more operations remove(book) load_cover(book)
  33. 33. Functional vs OOP find_similar(vid_uploaded) transcribe_captions(vid_uploaded ) Things with operations Add more operations alert_subscribers(vid_uploaded) generate_thumbnails(vid_uploaded )
  34. 34. What’s supporting data? Transform Supporting data event { id: 12345, type: “story_posted” user_id: 67890 coordinates: [ 10.76, 106.66 ] } Friends or city DB
  35. 35. Who uses ext. supporting data?
  36. 36. API vs Pipeline: availability? Requests in thread Long running
  37. 37. API vs Pipeline: performance? 100ms ⇓ 10ms 100ms * 300,000/60/60 = 9H ⇓ 10ms * 300,000/60/60 = 55 min
  38. 38. Where is the data coming from? Is this a face? Who’s this? Friend? Celebrity?
  39. 39. Data pipelines & AI TransformAI model
  40. 40. How can 2 processes talk? Transform AI model
  41. 41. What is a sink ?
  42. 42. Which DB to sink to?
  43. 43. What to do with the sink? Write Read Data scientist Sales
  44. 44. What are the read use cases? Give me summary report of last month’s activity Give me posts that contain the words Donald Trump, Trump or President Give me all posts by female, age 18-35 Aggregation Full text search Bulk data, filtered
  45. 45. ACID
  46. 46. Denormalization: good or bad?
  47. 47. What is BCNF?
  48. 48. What’s distributed data systems?
  49. 49. Why re-run the pipeline? TransformAI model Transform v2
  50. 50. Idempotency & backfill f(f(x)) = f(x) POST “/BankAccount/AddFunds” { value: 1000, token: TX123 }
  51. 51. Another reason for backfill?
  52. 52. What if the AI model improves? TransformAI model v2
  53. 53. AI systems ≠ traditional systems? 93.2% ProbabilisticDeterministic
  54. 54. Store output of model v1 or v2? AI Model v1 ( accuracy: 83.1% ) AI Model v2 ( accuracy: ?? )
  55. 55. What have we learned ?
  56. 56. Source: Uber Engineering [DE] Collect data [DE] Process data [DS] Build DL model [BE/FE] Use DL model in app [DA] Validate DL model
  57. 57. Which NFR for Big Data? • Scalability • Availability • Interoperability • Portability • Modifiability • Maintainability • Testability • Usability • Buildability • Deployability • Ease of Development • Performance • Security • Localization • Legal • Reusability • Supportability • Monitorability
  58. 58. • Deployability • Ease of Development • Performance • Security • Localization • Legal • Reusability • Supportability • Monitorability Which NFR for Big Data? • Scalability • Availability • Interoperability • Portability • Modifiability • Maintainability • Testability • Usability • Buildability
  59. 59. Main data + Materialized view Event data ⇓ Pipeline ⇓ Augmented data What have we learned?
  60. 60. Want to learn more about AI & Big Data? We’re hiring: ● Big Data Engineer, in training (Java) ● Big Data Engineer (Java) ● Data Scientist (Python) http://bit.ly/quod-ai-join herve@quod.aiHerve Roussel

×