Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정

693 views

Published on

English will follow the Korean version.

패스트캠퍼스가 주최한 "2019 퓨처 컨퍼런스"에서 "지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정"이라는 제목으로 발표한 슬라이드를 공유합니다. 참고로 데이터에 포커싱된 세미나가 아니라서, 일반적인 청중분들도 최대한 이해할 수 있게 슬라이드를 준비했습니다.

[Translated]

Here is the deck that I gave a talk in 2019 Future last Friday. FYI, I prepared this deck for general attendees to understand it easily because the conference doesn't focus on Data engineering.

https://www.fastcampus.co.kr/2019future_tech/

Published in: Technology

[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정

  1. 1. Coupang Confidential and Proprietary 이 문서는 쿠팡의 대외비이며 지적자산입니다 Journey to the Continuous and Scalable Big Data Platform Matthew (정재화), Coupang
  2. 2. Coupang Confidential and Proprietary About me 02 • Software Development Manager of BigData & DW Platform team • 8+ years Hadoop experience • Apache Tajo Committer and PMC • blrunner78@gmail.com • Blog : https://blrunner.tistory.com • The author of Hadoop tech hand book
  3. 3. Coupang Confidential and Proprietary Agenda 03 1. On-Premise 2. Cloud 1.0 3. Cloud 2.0 4. Airflow as a Service 5. Zeppelin as a Service
  4. 4. Coupang Confidential and Proprietary Motivation 04 The purpose of a business is to create and keep a customer - Peter Drucker -
  5. 5. Coupang Confidential and Proprietary 1. On-Premise
  6. 6. Coupang Confidential and Proprietary Architecture 06 • Aggregations and Joins • MapReduce • Hive/Pig/Spark • Oozie Logs • Client Logs • Server Logs • Adhoc Query • HiveRDBMS External Data ETL Cluster Read-Only Cluster
  7. 7. Coupang Confidential and Proprietary Team's Responsibility 07 • Architect, build and operate our data infrastructure and tools • Create and maintain company-wide data pipeline • Troubleshoot and resolve all issues as users arise
  8. 8. Coupang Confidential and Proprietary Areas of Improvements 08 • Pros • A wide variety of workloads • Continuous increase in users • Cons • Multiple copies of Data • Lack of Elasticity • Operation overhead
  9. 9. Coupang Confidential and Proprietary 2. Cloud 1.0
  10. 10. Coupang Confidential and Proprietary Architecture : Decouple compute and storage 010 Domain Cluster #N Domain Cluster #2 Centralized Resources Hive Meta store Cloud Storage Batch Cluster HiveServer2 Ad-hoc Cluster HiveServer2 Domain Cluster #1 HiveServer2 - Batch Jobs - High throughput - fault tolerant, ETL - Ad-hoc Queries - Low latency - Interactive Analysis - In-memory
  11. 11. Coupang Confidential and Proprietary Team's Responsibility 011 • Architect, build and operate our data infrastructure and tools • Troubleshoot and resolve all issues as users arise • Implement company-wide data pipelines
  12. 12. Coupang Confidential and Proprietary Areas of Improvements 012 • Pros • Allows Parsing, Enriching of Data for Custom Need • Independent scale of CPU and storage capacity • Cons • Learning Curve for Cloud Infrastructure • Operation overhead • Users want latest tools and more features
  13. 13. Coupang Confidential and Proprietary 3. Cloud 2.0
  14. 14. Coupang Confidential and Proprietary High Level Architecture 014 Storage Data Processing Tools Scheduler Tools Security Airflow LDAP Authentication Apache Ranger ACL & Audit Zeppelin Monitoring Computing Clusters Cloud Storage Data Platorm Portal
  15. 15. Coupang Confidential and Proprietary Various types of Computing Clusters 015 Centralized Resource Hive Meta Store Cloud Storage Transient Cluster - Batch Jobs Persistent Cluster - Interactive Queries Workload Specific Cluster
  16. 16. Coupang Confidential and Proprietary Team's Responsibility 016 • Architect, build and our data infrastructure and tools • Create data APIs and data services • Support users using SLA policies • Maintaining security and data privacy • Application Knowledge Support Artifacts, etc.
  17. 17. Coupang Confidential and Proprietary Areas of Improvements 017 • Pros • Onboard lots of users and variety of jobs • Easier management and added features • Cons • Unintended infrastructure costs have increased • A wide variety of client tools and Dev environments • Various types of users
  18. 18. Coupang Confidential and Proprietary Lessons & Learnings 018 • Distribute traffic instead of concentrating the one place • Optimize all types of system resources in clusters • Enforce the Lifecycle of Hadoop Cluster • Monitor clusters and send alarms from the efficiency perspective • Training Users Continuously and building the community culture
  19. 19. Coupang Confidential and Proprietary 4. Airflow as a Service
  20. 20. Coupang Confidential and Proprietary Why we love Airflow? 020 • Define Workflows as code • Makes Workflows more maintainable, versionable, and testable • More flexible execution and workflow generation • Lots of features • Sensor • Workflow Profiling • SLA alert • Rich Web Interface • Scalable Worker Processes • In-house Airflow
  21. 21. Coupang Confidential and Proprietary Airflow : deployment process 021 Cloud Storage
  22. 22. Coupang Confidential and Proprietary 5. Zeppelin as a Service
  23. 23. Coupang Confidential and Proprietary Why we love Zeppelin? 023 • Easy spark development in personal computer • Customized Presto Interpreter • Run presto query easily without complex JDBC configuration • Export the heavy data file to local machine without exception • Persistent Storage for Notebook
  24. 24. Coupang Confidential and Proprietary Zeppelin Architecture 024
  25. 25. Coupang Confidential and Proprietary Areas of Improvements 025 • Users • Load all notebooks in the main page -> Too slow • Big notebook can consume most resources -> Zeppelin Pending • Platform team • Spark interpreter doesn’t support YARN cluster mode • Doesn’t support the life cycle for notebooks • Difficult to upgrade and improve existing zeppelins gracefully
  26. 26. Coupang Confidential and Proprietary Resolution 026 • Upgrade Zeppelin to 0.8.1 • Main Page Improvements • Yarn Cluster Mode for Spark Interpreter • Interpreter Lifecycle manager • Interpreter Recovery • Containerized Zeppelin on Kubernetes
  27. 27. Coupang Confidential and Proprietary Summary 027 • Understand who is the immediate customer • Focus on the truly important things • Detect and solve problems immediately • Leverage the identity of infrastructure • Best Practice is not best for you
  28. 28. Coupang Confidential and Proprietary SELECT question FROM you https://boards.greenhouse.io/coupang/
  29. 29. Coupang Confidential and Proprietary Thank you

×