Data(?)Ops with CircleCI
CircleCI Korea User Group 2nd Meetup
김진웅
About Me
김진웅 @ddiiwoong
Cloud Engineer @SK C&C
Interested in Kubernetes and Serverless(FaaS), Dev(Data)Ops, SRE, ML/DL
Today
• Data Lake, DataOps
• AWS Native CI/CD
• Why CircleCI?
• DataOps with CircleCI
• Summary
Big Data
https://blogs.gartner.com/doug-laney/big-datas-10-biggest-vision-and-strategy-questions/
https://kr.cloudera.com/products/open-source/apache-hadoop.html
Data Lake
• Centralized repository that allows you to store all your structured and
unstructured data at any scale.
• From dashboards and visualizations to big data processing, real-time
analytics, and machine learning to guide better decisions.
https://aws.amazon.com/ko/big-data/datalakes-
and-analytics/what-is-a-data-lake/
Data Lake Management
참고 - https://www.samsungsds.com/global/ko/support/insights/data_lake.html
What is DataOps
@Wikipedia
DataOps is an automated, process-oriented methodology, used by analytic
and data teams, to improve the quality and reduce the cycle time of data
analytics.
@The DataOps Manifesto
Data Science, Data Engineering, Data Management, Big Data,
Business Intelligence, or the like, through our work we have come to value in
analytics
@My point of view
데이터 중심 사고를 하는 사람들이 모여서 일을 하는 것
Dev + Ops + Data Engineer + Data Scientist
DataOps Principles
https://www.dataopsmanifesto.org/dataops-manifesto.html
1. Continually satisfy your customer - 지속적으로 고객을 만족시켜라
2. Value working analytics - 분석을 가치있게 생각하라
3. Embrace change - 변화 수용
4. It's a team sport - 다양한 역할, 기술, 도구 수용
5. Daily interactions - 매일 협력
6. Self-organize - 자기주도
7. Reduce heroism - 영웅주의를 줄여라
8. Reflect - 반성하라
9. Analytics is code - 분석은 코드다
10. Orchestrate - 결합하라
11. Make it reproducible - 재현 가능하게 만들어라
12. Disposable environments - 비용 최소화
13. Simplicity - 단순성
14. Analytics is manufacturing - 분석은 제조와 같다
15. Quality is paramount - 품질이 제일 중요
16. Monitor quality and performance - 품질 및 성능을 모니터링하라
17. Reuse - 재사용하라
18. Improve cycle times - 사이클 타임을 개선하라
DataOps Principles
• 애자일 방법론부터 시작
• 내/외부 고객 만족을 위한 끊임없는 분석 통찰력 제공
• 분석 성과를 측정하고, 변화를 추구하며, 변하는 고객 요구사항을 지속적으로 이
해해야함
DataOps 구성
• 목표를 중심으로 스스로 조직
• No Hero, Sustainable, Scalable, Process 지향
• Data, Tool, Code, Environment 모두 장악이 필요함
• Reproducible 결과물 -> 분석 Pipeline
• Cross-Functional Team(교차기능 팀)
• Dev, Architect, Ops, Data Science, Data Engineer 모두 포함
• 개발자, 운영자, 데이터전문가 (3자 협업 구도)
Our Project
Goal
• No-Ops : Remove existing management (Serverless)
• GitOps : All infra, codes, and scripts are managed in immutable state
• Automation : Communications, Approvals, SRs, Issues
Requirements
• Key Management
• IAM Role
• Access/SecretKey
• Code Repository
• Github, Bitbucket (Public Access) Account
• Code Commit
• CI
• CircleCI(Github Auth)
• Code Build
• Container Registry
• AWS ECR (Elastic Container Registry)
• Dockerhub
• CD
• Terraform, CircleCI
• CloudFormation, Code Deploy
• Notification
• Approval, SR, Issue, Collaboration
AWS Native CI/CD for Web Service
AWS Native CI/CD for Data Preparation
AWS Native CI/CD for Data Ingest
Code Repository
Github
• 다양한 Eco 3rd Party
• Private 사용 및 Collaboration 기능으로 비공개 Repo 활용
• GitOps
Container Registry
ECR
• Fully Managed
• Security (IAM) 연동
• CircleCI Orbs 제공
• EKS, ECS, Batch 연동 용이
CI/CD
Terraform
• CloudFormation : 직접 사용하는 도구보다는 백엔드로 활용하는 서비스
• 유지보수, 재사용, 모듈화 어려움
• 선언적 인프라스트럭처 관리 도구로 많이 사용하고 있는 도구
• 옵션,설정에 대한 관리 (State), 재사용, Dev-Ops 동시 확인 용이
• VPC, Security Group, IAM 관리 용이
CI/CD
CircleCI
• Fully Managed (Serverless)
• Caching, Debugging, Context
• AWS 종속성 최소화
• Git, Registry, CD영역의 확장성 고려
• AWS Console 접속 최소화
• 쉽고 단순하고 빠른 빌드 환경구성
• 소규모 프로젝트 빠르게 시작 가능
Portal Development Environment
Terraform Pipeline
Job flow
• 인프라 작업 및 IAM 계정작업
• master branch에서만 terraform apply
• CircleCI Version 2.1 기능 활용
• slack notification
• executors
• terraform plan 결과 저장 : persist_to_workspace
• terraform apply : attach_workspace
Checkout Lint Plan
Approval Apply
Master?
https://github.com/ddiiwoong/ecs-tf-template/blob/master/.circleci/config.yml
Batch Code Pipeline
Job flow
• Crawling을 위한 Batch Job(AWS ECS) Script update Pipeline 구성
• Docker build 및 AWS ECR(Registry)로 Script image Push
• AWS Batch Job Definition 변경 (Image Change)
참고 : https://ddii.dev/devops/circleci-ecs/
https://github.com/ddiiwoong/batch-cicd-demo/blob/master/.circleci/config.yml
Docker
Build & Push
S3 Upload
Slack Noti.
Approval
Deploy BatchCheckout
Portal Development Environment Pipeline
Job flow
• 서비스 Application Build Pipeline
• Docker build 및 AWS ECR(Registry)로 image Push
• image archive 및 Caching
• 특정 Tags 또는 branch에 대해서만 Build/Deploy
• Landing Page는 S3로 hosting (S3 upload)
참고 : https://yunsangjun.github.io/blog/cicd/2019/07/03/circleci.html
https://github.com/ddiiwoong/circleci-demo/blob/master/.circleci/config.yml
Build & Test
Archive &
Caching
Image Push ECS updateCheckout
S3(Dev)
Sync
Checkout Approval
S3(Prod)
Sync
ECS update
Static Hosting
In Progress
Glue Job Code Update
• ETL/ELT Job 처리를 위한 Python/Scala Code Update
• Scripts Location, Filename Update
Lambda Code Update
• https://github.com/ddiiwoong/serverless-example-monorepo-with-circleci/blob/master/.circleci/config.yml
EMR, Sagemaker Provisioning 설정 구성
• Bootstrap, Lifecycle Scripts
• Shutdown-actions
Multi-Cloud를 위한 준비
• Azure, IBM Data Pipeline Integration
• Data Migration을 위한 Crawler 구성
DevSecOps
• Scan images, Secrets Management
• https://circleci.com/integrations/devsecops/
Summary
Challenge
• Dev 동의 - Easy!
• 정보보호 동의 - Hard!!
• Secret Environment, Context (DevSecOps)
• https://circleci.com/blog/protect-secrets-with-restricted-contexts/
• Ops(TA) 동의 - Very Hard!!!
• CloudFormation vs Terraform
• S3 vs Git
• Role, Policy
Remember
• executor
• caching (persist_to_workspace)
• ECR immutable image tags
• https://aws.amazon.com/ko/about-aws/whats-new/2019/07/amazon-ecr-now-
supports-immutable-image-tags/
Q&A
@ddiiwoong
@ddiiwoong
ddiiwoong@gmail.com
https://ddii.dev

Data(?)Ops with CircleCI

  • 1.
    Data(?)Ops with CircleCI CircleCIKorea User Group 2nd Meetup 김진웅
  • 2.
    About Me 김진웅 @ddiiwoong CloudEngineer @SK C&C Interested in Kubernetes and Serverless(FaaS), Dev(Data)Ops, SRE, ML/DL
  • 3.
    Today • Data Lake,DataOps • AWS Native CI/CD • Why CircleCI? • DataOps with CircleCI • Summary
  • 4.
  • 5.
    Data Lake • Centralizedrepository that allows you to store all your structured and unstructured data at any scale. • From dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions. https://aws.amazon.com/ko/big-data/datalakes- and-analytics/what-is-a-data-lake/
  • 6.
    Data Lake Management 참고- https://www.samsungsds.com/global/ko/support/insights/data_lake.html
  • 7.
    What is DataOps @Wikipedia DataOpsis an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. @The DataOps Manifesto Data Science, Data Engineering, Data Management, Big Data, Business Intelligence, or the like, through our work we have come to value in analytics @My point of view 데이터 중심 사고를 하는 사람들이 모여서 일을 하는 것 Dev + Ops + Data Engineer + Data Scientist
  • 8.
    DataOps Principles https://www.dataopsmanifesto.org/dataops-manifesto.html 1. Continuallysatisfy your customer - 지속적으로 고객을 만족시켜라 2. Value working analytics - 분석을 가치있게 생각하라 3. Embrace change - 변화 수용 4. It's a team sport - 다양한 역할, 기술, 도구 수용 5. Daily interactions - 매일 협력 6. Self-organize - 자기주도 7. Reduce heroism - 영웅주의를 줄여라 8. Reflect - 반성하라 9. Analytics is code - 분석은 코드다 10. Orchestrate - 결합하라 11. Make it reproducible - 재현 가능하게 만들어라 12. Disposable environments - 비용 최소화 13. Simplicity - 단순성 14. Analytics is manufacturing - 분석은 제조와 같다 15. Quality is paramount - 품질이 제일 중요 16. Monitor quality and performance - 품질 및 성능을 모니터링하라 17. Reuse - 재사용하라 18. Improve cycle times - 사이클 타임을 개선하라
  • 9.
    DataOps Principles • 애자일방법론부터 시작 • 내/외부 고객 만족을 위한 끊임없는 분석 통찰력 제공 • 분석 성과를 측정하고, 변화를 추구하며, 변하는 고객 요구사항을 지속적으로 이 해해야함
  • 10.
    DataOps 구성 • 목표를중심으로 스스로 조직 • No Hero, Sustainable, Scalable, Process 지향 • Data, Tool, Code, Environment 모두 장악이 필요함 • Reproducible 결과물 -> 분석 Pipeline • Cross-Functional Team(교차기능 팀) • Dev, Architect, Ops, Data Science, Data Engineer 모두 포함 • 개발자, 운영자, 데이터전문가 (3자 협업 구도)
  • 11.
  • 12.
    Goal • No-Ops :Remove existing management (Serverless) • GitOps : All infra, codes, and scripts are managed in immutable state • Automation : Communications, Approvals, SRs, Issues
  • 13.
    Requirements • Key Management •IAM Role • Access/SecretKey • Code Repository • Github, Bitbucket (Public Access) Account • Code Commit • CI • CircleCI(Github Auth) • Code Build • Container Registry • AWS ECR (Elastic Container Registry) • Dockerhub • CD • Terraform, CircleCI • CloudFormation, Code Deploy • Notification • Approval, SR, Issue, Collaboration
  • 14.
    AWS Native CI/CDfor Web Service
  • 15.
    AWS Native CI/CDfor Data Preparation
  • 16.
    AWS Native CI/CDfor Data Ingest
  • 17.
    Code Repository Github • 다양한Eco 3rd Party • Private 사용 및 Collaboration 기능으로 비공개 Repo 활용 • GitOps
  • 18.
    Container Registry ECR • FullyManaged • Security (IAM) 연동 • CircleCI Orbs 제공 • EKS, ECS, Batch 연동 용이
  • 19.
    CI/CD Terraform • CloudFormation :직접 사용하는 도구보다는 백엔드로 활용하는 서비스 • 유지보수, 재사용, 모듈화 어려움 • 선언적 인프라스트럭처 관리 도구로 많이 사용하고 있는 도구 • 옵션,설정에 대한 관리 (State), 재사용, Dev-Ops 동시 확인 용이 • VPC, Security Group, IAM 관리 용이
  • 20.
    CI/CD CircleCI • Fully Managed(Serverless) • Caching, Debugging, Context • AWS 종속성 최소화 • Git, Registry, CD영역의 확장성 고려 • AWS Console 접속 최소화 • 쉽고 단순하고 빠른 빌드 환경구성 • 소규모 프로젝트 빠르게 시작 가능
  • 21.
  • 22.
    Terraform Pipeline Job flow •인프라 작업 및 IAM 계정작업 • master branch에서만 terraform apply • CircleCI Version 2.1 기능 활용 • slack notification • executors • terraform plan 결과 저장 : persist_to_workspace • terraform apply : attach_workspace Checkout Lint Plan Approval Apply Master? https://github.com/ddiiwoong/ecs-tf-template/blob/master/.circleci/config.yml
  • 23.
    Batch Code Pipeline Jobflow • Crawling을 위한 Batch Job(AWS ECS) Script update Pipeline 구성 • Docker build 및 AWS ECR(Registry)로 Script image Push • AWS Batch Job Definition 변경 (Image Change) 참고 : https://ddii.dev/devops/circleci-ecs/ https://github.com/ddiiwoong/batch-cicd-demo/blob/master/.circleci/config.yml Docker Build & Push S3 Upload Slack Noti. Approval Deploy BatchCheckout
  • 24.
    Portal Development EnvironmentPipeline Job flow • 서비스 Application Build Pipeline • Docker build 및 AWS ECR(Registry)로 image Push • image archive 및 Caching • 특정 Tags 또는 branch에 대해서만 Build/Deploy • Landing Page는 S3로 hosting (S3 upload) 참고 : https://yunsangjun.github.io/blog/cicd/2019/07/03/circleci.html https://github.com/ddiiwoong/circleci-demo/blob/master/.circleci/config.yml Build & Test Archive & Caching Image Push ECS updateCheckout S3(Dev) Sync Checkout Approval S3(Prod) Sync ECS update Static Hosting
  • 25.
    In Progress Glue JobCode Update • ETL/ELT Job 처리를 위한 Python/Scala Code Update • Scripts Location, Filename Update Lambda Code Update • https://github.com/ddiiwoong/serverless-example-monorepo-with-circleci/blob/master/.circleci/config.yml EMR, Sagemaker Provisioning 설정 구성 • Bootstrap, Lifecycle Scripts • Shutdown-actions Multi-Cloud를 위한 준비 • Azure, IBM Data Pipeline Integration • Data Migration을 위한 Crawler 구성 DevSecOps • Scan images, Secrets Management • https://circleci.com/integrations/devsecops/
  • 26.
    Summary Challenge • Dev 동의- Easy! • 정보보호 동의 - Hard!! • Secret Environment, Context (DevSecOps) • https://circleci.com/blog/protect-secrets-with-restricted-contexts/ • Ops(TA) 동의 - Very Hard!!! • CloudFormation vs Terraform • S3 vs Git • Role, Policy Remember • executor • caching (persist_to_workspace) • ECR immutable image tags • https://aws.amazon.com/ko/about-aws/whats-new/2019/07/amazon-ecr-now- supports-immutable-image-tags/
  • 27.