스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Online 2021

•

0 likes•447 views

AWSKRUG - AWS한국사용자모임

아무런 경험없이 밑바닥에서부터 구축한 데이터 분석 파이프라인을 소개합니다. 데이터분석을 시작하려고 하는 스타트업에 도움이 될만한 프로젝트들과 AWS를 활용한 데이터 분석 아키텍쳐 구성들을 소개합니다.

Technology

스타트업,
나홀로 데이터 엔지니어:
데이터 분석환경 구축기
천지은
Data Engineer
Tappytoon

Data Engineer @ Tappytoon
genie@tappytoon.com
linkedin.com/in/g471000/
천 지은 (Genie)

Project List
- Server Data Analysis Pipeline
- DMS를 이용한 DW(Redshift) 구축
- Athena Federated Query를 이용한 Multiple Resource 분석
- Appsflyer Data Analysis Pipeline
- Kinesis를 이용한 DataLake(S3) 구축
- Athena, Lambda를 이용한 파티셔닝과 Parquet 변환

Business Intelligence Tool
Business Intelligence Tool

Quicksight를 이용한 시각화
- Trends를 보고싶은 Needs 충족
- 데이터 보안에 안정적인 AWS 인프라
- 저렴한 비용
- 쉽게 연결
- 24/7 Amazon Support
고려해야할점:
- Quicksight의 Learning Curve
- RDS 과부하로 인한 delay
- Dataset 관리

Redshift (with DMS)
- 병렬처리로 신속한 분석
- Low Learning Curve (Postgre SQL 기반)
- 불필요한 데이터 제외
- 24/7 Support
고려해야할점
- 다른 DW 또는 RDS 옵션 (상황에 맞는
선택)
- Binlog 사용으로 인한 서버 부하
- Data Delay
- and...

Redshift Lob vs RDS Text
- DMS에러시 Re-load 비용 & 고비용
- 불필요한 Text Column은 비포함
- Text 필드가 필요한 테이블은 다른 Task로
Migration 처리
- 두 Task의 세팅값을 다르게 설정
- Lob-Task 실패시, Text 포함된 테이블만
재처리

Aurora MySQL Migration
- 기존 인프라 모듈 거의 그대로 사용
- Endpoint만 따로 생성
- 세팅값도 변경 없이 사용 가능
고려해야할점
- Slave Cluster 구성으로 인한 추가비용
- binlog 사용으로 인한 성능 저하 가능성
- Redshift data delay

Athena Federated Query
- Redshift에 있는 데이터와
RDS에 있는 데이터를
JOIN할 수 있음
- Select만 해오고, 분석
쿼리는 Athena에서
- 쿼리 결과를 Quicksight에
연동해서 볼 수 있음
-
고려사항
- 추가 비용 발생 ($5/TB)
- Lambda 관리

EC2 Server
- EC2에 서버 생성
- MongoDB에 Appsflyer
데이터 적재
- 실시간 대시보드 제작
문제점
- 모니터링/유지보수 어려움
- 분석용도로 부적합
- 시각화 어려움
- 쿼리로 인한 인스턴스 과부하
- 대시보드 업데이트 어려움

Kinesis to DataLake
- Gateway를 이용한 안정성
강화
- 운영 비용을 최소화 하기
위한 Stream 분리
- Lambda를 이용한 1차
Parsing
- Datalake에 적재
- Athena & Quicksight 연동

Parquet and partitioning
- Cloudwatch Event Bridge로
매시간 작업
- platform(iOS, Android), year,
month, day, hour로
partitioning
- parquet 형태로 merge
- Athena & Quicksight 연동

AWS Korea PaceMaker
- Search, Recommendation System, Data Analysis System, AI/ML Services 등을
구축하는데 어려움을 겪는 고객들
- 데이터 기반 서비스 확장 및 개선을 하시고자하는 고객들
- Data Engineer, Scientist, Analyst, AI/ML Engineer 팀이 부족한 고객들
- 2~3개월 이내에 Data 관련 프로젝트를 완료하고자 하는 고객들
김성민, AWS Solutions Architect, sungmk@amazon.com
박진우, AWS Solutions Architect, jinwoop@amazon.com

Next….
- Event Data Logging 디자인/구축
- Other 3rd party Data Pipeline
- Daily SNS 지표 시스템
- DB Migrations
- Recommendation 시스템 구축
- 실시간 분석 시스템 구축
- 새로운 BI 툴 구축
채용 공고 확인하기
채용문의 👉 recruit@tappytoon.com

Thank you
천 지은 (Genie)
Data Engineer @ Tappytoon
genie@tappytoon.com
linkedin.com/in/g471000/

한빛데브그라운드에서 발표했던 내용입니다. 발표 영상 : https://youtu.be/ohpfSLf0V3Y -- 스타트업 비즈니스에서 데이터를 활용한 전략 수립과 의사결정은 필수적인 요소입니다. 서비스 운영 데이터에서부터 다양한 고객의 행동 로그, 소셜 미디어 데이터까지 다양한 데이터를 모두 모아 분석 환경을 구축하기 위해서는 많은 준비와 고민이 필요합니다. 스타트업에서 빠른 속도와 최소한의 비용, 다양한 분석 Tool들과 연동되는 Data Pipeline, Data Lake, Data Warehouse 구축 경험기를 공유하고자 합니다. 이 과정을 통해 애널리틱스 파이프라인을 구축 과정과 S3, Glue, Athena,EMR, Quicksight와 같은 서버리스 애널리틱스 서비스에 대한 구축 사례를 확인하실 수 있습니다.

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

Amazon Web Services Korea

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)

Yongho Ha

BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편

Seongyun Byeon

Airflow를 이용한 데이터 Workflow 관리

YoungHeon (Roy) Kim

데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...

Amazon Web Services Korea

데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study 이 세션에서는 데브시스터즈의 Case Study를 통하여 Data Lake를 만들고 사용하는데 있어 요구 되는 사항들에 대해 공유합니다. 여러 목적에 맞는 데이터를 전달하기 위해 AWS 를 활용하여 Data Lake 를 구축하게된 계기와 실제 구축 작업을 하면서 경험하게 된 것들에 대해 말씀드리고자 합니다. 기존 인프라 구조 대비 효율성 및 비용적 측면을 소개해드리고, 빅데이터를 이용한 부서별 데이터 세분화를 진행할 때 어떠한 Architecture가 사용되었는지 소개드리고자 합니다.

Cloud DW technology trends and considerations for enterprises to apply snowflake

SANG WON PARK

올해 처음 오프라인으로 진행된 "한국 데이터 엔니지어 모임"에서 발표한 cloud dw와 snowflake라는 주제로 발표한 내용을 정리하여 공유함. (2022.07) [ 발표 주제 ] Cloud DW 기술 트렌드와 Snowflake 적용 - Modern Data Stack에서 Cloud DW의 역할 - 기존 Data Lake + DW와 무엇이 다른가? - Data Engineer 관점에서 어떻게 사용하면 좋을까? (기능/성능/비용 측면의 장점/단점) [ 주요 내용 ] - 최근 많은 Data Engineer가 기존 기술 스택(Hadoop, Spark, DW 등)의 기술적/운영적 한계를 극복하기 위한 고민중. - 특히 Cloud의 장점과 운영 및 성능을 고려한 Cloud DW(AWS Redshift, GCP BigQuery, DataBricks, Snowflake)를 고려 - 이 중 Snowflake를 실제 프로젝트에 적용한 경험과 기술적인 특징/장점/단점을 공유하고자 함. 작년부터 정부의 데이터 정책 변화와 Cloud 기반의 기술 변화 가속화로 기업의 데이터 환경에도 많은 변화가 발생하고 있고, 기업들은 이에 적응하기 위한 다양한 시도를 하고 있다. 그 중심에 cloud dw (또는 Lake house)가 위치하고 있으며, 이를 기반으로 통합 데이터 플랫폼으로의 아키텍처로 변화하고 있다. 하지만, 아직까지 기존 DW 제품과 주요 CSP(AWS, GCP, Azure)의 제품군을 다양하게 시도하고 있으나, 기대와 다르게 생각보나 낮은 성능 또는 비싼 사용료, 운영의 복잡성으로 인한 많은 시행착오를 거치고 있다. 이 상황에서 작년에 처음 검토한 snowflake의 다양한 기능들이 기업들의 고민과 문제를 상당부분 손쉽게 해결할 수 있다는 것을 확인할 수 있었고, 이를 이용하여 실제 많은 기업들에게 적용하기 위한 POC를 수행하거나, 실제 적용하는 프로젝트를 수행하게 되었다. 본 발표 내용은 이러한 경험을 기반으로 기업(그리고 실제 업무를 수행할 Data Engineer) 관점에서 snowflake가 어떻게 문제를 해결할 수 있는지 cloud dw를 도입/활용/확장 하는 단계별로 문제와 해결 방안을 중심으로 설명하였다. https://blog.naver.com/freepsw?Redirect=Update&logNo=222815591918

카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개

if kakao

황민호(robin.hwang) / kakao corp. DSP개발파트 --- 최근 Spring Cloud와 Netflix OSS로 MSA를 구성하는 시스템 기반의 서비스들이 많아지는 추세입니다. 카카오에서도 작년에 오픈한 광고 플랫폼 모먼트에 Spring Cloud 기반의 MSA환경을 구성하여, API Gateway도 적용하였는데 1년 반 정도 운영한 경험을 공유할 예정입니다. 더불어 MSA 환경에서는 API Gateway를 통해 인증을 어떻게 처리하는지 알아보고 OAuth2 기반의 JWT Token을 이용한 인증에 대한 이야기도 함께 나눌 예정입니다.

Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912

Yooseok Choi

[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)

Seongyun Byeon

Massive service basic

DaeMyung Kang

[236] 카카오의데이터파이프라인 윤도영

NAVER D2

커머스 스타트업의 효율적인 데이터 분석 플랫폼 구축기 - 하지양 데이터 엔지니어, 발란 / 강웅석 데이터 엔지니어, 크로키닷컴 :: AWS...

Amazon Web Services Korea

스타트업에서 빠르게 분석 서비스를 구성하기 위한 AWS 분석 서비스를 활용하고 있습니다. 본 세션에서는 커머스 서비스의 대용량 데이터를 Amazon Kinesis Firehose를 이용하여 실시간으로 사내에 흐르는 중요 데이터를 캡쳐하여 다양한 용도로 사용하는 방법을 알아봅니다. 매달 수백억 건의 사용자 행동 로그를 안정적이고 견고하게 수집하여 인하우스 데이터 분석 방법을 소개합니다. 또한, Amazon Personalize를 통한 개인화 추천 및 Amazon SageMaker를 이용한 이미지분류 등 기계 학습 활용 사례도 공유합니다.

[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)

Hyojun Jeon

마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)

Amazon Web Services Korea

온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020

AWSKRUG - AWS한국사용자모임

Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )

SANG WON PARK

몇년 전부터 Data Architecture의 변화가 빠르게 진행되고 있고, 그 중 Cloud DW는 기존 Data Lake(Hadoop 기반)의 한계(성능, 비용, 운영 등)에 대한 대안으로 주목받으며, 많은 기업들이 이미 도입했거나, 도입을 검토하고 있다. 본 자료는 이러한 Cloud DW에 대해서 개념적으로 이해하고, 시장에 존재하는 다양한 Cloud DW 중에서 기업의 환경에 맞는 제품이 어떤 것인지 성능/비용 관점으로 비교했다. - 왜기업들은 CloudDW에주목하는가? - 시장에는어떤 제품들이 있는가? - 우리Biz환경에서는 어떤 제품을 도입해야 하는가? - CloudDW솔루션의 성능은? - 기존DataLake(EMR)대비 성능은? - 유사CloudDW(snowflake vs redshift) 대비성능은? 앞으로도 Data를 둘러싼 시장은 Cloud DW를 기반으로 ELT, Mata Mesh, Reverse ETL등 새로운 생테계가 급속하게 발전할 것이고, 이를 위한 데이터 엔지니어/데이터 아키텍트 관점의 기술적 검토와 고민이 필요할 것 같다. https://blog.naver.com/freepsw/222654809552

[211] HBase 기반 검색 데이터 저장소 (공개용)

NAVER D2

elasticsearch_적용 및 활용_정리

Junyi Song

성장을 좋아하는 사람이, 성장하고 싶은 사람에게

Seongyun Byeon

Data Modernization_Harinath Susairaj.pptx

ArunPandiyan890855

Modernize & Automate Analytics Data Pipelines

Carole Gunst

What's hot

[DevGround] 린하게 구축하는 스타트업 데이터파이프라인

Jae Young Park

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

Amazon Web Services Korea

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)

Yongho Ha

BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편

Seongyun Byeon

Airflow를 이용한 데이터 Workflow 관리

YoungHeon (Roy) Kim

데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...

Amazon Web Services Korea

Cloud DW technology trends and considerations for enterprises to apply snowflake

SANG WON PARK

카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개

if kakao

Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912

Yooseok Choi

[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)

Seongyun Byeon

Massive service basic

DaeMyung Kang

[236] 카카오의데이터파이프라인 윤도영

NAVER D2

커머스 스타트업의 효율적인 데이터 분석 플랫폼 구축기 - 하지양 데이터 엔지니어, 발란 / 강웅석 데이터 엔지니어, 크로키닷컴 :: AWS...

Amazon Web Services Korea

[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)

Hyojun Jeon

마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)

Amazon Web Services Korea

온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020

AWSKRUG - AWS한국사용자모임

Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )

SANG WON PARK

[211] HBase 기반 검색 데이터 저장소 (공개용)

NAVER D2

elasticsearch_적용 및 활용_정리

Junyi Song

성장을 좋아하는 사람이, 성장하고 싶은 사람에게

Seongyun Byeon

What's hot (20)

[DevGround] 린하게 구축하는 스타트업 데이터파이프라인

Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)

BigQuery의 모든 것(기획자, 마케터, 신입 데이터 분석가를 위한) 입문편

Airflow를 이용한 데이터 Workflow 관리

데브시스터즈 데이터 레이크 구축 이야기 : Data Lake architecture case study (박주홍 데이터 분석 및 인프라 팀...

Cloud DW technology trends and considerations for enterprises to apply snowflake

카카오 광고 플랫폼 MSA 적용 사례 및 API Gateway와 인증 구현에 대한 소개

Bigquery와 airflow를 이용한 데이터 분석 시스템 구축 v1 나무기술(주) 최유석 20170912

[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)

Massive service basic

[236] 카카오의데이터파이프라인 윤도영

커머스 스타트업의 효율적인 데이터 분석 플랫폼 구축기 - 하지양 데이터 엔지니어, 발란 / 강웅석 데이터 엔지니어, 크로키닷컴 :: AWS...

[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유 (2부)

마이크로서비스 기반 클라우드 아키텍처 구성 모범 사례 - 윤석찬 (AWS 테크에반젤리스트)

온라인 주문 서비스를 서버리스 아키텍쳐로 구축하기 - 김태우(Classmethod) :: AWS Community Day Online 2020

Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )

[211] HBase 기반 검색 데이터 저장소 (공개용)

elasticsearch_적용 및 활용_정리

성장을 좋아하는 사람이, 성장하고 싶은 사람에게

Similar to 스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Online 2021

Data Modernization_Harinath Susairaj.pptx

ArunPandiyan890855

Modernize & Automate Analytics Data Pipelines

Carole Gunst

Understanding AWS Managed Database and Analytics Services | AWS Public Sector...

Amazon Web Services

The world is creating more data in more ways than ever before. The average internet user in 2017 generates 1.5GB of data per day, with the rate doubling every 18 months. A single autonomous vehicle can generate 4TB per day. Each smart manufacturing plant generates 1PB per day. Storing, managing, and analyzing this data requires integrated database and analytic services that provide reliability and security at scale. AWS offers a range of managed data services that let customers focus on making data useful, including Amazon Aurora, RDS, DynamoDB, Redshift, Spectrum, ElastiCache, Kinesis, EMR, Elasticsearch Service, and Glue. In this session, we discuss these services, share our vision for innovation, and show how our customers use these services today. Learn More: https://aws.amazon.com/government-education/

Understanding AWS Managed Database and Analytics Services | AWS Public Sector...

Amazon Web Services

Building your First Big Data Application on AWS

Amazon Web Services

Multi-Source, Multi-Speed Data Consumption & Analytics on AWS

Amazon Web Services

AWS Big Data Platform

Amazon Web Services

This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering: - How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs. - Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics. - The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift. - The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database. Created by: Rahul Pathak, Sr. Manager of Software Development

AWS Big Data Solution Days

Amazon Web Services

Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects. We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.

Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017

Amazon Web Services

Join us for this general session where AWS big data experts present an in-depth look at the current state of big data. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data developments. Learn More: https://aws.amazon.com/government-education/

Concevoir une application scalable dans le Cloud

Stéphanie Hertrich

Comment envisager l'architecture d'une solution dans le Cloud ? Quelles différences avec un hébergement classique ? Nous illustrerons les grands principes du développement Cloud en prenant l'exemple d'une application web typique. Nous construirons l'architecture étape par étape pour la rendre scalable et lui faire bénéficier des avantages du Cloud. Nous verrons ensuite les différents types d'implémentations et choix technologiques possibles de cette architecture sur le Cloud Microsoft Azure. Nous envisagerons aussi bien des services d'infrastructure (VMs, conteneurs, …) que des services de plus haut niveau de type plateforme, du serverless, des bases de données managées… Nous zoomerons ensuite sur l'acquisition de la donnée et son traitement dans un contexte Big Data et verrons les caractéristiques d'une architecture lambda et ses implémentations possibles sur Azure (Hadoop, …). Nous terminerons par les différentes manières d'ajouter de l'intelligence dans sa solution : de la plus simple à mettre en œuvre pour le développeur via des APIs pré-packagées, à la plus élaborée et personnalisable pour le Data Scientist. Mais aussi comment la rendre plus facilement accessible par l'utilisateur via un bot Skype, Facebook, Slack, email, SMS... Support du meetup https://www.meetup.com/fr-FR/Duchess-France-Meetup/events/238437772/

5. iED Cloud Services.pdf

ssuser905b17

Modern dataintegration azuredatafactory_ssis

Gaurav Malhotra

This session focuses on the needs of the data integrator and data engineer whether that be for data warehousing & BI, advanced analytics of data for SaaS applications. We walk through a comprehensive set of new additions to Azure Data Factory and SSIS for moving and integrating data across on-premises and cloud. Topics and examples will include simple, scalable and reliable data pipelines in ADF using a serverless, parallel data movement service to/from the cloud, provisioning of Azure-SSIS Integration Runtime (IR) – dedicated servers for lifting & shifting SSIS packages to cloud– and its customization with your own/3rd party extensions, the execution of SSIS packages as first-class activities in ADF pipelines and their combination with other ADF activities to create modern ETL/ELT workflows all through the new code-free experience.

Azure Data.pptx

FedoRam1

When NOT to use Apache Kafka?

Kai Wähner

Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka. No matter if you think about open source Apache Kafka, a cloud service like Confluent Cloud, or another technology using the Kafka protocol like Redpanda or Pulsar, check out this slide deck. A detailed article about this topic: https://www.kai-waehner.de/blog/2022/01/04/when-not-to-use-apache-kafka/

When NOT to Use Apache Kafka? With Kai Waehner | Current 2022

HostedbyConfluent

When NOT to Use Apache Kafka? With Kai Waehner | Current 2022 Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.

SQL Saturday Redmond 2019 ETL Patterns in the Cloud

Mark Kromer

SQL Analytics Powering Telemetry Analysis at Comcast

Databricks

Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience. In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses. We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.

Introducing Azure SQL Data Warehouse

James Serra

The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.

LeedsSharp May 2023 - Azure Integration Services

Michael Stephenson

How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics

Informatica

This presentation is geared toward enterprise architects and senior IT leaders looking to drive more value from their data by learning about cloud data lake management. As businesses focus on leveraging big data to drive digital transformation, technology leaders are struggling to keep pace with the high volume of data coming in at high speed and rapidly evolving technologies. What's needed is an approach that helps you turn petabytes into profit. Cloud data lakes and cloud data warehouses have emerged as a popular architectural pattern to support next-generation analytics. Informatica's comprehensive AI-driven cloud data lake management solution natively ingests, streams, integrates, cleanses, governs, protects and processes big data workloads in multi-cloud environments. Please leave any questions or comments below.

Similar to 스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Online 2021 (20)

Data Modernization_Harinath Susairaj.pptx

Modernize & Automate Analytics Data Pipelines

Understanding AWS Managed Database and Analytics Services | AWS Public Sector...

Building your First Big Data Application on AWS

Multi-Source, Multi-Speed Data Consumption & Analytics on AWS

AWS Big Data Platform

AWS Big Data Solution Days

Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017

Concevoir une application scalable dans le Cloud

5. iED Cloud Services.pdf

Modern dataintegration azuredatafactory_ssis

Azure Data.pptx

When NOT to use Apache Kafka?

When NOT to Use Apache Kafka? With Kai Waehner | Current 2022

SQL Saturday Redmond 2019 ETL Patterns in the Cloud

SQL Analytics Powering Telemetry Analysis at Comcast

Introducing Azure SQL Data Warehouse

LeedsSharp May 2023 - Azure Integration Services

How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

The Future of Platform Engineering

Jemma Hussein Allen

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Elevating Tactical DDD Patterns Through Object Calisthenics

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Leading Change strategies and insights for effective change management pdf 1.pdf

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Connector Corner: Automate dynamic content and events by pushing a button

Generating a custom Ruby SDK for your web service or Rails API using Smithy

Accelerate your Kubernetes clusters with Varnish Caching

JMeter webinar - integration with InfluxDB and Grafana

The Art of the Pitch: WordPress Relationships and Sales

FIDO Alliance Osaka Seminar: Overview.pdf

Monitoring Java Application Security with JDK Tools and JFR Events

The Future of Platform Engineering

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Bits & Pixels using AI for Good.........

Essentials of Automations: Optimizing FME Workflows with Parameters

Assuring Contact Center Experiences for Your Customers With ThousandEyes

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Online 2021

1. SEOUL 2021

2. 스타트업, 나홀로 데이터 엔지니어: 데이터 분석환경 구축기 천지은 Data Engineer Tappytoon

3. Data Engineer @ Tappytoon genie@tappytoon.com linkedin.com/in/g471000/ 천 지은 (Genie)

8. Project List - Server Data Analysis Pipeline - DMS를 이용한 DW(Redshift) 구축 - Athena Federated Query를 이용한 Multiple Resource 분석 - Appsflyer Data Analysis Pipeline - Kinesis를 이용한 DataLake(S3) 구축 - Athena, Lambda를 이용한 파티셔닝과 Parquet 변환

9. Server Data Analysis Pipeline

10. Business Intelligence Tool Business Intelligence Tool

11. Quicksight를 이용한 시각화 - Trends를 보고싶은 Needs 충족 - 데이터 보안에 안정적인 AWS 인프라 - 저렴한 비용 - 쉽게 연결 - 24/7 Amazon Support 고려해야할점: - Quicksight의 Learning Curve - RDS 과부하로 인한 delay - Dataset 관리

12. Quicksight Learning Curve

13. Redshift (with DMS) - 병렬처리로 신속한 분석 - Low Learning Curve (Postgre SQL 기반) - 불필요한 데이터 제외 - 24/7 Support 고려해야할점 - 다른 DW 또는 RDS 옵션 (상황에 맞는 선택) - Binlog 사용으로 인한 서버 부하 - Data Delay - and...

14. Redshift Lob vs RDS Text - DMS에러시 Re-load 비용 & 고비용 - 불필요한 Text Column은 비포함 - Text 필드가 필요한 테이블은 다른 Task로 Migration 처리 - 두 Task의 세팅값을 다르게 설정 - Lob-Task 실패시, Text 포함된 테이블만 재처리

15. Aurora MySQL Migration - 기존 인프라 모듈 거의 그대로 사용 - Endpoint만 따로 생성 - 세팅값도 변경 없이 사용 가능 고려해야할점 - Slave Cluster 구성으로 인한 추가비용 - binlog 사용으로 인한 성능 저하 가능성 - Redshift data delay

16. Athena Federated Query - Redshift에 있는 데이터와 RDS에 있는 데이터를 JOIN할 수 있음 - Select만 해오고, 분석 쿼리는 Athena에서 - 쿼리 결과를 Quicksight에 연동해서 볼 수 있음 - 고려사항 - 추가 비용 발생 ($5/TB) - Lambda 관리

17. Server Data Analysis Pipeline

18. Appsflyer Data Analysis Pipeline

19. EC2 Server - EC2에 서버 생성 - MongoDB에 Appsflyer 데이터 적재 - 실시간 대시보드 제작 문제점 - 모니터링/유지보수 어려움 - 분석용도로 부적합 - 시각화 어려움 - 쿼리로 인한 인스턴스 과부하 - 대시보드 업데이트 어려움

20. Kinesis to DataLake - Gateway를 이용한 안정성 강화 - 운영 비용을 최소화 하기 위한 Stream 분리 - Lambda를 이용한 1차 Parsing - Datalake에 적재 - Athena & Quicksight 연동

21. Parquet and partitioning - Cloudwatch Event Bridge로 매시간 작업 - platform(iOS, Android), year, month, day, hour로 partitioning - parquet 형태로 merge - Athena & Quicksight 연동

22.

23.

24.

25.

26. AWS Korea PaceMaker - Search, Recommendation System, Data Analysis System, AI/ML Services 등을 구축하는데 어려움을 겪는 고객들 - 데이터 기반 서비스 확장 및 개선을 하시고자하는 고객들 - Data Engineer, Scientist, Analyst, AI/ML Engineer 팀이 부족한 고객들 - 2~3개월 이내에 Data 관련 프로젝트를 완료하고자 하는 고객들 김성민, AWS Solutions Architect, sungmk@amazon.com 박진우, AWS Solutions Architect, jinwoop@amazon.com

27. 채용 공고 확인하기 채용문의 👉 recruit@tappytoon.com

28. Next…. - Event Data Logging 디자인/구축 - Other 3rd party Data Pipeline - Daily SNS 지표 시스템 - DB Migrations - Recommendation 시스템 구축 - 실시간 분석 시스템 구축 - 새로운 BI 툴 구축 채용 공고 확인하기 채용문의 👉 recruit@tappytoon.com

29. Thank you 천 지은 (Genie) Data Engineer @ Tappytoon genie@tappytoon.com linkedin.com/in/g471000/

스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Online 2021

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Online 2021

Similar to 스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Online 2021 (20)

More from AWSKRUG - AWS한국사용자모임

More from AWSKRUG - AWS한국사용자모임 (20)

Recently uploaded

Recently uploaded (20)

스타트업 나홀로 데이터 엔지니어: 데이터 분석 환경 구축기 - 천지은 (Tappytoon) :: AWS Community Day Online 2021