- SmartNews uses stream processing to deliver news quickly as the lifetime of news articles is very short. Kinesis Streams play an important role in processing user activity streams and metrics in near real-time.
- Data is ingested using Kinesis Producer and Consumer Libraries and processed using Spark Streaming to generate metrics for ranking articles. Metrics are stored in DynamoDB.
- An ETL workflow is used to transform log data and perform machine learning tasks to cluster users. PipelineDB is also used for real-time analytics on streams.
SmartNews has evolved its use of AWS over time from a monolithic application to microservices as its scale increased. It now uses over 300 EC2 instances, 80 ELBs, and many other AWS services. Configuration management has moved from pull-style deploys to using tools like CodeDeploy, Auto Scaling Groups, and infrastructure as code. Future plans include further containerization and event aggregation to improve scalability, safety, and measureability across services.
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews, Inc.
This document discusses the data management platform (DMP) used for ad targeting and delivery in SmartNews Ads. The DMP collects, cleans, and aggregates over 14 million user profiles and ad data from multiple sources. It uses this first-party data to perform user clustering, CTR and CVR prediction using machine learning models, and lookalike targeting. Future work may include targeting based on user interests and collecting negative feedback to optimize the user experience.
This document discusses Apache Spark on EMR and best practices for using Spark. It introduces the speaker and their experience with Spark at SmartNews. It then covers recent Spark updates, how SmartNews uses Spark for tasks like AD targeting and recommendation, and 10 best practices for using Spark on EMR like running Spark on Yarn, tuning memory settings, minimizing data shuffle, and using dynamic scaling with Spark Streaming.