Evolving s3 story

•Download as PPTX, PDF•

0 likes•144 views

This document summarizes the evolution of AppsFlyer's raw data product from a simple Spark script to a premium data service over 3 months. It began as a prototype to address large file sizes and numbers for BI clients. Challenges included scaling, monitoring, security and schema. Improvements such as Parquet format and stateful S3 reduced costs and improved performance. The service was abstracted into microservices with automated tasks, search, and notifications. Monitoring, cost optimization, and prioritizing jobs further refined the product. It concluded having transitioned to a premium, self-serve offering with onboarding and defined schemas.

Evolving a premium raw data product
from simple spark script in 3 month
Avi Perez, Big Data Team Leader @AppsFlyer

AppsFlyer
• 28M raised top VC
• 200M To 13B Daily Events [3 Years]
• 40GB To 5TB [gz] daily text data
• 25 → 60ppl R&D during 2016
• Top 15 Israeli startups by inc.com

What We Do
Media SourcesApp Developers App Users
X10
9
4B$ in media payments annually
measured

AppsFlyer Raw Data Channels
Raw vs Aggregated
• Real Time Stream From kafka
• Online Query Data API (csv)
HTTP
Columnar DB
S3

New Use Case
• Big Clients with BI Systems
• Very large files / large number of files
• Tackling current limitations

secor
Amazon S3
Rapid Prototype ...
write
notify
read

Challenges ...
•Scale in #clients
•Client monitoring
•Security
•Schema
•Flow & Control

Requests keep coming...
• More Clients
• More Events Types
• Customizable Columns

Improving Data Format
• Scanning a lot of data is easy...but not that fast
• Being a big data company is not necessarily
saying you need to read all your data fast

Moving to Parquet . . .
Twitter & Cloudera
• Columnar storage (load only what you need)
• Space efficient (50% improvement)
• Read Time efficient (98% improvement )

Stateful S3 Bucket Structure
For automatic bots parsing

View Layer
• Flatten fields mapping
• Versions

From script to Micro Service
• Tasks creation (Buckets, IAM, Credentials etc)
• Search on Task Executions
• Access to the report files
• Get statuses from the Job HTTP
• Highly available

Moving Toward A Product . . .
• Clients want SLA . . .

Service transparency
Push notification to slack once there
is an issue

Results
Loading data for specific clients
Load specific clients raw
data from 2.5TB
compressed topic
Same load with
partitioning
1.5
min
30sec

From hard coded List to RDS
Client
A
Client
B
... ...

Secured Email Notifications
click Get link

Vault
• Secure Secret Storage
• Dynamics Secrets
• Data Encryption
• Leasing and renewal
• Revocation

Cost Optimization
Helping our clients with download
Daily sessions output file
for one of the clients
The same report
compressed
(.gz)
60G
B
2.1
GB

Support keep asking the same
questions….

Monitoring . . .
Monitor, monitor, monitor….
• Metrics
• Re-tries
• PDs

Going premium . .
• On boarding
• Well defined schema fields
• Self Serve and pricing

Thank you
And…
We are
hiring!!
avi@appsflyer.com

This document discusses how 24/7 Inc uses big data and predictive analytics to create intuitive customer experiences across multiple channels. It summarizes that 24/7 Inc manages over 2.5 billion digital interactions per year from 4.5 terabytes of customer data. The company aims to anticipate customer needs, simplify interactions, and continuously learn from each experience. It does this through its cloud platform that can render seamless customer experiences across online, phone, mobile, and agent channels using large amounts of customer data and predictive modeling.

Kafka in the Enterprise—A Two-Year Journey to Build a Data Streaming Platform...

confluent

(Benny Lee + Christopher Arthur, Bank of Australia) Kafka Summit SF 2018 Commonwealth Bank of Australia (CBA) is Australia’s largest bank with over 15m customers, 50,000 employees and over USD700 billion in assets. We started the journey two years ago to transform our existing enterprise architecture into an “event driven” architecture. Since then, Kafka has become a mission critical platform in the Bank and it is the core component in our “event driven” architecture strategy. In this talk, we will walk you through the journey of how we stood up the initial Kafka clusters, the challenges we encountered (both technical and organisational) and how we overcame those challenges. We will also deep dive into one of the use cases for Kafka (with Kafka Streams and Connectors) in our new real time payment system that was introduced in Australia early this year. We will discuss why we think Kafka was the perfect solution for this use case, and the lessons learned. Key Takeaways: -Lessons learned from our experiences (that we think other companies could be able to benefit from) -Our use cases for Kafka with a particular focus on the new real time payment systems (NPP) initiative in Australia

Building a Star Schema v1.1

Patrick Cuba

Star schemas are used to improve the performance of large datasets in SAS. The document discusses how a client had tables growing to over 1TB in size that were taking hours to build and query. By implementing a star schema using the Scalable Performance Data Engine (SPDE) and StarJoin, the table build time was reduced to 30-40 minutes and query times became seconds to a few minutes. The SPDE allows data to be partitioned across multiple servers and devices to improve scalability.

MongoDB .local Chicago 2019: A MongoDB Journey: Moving from a relational data...

MongoDB

Webinar: Gaining Insights into MongoDB with MongoDB Cloud Manager and New Relic

MongoDB

Mindtalk Tech - Behind the scenes

robin_sy

This document summarizes Mindtalk's approach to scalability. It discusses how Mindtalk uses databases like MongoDB and Redis with sharding and replication to handle high demand. It also covers Mindtalk's use of services like Nginx, HAProxy, Elastic Search, and message queues. The document provides tips on optimization, discusses Mindtalk's development processes involving tools like Git, Buildbot, and PandoraFMS, and lists some open job positions.

Informatica Cloud 101: Fast Track to Integration with Intuit

Informatica Cloud

Data Integration on the Cloud is a not just a trend... it is a proven, secure, efficient way to get you from fragmented to integrated in no time. 92% of IT executives say adoption of Cloud technologies is good value for business. In this session, learn more about Informatica's industry leading Cloud integration solution and how easily you may increase the ROI on your Cloud investments In this session, you will learn: * How to use Informatica Cloud to build an end-to-end data integration in less than 5 minutes * Informatica's Cloud offering for data integration, data replication, data quality and MDM * How Intuit is leveraging Informatica Cloud for Salesforce integration * Best practices on Hybrid IT and Cloud adoption

An afternoon with mongo db new delhi

Rajnish Verma

The document summarizes MongoDB as a modern database designed to solve problems of volume, velocity, and variety of data that traditional relational databases are not well-suited for. It highlights key MongoDB features like scalability, flexible schemas, and high availability. The document also discusses how MongoDB compares favorably to other databases in security capabilities and is a good fit for applications involving user data management, content delivery, and mobile apps.

This document discusses Rich Internet Applications using Flex. It covers topics such as externalizing and loading data, using the HTTPService object and security sandboxes, working with XML and XMLList, binding data, using data formatters and validators, implementing models and views, incorporating controls and containers, using ActionScript within MXML, following the MVC pattern, debugging, and refactoring code. The document provides an overview of key concepts for building Rich Internet Applications with the Flex framework.

MongoDB 3.2 Feature Preview

Norberto Leite

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...

MongoDB

This session will be a case study of eBay’s experience running MongoDB for project Zoom, in which eBay stores all media metadata for the site. This includes references to pictures of every item for sale on eBay. This cluster is eBay's first MongoDB installation on the platform and is a mission critical application. Yuri Finkelstein, an Enterprise Architect on the team, will provide a technical overview of the project and its underlying architecture.

MongoDB Atlas

MongoDB

Дмитрий Попович "How to build a data warehouse?"

Fwdays

To build a data warehouse, Tubular ingests raw data from multiple sources using Kafka and stores it permanently. The data is normalized using Spark - duplicates are removed, data is partitioned by time, and sources are joined. A metadata storage using Hive Metastore allows unified access to datasets discovered across various storage formats like Parquet and Avro. This centralized repository helps engineers, analysts and services access and analyze disparate data.

SharePoint UserGroup Stuttgart - Martina Grom - Office 365 News

atwork

Martina Grom is an Office 365 MVP who provides information on new Office 365 features and updates. Key points from the document include improvements to Groups, eDiscovery and legal hold capabilities, guest user invites in Files and Pages, and updates to the release cadence for new features in both cloud and on-premises versions of Office 365. Requirements and improvements for the next on-premises release in 2016 are also outlined.

Workshop 2: Building a streaming data platform on AWS

Amazon Web Services

Everything generates logs. Applications, infrastructure, security ... everything. Keeping track of the flood of log data is a big challenge, yet critical to your ability to understand your systems and troubleshoot (or prevent) issues. In this session, we will use both Amazon CloudWatch and application logs to show you how to build an end-to-end log analytics solution. First, we cover how to configure an Amazon Elaticsearch Service domain and ingest data into it using Amazon Kinesis Firehose, demonstrating how easy it is to transform data with Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data and configure a secure analytics environment. We demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.

Solving your Backup Needs - Ben Cefalo mdbe18

MongoDB

Effective AIOps with Open Source Software in a Week

Databricks

Classic event, incident, problem and change management are ITSM practices that are getting integrated with DevOps/SRE and ML through a competency known as AIOps. Large streams of data generated through logs, metrics and traces are organized and computed using machine learning algorithms to extract insights on the anomalies of system behavior that could be impacting end-users and business transactions. Businesses cannot afford to see their end-users impacted by those anomalies and therefore would want to proactively predict the likelihood of systems regressing and take corrective action long before any material impact. In this talk, we show the use of simple linear regression and multivariate linear regression techniques to predict the likelihood of system behavior resulting in one or two sigma of standard deviation. We show how to use FOSS tools to predict them using various decision trees that are integrated to high performing streaming platforms like Apache Flink, Apache Beam, Prometheus and Grafana which makes it a lot easier to visualize the various alerts and triage their way back to performing root cause analysis. These high performing systems are also backed by KAFKA for its streaming and distributed computing capabilities by partitioning the data for various staged analysis some of which can be done in parallel and concurrently based on the use cases. We present a fully integrated architecture that helps you realize a commercial AIOps capability without having to license expensive software products. The above open architecture allows you to implement various ML algorithms as needed and its agnostic to programming languages and tools. The talk will combine various techniques with demos and is focused to practicing engineers and developers who are familiar with ML.

MongoDB: Agile Combustion Engine

Norberto Leite

Agile Software Development is becoming the defacto way of building software these days. More and more enterprises, from large fortune 500 to small shop start-ups, are adopting agile development methodologies. But Agile Software development is more than just a methodology or a practice. It's also a combined set of tools and platforms that today are at our disposal to allows to iterate faster, get-to-market sooner and also fail faster. These set of tools augment our development cycles by a few orders of magnitude and allow developers to be much more productive.

Tech UG - Newcastle 09-17 - logic apps

Michael Stephenson

This document provides an overview of Logic Apps and how they can be used for integration tasks. It begins with an agenda that includes positioning Logic Apps, a Logic Apps 101 section, and demos. It then discusses how Logic Apps can be used for lightweight integrations, production integrations, and real-world projects. Examples are given of common integration architectures and how Logic Apps fit into them. The document concludes with a questions slide thanking the audience.

Securing an Azure Function REST API with Azure Active Directory

Rick van den Bosch

This document discusses securing an Azure Function REST API with Azure Active Directory. It provides an overview of Azure Active Directory and Azure Functions. It then covers using the Active Directory Authentication Library (ADAL) and Microsoft Authentication Library (MSAL) to authenticate users and calls the Azure Function API. It provides examples of integrating authentication into an Angular application using libraries like ADAL and MSAL.

Sitecore Symposium: DMS Where is the data at?

Pieter Brinkman

MongoDB Atlas - eHarmony’s New Message Store

Evan Rodd

This document discusses eHarmony's migration of its messaging platform from a relational database to MongoDB Atlas. Some key points: - eHarmony wanted to simplify its 18-step communication flow and support richer content like images and video. Its existing relational database had performance issues and a rigid data model. - It designed a new flexible schema in MongoDB Atlas using different collections for conversations, chat history, counts, and questions. Collections were sharded for scalability. - Load testing showed MongoDB Atlas provided high performance at scale. Its monitoring, alerting, and automated backups required low management overhead. - After migration, total communication volume on the platform increased, showing the new system enhanced the user

Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...

Amazon Web Services

Amazon Kinesis is a platform for streaming data on AWS, offering powerful services to make it easy to load and analyze streaming data. In this session, you’ll learn about how AWS customers are transitioning from batch to real-time processing using Amazon Kinesis, and how to get started. We will provide an overview of streaming data applications and introduce the Amazon Kinesis platform and its services. We will walk through a production use case to demonstrate how to ingest streaming data, prepare it, and analyze it to gain actionable insights in real time using Amazon Kinesis. We will also provide pointers to tutorials and other resources so you can quickly get started with your streaming data application.

Multi-Tenant Log Analytics SaaS Service using Solr: Presented by Chirag Gupta...

Lucidworks

This document summarizes a presentation about Microsoft's Log Analytics SaaS service, which uses Apache Solr. It discusses the challenges of supporting a multi-tenant service at scale, including bottlenecks in Solr Cloud and performance issues with wide queries. The presentation describes Microsoft's approach to addressing these challenges through workload management across Solr clusters, centralized configuration, and querying cold storage clusters to improve query performance. It concludes by discussing next steps to further optimize Solr for the log analytics scenario.

APIdays Helsinki 2019 - GraphQL API Management with Amit P. Acharya, IBM

apidays

GraphQL API management provides essential capabilities for protecting backend systems and enabling differentiated API plans. Key aspects of GraphQL API management include analyzing query complexity to prevent overwhelming backends, enforcing rate limits and access controls, and offering API plans differentiated by query limits and pricing. This allows GraphQL APIs to balance data access for consumers with protection of backend systems.

Scribe insight 01 publisher deep dive

Scribe Software Corp.

This document provides an overview of the different types of publishers in Microsoft Scribe that generate XML messages in MSMQ queues for integration. It discusses the key entities, Windows services, and message queues involved in transporting data from various applications into MSMQ. The agenda covers an overview of publishers, the different types including for CRM, Salesforce, Dynamics GP, AX and NAV, and details on how each publisher works and generates XML. Troubleshooting tips are also provided around MSMQ configuration and optimizing the integration.

Maximizing MongoDB Performance on AWS

MongoDB

AWS is an incredibly popular environment for running MongoDB deployments. Today you have many choices about instance type, storage, network config, security, how you configure MongoDB processes, and more. In addition, you now have options when it comes to tooling to help you manage and operate your deployment. In this session, we’ll take a look at several recommendations that can help you get the best performance out of AWS.

Serverless SQL

Torsten Steinbach

Serverless SQL provides a serverless analytics platform that allows users to analyze data stored in object storage without having to manage infrastructure. Key features include seamless elasticity, pay-per-query consumption, and the ability to analyze data directly in object storage without having to move it. The platform includes serverless storage, data ingest, data transformation, analytics, and automation capabilities. It aims to create a sharing economy for analytics by allowing various users like developers, data engineers, and analysts flexible access to data and analytics.

How Totango uses Apache Spark

Oren Raboy

What's hot

Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"

Fwdays

Rich Internet Applications and Flex - 3

Vijay Kalangi

MongoDB 3.2 Feature Preview

Norberto Leite

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...

MongoDB

MongoDB Atlas

MongoDB

Дмитрий Попович "How to build a data warehouse?"

Fwdays

SharePoint UserGroup Stuttgart - Martina Grom - Office 365 News

atwork

Workshop 2: Building a streaming data platform on AWS

Amazon Web Services

Solving your Backup Needs - Ben Cefalo mdbe18

MongoDB

Effective AIOps with Open Source Software in a Week

Databricks

MongoDB: Agile Combustion Engine

Norberto Leite

Tech UG - Newcastle 09-17 - logic apps

Michael Stephenson

Securing an Azure Function REST API with Azure Active Directory

Rick van den Bosch

Sitecore Symposium: DMS Where is the data at?

Pieter Brinkman

MongoDB Atlas - eHarmony’s New Message Store

Evan Rodd

Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...

Amazon Web Services

Multi-Tenant Log Analytics SaaS Service using Solr: Presented by Chirag Gupta...

Lucidworks

APIdays Helsinki 2019 - GraphQL API Management with Amit P. Acharya, IBM

apidays

Scribe insight 01 publisher deep dive

Scribe Software Corp.

Maximizing MongoDB Performance on AWS

MongoDB

What's hot (20)

Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"

Rich Internet Applications and Flex - 3

MongoDB 3.2 Feature Preview

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...

MongoDB Atlas

Дмитрий Попович "How to build a data warehouse?"

SharePoint UserGroup Stuttgart - Martina Grom - Office 365 News

Workshop 2: Building a streaming data platform on AWS

Solving your Backup Needs - Ben Cefalo mdbe18

Effective AIOps with Open Source Software in a Week

MongoDB: Agile Combustion Engine

Tech UG - Newcastle 09-17 - logic apps

Securing an Azure Function REST API with Azure Active Directory

Sitecore Symposium: DMS Where is the data at?

MongoDB Atlas - eHarmony’s New Message Store

Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...

Multi-Tenant Log Analytics SaaS Service using Solr: Presented by Chirag Gupta...

APIdays Helsinki 2019 - GraphQL API Management with Amit P. Acharya, IBM

Scribe insight 01 publisher deep dive

Maximizing MongoDB Performance on AWS

Similar to Evolving s3 story

Serverless SQL

Torsten Steinbach

How Totango uses Apache Spark

Oren Raboy

Getting Started with Real-time Analytics

Amazon Web Services

This session is recommended for anyone interested in understanding how to use AWS big data services to develop real-time analytics applications. In this session, you will get an overview of a number of Amazon's big data and analytics services that enable you to build highly scaleable cloud applications that immediately and continuously analyze large sets of distributed data. We'll explain how services like Amazon Kinesis, EMR and Redshift can be used for data ingestion, processing and storage to enable real-time insights and analysis into customer, operational and machine generated data and log files. We'll explore system requirements, design considerations, and walk through a specific customer use case to illustrate the power of real-time insights on their business.

Comment transformer vos données en informations exploitables

Elasticsearch

Cómo transformar los datos en análisis con los que tomar decisiones

Elasticsearch

Automated Data Synchronization: Data Loader, Data Mirror & Beyond

JeremyOtt5

This document discusses automated data synchronization tools from Bullhorn, including Data Loader, Data Mirror, and a proposed Data Sync API. It provides information on key features and statistics for Data Loader, the development timeline and future plans. It also summarizes Data Mirror capabilities and limitations, and outlines a proposal for a new Data Sync API and workflow to improve performance over the existing REST API approach.

Automation options with Office 365

Robert Crane

The document discusses Microsoft Flow and how it can be used to automate workflows across apps and services. It provides demos of how to build workflows with Flow that connect to various data sources, automate business processes, and integrate with other tools like PowerApps. The document also outlines the pricing and availability of Flow, including free and paid plans that can be used by individuals or organizations.

BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases

Amazon Web Services

In this session, you will learn best practices for implementing simple to advanced real-time streaming data use cases on AWS. First, we’ll review decision points on near real-time versus real time scenarios. Next, we will take a look at streaming data architecture patterns that include Amazon Kinesis Analytics, Amazon Kinesis Firehose, Amazon Kinesis Streams, Spark Streaming on Amazon EMR, and other open source libraries. Finally, we will dive deep into the most common of these patterns and cover design and implementation considerations.

Transforming data into actionable insights

Elasticsearch

Apache CarbonData+Spark to realize data convergence and Unified high performa...

Tech Triveni

Challenges in Data Analytics: Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.

Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017

Amazon Web Services

Real-Time Streaming Analytics became popular amongst many verticals and use cases. In AdTech, Gaming, Financial Service and IoT, AWS customers are leveraging Amazon Kinesis platform to ingest billions of events every day and process them in real-time. In this session, we will discuss Amazon Kinesis Streams, Amazon Kinesis Firehose and Amazon Kinesis Analytics. We will show best practice and design patterns in integrating Amazon Kinesis platform with other services like Amazon EMR, Redshift, Amazon Elasticsearch and AWS lambda as well as 3rd party connectors like storm, Spark and more.

Big data and Analytics on AWS

2nd Watch

The document provides an overview of big data concepts and Amazon Web Services (AWS) products for big data and analytics. It describes challenges of big data including unpredictable resource demand and job orchestration complexities. It then summarizes AWS products for data collection, storage, processing, analytics and machine learning. Specific examples are given using AWS services like Redshift, EMR, Kinesis and DynamoDB for scenarios like data warehousing, real-time streaming and Hadoop workloads. Core principles and common challenges of big data implementations on AWS are also outlined.

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

Amazon Web Services

FINRA’s Data Lake unlocks the value in its data to accelerate analytics and machine learning at scale. FINRA's Technology group has changed its customer's relationship with data by creating a Managed Data Lake that enables discovery on Petabytes of capital markets data, while saving time and money over traditional analytics solutions. FINRA’s Managed Data Lake includes a centralized data catalog and separates storage from compute, allowing users to query from petabytes of data in seconds. Learn how FINRA uses Spot instances and services such as Amazon S3, Amazon EMR, Amazon Redshift, and AWS Lambda to provide the 'right tool for the right job' at each step in the data processing pipeline. All of this is done while meeting FINRA’s security and compliance responsibilities as a financial regulator.

Building your Datalake on AWS

Amazon Web Services

Data Lake allows an organisation to store all of their data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this session we will explore the architecture of a Data Lake on AWS and cover topics such as storage, processing and security.

Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...

Matt Stubbs

Richard Freeman talks about how the data science team at JustGiving built KOALA, a fully serverless stack for real-time web analytics capture, stream processing, metrics API, and storage service, supporting live data at scale from over 26M users. He discusses recent advances in serverless computing, and how you can implement traditionally container-based microservice patterns using serverless-based architectures instead. Deploying Serverless in your organisation can dramatically increase the delivery speed, productivity and flexibility of the development team, while reducing the overall running, DevOps and maintenance costs.

Getting started with Amazon Kinesis

Amazon Web Services

Getting started with amazon kinesis

Jampp

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Hakka Labs

Scalable and Reliable Logging at Pinterest

Krishna Gade

Pinterest uses Kafka as the central logging system to collect over 120 billion messages per day from thousands of hosts. They developed Singer, a lightweight logging agent, to reliably upload application logs to Kafka with low latency. Data is then moved from Kafka to cloud storage using systems like Secor and Merced that ensure exactly-once processing. Maintaining high log quality requires monitoring for anomalies, auditing new features, and catching issues both before and after releases through automated tooling.

Building real time data-driven products

Lars Albertsson

This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time. Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).

Similar to Evolving s3 story (20)

Serverless SQL

How Totango uses Apache Spark

Getting Started with Real-time Analytics

Comment transformer vos données en informations exploitables

Cómo transformar los datos en análisis con los que tomar decisiones

Automated Data Synchronization: Data Loader, Data Mirror & Beyond

Automation options with Office 365

BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases

Transforming data into actionable insights

Apache CarbonData+Spark to realize data convergence and Unified high performa...

Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017

Big data and Analytics on AWS

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

Building your Datalake on AWS

Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...

Getting started with Amazon Kinesis

Getting started with amazon kinesis

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Scalable and Reliable Logging at Pinterest

Building real time data-driven products

Recently uploaded

Best 20 SEO Techniques To Improve Website Visibility In SERP

Pixlogix Infotech

Nordic Marketo Engage User Group_June 13_ 2024.pptx

MichaelKnudsen27

GraphRAG for Life Science to increase LLM accuracy

Tomaz Bratanic

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...

saastr

Taking AI to the Next Level in Manufacturing.pdf

ssuserfac0301

Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as: 1. How quickly AI is being implemented in manufacturing. 2. Which barriers stand in the way of AI adoption. 3. How data quality and governance form the backbone of AI. 4. Organizational processes and structures that may inhibit effective AI adoption. 6. Ideas and approaches to help build your organization's AI strategy.

Presentation of the OECD Artificial Intelligence Review of Germany

innovationoecd

Recommendation System using RAG Architecture

fredae14

Digital Marketing Trends in 2024 | Guide for Staying Ahead

Wask

https://www.wask.co/ebooks/digital-marketing-trends-in-2024 Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.

Main news related to the CCS TSI 2023 (2023/1695)

Jakub Marek

An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers. The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 . The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

saastr

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

shyamraj55

GenAI Pilot Implementation in the organizations

kumardaparthi1024

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx

SitimaJohn

Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.

Operating System Used by Users in day-to-day life.pptx

Pravash Chandra Das

Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes. Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions. Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻 The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️ Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution. The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟

AWS Cloud Cost Optimization Presentation.pptx

HarisZaheer8

This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

Fueling AI with Great Data with Airbyte Webinar

Zilliz

Letter and Document Automation for Bonterra Impact Management (fka Social Sol...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365. Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

Building Production Ready Search Pipelines with Spark and Milvus

Zilliz

Recently uploaded (20)

Best 20 SEO Techniques To Improve Website Visibility In SERP

Nordic Marketo Engage User Group_June 13_ 2024.pptx

GraphRAG for Life Science to increase LLM accuracy

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...

Taking AI to the Next Level in Manufacturing.pdf

Presentation of the OECD Artificial Intelligence Review of Germany

Recommendation System using RAG Architecture

Digital Marketing Trends in 2024 | Guide for Staying Ahead

Main news related to the CCS TSI 2023 (2023/1695)

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

GenAI Pilot Implementation in the organizations

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx

Operating System Used by Users in day-to-day life.pptx

AWS Cloud Cost Optimization Presentation.pptx

Programming Foundation Models with DSPy - Meetup Slides

Fueling AI with Great Data with Airbyte Webinar

Letter and Document Automation for Bonterra Impact Management (fka Social Sol...

HCL Notes and Domino License Cost Reduction in the World of DLAU

Building Production Ready Search Pipelines with Spark and Milvus

Evolving s3 story

1. Evolving a premium raw data product from simple spark script in 3 month Avi Perez, Big Data Team Leader @AppsFlyer

2. AppsFlyer • 28M raised top VC • 200M To 13B Daily Events [3 Years] • 40GB To 5TB [gz] daily text data • 25 → 60ppl R&D during 2016 • Top 15 Israeli startups by inc.com

3. What We Do Media SourcesApp Developers App Users X10 9 4B$ in media payments annually measured

4. AppsFlyer Raw Data Channels Raw vs Aggregated • Real Time Stream From kafka • Online Query Data API (csv) HTTP Columnar DB S3

5. New Use Case • Big Clients with BI Systems • Very large files / large number of files • Tackling current limitations

6. secor Amazon S3 Rapid Prototype ... write notify read

7. Naive SPARK SQL

8. Challenges ... •Scale in #clients •Client monitoring •Security •Schema •Flow & Control

9. Requests keep coming... • More Clients • More Events Types • Customizable Columns

10. What are we facing here...

11. What was missing?

12. Improving Data Format • Scanning a lot of data is easy...but not that fast • Being a big data company is not necessarily saying you need to read all your data fast

13. Moving to Parquet . . . Twitter & Cloudera • Columnar storage (load only what you need) • Space efficient (50% improvement) • Read Time efficient (98% improvement )

14. Stateful S3 Bucket Structure For automatic bots parsing

15. View Layer • Flatten fields mapping • Versions

16. From script to Micro Service • Tasks creation (Buckets, IAM, Credentials etc) • Search on Task Executions • Access to the report files • Get statuses from the Job HTTP • Highly available

17. Abstraction . . .

18. Moving Toward A Product . . . • Clients want SLA . . .

19. Service transparency Push notification to slack once there is an issue

20. Data Segregation

21. Results Loading data for specific clients Load specific clients raw data from 2.5TB compressed topic Same load with partitioning 1.5 min 30sec

22. From hard coded List to RDS Client A Client B ... ...

23. Secured Email Notifications click Get link

24. Vault • Secure Secret Storage • Dynamics Secrets • Data Encryption • Leasing and renewal • Revocation

25. Cost Optimization Helping our clients with download Daily sessions output file for one of the clients The same report compressed (.gz) 60G B 2.1 GB

26. Moving to YARN

27. Prioritizing spark Jobs

28. Support keep asking the same questions….

29. Monitoring . . . Monitor, monitor, monitor…. • Metrics • Re-tries • PDs

30. Going premium . . • On boarding • Well defined schema fields • Self Serve and pricing

31. What we learn . . .

32. Thank you And… We are hiring!! avi@appsflyer.com

Editor's Notes

sales come to r&d and asked a way to get organic data
Big data analytics נותנים כלים למשתמשים שלנו למדוד כמה איכותי הטרפיק שהם מביאים מערוצי פירסום שונים מאיפה מגיע אותו טראפיק איכותי וכלים לעזור להם לקבל החלטות מכמויות אדירות של מידע שמפפיעות באופן ישיר על הככנסות שלהם
Raw vs aggregate
Not always using out dashboard We asked them what we do with our API’s Jobs ETL to run on s3 High load on AF systems How we can solve Many queries per day We have inherint limit of 200k rows CMS big clients, remove limitations. Very large companies want all their data Script “issue” that cost us 50k
פתרון: נדרשנו לקבל החלטות קשות ב r&d בידיעה שאנחנו נצטרך לשלם בתחזוקה ידנית, אבל לא היתה ממש ברירה ורצינו שהלקוח האסטרטגי הזה יהיה לנו. וזה הפתרון שהצגנו 13B events → kafka → secor (service for persisting kafka log to S3) As sequence files SparkSQL on top on that Creating manually a bucket on our production S3 for that account with only List \ READ permissions.creating IAM specifc user manually and Providing him the credentails And running the process with chrons \ mesos each morning עלינו לפרודקשיין בתוך כמה ימים, ואפשרנו גישה רק לטופיק הקטן ביותר של התקנות. הלקוח חתם.
Mobile App Letgo Raises $100 Million From Naspers To Take Over Classifieds In The U.S.
reports/<Home Folder>/account /<event-type>-<date YYYY-MM-dd> reports/<Home Folder>/apps/app-id /<event-type>-<date YYYY-MM-dd>
Flatten the schema Schema on write Schema on read Code reuse Versioning Readability \ Simplification
Tell the story of lets go Which build the entire marketier team work flow base on the dasgbaord they are creating
Analytics process which calculate each day to X app (partiton keys) And saved that as meta-data on the files bucket
על מנת להגן על הלקוחות שלנו וגם להגן עלינו מעצמנו מטעויות. הטמענו שירות שנתן לנו דרכים להפיק keys \ secret באופן שרירותי
Helping our clients to improve their download time from our S3
Scheduled tasks were not executed Same job executed twice Not trivial to maintance DAG Dynamic allocation

Evolving s3 story

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Evolving s3 story

Similar to Evolving s3 story (20)

Recently uploaded

Recently uploaded (20)

Evolving s3 story

Editor's Notes