SlideShare a Scribd company logo
1 of 32
Evolving a premium raw data product
from simple spark script in 3 month
Avi Perez, Big Data Team Leader @AppsFlyer
AppsFlyer
• 28M raised top VC
• 200M To 13B Daily Events [3 Years]
• 40GB To 5TB [gz] daily text data
• 25 → 60ppl R&D during 2016
• Top 15 Israeli startups by inc.com
What We Do
Media SourcesApp Developers App Users
X10
9
4B$ in media payments annually
measured
AppsFlyer Raw Data Channels
Raw vs Aggregated
• Real Time Stream From kafka
• Online Query Data API (csv)
HTTP
Columnar DB
S3
New Use Case
• Big Clients with BI Systems
• Very large files / large number of files
• Tackling current limitations
secor
Amazon S3
Rapid Prototype ...
write
notify
read
Naive SPARK SQL
Challenges ...
•Scale in #clients
•Client monitoring
•Security
•Schema
•Flow & Control
Requests keep coming...
• More Clients
• More Events Types
• Customizable Columns
What are we facing here...
What was missing?
Improving Data Format
• Scanning a lot of data is easy...but not that fast
• Being a big data company is not necessarily
saying you need to read all your data fast
Moving to Parquet . . .
Twitter & Cloudera
• Columnar storage (load only what you need)
• Space efficient (50% improvement)
• Read Time efficient (98% improvement )
Stateful S3 Bucket Structure
For automatic bots parsing
View Layer
• Flatten fields mapping
• Versions
From script to Micro Service
• Tasks creation (Buckets, IAM, Credentials etc)
• Search on Task Executions
• Access to the report files
• Get statuses from the Job HTTP
• Highly available
Abstraction . . .
Moving Toward A Product . . .
• Clients want SLA . . .
Service transparency
Push notification to slack once there
is an issue
Data Segregation
Results
Loading data for specific clients
Load specific clients raw
data from 2.5TB
compressed topic
Same load with
partitioning
1.5
min
30sec
From hard coded List to RDS
Client
A
Client
B
... ...
Secured Email Notifications
click Get link
Vault
• Secure Secret Storage
• Dynamics Secrets
• Data Encryption
• Leasing and renewal
• Revocation
Cost Optimization
Helping our clients with download
Daily sessions output file
for one of the clients
The same report
compressed
(.gz)
60G
B
2.1
GB
Moving to YARN
Prioritizing spark Jobs
Support keep asking the same
questions….
Monitoring . . .
Monitor, monitor, monitor….
• Metrics
• Re-tries
• PDs
Going premium . .
• On boarding
• Well defined schema fields
• Self Serve and pricing
What we learn . . .
Thank you
And…
We are
hiring!!
avi@appsflyer.com

More Related Content

What's hot

Rich Internet Applications and Flex - 3
Rich Internet Applications and Flex - 3Rich Internet Applications and Flex - 3
Rich Internet Applications and Flex - 3
Vijay Kalangi
 
Effective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a WeekEffective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a Week
Databricks
 

What's hot (20)

Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
 
Rich Internet Applications and Flex - 3
Rich Internet Applications and Flex - 3Rich Internet Applications and Flex - 3
Rich Internet Applications and Flex - 3
 
MongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewMongoDB 3.2 Feature Preview
MongoDB 3.2 Feature Preview
 
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
 
MongoDB Atlas
MongoDB AtlasMongoDB Atlas
MongoDB Atlas
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
SharePoint UserGroup Stuttgart - Martina Grom - Office 365 News
SharePoint UserGroup Stuttgart - Martina Grom - Office 365 NewsSharePoint UserGroup Stuttgart - Martina Grom - Office 365 News
SharePoint UserGroup Stuttgart - Martina Grom - Office 365 News
 
Workshop 2: Building a streaming data platform on AWS
Workshop 2: Building a streaming data platform on AWSWorkshop 2: Building a streaming data platform on AWS
Workshop 2: Building a streaming data platform on AWS
 
Solving your Backup Needs - Ben Cefalo mdbe18
Solving your Backup Needs - Ben Cefalo mdbe18Solving your Backup Needs - Ben Cefalo mdbe18
Solving your Backup Needs - Ben Cefalo mdbe18
 
Effective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a WeekEffective AIOps with Open Source Software in a Week
Effective AIOps with Open Source Software in a Week
 
MongoDB: Agile Combustion Engine
MongoDB: Agile Combustion EngineMongoDB: Agile Combustion Engine
MongoDB: Agile Combustion Engine
 
Tech UG - Newcastle 09-17 - logic apps
Tech UG - Newcastle 09-17 -   logic appsTech UG - Newcastle 09-17 -   logic apps
Tech UG - Newcastle 09-17 - logic apps
 
Securing an Azure Function REST API with Azure Active Directory
Securing an Azure Function REST API with Azure Active DirectorySecuring an Azure Function REST API with Azure Active Directory
Securing an Azure Function REST API with Azure Active Directory
 
Sitecore Symposium: DMS Where is the data at?
Sitecore Symposium: DMS Where is the data at?Sitecore Symposium: DMS Where is the data at?
Sitecore Symposium: DMS Where is the data at?
 
MongoDB Atlas - eHarmony’s New Message Store
MongoDB Atlas - eHarmony’s New Message StoreMongoDB Atlas - eHarmony’s New Message Store
MongoDB Atlas - eHarmony’s New Message Store
 
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...
 
Multi-Tenant Log Analytics SaaS Service using Solr: Presented by Chirag Gupta...
Multi-Tenant Log Analytics SaaS Service using Solr: Presented by Chirag Gupta...Multi-Tenant Log Analytics SaaS Service using Solr: Presented by Chirag Gupta...
Multi-Tenant Log Analytics SaaS Service using Solr: Presented by Chirag Gupta...
 
APIdays Helsinki 2019 - GraphQL API Management with Amit P. Acharya, IBM
APIdays Helsinki 2019 - GraphQL API Management with Amit P. Acharya, IBMAPIdays Helsinki 2019 - GraphQL API Management with Amit P. Acharya, IBM
APIdays Helsinki 2019 - GraphQL API Management with Amit P. Acharya, IBM
 
Scribe insight 01 publisher deep dive
Scribe insight 01   publisher deep diveScribe insight 01   publisher deep dive
Scribe insight 01 publisher deep dive
 
Maximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWSMaximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWS
 

Similar to Evolving s3 story

Similar to Evolving s3 story (20)

Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
How Totango uses Apache Spark
How Totango uses Apache SparkHow Totango uses Apache Spark
How Totango uses Apache Spark
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Comment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitablesComment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitables
 
Cómo transformar los datos en análisis con los que tomar decisiones
Cómo transformar los datos en análisis con los que tomar decisionesCómo transformar los datos en análisis con los que tomar decisiones
Cómo transformar los datos en análisis con los que tomar decisiones
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondAutomated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
 
Automation options with Office 365
Automation options with Office 365Automation options with Office 365
Automation options with Office 365
 
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesBDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases
 
Transforming data into actionable insights
Transforming data into actionable insightsTransforming data into actionable insights
Transforming data into actionable insights
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
 
Big data and Analytics on AWS
Big data and Analytics on AWSBig data and Analytics on AWS
Big data and Analytics on AWS
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Building your Datalake on AWS
Building your Datalake on AWSBuilding your Datalake on AWS
Building your Datalake on AWS
 
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon Kinesis
 
Getting started with amazon kinesis
Getting started with amazon kinesisGetting started with amazon kinesis
Getting started with amazon kinesis
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Evolving s3 story

Editor's Notes

  1. sales come to r&d and asked a way to get organic data
  2. Big data analytics נותנים כלים למשתמשים שלנו למדוד כמה איכותי הטרפיק שהם מביאים מערוצי פירסום שונים מאיפה מגיע אותו טראפיק איכותי וכלים לעזור להם לקבל החלטות מכמויות אדירות של מידע שמפפיעות באופן ישיר על הככנסות שלהם
  3. Raw vs aggregate
  4. Not always using out dashboard We asked them what we do with our API’s Jobs ETL to run on s3 High load on AF systems How we can solve Many queries per day We have inherint limit of 200k rows CMS big clients, remove limitations. Very large companies want all their data Script “issue” that cost us 50k
  5. פתרון: נדרשנו לקבל החלטות קשות ב r&d בידיעה שאנחנו נצטרך לשלם בתחזוקה ידנית, אבל לא היתה ממש ברירה ורצינו שהלקוח האסטרטגי הזה יהיה לנו. וזה הפתרון שהצגנו 13B events → kafka → secor (service for persisting kafka log to S3) As sequence files SparkSQL on top on that Creating manually a bucket on our production S3 for that account with only List \ READ permissions.creating IAM specifc user manually and Providing him the credentails And running the process with chrons \ mesos each morning עלינו לפרודקשיין בתוך כמה ימים, ואפשרנו גישה רק לטופיק הקטן ביותר של התקנות. הלקוח חתם.
  6. Mobile App Letgo Raises $100 Million From Naspers To Take Over Classifieds In The U.S.
  7. reports/<Home Folder>/account /<event-type>-<date YYYY-MM-dd> reports/<Home Folder>/apps/app-id /<event-type>-<date YYYY-MM-dd>
  8. Flatten the schema Schema on write Schema on read Code reuse Versioning Readability \ Simplification
  9. Tell the story of lets go Which build the entire marketier team work flow base on the dasgbaord they are creating
  10. Analytics process which calculate each day to X app (partiton keys) And saved that as meta-data on the files bucket
  11. על מנת להגן על הלקוחות שלנו וגם להגן עלינו מעצמנו מטעויות. הטמענו שירות שנתן לנו דרכים להפיק keys \ secret באופן שרירותי
  12. Helping our clients to improve their download time from our S3
  13. Scheduled tasks were not executed Same job executed twice Not trivial to maintance DAG Dynamic allocation