1. Evolving a premium raw data product
from simple spark script in 3 month
Avi Perez, Big Data Team Leader @AppsFlyer
2. AppsFlyer
• 28M raised top VC
• 200M To 13B Daily Events [3 Years]
• 40GB To 5TB [gz] daily text data
• 25 → 60ppl R&D during 2016
• Top 15 Israeli startups by inc.com
3. What We Do
Media SourcesApp Developers App Users
X10
9
4B$ in media payments annually
measured
4. AppsFlyer Raw Data Channels
Raw vs Aggregated
• Real Time Stream From kafka
• Online Query Data API (csv)
HTTP
Columnar DB
S3
5. New Use Case
• Big Clients with BI Systems
• Very large files / large number of files
• Tackling current limitations
12. Improving Data Format
• Scanning a lot of data is easy...but not that fast
• Being a big data company is not necessarily
saying you need to read all your data fast
13. Moving to Parquet . . .
Twitter & Cloudera
• Columnar storage (load only what you need)
• Space efficient (50% improvement)
• Read Time efficient (98% improvement )
16. From script to Micro Service
• Tasks creation (Buckets, IAM, Credentials etc)
• Search on Task Executions
• Access to the report files
• Get statuses from the Job HTTP
• Highly available
24. Vault
• Secure Secret Storage
• Dynamics Secrets
• Data Encryption
• Leasing and renewal
• Revocation
25. Cost Optimization
Helping our clients with download
Daily sessions output file
for one of the clients
The same report
compressed
(.gz)
60G
B
2.1
GB
sales come to r&d and asked a way to get organic data
Big data analytics
נותנים כלים למשתמשים שלנו למדוד כמה איכותי הטרפיק שהם מביאים מערוצי פירסום שונים
מאיפה מגיע אותו טראפיק איכותי
וכלים לעזור להם לקבל החלטות מכמויות אדירות של מידע שמפפיעות באופן ישיר על הככנסות שלהם
Raw vs aggregate
Not always using out dashboard
We asked them what we do with our API’s
Jobs ETL to run on s3
High load on AF systems
How we can solve
Many queries per day
We have inherint limit of 200k rows
CMS big clients, remove limitations. Very large companies want all their data
Script “issue” that cost us 50k
פתרון: נדרשנו לקבל החלטות קשות ב r&d בידיעה שאנחנו נצטרך לשלם בתחזוקה ידנית, אבל לא היתה ממש ברירה ורצינו שהלקוח האסטרטגי הזה יהיה לנו.
וזה הפתרון שהצגנו
13B events → kafka → secor (service for persisting kafka log to S3)
As sequence files
SparkSQL on top on that
Creating manually a bucket on our production S3 for that account with only List \ READ permissions.creating IAM specifc user manually and Providing him the credentails
And running the process with chrons \ mesos each morning
עלינו לפרודקשיין בתוך כמה ימים, ואפשרנו גישה רק לטופיק הקטן ביותר של התקנות. הלקוח חתם.
Mobile App Letgo Raises $100 Million From Naspers To Take Over Classifieds In The U.S.