SlideShare a Scribd company logo
1 of 36
Download to read offline
Grega Kešpret, Director of Engineering, Analytics – Celtra
Denny Lee, Technology Evangelist – Databricks
December 9th, 2015
How Celtra Optimizes its
Advertising Platform
with Databricks
About Me: Grega Kešpret
Grega Kešpret is the Director of Engineering for Analytics. He works at
Celtra since 2012, where he helped build analytics pipeline and
optimization systems. Grega also leads the team of engineers and data
scientists at San Francisco and Ljubljana, working on their analytics
platform. Prior to Celtra, Grega worked at IBM, helping enterprise
customers adopt WebSphere Application Server and before that did a 8-
month internship at SANYO (Panasonic) in Japan, working on battery
systems. His current technical interests include databases, distributed
systems, functional programming and machine learning.
About Me: Denny Lee
Denny is a Technology Evangelist with Databricks; he is a hands-on data
sciences engineer with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and distributed systems for
both on-premises and cloud. His key focuses surround solving complex
large scale data problems – providing not only architectural direction but
the hands-on implementation of these systems. He has extensive
experience in building greenfield teams as well as turn around / change
catalyst. Prior to joining Databricks, Denny worked as a Senior Director of
Data Sciences Engineering at Concur and was part of the incubation team
that built Hadoop on Windows and Azure (currently known as HDInsight).
•  About Celtra & AdCreator
•  Event Data
•  AdTech Problems (Sessionization, Funnel Analytics)
•  Databricks Use Cases
•  Data Analysis Evolutionary Path
•  Main pain points from the past
•  Demo
•  Q&A
Agenda
Motivation
The analytics platform at Celtra has experienced tremendous growth
over the past few years in terms of size, complexity, number of users,
and variety of use cases.
Business Data Volume Data Complexity
Leading creative technology
company for data-driven brand
display and video advertising
across mobile, tablets and desktop
Powering the creative side of
advertising campaigns
SaaS platform to allow clients easily
create, manage, and traffic data-
driven dynamic ads, optimize them
on the go, and track their
performance with insightful
analytics
Celtra AdCreator
About Celtra & AdCreator
Trusted by +5000 brands
and agencies.
Unparalleled Distribution Options
Reach across endless list of large publishers and media owners, DSP’s and ad networks.
Certified and partnered with the top 25 global Ad Networks and
over 50 premium publishers
AdCreator is the most widely used creative, analytics and
optimization technology for display and video advertising
•  Bread and butter of our analytics data
•  Facts about what happened
•  creative X was served on placement Y at time Z
•  user A interacted with creative X at time Q
•  Automatically tracked
•  Behavioral data from users
•  ~2 billion events per day
•  1 TB new data per day (uncompressed)
•  Very sparse
•  Complex relationships between events
Event data
•  Combining discrete events into sessions
•  In first version, we used to compute simple counts of events
•  But, no “holistic” view over the whole session:
•  Very hard to troubleshoot/debug
•  Cannot check for/enforce causality (if X happened, Y must also have
happened)
•  Duplication leading to skewed rates (e.g. one user interacting)
•  Later events make you understand earlier arriving events better (e.g.
session duration, attribution, identity, etc.)
Sessionization
Why Spark for Event Analytics
One of the first companies to use Spark in production (v0.5)
Sessionization
Nice API
Expressive computation layer
Speed of innovation
Seamless integration with S3
Aggregated
Analytics data
Event data
+
Operational
data
Trackers
Funnel Analytics
Celtra Analytics Funnel View
•  Multi-step process
•  Enabled by sessionization
•  Originally developed in the context of e-commerce sites
•  Ad was requested, then served, then shown, then interacted with, then the
user expanded the ad, then watched a video, …
•  Not just whether X happened; but whether A, B and X happened
Funnel Analytics
XA
B
Production Use Case
But that’s just
the beginning
Consider the following sample questions:
•  When do users engage with creatives?
•  Do different groups of users behave differently?
•  Why is a certain rate low or high?
•  What is the adoption of recently rolled out feature?
•  Does it correlate with engagement?
•  What features are important when detect environment we run in?
In a traditional data warehouse, you’d figure out all the needed reports/questions in
advance, design the schema and that’s it.
Questions & Answers
Explanation
Insight
Question
Insight generation lifecycle
1. Ad-hoc queries for campaigns
•  Databricks allows us to easily run our increasingly complex ad-hoc queries
2. Exploratory data analysis
•  Derive value from data as soon as possible
3. Troubleshooting
•  The bar for reliability and correctness is very high
•  Why something broke? Why is certain rate low/high?
•  Quickly identify the root cause of production failures and minimize system downtime
Use cases
4. Compliance jobs
•  Regularly scheduled jobs that make various checks for compliance purposes (e.g. non-human
traffic)
5. Supporting product decisions with data
•  Some of this is not “Big Data”, but visualizations, precomputing different views, etc.
•  Example: pricing model analysis
•  Connecting to Databricks with Tableau
6. Predictive analytics
•  Example: dynamically detect environment in which
our tags run (in-app, mobile web)
Use cases
•  Need flexibility, not provided by precomputed aggregates (uniques, order
statistics, etc.)
•  Answers to questions that existing data model does not support (Demo)
•  Short development cycles and faster experimentation
•  Complex ecosystem + Wide creative capabilities = Diverse data
•  Data focused on the engagements of consumers with our clients’ ads
•  Constantly exploring new ways to leverage this information to improve our offering
•  Visualizations
•  Important aspect of big data mining
•  In 9 out of 10 notebooks, there is at least one visualization
Exploratory data analysis
Complex ecosystem / environment
•  Support all major mobile
platforms (IOS, Android,
Windows Phone) and desktop
•  Work on mobile web & in-app
•  Device fragmentation,
browser specifics
•  Errors will happen
Error in log parsing
Wrong tags being trafficked
Custom (wrong) code being executed
Internal service call failures
•  When stuff doesn’t work, we need to be able to
dig into the data easily and quickly
•  To understand the problem, we need to check:
ELB logs, event logs, operational data, creative structure, machine logs
Troubleshooting
Resulting in
Bad user experience
Wrong insights (wrong metrics)
Data analysis evolutionary path
Solution progression
Bash
Logcat
Spark
Spark-Shell
Databricks
•  Downloading event logs from S3
on a single machine, “grepping”,
doing simple counts in bash. Using
R or Python for analysis.
•  But: Slow download (single core,
one machine)
•  Solution: Written Logcat – Scala,
multi-threaded downloading from
S3, scale horizontally
Version 1 (with Bash)
Bash
Logcat
Spark
Spark-Shell
Databricks
Data analysis evolutionary path
Solution progression
•  Can download large amount of
logs
•  But: no shuffle, i.e. no
grouping/joining events
•  Solution: use Spark
Version 2 (with Logcat)
Bash
Logcat
Spark
Spark-Shell
Databricks
Data analysis evolutionary path
Solution progression
•  Can download large amounts,
do shuffle and perform data
analysis
•  But: package code into jar,
submit jar to the cluster, not
really interactive
•  Solution: spark-shell to the
rescue
Version 3 (with Spark)
Bash
Logcat
Spark
Spark-Shell
Databricks
Data analysis evolutionary path
Solution progression
•  Can download large amounts,
do shuffle and perform data
analysis interactively
•  But: need to provision clusters,
cannot do any statistical
processing, cannot visualize
results without moving data
(again)
•  Solution: some ideas, but then
saw Databricks demo at Spark
Summit 2014
Version 4 (with spark-shell)
Bash
Logcat
Spark
Spark-Shell
Databricks
Data analysis evolutionary path
Solution progression
•  Can download large amounts,
do shuffle and perform data
analysis interactively, visualize
results and perform statistical
processing
•  We can focus on drilling down
into the raw events, quickly
verify hypotheses and visualize
results, all of which would be
very difficult if not impossible
one year ago.
Version 5 (Databricks)
•  Complex setup and configuration required
•  Requires effort and experience
•  Analyses not reproducible and repeatable
•  No collaboration
•  Moving data between different stages in troubleshooting/analysis lifecycle
•  For example: Scala for aggregations, R for statistics and visualization
•  Heterogeneity of the various components (Spark in production, something
else for exploratory data analysis)
•  Analytics team (3-5 people) bottleneck
Main pain points from the past
•  Reduced the load on the analytics engineering team by expanding
access to the number of people able to work with the data directly by
a factor of four.
•  Before: Engineers, Analytics team (5 people)
•  Today: Engineers, Analysts, Ad ops, Support team, QA, Product (35 people, 20 active)
•  Increased the amount of ad-hoc analysis done six-fold, leading to
better informed product design and quicker issue detection and
resolution.
•  Increased collaboration and improved reproducibility and
repeatability of analyses.
Scaling Big Data Analysis Projects Six-Fold
Demo
Event data
+ + +
+
Operational
data
Databricks TableauBusiness
intelligence
Job/analysis
results
Access logsAggregated
analytics data
SumoLogic
BambooHR CircleCI
+
CloudWatch DynamoDB MongoDB Snowflake
Q&A
Appendix
•  Notebook sharing
•  public/ and private/ folders within each user’s home directory
•  Database default, tables called u_<username>_*	
  
•  problems with per-user databases (only works with Hive tables)
•  Always prefer Spot instances for ad-hoc clusters	
  
•  Name the cluster <username>, <username>2, <username>3, etc.
•  Prefer DBFS mounts over s3a:// URLs to get the benefits of Tachyon
Tips for using Databricks
•  If you want to preserve state, save it to a table (by e.g. CREATE	
  TABLE	
  
AS	
  SELECT	
  ...	
  FROM	
  ...) or to Parquet files on S3
•  Another option is to downscale the cluster to 1 node (all variables,
temporary tables etc. will be retained).
•  Prefer using Jobs for longer/larger tasks
•  Avoid high cardinality joins in Tableau, instead join the data in Spark/
Databricks
Tips for using Databricks

More Related Content

What's hot

Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSAgile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSKent Graziano
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesDrew Hansen
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Databricks
 
AWS User Group: Building Cloud Analytics Solution with AWS
AWS User Group: Building Cloud Analytics Solution with AWSAWS User Group: Building Cloud Analytics Solution with AWS
AWS User Group: Building Cloud Analytics Solution with AWSDmitry Anoshin
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...Kent Graziano
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsEduardo Castro
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
New! Real-Time Data Replication to Snowflake
New! Real-Time Data Replication to SnowflakeNew! Real-Time Data Replication to Snowflake
New! Real-Time Data Replication to SnowflakePrecisely
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfRob Winters
 
Does it only have to be ML + AI?
Does it only have to be ML + AI?Does it only have to be ML + AI?
Does it only have to be ML + AI?Harald Erb
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsThomas Sykes
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Kent Graziano
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data FactoryBizTalk360
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 

What's hot (20)

Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSAgile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
 
AWS User Group: Building Cloud Analytics Solution with AWS
AWS User Group: Building Cloud Analytics Solution with AWSAWS User Group: Building Cloud Analytics Solution with AWS
AWS User Group: Building Cloud Analytics Solution with AWS
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
New! Real-Time Data Replication to Snowflake
New! Real-Time Data Replication to SnowflakeNew! Real-Time Data Replication to Snowflake
New! Real-Time Data Replication to Snowflake
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
Does it only have to be ML + AI?
Does it only have to be ML + AI?Does it only have to be ML + AI?
Does it only have to be ML + AI?
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
 
A lap around Azure Data Factory
A lap around Azure Data FactoryA lap around Azure Data Factory
A lap around Azure Data Factory
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 

Similar to How Celtra Optimizes its Advertising Platform with Databricks

Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsLooker
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudInside Analysis
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with MicrosoftCaserta
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
Tableau Drive, A new methodology for scaling your analytic culture
Tableau Drive, A new methodology for scaling your analytic cultureTableau Drive, A new methodology for scaling your analytic culture
Tableau Drive, A new methodology for scaling your analytic cultureTableau Software
 
SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?Nicolas Georgeault
 
Before vs After: Redesigning a Website to be Useful and Informative for Devel...
Before vs After: Redesigning a Website to be Useful and Informative for Devel...Before vs After: Redesigning a Website to be Useful and Informative for Devel...
Before vs After: Redesigning a Website to be Useful and Informative for Devel...Teresa Giacomini
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsDenodo
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwarePanorama Software
 
Company Profile - NPC with TIBCO Spotfire solution
Company Profile - NPC with TIBCO Spotfire solution  Company Profile - NPC with TIBCO Spotfire solution
Company Profile - NPC with TIBCO Spotfire solution Sirinporn Setworaya
 
Webinar: 5 Clear Steps to Get Your Nonprofit Cloud Ready - 2018-5-31
Webinar: 5 Clear Steps to Get Your Nonprofit Cloud Ready - 2018-5-31Webinar: 5 Clear Steps to Get Your Nonprofit Cloud Ready - 2018-5-31
Webinar: 5 Clear Steps to Get Your Nonprofit Cloud Ready - 2018-5-31TechSoup
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise AnalyticsDATAVERSITY
 
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATIONLogitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATIONAvinash Deshpande
 
Jan 2017 Investment Recommendation for Tableau
Jan 2017 Investment Recommendation for TableauJan 2017 Investment Recommendation for Tableau
Jan 2017 Investment Recommendation for Tableaupaulchenuva
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMBig Data Joe™ Rossi
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMBig Data Joe™ Rossi
 
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...Memoori
 

Similar to How Celtra Optimizes its Advertising Platform with Databricks (20)

Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
 
Bridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the CloudBridging the Gap: Analyzing Data in and Below the Cloud
Bridging the Gap: Analyzing Data in and Below the Cloud
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with Microsoft
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Tableau Drive, A new methodology for scaling your analytic culture
Tableau Drive, A new methodology for scaling your analytic cultureTableau Drive, A new methodology for scaling your analytic culture
Tableau Drive, A new methodology for scaling your analytic culture
 
SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?
 
Before vs After: Redesigning a Website to be Useful and Informative for Devel...
Before vs After: Redesigning a Website to be Useful and Informative for Devel...Before vs After: Redesigning a Website to be Useful and Informative for Devel...
Before vs After: Redesigning a Website to be Useful and Informative for Devel...
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama Software
 
Company Profile - NPC with TIBCO Spotfire solution
Company Profile - NPC with TIBCO Spotfire solution  Company Profile - NPC with TIBCO Spotfire solution
Company Profile - NPC with TIBCO Spotfire solution
 
Webinar: 5 Clear Steps to Get Your Nonprofit Cloud Ready - 2018-5-31
Webinar: 5 Clear Steps to Get Your Nonprofit Cloud Ready - 2018-5-31Webinar: 5 Clear Steps to Get Your Nonprofit Cloud Ready - 2018-5-31
Webinar: 5 Clear Steps to Get Your Nonprofit Cloud Ready - 2018-5-31
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATIONLogitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
 
Jan 2017 Investment Recommendation for Tableau
Jan 2017 Investment Recommendation for TableauJan 2017 Investment Recommendation for Tableau
Jan 2017 Investment Recommendation for Tableau
 
resume4
resume4resume4
resume4
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
 
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...
 

Recently uploaded

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Recently uploaded (20)

Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

How Celtra Optimizes its Advertising Platform with Databricks

  • 1.
  • 2. Grega Kešpret, Director of Engineering, Analytics – Celtra Denny Lee, Technology Evangelist – Databricks December 9th, 2015 How Celtra Optimizes its Advertising Platform with Databricks
  • 3. About Me: Grega Kešpret Grega Kešpret is the Director of Engineering for Analytics. He works at Celtra since 2012, where he helped build analytics pipeline and optimization systems. Grega also leads the team of engineers and data scientists at San Francisco and Ljubljana, working on their analytics platform. Prior to Celtra, Grega worked at IBM, helping enterprise customers adopt WebSphere Application Server and before that did a 8- month internship at SANYO (Panasonic) in Japan, working on battery systems. His current technical interests include databases, distributed systems, functional programming and machine learning.
  • 4. About Me: Denny Lee Denny is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. His key focuses surround solving complex large scale data problems – providing not only architectural direction but the hands-on implementation of these systems. He has extensive experience in building greenfield teams as well as turn around / change catalyst. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).
  • 5.
  • 6. •  About Celtra & AdCreator •  Event Data •  AdTech Problems (Sessionization, Funnel Analytics) •  Databricks Use Cases •  Data Analysis Evolutionary Path •  Main pain points from the past •  Demo •  Q&A Agenda
  • 7. Motivation The analytics platform at Celtra has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. Business Data Volume Data Complexity
  • 8. Leading creative technology company for data-driven brand display and video advertising across mobile, tablets and desktop Powering the creative side of advertising campaigns SaaS platform to allow clients easily create, manage, and traffic data- driven dynamic ads, optimize them on the go, and track their performance with insightful analytics Celtra AdCreator About Celtra & AdCreator
  • 9. Trusted by +5000 brands and agencies.
  • 10. Unparalleled Distribution Options Reach across endless list of large publishers and media owners, DSP’s and ad networks. Certified and partnered with the top 25 global Ad Networks and over 50 premium publishers AdCreator is the most widely used creative, analytics and optimization technology for display and video advertising
  • 11. •  Bread and butter of our analytics data •  Facts about what happened •  creative X was served on placement Y at time Z •  user A interacted with creative X at time Q •  Automatically tracked •  Behavioral data from users •  ~2 billion events per day •  1 TB new data per day (uncompressed) •  Very sparse •  Complex relationships between events Event data
  • 12. •  Combining discrete events into sessions •  In first version, we used to compute simple counts of events •  But, no “holistic” view over the whole session: •  Very hard to troubleshoot/debug •  Cannot check for/enforce causality (if X happened, Y must also have happened) •  Duplication leading to skewed rates (e.g. one user interacting) •  Later events make you understand earlier arriving events better (e.g. session duration, attribution, identity, etc.) Sessionization
  • 13. Why Spark for Event Analytics One of the first companies to use Spark in production (v0.5) Sessionization Nice API Expressive computation layer Speed of innovation Seamless integration with S3 Aggregated Analytics data Event data + Operational data Trackers
  • 15. •  Multi-step process •  Enabled by sessionization •  Originally developed in the context of e-commerce sites •  Ad was requested, then served, then shown, then interacted with, then the user expanded the ad, then watched a video, … •  Not just whether X happened; but whether A, B and X happened Funnel Analytics XA B
  • 18. Consider the following sample questions: •  When do users engage with creatives? •  Do different groups of users behave differently? •  Why is a certain rate low or high? •  What is the adoption of recently rolled out feature? •  Does it correlate with engagement? •  What features are important when detect environment we run in? In a traditional data warehouse, you’d figure out all the needed reports/questions in advance, design the schema and that’s it. Questions & Answers
  • 20. 1. Ad-hoc queries for campaigns •  Databricks allows us to easily run our increasingly complex ad-hoc queries 2. Exploratory data analysis •  Derive value from data as soon as possible 3. Troubleshooting •  The bar for reliability and correctness is very high •  Why something broke? Why is certain rate low/high? •  Quickly identify the root cause of production failures and minimize system downtime Use cases
  • 21. 4. Compliance jobs •  Regularly scheduled jobs that make various checks for compliance purposes (e.g. non-human traffic) 5. Supporting product decisions with data •  Some of this is not “Big Data”, but visualizations, precomputing different views, etc. •  Example: pricing model analysis •  Connecting to Databricks with Tableau 6. Predictive analytics •  Example: dynamically detect environment in which our tags run (in-app, mobile web) Use cases
  • 22. •  Need flexibility, not provided by precomputed aggregates (uniques, order statistics, etc.) •  Answers to questions that existing data model does not support (Demo) •  Short development cycles and faster experimentation •  Complex ecosystem + Wide creative capabilities = Diverse data •  Data focused on the engagements of consumers with our clients’ ads •  Constantly exploring new ways to leverage this information to improve our offering •  Visualizations •  Important aspect of big data mining •  In 9 out of 10 notebooks, there is at least one visualization Exploratory data analysis
  • 23. Complex ecosystem / environment •  Support all major mobile platforms (IOS, Android, Windows Phone) and desktop •  Work on mobile web & in-app •  Device fragmentation, browser specifics
  • 24. •  Errors will happen Error in log parsing Wrong tags being trafficked Custom (wrong) code being executed Internal service call failures •  When stuff doesn’t work, we need to be able to dig into the data easily and quickly •  To understand the problem, we need to check: ELB logs, event logs, operational data, creative structure, machine logs Troubleshooting Resulting in Bad user experience Wrong insights (wrong metrics)
  • 25. Data analysis evolutionary path Solution progression Bash Logcat Spark Spark-Shell Databricks •  Downloading event logs from S3 on a single machine, “grepping”, doing simple counts in bash. Using R or Python for analysis. •  But: Slow download (single core, one machine) •  Solution: Written Logcat – Scala, multi-threaded downloading from S3, scale horizontally Version 1 (with Bash)
  • 26. Bash Logcat Spark Spark-Shell Databricks Data analysis evolutionary path Solution progression •  Can download large amount of logs •  But: no shuffle, i.e. no grouping/joining events •  Solution: use Spark Version 2 (with Logcat)
  • 27. Bash Logcat Spark Spark-Shell Databricks Data analysis evolutionary path Solution progression •  Can download large amounts, do shuffle and perform data analysis •  But: package code into jar, submit jar to the cluster, not really interactive •  Solution: spark-shell to the rescue Version 3 (with Spark)
  • 28. Bash Logcat Spark Spark-Shell Databricks Data analysis evolutionary path Solution progression •  Can download large amounts, do shuffle and perform data analysis interactively •  But: need to provision clusters, cannot do any statistical processing, cannot visualize results without moving data (again) •  Solution: some ideas, but then saw Databricks demo at Spark Summit 2014 Version 4 (with spark-shell)
  • 29. Bash Logcat Spark Spark-Shell Databricks Data analysis evolutionary path Solution progression •  Can download large amounts, do shuffle and perform data analysis interactively, visualize results and perform statistical processing •  We can focus on drilling down into the raw events, quickly verify hypotheses and visualize results, all of which would be very difficult if not impossible one year ago. Version 5 (Databricks)
  • 30. •  Complex setup and configuration required •  Requires effort and experience •  Analyses not reproducible and repeatable •  No collaboration •  Moving data between different stages in troubleshooting/analysis lifecycle •  For example: Scala for aggregations, R for statistics and visualization •  Heterogeneity of the various components (Spark in production, something else for exploratory data analysis) •  Analytics team (3-5 people) bottleneck Main pain points from the past
  • 31. •  Reduced the load on the analytics engineering team by expanding access to the number of people able to work with the data directly by a factor of four. •  Before: Engineers, Analytics team (5 people) •  Today: Engineers, Analysts, Ad ops, Support team, QA, Product (35 people, 20 active) •  Increased the amount of ad-hoc analysis done six-fold, leading to better informed product design and quicker issue detection and resolution. •  Increased collaboration and improved reproducibility and repeatability of analyses. Scaling Big Data Analysis Projects Six-Fold
  • 32. Demo Event data + + + + Operational data Databricks TableauBusiness intelligence Job/analysis results Access logsAggregated analytics data SumoLogic BambooHR CircleCI + CloudWatch DynamoDB MongoDB Snowflake
  • 33. Q&A
  • 35. •  Notebook sharing •  public/ and private/ folders within each user’s home directory •  Database default, tables called u_<username>_*   •  problems with per-user databases (only works with Hive tables) •  Always prefer Spot instances for ad-hoc clusters   •  Name the cluster <username>, <username>2, <username>3, etc. •  Prefer DBFS mounts over s3a:// URLs to get the benefits of Tachyon Tips for using Databricks
  • 36. •  If you want to preserve state, save it to a table (by e.g. CREATE  TABLE   AS  SELECT  ...  FROM  ...) or to Parquet files on S3 •  Another option is to downscale the cluster to 1 node (all variables, temporary tables etc. will be retained). •  Prefer using Jobs for longer/larger tasks •  Avoid high cardinality joins in Tableau, instead join the data in Spark/ Databricks Tips for using Databricks