SlideShare a Scribd company logo
ABOUT ME
➤ EE/CS/CB
➤ BS EE, Robotics, Shanghai Univ
➤ MS Computational Biology,
School of Computer Science, CMU
ABOUT ME
Tech~
➤ Large Genome Dataset ~ Open Big
data world
➤ Interested in data and open source
➤ Familiar with Hadoop ecosystem
and various open source data tools
➤ Insight Data Engineer Fellow
ABOUT ME
Non-tech~
➤ Enjoy most of the indoor non-
physical activities ~
➤ Like drawing small picture stories
to record fun facts happen around
➤ Like dressing little yellow and
observing his love life
Search Ads Playground
Online Search Ads Platform
&
Real Time Campaign Monitoring
Grace Lu
MOTIVATION
➤ Every website, every application with search engine can do
search advertisement business
url:searchad.fun or http://107.23.229.58/
Ads
Ads
SEARCH ADVERTISEMENT RANKING
“Don’t be evil.” —— Google
How do these ads being selected?
Quality Bid Price
Relevancy P-Click
Ad
Search
Result
url:searchad.fun or http://107.23.229.58/
ONLINE SEARCH AD PLATFORM
➤ Quality Score = Relevancy * Click_Prediction (p-click)
➤ Final Ranking Score = Quality Score * Bid Price
Search Query
batch ETL
product data
user click log
Top ads
“black chair”
Ad Engine
url:searchad.fun or http://107.23.229.58/
ETL
➤ Clean raw product data - store in MySql
➤ Train word2vector - generate synonyms - store in Memcached
➤ Build inverted index in Memcached
➤ Process log file to extract features
➤ Process log file to generate regression model
CTR FEATURES
➤ pClick = probability of click/estimated Click Through Rate
(CTR)
➤ pClick Features extracted from search log and stored in key-
value store
log: Device IP, Device id,Session id,Query,AdId,CampaignId,Ad_category_Query_category(0/1),clicked(0/1)
REALTIME MONITORING
➤ Let advertiser know …
➤ Can be extend to monitor activities in other system
url:searchad.fun or http://107.23.229.58/
REALTIME MONITORING
➤ Log events
Ad Impression Events
Ad Click Events
Chance Events
AD Server
url:searchad.fun or http://107.23.229.58/
ANALYTICS DASHBOARD
➤ Query by one product category
➤ Query by one campaign (an ad set follows similar strategy)
Top Key Words Cloud Top Winning Brands Average Winning Bid
Top Impression Ads Top Click Ads Top CTR Ads
Current Budget Current spend Total Impression
Total Click Cost per Click Click/Impression
url:searchad.fun or http://107.23.229.58/
FURTHER CHALLENGE - ONLINE SYSTEM
Scaling up
➤ Sharding by campaignId
➤ Select top k in distributed way
➤ More replicas
A/B Testing
➤ Test different algorithms, give 10% load to server with
alternative algorithm
FURTHER CHALLENGE - REAL TIME MONITOR
➤ Log raw records to S3
➤ Other charge option(CPM) - optimize counting
FURTHER CHALLENGE
➤ Feed the real time click stream to update the prediction model
➤ Fast reaction to some special event
➤ More click, more income (advertiser charged by cost per click)
THANK YOU!
TECHNICAL INSIGHTS
➤ Increase search speed: Inverted index and features in cache
➤ Feature space design
➤ Joint event streams
➤ Design Cassandra schema for query efficiency
CampaignId(Partition), TimeStamp(Cluster), Chance, Impression, Click, AvgPrice
WHY CASSANDRA
➤ High write performance
➤ Easy scaling
➤ Flexible columnar schema
➤ Schema for query efficiency
CampaignId(Partition), TimeStamp(Cluster), Chance, Impression, Click, AvgPrice
WHY MEMCACHED
➤ Fast key value look up for online system
➤ Store - inverted index, rewrite query, CTR features
WHY SQL
➤ Clearly defined, discrete items with exact specifications
➤ Easy to do complex relational query
➤ Transactional workload
WHY SPARK STREAMING
➤ Micro batch, pre-defined time window
➤ Process events every 10 seconds
MODELS
➤ Word2Vector: Query Rewrite
➤ Logistic Regression: Predict pClick
➤ Gradient Boosting Tree: Predict pClick
ONLINE SEARCH AD PLATFORM
Search
Query
url:searchad.fun
Relevancy
Query
Understand
Filter
Ads
Select
Ads
Rank
Ads
Top K
Ads
Pricing
Allocate
Ads
P-click bid
Quality
Craft Demo
Grace Lu
OVERVIEW
➤ Problem
➤ Transform -> Join -> Reduce By State and Time
Customer information
<customer_id, name, street, city,state,zip>
123#AAA Inc#1 First Ave Mountain View CA#94040
Sales information
<timestamp,customer_id,sales_price>
1454313600#123#123456
State Sales
<State, Time, Total Sales>
State - hourly sales
State - daily sales
State - monthly sales
State - yearly sales
BASIC IDEA - TRANSFORM TO PAIR ADD
➤ Map Transformation - customer_rdd
➤ Map Transformation - sales_rdd
Customer information
<customer_id, name, street, city,state,zip>
123#AAA Inc#1 First Ave Mountain View CA#94040
(customer_id, state)
Sales information
<timestamp,customer_id,sales_price>
1454313600#123#123456
(customer_id, (year, month, date, hour, price))
BASIC IDEA - JOIN TWO RDD
➤ Join customer and sales on customer id
(customer_id, state)(customer_id, (year, month, date, hour, price))
(customer_id, (year, month, date, hour, price), state)
BASIC IDEA - REDUCE BY STATE AND TIME
➤ Extract
(customer_id, (year, month, date, hour, price), state)
Hourly: (state#year#month#date#hour, price))
Daily: (state#year#month#date##, price))
Monthly: (state#year#month###, price))
yearly: (state#year####, price))
BASIC IDEA - REDUCE BY STATE AND HOUR
➤ Reduce by state and hourly time to sum the sales
""" key: AL#2017#08#01#09 “""
hourly = joined_rdd.map(hourly_state_records).reduceByKey(lambda v1,v2)
pair: (customer_id, (year, month, date, hour, price), state) -> (state#year#month#date#hour, price)
BASIC IDEA - REDUCE BY STATE AND DAY/MONTH/YEAR
➤ Further aggregation - daily, monthly, yearly
""" key : AL#2017#08#01#”""
daily = hourly.map(lambda (k,v) : ('#'.join(k.split('#')[:4]) + '#', v)) 
.reduceByKey(lambda v1,v2: v1 + v2)
""" key : AL#2017#08##””"
monthly = daily.map(lambda (k,v) : ('#'.join(k.split('#')[:3]) + '##', v)) 
.reduceByKey(lambda v1,v2: v1 + v2)
""" key : AL#2017###”””
yearly = monthly.map(lambda (k,v) : ('#'.join(k.split('#')[:2]) + '###', v)) 
.reduceByKey(lambda v1,v2: v1 + v2)
OUTPUT ACTION - SAVEASTEXTFILE
➤ Union -> SaveAsTextFile
""" construct output file """
result = hourly.union(daily).union(monthly).union(yearly).map(lambda (k,v): k + '#' + str(v))
""" output format: state#year#month#day#hour#sales “""
result.saveAsTextFile(output_file)
FINAL LINEAGE
➤ RDD dependency
customer
daily
sales
hourlyjoined monthly yearly
union
REDUCE SHUFFLING - BROADCAST JOIN
➤ Broadcast HashMap - Map side join
(customer_id, state) -> collectAsMap -> broadcast map
(JOIN) IF BOTH DATASET CANNOT FIT INTO MEMORY
➤ Hard to avoid shuffling without prior knowledge
➤ Take advantage of pre-existing data distribution - like sorted
➤ Many missing values in inner join - pre-filter
REDUCE SHUFFLING - PARTITIONER
➤ Customer Patitioner
hourly
hourly
hourly
hourly
daily
daily
daily
daily
monthly
monthly
monthly
monthly
hourly
hourly
hourly
hourly
REDUCE SHUFFLING - PARTITIONER
➤ Data Locality
.reduceByKey(lambda v1,v2: v1 + v2, 4, state_partitioner)
hourly
hourly
hourly
hourly
daily
daily
daily
daily
monthly
monthly
monthly
monthly
hourly
hourly
hourly
hourly
REDUCE SHUFFLING - PERSIST
➤ Don’t do the re-computation every time
customer
daily
sales
hourlyjoined monthly yearly
union
THANK YOU
➤ FURTHER DISCUSSION

More Related Content

Similar to Intuit0810

Google analytics and website optimizer
Google analytics and website optimizerGoogle analytics and website optimizer
Google analytics and website optimizer
Digiword Ha Noi
 
Google Analytics and Website Optimizer
Google Analytics and Website OptimizerGoogle Analytics and Website Optimizer
Google Analytics and Website Optimizer
Simon Whatley
 
AI and Machine Language in PPC
AI and Machine Language in PPCAI and Machine Language in PPC
AI and Machine Language in PPC
David Szetela
 
[SMX ADVANCED - BERLIN 2019] Automate & Scale STR Treasure Hunt - Marco Fri...
[SMX ADVANCED - BERLIN 2019]   Automate & Scale STR Treasure Hunt - Marco Fri...[SMX ADVANCED - BERLIN 2019]   Automate & Scale STR Treasure Hunt - Marco Fri...
[SMX ADVANCED - BERLIN 2019] Automate & Scale STR Treasure Hunt - Marco Fri...
Marco Frighetto
 
SMX Advanced Europe 2019 Automate and Scale Google Ads Search Term Report
SMX Advanced Europe 2019 Automate and Scale Google Ads Search Term ReportSMX Advanced Europe 2019 Automate and Scale Google Ads Search Term Report
SMX Advanced Europe 2019 Automate and Scale Google Ads Search Term Report
Booster Box
 
Sem google ads &amp; ppc c
Sem google ads &amp; ppc  cSem google ads &amp; ppc  c
Sem google ads &amp; ppc c
SeoSameer1
 
Score google analytics
Score   google analyticsScore   google analytics
Score google analytics
HotTopics114
 
Szetela practcal ad words ai rocks
Szetela practcal ad words ai rocksSzetela practcal ad words ai rocks
Szetela practcal ad words ai rocks
David Szetela
 
Better conversion with Intelligent Analytics
Better conversion with Intelligent AnalyticsBetter conversion with Intelligent Analytics
Better conversion with Intelligent Analytics
Tim Stewart
 
Artificial Intelligence and Machine Learning in PPC - David Szetela
Artificial Intelligence and Machine Learning in PPC - David SzetelaArtificial Intelligence and Machine Learning in PPC - David Szetela
Artificial Intelligence and Machine Learning in PPC - David Szetela
State of Search Conference
 
Introduction to Search Engine Marketing Ethically and Strategically
Introduction to Search Engine Marketing Ethically and StrategicallyIntroduction to Search Engine Marketing Ethically and Strategically
Introduction to Search Engine Marketing Ethically and Strategically
SEM Team at Schipul - The Web Marketing Company
 
Action Connect!on
Action Connect!onAction Connect!on
Action Connect!on
proff6
 
Bid Estimation with the AdWords API (v2)
Bid Estimation with the AdWords API (v2)Bid Estimation with the AdWords API (v2)
Bid Estimation with the AdWords API (v2)
marcwan
 
Back to SEO/SEM Basics - Search Marketing Math for Fun and Profit
Back to SEO/SEM Basics - Search Marketing Math for Fun and ProfitBack to SEO/SEM Basics - Search Marketing Math for Fun and Profit
Back to SEO/SEM Basics - Search Marketing Math for Fun and Profit
David Langrock
 
Analytics vendors and_package-richard_zwicky
Analytics vendors and_package-richard_zwickyAnalytics vendors and_package-richard_zwicky
Analytics vendors and_package-richard_zwicky
zachbrowne
 
3*3 Developer Tour
3*3 Developer Tour3*3 Developer Tour
3*3 Developer Tour
ironSource
 
Ruby 2010
Ruby 2010Ruby 2010
Ruby 2010
iProspect Canada
 
Gianluca Binelli — Advanced PPC: Create Your Own Automated Bid Strategies
Gianluca Binelli — Advanced PPC: Create Your Own Automated Bid StrategiesGianluca Binelli — Advanced PPC: Create Your Own Automated Bid Strategies
Gianluca Binelli — Advanced PPC: Create Your Own Automated Bid Strategies
Semrush
 
Digital marketing road map for a real estate company in india
Digital marketing road map for a real estate company in indiaDigital marketing road map for a real estate company in india
Digital marketing road map for a real estate company in india
Jishnu Brahma
 
DIEVO Google SA360 Admixer
DIEVO Google SA360 AdmixerDIEVO Google SA360 Admixer
DIEVO Google SA360 Admixer
DIEVO
 

Similar to Intuit0810 (20)

Google analytics and website optimizer
Google analytics and website optimizerGoogle analytics and website optimizer
Google analytics and website optimizer
 
Google Analytics and Website Optimizer
Google Analytics and Website OptimizerGoogle Analytics and Website Optimizer
Google Analytics and Website Optimizer
 
AI and Machine Language in PPC
AI and Machine Language in PPCAI and Machine Language in PPC
AI and Machine Language in PPC
 
[SMX ADVANCED - BERLIN 2019] Automate & Scale STR Treasure Hunt - Marco Fri...
[SMX ADVANCED - BERLIN 2019]   Automate & Scale STR Treasure Hunt - Marco Fri...[SMX ADVANCED - BERLIN 2019]   Automate & Scale STR Treasure Hunt - Marco Fri...
[SMX ADVANCED - BERLIN 2019] Automate & Scale STR Treasure Hunt - Marco Fri...
 
SMX Advanced Europe 2019 Automate and Scale Google Ads Search Term Report
SMX Advanced Europe 2019 Automate and Scale Google Ads Search Term ReportSMX Advanced Europe 2019 Automate and Scale Google Ads Search Term Report
SMX Advanced Europe 2019 Automate and Scale Google Ads Search Term Report
 
Sem google ads &amp; ppc c
Sem google ads &amp; ppc  cSem google ads &amp; ppc  c
Sem google ads &amp; ppc c
 
Score google analytics
Score   google analyticsScore   google analytics
Score google analytics
 
Szetela practcal ad words ai rocks
Szetela practcal ad words ai rocksSzetela practcal ad words ai rocks
Szetela practcal ad words ai rocks
 
Better conversion with Intelligent Analytics
Better conversion with Intelligent AnalyticsBetter conversion with Intelligent Analytics
Better conversion with Intelligent Analytics
 
Artificial Intelligence and Machine Learning in PPC - David Szetela
Artificial Intelligence and Machine Learning in PPC - David SzetelaArtificial Intelligence and Machine Learning in PPC - David Szetela
Artificial Intelligence and Machine Learning in PPC - David Szetela
 
Introduction to Search Engine Marketing Ethically and Strategically
Introduction to Search Engine Marketing Ethically and StrategicallyIntroduction to Search Engine Marketing Ethically and Strategically
Introduction to Search Engine Marketing Ethically and Strategically
 
Action Connect!on
Action Connect!onAction Connect!on
Action Connect!on
 
Bid Estimation with the AdWords API (v2)
Bid Estimation with the AdWords API (v2)Bid Estimation with the AdWords API (v2)
Bid Estimation with the AdWords API (v2)
 
Back to SEO/SEM Basics - Search Marketing Math for Fun and Profit
Back to SEO/SEM Basics - Search Marketing Math for Fun and ProfitBack to SEO/SEM Basics - Search Marketing Math for Fun and Profit
Back to SEO/SEM Basics - Search Marketing Math for Fun and Profit
 
Analytics vendors and_package-richard_zwicky
Analytics vendors and_package-richard_zwickyAnalytics vendors and_package-richard_zwicky
Analytics vendors and_package-richard_zwicky
 
3*3 Developer Tour
3*3 Developer Tour3*3 Developer Tour
3*3 Developer Tour
 
Ruby 2010
Ruby 2010Ruby 2010
Ruby 2010
 
Gianluca Binelli — Advanced PPC: Create Your Own Automated Bid Strategies
Gianluca Binelli — Advanced PPC: Create Your Own Automated Bid StrategiesGianluca Binelli — Advanced PPC: Create Your Own Automated Bid Strategies
Gianluca Binelli — Advanced PPC: Create Your Own Automated Bid Strategies
 
Digital marketing road map for a real estate company in india
Digital marketing road map for a real estate company in indiaDigital marketing road map for a real estate company in india
Digital marketing road map for a real estate company in india
 
DIEVO Google SA360 Admixer
DIEVO Google SA360 AdmixerDIEVO Google SA360 Admixer
DIEVO Google SA360 Admixer
 

Recently uploaded

Learnings from Successful Jobs Searchers
Learnings from Successful Jobs SearchersLearnings from Successful Jobs Searchers
Learnings from Successful Jobs Searchers
Bruce Bennett
 
A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024
Bruce Bennett
 
thyroid case presentation.pptx Kamala's Lakshaman palatial
thyroid case presentation.pptx Kamala's Lakshaman palatialthyroid case presentation.pptx Kamala's Lakshaman palatial
thyroid case presentation.pptx Kamala's Lakshaman palatial
Aditya Raghav
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
Bruce Bennett
 
Lbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdfLbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdf
ashiquepa3
 
Introducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptxIntroducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptx
FauzanHarits1
 
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
dsnow9802
 
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdfSwitching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
SocMediaFin - Joyce Sullivan
 
Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024
SnapJob
 
labb123456789123456789123456789123456789
labb123456789123456789123456789123456789labb123456789123456789123456789123456789
labb123456789123456789123456789123456789
Ghh
 
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
MuhammadWaqasBaloch1
 
0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf
Thomas GIRARD BDes
 
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
2zjra9bn
 
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
GabrielleSinaga
 
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
2zjra9bn
 
Leave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employeesLeave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employees
Sreenivas702647
 
5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf
Alliance Jobs
 
Leadership Ambassador club Adventist module
Leadership Ambassador club Adventist moduleLeadership Ambassador club Adventist module
Leadership Ambassador club Adventist module
kakomaeric00
 
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAANBUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
cahgading001
 
IT Career Hacks Navigate the Tech Jungle with a Roadmap
IT Career Hacks Navigate the Tech Jungle with a RoadmapIT Career Hacks Navigate the Tech Jungle with a Roadmap
IT Career Hacks Navigate the Tech Jungle with a Roadmap
Base Camp
 

Recently uploaded (20)

Learnings from Successful Jobs Searchers
Learnings from Successful Jobs SearchersLearnings from Successful Jobs Searchers
Learnings from Successful Jobs Searchers
 
A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024
 
thyroid case presentation.pptx Kamala's Lakshaman palatial
thyroid case presentation.pptx Kamala's Lakshaman palatialthyroid case presentation.pptx Kamala's Lakshaman palatial
thyroid case presentation.pptx Kamala's Lakshaman palatial
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
 
Lbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdfLbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdf
 
Introducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptxIntroducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptx
 
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
 
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdfSwitching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
 
Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024
 
labb123456789123456789123456789123456789
labb123456789123456789123456789123456789labb123456789123456789123456789123456789
labb123456789123456789123456789123456789
 
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
 
0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf
 
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
 
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
 
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
 
Leave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employeesLeave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employees
 
5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf
 
Leadership Ambassador club Adventist module
Leadership Ambassador club Adventist moduleLeadership Ambassador club Adventist module
Leadership Ambassador club Adventist module
 
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAANBUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
 
IT Career Hacks Navigate the Tech Jungle with a Roadmap
IT Career Hacks Navigate the Tech Jungle with a RoadmapIT Career Hacks Navigate the Tech Jungle with a Roadmap
IT Career Hacks Navigate the Tech Jungle with a Roadmap
 

Intuit0810

  • 1. ABOUT ME ➤ EE/CS/CB ➤ BS EE, Robotics, Shanghai Univ ➤ MS Computational Biology, School of Computer Science, CMU
  • 2. ABOUT ME Tech~ ➤ Large Genome Dataset ~ Open Big data world ➤ Interested in data and open source ➤ Familiar with Hadoop ecosystem and various open source data tools ➤ Insight Data Engineer Fellow
  • 3. ABOUT ME Non-tech~ ➤ Enjoy most of the indoor non- physical activities ~ ➤ Like drawing small picture stories to record fun facts happen around ➤ Like dressing little yellow and observing his love life
  • 4. Search Ads Playground Online Search Ads Platform & Real Time Campaign Monitoring Grace Lu
  • 5. MOTIVATION ➤ Every website, every application with search engine can do search advertisement business url:searchad.fun or http://107.23.229.58/ Ads Ads
  • 6. SEARCH ADVERTISEMENT RANKING “Don’t be evil.” —— Google How do these ads being selected? Quality Bid Price Relevancy P-Click Ad Search Result url:searchad.fun or http://107.23.229.58/
  • 7. ONLINE SEARCH AD PLATFORM ➤ Quality Score = Relevancy * Click_Prediction (p-click) ➤ Final Ranking Score = Quality Score * Bid Price Search Query batch ETL product data user click log Top ads “black chair” Ad Engine url:searchad.fun or http://107.23.229.58/
  • 8. ETL ➤ Clean raw product data - store in MySql ➤ Train word2vector - generate synonyms - store in Memcached ➤ Build inverted index in Memcached ➤ Process log file to extract features ➤ Process log file to generate regression model
  • 9. CTR FEATURES ➤ pClick = probability of click/estimated Click Through Rate (CTR) ➤ pClick Features extracted from search log and stored in key- value store log: Device IP, Device id,Session id,Query,AdId,CampaignId,Ad_category_Query_category(0/1),clicked(0/1)
  • 10. REALTIME MONITORING ➤ Let advertiser know … ➤ Can be extend to monitor activities in other system url:searchad.fun or http://107.23.229.58/
  • 11. REALTIME MONITORING ➤ Log events Ad Impression Events Ad Click Events Chance Events AD Server url:searchad.fun or http://107.23.229.58/
  • 12. ANALYTICS DASHBOARD ➤ Query by one product category ➤ Query by one campaign (an ad set follows similar strategy) Top Key Words Cloud Top Winning Brands Average Winning Bid Top Impression Ads Top Click Ads Top CTR Ads Current Budget Current spend Total Impression Total Click Cost per Click Click/Impression url:searchad.fun or http://107.23.229.58/
  • 13. FURTHER CHALLENGE - ONLINE SYSTEM Scaling up ➤ Sharding by campaignId ➤ Select top k in distributed way ➤ More replicas A/B Testing ➤ Test different algorithms, give 10% load to server with alternative algorithm
  • 14. FURTHER CHALLENGE - REAL TIME MONITOR ➤ Log raw records to S3 ➤ Other charge option(CPM) - optimize counting
  • 15. FURTHER CHALLENGE ➤ Feed the real time click stream to update the prediction model ➤ Fast reaction to some special event ➤ More click, more income (advertiser charged by cost per click)
  • 17. TECHNICAL INSIGHTS ➤ Increase search speed: Inverted index and features in cache ➤ Feature space design ➤ Joint event streams ➤ Design Cassandra schema for query efficiency CampaignId(Partition), TimeStamp(Cluster), Chance, Impression, Click, AvgPrice
  • 18. WHY CASSANDRA ➤ High write performance ➤ Easy scaling ➤ Flexible columnar schema ➤ Schema for query efficiency CampaignId(Partition), TimeStamp(Cluster), Chance, Impression, Click, AvgPrice
  • 19. WHY MEMCACHED ➤ Fast key value look up for online system ➤ Store - inverted index, rewrite query, CTR features
  • 20. WHY SQL ➤ Clearly defined, discrete items with exact specifications ➤ Easy to do complex relational query ➤ Transactional workload
  • 21. WHY SPARK STREAMING ➤ Micro batch, pre-defined time window ➤ Process events every 10 seconds
  • 22. MODELS ➤ Word2Vector: Query Rewrite ➤ Logistic Regression: Predict pClick ➤ Gradient Boosting Tree: Predict pClick
  • 23. ONLINE SEARCH AD PLATFORM Search Query url:searchad.fun Relevancy Query Understand Filter Ads Select Ads Rank Ads Top K Ads Pricing Allocate Ads P-click bid Quality
  • 25. OVERVIEW ➤ Problem ➤ Transform -> Join -> Reduce By State and Time Customer information <customer_id, name, street, city,state,zip> 123#AAA Inc#1 First Ave Mountain View CA#94040 Sales information <timestamp,customer_id,sales_price> 1454313600#123#123456 State Sales <State, Time, Total Sales> State - hourly sales State - daily sales State - monthly sales State - yearly sales
  • 26. BASIC IDEA - TRANSFORM TO PAIR ADD ➤ Map Transformation - customer_rdd ➤ Map Transformation - sales_rdd Customer information <customer_id, name, street, city,state,zip> 123#AAA Inc#1 First Ave Mountain View CA#94040 (customer_id, state) Sales information <timestamp,customer_id,sales_price> 1454313600#123#123456 (customer_id, (year, month, date, hour, price))
  • 27. BASIC IDEA - JOIN TWO RDD ➤ Join customer and sales on customer id (customer_id, state)(customer_id, (year, month, date, hour, price)) (customer_id, (year, month, date, hour, price), state)
  • 28. BASIC IDEA - REDUCE BY STATE AND TIME ➤ Extract (customer_id, (year, month, date, hour, price), state) Hourly: (state#year#month#date#hour, price)) Daily: (state#year#month#date##, price)) Monthly: (state#year#month###, price)) yearly: (state#year####, price))
  • 29. BASIC IDEA - REDUCE BY STATE AND HOUR ➤ Reduce by state and hourly time to sum the sales """ key: AL#2017#08#01#09 “"" hourly = joined_rdd.map(hourly_state_records).reduceByKey(lambda v1,v2) pair: (customer_id, (year, month, date, hour, price), state) -> (state#year#month#date#hour, price)
  • 30. BASIC IDEA - REDUCE BY STATE AND DAY/MONTH/YEAR ➤ Further aggregation - daily, monthly, yearly """ key : AL#2017#08#01#”"" daily = hourly.map(lambda (k,v) : ('#'.join(k.split('#')[:4]) + '#', v)) .reduceByKey(lambda v1,v2: v1 + v2) """ key : AL#2017#08##””" monthly = daily.map(lambda (k,v) : ('#'.join(k.split('#')[:3]) + '##', v)) .reduceByKey(lambda v1,v2: v1 + v2) """ key : AL#2017###””” yearly = monthly.map(lambda (k,v) : ('#'.join(k.split('#')[:2]) + '###', v)) .reduceByKey(lambda v1,v2: v1 + v2)
  • 31. OUTPUT ACTION - SAVEASTEXTFILE ➤ Union -> SaveAsTextFile """ construct output file """ result = hourly.union(daily).union(monthly).union(yearly).map(lambda (k,v): k + '#' + str(v)) """ output format: state#year#month#day#hour#sales “"" result.saveAsTextFile(output_file)
  • 32. FINAL LINEAGE ➤ RDD dependency customer daily sales hourlyjoined monthly yearly union
  • 33. REDUCE SHUFFLING - BROADCAST JOIN ➤ Broadcast HashMap - Map side join (customer_id, state) -> collectAsMap -> broadcast map
  • 34. (JOIN) IF BOTH DATASET CANNOT FIT INTO MEMORY ➤ Hard to avoid shuffling without prior knowledge ➤ Take advantage of pre-existing data distribution - like sorted ➤ Many missing values in inner join - pre-filter
  • 35. REDUCE SHUFFLING - PARTITIONER ➤ Customer Patitioner hourly hourly hourly hourly daily daily daily daily monthly monthly monthly monthly hourly hourly hourly hourly
  • 36. REDUCE SHUFFLING - PARTITIONER ➤ Data Locality .reduceByKey(lambda v1,v2: v1 + v2, 4, state_partitioner) hourly hourly hourly hourly daily daily daily daily monthly monthly monthly monthly hourly hourly hourly hourly
  • 37. REDUCE SHUFFLING - PERSIST ➤ Don’t do the re-computation every time customer daily sales hourlyjoined monthly yearly union
  • 38. THANK YOU ➤ FURTHER DISCUSSION