SlideShare a Scribd company logo
1 of 34
Download to read offline
Automatic Detection
of Web Trackers
Vasia Kalavri
Apache Flink PMC, PhD student @KTH
kalavri@kth.se, @vkalavri
Telefonica Research, Barcelona
Computer Networks, Multimedia,
Online Social Networks, Security &
Privacy, Recommender Systems,
HCI & Mobile Computing,
Distributed Systems…
2
Ads
Recommendations
Browsing the Web
3
Tracker
Tracker
Ad Server
display relevant
ads
cookie exchange
profiling
Tracking
4
5
The study's authors defined "creepiness" by the feeling
consumers get when they sense an ad is too personal
because it uses data the consumer did not agree to
provide, such as online-search and browsing history.
Consumers are even more creeped out by this because
they don't know how and where that information will
be used.
6
amazon.com imdb.com facebook.com
X Y
X
Y
IP 1.1.1.1
ID-A = “aaa”
IP 1.1.1.1
ID-X = “xxx”
IP 2.2.2.2
ID-B = “bbb”
IP 2.2.2.2
ID-Y = “yyy”
IP 3.3.3.3
ID-C = “ccc”
IP 3.3.3.3
ID-X = “xxx”
IP 3.3.3.3
ID-Y = “yyy”
Linking Tracker Information
7
Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give
You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299
Can’t we block them?
proxy
Tracker
Tracker
Ad Server
8
Legitimate site
● not frequently updated
● not sure who or based on what criteria URLs are
blacklisted
● miss “hidden” trackers or dual-role nodes
● blocking requires manual matching against the list
● can you buy your way into the whitelist?
Available Solutions
AdBlock, DoNotTrack, EasyPrivacy:
crowd-sourced “black lists” of tracker URLs
9
10
Our Goal
Exploit fundamental
properties necessary for
tracker operation
Use existing data
to build a trackers classifier
● structural attributes:
connections, network
positions
● operational aspects: data
volume exchange,
communication patterns
Can we detect Trackers automatically?
● Are Trackers similar? How?
○ network structure
○ data received/sent
○ response times
○ latency
● Are Trackers different from normal sites? How?
● Are Trackers mainly connected to other Trackers?
12
The Road to our Goal
● algorithms
● tuning
● features
● combinations of
algorithms and features
and parameters...
13
The Dataset
172.134.23.3 http://www.buzzfeed.com/sheridanwatson/happy-birthday-eva-you
-lucky- gal#.gnJbE8EDDK 3 45 20150203:17080345 34 200 GET www.buzzfeed.
com/ HTTP/1.1 Host: www.google-analytics.com User-Agent: Mozilla/5.0
(Windows; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR
3.5.30729) Accept: text/html,application/xhtml+xml, application/xml;q=0.9,
*/*;q=0.8 Keep-Alive: 300 Connection: keep-alive 234561 34 0 0 ES
#records: ~80m
#users: ~3k
#URLs: ~2m
#Trackers: ~4k
14
Basic Dataset Analysis
● How many requests to
Trackers?
DataSet API
● Do Tracker requests have
larger latency than other
requests?
15
● How many Trackers
○ per user?
○ per request?
○ per website?
● Do popular websites embed
more Trackers than others?
● Do same-topic websites share
Trackers?
● Do different users visiting the
same website end up on
different Trackers?
● Do Trackers send / receive
more / less bytes?
● Do they have more / less
connections on average?
Main Idea
Model the data as a
referer → host bipartite
graph and exploit the
graph structure to identify
Trackers
facebook.com
youtube.com
google-analytics.com
b.scorecardresearch.com
embedded URLsURLs explicitly
visited by the user
16
Attempt#1
Relevance Search
Iterative, random walk-like
algorithm for bipartite graphs
Given an input source node,
assign a “relevance score” to other
nodes, based on how similar their
network position is
Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search
and anomaly detection in bipartite graphs. ACM SIGKDD Explorations
Newsletter,7(2), 48-55.
Relevance Search Algorithm
google-analytics
b.scorecardresearch
xzy/logo_small.jpg
0.9
0.1
source
In each iteration, a vertex:
- sends a score to out-neighbors
- sums up received scores and
updates value
18
Relevance Search Implementation
● single-source relevance search
○ similar to pagerank
○ easily mapped to vertex-centric iterations
● multi-source relevance search
○ each vertex keeps a vector of scores
○ compute top-k relevant nodes per source
○ merge the top-k lists
19
Gelly API
Data Pipeline
top-k
relevant
nodes
www.google-analytics.com: T
www.bscored-research.com:
T
www.facebook.com: NT
www.github.com: NT
cdn.cxense.com: NT
...
Bipartite graph
creation
Multi-source
Relevance Search
Classification
20
Relevance Search Tuning
● How many and which sources to give as input?
● How to define convergence?
● Does initialization matter?
● How to weigh the input graph?
● How to define the relevance score threshold?
21
Relevance Search Problems
● Easy to find the few very similar and the few very
different pages
● Popular trackers are similar to other popular trackers,
but not to not-so-popular ones
● We might keep re-discovering what we already know
22
Relevance Search doesn’t seem to completely solve the problem…
Where do we go now?
23
Attempt#2...N-1
Combining Relevance Search
with other algorithms
Several Clustering algorithms
k-nn Classification
Random Forest
Data Pipeline(s)
top-k
relevant
nodes
www.google-analytics.
com: T
www.bscored-research.
com: T
www.facebook.com:
NT
www.github.com: NT
cdn.cxense.com: NT
...
Bipartite graph
creation
Multi-source
Relevance Search
[feature extraction]
Classification
[your clustering,
classification, etc.
algorithm here]
[evaluation]
25
r1
r2
r3
r5
r6
r7
h1
h2
h3
h4
h5
h6
h7
h8
NT
NT
T
T
?
T
NT
NT
r4
referer-hosts graph
h2
h3 h4
h5 h6
h8
h7
h1
r1
r2
r3
r3
r3
r4
r5r6
r7
hosts-projection graph
: referer
: non-tracker host
: tracker host
: unlabeled host
The Projection Graph
27
Attempt#N
Community Detection on the
Projection Graph
The Projection Graph captures
implicit connections between
trackers, through other sites
Do Trackers form communities in
the Projection Graph?
● Do they form connected
components?
Basic Analysis of the Projection Graph
● Do Trackers have unusually
high degrees?
DataSet & Gelly APIs
29
● Are they mainly connected to
other Trackers?
Visualization
30
Final Data Pipeline
raw logs
cleaned
logs
1: logs pre-
processing
2: bipartite graph
creation
3: largest
connected
component
extraction
4: hosts-
projection graph
creation
5: community
detection
google-analytics.com: T
bscored-research.com: T
facebook.com: NT
github.com: NT
cdn.cxense.com: NT
...
6: results
DataSet API
Gelly
DataSet API
31
Very high accuracy and
very low FPR :-)
Start simple
Lessons Learned
Choose features incrementally
Visualize your data
Re-evaluate your models
Try different data representations
Use a flexible system
Automatic Detection
of Web Trackers
Vasia Kalavri
Apache Flink PMC, PhD student @KTH
kalavri@kth.se, @vkalavri
Optimizing the Pipeline
Flink Optimizer
to the rescue :-)
34

More Related Content

What's hot

Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Timo Walther
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiSlim Baltagi
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Henry Saputra
 
Implementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache FlinkImplementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache FlinkMárton Balassi
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobDatabricks
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
20120907 microbiome-intro
20120907 microbiome-intro20120907 microbiome-intro
20120907 microbiome-introLeo Lahti
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Tin Ho
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patternsLars Albertsson
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks
 
Fast Data processing with RFX
Fast Data processing with RFXFast Data processing with RFX
Fast Data processing with RFXTrieu Nguyen
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Databricks
 

What's hot (20)

Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim Baltagi
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
 
Implementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache FlinkImplementing BigPetStore with Apache Flink
Implementing BigPetStore with Apache Flink
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the Job
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
20120907 microbiome-intro
20120907 microbiome-intro20120907 microbiome-intro
20120907 microbiome-intro
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
 
Fast Data processing with RFX
Fast Data processing with RFXFast Data processing with RFX
Fast Data processing with RFX
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 

Viewers also liked

Ted Dunning-Faster and Furiouser- Flink Drift
Ted Dunning-Faster and Furiouser- Flink DriftTed Dunning-Faster and Furiouser- Flink Drift
Ted Dunning-Faster and Furiouser- Flink DriftFlink Forward
 
Eron Wright - Introducing Flink on Mesos
Eron Wright - Introducing Flink on MesosEron Wright - Introducing Flink on Mesos
Eron Wright - Introducing Flink on MesosFlink Forward
 
Julian Hyde - Streaming SQL
Julian Hyde - Streaming SQLJulian Hyde - Streaming SQL
Julian Hyde - Streaming SQLFlink Forward
 
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkSanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkFlink Forward
 
Aljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache FlinkAljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache FlinkFlink Forward
 
Jamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache FlinkJamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache FlinkFlink Forward
 
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...Flink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Flink Forward
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Flink Forward
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Flink Forward
 
Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...
Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...
Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...Flink Forward
 
Gyula Fóra - RBEA- Scalable Real-Time Analytics at King
Gyula Fóra - RBEA- Scalable Real-Time Analytics at KingGyula Fóra - RBEA- Scalable Real-Time Analytics at King
Gyula Fóra - RBEA- Scalable Real-Time Analytics at KingFlink Forward
 
Alexander Kolb - Flinkspector – Taming the squirrel
Alexander Kolb - Flinkspector – Taming the squirrelAlexander Kolb - Flinkspector – Taming the squirrel
Alexander Kolb - Flinkspector – Taming the squirrelFlink Forward
 
Trevor Grant - Apache Zeppelin - A friendlier way to Flink
Trevor Grant - Apache Zeppelin - A friendlier way to FlinkTrevor Grant - Apache Zeppelin - A friendlier way to Flink
Trevor Grant - Apache Zeppelin - A friendlier way to FlinkFlink Forward
 
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkFlink Forward
 
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...Flink Forward
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFlink Forward
 
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Flink Forward
 

Viewers also liked (20)

Ted Dunning-Faster and Furiouser- Flink Drift
Ted Dunning-Faster and Furiouser- Flink DriftTed Dunning-Faster and Furiouser- Flink Drift
Ted Dunning-Faster and Furiouser- Flink Drift
 
Eron Wright - Introducing Flink on Mesos
Eron Wright - Introducing Flink on MesosEron Wright - Introducing Flink on Mesos
Eron Wright - Introducing Flink on Mesos
 
Julian Hyde - Streaming SQL
Julian Hyde - Streaming SQLJulian Hyde - Streaming SQL
Julian Hyde - Streaming SQL
 
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkSanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
 
Aljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache FlinkAljoscha Krettek - The Future of Apache Flink
Aljoscha Krettek - The Future of Apache Flink
 
Jamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache FlinkJamie Grier - Robust Stream Processing with Apache Flink
Jamie Grier - Robust Stream Processing with Apache Flink
 
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink-
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
 
Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...
Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...
Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...
 
Gyula Fóra - RBEA- Scalable Real-Time Analytics at King
Gyula Fóra - RBEA- Scalable Real-Time Analytics at KingGyula Fóra - RBEA- Scalable Real-Time Analytics at King
Gyula Fóra - RBEA- Scalable Real-Time Analytics at King
 
Alexander Kolb - Flinkspector – Taming the squirrel
Alexander Kolb - Flinkspector – Taming the squirrelAlexander Kolb - Flinkspector – Taming the squirrel
Alexander Kolb - Flinkspector – Taming the squirrel
 
Trevor Grant - Apache Zeppelin - A friendlier way to Flink
Trevor Grant - Apache Zeppelin - A friendlier way to FlinkTrevor Grant - Apache Zeppelin - A friendlier way to Flink
Trevor Grant - Apache Zeppelin - A friendlier way to Flink
 
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
 
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
 
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
 

Similar to Automatic Detection of Web Trackers by Vasia Kalavri

Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with sparkMarissa Saunders
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014Bastian Grimm
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchYury Lifshits
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationDmitry Grapov
 
Master in Big Data Analytics and Social Mining 20015
Master in Big Data Analytics and Social Mining 20015Master in Big Data Analytics and Social Mining 20015
Master in Big Data Analytics and Social Mining 20015Andrea Gigli
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData SolutionsTravis Oliphant
 
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...MITRE - ATT&CKcon
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia UniversityTunghai University
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldCarlo Vaccari
 

Similar to Automatic Detection of Web Trackers by Vasia Kalavri (20)

Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with spark
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014Structured Data & Schema.org - SMX Milan 2014
Structured Data & Schema.org - SMX Milan 2014
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! Research
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report Generation
 
Master in Big Data Analytics and Social Mining 20015
Master in Big Data Analytics and Social Mining 20015Master in Big Data Analytics and Social Mining 20015
Master in Big Data Analytics and Social Mining 20015
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Diadem 1.0
Diadem 1.0Diadem 1.0
Diadem 1.0
 
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia University
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Automatic Detection of Web Trackers by Vasia Kalavri

  • 1. Automatic Detection of Web Trackers Vasia Kalavri Apache Flink PMC, PhD student @KTH kalavri@kth.se, @vkalavri
  • 2. Telefonica Research, Barcelona Computer Networks, Multimedia, Online Social Networks, Security & Privacy, Recommender Systems, HCI & Mobile Computing, Distributed Systems… 2
  • 5. 5 The study's authors defined "creepiness" by the feeling consumers get when they sense an ad is too personal because it uses data the consumer did not agree to provide, such as online-search and browsing history. Consumers are even more creeped out by this because they don't know how and where that information will be used.
  • 6. 6
  • 7. amazon.com imdb.com facebook.com X Y X Y IP 1.1.1.1 ID-A = “aaa” IP 1.1.1.1 ID-X = “xxx” IP 2.2.2.2 ID-B = “bbb” IP 2.2.2.2 ID-Y = “yyy” IP 3.3.3.3 ID-C = “ccc” IP 3.3.3.3 ID-X = “xxx” IP 3.3.3.3 ID-Y = “yyy” Linking Tracker Information 7 Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299
  • 8. Can’t we block them? proxy Tracker Tracker Ad Server 8 Legitimate site
  • 9. ● not frequently updated ● not sure who or based on what criteria URLs are blacklisted ● miss “hidden” trackers or dual-role nodes ● blocking requires manual matching against the list ● can you buy your way into the whitelist? Available Solutions AdBlock, DoNotTrack, EasyPrivacy: crowd-sourced “black lists” of tracker URLs 9
  • 10. 10
  • 11. Our Goal Exploit fundamental properties necessary for tracker operation Use existing data to build a trackers classifier ● structural attributes: connections, network positions ● operational aspects: data volume exchange, communication patterns
  • 12. Can we detect Trackers automatically? ● Are Trackers similar? How? ○ network structure ○ data received/sent ○ response times ○ latency ● Are Trackers different from normal sites? How? ● Are Trackers mainly connected to other Trackers? 12
  • 13. The Road to our Goal ● algorithms ● tuning ● features ● combinations of algorithms and features and parameters... 13
  • 14. The Dataset 172.134.23.3 http://www.buzzfeed.com/sheridanwatson/happy-birthday-eva-you -lucky- gal#.gnJbE8EDDK 3 45 20150203:17080345 34 200 GET www.buzzfeed. com/ HTTP/1.1 Host: www.google-analytics.com User-Agent: Mozilla/5.0 (Windows; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Accept: text/html,application/xhtml+xml, application/xml;q=0.9, */*;q=0.8 Keep-Alive: 300 Connection: keep-alive 234561 34 0 0 ES #records: ~80m #users: ~3k #URLs: ~2m #Trackers: ~4k 14
  • 15. Basic Dataset Analysis ● How many requests to Trackers? DataSet API ● Do Tracker requests have larger latency than other requests? 15 ● How many Trackers ○ per user? ○ per request? ○ per website? ● Do popular websites embed more Trackers than others? ● Do same-topic websites share Trackers? ● Do different users visiting the same website end up on different Trackers? ● Do Trackers send / receive more / less bytes? ● Do they have more / less connections on average?
  • 16. Main Idea Model the data as a referer → host bipartite graph and exploit the graph structure to identify Trackers facebook.com youtube.com google-analytics.com b.scorecardresearch.com embedded URLsURLs explicitly visited by the user 16
  • 17. Attempt#1 Relevance Search Iterative, random walk-like algorithm for bipartite graphs Given an input source node, assign a “relevance score” to other nodes, based on how similar their network position is Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search and anomaly detection in bipartite graphs. ACM SIGKDD Explorations Newsletter,7(2), 48-55.
  • 18. Relevance Search Algorithm google-analytics b.scorecardresearch xzy/logo_small.jpg 0.9 0.1 source In each iteration, a vertex: - sends a score to out-neighbors - sums up received scores and updates value 18
  • 19. Relevance Search Implementation ● single-source relevance search ○ similar to pagerank ○ easily mapped to vertex-centric iterations ● multi-source relevance search ○ each vertex keeps a vector of scores ○ compute top-k relevant nodes per source ○ merge the top-k lists 19 Gelly API
  • 20. Data Pipeline top-k relevant nodes www.google-analytics.com: T www.bscored-research.com: T www.facebook.com: NT www.github.com: NT cdn.cxense.com: NT ... Bipartite graph creation Multi-source Relevance Search Classification 20
  • 21. Relevance Search Tuning ● How many and which sources to give as input? ● How to define convergence? ● Does initialization matter? ● How to weigh the input graph? ● How to define the relevance score threshold? 21
  • 22. Relevance Search Problems ● Easy to find the few very similar and the few very different pages ● Popular trackers are similar to other popular trackers, but not to not-so-popular ones ● We might keep re-discovering what we already know 22
  • 23. Relevance Search doesn’t seem to completely solve the problem… Where do we go now? 23
  • 24. Attempt#2...N-1 Combining Relevance Search with other algorithms Several Clustering algorithms k-nn Classification Random Forest
  • 25. Data Pipeline(s) top-k relevant nodes www.google-analytics. com: T www.bscored-research. com: T www.facebook.com: NT www.github.com: NT cdn.cxense.com: NT ... Bipartite graph creation Multi-source Relevance Search [feature extraction] Classification [your clustering, classification, etc. algorithm here] [evaluation] 25
  • 26.
  • 27. r1 r2 r3 r5 r6 r7 h1 h2 h3 h4 h5 h6 h7 h8 NT NT T T ? T NT NT r4 referer-hosts graph h2 h3 h4 h5 h6 h8 h7 h1 r1 r2 r3 r3 r3 r4 r5r6 r7 hosts-projection graph : referer : non-tracker host : tracker host : unlabeled host The Projection Graph 27
  • 28. Attempt#N Community Detection on the Projection Graph The Projection Graph captures implicit connections between trackers, through other sites Do Trackers form communities in the Projection Graph?
  • 29. ● Do they form connected components? Basic Analysis of the Projection Graph ● Do Trackers have unusually high degrees? DataSet & Gelly APIs 29 ● Are they mainly connected to other Trackers?
  • 31. Final Data Pipeline raw logs cleaned logs 1: logs pre- processing 2: bipartite graph creation 3: largest connected component extraction 4: hosts- projection graph creation 5: community detection google-analytics.com: T bscored-research.com: T facebook.com: NT github.com: NT cdn.cxense.com: NT ... 6: results DataSet API Gelly DataSet API 31 Very high accuracy and very low FPR :-)
  • 32. Start simple Lessons Learned Choose features incrementally Visualize your data Re-evaluate your models Try different data representations Use a flexible system
  • 33. Automatic Detection of Web Trackers Vasia Kalavri Apache Flink PMC, PhD student @KTH kalavri@kth.se, @vkalavri
  • 34. Optimizing the Pipeline Flink Optimizer to the rescue :-) 34