How to create self-service analytics tool from activity logs garbage
1. 1
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
How to create self-service
analytics tool from activity
logs garbage
2016 Sep 14
Wrike Tech Hub
Aleksei Smirnov
Data Analyst at Wrike Inc.
Aleksei Pupyshev
Data Scientist at Wrike Inc.
2. 2
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike is ...
Workspace (Web Application) iOS & Android apps
Many integrations and public API
We're releasing new
products and features
as well as changing
old ones, very quickly.
3. 3
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike is - Data Driven Development Company
4. 4
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike Analytics Tools Evolution: What about logs?
So here we’ve implemented log processing infrastructure based on Spark SQL
Presentation from SPbDSM Sep 2015
UI events
Web Requests
Backend Services
ETL
More about parquet files structure:
https://habrahabr.ru/company/wrike/blog/279797/
Thrift interface
5. 5
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike Analytics Tools Evolution: Problems
Spark-submit python jobs
● More and more ETLs or pyspark jobs for different
specific tasks and dashboards
● There is no common standard and knowledge (code)
base for different metrics extractions / computations
● Many different specific sources in out for each
analytics separately
● It’s hell to generate datasets for ML (predictions,
lead-scoring, personalizations etc) or adhocs
● There is no ability to build one monitoring and alert
system for wrike events and KPIs
● Hundreds of dashboards for Wrike data stakeholders
which is difficult to get any insights about product and
business development
● No metrics naming convention
6. 6
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike Analytics Tools Evolution: Problems
7. 7
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike Analytics Tools Evolution: Solution
● Unification of log-format data - different event timestamps formats to one, different
production tables to log-structure format, unifications of user_id for all sources
● Unification of grouping format - (in our case) user_id and day
● Standardisation of metric naming principles - positioning based naming schema:
entity__event__source__path__measure__unit__details
● Unification of auto-updateable metrics, features creating and metrics testing
process - via Jupiter Notebook using any of following syntax: Python, Pandas, SQL
(PandasSQL)
● Generating of one datasource which contains all user activity metrics and
features with updatable schema - Daily User Activity Data Mart (Vitrina)
8. 8
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike User Activity Data Mart: Tech Stack
9. 9
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike User Activity Data Mart: Under the Hood
10. 10
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Logs:
● Client log (UI)
● Web log (Requests)
● Email log
● Event log (Invitations, Registrations etc)
● Search log
● Mobile log
● ...
UADataMart Under the Hood: Concatenating logs
● Unification of log-format data - different event timestamps formats to one, different production tables to log-structure format, unifications of user_id for all sources
Production Data Bases (from many shards):
● Delta table
● Files Attachments
● Task changes
● ...
Union of spark data frames
with merging schema
~ we also should rename columns with adding of
source prefix (except user_id and timestamp)
This operation isn’t expensive and very useful!
11. 11
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
UADataMart Under the Hood: Grouping by User
This is expensive operations!
And then applying of “magic” map function
12. 12
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
UADataMart Under the Hood: “magic” map function
13. 13
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
UADataMart Under the Hood: “magic” map function
● Creating of Pandas Data Frame from
grouped Row object
● Applying of each “Metrics Module
Function” to copy of Pandas DF which
generates dictionary with appropriate
metrics (KPIs) name and value
● If exception occurs (some error inside
module function) generates dictionary with
default KPI values
● Concatenation of list of returned dictionaries
and converting to Row
14. 14
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
UADataMart Under the Hood: Metrics Module Functions
Example: based on PandasSQL syntax
Note: here we can use any syntax we like or Python or Pandas!
15. 15
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
UADataMart Under the Hood: Modules Structure
16. 16
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike User Activity Data Mart: Under the Hood
17. 17
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike User Activity Data Mart: Under the Hood
Dimensions
apply UDFs (converting
to categorical value)
for each dimension
column
Categorical
dimensions
grouping by categorical
dimensions and
aggregations (by all
users) inside grouped
data
Registration Period Paid Details Country KPI Name Sum of KPI Day
From 1 year to 2 year Paid US ses__x__x__x__avg__mn__x 1000000 2016.09.01
From 6 months to 1 year Free BR act__x__ws__dashb__cnt__ev__x 20000 2016.09.01
From 2 week to 1month Free GB act__x__ws__tlist__cnt__ev__x 100000 2016.09.02
~ 1 mln rows
18. 18
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike User Activity Data Mart: For Wrike Data Stakeholders
● entity__event__source__path__measure__unit__detail
s
19. 19
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Demo!
20. 20
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Flow:
21. 21
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Wrike Analytics Tools Evolution: Problems
Spark-submit python jobs
● More and more ETLs or pyspark jobs for different
specific tasks and dashboards
● There is no common standard and knowledge (code)
base for different metrics extractions / computations
● Many different specific sources in out for each
analytics separately
● It’s hell to generate datasets for ML (predictions,
lead-scoring, personalizations etc) or adhocs
● There is no ability to build one monitoring and alert
system for wrike events and KPIs
● Hundreds of dashboards for Wrike data stakeholders
which is difficult to get any insights about product and
business development
● No metrics naming convention
22. 22
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Other Applications:
● Alarm system (notification when something goes wrong with metrics values)
● Email personalization
● Recommendation system ( like wrike features recommendations,
search quality improvements, user-churn predictions, lead-scoring etc. )
23. 23
How to create self-service analytics tool from activity logs garbage
Wrike Tech Hub Aleksei Pupyshev, Aleksei Smirnov 14.09.2016
slideWrike
Questions!
Thank you!