Four years ago we started with a data analytics platform to learn more about how our customers use our on-premise software and how it behaves out in the fields regarding function usage, exception rates and overall performance.
The talk is about the journey we had to take, coming from an existing web apps statistics tracking system to our current and still evolving Hadoop based ETL system. This includes the current technologies we use and the approcache on how we support reporting and dashboards.
This new Hadoop platform is used to collect, transform, enrich with data warehouse data, and analyze millions of log files every day. The generated insights help us to make data driven decisions for portfolio management, UX-Design, and overall software quality improvements with real business value.
You will hear about the Dos and Donts we learned, what we think are best practices, and the new challenges we have to deal with while data volume and management awareness is still emerging.
Data Beats Emotions – How DATEV Generates Business Value with Data-driven Decisions
1. DATEV eG
Data Beats Emotions
How DATEV Generates Business Value with Data-driven Decisions
matthias.mueller@datev.de / @bicaluv
2. DATEV eG
Agenda
About the data
Processing
Business values
What’s next
22.04.2019 Data Beats Emotions 2
3. DATEV eG
DATEV – Company
Founded in 1966 as a co-operative organization
Main business is software for tax consulting, accounting, and law business
Our customers are mostly tax consultants and their clients
B2B market
7,500 employees (1,800 devs)
1 billion euro annual revenue in 2018
Typical tax consultant has around 10 employees. Few up to 1,500
40,000 co-operative members
160,000 companies using our software on behalf of their tax consultants
22.04.2019 Data Beats Emotions 3
4. DATEV eG
DATEV – Software
on-premises running at customers site
we do have data center applications, but not focused in this talk
MS Windows based, incl. MS SQL Server
250 different applications
22.04.2019 Data Beats Emotions 4
5. DATEV eG
About the data
Based on in-memory logs generated for every on-prem application
Logs include
Clicks / Tracked User Interactions
Exceptions
Performance data
+ metadata: OS, screen resolution, touch device, UI themes, no IP !
22.04.2019 Data Beats Emotions 5
6. DATEV eG
About the data –
General Data Protection Regulation Compliance
Personal data tracking requires agreement / consent management
Dialog shown to each user no agreement, no tracking data
2 data schemas from client
actual data with GUID (Globally Unique Identifier, generated at client site)
agreement with GUID and User ID (for data warehouse joins)
Essential for handling right to be forgotten without requiring big data deletes
22.04.2019 Data Beats Emotions 6
{ GUID, [data] }
Click
{ GUID, UserID, [ true | false ] }
Agreement
7. DATEV eG
About the data – GDPR Compliance
22.04.2019 Data Beats Emotions 7
{ A1, „File.Open“ }
Click1
{ A1, User42, true }
Agreement
{ A1, „File.Quit“ }
Click2
Big Data World
…
8. DATEV eG
About the data – GDPR Compliance
22.04.2019 Data Beats Emotions 8
{ A1, „File.Open“ }
Click1
{ A1, User42, true }
Agreement
{ A1, „File.Quit“ }
Click2
Big Data World
…
9. DATEV eG
About the data – Current Figures
22.04.2019 Data Beats Emotions 9
1 2
Agreements
Consent Rate
Startup
of every
Application
60 GB
Logfiles
per day
(decompressed)
200million
events
per day
(6,000/s)
Components
with 1,250
dynamic trace
points 30
Total Client Events in
Hadoop Cluster
billion
Unique User per day
200,000
Approx.
50
83%
14. DATEV eG
Actual Processing
22.04.2019 Data Beats Emotions 14
Data Center
HTTPS
Hadoop ClusterOn-premises ReportingInternet Tracking Server
ISA
DEV
Team of 7, including Devs, Data Scientist, Master of Ceremony, Requirements Engineer, and Product Owner
OP
Team of 2, operate the data center platforms
DMZ
16. DATEV eG
Actual Processing – Client
Continuous monitoring of client logs using ring buffer
(remember: no individual agreement, no data)
on-premises clients send data every 3 hours
(random distribution of sending time based on installation time)
continuous flow of data
BTW: We do dogfooding for client site data tracking, like buffer overruns, CPU, and
memory usage
22.04.2019 Data Beats Emotions 16
HTTPS
17. DATEV eG
Actual Processing – Ingestion
Proprietary protocol to get from ISA to Cluster (DMZ)
Transfers incoming unsecure data to secure data center every 5 minutes
continuous flow of data to Hadoop Edge Node
22.04.2019 Data Beats Emotions 17
18. DATEV eG
Actual Processing – Ingestion
CRON & Batch: Once every night, data gets processed
Decompress
Filter (valid timestamp, test data)
Store and upload to HDFS in file chunks of 100 MB
22.04.2019 Data Beats Emotions 18
19. DATEV eG
Actual Processing – ETL Phase 1
CRON & Batch: Once every night, data gets processed
Start Spark job for agreement data
Start Spark jobs for hot data (window of 5 days)
– De-duplicate data
– Add delayed received data
– Generate ORC files with data partitioned by day
– Optimize partitions (e.g. delete outdated partitions due to retention policy)
– Automated check of internal compliance regulations
(it is not allowed that data contains customer confidential data)
22.04.2019 Data Beats Emotions 19
20. DATEV eG
Actual Processing – ETL Phase 2
Start Spark jobs to update data for reports
Generate ORC files for Star Schema (facts and dimensions)
Aggregations and calculations for reporting
Update files of report tool incrementally by reading ORC files using Hive ODBC
(external tables)
22.04.2019 Data Beats Emotions 20
21. DATEV eG
HDP 2.6.5 Production Cluster
22.04.2019 Data Beats Emotions 21
Data Center
Rack 1 Rack 2
Edge
Master
Workers
…0001 …0003 …000 …0015 …0016 …0002 …0004 …0006 …0013 …0014
each 48 Cores, 512 GB RAM, 16 TB HDD, RHEL 7
each
48 Cores, 512 GB RAM,
1 TB HDD, RHEL 7
each 48 Cores, 512 GB RAM, 16 TB HDD, RHEL 7
Edge
Master
Workers
each
48 Cores, 512 GB RAM,
1 TB HDD, RHEL 7
23. DATEV eG
Actual Processing - Reporting
UX (including click counts)
Exceptions
Performance
22.04.2019 Data Beats Emotions 23
22 different default reports
24. DATEV eG
Actual Processing – Reporting Example
22.04.2019 Data Beats Emotions 24
Top 10 Screen Resolution
25. DATEV eG
Actual Processing – Reporting Example
22.04.2019 Data Beats Emotions 25
Top 10 Screen Resolution by Target Market
Clients / Companies
Tax Consultants
Data Warehouse
Other
Lawyers
26. DATEV eG
0
5,000
10,000
15,000
20,000
25,000
1 2 3 4 5 6 7
Actual Processing – Reporting Example
22.04.2019 Data Beats Emotions 26
Program Usage by Target Market
Clients / Companies
Tax Consultants
Data Warehouse
Member CountMember Type
Education Institutes
Public Sector
Lecturer
Other
28. DATEV eG
Business Values
UX, e.g. optimized screen resolution
Check „Payed Beta Testers“ actual program usage
A/B comparison (usage and performance)
Proof of sales license bundles
Performance anomaly detection, e.g. based on OSs
22.04.2019 Data Beats Emotions 28
29. DATEV eG
Business Values
Discontinuation of over 10 applications and over 30 features within apps
saves hours in dev and support €
Detailed field analysis for new application
„saved trouble“ from 4,500 customers caused by missing features
Counting of real SQL server licenses in use
saves €
22.04.2019 Data Beats Emotions 29
32. DATEV eG
Evolve from Guided Analytics…
22.04.2019 Data Beats Emotions 32
On-Prem
Statistics
Data
Program
Statistics
Add. Data
Warehouse
Statistics Team only
Producer
Consumers
POs
Standard Reports
34. DATEV eG
…to Self-Service Analytics
22.04.2019 Data Beats Emotions 34
On-Prem
Online
Statistics
Data
Source
A
Data Abstraction
Data Catalog
Reporting
Environment
Data Scientist
Power User
Producers
Consumers
Manager
Data Governance Process
Publishing Workflow
Program
Statistics
Add. Data
Warehouse
Source
B
Source
…
36. DATEV eG
Self-Service Analytics PoC Example
Exception Path Analysis
using Kibana + Elasticsearch
22.04.2019 Data Beats Emotions 36
previous
37. DATEV eG
Self-Service Analytics PoC Example
Exception Path Analysis
using Kibana + Elasticsearch
22.04.2019 Data Beats Emotions 37
previous
38. DATEV eG
Self-Service Analytics PoC Example
22.04.2019 Data Beats Emotions 38
Number of Exceptions on DVD after Release using Qlik Sense
Example Data only
39. DATEV eG
Self-Service Analytics PoC Example
22.04.2019 Data Beats Emotions 39
Top 5 Exceptions by DVDs using Qlik Sense
Example Data only