7. Enterprise Business ?
• Many different definitions and discussions about "Enterprise"... :(
• MY DEFINITION IN THIS TALK:
"Businesses NOT about IT"
• Thus, most of businesses are "Enterprise", everywhere, not only in Tokyo
8. Data Analytics Service ?
• Provides ways to know:
• How many people are reaching our products?
• How many times are they seeing our advertisements?
• And how many times do they buy our products?
• When are they use our products?
• When did they buy our products?
• Where did they buy our products?
• ...
• Something helps our business using data
9. Data Analytics Service
for Enterprise Business ?
• Something helps "Business not about IT", using data (IT)
• Staffs (using data analytics service) doesn't know about IT
• and also don't take care about IT
• but "need" result of analytics
• Everyone are checking report about yesterday at 10:00 AM
• We need results before 10:00AM
• 10:10 AM is too late, but 2:00 AM is too early...
10. Deadline and Retries
Big Job: Power 1
10:00AM00:00AM 05:30AM01:00AM
Big Job: Power 1
Crash! Delay...
Big Job: Power 2
Big Job: Power 2
Crash! OK!
Small Jobs: Power 1
Small Jobs: Power 1
Crash! OK!
11. Missions of Data Analytics Service
for Enterprise Business
Fast "enough"
Cheap "enough"
Stable
Easy to use "enough"
12. Technologies for Data Analytics Service
• Data Management System
• Distributed Processing System
• Queue and Scheduler
• Connecting Systems and Services
• Controlling Jobs, Tasks and Workflows
• Managing Retries
13. Data Management Systems
• Data Collecting Systems
• Fluentd, Embulk, ...
• Distributed Database and Storage
• Storing data in efficient format (MPC1, MessagePack columnar format)
• Managing index
• Managing schema
• Providing transactional operations
14. Distributed Processing System
• Running Analytics Queries
• MapReduce engines: Hadoop + Hive
• MPP (Massive Parallel Processing systems): Presto
• Running Data Management Jobs
• Converting data formats, re-index, detecting schema, ...
• Computing Resource Management
• Customer queries (and internal use) must be separated!
15. Queue and Scheduler
• Queuing Queries
• Allow to enqueue queries, run these next-to-next
Power 1
Customer
Request
• Scheduling Queries
• Run queries when it's ok to run
Data for Queries
01:00AM 03:00AM
16. Connecting Systems and Services
• Non-"connected" Data Analytics Service
Ultra Super Great
Analytics Service
Database
Query
Result
Not "easy enough"
17. Connecting Systems and Services
• Data Analytics Service MUST be "connected"
Treasure Data
Database
Query
Result
18. Control Jobs/Tasks
• A Job needs results of other jobs
"Risky"
Time based schedule
A,B,C -> D,E -> F
01:00AM
03:10AM ?
03:30AM
06:30AM ?
07:00AM 10:00AM
"Risky"
Time based schedule
A,B,C -> D,E -> F
01:00AM
Crash!
03:30AM
Oops, No Data...
10:00AM
• "Risk" for failures
07:00AM
Oops, No Data...
08:15AM ?
19. Control Jobs/Tasks
• A Job needs results of other jobs
Time based schedule
A,B,C -> D,E -> F
01:00AM
03:10AM ?
06:00AM
08:30AM ?
11:00AM ???
• "Time based schedule" needs
• Wide space for retries
• Big resource for fast results (not cheap!)
Space for Retries Space for Retries
20. Control Jobs/Tasks
• Workflow pattern
Workflow execution
A,B,C -> D,E -> F
01:00AM
07:15AM ?
10:00AMWorkflow control barriers
Workflow execution
A,B,C -> D,E -> F
01:00AM 10:00AMWorkflow control barriers
• Workflow pattern with retries
Crash!
22. Retry-able Failures or Not
• "Retry-able Failures"
• Crash of compute nodes
• Communication errors
• Service down of "connected" services
• ...
• Non-"Retry-able Failures"
• SQL syntax error
• Missing data sources / Missing tables
• Wrong API key of "connected" services
• ...
23. Table B
Table B
Retry-able Operations ?
• For example.... :
• Run Query A
• Append result of A into B
• Count rows of B
• Failures?:
• Run Query A
• Append result of A into B ... (Failed!)
• Retry Query A
• Retry to append result of A into B
• Count rows of B
1
2
3
4
1
2
1
2
3
4
24. Idempotent Operations
• "Idempotent" (冪等である) operation
• can get "same" result when it's executed twice or more
べきとう
Table B
1
2
3
4
• Idempotent Operation:
• Run Query A
• "Replace" table B with result of A
• Count rows of B
Table B
1
2
25. Replay-able Data Analytics Workflow
• Need to do many "try-and-error"
• w/ updated queries
• w/ updated data...
• Idempotent operations makes workflow "Replay-able"
• Fast try-and-error (PDCA!) cycles
• → Fast business growth!