1. Min Tu Pradhan Cadabam
Gobblin Configuration
Management
Gobblin Meetup June 2016
2. 1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
3. 1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
4. Job Configs Vs. Dataset Configs
Copy Job
- Permission for loginEvent 700
- Permission for logoutEvent 777
Option 1 : One job per dataset
- Too many jobs
- Long whitelist
- Difficult to maintain
Option 2 : Prefix
- Too many configs
- Can not have single config for
all datasets with same
permissions
/events/loginEvent
/events/logoutEvent
/events/loginEvent - 700
/events/logoutEvent - 777
Source Destination
Copy Job 1 Copy Job
2
dest.permission = 700
whitelist = loginEvent
dest.permission = 777
whitelist = logoutEvent
loginEvent.dest.permission = 700
logoutEvent.dest.permission = 777
Copy Job with prefix
5. Data Life Cycle Management Configs
/events/loginEvent_Avro /events/loginEvent_Orc
/events/loginEvent_Orc Retention Job
Conversion Job
Copy Job
• Shared configs across jobs
• Destination path of conversion job is source path of copy job
• Retention job works on destination path of copy job
• Dataset needs to be enabled in all jobs
/events/loginEvent_Orc
/events/loginEvent_Orc
Retention Job
Retention Job
6. Other Motivations
• New version of configs should be deployable
without deploying new binaries
• Should be easy to rollback to previous stable
version of configs
• Config changes should have an audit trail
• Complex value types and substitution resolution
support
7. 1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
8. At a very high-level, we extend typesafe config with:
• Abstraction of a Config Store
• Config versioning
• Support for logical “import” URIs
• Ability to traverse the ”import” relationships
Dataset Configuration Management
14. Example of a config store on HDFS
ROOT
├── _CONFIG_STORE // contents = latest non-rolled-back version
├── 1.0.53 // version directory
├── events
│ ├── main.conf
│ ├── loginEvent
│ │ └── main.conf // configuration file for /events/loginEvent
│ │ └── includes.conf // specify import links for /events/loginEvent
│ ├── shareEvent
│ │ └── includes.conf
│ └── clickEvent
│ └── includes.conf
│
└── tags
├── highPriority
│ └── main.conf // configuration file for /tags/highPriority
│ └── includes.conf // specify import links for /tags/highPriority
├── blacklist
└── 10Days
15. 1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
16. Retention
├── events
├── loginEvent
│ ├── 2016-06-20.avro
│ └── 2016-06-25.avro
└── logoutEvent
├── 2016-05-10.avro
└── 2016-06-10.avro
├── events
├── loginEvent
│ └── 2016-06-25.avro
└── logoutEvent
└── 2016-06-10.avro
• Deleting data that is not required
• Most common retention policy is to delete data older than some days
Example
• Retention policy of 10 days for loginEvent
• Retention policy of 30 days for logoutEvent
Before Retention After Retention
17. More complex use cases in Production
• Default retention policy of 30 days for all events
• Retention policy of 10 days for loginEvent
• Blacklist retention for clickEvent
• 3 years retention for high priority events like shareEvent
18. ● “events” is the common parent block for “shareEvent”, “loginEvent”,
“logoutEvent”, “clickEvent”
● Each block implicitly imports configs from the parent block, “logoutEvent”
implicitly imports “events” (Dashed lines)
● Any block can explicitly import any other block (Solid lines)
● A child block overrides any key value pairs specified in the parent block
Retention Config
19. ● “logoutEvent” inherits the default retention of 30 days from implicit import,
“events”
logoutEvent 30 Days
20. ● “loginEvent” inherits the default retention of 30 days from implicit import,
“events”
● “loginEvent” defines a 10 days policy which overrides the 30 days inherited
from “events”
loginEvent 10 Days
21. ● “shareEvent” explicitly imports a high priority tag which has retention of 3
years
● “clickEvent” explicitly imports blacklist tag which disables retention for
“clickEvent”
Retention Config for share/clickEvent
Config versioning – For stable config store once the a version has been deployed it should not be changed
- two logic types: dataset, tags
- both in tree hierarchy
- inherent/override parent/imported tags
Fix font
Add hocon example
CROSS_JVM_STABILITY
STRONG_LOCAL_STABILITY
WEAK_LOCAL_STABILITY
READ_FRESHEST
Handle case: config updated while version released ( configStore is NOT stable)
Calling getConfig multiple times , get same value?
Each directory is one config node (dataset, tags )