Min Tu Pradhan Cadabam
Gobblin Configuration
Management
Gobblin Meetup June 2016
1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
Job Configs Vs. Dataset Configs
Copy Job
- Permission for loginEvent 700
- Permission for logoutEvent 777
Option 1 : One job per dataset
- Too many jobs
- Long whitelist
- Difficult to maintain
Option 2 : Prefix
- Too many configs
- Can not have single config for
all datasets with same
permissions
/events/loginEvent
/events/logoutEvent
/events/loginEvent - 700
/events/logoutEvent - 777
Source Destination
Copy Job 1 Copy Job
2
dest.permission = 700
whitelist = loginEvent
dest.permission = 777
whitelist = logoutEvent
loginEvent.dest.permission = 700
logoutEvent.dest.permission = 777
Copy Job with prefix
Data Life Cycle Management Configs
/events/loginEvent_Avro /events/loginEvent_Orc
/events/loginEvent_Orc Retention Job
Conversion Job
Copy Job
• Shared configs across jobs
• Destination path of conversion job is source path of copy job
• Retention job works on destination path of copy job
• Dataset needs to be enabled in all jobs
/events/loginEvent_Orc
/events/loginEvent_Orc
Retention Job
Retention Job
Other Motivations
• New version of configs should be deployable
without deploying new binaries
• Should be easy to rollback to previous stable
version of configs
• Config changes should have an audit trail
• Complex value types and substitution resolution
support
1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
At a very high-level, we extend typesafe config with:
• Abstraction of a Config Store
• Config versioning
• Support for logical “import” URIs
• Ability to traverse the ”import” relationships
Dataset Configuration Management
Architecture
Client Application
ConfigClient API
ConfigStore API
HadoopF
S
Store
Hive
MetaStor
e
Adapter
MySQL
Adapter
Zookeepe
r
Adapter
…
Data Model
Config Store
Dataset config key (URI):
/events/loginEvent
Key1: value1
Key2: value2
…
KeyM: valueM
Dataset config key (URI):
/events
Tag config key(URI):
/tags
imports
Imported by
Tag config key(URI):
/tags/highPriority
keyA: valueX
keyB: valueY
Implicit import Implicit import
HOCON format
• Support Java Properties file
• Support Json file
• Value substitution
• “+=“ syntax to append elements to arrays, path += "/bin”
• …
gobblin.retention : {
selection {
timeBased.lookbackTime=3y
}
}
Using Configs in code
ConfigClient client =
ConfigClient.createConfigClient(VersionStabilityPolicy policy);
Config config = client.getConfig(URI uri);
Collection<URI> imports = client.getImports(URI dataset, boolean recursive);
Collection<URI> importedBy = client.getImportedBy(URI tag, boolean recursive);
Config lifecycle at LinkedIn
Example of a config store on HDFS
ROOT
├── _CONFIG_STORE // contents = latest non-rolled-back version
├── 1.0.53 // version directory
├── events
│ ├── main.conf
│ ├── loginEvent
│ │ └── main.conf // configuration file for /events/loginEvent
│ │ └── includes.conf // specify import links for /events/loginEvent
│ ├── shareEvent
│ │ └── includes.conf
│ └── clickEvent
│ └── includes.conf
│
└── tags
├── highPriority
│ └── main.conf // configuration file for /tags/highPriority
│ └── includes.conf // specify import links for /tags/highPriority
├── blacklist
└── 10Days
1. Current Solutions and Motivation – Why we
built Gobblin config?
2. Architecture – Gobblin config internals
3. Retention Example – How retention is
configured using Gobblin config?
Agenda
Retention
├── events
├── loginEvent
│ ├── 2016-06-20.avro
│ └── 2016-06-25.avro
└── logoutEvent
├── 2016-05-10.avro
└── 2016-06-10.avro
├── events
├── loginEvent
│ └── 2016-06-25.avro
└── logoutEvent
└── 2016-06-10.avro
• Deleting data that is not required
• Most common retention policy is to delete data older than some days
Example
• Retention policy of 10 days for loginEvent
• Retention policy of 30 days for logoutEvent
Before Retention After Retention
More complex use cases in Production
• Default retention policy of 30 days for all events
• Retention policy of 10 days for loginEvent
• Blacklist retention for clickEvent
• 3 years retention for high priority events like shareEvent
● “events” is the common parent block for “shareEvent”, “loginEvent”,
“logoutEvent”, “clickEvent”
● Each block implicitly imports configs from the parent block, “logoutEvent”
implicitly imports “events” (Dashed lines)
● Any block can explicitly import any other block (Solid lines)
● A child block overrides any key value pairs specified in the parent block
Retention Config
● “logoutEvent” inherits the default retention of 30 days from implicit import,
“events”
logoutEvent 30 Days
● “loginEvent” inherits the default retention of 30 days from implicit import,
“events”
● “loginEvent” defines a 10 days policy which overrides the 30 days inherited
from “events”
loginEvent 10 Days
● “shareEvent” explicitly imports a high priority tag which has retention of 3
years
● “clickEvent” explicitly imports blacklist tag which disables retention for
“clickEvent”
Retention Config for share/clickEvent
├── events
│ ├── main.conf // Default 30 Days
│ ├── loginEvent
│ │ └── main.conf // 10 Days
│ ├── shareEvent
│ │ └── includes.conf // Import /tags/highPriority
│ └── clickEvent
│ └── includes.conf // Import /tags/blacklist
│
└── tags
├── highPriority
│ └── main.conf // Define 3 Years retention
└── blacklist
HDFS Config store
Retention Config Examples
/events/main.conf
gobblin.retention : {
dataset : {
finder.class=gobblin.data.management.retention.CleanableDatasetFinder
pattern="/events/*"
}
selection {
policy.class = gobblin.data.management.SelectBeforeTimeBasedSelectionPolicy
timeBased.lookbackTime=30d
}
version : {
finder.class=gobblin.data.management.DateTimeDatasetVersionFinder
}
}
gobblin.retention : {
selection {
timeBased.lookbackTime=3y
}
}
/tags/highPriority/main.conf
Supported Policies
• SelectBeforeTimeBasedSelectionPolicy
• NewestKSelectionPolicy
• DailyDependentHourlyPolicy
• CombineSelectionPolicy
More policies - http://gobblin.readthedocs.io/en/latest/data-management/Gobblin-
Retention/
Future work
• Config stores other than Hdfs based config store
• Improve tooling, validation and UI for config store
deployment
Questions

Gobbin config-meetup-june-2016

  • 1.
    Min Tu PradhanCadabam Gobblin Configuration Management Gobblin Meetup June 2016
  • 2.
    1. Current Solutionsand Motivation – Why we built Gobblin config? 2. Architecture – Gobblin config internals 3. Retention Example – How retention is configured using Gobblin config? Agenda
  • 3.
    1. Current Solutionsand Motivation – Why we built Gobblin config? 2. Architecture – Gobblin config internals 3. Retention Example – How retention is configured using Gobblin config? Agenda
  • 4.
    Job Configs Vs.Dataset Configs Copy Job - Permission for loginEvent 700 - Permission for logoutEvent 777 Option 1 : One job per dataset - Too many jobs - Long whitelist - Difficult to maintain Option 2 : Prefix - Too many configs - Can not have single config for all datasets with same permissions /events/loginEvent /events/logoutEvent /events/loginEvent - 700 /events/logoutEvent - 777 Source Destination Copy Job 1 Copy Job 2 dest.permission = 700 whitelist = loginEvent dest.permission = 777 whitelist = logoutEvent loginEvent.dest.permission = 700 logoutEvent.dest.permission = 777 Copy Job with prefix
  • 5.
    Data Life CycleManagement Configs /events/loginEvent_Avro /events/loginEvent_Orc /events/loginEvent_Orc Retention Job Conversion Job Copy Job • Shared configs across jobs • Destination path of conversion job is source path of copy job • Retention job works on destination path of copy job • Dataset needs to be enabled in all jobs /events/loginEvent_Orc /events/loginEvent_Orc Retention Job Retention Job
  • 6.
    Other Motivations • Newversion of configs should be deployable without deploying new binaries • Should be easy to rollback to previous stable version of configs • Config changes should have an audit trail • Complex value types and substitution resolution support
  • 7.
    1. Current Solutionsand Motivation – Why we built Gobblin config? 2. Architecture – Gobblin config internals 3. Retention Example – How retention is configured using Gobblin config? Agenda
  • 8.
    At a veryhigh-level, we extend typesafe config with: • Abstraction of a Config Store • Config versioning • Support for logical “import” URIs • Ability to traverse the ”import” relationships Dataset Configuration Management
  • 9.
    Architecture Client Application ConfigClient API ConfigStoreAPI HadoopF S Store Hive MetaStor e Adapter MySQL Adapter Zookeepe r Adapter …
  • 10.
    Data Model Config Store Datasetconfig key (URI): /events/loginEvent Key1: value1 Key2: value2 … KeyM: valueM Dataset config key (URI): /events Tag config key(URI): /tags imports Imported by Tag config key(URI): /tags/highPriority keyA: valueX keyB: valueY Implicit import Implicit import
  • 11.
    HOCON format • SupportJava Properties file • Support Json file • Value substitution • “+=“ syntax to append elements to arrays, path += "/bin” • … gobblin.retention : { selection { timeBased.lookbackTime=3y } }
  • 12.
    Using Configs incode ConfigClient client = ConfigClient.createConfigClient(VersionStabilityPolicy policy); Config config = client.getConfig(URI uri); Collection<URI> imports = client.getImports(URI dataset, boolean recursive); Collection<URI> importedBy = client.getImportedBy(URI tag, boolean recursive);
  • 13.
  • 14.
    Example of aconfig store on HDFS ROOT ├── _CONFIG_STORE // contents = latest non-rolled-back version ├── 1.0.53 // version directory ├── events │ ├── main.conf │ ├── loginEvent │ │ └── main.conf // configuration file for /events/loginEvent │ │ └── includes.conf // specify import links for /events/loginEvent │ ├── shareEvent │ │ └── includes.conf │ └── clickEvent │ └── includes.conf │ └── tags ├── highPriority │ └── main.conf // configuration file for /tags/highPriority │ └── includes.conf // specify import links for /tags/highPriority ├── blacklist └── 10Days
  • 15.
    1. Current Solutionsand Motivation – Why we built Gobblin config? 2. Architecture – Gobblin config internals 3. Retention Example – How retention is configured using Gobblin config? Agenda
  • 16.
    Retention ├── events ├── loginEvent │├── 2016-06-20.avro │ └── 2016-06-25.avro └── logoutEvent ├── 2016-05-10.avro └── 2016-06-10.avro ├── events ├── loginEvent │ └── 2016-06-25.avro └── logoutEvent └── 2016-06-10.avro • Deleting data that is not required • Most common retention policy is to delete data older than some days Example • Retention policy of 10 days for loginEvent • Retention policy of 30 days for logoutEvent Before Retention After Retention
  • 17.
    More complex usecases in Production • Default retention policy of 30 days for all events • Retention policy of 10 days for loginEvent • Blacklist retention for clickEvent • 3 years retention for high priority events like shareEvent
  • 18.
    ● “events” isthe common parent block for “shareEvent”, “loginEvent”, “logoutEvent”, “clickEvent” ● Each block implicitly imports configs from the parent block, “logoutEvent” implicitly imports “events” (Dashed lines) ● Any block can explicitly import any other block (Solid lines) ● A child block overrides any key value pairs specified in the parent block Retention Config
  • 19.
    ● “logoutEvent” inheritsthe default retention of 30 days from implicit import, “events” logoutEvent 30 Days
  • 20.
    ● “loginEvent” inheritsthe default retention of 30 days from implicit import, “events” ● “loginEvent” defines a 10 days policy which overrides the 30 days inherited from “events” loginEvent 10 Days
  • 21.
    ● “shareEvent” explicitlyimports a high priority tag which has retention of 3 years ● “clickEvent” explicitly imports blacklist tag which disables retention for “clickEvent” Retention Config for share/clickEvent
  • 22.
    ├── events │ ├──main.conf // Default 30 Days │ ├── loginEvent │ │ └── main.conf // 10 Days │ ├── shareEvent │ │ └── includes.conf // Import /tags/highPriority │ └── clickEvent │ └── includes.conf // Import /tags/blacklist │ └── tags ├── highPriority │ └── main.conf // Define 3 Years retention └── blacklist HDFS Config store
  • 23.
    Retention Config Examples /events/main.conf gobblin.retention: { dataset : { finder.class=gobblin.data.management.retention.CleanableDatasetFinder pattern="/events/*" } selection { policy.class = gobblin.data.management.SelectBeforeTimeBasedSelectionPolicy timeBased.lookbackTime=30d } version : { finder.class=gobblin.data.management.DateTimeDatasetVersionFinder } } gobblin.retention : { selection { timeBased.lookbackTime=3y } } /tags/highPriority/main.conf
  • 24.
    Supported Policies • SelectBeforeTimeBasedSelectionPolicy •NewestKSelectionPolicy • DailyDependentHourlyPolicy • CombineSelectionPolicy More policies - http://gobblin.readthedocs.io/en/latest/data-management/Gobblin- Retention/
  • 25.
    Future work • Configstores other than Hdfs based config store • Improve tooling, validation and UI for config store deployment
  • 26.

Editor's Notes

  • #9 Config versioning – For stable config store once the a version has been deployed it should not be changed
  • #11 - two logic types: dataset, tags - both in tree hierarchy  - inherent/override parent/imported tags
  • #12 Fix font Add hocon example
  • #13 CROSS_JVM_STABILITY STRONG_LOCAL_STABILITY WEAK_LOCAL_STABILITY READ_FRESHEST Handle case: config updated while version released ( configStore is NOT stable) Calling getConfig multiple times , get same value?
  • #15 Each directory is one config node (dataset, tags )