Speaker: Nitin Motgi, Cask
Big Data Applications Meetup, 09/20/2017
Palo Alto, CA
More info here: http://www.meetup.com/BigDataApps/
Link to video: https://www.youtube.com/watch?v=FnQwDaKii2U
About the talk:
Business Rules are statements that describe business policies or procedures to process data. Rules engines or inference engines execute business rules in a runtime production environment, and have become commonplace for many IT applications. Except in the world of big data, where there has been a gap for a horizontally scalable, lightweight inference-based business rules engine for big data processing.
In this session, you learn about Cask’s new business Rule Rngine built on top of CDAP, which is a sophisticated if-then-else statement interpreter that runs natively on big data systems such as Spark, Hadoop, Amazon EMR, Azure HDInsight and GCE. It provides an alternative computational model for transforming your data while empowering business users to specify and manage the transformations and policy enforcements.
In his talk, Nitin Motgi, Cask co-founder and CTO, demonstrates this new, distributed rule engine and explain how business users in big data environments can make decisions on their data, enforce policies, and be an integral part of the data ingestion and ETL process. He also shows how business users can write, manage, deploy, execute and monitor business data transformation and policy enforcements.
Check out http://bdam.io/ for more info on the Big Data Apps meetup!
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Introducing a horizontally scalable, inference-based business Rules Engine for Big Data processing
1. Big Data on Tap
cask.co
rule Distributed-Rules-Engine-DRE {
description ‘Presentation of Distributed Rules Engine(DRE)’
when(presenter ~= “Nitin Motgi” && tile ~= “Co-founder & CTO” &&
datetime = today())
then {
welcome;
present;
question-and-answer;
}
}
2. 2
Who is Cask
AT&T, Cloudera and Ericsson
Strategic Investors
First Unified Integration Platform for
rapid time-to-value from Big Data
Unique Value Proposition
AT&T, Cloudera, Ericsson, Hortonworks, IBM,
MapR, Microsoft, Salesforce, Thomson Reuters, …
Key Customers & Partners
By early Hadoop engineers from
Facebook and Yahoo!
Founded in 2011
Andreessen Horowitz, Safeguard,
Battery Venture and Ignition Partners
Raised $37+ Million
Featuring Cask Market,
the “big data app store”
Mature Platform: CDAP 4
A Container Architecture that puts
Big Data on Tap
Why “Cask” ?
3. 3
What is Cask Data Application Platform (CDAP)
Runtime
A unified platform for
building integrated data analytics tools and applications and
delivering specialized frameworks and solutions that
enable enterprises to rapidly extract value from data
APIs Tools Frameworks Market
4. 4
Pre-Built Tools and Frameworks
Data Prep — Framework
Data Pipeline — Framework
• Data Preparation for on-boarding new sources and datasets
• Perform Data Transformations, Data Quality checks with visual
feedback
• Extend the Data Prep by building new user defined directives
• Integrates with Data Pipeline for operationalizing transformations
• User interface for building complex data workflows
• Join, Lookup, Aggregate, Filtering data in-flight
• Building complex workflows with 100s of connectors
• Extend Data Pipeline using simple APIs
• Integrates with Data Prep, Rules Engine and Metadata Aggregator
5. 5
Pre-Built Tools and Frameworks
Rules Engine — Tool
Metadata Aggregator — Tool
• Business Data Transformations and checks codified for business users
• Define Complex rules using intuitive and simple to use user interface
• Logically group Rules in Rulebook and trigger or schedule
processing.
• Integrate with Data Pipeline for operationalizing Rules
• Aggregate Business, Technical and Operational Metadata
• Track the flow of data (Lineage) for richer data needed for governance
• Create Data Dictionary and Metadata Repository
• Integrate with enterprise MDM solutions.
• Integrates with Data Pipeline, Rules Engine
6. 6
Pre-Built Tools and Frameworks
Microservice — Framework
Event Condition Action (ECA) — Application
• Build specialized logic for processing data
• Create loosely coupled network for processing events
• Connect them using Amazon SQS, Websocket, MapR Streams and
Kafka
• Delivers a specialized solutions for IoT event processing
• Parses any events, triggers conditions and executes Action.
• Real-time notification system, with easy-to-use user interface for
configuring event parsing, condition and actions
7. 7
Realizing Value from Data
Empower
Business Users
On-Board New
Data Sources and
Types
Integrate with
Cloud and
Enterprise
Ecosystem
Build Data
Processing
Workflows
Integrate
Machine Learning
and Model
Management
Expand and
Customize to
build new Data
Analytics
Applications
Data is
Your Asset
Cloud Connectors +
Run on Public Cloud +
Integrate with Existing Apps
Platform
Automate
Operationalize
IntelligenceIntegration
Extend & Innovate
Business Value
ML Integration
Data Prep
Rules Engine
Data Pipelines
8. 8
What is a Rules Engine?
“Externalize the business
logic that is dynamic”
“Make code Intuitive and readable,
easily understood by business people”
“Simplify complicated requirements with the
declarative logic, raise level of abstraction”
9. 9
Why Rules Engine?
Rules Engines are a great way to collect all complex
decision-making logic involved in Data Integration
and work with large datasets
10. 10
When to use a Rules Engine?
“When transformation logic changes often or there
are constant on-boarding feeds into the Data Lake”
“Want to involve non-technical domain experts in
Ingestion and Integration of Big Data”
“Separate code and logic”
13. 13
Data Transformation Flow with Cask DRE
• Involve business analysts or non-
technical domain experts in Big
Data
• Transform data using declarative
language instead of imperative
• Centralize data transformation
into a self-documented
knowledge base.
• Real-time or Batch - Do it exactly
the same way at scale on Hadoop
or Spark
• Integrate with workflow to
operationalize at scale.
15. 15
Rule Execution
age vin model
32 … van
45 … sports
16 … truck
17 … sports
19 … sedan
13 … van
56 … sedan
46 … sports
28 … van
Input Table
Record is a individual row in the
table that is made up of columns.
Each cell contains the value
associated with column.
Working Set DRE
Rule Repository
Record is considered as
working set on which the DRE
would operate on
DRE updates working set based
on the Actions of Rules fired
Outcome
16. 16
Benefits of Cask DRE
Cask Distributed Rules Engine (DRE) is a tool that helps implement
a production rule system with forward chaining that
Makes it easier to program data transformations for Big Data
Enables non-technical domain experts
Integration with Apache
Spark/Spark Streaming
& MapReduce
Insights into Rule
Execution
~ 1K pre-built
Actions
Interactive
User Interface
1000
17. 17
Details of DRE
• Records represented as in-memory working set
• Set of Rules that declaratively define conditions
• Actions executed or inference derived based on the rules.
• Planner to execute Rules in map() phase or reduce() phase.
Forward Chaining
Starts with data or records and
triggers actions or generate
outcome — Target State.
18. 18
Rules
Allows users to specify the requirements or knowledge
about processing using
— Declarative (say what should happen, not how to do it),
— Logic based languages.
19. 19
Rules
When condition is satisfied
based on data in working set.
Condition Action
Then, apply the actions on the
data in the working set.
If dark, then turn on the lights
If thirsty, then drink some water
If sleepy, then go to bed
20. 20
Anatomy of a Rule
rule <id> {
description ‘<description>’
when(<LHS>)
then {
<RHS>
}
} When LHS is satisfied, RHS is executed
Version of Rulebook - Any modifications
in Rule will update Rulebook version
Description what actually the
rule is.
Defines the condition to be met
or satisfied before action are
executed
Actions to be executed when LHS is
satisfied to true.
21. 21
Example of a Rule
rule Minimum-Age {
description ‘Increase the premium by 15% if driver is less than 25 and drives a sports car’
when(age < 25 && model == “sports”)
then {
set-column premium (premium + premium * 0.15)
}
}
The following Rule increases the premium by 15% when the driver’s age is less than
25 and drives a sports car.
When a record has age less than 25 and model is sports, the action to generate is triggered
23. 23
Rule Patterns
rule Minimum-Age {
description ‘Filter Records where the age of person is less than 17’
when(age < 17)
then {
filter-row-if-true true;
}
}
Following Rule rejects or sends them to error who’s age is less than 17
Conditions support <, >, ==, <=, >=, matches / not matches, contains / not contains.
Following Rule provides a discount if customer is married and is older than 25.
rule Discount {
description ‘Provide discount if customer is married and is older than 25’
when(married == true && age > 25)
then {
set-column discount 10;
}
}
24. 24
Rule Actions
• Any Data Prep Directive — Micro Data Transformation
Instructions
• Operations on working set
• Update Column — Will tell the engine the column has
changed
• Insert new Column — Will add a new column to the working
set
• Delete Column — Will remove a column from the working set
• Stateful - Temporary variables available across Rules.
rule <Rule-Id> {
…
then {
<data prep directive>;
<data prep directive>;
<data prep directive>;
<data prep directive>;
}
}
25. 25
Rule Example 1
If dataset contains an SSN field, always mask the SSN field to preserve only last four
digits of SSN
rule Mask-SSN {
description ‘Mask SSN to format xxx-xx-####’
when(present(ssn))
then {
mask-number ssn xxx-xx-####
}
}
26. 26
Rule Example 2
If a work location defined in the dataset being processed is not one of the allowed locations or
location is empty — send the record to error to be collected for investigation
rule Location-Validator {
description ‘Check to see the work location state is empty or one of the valid states.’
when(present(workloc) && !dq:isEmpty(workloc) && (!(workloc() =~ [ 'AK', 'AL', 'AR', 'AZ',
'CA', ]) ))
then {
send-to-error true
}
}
27. 27
Rule Example 3
Check if Subscriber id is a number and is of length 8. If invalid Subscriber id, filter the record
out of processing
rule SubscriberId-Validator {
description ‘Validate Subscriber Id.’
when(present(subscriberId) && !dq:isNumber(subscriberId) && subscriberId.length != 8)
then {
filter-record-if-true true
}
}
28. 28
Rule Example 4
If customer is above 40 and has purchased more than 10 items from Pantry, then give them
9.8% discount it from the total price they pay.
rule Customer-Pantry-Discount {
description ‘Customer 9.8% discount, when age above 40, purchased 10 items from pantry’
when(age > 40 && items > 10)
then {
set-column discount 9.8
set-column net_amount (amount - (amount * 0.098))
}
}
29. 29
Anatomy of a Rulebook
Specifies an Id for the Rulebook
Version of Rulebook - Any modifications
in Rule will update Rulebook version
Metadata to defines additional
information for Rulebook.
Collection of Rules
rulebook <Id> {
version <number>
meta {
description ‘<description>’
created-date <date>
updated-date <date>
source ‘<source>’
user ‘<user>’
}
<rule>
<rule>
<rule>
. . .
<rule>
}
30. 30
Demo
Do we want to provide a few bullets of what pieces will be
demo’ed (similar to last presentation)?
31. 31
Technical Webinar Series
Live Technical Webinar Series: Moving Big Data Forward with Cask
RSVP at: https://cask.co/company/events
Oct 31, 2017 @ 11am PT / 2pm ET
Watch on demand @
cask.co/resources/webinars/
Oct 5, 2017 @ 11am PT / 2pm ET Oct 19, 2017 @ 11am PT / 2pm ET
32. 32
To learn more, go to cask.co
or contact us at info@cask.co
Questions?