Big Data from the
trenches
Advice from the FSI industry
By: Azrul MADISA
About me…
• VP – Enterprise Data
Architect @ Maybank
• Take care of Maybank’s
data world wide
• Nuts about data, analytics
and software dev.
• Very hands on, love to read
• Teach aikido to kids
Big Data landscape today
https://www.linkedin.com/pulse/big-data-still-thing-2016-landscape-matt-turck
Too many big data tech?
Wait … what?
I have to know ALL
that?
Let’s change the game a bit…
Usecase
The data journey
The data journey
Acquisition Dumping
Tidy data
Real Time
Analytics
Analytical
model
Sandbox
Example: credit scoring and loan origination
Acquisition Dumping
Tidy data
Real Time
Analytics
Analytical
model
Screens
Data staging
area
Data
warehouse
Score card
builder
Decisioning
Sandbox
Data
scientist
Acquisition with quality
Acquisition with quality
• Manage data quality up front
• Human-factor data quality
Data Entry
Data
StagingApplication
Over-night
Acquisition with quality
• Manage data quality up front
• Human-factor data quality
Data Entry
Data Staging
Application
Over-night
Audit trail
Weekly
Acquisition with quality
• Non-human error
• Use PEWMA algorithm
https://aws.amazon.com/blogs/iot/anomaly-detection-using-aws-iot-and-aws-lambda/
Data sandbox
Creating a sandbox on the cloud
• Why cloud:
– Scale data discovery as needed
– Merging private with public data
– Less bureaucratic
• But…
– Customer data on the cloud is a no no
Creating a sandbox on the cloud
• Masking
– Non-numerical data => No sweat!
– E.g.
• En. Abdul Jalil => 837x2unxy237e832!@
• 720324-03-8891 => 472376-84-8732
• Masking numerical data?
Creating a sandbox on the cloud
• Masking
– Non-numerical data => No sweat!
– E.g.
• En. Abdul Jalil => 837x2unxy237e832!@
• 720324-03-8891 => 472376-84-8732
• Masking numerical data?
What if there is a way to mask numerical data
while keeping the statistical properties intact
Easier for the
regulators to
digest
Creating a sandbox on the cloud
• Random projection
• Usually used for dimension reduction
Original
data
(M x N)
Random
matrix
(N x N)
X =
Masked
data
(M x N)
Fast real-time vs. batch
analytics
Fast real-time analytics
• ‘Batch’ analytics:
User
Application
Over-night
batch
Data
warehouse
Predictive
analytics
Descriptive
analytics
Analytical
model
Monthly
Fast real-time analytics
• ‘Batch’ analytics:
User
Application
Over-night
batch
Data
warehouse
Predictive
analytics
Descriptive
analytics
Real time decisioning
Monthly
Fast real-time analytics
• So what is real time analytics:
User
Application
Real time decisioning analytics
Analytical
model
updated in
real time
Fast real-time analytics
• So what is real time analytics:
User
Application
Real time analytics and decisioning
Analytical
model
updated in
real time
Predictive
analytics
Batch
analytical
model
Real-time
analytical model
Fast real-time analytics
• Q- learning
• E.g. SMS advertisement campaign
Real-time
Analytical
Marketting
System
Location, user info
SMS campaign
Fast real-time analytics
• Q- learning
• E.g. SMS advertisement campaign
Real-time
Analytical
Marketting
System
Change behaviour
(E.g. buy
something else)
Learn new
behaviour
Fast real-time analytics : Real-time analytics in
action
Over time
Interest
in
concerts
Interest
in movies
Interest
in sports
Fast real-time analytics: Real time analytics in
action
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5 1
174
347
520
693
866
1039
1212
1385
1558
1731
1904
2077
2250
2423
2596
2769
2942
3115
3288
3461
3634
3807
3980
4153
4326
4499
4672
4845
5018
5191
5364
5537
5710
5883
6056
6229
6402
6575
6748
6921
7094
7267
7440
7613
7786
7959
8132
8305
8478
8651
8824
8997
9170
9343
9516
9689
9862
10…
10…
10…
10…
10…
10…
INTEREST
MESSAGES
SPORTS CONCERTS MOVIES
Interest
in
concerts
Interest
in movies
Interest
in sports
Fast real-time analytics: Real time analytics in
action
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5 1
174
347
520
693
866
1039
1212
1385
1558
1731
1904
2077
2250
2423
2596
2769
2942
3115
3288
3461
3634
3807
3980
4153
4326
4499
4672
4845
5018
5191
5364
5537
5710
5883
6056
6229
6402
6575
6748
6921
7094
7267
7440
7613
7786
7959
8132
8305
8478
8651
8824
8997
9170
9343
9516
9689
9862
10…
10…
10…
10…
10…
10…
INTEREST
MESSAGES
SPORTS CONCERTS MOVIES
Interest
in
concerts
Interest
in movies
Interest
in sports
Real time
analytical
tracking and
learning of
people’s
interest
Putting it all together
under one architecture
Data architecture
• Some difficult questions around big data and analytics
– How can I invest in big data while managing cost?
– How can I “experiment” with big data while mitigating risks?
– How can I create a 360 view of data without boiling the ocean?
– How can I use oversea data without violation regulations?
Tiered data architecture
Data warehouse
- Staging
- SQL access
Big Data Infra (E.g. Hadoop)
Data sources Batch
Real-time Real-time store
Master / Reference Data
Social / Cloud Public Data
Oversea Data
Oversea data
sources
Social
network
Batch
Tiered data architecture
Data
consumer
Data virtualization
SQL /
Rest /
SOAP /
MQ
Data warehouse
- Staging
- SQL access
Big Data Infra (E.g. Hadoop)
Data sources Batch
Real-time Real-time store
Master / Reference Data
Social / Cloud Public Data
Oversea Data
Oversea data
sources
Social
network
Batch
Official data model
Tiered data architecture
• Investment / level of support
Master data
Fast data
Hot data
Cold data
Investment
in CPU /
memory
Investment
in storage
Level 1
Level 1
Level 2
Level 3
Data virtualization Level 1
Level of
support
Tiered data architecture
• Invest where it matters
– Defer investment if needed
– Refocus investment without disrupting business
• Data virtualization
– Create a façade for data access
– Provide standard interface for data
– Single data model, single access, single quality checkpoint
• Allow ‘experimentation’
– E.g. cut-off point for hot / cold
• Oversea data access
– Data stays where they are, only aggregated data is transferred back
– More palatable to regulators
• 360 view
– Data can be ‘joined’ through the data virtualization layer – no laborious ETL needed
• Single place to check for data quality
That’s all folks…
• Linkedin:
– https://www.linkedin.com/in/azrul-madisa-6052419

Big data from the trenches

  • 1.
    Big Data fromthe trenches Advice from the FSI industry By: Azrul MADISA
  • 2.
    About me… • VP– Enterprise Data Architect @ Maybank • Take care of Maybank’s data world wide • Nuts about data, analytics and software dev. • Very hands on, love to read • Teach aikido to kids
  • 3.
    Big Data landscapetoday https://www.linkedin.com/pulse/big-data-still-thing-2016-landscape-matt-turck
  • 4.
    Too many bigdata tech? Wait … what? I have to know ALL that?
  • 5.
    Let’s change thegame a bit… Usecase
  • 6.
  • 7.
    The data journey AcquisitionDumping Tidy data Real Time Analytics Analytical model Sandbox
  • 8.
    Example: credit scoringand loan origination Acquisition Dumping Tidy data Real Time Analytics Analytical model Screens Data staging area Data warehouse Score card builder Decisioning Sandbox Data scientist
  • 9.
  • 10.
    Acquisition with quality •Manage data quality up front • Human-factor data quality Data Entry Data StagingApplication Over-night
  • 11.
    Acquisition with quality •Manage data quality up front • Human-factor data quality Data Entry Data Staging Application Over-night Audit trail Weekly
  • 12.
    Acquisition with quality •Non-human error • Use PEWMA algorithm https://aws.amazon.com/blogs/iot/anomaly-detection-using-aws-iot-and-aws-lambda/
  • 13.
  • 14.
    Creating a sandboxon the cloud • Why cloud: – Scale data discovery as needed – Merging private with public data – Less bureaucratic • But… – Customer data on the cloud is a no no
  • 15.
    Creating a sandboxon the cloud • Masking – Non-numerical data => No sweat! – E.g. • En. Abdul Jalil => 837x2unxy237e832!@ • 720324-03-8891 => 472376-84-8732 • Masking numerical data?
  • 16.
    Creating a sandboxon the cloud • Masking – Non-numerical data => No sweat! – E.g. • En. Abdul Jalil => 837x2unxy237e832!@ • 720324-03-8891 => 472376-84-8732 • Masking numerical data? What if there is a way to mask numerical data while keeping the statistical properties intact Easier for the regulators to digest
  • 17.
    Creating a sandboxon the cloud • Random projection • Usually used for dimension reduction Original data (M x N) Random matrix (N x N) X = Masked data (M x N)
  • 18.
    Fast real-time vs.batch analytics
  • 19.
    Fast real-time analytics •‘Batch’ analytics: User Application Over-night batch Data warehouse Predictive analytics Descriptive analytics Analytical model Monthly
  • 20.
    Fast real-time analytics •‘Batch’ analytics: User Application Over-night batch Data warehouse Predictive analytics Descriptive analytics Real time decisioning Monthly
  • 21.
    Fast real-time analytics •So what is real time analytics: User Application Real time decisioning analytics Analytical model updated in real time
  • 22.
    Fast real-time analytics •So what is real time analytics: User Application Real time analytics and decisioning Analytical model updated in real time Predictive analytics Batch analytical model Real-time analytical model
  • 23.
    Fast real-time analytics •Q- learning • E.g. SMS advertisement campaign Real-time Analytical Marketting System Location, user info SMS campaign
  • 24.
    Fast real-time analytics •Q- learning • E.g. SMS advertisement campaign Real-time Analytical Marketting System Change behaviour (E.g. buy something else) Learn new behaviour
  • 25.
    Fast real-time analytics: Real-time analytics in action Over time Interest in concerts Interest in movies Interest in sports
  • 26.
    Fast real-time analytics:Real time analytics in action 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 174 347 520 693 866 1039 1212 1385 1558 1731 1904 2077 2250 2423 2596 2769 2942 3115 3288 3461 3634 3807 3980 4153 4326 4499 4672 4845 5018 5191 5364 5537 5710 5883 6056 6229 6402 6575 6748 6921 7094 7267 7440 7613 7786 7959 8132 8305 8478 8651 8824 8997 9170 9343 9516 9689 9862 10… 10… 10… 10… 10… 10… INTEREST MESSAGES SPORTS CONCERTS MOVIES Interest in concerts Interest in movies Interest in sports
  • 27.
    Fast real-time analytics:Real time analytics in action 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 174 347 520 693 866 1039 1212 1385 1558 1731 1904 2077 2250 2423 2596 2769 2942 3115 3288 3461 3634 3807 3980 4153 4326 4499 4672 4845 5018 5191 5364 5537 5710 5883 6056 6229 6402 6575 6748 6921 7094 7267 7440 7613 7786 7959 8132 8305 8478 8651 8824 8997 9170 9343 9516 9689 9862 10… 10… 10… 10… 10… 10… INTEREST MESSAGES SPORTS CONCERTS MOVIES Interest in concerts Interest in movies Interest in sports Real time analytical tracking and learning of people’s interest
  • 28.
    Putting it alltogether under one architecture
  • 29.
    Data architecture • Somedifficult questions around big data and analytics – How can I invest in big data while managing cost? – How can I “experiment” with big data while mitigating risks? – How can I create a 360 view of data without boiling the ocean? – How can I use oversea data without violation regulations?
  • 30.
    Tiered data architecture Datawarehouse - Staging - SQL access Big Data Infra (E.g. Hadoop) Data sources Batch Real-time Real-time store Master / Reference Data Social / Cloud Public Data Oversea Data Oversea data sources Social network Batch
  • 31.
    Tiered data architecture Data consumer Datavirtualization SQL / Rest / SOAP / MQ Data warehouse - Staging - SQL access Big Data Infra (E.g. Hadoop) Data sources Batch Real-time Real-time store Master / Reference Data Social / Cloud Public Data Oversea Data Oversea data sources Social network Batch Official data model
  • 32.
    Tiered data architecture •Investment / level of support Master data Fast data Hot data Cold data Investment in CPU / memory Investment in storage Level 1 Level 1 Level 2 Level 3 Data virtualization Level 1 Level of support
  • 33.
    Tiered data architecture •Invest where it matters – Defer investment if needed – Refocus investment without disrupting business • Data virtualization – Create a façade for data access – Provide standard interface for data – Single data model, single access, single quality checkpoint • Allow ‘experimentation’ – E.g. cut-off point for hot / cold • Oversea data access – Data stays where they are, only aggregated data is transferred back – More palatable to regulators • 360 view – Data can be ‘joined’ through the data virtualization layer – no laborious ETL needed • Single place to check for data quality
  • 34.
    That’s all folks… •Linkedin: – https://www.linkedin.com/in/azrul-madisa-6052419