© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Practical Computing with Chaos
Ted Dunning, Chief Applications Architect MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
© 2014 MapR Technologies 3
e-book available courtesy of MapR
Also at MapR booth
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 4
Practical Machine Learning series (O’Reilly)
• Machine learning is becoming mainstream
• Need pragmatic approaches that take into account real world
business settings:
– Time to value
– Limited resources
– Availability of data
– Expertise and cost of team to develop and to maintain system
• Look for approaches with big benefits for the effort expended
© 2014 MapR Technologies 5
Agenda
• Monty Hall
• Randomized geo-coding
• Thompson sampling
– Bayesian Bandits
– Targeting
– Bayesian ranking
• Dithering (sound, signals)
• Synthetic data (preview)
© 2014 MapR Technologies 6
Let’s Start with Trouble
• Monty Hall problem (oops, done)
• Three doors, one with a fabulous prize
• You pick one
• Monte shows you one of the remaining doors is empty
• You can switch at this point to the other door or not
• Should you switch?
© 2014 MapR Technologies 7
© 2014 MapR Technologies 8
© 2014 MapR Technologies 9
© 2014 MapR Technologies 10
The Real Problem
• Doing the math isn’t too hard
• Convincing somebody you have the right answer is really hard
© 2014 MapR Technologies 11
Live Coding
With REAL Chaos
© 2014 MapR Technologies 12
Geo-coding
© 2014 MapR Technologies 13
Geo-coding
• Some databases have disk locality  key locality
• The primary key is totally ordered
• Embedding a total ordering of the points in a plane is possible
– But loses some distance information
– A line is not a square!
• We want to do proximity searches
– This gets harder in the polar regions for most codings
© 2014 MapR Technologies 14
Space Filling Curve
0 1
23 01
2 3
0
1 2
3 0
1 2
3
0
1 2
3
© 2014 MapR Technologies 15
Space Filling Curve
0123
2
3
3
1
0
2
2
3
1
1
0
0 3
20
1
© 2014 MapR Technologies 16
Z-coding – Interleave Bits
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 17
Neighbors Often Share Prefix
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
00. 11.11
10. 01.01
00. 11.01
© 2014 MapR Technologies 18
Often, not always
Close Far
© 2014 MapR Technologies 19
Random Sampling to Derive Keys
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 20
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 21
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 22
"00.01.10" - "00.01.11"
"00.11.00" - "00.11.11"
"01.00.10"
"01.10.00" - "01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 23
Dithering
© 2014 MapR Technologies 24
• 4 bit sine wave (listen for artifacts as volume decreases)
• White dithering (artifacts gone, we hear through the noise)
• Noise shaping (noise is easier to hear through)
© 2014 MapR Technologies 25
0 1 2 3 4 5 6
−4−2024
Time
© 2014 MapR Technologies 26
The Shape of the Noise
Noise
Frequency
−0.4 −0.2 0.0 0.2 0.4
010003000
© 2014 MapR Technologies 27
The Effect After Averaging
0 1 2 3 4 5 6
−4−2024
Time
© 2014 MapR Technologies 28
Thompson Sampling
© 2014 MapR Technologies 29
Learning in the Real World
• In the real world we get to pick our training examples
– Do we try this restaurant or not?
• Learning has real and opportunity costs
• Not learning has real and opportunity costs as well
• Every sub-optimal choice we make incurs regret
– We would like to minimize this
– But we can’t quantify regret without incurring regret!
© 2014 MapR Technologies 30
An Example
• Pick one of five options
– Purple, blue, green, red, yellow
– Each has a random payoff
• If you pick a bad option, regret = mean(best) – mean(yours)
• The best known algorithm uses randomization
– Best = minimal regret + minimal code complexity
© 2014 MapR Technologies 31
Demo – The Algorithm
© 2014 MapR Technologies 32
Synthetic Data
© 2014 MapR Technologies 33
select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD
,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr
,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC
,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD
,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc
,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd
,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC
FROM (SELECT distinct enc.encounter_key as ENC_KEY,
enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE,
bt.bill_type, cnt.contract_nbr as CONTR_,
ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD,
enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS,
enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD,
eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type,
prv.PROVIDER_SOURCE_CD, diag.cms_provider_type,
sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd,
rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY,
st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD,
derr.error_desc as DG_ERRDESC
FROM oicpcuhg.ir_encounter enc
`
Can You See the Problem?
© 2014 MapR Technologies 34
INNER JOIN oicpcuhg.ir_encountertype typ
ON (typ.encounter_type_key = enc.encounter_type_key)
LEFT OUTER JOIN oicpcuhg.ir_billtype bt
ON (bt.bill_type_key = enc.bill_type_key)
LEFT OUTER JOIN oicpcuhg.ir_contract cnt
ON (cnt.contract_key = enc.contract_key)
LEFT OUTER JOIN oicpcuhg.ir_datasource ds
ON (ds.source_key = enc.data_source_key)
LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob
ON (lob.lob_key = enc.lob_key)
INNER JOIN oicpcuhg.ir_member m
ON (
m.hp_cd = enc.hp_cd
AND m.member_source_cd = enc.member_source_cd
AND m.member_nbr = enc.member_nbr)
LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror
ON (eerror.encounter_key = enc.encounter_key and
eerror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error eerr
ON (eerr.error_key = eerror.error_key)
LEFT OUTER JOIN oicpcuhg.ir_provider prv
ON (prv.hp_cd = enc.hp_cd and
prv.provider_source_cd = enc.provider_source_cd and
prv.provider_nbr = enc.provider_nbr)
© 2014 MapR Technologies 35
LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp
ON (esp.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_specialty sp
ON (sp.specialty_key = esp.specialty_key)
LEFT OUTER JOIN oicpcuhg.ir_service svc
ON (svc.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_revenue rev
ON (rev.rev_cd = svc.rev_cd)
LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag
ON (diag.encounter_key = enc.encounter_key)
INNER JOIN oicpcuhg.ir_diagcd dgcd
ON (dgcd.diag_cd_key = diag.diag_cd_key)
INNER JOIN oicpcuhg.ir_recordstate st
ON (st.rec_state_key = diag.rec_state_key)
INNER JOIN oicpcuhg.ir_recordstatus sts
ON (sts.rec_status_key = diag.rec_status_key)
LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror
ON (derror.diagnosis_key = diag.diagnosis_key and
derror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error derr
ON (derr.error_key = derror.error_key)) IR
INNER JOIN oicpcuhg.umr_req_inbound umr
ON (trim(umr.member_nbr) = IR.member_Nbr AND
trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND
trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND
trim(umr.diag1) = IR.diag_cd)
© 2014 MapR Technologies 36
One Attack
• The customer can’t give you the data
– They can’t trust you, by law
• But they can probably summarize the data
– How many columns
– What types
– Perhaps statistical summaries
© 2014 MapR Technologies 37
Bug Replication Without Security Violation
Customer You
DataData
DataFake
DataFake
x y α ξ
x y α ξ
© 2014 MapR Technologies 38
The Upshot
• So random numbers are useful
• But simple distributions not so much
• How can YOU generate cool data?
© 2014 MapR Technologies 39
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 40
Last October: Time Series Databases
by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
© 2014 MapR Technologies 41
Coming in February: Real World Hadoop
by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
© 2014 MapR Technologies 42
Thank you for coming today!

Practical Computing with Chaos

  • 1.
    © 2014 MapRTechnologies 1© 2014 MapR Technologies
  • 2.
    © 2014 MapRTechnologies 2 Practical Computing with Chaos Ted Dunning, Chief Applications Architect MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning
  • 3.
    © 2014 MapRTechnologies 3 e-book available courtesy of MapR Also at MapR booth http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 4.
    © 2014 MapRTechnologies 4 Practical Machine Learning series (O’Reilly) • Machine learning is becoming mainstream • Need pragmatic approaches that take into account real world business settings: – Time to value – Limited resources – Availability of data – Expertise and cost of team to develop and to maintain system • Look for approaches with big benefits for the effort expended
  • 5.
    © 2014 MapRTechnologies 5 Agenda • Monty Hall • Randomized geo-coding • Thompson sampling – Bayesian Bandits – Targeting – Bayesian ranking • Dithering (sound, signals) • Synthetic data (preview)
  • 6.
    © 2014 MapRTechnologies 6 Let’s Start with Trouble • Monty Hall problem (oops, done) • Three doors, one with a fabulous prize • You pick one • Monte shows you one of the remaining doors is empty • You can switch at this point to the other door or not • Should you switch?
  • 7.
    © 2014 MapRTechnologies 7
  • 8.
    © 2014 MapRTechnologies 8
  • 9.
    © 2014 MapRTechnologies 9
  • 10.
    © 2014 MapRTechnologies 10 The Real Problem • Doing the math isn’t too hard • Convincing somebody you have the right answer is really hard
  • 11.
    © 2014 MapRTechnologies 11 Live Coding With REAL Chaos
  • 12.
    © 2014 MapRTechnologies 12 Geo-coding
  • 13.
    © 2014 MapRTechnologies 13 Geo-coding • Some databases have disk locality  key locality • The primary key is totally ordered • Embedding a total ordering of the points in a plane is possible – But loses some distance information – A line is not a square! • We want to do proximity searches – This gets harder in the polar regions for most codings
  • 14.
    © 2014 MapRTechnologies 14 Space Filling Curve 0 1 23 01 2 3 0 1 2 3 0 1 2 3 0 1 2 3
  • 15.
    © 2014 MapRTechnologies 15 Space Filling Curve 0123 2 3 3 1 0 2 2 3 1 1 0 0 3 20 1
  • 16.
    © 2014 MapRTechnologies 16 Z-coding – Interleave Bits 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 17.
    © 2014 MapRTechnologies 17 Neighbors Often Share Prefix 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10 00. 11.11 10. 01.01 00. 11.01
  • 18.
    © 2014 MapRTechnologies 18 Often, not always Close Far
  • 19.
    © 2014 MapRTechnologies 19 Random Sampling to Derive Keys 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 20.
    © 2014 MapRTechnologies 20 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 21.
    © 2014 MapRTechnologies 21 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 22.
    © 2014 MapRTechnologies 22 "00.01.10" - "00.01.11" "00.11.00" - "00.11.11" "01.00.10" "01.10.00" - "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 23.
    © 2014 MapRTechnologies 23 Dithering
  • 24.
    © 2014 MapRTechnologies 24 • 4 bit sine wave (listen for artifacts as volume decreases) • White dithering (artifacts gone, we hear through the noise) • Noise shaping (noise is easier to hear through)
  • 25.
    © 2014 MapRTechnologies 25 0 1 2 3 4 5 6 −4−2024 Time
  • 26.
    © 2014 MapRTechnologies 26 The Shape of the Noise Noise Frequency −0.4 −0.2 0.0 0.2 0.4 010003000
  • 27.
    © 2014 MapRTechnologies 27 The Effect After Averaging 0 1 2 3 4 5 6 −4−2024 Time
  • 28.
    © 2014 MapRTechnologies 28 Thompson Sampling
  • 29.
    © 2014 MapRTechnologies 29 Learning in the Real World • In the real world we get to pick our training examples – Do we try this restaurant or not? • Learning has real and opportunity costs • Not learning has real and opportunity costs as well • Every sub-optimal choice we make incurs regret – We would like to minimize this – But we can’t quantify regret without incurring regret!
  • 30.
    © 2014 MapRTechnologies 30 An Example • Pick one of five options – Purple, blue, green, red, yellow – Each has a random payoff • If you pick a bad option, regret = mean(best) – mean(yours) • The best known algorithm uses randomization – Best = minimal regret + minimal code complexity
  • 31.
    © 2014 MapRTechnologies 31 Demo – The Algorithm
  • 32.
    © 2014 MapRTechnologies 32 Synthetic Data
  • 33.
    © 2014 MapRTechnologies 33 select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD ,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr ,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC ,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD ,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc ,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd ,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC FROM (SELECT distinct enc.encounter_key as ENC_KEY, enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE, bt.bill_type, cnt.contract_nbr as CONTR_, ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD, enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS, enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD, eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type, prv.PROVIDER_SOURCE_CD, diag.cms_provider_type, sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd, rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY, st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD, derr.error_desc as DG_ERRDESC FROM oicpcuhg.ir_encounter enc ` Can You See the Problem?
  • 34.
    © 2014 MapRTechnologies 34 INNER JOIN oicpcuhg.ir_encountertype typ ON (typ.encounter_type_key = enc.encounter_type_key) LEFT OUTER JOIN oicpcuhg.ir_billtype bt ON (bt.bill_type_key = enc.bill_type_key) LEFT OUTER JOIN oicpcuhg.ir_contract cnt ON (cnt.contract_key = enc.contract_key) LEFT OUTER JOIN oicpcuhg.ir_datasource ds ON (ds.source_key = enc.data_source_key) LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob ON (lob.lob_key = enc.lob_key) INNER JOIN oicpcuhg.ir_member m ON ( m.hp_cd = enc.hp_cd AND m.member_source_cd = enc.member_source_cd AND m.member_nbr = enc.member_nbr) LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror ON (eerror.encounter_key = enc.encounter_key and eerror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error eerr ON (eerr.error_key = eerror.error_key) LEFT OUTER JOIN oicpcuhg.ir_provider prv ON (prv.hp_cd = enc.hp_cd and prv.provider_source_cd = enc.provider_source_cd and prv.provider_nbr = enc.provider_nbr)
  • 35.
    © 2014 MapRTechnologies 35 LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp ON (esp.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_specialty sp ON (sp.specialty_key = esp.specialty_key) LEFT OUTER JOIN oicpcuhg.ir_service svc ON (svc.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_revenue rev ON (rev.rev_cd = svc.rev_cd) LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag ON (diag.encounter_key = enc.encounter_key) INNER JOIN oicpcuhg.ir_diagcd dgcd ON (dgcd.diag_cd_key = diag.diag_cd_key) INNER JOIN oicpcuhg.ir_recordstate st ON (st.rec_state_key = diag.rec_state_key) INNER JOIN oicpcuhg.ir_recordstatus sts ON (sts.rec_status_key = diag.rec_status_key) LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror ON (derror.diagnosis_key = diag.diagnosis_key and derror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error derr ON (derr.error_key = derror.error_key)) IR INNER JOIN oicpcuhg.umr_req_inbound umr ON (trim(umr.member_nbr) = IR.member_Nbr AND trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND trim(umr.diag1) = IR.diag_cd)
  • 36.
    © 2014 MapRTechnologies 36 One Attack • The customer can’t give you the data – They can’t trust you, by law • But they can probably summarize the data – How many columns – What types – Perhaps statistical summaries
  • 37.
    © 2014 MapRTechnologies 37 Bug Replication Without Security Violation Customer You DataData DataFake DataFake x y α ξ x y α ξ
  • 38.
    © 2014 MapRTechnologies 38 The Upshot • So random numbers are useful • But simple distributions not so much • How can YOU generate cool data?
  • 39.
    © 2014 MapRTechnologies 39 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 40.
    © 2014 MapRTechnologies 40 Last October: Time Series Databases by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
  • 41.
    © 2014 MapRTechnologies 41 Coming in February: Real World Hadoop by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
  • 42.
    © 2014 MapRTechnologies 42 Thank you for coming today!

Editor's Notes

  • #4  Talk track: 2nd in series, first was on how to build a simple recommender. This one on anomaly detection is being sold by O’Reilly on Amazon, but for a limited time MapR is giving away the e-book for free. Here’s the link where you can register to get one.
  • #5 Talk track: ELLEN New ways to do it that take into account real world business goals, realistic resources, new types of data and best time to value…