SlideShare a Scribd company logo
1 of 42
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Practical Computing with Chaos
Ted Dunning, Chief Applications Architect MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
© 2014 MapR Technologies 3
e-book available courtesy of MapR
Also at MapR booth
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 4
Practical Machine Learning series (O’Reilly)
• Machine learning is becoming mainstream
• Need pragmatic approaches that take into account real world
business settings:
– Time to value
– Limited resources
– Availability of data
– Expertise and cost of team to develop and to maintain system
• Look for approaches with big benefits for the effort expended
© 2014 MapR Technologies 5
Agenda
• Monty Hall
• Randomized geo-coding
• Thompson sampling
– Bayesian Bandits
– Targeting
– Bayesian ranking
• Dithering (sound, signals)
• Synthetic data (preview)
© 2014 MapR Technologies 6
Let’s Start with Trouble
• Monty Hall problem (oops, done)
• Three doors, one with a fabulous prize
• You pick one
• Monte shows you one of the remaining doors is empty
• You can switch at this point to the other door or not
• Should you switch?
© 2014 MapR Technologies 7
© 2014 MapR Technologies 8
© 2014 MapR Technologies 9
© 2014 MapR Technologies 10
The Real Problem
• Doing the math isn’t too hard
• Convincing somebody you have the right answer is really hard
© 2014 MapR Technologies 11
Live Coding
With REAL Chaos
© 2014 MapR Technologies 12
Geo-coding
© 2014 MapR Technologies 13
Geo-coding
• Some databases have disk locality  key locality
• The primary key is totally ordered
• Embedding a total ordering of the points in a plane is possible
– But loses some distance information
– A line is not a square!
• We want to do proximity searches
– This gets harder in the polar regions for most codings
© 2014 MapR Technologies 14
Space Filling Curve
0 1
23 01
2 3
0
1 2
3 0
1 2
3
0
1 2
3
© 2014 MapR Technologies 15
Space Filling Curve
0123
2
3
3
1
0
2
2
3
1
1
0
0 3
20
1
© 2014 MapR Technologies 16
Z-coding – Interleave Bits
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 17
Neighbors Often Share Prefix
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
00. 11.11
10. 01.01
00. 11.01
© 2014 MapR Technologies 18
Often, not always
Close Far
© 2014 MapR Technologies 19
Random Sampling to Derive Keys
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 20
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 21
"00.01.01"
"00.01.10"
"00.01.11"
"00.11.00"
"00.11.01"
"00.11.10"
"00.11.11"
"01.00.10"
"01.10.00"
"01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 22
"00.01.10" - "00.01.11"
"00.11.00" - "00.11.11"
"01.00.10"
"01.10.00" - "01.10.10”
1110
0100
00
1110
11
01
01
10
00
00
11
01
10
01
1100
10
© 2014 MapR Technologies 23
Dithering
© 2014 MapR Technologies 24
• 4 bit sine wave (listen for artifacts as volume decreases)
• White dithering (artifacts gone, we hear through the noise)
• Noise shaping (noise is easier to hear through)
© 2014 MapR Technologies 25
0 1 2 3 4 5 6
−4−2024
Time
© 2014 MapR Technologies 26
The Shape of the Noise
Noise
Frequency
−0.4 −0.2 0.0 0.2 0.4
010003000
© 2014 MapR Technologies 27
The Effect After Averaging
0 1 2 3 4 5 6
−4−2024
Time
© 2014 MapR Technologies 28
Thompson Sampling
© 2014 MapR Technologies 29
Learning in the Real World
• In the real world we get to pick our training examples
– Do we try this restaurant or not?
• Learning has real and opportunity costs
• Not learning has real and opportunity costs as well
• Every sub-optimal choice we make incurs regret
– We would like to minimize this
– But we can’t quantify regret without incurring regret!
© 2014 MapR Technologies 30
An Example
• Pick one of five options
– Purple, blue, green, red, yellow
– Each has a random payoff
• If you pick a bad option, regret = mean(best) – mean(yours)
• The best known algorithm uses randomization
– Best = minimal regret + minimal code complexity
© 2014 MapR Technologies 31
Demo – The Algorithm
© 2014 MapR Technologies 32
Synthetic Data
© 2014 MapR Technologies 33
select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD
,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr
,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC
,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD
,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc
,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd
,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC
FROM (SELECT distinct enc.encounter_key as ENC_KEY,
enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE,
bt.bill_type, cnt.contract_nbr as CONTR_,
ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD,
enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS,
enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD,
eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type,
prv.PROVIDER_SOURCE_CD, diag.cms_provider_type,
sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd,
rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY,
st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD,
derr.error_desc as DG_ERRDESC
FROM oicpcuhg.ir_encounter enc
`
Can You See the Problem?
© 2014 MapR Technologies 34
INNER JOIN oicpcuhg.ir_encountertype typ
ON (typ.encounter_type_key = enc.encounter_type_key)
LEFT OUTER JOIN oicpcuhg.ir_billtype bt
ON (bt.bill_type_key = enc.bill_type_key)
LEFT OUTER JOIN oicpcuhg.ir_contract cnt
ON (cnt.contract_key = enc.contract_key)
LEFT OUTER JOIN oicpcuhg.ir_datasource ds
ON (ds.source_key = enc.data_source_key)
LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob
ON (lob.lob_key = enc.lob_key)
INNER JOIN oicpcuhg.ir_member m
ON (
m.hp_cd = enc.hp_cd
AND m.member_source_cd = enc.member_source_cd
AND m.member_nbr = enc.member_nbr)
LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror
ON (eerror.encounter_key = enc.encounter_key and
eerror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error eerr
ON (eerr.error_key = eerror.error_key)
LEFT OUTER JOIN oicpcuhg.ir_provider prv
ON (prv.hp_cd = enc.hp_cd and
prv.provider_source_cd = enc.provider_source_cd and
prv.provider_nbr = enc.provider_nbr)
© 2014 MapR Technologies 35
LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp
ON (esp.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_specialty sp
ON (sp.specialty_key = esp.specialty_key)
LEFT OUTER JOIN oicpcuhg.ir_service svc
ON (svc.encounter_key = enc.encounter_key)
LEFT OUTER JOIN oicpcuhg.ir_revenue rev
ON (rev.rev_cd = svc.rev_cd)
LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag
ON (diag.encounter_key = enc.encounter_key)
INNER JOIN oicpcuhg.ir_diagcd dgcd
ON (dgcd.diag_cd_key = diag.diag_cd_key)
INNER JOIN oicpcuhg.ir_recordstate st
ON (st.rec_state_key = diag.rec_state_key)
INNER JOIN oicpcuhg.ir_recordstatus sts
ON (sts.rec_status_key = diag.rec_status_key)
LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror
ON (derror.diagnosis_key = diag.diagnosis_key and
derror.active_flg = 'Y')
LEFT OUTER JOIN oicpcuhg.ir_error derr
ON (derr.error_key = derror.error_key)) IR
INNER JOIN oicpcuhg.umr_req_inbound umr
ON (trim(umr.member_nbr) = IR.member_Nbr AND
trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND
trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND
trim(umr.diag1) = IR.diag_cd)
© 2014 MapR Technologies 36
One Attack
• The customer can’t give you the data
– They can’t trust you, by law
• But they can probably summarize the data
– How many columns
– What types
– Perhaps statistical summaries
© 2014 MapR Technologies 37
Bug Replication Without Security Violation
Customer You
DataData
DataFake
DataFake
x y α ξ
x y α ξ
© 2014 MapR Technologies 38
The Upshot
• So random numbers are useful
• But simple distributions not so much
• How can YOU generate cool data?
© 2014 MapR Technologies 39
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 40
Last October: Time Series Databases
by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
© 2014 MapR Technologies 41
Coming in February: Real World Hadoop
by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
© 2014 MapR Technologies 42
Thank you for coming today!

More Related Content

Viewers also liked

Learning Linear Models with Hadoop
Learning Linear Models with HadoopLearning Linear Models with Hadoop
Learning Linear Models with HadoopDataWorks Summit
 
Omid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBaseOmid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBaseDataWorks Summit
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopDataWorks Summit
 
Enterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive TechnologiesEnterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive TechnologiesDataWorks Summit
 
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterEnterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterDataWorks Summit
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the MoviesDataWorks Summit
 
Safer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science PlatformSafer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science PlatformDataWorks Summit
 
Hadoop-scale Search with Solr
Hadoop-scale Search with SolrHadoop-scale Search with Solr
Hadoop-scale Search with SolrDataWorks Summit
 
Building and Improving Products with Hadoop
Building and Improving Products with Hadoop Building and Improving Products with Hadoop
Building and Improving Products with Hadoop DataWorks Summit
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...DataWorks Summit
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingDataWorks Summit
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Analyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and ProfitAnalyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and ProfitDataWorks Summit
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)DataWorks Summit
 
Experiences Streaming Analytics at Petabyte Scale
Experiences Streaming Analytics at Petabyte ScaleExperiences Streaming Analytics at Petabyte Scale
Experiences Streaming Analytics at Petabyte ScaleDataWorks Summit
 
A Hadoop-enabled Ship Tracking Application for the Port of Rotterdam
A Hadoop-enabled Ship Tracking Application for the Port of RotterdamA Hadoop-enabled Ship Tracking Application for the Port of Rotterdam
A Hadoop-enabled Ship Tracking Application for the Port of RotterdamDataWorks Summit
 

Viewers also liked (20)

Learning Linear Models with Hadoop
Learning Linear Models with HadoopLearning Linear Models with Hadoop
Learning Linear Models with Hadoop
 
Omid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBaseOmid Efficient Transaction Mgmt and Processing for HBase
Omid Efficient Transaction Mgmt and Processing for HBase
 
Hadoop Puzzlers
Hadoop PuzzlersHadoop Puzzlers
Hadoop Puzzlers
 
Scalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for HadoopScalding: Twitter's New DSL for Hadoop
Scalding: Twitter's New DSL for Hadoop
 
Enterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive TechnologiesEnterprise Integration of Disruptive Technologies
Enterprise Integration of Disruptive Technologies
 
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop ClusterEnterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
Enterprise-Grade Rolling Upgrade for a Live Hadoop Cluster
 
Hadoop and R Go to the Movies
Hadoop and R Go to the MoviesHadoop and R Go to the Movies
Hadoop and R Go to the Movies
 
Safer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science PlatformSafer on the road with Hadoop! Setting up a Data Science Platform
Safer on the road with Hadoop! Setting up a Data Science Platform
 
Hadoop-scale Search with Solr
Hadoop-scale Search with SolrHadoop-scale Search with Solr
Hadoop-scale Search with Solr
 
Building and Improving Products with Hadoop
Building and Improving Products with Hadoop Building and Improving Products with Hadoop
Building and Improving Products with Hadoop
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARN
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction Processing
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Analyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and ProfitAnalyzing Historical Data of Applications on YARN for Fun and Profit
Analyzing Historical Data of Applications on YARN for Fun and Profit
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
 
Experiences Streaming Analytics at Petabyte Scale
Experiences Streaming Analytics at Petabyte ScaleExperiences Streaming Analytics at Petabyte Scale
Experiences Streaming Analytics at Petabyte Scale
 
A Hadoop-enabled Ship Tracking Application for the Port of Rotterdam
A Hadoop-enabled Ship Tracking Application for the Port of RotterdamA Hadoop-enabled Ship Tracking Application for the Port of Rotterdam
A Hadoop-enabled Ship Tracking Application for the Port of Rotterdam
 

Similar to Practical Computing With Chaos

Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith ChaosMapR Technologies
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionDataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFMLconf
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15MLconf
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentMapR Technologies
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedTed Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessDataWorks Summit
 

Similar to Practical Computing With Chaos (20)

Practical Computing Wiith Chaos
Practical Computing Wiith ChaosPractical Computing Wiith Chaos
Practical Computing Wiith Chaos
 
How to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detectionHow to find what you didn't know to look for, oractical anomaly detection
How to find what you didn't know to look for, oractical anomaly detection
 
Deep Learning for Fraud Detection
Deep Learning for Fraud DetectionDeep Learning for Fraud Detection
Deep Learning for Fraud Detection
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Dunning ml-conf-2014
Dunning ml-conf-2014Dunning ml-conf-2014
Dunning ml-conf-2014
 
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SFTed Dunning, Chief Application Architect, MapR at MLconf SF
Ted Dunning, Chief Application Architect, MapR at MLconf SF
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Architecting R into Storm Application Development Process
Architecting R into Storm Application Development ProcessArchitecting R into Storm Application Development Process
Architecting R into Storm Application Development Process
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Practical Computing With Chaos

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Practical Computing with Chaos Ted Dunning, Chief Applications Architect MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning
  • 3. © 2014 MapR Technologies 3 e-book available courtesy of MapR Also at MapR booth http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 4. © 2014 MapR Technologies 4 Practical Machine Learning series (O’Reilly) • Machine learning is becoming mainstream • Need pragmatic approaches that take into account real world business settings: – Time to value – Limited resources – Availability of data – Expertise and cost of team to develop and to maintain system • Look for approaches with big benefits for the effort expended
  • 5. © 2014 MapR Technologies 5 Agenda • Monty Hall • Randomized geo-coding • Thompson sampling – Bayesian Bandits – Targeting – Bayesian ranking • Dithering (sound, signals) • Synthetic data (preview)
  • 6. © 2014 MapR Technologies 6 Let’s Start with Trouble • Monty Hall problem (oops, done) • Three doors, one with a fabulous prize • You pick one • Monte shows you one of the remaining doors is empty • You can switch at this point to the other door or not • Should you switch?
  • 7. © 2014 MapR Technologies 7
  • 8. © 2014 MapR Technologies 8
  • 9. © 2014 MapR Technologies 9
  • 10. © 2014 MapR Technologies 10 The Real Problem • Doing the math isn’t too hard • Convincing somebody you have the right answer is really hard
  • 11. © 2014 MapR Technologies 11 Live Coding With REAL Chaos
  • 12. © 2014 MapR Technologies 12 Geo-coding
  • 13. © 2014 MapR Technologies 13 Geo-coding • Some databases have disk locality  key locality • The primary key is totally ordered • Embedding a total ordering of the points in a plane is possible – But loses some distance information – A line is not a square! • We want to do proximity searches – This gets harder in the polar regions for most codings
  • 14. © 2014 MapR Technologies 14 Space Filling Curve 0 1 23 01 2 3 0 1 2 3 0 1 2 3 0 1 2 3
  • 15. © 2014 MapR Technologies 15 Space Filling Curve 0123 2 3 3 1 0 2 2 3 1 1 0 0 3 20 1
  • 16. © 2014 MapR Technologies 16 Z-coding – Interleave Bits 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 17. © 2014 MapR Technologies 17 Neighbors Often Share Prefix 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10 00. 11.11 10. 01.01 00. 11.01
  • 18. © 2014 MapR Technologies 18 Often, not always Close Far
  • 19. © 2014 MapR Technologies 19 Random Sampling to Derive Keys 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 20. © 2014 MapR Technologies 20 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 21. © 2014 MapR Technologies 21 "00.01.01" "00.01.10" "00.01.11" "00.11.00" "00.11.01" "00.11.10" "00.11.11" "01.00.10" "01.10.00" "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 22. © 2014 MapR Technologies 22 "00.01.10" - "00.01.11" "00.11.00" - "00.11.11" "01.00.10" "01.10.00" - "01.10.10” 1110 0100 00 1110 11 01 01 10 00 00 11 01 10 01 1100 10
  • 23. © 2014 MapR Technologies 23 Dithering
  • 24. © 2014 MapR Technologies 24 • 4 bit sine wave (listen for artifacts as volume decreases) • White dithering (artifacts gone, we hear through the noise) • Noise shaping (noise is easier to hear through)
  • 25. © 2014 MapR Technologies 25 0 1 2 3 4 5 6 −4−2024 Time
  • 26. © 2014 MapR Technologies 26 The Shape of the Noise Noise Frequency −0.4 −0.2 0.0 0.2 0.4 010003000
  • 27. © 2014 MapR Technologies 27 The Effect After Averaging 0 1 2 3 4 5 6 −4−2024 Time
  • 28. © 2014 MapR Technologies 28 Thompson Sampling
  • 29. © 2014 MapR Technologies 29 Learning in the Real World • In the real world we get to pick our training examples – Do we try this restaurant or not? • Learning has real and opportunity costs • Not learning has real and opportunity costs as well • Every sub-optimal choice we make incurs regret – We would like to minimize this – But we can’t quantify regret without incurring regret!
  • 30. © 2014 MapR Technologies 30 An Example • Pick one of five options – Purple, blue, green, red, yellow – Each has a random payoff • If you pick a bad option, regret = mean(best) – mean(yours) • The best known algorithm uses randomization – Best = minimal regret + minimal code complexity
  • 31. © 2014 MapR Technologies 31 Demo – The Algorithm
  • 32. © 2014 MapR Technologies 32 Synthetic Data
  • 33. © 2014 MapR Technologies 33 select IR.ENC_KEY ,IR.ENCOUNTER_ ,IR.ETYPE ,IR.bill_type ,IR.CONTR_ ,IR.SOURCE_CD ,IR.sub_source_cd ,IR.HP_CD ,IR.LOB_CD ,IR.FDO ,IR.TDOS ,IR.member_Nbr ,IR.HIC_NBR ,IR.MEMBER_SOURCE_CD ,IR.HDR_ERRCD ,IR.HDR_ERRDESC ,IR.PROVIDER_NBR ,IR.provider_type ,IR.PROVIDER_SOURCE_CD ,IR.cms_provider_ty e ,IR.SPEC_CD ,IR.SPEC_DESC ,IR.rev_cd ,IR.rev_cd_desc ,IR.proc_cd ,IR.diag_cd ,IR.DIAG_CD_KEY ,IR.DIAGNOSIS_KEY ,IR.rec_state_cd ,IR.rec_status_cd ,IR.DG_ERRCD ,IR.DG_ERRDESC FROM (SELECT distinct enc.encounter_key as ENC_KEY, enc.encounter_nbr as ENCOUNTER_, typ.encounter_type_cd as ETYPE, bt.bill_type, cnt.contract_nbr as CONTR_, ds.SOURCE_CD, enc.sub_source_cd, enc.HP_CD, lob.LOB_CD, enc.new_min_dt as FDOS, substr(enc.new_max_dt, 1, 10) as TDOS, enc.member_Nbr, m.HIC_NBR, m.MEMBER_SOURCE_CD, eerr.error_cd as HDR_ERRCD, eerr.ERROR_DESC as HDR_ERRDESC, enc.PROVIDER_NBR, prv.provider_type, prv.PROVIDER_SOURCE_CD, diag.cms_provider_type, sp.specialty_cd as SPEC_CD, sp.specialty_desc as SPEC_DESC, svc.rev_cd, rev.rev_cd_desc, svc.proc_cd, dgcd.diag_cd, dgcd.DIAG_CD_KEY, diag.DIAGNOSIS_KEY, st.rec_state_cd, sts.rec_status_cd, derr.error_cd as DG_ERRCD, derr.error_desc as DG_ERRDESC FROM oicpcuhg.ir_encounter enc ` Can You See the Problem?
  • 34. © 2014 MapR Technologies 34 INNER JOIN oicpcuhg.ir_encountertype typ ON (typ.encounter_type_key = enc.encounter_type_key) LEFT OUTER JOIN oicpcuhg.ir_billtype bt ON (bt.bill_type_key = enc.bill_type_key) LEFT OUTER JOIN oicpcuhg.ir_contract cnt ON (cnt.contract_key = enc.contract_key) LEFT OUTER JOIN oicpcuhg.ir_datasource ds ON (ds.source_key = enc.data_source_key) LEFT OUTER JOIN oicpcuhg.ir_lineofbusiness lob ON (lob.lob_key = enc.lob_key) INNER JOIN oicpcuhg.ir_member m ON ( m.hp_cd = enc.hp_cd AND m.member_source_cd = enc.member_source_cd AND m.member_nbr = enc.member_nbr) LEFT OUTER JOIN oicpcuhg.ir_encountererror eerror ON (eerror.encounter_key = enc.encounter_key and eerror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error eerr ON (eerr.error_key = eerror.error_key) LEFT OUTER JOIN oicpcuhg.ir_provider prv ON (prv.hp_cd = enc.hp_cd and prv.provider_source_cd = enc.provider_source_cd and prv.provider_nbr = enc.provider_nbr)
  • 35. © 2014 MapR Technologies 35 LEFT OUTER JOIN oicpcuhg.ir_encounterspecialty esp ON (esp.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_specialty sp ON (sp.specialty_key = esp.specialty_key) LEFT OUTER JOIN oicpcuhg.ir_service svc ON (svc.encounter_key = enc.encounter_key) LEFT OUTER JOIN oicpcuhg.ir_revenue rev ON (rev.rev_cd = svc.rev_cd) LEFT OUTER JOIN oicpcuhg.ir_diagnosis diag ON (diag.encounter_key = enc.encounter_key) INNER JOIN oicpcuhg.ir_diagcd dgcd ON (dgcd.diag_cd_key = diag.diag_cd_key) INNER JOIN oicpcuhg.ir_recordstate st ON (st.rec_state_key = diag.rec_state_key) INNER JOIN oicpcuhg.ir_recordstatus sts ON (sts.rec_status_key = diag.rec_status_key) LEFT OUTER JOIN oicpcuhg.ir_diagnosiserror derror ON (derror.diagnosis_key = diag.diagnosis_key and derror.active_flg = 'Y') LEFT OUTER JOIN oicpcuhg.ir_error derr ON (derr.error_key = derror.error_key)) IR INNER JOIN oicpcuhg.umr_req_inbound umr ON (trim(umr.member_nbr) = IR.member_Nbr AND trim(umr.hhc_from_ccyymmdd) = IR.TDOS AND trim(umr.sub_mcare_mbr) = IR.HIC_NBR AND trim(umr.diag1) = IR.diag_cd)
  • 36. © 2014 MapR Technologies 36 One Attack • The customer can’t give you the data – They can’t trust you, by law • But they can probably summarize the data – How many columns – What types – Perhaps statistical summaries
  • 37. © 2014 MapR Technologies 37 Bug Replication Without Security Violation Customer You DataData DataFake DataFake x y α ξ x y α ξ
  • 38. © 2014 MapR Technologies 38 The Upshot • So random numbers are useful • But simple distributions not so much • How can YOU generate cool data?
  • 39. © 2014 MapR Technologies 39 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 40. © 2014 MapR Technologies 40 Last October: Time Series Databases by Ted Dunning and Ellen Friedman © Oct 2014 (published by O’Reilly)
  • 41. © 2014 MapR Technologies 41 Coming in February: Real World Hadoop by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
  • 42. © 2014 MapR Technologies 42 Thank you for coming today!

Editor's Notes

  1. Talk track: 2nd in series, first was on how to build a simple recommender. This one on anomaly detection is being sold by O’Reilly on Amazon, but for a limited time MapR is giving away the e-book for free. Here’s the link where you can register to get one.
  2. Talk track: ELLEN New ways to do it that take into account real world business goals, realistic resources, new types of data and best time to value…