SlideShare a Scribd company logo
1 of 79
© 2016 MapR Technologies 1© 2014 MapR Technologies
© 2016 MapR Technologies 2© 2014 MapR Technologies
© 2016 MapR Technologies 3
Me, Us
• Ted Dunning, MapR Chief Application Architect, Apache Member
– Committer PMC member Zookeeper, Drill, others
– Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin
– VP Incubator
– Bought the beer at the first HUG
• MapR
– Produces first converged platform for big and fast data
– Includes data platform (files, streams, tables) + open source
– Adds major technology for performance, HA, industry standard API’s
• Contact
@ted_dunning, ted.dunning@gmail.com, tdunning@mapr.com
© 2016 MapR Technologies 4
Agenda
• Rationale
• Why cheap isn't the same as simple-minded
• Some techniques
• Examples
© 2016 MapR Technologies 5
Outline
• We have a revolution on our hands
• This leads to a green-field situation
• That implies that many important problems are easy to solve
• The limiting factor is fielding good enough solutions
– Quickly
– With available workforce
• Examples
© 2016 MapR Technologies 6
Is this really a
revolutionary moment?
© 2016 MapR Technologies 7
Big is the next big thing
• Data scale is exploding
• Companies are being funded
• Books are being written
• Applications sprouting up everywhere
© 2016 MapR Technologies 8
Why Now?
• But Moore’s law has applied for a long time
• Why is data exploding now?
• Why not 10 years ago?
• Why not 20?
© 2016 MapR Technologies 9
Size Matters, but …
• If it were just availability of data then existing big companies
would adopt big data technology first
© 2016 MapR Technologies 10
Size Matters, but …
• If it were just availability of data then existing big companies
would adopt big data technology first
They didn’t
© 2016 MapR Technologies 11
Or Maybe Cost
• If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte
© 2016 MapR Technologies 12
Or Maybe Cost
• If it were just a net positive value then finance companies should
adopt first because they have higher opportunity value / byte
They didn’t
© 2016 MapR Technologies 13
Backwards adoption
• Under almost any threshold argument startups would not adopt
big data technology first
© 2016 MapR Technologies 14
Backwards adoption
• Under almost any threshold argument startups would not adopt
big data technology first
They did
© 2016 MapR Technologies 15
Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
© 2016 MapR Technologies 16
Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
Why?
© 2016 MapR Technologies 17
Analytics Scaling Laws
• Analytics scaling is all about the 80-20 rule
– Big gains for little initial effort
– Rapidly diminishing returns
• The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
• Cost/performance has changed radically
– IF you can use many commodity boxes
© 2016 MapR Technologies 18
Most data isn’t worth much in isolation
First data is valuable
Later data is dregs
© 2016 MapR Technologies 19
Suddenly worth processing
First data is valuable
Later data is dregs
But has high aggregate value
© 2016 MapR Technologies 20
If we can handle the scale
It’s really big
© 2016 MapR Technologies 21
So what makes
that possible?
© 2016 MapR Technologies 22
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
© 2016 MapR Technologies 23
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Net value optimum has
a sharp peak well
before maximum effort
© 2016 MapR Technologies 24
But scaling laws are
changing both slope and
shape
© 2016 MapR Technologies 25
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
More than just a
little
© 2016 MapR Technologies 26
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
They are changing a
LOT!
© 2016 MapR Technologies 27
© 2016 MapR Technologies 28
© 2016 MapR Technologies 29
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
© 2016 MapR Technologies 30
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
© 2016 MapR Technologies 31
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Value
Initially, linear cost scaling
actually makes things
worse
Then a tipping point is
reached and things change
radically …
© 2016 MapR Technologies 32
Pre-requisites for Tipping
• To reach the tipping point,
• Algorithms must scale out horizontally
– On commodity hardware
– That can and will fail
• Data practice must change
– Denormalized is the new black
– Flexible data dictionaries are the rule
– Structured data becomes rare
© 2016 MapR Technologies 33
With great scale comes great opportunity
• Increasing scale by 1000x changes the game
• We essentially have green fields opening up all around
• Most of the opportunities don’t require advanced learning
© 2016 MapR Technologies 34
OK.
We have a
bona fide revolution
© 2016 MapR Technologies 35
Greenfield Problem Landscape
© 2016 MapR Technologies 36
Mature Problem Landscape
© 2016 MapR Technologies 37
Why is cheap better than deep (sometimes)?
When we have a greenfield, problems can be
– Easy (large number of these)
– Impossible (large number of these)
– Hard but possible (right on the boundary)
In a mature field, problems can be
– Easy (these are already done)
– Impossible (still a large number of these)
– Hard but possible (now the majority of the effort)
© 2016 MapR Technologies 38
Some examples
© 2016 MapR Technologies 39
A simple example - security monitoring
• “Small” data
– Capture IDS logs
– Detect what you already know
• “Big” data
– Capture switch, server, firewall logs as well
– New patterns emerge immediately
© 2016 MapR Technologies 40
Another example – fraud detection
• “Small” data
– Maintain card profiles
– Segment models
– Evaluate all transactions
• “Big” Data
– Maintain card profiles, full 90 day transaction history
– Evaluate all transactions
© 2016 MapR Technologies 41
Another example – indicator-based recommendation
• “Advanced” approach
– Use matrix completion techniques (LDA, NNM, ALS)
– Tune meta-parameters
– Ensembles galore
• “Simple” approach
– Count cooccurrences and cross-occurrences
– Finding “interesting” pairs
– Use standard search engine to recommend
© 2016 MapR Technologies 42
Easy != Stupid
• You still have to do things reasonably well
– Techniques that are not well founded are still problems
• Heuristic frequency ratios still fail
– Coincidences still dominate the data
– Accidental 100% correlations abound
• Related techniques still broken for coincidence
– Pearson’s χ2
– Simple correlations
© 2016 MapR Technologies 43
Scale does not cure wrong
It just makes easy more common
© 2016 MapR Technologies 44
A core technique
• Many of these easy problems reduce to finding interesting
coincidences
• This can be summarized as a 2 x 2 table
• Actually, many of these tables
A Other
B k11 k12
Other k21 k22
© 2016 MapR Technologies 45
How do you do that?
• This is well handled using G2-test
– See wikipedia
– See http://bit.ly/surprise-and-coincidence
• Original application in linguistics now cited > 2000 times
• Available in ElasticSearch, in Solr, in Mahout
• Available in R, C, Java, Python
© 2016 MapR Technologies 46
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
© 2016 MapR Technologies 47
Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
0.90 1.95
4.52 14.3
Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence,
Computational Linguistics vol 19 no. 1 (1993)
© 2016 MapR Technologies 48
So we can find
interesting coincidences.
That gets us exactly what?
© 2016 MapR Technologies 49
Operation Ababil – Brobots on Parade
• Dork attack to find unpatched default Joomla sites
– Especially web servers with high bandwidth connections
– Basically just Google searches for default strings
– Joomla compromised into attack Brobot
• C&C network checks in occasionally
– Note C&C is incoming request and looks like normal web requests
• Later, on command, multiple Brobots direct 50-75 Gb/s of attack
– Attacks come from white-listed sites
© 2016 MapR Technologies 50
Attack Sequence
Source
First level
C&C
Second
level C&C
© 2016 MapR Technologies 51
Google
Attack Sequence
Source
First level
C&C
Second
level C&C
© 2016 MapR Technologies 52
Brobot
Brobot
Brobot
Attack Sequence
Source
First level
C&C
Second
level C&C
© 2016 MapR Technologies 53
Target
Brobot
Brobot
Brobot
Attack Sequence
Source
First level
C&C
Second
level C&C
© 2016 MapR Technologies 54
Outline of an Advanced Persistent Threat
• Advanced
– Common use of zero-day for preliminary attacks
– Often attributed to state-level actors
– Modern privateers blur the line
• Persistent
– Result of first attack is heavily muffled, no immediate exploit
– Remote access toolset installed (RAT)
• Threat
– On command, data is exfiltrated covertly or en masse
– Or the compromised host is used for other nefarious purpose
© 2016 MapR Technologies 55
APT in Summary
• Attack, penetrate, pivot, exfiltrate or exploit
• If you are a high-value target, attack is likely and stealthy
– High-value = telecom, banks, utilities, retail targets, web100
– … and all their vendors
– Conventional multi-factor auth is easily breached
• Penetration and pivot are critical counter-measure opportunities
– In 2010, RAT would contact command and control (C&C)
– In 2016, C&C looks like normal traffic
• Once exfiltration or exploit starts, you may no longer have a
business
© 2016 MapR Technologies 56
Target
Brobot
Brobot
Brobot
Example 1 - Ababil
Source
First level
C&C
Second
level C&C
Defense has to
happen here
© 2016 MapR Technologies 57
Spot the Important Difference?
GET /personal/comparison-table.jsp?iODg2OQ=51a90 HTTP/1.1
Host: www.sometarget.com
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;)
Accept-Encoding: deflate
Accept-Charset: UTF-8
Accept-Language: fr
Cache-Control: no-cache
Pragma: no-cache
Connection: Keep-Alive
GET /photo.jpg HTTP/1.1
Host: lh4.googleusercontent.com
User-Agent: Mozilla/5.0 (Macint
Accept: image/png,image/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate,
Referer: https://www.google.com
Connection: keep-alive
If-None-Match: "v9”
Cache-Control: max-age=0
Attacker request Real request
© 2016 MapR Technologies 58
Spot the Important Difference?
GET /personal/comparison-table.jsp?iODg2OQ=51a90 HTTP/1.1
Host: www.sometarget.com
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;)
Accept-Encoding: deflate
Accept-Charset: UTF-8
Accept-Language: fr
Cache-Control: no-cache
Pragma: no-cache
Connection: Keep-Alive
GET /photo.jpg HTTP/1.1
Host: lh4.googleusercontent.com
User-Agent: Mozilla/5.0 (Macint
Accept: image/png,image/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate,
Referer: https://www.google.com
Connection: keep-alive
If-None-Match: "v9”
Cache-Control: max-age=0
Attacker request Real request
© 2016 MapR Technologies 59
This could only be found at scale
© 2016 MapR Technologies 60
This could only be found at scale
But at scale, it is stupidly simple
to find
© 2016 MapR Technologies 61
Target
Brobot
Brobot
Brobot
Overall Outline Again
Source
First level
C&C
Second
level C&C
Tradecraft error!
© 2016 MapR Technologies 62
Large corpus analysis of source
IP’s wins big
© 2016 MapR Technologies 63
© 2016 MapR Technologies 64
Example 2 - Common Point of Compromise
• Scenario:
– Merchant 0 is compromised, leaks account data during compromise
– Fraud committed elsewhere during exploit
– High background level of fraud
– Limited detection rate for exploits
• Goal:
– Find merchant 0
• Meta-goal:
– Screen algorithms for this task without leaking sensitive data
© 2016 MapR Technologies 65
Example 2 - Common Point of Compromise
skim exploit
Merchant 0
Skimmed
data
Merchant n
Card data is stolen
from Merchant 0
That data is used
in frauds at other
merchants
© 2016 MapR Technologies 66
Simulation Setup
0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compromises
frauds
© 2016 MapR Technologies 67
Detection Strategy
• Select histories that precede non-fraud
• And histories that precede fraud detection
• Analyze 2x2 cooccurrence of merchant n versus fraud detection
© 2016 MapR Technologies 68
© 2016 MapR Technologies 69
What about the
real world?
© 2016 MapR Technologies 70
●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●
●● ●●● ●●● ●
●
● ●●
●
●
●
●●
020406080
LLR score for real data
Number of Merchants
BreachScore(LLR)
Real truly bad guys
100
101
102
103
104
105
106
Really truly bad guys
© 2016 MapR Technologies 71
Historical cooccurrence gives high
S/N
© 2016 MapR Technologies 72
Historical cooccurrence gives high
S/N
(we win)
© 2016 MapR Technologies 73
Cooccurrence AnalysisCooccurrence Analysis
© 2016 MapR Technologies 74
Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres de paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
© 2016 MapR Technologies 75
Real-life example
© 2016 MapR Technologies 76
So …
• There are suddenly lots of these problems
• Simple techniques have surprising power at scale
– Cooccurrence via G2 / LLR
– Distributional anomaly detection via t-digest
• These simple techniques are largely unsuitable for academic
research
• But they are highly applicable in resource constrained industrial
settings
© 2016 MapR Technologies 77
●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●
●● ●●● ●●● ●
●
● ●●
●
●
●
●●
020406080
LLR score for real data
Number of Merchants
BreachScore(LLR)
Real truly bad guys
100
101
102
103
104
105
106
Cooccurrence An
Summary
• We live in a golden age of newly achieved scale
• That scale has lowered the tree
– Hard problems are much easier
– Lots of low-hanging fruit all around us
• Cheap learning has huge value
• Code available at
http://github.com/tdunning 0 20 40 60 80 100
0100300500
day
count
Compromise period
Exploit period
compromises
frauds
© 2016 MapR Technologies 78
Me, Us
• Ted Dunning, MapR Chief Application Architect, Apache Member
– Committer PMC member Zookeeper, Drill, others
– Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin
– VP Incubator
– Bought the beer at the first HUG
• MapR
– Produces a converged platform for big and fast data
– Includes data platform (files, streams, tables) + open source
– Adds major technology for performance, HA, industry standard API’s
• Contact
@ted_dunning, ted.dunning@gmail.com, tdunning@mapr.com
© 2016 MapR Technologies 79
Q & A

More Related Content

What's hot

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Technologies
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Insight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationInsight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationMapR Technologies
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR Technologies
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 

What's hot (20)

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data Platform
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Insight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationInsight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital Transformation
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 

Viewers also liked

SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill WorkshopCharles Givre
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Charles Givre
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1Charles Givre
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 

Viewers also liked (14)

SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Gta San Andreas CODECS
Gta San Andreas CODECSGta San Andreas CODECS
Gta San Andreas CODECS
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 

Similar to Deep Learning vs. Cheap Learning

Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Using Sequence Statistics to Fight Advanced Persistent Threats
Using Sequence Statistics to Fight Advanced Persistent ThreatsUsing Sequence Statistics to Fight Advanced Persistent Threats
Using Sequence Statistics to Fight Advanced Persistent ThreatsDataWorks Summit/Hadoop Summit
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?Flink Forward
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matterDataWorks Summit
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterDataWorks Summit
 
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouBig Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouMatt Stubbs
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Mathieu Dumoulin
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleIan Downard
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 

Similar to Deep Learning vs. Cheap Learning (20)

Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming in the Extreme
Streaming in the ExtremeStreaming in the Extreme
Streaming in the Extreme
 
Using Sequence Statistics to Fight Advanced Persistent Threats
Using Sequence Statistics to Fight Advanced Persistent ThreatsUsing Sequence Statistics to Fight Advanced Persistent Threats
Using Sequence Statistics to Fight Advanced Persistent Threats
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?
 
How to tell which algorithms really matter
How to tell which algorithms really matterHow to tell which algorithms really matter
How to tell which algorithms really matter
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
How to Determine which Algorithms Really Matter
How to Determine which Algorithms Really MatterHow to Determine which Algorithms Really Matter
How to Determine which Algorithms Really Matter
 
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell YouBig Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceMapR Technologies
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital TransformationMapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in Finance
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital Transformation
 

Recently uploaded

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 

Recently uploaded (20)

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 

Deep Learning vs. Cheap Learning

  • 1. © 2016 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2016 MapR Technologies 2© 2014 MapR Technologies
  • 3. © 2016 MapR Technologies 3 Me, Us • Ted Dunning, MapR Chief Application Architect, Apache Member – Committer PMC member Zookeeper, Drill, others – Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin – VP Incubator – Bought the beer at the first HUG • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology for performance, HA, industry standard API’s • Contact @ted_dunning, ted.dunning@gmail.com, tdunning@mapr.com
  • 4. © 2016 MapR Technologies 4 Agenda • Rationale • Why cheap isn't the same as simple-minded • Some techniques • Examples
  • 5. © 2016 MapR Technologies 5 Outline • We have a revolution on our hands • This leads to a green-field situation • That implies that many important problems are easy to solve • The limiting factor is fielding good enough solutions – Quickly – With available workforce • Examples
  • 6. © 2016 MapR Technologies 6 Is this really a revolutionary moment?
  • 7. © 2016 MapR Technologies 7 Big is the next big thing • Data scale is exploding • Companies are being funded • Books are being written • Applications sprouting up everywhere
  • 8. © 2016 MapR Technologies 8 Why Now? • But Moore’s law has applied for a long time • Why is data exploding now? • Why not 10 years ago? • Why not 20?
  • 9. © 2016 MapR Technologies 9 Size Matters, but … • If it were just availability of data then existing big companies would adopt big data technology first
  • 10. © 2016 MapR Technologies 10 Size Matters, but … • If it were just availability of data then existing big companies would adopt big data technology first They didn’t
  • 11. © 2016 MapR Technologies 11 Or Maybe Cost • If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
  • 12. © 2016 MapR Technologies 12 Or Maybe Cost • If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte They didn’t
  • 13. © 2016 MapR Technologies 13 Backwards adoption • Under almost any threshold argument startups would not adopt big data technology first
  • 14. © 2016 MapR Technologies 14 Backwards adoption • Under almost any threshold argument startups would not adopt big data technology first They did
  • 15. © 2016 MapR Technologies 15 Everywhere at Once? • Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small
  • 16. © 2016 MapR Technologies 16 Everywhere at Once? • Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small Why?
  • 17. © 2016 MapR Technologies 17 Analytics Scaling Laws • Analytics scaling is all about the 80-20 rule – Big gains for little initial effort – Rapidly diminishing returns • The key to net value is how costs scale – Old school – exponential scaling – Big data – linear scaling, low constant • Cost/performance has changed radically – IF you can use many commodity boxes
  • 18. © 2016 MapR Technologies 18 Most data isn’t worth much in isolation First data is valuable Later data is dregs
  • 19. © 2016 MapR Technologies 19 Suddenly worth processing First data is valuable Later data is dregs But has high aggregate value
  • 20. © 2016 MapR Technologies 20 If we can handle the scale It’s really big
  • 21. © 2016 MapR Technologies 21 So what makes that possible?
  • 22. © 2016 MapR Technologies 22 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value
  • 23. © 2016 MapR Technologies 23 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value Net value optimum has a sharp peak well before maximum effort
  • 24. © 2016 MapR Technologies 24 But scaling laws are changing both slope and shape
  • 25. © 2016 MapR Technologies 25 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value More than just a little
  • 26. © 2016 MapR Technologies 26 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value They are changing a LOT!
  • 27. © 2016 MapR Technologies 27
  • 28. © 2016 MapR Technologies 28
  • 29. © 2016 MapR Technologies 29 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value
  • 30. © 2016 MapR Technologies 30 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value
  • 31. © 2016 MapR Technologies 31 2,0000 500 1000 1500 1 0 0.25 0.5 0.75 Scale Value Initially, linear cost scaling actually makes things worse Then a tipping point is reached and things change radically …
  • 32. © 2016 MapR Technologies 32 Pre-requisites for Tipping • To reach the tipping point, • Algorithms must scale out horizontally – On commodity hardware – That can and will fail • Data practice must change – Denormalized is the new black – Flexible data dictionaries are the rule – Structured data becomes rare
  • 33. © 2016 MapR Technologies 33 With great scale comes great opportunity • Increasing scale by 1000x changes the game • We essentially have green fields opening up all around • Most of the opportunities don’t require advanced learning
  • 34. © 2016 MapR Technologies 34 OK. We have a bona fide revolution
  • 35. © 2016 MapR Technologies 35 Greenfield Problem Landscape
  • 36. © 2016 MapR Technologies 36 Mature Problem Landscape
  • 37. © 2016 MapR Technologies 37 Why is cheap better than deep (sometimes)? When we have a greenfield, problems can be – Easy (large number of these) – Impossible (large number of these) – Hard but possible (right on the boundary) In a mature field, problems can be – Easy (these are already done) – Impossible (still a large number of these) – Hard but possible (now the majority of the effort)
  • 38. © 2016 MapR Technologies 38 Some examples
  • 39. © 2016 MapR Technologies 39 A simple example - security monitoring • “Small” data – Capture IDS logs – Detect what you already know • “Big” data – Capture switch, server, firewall logs as well – New patterns emerge immediately
  • 40. © 2016 MapR Technologies 40 Another example – fraud detection • “Small” data – Maintain card profiles – Segment models – Evaluate all transactions • “Big” Data – Maintain card profiles, full 90 day transaction history – Evaluate all transactions
  • 41. © 2016 MapR Technologies 41 Another example – indicator-based recommendation • “Advanced” approach – Use matrix completion techniques (LDA, NNM, ALS) – Tune meta-parameters – Ensembles galore • “Simple” approach – Count cooccurrences and cross-occurrences – Finding “interesting” pairs – Use standard search engine to recommend
  • 42. © 2016 MapR Technologies 42 Easy != Stupid • You still have to do things reasonably well – Techniques that are not well founded are still problems • Heuristic frequency ratios still fail – Coincidences still dominate the data – Accidental 100% correlations abound • Related techniques still broken for coincidence – Pearson’s χ2 – Simple correlations
  • 43. © 2016 MapR Technologies 43 Scale does not cure wrong It just makes easy more common
  • 44. © 2016 MapR Technologies 44 A core technique • Many of these easy problems reduce to finding interesting coincidences • This can be summarized as a 2 x 2 table • Actually, many of these tables A Other B k11 k12 Other k21 k22
  • 45. © 2016 MapR Technologies 45 How do you do that? • This is well handled using G2-test – See wikipedia – See http://bit.ly/surprise-and-coincidence • Original application in linguistics now cited > 2000 times • Available in ElasticSearch, in Solr, in Mahout • Available in R, C, Java, Python
  • 46. © 2016 MapR Technologies 46 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2
  • 47. © 2016 MapR Technologies 47 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3 Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)
  • 48. © 2016 MapR Technologies 48 So we can find interesting coincidences. That gets us exactly what?
  • 49. © 2016 MapR Technologies 49 Operation Ababil – Brobots on Parade • Dork attack to find unpatched default Joomla sites – Especially web servers with high bandwidth connections – Basically just Google searches for default strings – Joomla compromised into attack Brobot • C&C network checks in occasionally – Note C&C is incoming request and looks like normal web requests • Later, on command, multiple Brobots direct 50-75 Gb/s of attack – Attacks come from white-listed sites
  • 50. © 2016 MapR Technologies 50 Attack Sequence Source First level C&C Second level C&C
  • 51. © 2016 MapR Technologies 51 Google Attack Sequence Source First level C&C Second level C&C
  • 52. © 2016 MapR Technologies 52 Brobot Brobot Brobot Attack Sequence Source First level C&C Second level C&C
  • 53. © 2016 MapR Technologies 53 Target Brobot Brobot Brobot Attack Sequence Source First level C&C Second level C&C
  • 54. © 2016 MapR Technologies 54 Outline of an Advanced Persistent Threat • Advanced – Common use of zero-day for preliminary attacks – Often attributed to state-level actors – Modern privateers blur the line • Persistent – Result of first attack is heavily muffled, no immediate exploit – Remote access toolset installed (RAT) • Threat – On command, data is exfiltrated covertly or en masse – Or the compromised host is used for other nefarious purpose
  • 55. © 2016 MapR Technologies 55 APT in Summary • Attack, penetrate, pivot, exfiltrate or exploit • If you are a high-value target, attack is likely and stealthy – High-value = telecom, banks, utilities, retail targets, web100 – … and all their vendors – Conventional multi-factor auth is easily breached • Penetration and pivot are critical counter-measure opportunities – In 2010, RAT would contact command and control (C&C) – In 2016, C&C looks like normal traffic • Once exfiltration or exploit starts, you may no longer have a business
  • 56. © 2016 MapR Technologies 56 Target Brobot Brobot Brobot Example 1 - Ababil Source First level C&C Second level C&C Defense has to happen here
  • 57. © 2016 MapR Technologies 57 Spot the Important Difference? GET /personal/comparison-table.jsp?iODg2OQ=51a90 HTTP/1.1 Host: www.sometarget.com User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;) Accept-Encoding: deflate Accept-Charset: UTF-8 Accept-Language: fr Cache-Control: no-cache Pragma: no-cache Connection: Keep-Alive GET /photo.jpg HTTP/1.1 Host: lh4.googleusercontent.com User-Agent: Mozilla/5.0 (Macint Accept: image/png,image/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate, Referer: https://www.google.com Connection: keep-alive If-None-Match: "v9” Cache-Control: max-age=0 Attacker request Real request
  • 58. © 2016 MapR Technologies 58 Spot the Important Difference? GET /personal/comparison-table.jsp?iODg2OQ=51a90 HTTP/1.1 Host: www.sometarget.com User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;) Accept-Encoding: deflate Accept-Charset: UTF-8 Accept-Language: fr Cache-Control: no-cache Pragma: no-cache Connection: Keep-Alive GET /photo.jpg HTTP/1.1 Host: lh4.googleusercontent.com User-Agent: Mozilla/5.0 (Macint Accept: image/png,image/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate, Referer: https://www.google.com Connection: keep-alive If-None-Match: "v9” Cache-Control: max-age=0 Attacker request Real request
  • 59. © 2016 MapR Technologies 59 This could only be found at scale
  • 60. © 2016 MapR Technologies 60 This could only be found at scale But at scale, it is stupidly simple to find
  • 61. © 2016 MapR Technologies 61 Target Brobot Brobot Brobot Overall Outline Again Source First level C&C Second level C&C Tradecraft error!
  • 62. © 2016 MapR Technologies 62 Large corpus analysis of source IP’s wins big
  • 63. © 2016 MapR Technologies 63
  • 64. © 2016 MapR Technologies 64 Example 2 - Common Point of Compromise • Scenario: – Merchant 0 is compromised, leaks account data during compromise – Fraud committed elsewhere during exploit – High background level of fraud – Limited detection rate for exploits • Goal: – Find merchant 0 • Meta-goal: – Screen algorithms for this task without leaking sensitive data
  • 65. © 2016 MapR Technologies 65 Example 2 - Common Point of Compromise skim exploit Merchant 0 Skimmed data Merchant n Card data is stolen from Merchant 0 That data is used in frauds at other merchants
  • 66. © 2016 MapR Technologies 66 Simulation Setup 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds
  • 67. © 2016 MapR Technologies 67 Detection Strategy • Select histories that precede non-fraud • And histories that precede fraud detection • Analyze 2x2 cooccurrence of merchant n versus fraud detection
  • 68. © 2016 MapR Technologies 68
  • 69. © 2016 MapR Technologies 69 What about the real world?
  • 70. © 2016 MapR Technologies 70 ●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ●● 020406080 LLR score for real data Number of Merchants BreachScore(LLR) Real truly bad guys 100 101 102 103 104 105 106 Really truly bad guys
  • 71. © 2016 MapR Technologies 71 Historical cooccurrence gives high S/N
  • 72. © 2016 MapR Technologies 72 Historical cooccurrence gives high S/N (we win)
  • 73. © 2016 MapR Technologies 73 Cooccurrence AnalysisCooccurrence Analysis
  • 74. © 2016 MapR Technologies 74 Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres de paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • 75. © 2016 MapR Technologies 75 Real-life example
  • 76. © 2016 MapR Technologies 76 So … • There are suddenly lots of these problems • Simple techniques have surprising power at scale – Cooccurrence via G2 / LLR – Distributional anomaly detection via t-digest • These simple techniques are largely unsuitable for academic research • But they are highly applicable in resource constrained industrial settings
  • 77. © 2016 MapR Technologies 77 ●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ● ●● ●●● ●●● ● ● ● ●● ● ● ● ●● 020406080 LLR score for real data Number of Merchants BreachScore(LLR) Real truly bad guys 100 101 102 103 104 105 106 Cooccurrence An Summary • We live in a golden age of newly achieved scale • That scale has lowered the tree – Hard problems are much easier – Lots of low-hanging fruit all around us • Cheap learning has huge value • Code available at http://github.com/tdunning 0 20 40 60 80 100 0100300500 day count Compromise period Exploit period compromises frauds
  • 78. © 2016 MapR Technologies 78 Me, Us • Ted Dunning, MapR Chief Application Architect, Apache Member – Committer PMC member Zookeeper, Drill, others – Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin – VP Incubator – Bought the beer at the first HUG • MapR – Produces a converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology for performance, HA, industry standard API’s • Contact @ted_dunning, ted.dunning@gmail.com, tdunning@mapr.com
  • 79. © 2016 MapR Technologies 79 Q & A

Editor's Notes

  1. Why is big data sooo fashionable with big and small companies from different industries? What has suddenly changed?
  2. But we have seen constant growth for a long time. And simple growth would only explain some kinds of companies starting with big data (probably big ones) and then slow adoption. Databases started with big companies and took 20 years or more to reach everywhere because the need exceeded cost at different times for different companies. The internet, on the other hand, largely happened to everybody at the same time so it changed things in nearly all industries at all scales nearly simultaneously. Why is big data exploding right now and why is it exploding at all?
  3. The different kinds of scaling laws have different shape and I think that shape is the key.
  4. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  5. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  6. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  7. In classical analytics, the cost of doing analytics increases sharply.
  8. The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.
  9. New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.
  10. This next sequence shows how the net value changes with different slope linear cost models.
  11. Notice how the best net value has jumped up significantly
  12. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
  13. Laugh on tech term in American English = garbage 10:38