Big Data : Bits of History, Words of Advice

Big Data :
Bits of History, Words of Advice
Venu Vasudevan
GLSEC Big Data Meetup

Big Data :
Bits of History, Words of Advice

Big Data Past
Big
Fast
intelligent
media
IoT
satellites

Big Data : Behavioral
Big Data
- The ‘V’ view of Big Data challenges
- Number of V’s up for debate

Big Data : Architectural
untidy
data
ﬁrehose
clean
analytics
fast &
good
slower & much better
Lambda
architecture
Lake architecture
Stream architecture

This Talk
Behavioral
View
Technology
Solution
Stack
‘Middleware’
(beneﬁt of
hindsight)
some more some
governance culture (gap)
data economics
ownership
foodﬁghts
dataeconomics

3 data points
Big
Fast
intelligent
media
IoT
satellites

Iridium
• mobile routers (10K mph), ﬁxed
people
• no repeated patterns
• satellites N-S movement
• earth E-W movement
• regular topology, irregular
exceptions
• solar ﬂares
• military satellite presence

Fast Data Problem
• cellular frequency allocation
(graph coloring problem)
• frequent fast recalculations (fast
routers + semi-fast earth)
• transmit-no transmit (solar ﬂares,
military satellite presence)
• moving ‘seam’
seam
irregularities

Fast Data Problem
• cellular frequency allocation
(graph coloring problem)
• frequent fast recalculations (fast
routers + semi-fast earth)
• transmit-no transmit (solar ﬂares,
military satellite presence)
• moving ‘seam’
• + ‘France’
seam
irregularities
broadcast
= +$$$
broadcast
= -$$$ (lawsuit)

Fast Data Problem
• quest for (OO)DB technology to
address ‘France’ as make-or-
break use case
• query expressive power
• complex constraint satisfaction
• query handling throughput
• 3-4 month benchmarking eﬀort
seam
broadcast
= +$$$
broadcast
= -$$$ (lawsuit)

Fast Data Problem
• quest for (OO)DB technology to
address ‘France’
• query expressive power
• query handling throughput
• 3-4 month benchmarking eﬀort
• France solved ‘out-of-
band’ (legally)
seam
broadcast
= +$$$
broadcast
= -$$$ (lawsuit)don’t overfit your architecture to
an extreme requirement
unless it’s from an extreme (paying) user

Big Data Problem
• systems management
• manage 66 ‘nodes’
• nodes moving at 10K mph
• ‘seam’ moving of 20K mph
• sounds harder than trivial, but
not too hard

‘Pre’ Lambda Solution
• Dumb edge | smart core
approach
• 15K events/sec/satellite
• 1M events/sec
• Fast & Approximate - FMEA:
’compiled’ lookup table for
failure modes
• Slow & Precise - Model-based
reasoning on satellite models
untidy
satellite
ﬁrehose
(1M events/sec)
actionable
insights
‘Pre’ Lambda
architecture
Model-Based
Reasoning
FMEA

‘Pre’ Lambda Solution
• Dumb edge | smart core
approach
• 15K events/sec/satellite
• Fast & Approximate - FMEA:
’compiled’ lookup table for
failure modes
• Slow & Precise - Model-based
reasoning on satellite models
• Simple, straightforward &
wrong.
untidy
satellite
ﬁrehose
(1M events/sec)
actionable
insights
‘Pre’ Lambda
architecture
Model-Based
Reasoning
real-time
expert system
FMEA
Yet, an architecture that is
‘rinsed and repeated’
over the years

why does dumb edge
smart cloud endure?
• edges are expensive ($2B)
• when edges go wrong
(break/blow up /collide) ,
they make headlines
$
$$$$$

why dumb edge smart
cloud
(break/blow up /collide)
and make headlines
• nobody messes with an
‘edge’ once it works
• clouds don’t make for good
news headlines
$
T-0
$$$$$
T-30 yrs

why dumb edge smart
cloud
(break/blow up /collide)
and make news headlines
• nobody messes with an
‘edge’ once it works
• thus, implementing an end-
to-end architecture causes
culture clashes
over my
dead body
iterate &
reﬁne

an almost repeat
(Industrial IoT)
• edges are messy & domain
speciﬁc
• creating them means
dealing with culture clashes
• but .. an ounce of edge is
worth a pound of cloud
$$$$$
T-30 yrs
$
T-0

Things to consider
• Problem statement. What’s your ‘France’?
• colorful sub-problem. strategy overﬁt.
• Architecture. small ﬁxes to IT/OT gap can go a long way to
a simpler problem
• Technology Choices. best practices & the risk of ‘rewardless
risk’
• right - make average programmers productive with new
tech
• frequent - turn great programmers into average

Big Data to Deep Metadata
streaming video(TV) ~ 1 petabyte/day
second
minute
hour
day/week
epochal
detect &
replace ads
Create Playlists by
Player,
Play, Sentiment
Identify minor characters
with rabid fan following
rejuvenate old content
derivenewcontent
‘chapterize’ by
Player,
Play, Sentiment

Platform Triage Challenge
new Product, new market
• one core technology, many
markets
• platform triaging challenge.
what drives the platform?
• highest (but uncertain) $
potential?
• ‘extreme’ requirement?
• sparsest competition?
• use case outlier is your biggest
customer
deep
metadata
technology
SaaS
data
platform
Advertising
Search
Video
concept
maps

ad replacement use case
• speed
• few days (on-demand content)
• few seconds (real-time rebroadcast with
new ads)
• precision
• low - best eﬀort, for low cost
international content for niche audiences
• high - frame level for expensive content.
e.g. Sports/$10M/episode programming
• errors
• 90% accuracy - ok for long tail content
• ‘ﬁve nines’ for premium content
precision accuracy
speed
ad replacement
opportunity space
largest
customer

occam’s razor works (again)
• build to simplicity
• loose coupling between data
engg & equipment engg
• modularize complexity
• ‘diﬀerentiate your product’
changes
• ‘necessary evil’ changes
data-only
approach
+1st party integration
(dynamically conﬁgure
ad splicers)
3rd party knobs
(dynamically refresh CDN)

but, what if ..
• Data is untidy
• Interpretation is subjective/cultural
• Automation is aspirational but quixotic

human-powered analytics
• some analytics tasks are too
‘slippery’ for machines
• data hard to characterize
• uneven video quality of ‘old’
archives
• untidy
• insights are subjective

human-powered analytics
• some analytics tasks are too
‘slippery’ for machines
• need for human
augmentation
• humans generate ‘training’
sets to bootstrap m/c learning
• humans completely take over
some tasks

machines vs humans
• crowdsourcing & human-
powered computing
• has been the ‘next big thing’
for a while
• checkered history:
• uneven output
• fraud
• uneven throughput
Machines Humans
fast slow
brittle malleable
objective subjective
clear nuanced

machines vs humans
• much of that has changed
• Amazon Mech Turk
• 500K active users
• the ‘human machine’ can
return substantial jobs in
under 30 mins
• quantiﬁable as a machine for
many media tasks - latency,
quality, error rate, thruput

Things to consider
• Beware ‘France’ in other forms:
• customer with loudest voice & ‘holy grail’ hairball
• Dealing with data quality & variability
• crowdsourcing has come a long way as credible ‘engine’
• If big data the answer, what is the question? (have strong opinion held
weakly)
• decision rationalization
• process automation
• human ‘power tool’ (e.g. compelling visualization) vs imperfect
automation

startup data jiu-jitsu
• How to create a data-
driven strategy before
the data shows up?
• rationalize future
SaaS revenue
models
• justify product
decisions in a data-
driven manner
need data
for product
need product
for data

startup data jiu-jitsu
• How to create a data-
driven strategy before
the data shows up?
• how ‘intelligent’ can
lighting control be
with 50-100K users?
• how do people use
dimmers (continuous
or quantized) — UX
implications

data set dilemma
• standard sources (e.g. Kaggle & UCI) insuﬃcient
• few ‘physical world’ datasets
• expensive to collect
• may be specialized (vendor-speciﬁc)
• dataset proxies for IoT actuation may not work
• energy utilization != switch usage

big data, small start
• physical world data likely to
be smaller (1-10 homes, few
months)
• setup costs limit size of public
datasets
• e.g. UMass Smart* light switch
dataset

big data, small start
• consider data
‘augmentation’
• standard practice in AI (deep
learning) - horizontally ﬂipping,
random crops …
• under-used in data space
• may need some thought on
perturbation models for your
domain
real
synthesized
https://blog.keras.io/building-powerful-image-classiﬁcation-models-using-very-little-data.html

In short ..
• big data success - equal parts tech & non-tech
• solving right problem, not just problem right
• revisit problem, and what success means

@venuv62
venu.vasudevan@nextio.co

Big Data : Bits of History, Words of Advice

Recommended

Recommended

More Related Content

Similar to Big Data : Bits of History, Words of Advice

Similar to Big Data : Bits of History, Words of Advice (20)

More from Venu Vasudevan

More from Venu Vasudevan (11)

Recently uploaded

Recently uploaded (20)

Big Data : Bits of History, Words of Advice