Big Data: Bits of History, Words of Advice by Venu Vasudevan discusses the history and challenges of big data. The document provides examples from Iridium satellite communications and an industrial IoT project to illustrate issues with fast, unpredictable data streams and the challenges of integrating IT and OT systems. It advocates for hybrid human-machine approaches and emphasizes starting simply before over-engineering architectures to fit extreme needs. The key lessons are to focus on the core problem, allow for data variability, and consider lightweight data augmentation when starting with limited real-world datasets.
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016Sri Ambati
H2O: A Platform for Big Math
From just your laptop to 100's of nodes, H2O gives you a Single System Image - easy aggregation of all the memory and all the cores, and a simple coding style that scales wide at in-memory speeds. H2O is easily 1000x faster than disk based clustering solutions, and often 10x faster than best-of-breed alternative in-memory solutions - and will work directly on your existing Hadoop cluster. H2O ingests a wide variety of formats, parallel and distributed across the cluster, and stores the data highly compressed and then lets you do scale-out math at memory-bandwidth speeds (on compressed data!), making terabyte-scale munging an interactive experience. This is a technical talk on the insides of H2O, specifically focusing on the Single-System-Image aspect: how we write single-threaded code, and have H2O auto-parallelize and auto-scale-out to 100's of nodes and 1000's of cores.
Arno is the Chief Architect of H2O, a distributed and scalable open-source machine learning platform. He is also the main author of H2O’s Deep Learning. Before joining H2O.ai, Arno was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world’s largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives and collaborated with CERN on next-generation particle accelerators. Arno holds a PhD and Masters summa cum laude in Physics from ETH Zurich, Switzerland. He has authored dozens of scientific papers and is a sought-after conference speaker. Arno was named "2014 Big Data All-Star" by Fortune Magazine. Follow him on Twitter: @ArnoCandel.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati
Arno Candel introduces Deep Water, which brings Tensorflow, Caffe, Mxnet to H2O. It also brings support for GPUs, image classification, NLP and much more to H2O.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
By 2020, 50% of all new software will process machine-generated data of some sort (Gartner). Historically, machine data use cases have required non-SQL data stores like Splunk, Elasticsearch, or InfluxDB.
Today, new SQL DB architectures rival the non-SQL solutions in ease of use, scalability, cost, and performance. Please join this webinar for a detailed comparison of machine data management approaches.
NoSQL databases like MongoDB, Elasticsearch, and Cassandra are synonymous with scalability, search, and developer agility. But there’s a downside...having to give up the ease and comfort of SQL.
Or do you?
Join this webcast to learn how the newest databases, like CrateDB and CockroachDB deliver the benefits of NoSQL with the ease of SQL by building SQL engines on top of custom NoSQL technology stacks. Database industry veteran Andy Ellicott, who helped launch Vertica, VoltDB, Cloudant, and now with Crate.io, will provide a no-BS view of current DBMS architectures and predictions for the future of data.
If you’re a DBMS user, this webcast will help you make sense of a very crowded DBMS market and make better-informed decisions for your new tech stacks.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
Arno candel h2o_a_platform_for_big_math_hadoop_summit_june2016Sri Ambati
H2O: A Platform for Big Math
From just your laptop to 100's of nodes, H2O gives you a Single System Image - easy aggregation of all the memory and all the cores, and a simple coding style that scales wide at in-memory speeds. H2O is easily 1000x faster than disk based clustering solutions, and often 10x faster than best-of-breed alternative in-memory solutions - and will work directly on your existing Hadoop cluster. H2O ingests a wide variety of formats, parallel and distributed across the cluster, and stores the data highly compressed and then lets you do scale-out math at memory-bandwidth speeds (on compressed data!), making terabyte-scale munging an interactive experience. This is a technical talk on the insides of H2O, specifically focusing on the Single-System-Image aspect: how we write single-threaded code, and have H2O auto-parallelize and auto-scale-out to 100's of nodes and 1000's of cores.
Arno is the Chief Architect of H2O, a distributed and scalable open-source machine learning platform. He is also the main author of H2O’s Deep Learning. Before joining H2O.ai, Arno was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world’s largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives and collaborated with CERN on next-generation particle accelerators. Arno holds a PhD and Masters summa cum laude in Physics from ETH Zurich, Switzerland. He has authored dozens of scientific papers and is a sought-after conference speaker. Arno was named "2014 Big Data All-Star" by Fortune Magazine. Follow him on Twitter: @ArnoCandel.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Deep Water - Bringing Tensorflow, Caffe, Mxnet to H2OSri Ambati
Arno Candel introduces Deep Water, which brings Tensorflow, Caffe, Mxnet to H2O. It also brings support for GPUs, image classification, NLP and much more to H2O.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
By 2020, 50% of all new software will process machine-generated data of some sort (Gartner). Historically, machine data use cases have required non-SQL data stores like Splunk, Elasticsearch, or InfluxDB.
Today, new SQL DB architectures rival the non-SQL solutions in ease of use, scalability, cost, and performance. Please join this webinar for a detailed comparison of machine data management approaches.
NoSQL databases like MongoDB, Elasticsearch, and Cassandra are synonymous with scalability, search, and developer agility. But there’s a downside...having to give up the ease and comfort of SQL.
Or do you?
Join this webcast to learn how the newest databases, like CrateDB and CockroachDB deliver the benefits of NoSQL with the ease of SQL by building SQL engines on top of custom NoSQL technology stacks. Database industry veteran Andy Ellicott, who helped launch Vertica, VoltDB, Cloudant, and now with Crate.io, will provide a no-BS view of current DBMS architectures and predictions for the future of data.
If you’re a DBMS user, this webcast will help you make sense of a very crowded DBMS market and make better-informed decisions for your new tech stacks.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler
Virtualization, Cloud Deployments, and Cloud-Based Tools have challenged and changed performance testing practices. Today’s performance tester can summons tens of thousands of virtual users from the cloud in a few minutes at a cost far lower than the expensive on-premise installations of yesteryear.
Meanwhile, systems under test have changed more. Updated software stacks have increased the complexity of scripting and performance measurement, but the biggest changes are in the nature and quantities of resources powering the systems. Interpreting resource usage when resources are shared on a private virtualization platform is exceedingly difficult. Understanding resources when they live in a large public cloud is impossible.
Architecting a next-generation data platformhadooparchbook
Slides for Architecting a next-generation data platform at Strata + Hadoop World, London 2017.
https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57652
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...NETWAYS
With the advent of IoT, companies have the opportunity to put larger and larger volumes of machine data to work to optimize operations like manufacturing production, safety, security, user experience. Yet, they are finding that the old paradigms of processing this data do not help mainstream developers keep pace with the velocity of data, new analytic algorithms, and the need for real-time insight. Jodok Batlogg, founder and CTO of Crate.io, believes that the solution to this problem lies at the nexus of modern open source distributed database architectures, machine learning/AI, and IoT networking. These technologies will combine to create a new data management paradigm that moves beyond traditional conceptions of databases. He believes the future lies in a central nervous system, an “operational brain” that connects directly to sensory inputs and applies artificial intelligence to control, predict, and monitor systems and things in real time. In this session, Jodok will use-real world, in-production manufacturing and cybersecurity examples of “operational brains” at work to explain the new paradigm, and discuss the concrete steps organizations can take to implement them.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022HostedbyConfluent
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Event-first thinking and streaming help organizations transition from followers to leaders in the market. A reliable, scalable, and economical streaming architecture helps them get there.
This talk first explores the ""classic streaming stack,"" based on the Lambda architecture, its origin, and why it didn't pick up amongst data-driven organizations. The modern streaming stack (MSS) is a lean, cloud-native, and economical alternative to classic streaming architectures, where it aims to make event-driven real-time applications viable for organizations.
The second half of the talk explores the MSS in detail, including its core components, their purposes, and how Kappa architecture has influenced it. Moreover, the talk lays out a few considerations before planning a new streaming application within an organization. The talk concludes by discussing the challenges in the streaming world and how vendors are trying to overcome them in the future.
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Amazon Web Services
Scientists, developers, and other technologists from many different industries are taking advantage of Amazon Web Services to perform big data workloads from analytics to using data lakes for better decision making to meet the challenges of the increasing volume, variety, and velocity of digital information. This session will feature UCB's RISELab (Real time Intelligent Secure Execution), a new lab recently created at UCB to enable computers to make intelligent, real-time decisions. You will hear how they are building on their earlier success with AMPLab to enable applications to interact intelligently and securely with their environment in real time, wherever computing decisions need to interact with the world. From cybersecurity to coordinating fleets of self-driving cars and drones to earthquake warning systems, you will come away with insight on how they are using AWS to develop and experiment with the systems for important research. Learn More: https://aws.amazon.com/government-education/
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
Machine Learning for Smarter Apps with Tom Kraljevic
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Three Pillars, Zero Answers: Rethinking ObservabilityDevOps.com
Observability has never been more important: the complexity of microservices makes it harder and harder to answer basic questions about system behavior. The conventional wisdom claims that Metrics, Logging and Tracing are “the three pillars” of observability… yet software organizations check these three boxes and are still grasping at straws during emergencies.
In this session, we’ll illustrate the problem with the three pillars: metrics, logs, and traces are just data – they are the fuel, not the car.
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler
Virtualization, Cloud Deployments, and Cloud-Based Tools have challenged and changed performance testing practices. Today’s performance tester can summons tens of thousands of virtual users from the cloud in a few minutes at a cost far lower than the expensive on-premise installations of yesteryear.
Meanwhile, systems under test have changed more. Updated software stacks have increased the complexity of scripting and performance measurement, but the biggest changes are in the nature and quantities of resources powering the systems. Interpreting resource usage when resources are shared on a private virtualization platform is exceedingly difficult. Understanding resources when they live in a large public cloud is impossible.
Architecting a next-generation data platformhadooparchbook
Slides for Architecting a next-generation data platform at Strata + Hadoop World, London 2017.
https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57652
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...NETWAYS
With the advent of IoT, companies have the opportunity to put larger and larger volumes of machine data to work to optimize operations like manufacturing production, safety, security, user experience. Yet, they are finding that the old paradigms of processing this data do not help mainstream developers keep pace with the velocity of data, new analytic algorithms, and the need for real-time insight. Jodok Batlogg, founder and CTO of Crate.io, believes that the solution to this problem lies at the nexus of modern open source distributed database architectures, machine learning/AI, and IoT networking. These technologies will combine to create a new data management paradigm that moves beyond traditional conceptions of databases. He believes the future lies in a central nervous system, an “operational brain” that connects directly to sensory inputs and applies artificial intelligence to control, predict, and monitor systems and things in real time. In this session, Jodok will use-real world, in-production manufacturing and cybersecurity examples of “operational brains” at work to explain the new paradigm, and discuss the concrete steps organizations can take to implement them.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022HostedbyConfluent
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Event-first thinking and streaming help organizations transition from followers to leaders in the market. A reliable, scalable, and economical streaming architecture helps them get there.
This talk first explores the ""classic streaming stack,"" based on the Lambda architecture, its origin, and why it didn't pick up amongst data-driven organizations. The modern streaming stack (MSS) is a lean, cloud-native, and economical alternative to classic streaming architectures, where it aims to make event-driven real-time applications viable for organizations.
The second half of the talk explores the MSS in detail, including its core components, their purposes, and how Kappa architecture has influenced it. Moreover, the talk lays out a few considerations before planning a new streaming application within an organization. The talk concludes by discussing the challenges in the streaming world and how vendors are trying to overcome them in the future.
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Amazon Web Services
Scientists, developers, and other technologists from many different industries are taking advantage of Amazon Web Services to perform big data workloads from analytics to using data lakes for better decision making to meet the challenges of the increasing volume, variety, and velocity of digital information. This session will feature UCB's RISELab (Real time Intelligent Secure Execution), a new lab recently created at UCB to enable computers to make intelligent, real-time decisions. You will hear how they are building on their earlier success with AMPLab to enable applications to interact intelligently and securely with their environment in real time, wherever computing decisions need to interact with the world. From cybersecurity to coordinating fleets of self-driving cars and drones to earthquake warning systems, you will come away with insight on how they are using AWS to develop and experiment with the systems for important research. Learn More: https://aws.amazon.com/government-education/
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
Machine Learning for Smarter Apps with Tom Kraljevic
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Three Pillars, Zero Answers: Rethinking ObservabilityDevOps.com
Observability has never been more important: the complexity of microservices makes it harder and harder to answer basic questions about system behavior. The conventional wisdom claims that Metrics, Logging and Tracing are “the three pillars” of observability… yet software organizations check these three boxes and are still grasping at straws during emergencies.
In this session, we’ll illustrate the problem with the three pillars: metrics, logs, and traces are just data – they are the fuel, not the car.
the evolution of TV experience from media to entertainment, from broadcast to personacast, from consumption to interaction, and from branded content to a combination of branded and user-generated content. This addresses how these technology changes might influence the thinking of diverse stakeholders in the tv value chain – broadcasters, content studios, advertisers and of course viewers.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
4. Big Data : Behavioral
Big Data
- The ‘V’ view of Big Data challenges
- Number of V’s up for debate
5. Big Data : Architectural
untidy
data
firehose
clean
analytics
fast &
good
slower & much better
Lambda
architecture
Lake architecture
Stream architecture
10. Iridium
• mobile routers (10K mph), fixed
people
• no repeated patterns
• satellites N-S movement
• earth E-W movement
• regular topology, irregular
exceptions
• solar flares
• military satellite presence
11. Fast Data Problem
• cellular frequency allocation
(graph coloring problem)
• frequent fast recalculations (fast
routers + semi-fast earth)
• transmit-no transmit (solar flares,
military satellite presence)
• moving ‘seam’
seam
irregularities
12. Fast Data Problem
• cellular frequency allocation
(graph coloring problem)
• frequent fast recalculations (fast
routers + semi-fast earth)
• transmit-no transmit (solar flares,
military satellite presence)
• moving ‘seam’
• + ‘France’
seam
irregularities
broadcast
= +$$$
broadcast
= -$$$ (lawsuit)
13. Fast Data Problem
• quest for (OO)DB technology to
address ‘France’ as make-or-
break use case
• query expressive power
• complex constraint satisfaction
• query handling throughput
• 3-4 month benchmarking effort
seam
broadcast
= +$$$
broadcast
= -$$$ (lawsuit)
14. Fast Data Problem
• quest for (OO)DB technology to
address ‘France’
• query expressive power
• query handling throughput
• 3-4 month benchmarking effort
• France solved ‘out-of-
band’ (legally)
seam
broadcast
= +$$$
broadcast
= -$$$ (lawsuit)don’t overfit your architecture to
an extreme requirement
unless it’s from an extreme (paying) user
15. Big Data Problem
• systems management
• manage 66 ‘nodes’
• nodes moving at 10K mph
• ‘seam’ moving of 20K mph
• sounds harder than trivial, but
not too hard
17. ‘Pre’ Lambda Solution
• Dumb edge | smart core
approach
• 15K events/sec/satellite
• Fast & Approximate - FMEA:
’compiled’ lookup table for
failure modes
• Slow & Precise - Model-based
reasoning on satellite models
• Simple, straightforward &
wrong.
untidy
satellite
firehose
(1M events/sec)
actionable
insights
‘Pre’ Lambda
architecture
Model-Based
Reasoning
real-time
expert system
FMEA
Yet, an architecture that is
‘rinsed and repeated’
over the years
18. why does dumb edge
smart cloud endure?
• edges are expensive ($2B)
• when edges go wrong
(break/blow up /collide) ,
they make headlines
$
$$$$$
19. why dumb edge smart
cloud
• edges are expensive ($2B)
• when edges go wrong
(break/blow up /collide)
and make headlines
• nobody messes with an
‘edge’ once it works
• clouds don’t make for good
news headlines
$
T-0
$$$$$
T-30 yrs
20. why dumb edge smart
cloud
• edges are expensive ($2B)
• when edges go wrong
(break/blow up /collide)
and make news headlines
• nobody messes with an
‘edge’ once it works
• thus, implementing an end-
to-end architecture causes
culture clashes
over my
dead body
iterate &
refine
21. an almost repeat
(Industrial IoT)
• edges are messy & domain
specific
• creating them means
dealing with culture clashes
• but .. an ounce of edge is
worth a pound of cloud
$$$$$
T-30 yrs
$
T-0
22. Things to consider
• Problem statement. What’s your ‘France’?
• colorful sub-problem. strategy overfit.
• Architecture. small fixes to IT/OT gap can go a long way to
a simpler problem
• Technology Choices. best practices & the risk of ‘rewardless
risk’
• right - make average programmers productive with new
tech
• frequent - turn great programmers into average
23. Big Data to Deep Metadata
streaming video(TV) ~ 1 petabyte/day
second
minute
hour
day/week
epochal
detect &
replace ads
Create Playlists by
Player,
Play, Sentiment
Identify minor characters
with rabid fan following
rejuvenate old content
derivenewcontent
‘chapterize’ by
Player,
Play, Sentiment
24. Platform Triage Challenge
new Product, new market
• one core technology, many
markets
• platform triaging challenge.
what drives the platform?
• highest (but uncertain) $
potential?
• ‘extreme’ requirement?
• sparsest competition?
• use case outlier is your biggest
customer
deep
metadata
technology
SaaS
data
platform
Advertising
Search
Video
concept
maps
25. ad replacement use case
• speed
• few days (on-demand content)
• few seconds (real-time rebroadcast with
new ads)
• precision
• low - best effort, for low cost
international content for niche audiences
• high - frame level for expensive content.
e.g. Sports/$10M/episode programming
• errors
• 90% accuracy - ok for long tail content
• ‘five nines’ for premium content
precision accuracy
speed
ad replacement
opportunity space
largest
customer
26. occam’s razor works (again)
• build to simplicity
• loose coupling between data
engg & equipment engg
• modularize complexity
• ‘differentiate your product’
changes
• ‘necessary evil’ changes
data-only
approach
+1st party integration
(dynamically configure
ad splicers)
3rd party knobs
(dynamically refresh CDN)
28. but, what if ..
• Data is untidy
• Interpretation is subjective/cultural
• Automation is aspirational but quixotic
29. human-powered analytics
• some analytics tasks are too
‘slippery’ for machines
• data hard to characterize
• uneven video quality of ‘old’
archives
• untidy
• insights are subjective
30. human-powered analytics
• some analytics tasks are too
‘slippery’ for machines
• need for human
augmentation
• humans generate ‘training’
sets to bootstrap m/c learning
• humans completely take over
some tasks
31. machines vs humans
• crowdsourcing & human-
powered computing
• has been the ‘next big thing’
for a while
• checkered history:
• uneven output
• fraud
• uneven throughput
Machines Humans
fast slow
brittle malleable
objective subjective
clear nuanced
32. machines vs humans
• much of that has changed
• Amazon Mech Turk
• 500K active users
• the ‘human machine’ can
return substantial jobs in
under 30 mins
• quantifiable as a machine for
many media tasks - latency,
quality, error rate, thruput
34. Things to consider
• Beware ‘France’ in other forms:
• customer with loudest voice & ‘holy grail’ hairball
• Dealing with data quality & variability
• crowdsourcing has come a long way as credible ‘engine’
• If big data the answer, what is the question? (have strong opinion held
weakly)
• decision rationalization
• process automation
• human ‘power tool’ (e.g. compelling visualization) vs imperfect
automation
35. startup data jiu-jitsu
• How to create a data-
driven strategy before
the data shows up?
• rationalize future
SaaS revenue
models
• justify product
decisions in a data-
driven manner
need data
for product
need product
for data
36. startup data jiu-jitsu
• How to create a data-
driven strategy before
the data shows up?
• how ‘intelligent’ can
lighting control be
with 50-100K users?
• how do people use
dimmers (continuous
or quantized) — UX
implications
37. data set dilemma
• standard sources (e.g. Kaggle & UCI) insufficient
• few ‘physical world’ datasets
• expensive to collect
• may be specialized (vendor-specific)
• dataset proxies for IoT actuation may not work
• energy utilization != switch usage
38. big data, small start
• physical world data likely to
be smaller (1-10 homes, few
months)
• setup costs limit size of public
datasets
• e.g. UMass Smart* light switch
dataset
39. big data, small start
• consider data
‘augmentation’
• standard practice in AI (deep
learning) - horizontally flipping,
random crops …
• under-used in data space
• may need some thought on
perturbation models for your
domain
real
synthesized
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
40. In short ..
• big data success - equal parts tech & non-tech
• solving right problem, not just problem right
• revisit problem, and what success means