SlideShare a Scribd company logo
Sensors, networks, and massive data
Michael W. Mahoney
Stanford University
May 2012
( For more info, see:
http:// cs.stanford.edu/people/mmahoney/
or Google on “Michael Mahoney”)
Lots of types of “sensors”
Examples:
• Physical/environmental: temperature, air quality, oil, etc.
• Consumer: RFID chips, SmartPhone, Store Video, etc.
• Health care: Patient Records, Images & Surgery Videos, etc.
• Financial: Transactions for regulations, HFT, etc.
• Internet/e-commerce: clicks, email, etc. for user modeling, etc.
• Astronomical/HEP: images, experiments, etc.
Common theme: easy to generate A LOT of data
Questions:
• What are similarities/differences i.t.o. funding drivers, customer
demands, questions of interest, time sensitivity, etc. about “sensing”
in these different applications?
• What can we learn from one area and apply to another area?
BIG data??? MASSIVE data????
NYT, Feb 11, 2012: “The Age of Big Data”
• “What is Big Data? A meme and a marketing term, for sure, but also
shorthand for advancing trends in technology that open the door to a new
approach to understanding the world and making decisions. …”
Why are big data big?
• Generate data at different places/times and different resolutions
• Factor of 10 more data is not just more data, but different data
BIG data??? MASSIVE data????
MASSIVE data:
• Internet, Customer Transactions, Astronomy/HEP = “Petascale”
• One Petabyte = watching 20 years of movies (HD) = listening to 20,000
years of MP3 (128 kbits/sec) = way too much to browse or comprehend
massive data:
• 105
people typed at 106
DNA SNPs; 106
or 109
node social network; etc.
In either case, main issues:
• Memory management issues, e.g., push computation to the data
• Hard to answer even basic questions about what data “looks like”
How do we view BIG data?
Algorithmic vs. Statistical Perspectives
Computer Scientists
• Data: are a record of everything that happened.
• Goal: process the data to find interesting patterns and associations.
• Methodology: Develop approximation algorithms under different
models of data access since the goal is typically computationally hard.
Statisticians (and Natural Scientists)
• Data: are a particular random instantiation of an underlying process
describing unobserved patterns in the world.
• Goal: is to extract information about the world from noisy data.
• Methodology: Make inferences (perhaps about unseen events) by
positing a model that describes the random variability of the data
around the deterministic model.
Lambert (2000), Mahoney (2010)
Thinking about large-scale data
Data generation is modern version of microscope/telescope:
• See things couldn't see before: e.g., movement of people, clicks and
interests; tracking of packages; fine-scale measurements of temperature,
chemicals, etc.
• Those inventions ushered new scientific eras and new understanding of
the world and new technologies to do stuff
Easy things become hard and hard things become easy:
• Easier to see the other side of universe than bottom of ocean
• Means, sums, medians, correlations is easy with small data
Our ability to generate data far exceeds our
ability to extract insight from data.
Many challenges ...
• Tradeoffs between prediction & understanding
• Tradeoffs between computation & communication,
• Balancing heat dissipation & energy requirements
• Scalable, interactive, & inferential analytics
• Temporal constraints in real-time applications
• Understanding “structure” and “noise” at large-scale (*)
• Even meaningfully answering “What does the data look like?”
Micro-markets in sponsored search
10 million keywords
1.4MillionAdvertisers
Gambling
Sports
Sports
Gambling
Movies Media
Sport
videos
What is the CTR and
advertiser ROI of
sports gambling
keywords?
Goal: Find isolated markets/clusters (in an advertiser-bidded phrase bipartite graph)
with sufficient money/clicks with sufficient coherence.
Ques: Is this even possible?
What about sensors?
Vector space model - analogous to “bag-of-words” model for documents/terms.
• Each sensor is a “document,” a vector in a high-dimensional Euclidean space
• Each measurement is a “term”, describing the elements of that vector
• (Advertisers and bidded-phrases--and many other things--are also analogous.)
Can also define sensor-measurement graphs :
• Sensors are nodes, and edges are between sensors with similar measurements
m
documents
(sensors)
n terms (measurements)
A ij= frequency of j-th term in i-th
document (value of j-th measurement
at i-th sensor)
= =
Cluster-quality Score: Conductance
S
S’
11
 How cluster-like is a set of nodes?
Idea: balance “boundary” of cluster
with “volume” of cluster
 Need a natural intuitive measure:
Conductance (normalized cut)
φ(S) ≈ # edges cut / # edges inside
 Small φ(S) corresponds to better
clusters of nodes
Graph partitioning
A family of combinatorial optimization problems - want to
partition a graph’s nodes into two sets s.t.:
• Not much edge weight across the cut (cut quality)
• Both sides contain a lot of nodes
Standard formalizations of the bi-criterion are NP-hard!
Approximation algorithms:
• Spectral methods* - (compute eigenvectors)
• Local improvement - (important in practice)
• Multi-resolution - (important in practice)
• Flow-based methods* - (mincut-maxflow)
* comes with strong underlying theory to guide heuristics
Comparison of “spectral” versus “flow”
Spectral:
• Compute an eigenvector
• “Quadratic” worst-case bounds
• Worst-case achieved -- on “long
stringy” graphs
• Embeds you on a line (or Kn)
Flow:
• Compute a LP
• O(log n) worst-case bounds
• Worst-case achieved -- on
expanders
• Embeds you in L1
Two methods:
• Complementary strengths and weaknesses
• What we compute will depend on approximation
algorithm as well as objective function.
Analogy: What does a protein look like?
Experimental Procedure:
• Generate a bunch of output data by using
the unseen object to filter a known input
signal.
• Reconstruct the unseen object given the
output signal and what we know about the
artifactual properties of the input signal.
Three possible representations (all-atom;
backbone; and solvent-accessible
surface) of the three-dimensional
structure of the protein triose
phosphate isomerase.
Popular small networks
Zachary’s karate club Newman’s Network Science Meshes and RoadNet-CA
Large Social and Information Networks
Typical example of our findings
General relativity collaboration network
(pretty small: 4,158 nodes, 13,422 edges)
17Community size
Communityscore
Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)
Large Social and Information Networks
LiveJournal Epinions
Focus on the red curves (local spectral algorithm) - blue (Metis+Flow), green (Bag of
whiskers), and black (randomly rewired network) for consistency and cross-validation.
Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)
Interpretation: “Whiskers” and the
“core” of large informatics graphs
• “Whiskers”
• maximal sub-graph detached
from network by removing a
single edge
• contains 40% of nodes and 20%
of edges
• “Core”
• the rest of the graph, i.e., the
2-edge-connected core
• Global minimum of NCPP is a whisker
• BUT, core itself has nested whisker-
core structure
Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)
Local “structure” and global “noise”
Many (most/all?) large informatics graphs (& massive data in general?)
• have local structure that is meaningfully geometric/low-dimensional
• does not have analogous meaningful global structure
Intuitive example:
• What does the graph of you and your
102
closest Facebook friends “look like”?
• What does the graph of you and your
105
closest Facebook friends “look like”?
Many lessons ...
This is problematic for MANY things people want to do:
• statistical analysis that relies on asymptotic limits
• recursive clustering algorithms
• analysts who want a few meaningful clusters
More data need not be better if you:
• don’t have control over the noise
• want “islands of insight” in the “sea of data”
How does this manifest itself in your “sensor” application?
• Needles in haystack; correlations; time series -- “scientific” apps
• Historically, CS & database apps did more summaries & aggregates
Big changes in the past ... and future
Consider the creation of:
• Modern Physics
• Computer Science
• Molecular Biology
These were driven by new measurement techniques and
technological advances, but they led to:
• big new (academic and applied) questions
• new perspectives on the world
• lots of downstream applications
We are in the middle of a similarly big shift!
• OR and Management Science
•Transistors and Microelectronics
• Biotechnology
Conclusions
HUGE range of “sensors” are generating A LOT of data:
• will lead to a very different world in many ways
Large-scale data are very different than small-scale data.
• Easy things become hard, and hard things become easy
• Types of questions that are meaningful to ask are different
• Structure, noise, etc. properties are often deeply counterintuitive
Different applications are driven by different considerations
• next-user-interaction, qualitative insight, failure modes, false
positives versus false negatives, time sensitivity, etc.
Algorithms can compute answers to known questions
• but algorithms can also be used as “experimental probes” of the data
to form questions!
MMDS Workshop on
“Algorithms for Modern Massive Data Sets”
(http://mmds.stanford.edu)
at Stanford University, July 10-13, 2012
Objectives:
- Address algorithmic, statistical, and mathematical challenges in modern statistical
data analysis.
- Explore novel techniques for modeling and analyzing massive, high-dimensional, and
nonlinearly-structured data.
- Bring together computer scientists, statisticians, mathematicians, and data analysis
practitioners to promote cross-fertilization of ideas.
Organizers: M. W. Mahoney, A. Shkolnik, G. Carlsson, and P. Drineas,
Registration is available now!

More Related Content

Similar to Sensors1(1)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
AbcdDcba12
 
DBMS
DBMSDBMS
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
SanjayAcharaya
 
Data Science: Origins, Methods, Challenges and the future?
Data Science: Origins, Methods, Challenges and the future?Data Science: Origins, Methods, Challenges and the future?
Data Science: Origins, Methods, Challenges and the future?
Cagatay Turkay
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage data
Herbert Van de Sompel
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Data
jonblower
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
butest
 
10 problems 06
10 problems 0610 problems 06
10 problems 06
Loc Nguyễn
 
Semantic Sensor Networks and Linked Stream Data
Semantic Sensor Networks and Linked Stream DataSemantic Sensor Networks and Linked Stream Data
Semantic Sensor Networks and Linked Stream Data
Oscar Corcho
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
oiisdp
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
Network Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and ApplicationsNetwork Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and Applications
Biocomplexity Institute of Virginia Tech
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
dnac
 
02 Network Data Collection (2016)
02 Network Data Collection (2016)02 Network Data Collection (2016)
02 Network Data Collection (2016)
Duke Network Analysis Center
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
BREENAHICETSTAFFCSE
 
Ngsp
NgspNgsp
Ngsp
Tim Clark
 
CLIR Fellows - Science Data - 14_0730
CLIR Fellows - Science Data - 14_0730CLIR Fellows - Science Data - 14_0730
CLIR Fellows - Science Data - 14_0730
jeffreylancaster
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
Robert Grossman
 
Data Science definition
Data Science definitionData Science definition
Data Science definition
CarloLauro1
 

Similar to Sensors1(1) (20)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
DBMS
DBMSDBMS
DBMS
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
 
Data Science: Origins, Methods, Challenges and the future?
Data Science: Origins, Methods, Challenges and the future?Data Science: Origins, Methods, Challenges and the future?
Data Science: Origins, Methods, Challenges and the future?
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage data
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Data
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
10 problems 06
10 problems 0610 problems 06
10 problems 06
 
Semantic Sensor Networks and Linked Stream Data
Semantic Sensor Networks and Linked Stream DataSemantic Sensor Networks and Linked Stream Data
Semantic Sensor Networks and Linked Stream Data
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Network Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and ApplicationsNetwork Science: Theory, Modeling and Applications
Network Science: Theory, Modeling and Applications
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
 
02 Network Data Collection (2016)
02 Network Data Collection (2016)02 Network Data Collection (2016)
02 Network Data Collection (2016)
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
Ngsp
NgspNgsp
Ngsp
 
CLIR Fellows - Science Data - 14_0730
CLIR Fellows - Science Data - 14_0730CLIR Fellows - Science Data - 14_0730
CLIR Fellows - Science Data - 14_0730
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Data Science definition
Data Science definitionData Science definition
Data Science definition
 

Recently uploaded

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 

Recently uploaded (20)

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 

Sensors1(1)

  • 1. Sensors, networks, and massive data Michael W. Mahoney Stanford University May 2012 ( For more info, see: http:// cs.stanford.edu/people/mmahoney/ or Google on “Michael Mahoney”)
  • 2. Lots of types of “sensors” Examples: • Physical/environmental: temperature, air quality, oil, etc. • Consumer: RFID chips, SmartPhone, Store Video, etc. • Health care: Patient Records, Images & Surgery Videos, etc. • Financial: Transactions for regulations, HFT, etc. • Internet/e-commerce: clicks, email, etc. for user modeling, etc. • Astronomical/HEP: images, experiments, etc. Common theme: easy to generate A LOT of data Questions: • What are similarities/differences i.t.o. funding drivers, customer demands, questions of interest, time sensitivity, etc. about “sensing” in these different applications? • What can we learn from one area and apply to another area?
  • 3. BIG data??? MASSIVE data???? NYT, Feb 11, 2012: “The Age of Big Data” • “What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions. …” Why are big data big? • Generate data at different places/times and different resolutions • Factor of 10 more data is not just more data, but different data
  • 4. BIG data??? MASSIVE data???? MASSIVE data: • Internet, Customer Transactions, Astronomy/HEP = “Petascale” • One Petabyte = watching 20 years of movies (HD) = listening to 20,000 years of MP3 (128 kbits/sec) = way too much to browse or comprehend massive data: • 105 people typed at 106 DNA SNPs; 106 or 109 node social network; etc. In either case, main issues: • Memory management issues, e.g., push computation to the data • Hard to answer even basic questions about what data “looks like”
  • 5. How do we view BIG data?
  • 6. Algorithmic vs. Statistical Perspectives Computer Scientists • Data: are a record of everything that happened. • Goal: process the data to find interesting patterns and associations. • Methodology: Develop approximation algorithms under different models of data access since the goal is typically computationally hard. Statisticians (and Natural Scientists) • Data: are a particular random instantiation of an underlying process describing unobserved patterns in the world. • Goal: is to extract information about the world from noisy data. • Methodology: Make inferences (perhaps about unseen events) by positing a model that describes the random variability of the data around the deterministic model. Lambert (2000), Mahoney (2010)
  • 7. Thinking about large-scale data Data generation is modern version of microscope/telescope: • See things couldn't see before: e.g., movement of people, clicks and interests; tracking of packages; fine-scale measurements of temperature, chemicals, etc. • Those inventions ushered new scientific eras and new understanding of the world and new technologies to do stuff Easy things become hard and hard things become easy: • Easier to see the other side of universe than bottom of ocean • Means, sums, medians, correlations is easy with small data Our ability to generate data far exceeds our ability to extract insight from data.
  • 8. Many challenges ... • Tradeoffs between prediction & understanding • Tradeoffs between computation & communication, • Balancing heat dissipation & energy requirements • Scalable, interactive, & inferential analytics • Temporal constraints in real-time applications • Understanding “structure” and “noise” at large-scale (*) • Even meaningfully answering “What does the data look like?”
  • 9. Micro-markets in sponsored search 10 million keywords 1.4MillionAdvertisers Gambling Sports Sports Gambling Movies Media Sport videos What is the CTR and advertiser ROI of sports gambling keywords? Goal: Find isolated markets/clusters (in an advertiser-bidded phrase bipartite graph) with sufficient money/clicks with sufficient coherence. Ques: Is this even possible?
  • 10. What about sensors? Vector space model - analogous to “bag-of-words” model for documents/terms. • Each sensor is a “document,” a vector in a high-dimensional Euclidean space • Each measurement is a “term”, describing the elements of that vector • (Advertisers and bidded-phrases--and many other things--are also analogous.) Can also define sensor-measurement graphs : • Sensors are nodes, and edges are between sensors with similar measurements m documents (sensors) n terms (measurements) A ij= frequency of j-th term in i-th document (value of j-th measurement at i-th sensor) = =
  • 11. Cluster-quality Score: Conductance S S’ 11  How cluster-like is a set of nodes? Idea: balance “boundary” of cluster with “volume” of cluster  Need a natural intuitive measure: Conductance (normalized cut) φ(S) ≈ # edges cut / # edges inside  Small φ(S) corresponds to better clusters of nodes
  • 12. Graph partitioning A family of combinatorial optimization problems - want to partition a graph’s nodes into two sets s.t.: • Not much edge weight across the cut (cut quality) • Both sides contain a lot of nodes Standard formalizations of the bi-criterion are NP-hard! Approximation algorithms: • Spectral methods* - (compute eigenvectors) • Local improvement - (important in practice) • Multi-resolution - (important in practice) • Flow-based methods* - (mincut-maxflow) * comes with strong underlying theory to guide heuristics
  • 13. Comparison of “spectral” versus “flow” Spectral: • Compute an eigenvector • “Quadratic” worst-case bounds • Worst-case achieved -- on “long stringy” graphs • Embeds you on a line (or Kn) Flow: • Compute a LP • O(log n) worst-case bounds • Worst-case achieved -- on expanders • Embeds you in L1 Two methods: • Complementary strengths and weaknesses • What we compute will depend on approximation algorithm as well as objective function.
  • 14. Analogy: What does a protein look like? Experimental Procedure: • Generate a bunch of output data by using the unseen object to filter a known input signal. • Reconstruct the unseen object given the output signal and what we know about the artifactual properties of the input signal. Three possible representations (all-atom; backbone; and solvent-accessible surface) of the three-dimensional structure of the protein triose phosphate isomerase.
  • 15. Popular small networks Zachary’s karate club Newman’s Network Science Meshes and RoadNet-CA
  • 16. Large Social and Information Networks
  • 17. Typical example of our findings General relativity collaboration network (pretty small: 4,158 nodes, 13,422 edges) 17Community size Communityscore Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)
  • 18. Large Social and Information Networks LiveJournal Epinions Focus on the red curves (local spectral algorithm) - blue (Metis+Flow), green (Bag of whiskers), and black (randomly rewired network) for consistency and cross-validation. Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)
  • 19. Interpretation: “Whiskers” and the “core” of large informatics graphs • “Whiskers” • maximal sub-graph detached from network by removing a single edge • contains 40% of nodes and 20% of edges • “Core” • the rest of the graph, i.e., the 2-edge-connected core • Global minimum of NCPP is a whisker • BUT, core itself has nested whisker- core structure Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)
  • 20. Local “structure” and global “noise” Many (most/all?) large informatics graphs (& massive data in general?) • have local structure that is meaningfully geometric/low-dimensional • does not have analogous meaningful global structure Intuitive example: • What does the graph of you and your 102 closest Facebook friends “look like”? • What does the graph of you and your 105 closest Facebook friends “look like”?
  • 21. Many lessons ... This is problematic for MANY things people want to do: • statistical analysis that relies on asymptotic limits • recursive clustering algorithms • analysts who want a few meaningful clusters More data need not be better if you: • don’t have control over the noise • want “islands of insight” in the “sea of data” How does this manifest itself in your “sensor” application? • Needles in haystack; correlations; time series -- “scientific” apps • Historically, CS & database apps did more summaries & aggregates
  • 22. Big changes in the past ... and future Consider the creation of: • Modern Physics • Computer Science • Molecular Biology These were driven by new measurement techniques and technological advances, but they led to: • big new (academic and applied) questions • new perspectives on the world • lots of downstream applications We are in the middle of a similarly big shift! • OR and Management Science •Transistors and Microelectronics • Biotechnology
  • 23. Conclusions HUGE range of “sensors” are generating A LOT of data: • will lead to a very different world in many ways Large-scale data are very different than small-scale data. • Easy things become hard, and hard things become easy • Types of questions that are meaningful to ask are different • Structure, noise, etc. properties are often deeply counterintuitive Different applications are driven by different considerations • next-user-interaction, qualitative insight, failure modes, false positives versus false negatives, time sensitivity, etc. Algorithms can compute answers to known questions • but algorithms can also be used as “experimental probes” of the data to form questions!
  • 24. MMDS Workshop on “Algorithms for Modern Massive Data Sets” (http://mmds.stanford.edu) at Stanford University, July 10-13, 2012 Objectives: - Address algorithmic, statistical, and mathematical challenges in modern statistical data analysis. - Explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly-structured data. - Bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to promote cross-fertilization of ideas. Organizers: M. W. Mahoney, A. Shkolnik, G. Carlsson, and P. Drineas, Registration is available now!