Knowledge Will Propel Machine Understanding of Big DataAmit Sheth
Preview video: https://youtu.be/4e0dtV7CTWM
CCKS Keynote, August 2017: http://www.ccks2017.com/?page_id=358
SEAS Summer School, July 2017
https://sites.google.com/view/seasschool2017/talks
Related paper: http://knoesis.org/node/2835
CCKS Conf had over 500 attendees- some photos: https://photos.app.goo.gl/5CdlfAX1uYwvgqsQ2
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...Amit Sheth
Keynote at Web Intelligence 2017: http://webintelligence2017.com/program/keynotes/
Video: https://youtu.be/EIbhcqakgvA Paper: http://knoesis.org/node/2698
Abstract: While Bill Gates, Stephen Hawking, Elon Musk, Peter Thiel, and others engage in OpenAI discussions of whether or not AI, robots, and machines will replace humans, proponents of human-centric computing continue to extend work in which humans and machine partner in contextualized and personalized processing of multimodal data to derive actionable information.
In this talk, we discuss how maturing towards the emerging paradigms of semantic computing (SC), cognitive computing (CC), and perceptual computing (PC) provides a continuum through which to exploit the ever-increasing and growing diversity of data that could enhance people’s daily lives. SC and CC sift through raw data to personalize it according to context and individual users, creating abstractions that move the data closer to what humans can readily understand and apply in decision-making. PC, which interacts with the surrounding environment to collect data that is relevant and useful in understanding the outside world, is characterized by interpretative and exploratory activities that are supported by the use of prior/background knowledge. Using the examples of personalized digital health and a smart city, we will demonstrate how the trio of these computing paradigms form complementary capabilities that will enable the development of the next generation of intelligent systems. For background: http://bit.ly/PCSComputing
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
Presented at the 31st ACM User Interface Software and Technology Symposium (UIST), 2018. Paper: https://www.ischool.utexas.edu/~ml/papers/nguyen-uist18.pdf
Smart Data for you and me: Personalized and Actionable Physical Cyber Social ...Amit Sheth
Featured Keynote at Worldcomp'14, July 2014: http://www.world-academy-of-science.org/worldcomp14/ws/keynotes/keynote_sheth
Video of the talk at: http://youtu.be/2991W7OBLqU
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is human health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information, etc.). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will forward the concept of Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If I am an asthma patient, for all the data relevant to me with the four V-challenges, what I care about is simply, “How is my current health, and what is the risk of having an asthma attack in my personal situation, especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city. I will present examples from a couple of these.
Philosophy of Big Data: Big Data, the Individual, and SocietyMelanie Swan
Philosophical concepts elucidate the impact the Big Data Era (exabytes/year of scientific, governmental, corporate, personal data being created) is having on our sense of ourselves as individuals in society as information generators in constant dialogue with the pervasive information climate.
Knowledge Will Propel Machine Understanding of Big DataAmit Sheth
Preview video: https://youtu.be/4e0dtV7CTWM
CCKS Keynote, August 2017: http://www.ccks2017.com/?page_id=358
SEAS Summer School, July 2017
https://sites.google.com/view/seasschool2017/talks
Related paper: http://knoesis.org/node/2835
CCKS Conf had over 500 attendees- some photos: https://photos.app.goo.gl/5CdlfAX1uYwvgqsQ2
Semantic, Cognitive, and Perceptual Computing – three intertwined strands of ...Amit Sheth
Keynote at Web Intelligence 2017: http://webintelligence2017.com/program/keynotes/
Video: https://youtu.be/EIbhcqakgvA Paper: http://knoesis.org/node/2698
Abstract: While Bill Gates, Stephen Hawking, Elon Musk, Peter Thiel, and others engage in OpenAI discussions of whether or not AI, robots, and machines will replace humans, proponents of human-centric computing continue to extend work in which humans and machine partner in contextualized and personalized processing of multimodal data to derive actionable information.
In this talk, we discuss how maturing towards the emerging paradigms of semantic computing (SC), cognitive computing (CC), and perceptual computing (PC) provides a continuum through which to exploit the ever-increasing and growing diversity of data that could enhance people’s daily lives. SC and CC sift through raw data to personalize it according to context and individual users, creating abstractions that move the data closer to what humans can readily understand and apply in decision-making. PC, which interacts with the surrounding environment to collect data that is relevant and useful in understanding the outside world, is characterized by interpretative and exploratory activities that are supported by the use of prior/background knowledge. Using the examples of personalized digital health and a smart city, we will demonstrate how the trio of these computing paradigms form complementary capabilities that will enable the development of the next generation of intelligent systems. For background: http://bit.ly/PCSComputing
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
Presented at the 31st ACM User Interface Software and Technology Symposium (UIST), 2018. Paper: https://www.ischool.utexas.edu/~ml/papers/nguyen-uist18.pdf
Smart Data for you and me: Personalized and Actionable Physical Cyber Social ...Amit Sheth
Featured Keynote at Worldcomp'14, July 2014: http://www.world-academy-of-science.org/worldcomp14/ws/keynotes/keynote_sheth
Video of the talk at: http://youtu.be/2991W7OBLqU
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is human health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information, etc.). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will forward the concept of Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If I am an asthma patient, for all the data relevant to me with the four V-challenges, what I care about is simply, “How is my current health, and what is the risk of having an asthma attack in my personal situation, especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city. I will present examples from a couple of these.
Philosophy of Big Data: Big Data, the Individual, and SocietyMelanie Swan
Philosophical concepts elucidate the impact the Big Data Era (exabytes/year of scientific, governmental, corporate, personal data being created) is having on our sense of ourselves as individuals in society as information generators in constant dialogue with the pervasive information climate.
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014StampedeCon
At StampedeCon 2014, Rob Peglar (EMC Isilon) presented "Big Data Past, Present and Future – Where are we Headed?"
Rob Peglar was one of the speakers at the very first StampedeCon. Following that talk two years ago, Rob will present an overview of and insight into the technologies and system approaches to computing, transport and storage of big data – where we’ve been, are now and are headed. There is a major ‘fork in the road’ upcoming in the treatment and business application of big data and the technology that surrounds it, one that is important enough to change the course of the methodologies and approaches used by large and small business alike, especially for the infrastructure required either on premise or in the cloud.
A key contemporary trend emerging in big data science is the Quantified Self (QS) - individuals engaged in the deliberate self-tracking of any kind of biological, physical, behavioral, or transactional information as n=1 individuals or in groups. This is giving rise to interesting pools of individual data, group data, and big data which can be interlinked to create a new era of highly-targeted value-specific consumer applications. There are significant opportunities in big data to develop models to support QS data collection, integration, analysis, and use for personal lifestyle and consumption management. There are also opportunities to provide leadership in designing consumer-friendly standards and etiquette regarding the use of personal and collective data. Next-generation QS big data applications and services could include tools for rendering QS data meaningful in behavior change, establishing baselines and variability in objective metrics, applying new kinds of pattern recognition techniques, and aggregating multiple self-tracking data streams from wearable electronics, biosensors, mobile phones, genomic data, and cloud-based services. Potential limitations regarding QS activity need to be considered including consumer non-adoption, data privacy and sharing concerns, the digital divide, ease-of-use, and social acceptance.
Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:
(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
Presented at SW2012 @ ISWC2012.
http://amitsheth.blogspot.com/2012/08/semantics-empowered-physical-cyber.html
This is an old version of this talk, for more recent information on this topic (eg talks, papers, events), see: http://wiki.knoesis.org/index.php/PCS
CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk
Presentation at NeDICC Meeting on 16 July 2014. Feedback from CODATA International Training Workshop in Big Data for Science for Researchers from Emerging and Developing Countries, Beijing, China, 5-20 June 2014
The Philosophy of Big Data is the branch of philosophy concerned with the foundations, methods, and implications of big data; the definitions, meaning, conceptualization, knowledge possibilities, truth standards, and practices in situations involving very-large data sets that are big in volume, velocity, variety, veracity, and variability
BIG DATA | How to explain it & how to use it for your career?Tuan Yang
If you ask people what BIG DATA is they often say it is about a lot of data. But the world has ALWAYS had a lot of data. It is about datafication – a word so new even spellcheck functions don’t know it is a real word!
Learn more about:
» How BIG DATA changes career paths of even the most unsuspecting?
» How BIG DATA changes the way business decision are made?
» How BIG DATA changes who makes those decisions & the reshuffle of the balance of power it causes?
» What BIG DATA skills can you bring to the office tomorrow to increase your value to the firm
This tutorial presents tools and techniques for effectively utilizing the Internet of Things (IoT) for building advanced applications, including the Physical-Cyber-Social (PCS) systems. The issues and challenges related to IoT, semantic data modelling, annotation, knowledge representation (e.g. modelling for constrained environments, complexity issues and time/location dependency of data), integration, analy- sis, and reasoning will be discussed. The tutorial will de- scribe recent developments on creating annotation models and semantic description frameworks for IoT data (e.g. such as W3C Semantic Sensor Network ontology). A review of enabling technologies and common scenarios for IoT applications from the data and knowledge engineering point of view will be discussed. Information processing, reasoning, and knowledge extraction, along with existing solutions re- lated to these topics will be presented. The tutorial summarizes state-of-the-art research and developments on PCS systems, IoT related ontology development, linked data, do- main knowledge integration and management, querying large- scale IoT data, and AI applications for automated knowledge extraction from real world data.
Related: Semantic Sensor Web: http://knoesis.org/projects/ssw
Physical-Cyber-Social Computing: http://wiki.knoesis.org/index.php/PCS
Using big data and implementing hadoop is a trend that people jump all to quickly to. Instead understanding the run time complexity of one's algorithms, reducing said complexity and managing the process from start to finish in a lean and agile way can yield massive cost savings - or save your organization.
over the past ten years, data has grown on the Internet, and we are the fuel and haste of this increase. Business owners, they produce apps for us, and we feed these companies with our data, unfortunately, it is all our private data. In the end, we become, through our private data, a commodity that is sold to the highest bidder.
Without security, not even privacy. Ethical oversight and constraints are needed to ensure that an appropriate balance. This article will cover: the contents of big data, what it includes, how data is collected, and the process of involving it on the Internet. In addition, it discuss the analysis of data, methods of collecting it, and factors of ethical challenges. Furthermore, the user's rights, which must be observed, and the privacy the user has.
Presented at the Panel on
Sensor, Data, Analytics and Integration in Advanced Manufacturing, at the Connected Manufacturing track of Bosch-USA organized "Leveraging Public-Private Partnerships for Regional Growth Summit". Panel statement: Sensors, data and analytics are the core of any smart manufacturing system. What are the main challenges to create actionable outputs, replicate systems and scale efficiency gains across industries?
Moderator: Thomas Stiedl, Bosch
Panelists:
1. Amit Sheth, Wright State University
2. Howie Choset, Carnegie Melon University
3. Nagi Gebraeel, Georgia Institute of Technology
4. Brian Anthony, Massachusetts Institute of Technology
5. Yarom Polosky, Oak Ridget National Laboratory
For in-depth look:
Smart IoT: IoT as a human agent, human extension, and human complement
http://amitsheth.blogspot.com/2015/03/smart-iot-iot-as-human-agent-human.html
Semantic Gateway: http://knoesis.org/library/resource.php?id=2154
SSN Ontology: http://knoesis.org/library/resource.php?id=1659
Applications of Multimodal Physical (IoT), Cyber and Social Data for Reliable and Actionable Insights: http://knoesis.org/library/resource.php?id=2018
Smart Data: Transforming Big Data into Smart Data...: http://wiki.knoesis.org/index.php/Smart_Data
Historic use of the term Smart Data (2004): http://www.scribd.com/doc/186588820
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
Amit Sheth's keynote at IEEE BigData 2014, Oct 29, 2014.
Abstract from:
http://cci.drexel.edu/bigdata/bigdata2014/keynotespeech.htm
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city.
Quantified Self Ideology: Personal Data becomes Big DataMelanie Swan
A key contemporary trend emerging in big data science is the quantified self: individuals engaged in the deliberate self-tracking of any kind of biological, physical, behavioral, or transactional information, as n=1 individuals or in groups. The quantified self is one dimension of the bigger trend to integrate and apply a variety of personal information streams including big health data (genome, transcriptome, environmentome, diseasome), quantified self data streams (biosensor, fitness, sleep, food, mood, heart rate, glucose tracking, etc.), traditional data streams (personal and family health history, prescription history) and IOT (Internet of things) activity data streams (smart home, smart car, environmental sensors, community data). This talk looks at how personal data and group data are becoming big data as individuals and communities share, collaborate, and work with large personalized data sets using novel discovery methods such as anomaly detection and exception reporting, longitudinal baseline analysis, episodic triggers, and hierarchical machine learning.
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...Amit Sheth
Keynote given at ICDE2014, April 2014. Details at: http://ieee-icde2014.eecs.northwestern.edu/keynotes.html
A video of a version of this talk is available here: http://youtu.be/8RhpFlfpJ-A
(download to see many hidden slides).
Two versions of this talk, targeted at Smart Energy and Personalized Digital Health domains/apps at: http://wiki.knoesis.org/index.php/Smart_Data
Previous (older) version replaced by this version: http://www.slideshare.net/apsheth/big-data-to-smart-data-keynote
This is a brief a brief review of current multi-disciplinary and collaborative projects at Kno.e.sis led by Prof. Amit Sheth. They cover research in big social data, IoT, semantic web, semantic sensor web, health informatics, personalized digital health, social data for social good, smart city, crisis informatics, digital data for material genome initiative, etc. Dec 2015 edition.
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014StampedeCon
At StampedeCon 2014, Rob Peglar (EMC Isilon) presented "Big Data Past, Present and Future – Where are we Headed?"
Rob Peglar was one of the speakers at the very first StampedeCon. Following that talk two years ago, Rob will present an overview of and insight into the technologies and system approaches to computing, transport and storage of big data – where we’ve been, are now and are headed. There is a major ‘fork in the road’ upcoming in the treatment and business application of big data and the technology that surrounds it, one that is important enough to change the course of the methodologies and approaches used by large and small business alike, especially for the infrastructure required either on premise or in the cloud.
A key contemporary trend emerging in big data science is the Quantified Self (QS) - individuals engaged in the deliberate self-tracking of any kind of biological, physical, behavioral, or transactional information as n=1 individuals or in groups. This is giving rise to interesting pools of individual data, group data, and big data which can be interlinked to create a new era of highly-targeted value-specific consumer applications. There are significant opportunities in big data to develop models to support QS data collection, integration, analysis, and use for personal lifestyle and consumption management. There are also opportunities to provide leadership in designing consumer-friendly standards and etiquette regarding the use of personal and collective data. Next-generation QS big data applications and services could include tools for rendering QS data meaningful in behavior change, establishing baselines and variability in objective metrics, applying new kinds of pattern recognition techniques, and aggregating multiple self-tracking data streams from wearable electronics, biosensors, mobile phones, genomic data, and cloud-based services. Potential limitations regarding QS activity need to be considered including consumer non-adoption, data privacy and sharing concerns, the digital divide, ease-of-use, and social acceptance.
Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works:
(1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020.
Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.
Presented at SW2012 @ ISWC2012.
http://amitsheth.blogspot.com/2012/08/semantics-empowered-physical-cyber.html
This is an old version of this talk, for more recent information on this topic (eg talks, papers, events), see: http://wiki.knoesis.org/index.php/PCS
CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk
Presentation at NeDICC Meeting on 16 July 2014. Feedback from CODATA International Training Workshop in Big Data for Science for Researchers from Emerging and Developing Countries, Beijing, China, 5-20 June 2014
The Philosophy of Big Data is the branch of philosophy concerned with the foundations, methods, and implications of big data; the definitions, meaning, conceptualization, knowledge possibilities, truth standards, and practices in situations involving very-large data sets that are big in volume, velocity, variety, veracity, and variability
BIG DATA | How to explain it & how to use it for your career?Tuan Yang
If you ask people what BIG DATA is they often say it is about a lot of data. But the world has ALWAYS had a lot of data. It is about datafication – a word so new even spellcheck functions don’t know it is a real word!
Learn more about:
» How BIG DATA changes career paths of even the most unsuspecting?
» How BIG DATA changes the way business decision are made?
» How BIG DATA changes who makes those decisions & the reshuffle of the balance of power it causes?
» What BIG DATA skills can you bring to the office tomorrow to increase your value to the firm
This tutorial presents tools and techniques for effectively utilizing the Internet of Things (IoT) for building advanced applications, including the Physical-Cyber-Social (PCS) systems. The issues and challenges related to IoT, semantic data modelling, annotation, knowledge representation (e.g. modelling for constrained environments, complexity issues and time/location dependency of data), integration, analy- sis, and reasoning will be discussed. The tutorial will de- scribe recent developments on creating annotation models and semantic description frameworks for IoT data (e.g. such as W3C Semantic Sensor Network ontology). A review of enabling technologies and common scenarios for IoT applications from the data and knowledge engineering point of view will be discussed. Information processing, reasoning, and knowledge extraction, along with existing solutions re- lated to these topics will be presented. The tutorial summarizes state-of-the-art research and developments on PCS systems, IoT related ontology development, linked data, do- main knowledge integration and management, querying large- scale IoT data, and AI applications for automated knowledge extraction from real world data.
Related: Semantic Sensor Web: http://knoesis.org/projects/ssw
Physical-Cyber-Social Computing: http://wiki.knoesis.org/index.php/PCS
Using big data and implementing hadoop is a trend that people jump all to quickly to. Instead understanding the run time complexity of one's algorithms, reducing said complexity and managing the process from start to finish in a lean and agile way can yield massive cost savings - or save your organization.
over the past ten years, data has grown on the Internet, and we are the fuel and haste of this increase. Business owners, they produce apps for us, and we feed these companies with our data, unfortunately, it is all our private data. In the end, we become, through our private data, a commodity that is sold to the highest bidder.
Without security, not even privacy. Ethical oversight and constraints are needed to ensure that an appropriate balance. This article will cover: the contents of big data, what it includes, how data is collected, and the process of involving it on the Internet. In addition, it discuss the analysis of data, methods of collecting it, and factors of ethical challenges. Furthermore, the user's rights, which must be observed, and the privacy the user has.
Presented at the Panel on
Sensor, Data, Analytics and Integration in Advanced Manufacturing, at the Connected Manufacturing track of Bosch-USA organized "Leveraging Public-Private Partnerships for Regional Growth Summit". Panel statement: Sensors, data and analytics are the core of any smart manufacturing system. What are the main challenges to create actionable outputs, replicate systems and scale efficiency gains across industries?
Moderator: Thomas Stiedl, Bosch
Panelists:
1. Amit Sheth, Wright State University
2. Howie Choset, Carnegie Melon University
3. Nagi Gebraeel, Georgia Institute of Technology
4. Brian Anthony, Massachusetts Institute of Technology
5. Yarom Polosky, Oak Ridget National Laboratory
For in-depth look:
Smart IoT: IoT as a human agent, human extension, and human complement
http://amitsheth.blogspot.com/2015/03/smart-iot-iot-as-human-agent-human.html
Semantic Gateway: http://knoesis.org/library/resource.php?id=2154
SSN Ontology: http://knoesis.org/library/resource.php?id=1659
Applications of Multimodal Physical (IoT), Cyber and Social Data for Reliable and Actionable Insights: http://knoesis.org/library/resource.php?id=2018
Smart Data: Transforming Big Data into Smart Data...: http://wiki.knoesis.org/index.php/Smart_Data
Historic use of the term Smart Data (2004): http://www.scribd.com/doc/186588820
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
Amit Sheth's keynote at IEEE BigData 2014, Oct 29, 2014.
Abstract from:
http://cci.drexel.edu/bigdata/bigdata2014/keynotespeech.htm
Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses. Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions. A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being. Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information). However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!
In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP. I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.
For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration. For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.
Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city.
Quantified Self Ideology: Personal Data becomes Big DataMelanie Swan
A key contemporary trend emerging in big data science is the quantified self: individuals engaged in the deliberate self-tracking of any kind of biological, physical, behavioral, or transactional information, as n=1 individuals or in groups. The quantified self is one dimension of the bigger trend to integrate and apply a variety of personal information streams including big health data (genome, transcriptome, environmentome, diseasome), quantified self data streams (biosensor, fitness, sleep, food, mood, heart rate, glucose tracking, etc.), traditional data streams (personal and family health history, prescription history) and IOT (Internet of things) activity data streams (smart home, smart car, environmental sensors, community data). This talk looks at how personal data and group data are becoming big data as individuals and communities share, collaborate, and work with large personalized data sets using novel discovery methods such as anomaly detection and exception reporting, longitudinal baseline analysis, episodic triggers, and hierarchical machine learning.
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...Amit Sheth
Keynote given at ICDE2014, April 2014. Details at: http://ieee-icde2014.eecs.northwestern.edu/keynotes.html
A video of a version of this talk is available here: http://youtu.be/8RhpFlfpJ-A
(download to see many hidden slides).
Two versions of this talk, targeted at Smart Energy and Personalized Digital Health domains/apps at: http://wiki.knoesis.org/index.php/Smart_Data
Previous (older) version replaced by this version: http://www.slideshare.net/apsheth/big-data-to-smart-data-keynote
This is a brief a brief review of current multi-disciplinary and collaborative projects at Kno.e.sis led by Prof. Amit Sheth. They cover research in big social data, IoT, semantic web, semantic sensor web, health informatics, personalized digital health, social data for social good, smart city, crisis informatics, digital data for material genome initiative, etc. Dec 2015 edition.
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
Abstract:
Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.
Data Science Institutes : kelly technologies is the best Data Science Training Institutes in Hyderabad. Providing Data Science training by real time faculty in Hyderabad.
Understand the Idea of Big Data and in Present ScenarioAI Publications
Big data analytics and deep learning are two of data science's most promising areas of convergence. The importance of Big Data has grown recently as several organizations, both public and commercial, have been amassing large amounts of region-specific data that may provide useful information on topics like as national information, advanced security, blackmail area, development, and prosperity informatics. For Big Data Analytics, where data is often unstructured and unlabeled, Deep Learning's ability to analyze and learn from large amounts of data on its own is a crucial feature. In this review, we look at how Deep Learning can be used to solve some of the most pressing problems in Big Data Analytics, including model isolation from large data sets, semantic querying, data marking, smart data recovery, and the automation of discriminative tasks.
Is big data just a buzzword -Big data simply explainedVivek Srivastava
Big data helps us to uncover and discover those facets of data which we are not aware of . Using predictive science it helps us to provide insights on which actions can be taken and suggests those actions which will impact the business significantly boosting the revenue or market reach.For example, using large amount of data and appropriate tools, we can categorize different strata of population and build customize products. So whether companies deploy it or not, all depends on what factor constitute the value of company and where the center of value creation lies. It may be money or it may be geographic reach. - Watch this video at https://www.youtube.com/watch?v=ELyOl0fkqNM
What should organizations be concerned about when using Machine Learning for Predictive Modeling techniques? Divergence Academy and Divergence.AI are leading efforts to bring Algorithmic Accountability awareness to masses.
Notes from the Observation Deck // A Data Revolution gngeorge
Notes from the Observation Deck will provide you with an examined look at the interesting phenomena and trends taking place around us today. We present them to you with the hope of sparking broader conversations, debates and ideas. Please use this as a resource for knowledge, inspiration and enjoyment.
How open data contribute to improving the world. The life science use case. The technical, social, ethical issues.
This was a talk given within the iGEM 2020 programme by the London Imperial College students group (https://2020.igem.org/Team:Imperial_College), in a webinar organised by the SOAPLab group on the topic of Ethics of Automation. Excellent Dr Brandon Sepulvado was the other speaker of the day.
It has been said that Mobiles +Cloud + Social + Big Data = Better Run The World. IBM has invested over $20 billion since 2005 to grow its analytics business, many companies will invest more than $120 billion by 2015 on analytics, hardware, software and services critical in almost every industry like ; Healthcare, media, sports, finance, government, etc.
It has been estimated that there is a shortage of 140,000 – 190,000 people with deep analytical skills to fill the demand of jobs in the U.S. by 2018.
Decoding the human genome originally took 10 years to process; now it can be achieved in one week with the power of Analytic and BI (Business Intelligence). This lecture’s Key Messages is that Analytics provide a competitive edge to individuals , companies and institutions and that Analytics and BI are often critical to the success of any organization.
Methodology used is to teach analytic techniques through real world examples and real data with this goal to convince audience of the Analytics Edge and power of BI, and inspire them to use analytics and BI in their career and their life.
Similar to 2019 June 27 - Big data and data science (20)
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
1. International School of Physics
“Enrico Fermi”
Workshop 205
Big Data Analytics
27-29 June 2019
Villa Monastero, Varenna
Big Data and Data Science
Fabio Stella
University of Milan-Bicocca
Department of Informatics, Systems and Communication
2. The 100,000
Genomes Project
▪ complete sequencing of
70,000 individuals
▪ 21 PetaBytes of data
▪ 1 PetaByte of music requires
about 2,000 years to be
listened
The complexity increases with
digital data
▪ EHR
▪ Life style
▪ Nutrition
▪ Job type
▪ Geographical
▪ Social networks
3. Smart Cities
▪ Automate
▪ Control
▪ See real-time data
▪ Home automation
▪ Environment monitoring
▪ Medical and healtcare
▪ Smart transportation
▪ Smart manufactuing
▪ Energy resource
management
4. Large Hadron Collider
▪ 2017 June 29 at CERN,
200 PB of data permanently
stored.
▪ 1 billion collisions per
second happening at,
ATLAS, CMS, ALICE, and
LHCb.
5. According to [1],
• Big Data and Data Science are being used as buzzwords and are composites of
many concepts.
• Big data appears frequently in the press and academic journals.
• In the last five years; birth and growth of many data science programs in academia.
• In 2012, the White House Office of Science and Technology Policy announced the
Big Data Research and Development Initiative that builds upon federal initiatives
ranging from
✓ computer architecture and networking technologies,
✓ algorithms,
✓ data management,
✓ artificial intelligence,
✓ machine learning,
✓ advanced cyber-infrastructure.
[1] Brady H. E., Annual Review of Political Science, 22 (2019) 297.
7. Volume; the amount of data
to be stored and managed.
▪ It is probably the first thing
people think of when they
hear the phrase big data.
▪ We typically talk about big
data in the case where
ExaBytes or PetaBytes
have to be stored and
managed.
Big Data 5 V’s
8. Velocity; defines the speed
of increase in big data volume
and its relative accessibility.
▪ This dimension helps
organizations to realize the
relative growth of their big
data and how quickly that
data reaches sourcing
users, applications and
systems.
Big Data 5 V’s
9. Variety; we no longer have
control over the data format,
i.e. we must manage
heterogeneous data, with
different formats, semantics
and structures.
▪ In other words we could be
asked to combine;
✓ pure text,
✓ photo,
✓ audio,
✓ video,
✓ web,
✓ GPS data,
✓ sensor data,
✓ relational data bases,
✓ and documents
to answer a given question.
Big Data 5 V’s
10. Veracity; how accurate or
truthful a data set may be.
▪ In the context of big data,
however, it takes on a bit
more meaning.
▪ When it comes to the
accuracy of big data, it’s not
just the quality of the data
itself but how trustworthy
the data source, type, and
processing of it is.
▪ Removing things like bias,
abnormalities or
inconsistencies, duplication,
and volatility are just a few
aspects that factor into
improving the accuracy of
big data.
Big Data 5 V’s
11. Value; the worth of the data
being extracted.
▪ Having endless amounts of
data is one thing, but unless
it can be turned into value it
is useless.
▪ While there is a clear link
between data and insights,
this does not always mean
there is value in Big Data.
▪ The most important part of
embarking on a big data
initiative is to understand
the costs and benefits of
collecting and analyzing the
data to ensure that
ultimately the data that is
reaped can be monetized.
Big Data 5 V’s
12. Artificial Intelligence;
intelligence demonstrated by
machines, in contrast to the
natural intelligence displayed
by humans.
▪ Colloquially, the term
Artificial Intelligence (AI) is
used to describe
machines/computers that
mimic “cognitive” functions
that humans associate
with other human minds,
such as "learning" and
"problem solving".
▪ Two kinds of AI:
✓ Weak
✓ Strong
13. Artificial Intelligence
Bayesian Networks
▪ A type of statistical model
that represents a set of
variables and their
conditional dependencies
via a directed acyclic
graph (DAG).
▪ Bayesian networks are
ideal for taking an event
that occurred and
predicting the likelihood
that any one of several
possible known causes
was the contributing
factor.
▪ Structural Causal Models.
14. Artificial Intelligence
Knowledge Graphs
▪ A knowledge base used
by Google and its
services to enhance its
search engine's results
with information gathered
from a variety of sources.
▪ The information is
presented to users in an
infobox.
15. Machine Learning;
algorithms and statistical
models that computer
systems use in order to
perform a specific task
effectively without using
explicit instructions, relying
on patterns and inference
instead.
Three kinds of ML;
▪ Supervised
▪ Self-Supervised
▪ Reinforcement Learning
19. Machine Learning
Reinforcement
Learning
▪ Learn by interacting with the
environment
▪ The environment reacts to
our decisions/actions
▪ Sequential learning, only at
the end of the game we
know our performance
(reward/punishment)
20.
21. Deep Learning; is part
of a broader family of
machine learning methods
based on Artificial Neural
Networks.
Three kinds of DL;
▪ Supervised
▪ Self-Supervised
▪ Reinforcement Learning
22. Deep Learning
Feedforward Neural
Networks
▪ The first and simplest type
of artificial neural network
devised.
▪ The information moves in
only one direction,
forward, from the input
nodes, through the hidden
nodes (if any) and to the
output nodes.
▪ There are no cycles or
loops in the network.
8 pixels
8pixels
X1
X64
Y0
Y9
…
…
…
…
23. Deep Learning
Convolutional
Networks
▪ Regularized versions of
multilayer perceptrons
which are fully connected
and thus prone to
overfitting the data.
▪ Regularization by adding
some form of magnitude
measurement of weights
to the loss function.
▪ Different approach
towards regularization:
take advantage of the
hierarchical pattern in data
and assemble more
complex patterns using
smaller and simpler
patterns.
24. Deep Learning
LSTM Networks
▪ An artificial Recurrent
Neural Network
architecture.
▪ Unlike standard
feedforward neural
networks, LSTM has
feedback connections that
make it a "general
purpose computer" (it can
compute anything that a
Turing machine can).
▪ LSTM started to
revolutionize speech
recognition, outperforming
traditional models in
certain speech
applications.
25. “The vision systems of the eagle and the snake outperform
everything that we can make in the laboratory, but snakes
and eagles cannot build and eyeglass or a telescope or a
microscope."
— Judea Pearl
27. Data Science
▪ Applied Mathematics
▪ Computer Science
▪ Neuroscience
▪ Engineering
▪ Statistics
▪ Biology
▪ Physics
Data science is the application of computational and statistical
techniques to address or gain insight into some problem in the
real world (J. Zico Kolter, Carnegie Mellon University, 2018)
Data science is an extraordinary multidisciplinary and
interdisciplinary challenge to apply the scientific method
empowered by
• data,
• models,
• computational resources,
• open source software, and
… the most sophisticated technology we have today, i.e. the
human being.
28. Causal Inference and the
data-fusion problem
Fabio Stella
University of Milan-Bicocca
Department of Informatics, Systems and Communication
International School of Physics
“Enrico Fermi”
Workshop 205
Big Data Analytics
27-29 June 2019
Villa Monastero, Varenna
29. What can truly
be achieved?
▪ No matter how much
data you collect, the
question can not be
answered when using
the data alone
Does Exercise affects
Cholesterol?
Simpson’s Paradox
Age
Cholesterol Exercise
30. Big Data and the
data-fusion problem
▪ Piecing together multiple
datasets collected under
heterogeneous conditions
(i.e., different populations,
regimes, and sampling
methods) to obtain valid
answers to queries of
interest.
▪ The availability of multiple
heterogeneous datasets
presents new
opportunities to big data
analysts, because the
knowledge that can be
acquired from combined
data would not be
possible from any
individual source alone.
32. Seeing; we are looking for
regularities in observations.
“What if I see …?”
Calls for predictions based on passive observations.
It is characterized by the question “What if I see …?”
For instance, imagine a marketing director at a
department store who asks,
“How likely is a customer who bought toothpaste to
also buy dental floss?”
33. “What if do …?” & “How?”
We step up to the next level of causal queries
when we begin to change the world. A typical
question for this level is
“What will happen to our floss sales if we double
the price of toothpaste?”
This already calls for a new
kind of knowledge, absent
from the data, which we find at
rung two of the Ladder of
Causation, Intervention.
Intervention; ranks
higher than association
because it involves not just
seeing but changing what is.
Many scientists have been quite traumatized to learn
that none of the methods they learned in statistics is
sufficient even to articulate, let alone answer, a simple
question like
“What happens if we double the price?”
34. “What if I had done …?” & “Why?”
Counterfactuals; ranks
higher than intervention
because it involves
imagining, retrospection
and understanding.
We might wonder, My
headache is gone now, but
• Why?
• Was it the aspirin I took?
• The food I ate?
• The good news I heard?
These queries take us to the top rung of the Ladder of
Causation, the level of Counterfactuals, because to
answer them we must go back in time, change history,
and ask,
“What would have happened if I had not taken the
aspirin?”
No experiment in the world can deny treatment to an
already treated person and compare the two outcomes,
so we must import a whole new kind of knowledge.
35. Z1 is a direct cause of W1, Z3
Z1 is an indirect cause of X, Y, W3
Cause
A variable X is a cause of a
variable Y if Y “listens” to X
and decides its value in
response to what it hears.
▪ Direct cause
▪ Indirect cause
36. Causal Graph
Exogenous variables;
background conditions for
which no explanatory
mechanism is encoded in
model M.
Every instantiation U=u of the
exogenous variables uniquely
determines the values of all
variables in V and, hence, if
we assign a probability P(u)
to U, it induces a probability
function P(v) on V.
The vector U=u can also be
interpreted as an
experimental “unit” that can
stand for an individual
subject, time of day, …
Exogenous Variables
𝑈 = 𝑍1, 𝑍2
39. A structural model M consists of two sets of variables U and V, and a set
F of functions that determine or simulate how values are assigned to each
variable Vi ∈V.
Thus, for example, the equation
𝑣𝑖 = 𝑓𝑖 𝑢, 𝑣
describes a physical process by which variable Vi is assigned the value
𝑣𝑖 = 𝑓𝑖 𝑢, 𝑣 in response to the current values, 𝑣 and 𝑢, of the variables in
V and U.
Formally, the triplet <U,V,F> defines a Structural Causal Model, and the
diagram that captures the relationships among the variables is called the
Causal Graph G (of M).
Cause
A variable X is a cause of a
variable Y if Y “listens” to X
and decides its value in
response to what it hears.
▪ Direct cause
▪ Indirect cause
41. Gender
Discrimination?
You want to know whether
and to what degree a
company discriminates by
Gender in its Hiring
practices.
Such discrimination would
constitute a direct effect of
Gender on Hiring, which is
illegal in many cases.
However, women are more or
less likely to go into a
particular field than men, or to
have achieved advanced
degrees in that field.
So Gender may also have an
indirect effect on Hiring
through the mediating
variable of Qualification.
Qualification is a mediating variable
42. Gender
Discrimination?
In order to find the direct
effect of Gender on Hiring,
we need to somehow hold
Qualification steady, and
measure the remaining
relationship between Gender
and Hiring;
with Qualification
unchanging, any change in
Hiring would have to be due
to Gender alone.
𝑃 ℎ𝑖𝑟𝑒𝑑|𝑓𝑒𝑚𝑎𝑙𝑒, ℎ𝑞 < 𝑃 ℎ𝑖𝑟𝑒𝑑|𝑚𝑎𝑙𝑒, ℎ𝑞
highly qualified (hq)
43. The Confounder
What happens if there are
confounders of the
mediating variable and the
outcome variable.
For instance, Income:
People from higher income
backgrounds are more likely
to have gone to college and
more likely to have
connections that would help
them get hired.
44. The Confounder
Now, if we condition on
Qualification, we are
conditioning on a Collider.
45. The Confounder
Now, if we condition on
Qualification, we are
conditioning on a Collider.
Gender and Income
communicate and thus it is
no longer possible to judge
the effect of Gender on
Hiring alone.
Income is a confounder for Qualification
46. The Confounder
However, if we don’t
condition on Qualification,
indirect dependence can
pass from Gender to Hiring
through the path to the right.
No matter how you look at it,
we’re not getting the true
direct effect of Gender on
Hiring.
47. The Confounder
No matter how you look at it,
we’re not getting the true
direct effect of Gender on
Hiring.
Traditionally, therefore,
statistics has had to abandon
a huge class of potential
mediation problems, where
the concept of “direct effect”
could not be defined, let
alone estimated.
48. 𝑃 ℎ𝑖𝑟𝑒𝑑|𝑑𝑜(𝑓𝑒𝑚𝑎𝑙𝑒), 𝑑𝑜(ℎ𝑞) < 𝑃 ℎ𝑖𝑟𝑒𝑑|𝑑𝑜(𝑚𝑎𝑙𝑒), 𝑑𝑜(ℎ𝑞)The “do” operator
Luckily, we now have a
conceptual way of holding
the mediating variable steady
without conditioning on it:
We can intervene on it.
If, instead of conditioning, we
fix the qualifications, the
arrow between gender and
qualifications (and the one
between income and
qualifications) disappears,
and no spurious dependence
can pass through it.
49. The “do” operator
Luckily, we now have a
conceptual way of holding
the mediating variable steady
without conditioning on it:
We can intervene on it.
If, instead of conditioning, we
fix the qualifications, the
arrow between gender and
qualifications (and the one
between income and
qualifications) disappears,
and no spurious dependence
can pass through it.
50. Conclusions
▪ Big Data offers an extraordinary opportunity to understand how things
work, this holds almost in any discipline and application domain.
▪ Causality will play a central role in Big Data, one of the most
challenging problems is data-fusion (How to combine heterogeneous
data collected from different sources?)
▪ Data Science can play a major role in the incoming revolution, “The
revolution hasn't happened yet.” (Michael Jordan).
▪ More interdisciplinary and multidisciplinary efforts are needed.
▪ Need to think about a possible alternative model for education, both
low and high level.
International School of Physics
“Enrico Fermi”
Workshop 205
Big Data Analytics
27-29 June 2019
Villa Monastero, Varenna