Present: Our lives, as well as any field of business and society, are continuously transformed by our ability to collect meaningful data in a systematic fashion and turn that into value. We are increasingly more connected to data sources, have unprecedented distributed infrastructure capabilities and continuously improve our scientific and analytical capabilities. A new interest in an evolved field of data science has emerged as a response to the push from these advances.
Potential: The state of the art and present challenges come with many opportunities. They not only push for new and innovative capabilities in composable data management and analytical methods that can run anytime, anywhere but also require methods to bridge the gap between applications and such capabilities. However, we often lack collaborative culture and effective methodologies to translate these newest advances into impactful solution architectures that can transform science, society, and education.
Future: A Collaborative Networked World as a Part of the Data Science Process: Any solution architecture for data science today depends on the effectivity of a multi-disciplinary data science team, not only with humans but also with analytical systems and infrastructure which are inter-related parts of the solution. Focusing on collaboration and communication between people, and dynamic, predictable and programmable interfaces to systems and scalable infrastructure from the beginning of any activity is critical. This talk will provide an overview of some of our recent work on networked application architectures for dynamic data-driven wildfire modeling and smart cities. It will also explain how focusing on (1) some P’s in the planning phases of a data science activity and (2) creating a measurable process that spans multiple perspectives and success metrics was effective in making these solutions scalable. Lastly, it will introduce the PPODS methodology and family of composable tools for a team-based data science process management and training.
The document discusses the rise of big data and cloud computing as driving forces of the future economy. It notes that an ever-growing amount of data is being generated from transactions and sensors, and this data is stored and analyzed in large cloud infrastructures. Cloud computing provides analytics capabilities to transform data into information and insights. This data revolution is creating many new jobs in data science and transforming industries and the research model.
Keynote talk at the 2021 Australasian Conference on AI. A summary of Australia's global standing in AI, a bit of history, and where Australian AI is going next
The document summarizes the scope and application of public "big data". It discusses the value chain of data to information to knowledge to action. It describes shifts in governments from process-focused to data-driven approaches. Examples of public health and transportation data are provided. Overall, the document outlines the growing importance of data as a public resource and infrastructure for decision making.
This document provides an overview of a lecture on big data analytics given by Dr. Ching-Yung Lin. The key points covered in the lecture include:
- Definitions and characteristics of big data based on the 3V's of volume, velocity and variety.
- Techniques used for big data such as massive parallelism, distributed storage and processing, machine learning and data visualization.
- Factors that have enabled big data to become prominent in recent years like greater data collection, open source software and commodity hardware.
- Examples of big data platforms, databases and analytics techniques including Hadoop, Spark, NoSQL databases and graph databases.
- The large and growing market for big data
Top 5 Deep Learning and AI Stories - August 31, 2018NVIDIA
Read this week's top 5 news updates in deep learning and AI: Microsoft Azure now supports NVIDIA GPU Cloud for AI/HPC workloads, Pinterest uses AI to enhance its recommendations system, Johns Hopkins researchers use deep learning to combat pancreatic cancer, MIT researchers train neural networks with music videos to separate sounds from each other, and AI bots are now designing chairs (and they're surprisingly good).
Celebrating and Supporting the Medical Imaging CommunityNVIDIA
This year’s MICCAI conference had record-breaking attendance. If you missed it, view this SlideShare to catch up on all the highlights and NVIDIA news.
Presentation of the research activities of IMU (Information Management Unit) a multi-disciplinary research lab of the Institute of Communication and Computer Systems (ICCS) at the National Technical University of Athens, Greece.
See http://imu.iccs.gr
Top 5 Deep Learning and AI Stories - September 28, 2018NVIDIA
Read this week's top 5 news updates in deep learning and AI: Automakers look to virtual training to simulate billions of miles in driving, five Gordon Bell prize finalists leveraged Summit, the world's fastest supercomputer, Toronto celebrates NVIDIA's new Toronto AI lab and Canada's top researchers, scientists turn to simulated health data to train AI and preserve patient privacy, and two researchers leverage deep learning to create new levels for DOOM.
The document discusses the rise of big data and cloud computing as driving forces of the future economy. It notes that an ever-growing amount of data is being generated from transactions and sensors, and this data is stored and analyzed in large cloud infrastructures. Cloud computing provides analytics capabilities to transform data into information and insights. This data revolution is creating many new jobs in data science and transforming industries and the research model.
Keynote talk at the 2021 Australasian Conference on AI. A summary of Australia's global standing in AI, a bit of history, and where Australian AI is going next
The document summarizes the scope and application of public "big data". It discusses the value chain of data to information to knowledge to action. It describes shifts in governments from process-focused to data-driven approaches. Examples of public health and transportation data are provided. Overall, the document outlines the growing importance of data as a public resource and infrastructure for decision making.
This document provides an overview of a lecture on big data analytics given by Dr. Ching-Yung Lin. The key points covered in the lecture include:
- Definitions and characteristics of big data based on the 3V's of volume, velocity and variety.
- Techniques used for big data such as massive parallelism, distributed storage and processing, machine learning and data visualization.
- Factors that have enabled big data to become prominent in recent years like greater data collection, open source software and commodity hardware.
- Examples of big data platforms, databases and analytics techniques including Hadoop, Spark, NoSQL databases and graph databases.
- The large and growing market for big data
Top 5 Deep Learning and AI Stories - August 31, 2018NVIDIA
Read this week's top 5 news updates in deep learning and AI: Microsoft Azure now supports NVIDIA GPU Cloud for AI/HPC workloads, Pinterest uses AI to enhance its recommendations system, Johns Hopkins researchers use deep learning to combat pancreatic cancer, MIT researchers train neural networks with music videos to separate sounds from each other, and AI bots are now designing chairs (and they're surprisingly good).
Celebrating and Supporting the Medical Imaging CommunityNVIDIA
This year’s MICCAI conference had record-breaking attendance. If you missed it, view this SlideShare to catch up on all the highlights and NVIDIA news.
Presentation of the research activities of IMU (Information Management Unit) a multi-disciplinary research lab of the Institute of Communication and Computer Systems (ICCS) at the National Technical University of Athens, Greece.
See http://imu.iccs.gr
Top 5 Deep Learning and AI Stories - September 28, 2018NVIDIA
Read this week's top 5 news updates in deep learning and AI: Automakers look to virtual training to simulate billions of miles in driving, five Gordon Bell prize finalists leveraged Summit, the world's fastest supercomputer, Toronto celebrates NVIDIA's new Toronto AI lab and Canada's top researchers, scientists turn to simulated health data to train AI and preserve patient privacy, and two researchers leverage deep learning to create new levels for DOOM.
Ryan Goode is a U.S. citizen living in South Holland, Illinois who received a Bachelor of Science in Engineering Physics from Chicago State University in May 2015. He has extensive experience in mechanical systems, hydraulics, engines, and other technical areas. His research experience includes work at Fermilab, CERN, and Chicago State University. He has presented his research at several conferences.
In this presentation, Wes Eldridge will provide a general overview on data science. The talk will cover a variety of topics, Wes will start with the dirty history of the field which will help add context. After learning about the history of data and data science Wes will discuss the common roles a data scientist holds in business and organizations. Next, he will talk about how to use data in your organization and products. Finally, he'll cover some tools to help you get started in data science. After the presentation, Wes will stick around for Q/A and data discussion.
The document provides an introduction and agenda for a course on big data and data science. It defines big data as large, complex data sets that are difficult to process using traditional data processing applications. It notes that 90% of data in the world today was created in the last two years alone. It also defines the four V's of big data: volume, variety, velocity, and veracity. The document defines data science as an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. It notes that data scientists work with hypothesis generation, data analysis, and data visualization to gather insights that inform decisions. The document outlines some of the day-to-day responsibilities
Grid Computing in a Commodity World (KCCMG, 2005)Lorin Olsen
The document discusses the evolution of grid computing from early introductions to definitions and examples of different types of grids. It describes computational grids which focus on computing power, scavenging grids which utilize unused desktop resources, and data grids which allow sharing and access to data across organizations. Specific examples of computational grids include those used for grand challenge problems in fields like fluid dynamics, environmental modeling, and molecular biology.
My presentation on Data Mining, Lessons from Competitions, and Public Data looks at the Data Mining/Data Science/Big Data evolution, reviews lessons from KDD Cup 1997, Netflix Prize, and Kaggle, presents a big list of Public and Government data APIs, Marketplaces, Portals, and Platforms, and examines Big Data Hype. This talk was given at BPDM-2013, (Broadening Participation in Data Mining), Aug 10, 2013 held at KDD-2013, Chicago.
Top 5 Deep Learning and AI Stories - November 30, 2018NVIDIA
Read this week's top 5 news updates in deep learning and AI: 75 healthcare companies partner with NVIDIA to power the future of radiology, NeurIPS conference showcases the latest in AI research, NVIDIA's new research lab pushes machine learning boundaries, Israeli AI startup restores speech abilities to stroke victims and others with impaired language, and radiologists can detect anomalies in medical images with deep learning.
Advancing Medical Imaging with Deep LearningNVIDIA
The inaugural MIDL conference in Amsterdam brought together nearly 300 deep learning researchers, clinicians, and healthcare companies. There were 4 keynote speakers from industry leaders, 21 oral presentations, and over 60 poster presentations. Awards were given for the best paper and poster, which focused on medical image analysis using deep learning. The conference aimed to advance the application of deep learning in medical imaging and discuss collaboration across industry, academia, and clinicians.
This document discusses analytics education in the era of big data. It begins with an overview of different terms used such as analytics, data mining, data science, and knowledge discovery. It then discusses trends in big data including the 3 V's of volume, velocity, and variety. It notes that skills and jobs in analytics are in high demand but there is a shortage of people with deep analytical skills. The document provides an overview of analytics education including various certificate programs and online courses available. It emphasizes that analytics education works best when combined with learning by doing through competitions and hands-on projects.
This document discusses the key factors that contributed to the recent boom in deep learning. It identifies better neural network algorithms/techniques, large datasets, massive parallelization using GPUs, and industry investment as major enabling factors. In particular, it highlights how the availability of large, labeled datasets like ImageNet; developments in CNNs, autoencoders, and other neural network architectures; the use of GPUs to enable efficient parallel training; and large-scale research at tech companies like Google were central to recent advances in deep learning.
Top 5 Deep Learning and AI Stories - November 3, 2017NVIDIA
The document discusses insights into deep learning and artificial intelligence. It provides the top 5 headlines: 1) Pentagon official discusses how AI and machine learning will revolutionize the US intelligence community. 2) Startup is working on an AI system to detect lung cancer earlier from chest X-rays to save lives. 3) NVIDIA's GPU Cloud gives developers access to optimized deep learning tools in the cloud. 4) Non-profit AI4ALL partners with NVIDIA to increase students' access to AI resources and careers. 5) NVIDIA expands its Deep Learning Institute to address the growing need for AI experts.
Accumulo Summit 2014: Addressing big data challenges through innovative archi...Accumulo Summit
The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. The Massachusetts Institute of Technology, Lincoln Laboratory (MIT LL) is not immune to these challenges and has developed a set of tools that address many of these challenges.
Big data volume stresses the storage, memory, and compute capacity of a computing system and requires access to a computing cloud. Choosing the right cloud is problem specific. Currently, there are four multi-billion dollar ecosystems that dominate the cloud computing environment: enterprise clouds, big data clouds, SQL database clouds, and supercomputing clouds. Each cloud ecosystem has its own hardware, software, conferences, and business markets. The broad nature of business big data challenges make it unlikely that one cloud ecosystem can meet its needs and solutions are likely to require the tools and techniques from more than one cloud ecosystem. The MIT Supercloud was developed to address this challenge. To our knowledge, the MIT SuperCloud is the only deployed cloud system that allows all four ecosystems to co-exist without sacrificing performance or functionality.
The velocity of big data velocity stresses the rate at which data can be absorbed and meaningful answers produced. Led by the NSA, a Common Big Data Architecture (CBDA) was developed for the U.S. government based on the Google Big Table NoSQL approach and is now in wide use. MIT/LL played a leading role in developing the CBDA and is a leader in adapting the CBDA to a variety of big data challenges.
Big data variety may present the largest challenge and greatest opportunities. The promise of big data is the ability to correlate diverse and heterogeneous data to form new insights. The centerpiece of the CBDA is the NSA developed Apache Accumulo database (capable of millions of entries/second) and the MIT/LL developed D4M schema. These technologies allow vast quantities of highly diverse data (text, computer logs, and social media data, etc.) to be automatically ingested into a common schema that enables rapid query and correlation of every element.
The talk will concentrate on how we utilize the aforementioned technologies in our mission to apply advanced technology to problems of national security.
This document discusses machine learning and topological data analysis using Ayasdi software. It provides an introduction to Ayasdi and topological data analysis, and explains how Ayasdi uses machine learning techniques like neural networks and random forests along with topological data analysis to better understand, segment, and act on data. Specifically, Ayasdi provides clearer segmentation of data and better identification of errors compared to traditional machine learning approaches. The goal is to help people leverage their data more effectively.
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...Hong-Linh Truong
This is a lecture from the advanced service engineering course from the Vienna University of Technology. See http://dsg.tuwien.ac.at/teaching/courses/ase/
Top 5 Deep Learning and AI Stories - September 14, 2018NVIDIA
Read this week's top 5 news updates in deep learning and AI: NVIDIA’s Clara Smartens up medical instruments, Fujifilm and NVIDIA bring radiology AI to Japan, Cisco boosts its deep learning capabilities, "I am AI" docuseries episode 8: Taking AI to new heights and How a Stanford PhD student is using deep learning to create “dank memes”.
Transforming Operations Using the Results of the Tech WaveDavid Blankinship
Brian Collins presented at the CalChiefs Fire Operations Technology Summit hosted by Esri. Attendees learned the history of how chiefs have transformed operations by using the results of the #TechWave
Are you interested in hearing more about his presentation? Email: marie.marshall@intterragroup.com
Transforming Healthcare at GTC Silicon ValleyNVIDIA
The GPU Technology Conference (GTC) brings together the leading minds in AI and healthcare that are driving advances in the industry - from top radiology departments and medical research institutions to the hottest startups from around the world. Can't miss panels and trainings at GTC Silicon Valley
Towards a better measure of business proximity: Topic modeling for industry i...Gene Moo Lee
The document presents a new approach for measuring business proximity between firms using topic modeling. It aims to overcome limitations of existing approaches by developing a data-driven, scalable method that provides finer-grained analysis with limited data requirements. The approach applies latent Dirichlet allocation to uncover topics from company descriptions in the CrunchBase dataset. Business proximity is then measured as the cosine similarity between the topic distributions of firm pairs. The method is shown to outperform a baseline of using common industry membership and provides a validated measure of firms' technological and business relatedness.
This document discusses the application of artificial intelligence and machine learning in petroleum engineering. It provides an overview of topics including:
- Petroleum data analytics which uses data as the foundation for analysis and modeling rather than assumptions.
- How artificial intelligence and machine learning can be used to model complex physical phenomena by learning from data rather than using mathematical equations.
- The importance of domain expertise when applying AI/ML to solve engineering problems compared to non-engineering problems.
- The differences between traditional statistical analysis and AI/ML, with the latter discovering patterns in data inductively rather than deductively fitting data to predetermined models.
- The importance of explainable artificial intelligence (XAI) for petroleum
The document discusses challenges and opportunities related to big data and high performance computing. It notes that computational power is increasing exponentially according to Moore's Law, but clock speeds have plateaued forcing a shift to multi-core processors. This is driving the need for parallel programming and new software approaches. Big data is also growing dramatically from various sources, such as sensors and social media. Analyzing this large, heterogeneous data requires new techniques in data mining, machine learning, and visualization. High performance computing and citizen science initiatives can help extract insights from big data to address important problems in health, environment, and other domains.
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
The new era of data science is here. Our lives and society are continuously transformed by our ability to collect data in a systematic fashion and turn that into value. The opportunities created by this change also comes with challenges that push for new and innovative data management and analytical methods as well as translating these new methods to applications in many areas that impact science, society, and education. Collaboration and ability of multi-disciplinary teams to work together and communicate to bring together the best of their knowledge in business, data and computing is vital for impactful solutions. This talk will discusses a reference ecosystem and question-driven methodology, called PPODS, to make impactful data science applications in many fields with specific examples in hazards, smart cities and biomedical research.
From AirBox to Smart City: where are we and what's next?Ling-Jyh Chen
The document discusses the AirBox project, which aims to monitor PM2.5 levels through participatory citizen sensing. It describes how over 1,600 AirBox devices have been deployed across 24 countries to measure air quality. The data is openly available through APIs and dashboards. The project also focuses on education and community engagement around air pollution issues. Applications of the large sensor network data include tracking emission sources, anomaly detection, and informing government policymaking to help make cities smarter.
Ryan Goode is a U.S. citizen living in South Holland, Illinois who received a Bachelor of Science in Engineering Physics from Chicago State University in May 2015. He has extensive experience in mechanical systems, hydraulics, engines, and other technical areas. His research experience includes work at Fermilab, CERN, and Chicago State University. He has presented his research at several conferences.
In this presentation, Wes Eldridge will provide a general overview on data science. The talk will cover a variety of topics, Wes will start with the dirty history of the field which will help add context. After learning about the history of data and data science Wes will discuss the common roles a data scientist holds in business and organizations. Next, he will talk about how to use data in your organization and products. Finally, he'll cover some tools to help you get started in data science. After the presentation, Wes will stick around for Q/A and data discussion.
The document provides an introduction and agenda for a course on big data and data science. It defines big data as large, complex data sets that are difficult to process using traditional data processing applications. It notes that 90% of data in the world today was created in the last two years alone. It also defines the four V's of big data: volume, variety, velocity, and veracity. The document defines data science as an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. It notes that data scientists work with hypothesis generation, data analysis, and data visualization to gather insights that inform decisions. The document outlines some of the day-to-day responsibilities
Grid Computing in a Commodity World (KCCMG, 2005)Lorin Olsen
The document discusses the evolution of grid computing from early introductions to definitions and examples of different types of grids. It describes computational grids which focus on computing power, scavenging grids which utilize unused desktop resources, and data grids which allow sharing and access to data across organizations. Specific examples of computational grids include those used for grand challenge problems in fields like fluid dynamics, environmental modeling, and molecular biology.
My presentation on Data Mining, Lessons from Competitions, and Public Data looks at the Data Mining/Data Science/Big Data evolution, reviews lessons from KDD Cup 1997, Netflix Prize, and Kaggle, presents a big list of Public and Government data APIs, Marketplaces, Portals, and Platforms, and examines Big Data Hype. This talk was given at BPDM-2013, (Broadening Participation in Data Mining), Aug 10, 2013 held at KDD-2013, Chicago.
Top 5 Deep Learning and AI Stories - November 30, 2018NVIDIA
Read this week's top 5 news updates in deep learning and AI: 75 healthcare companies partner with NVIDIA to power the future of radiology, NeurIPS conference showcases the latest in AI research, NVIDIA's new research lab pushes machine learning boundaries, Israeli AI startup restores speech abilities to stroke victims and others with impaired language, and radiologists can detect anomalies in medical images with deep learning.
Advancing Medical Imaging with Deep LearningNVIDIA
The inaugural MIDL conference in Amsterdam brought together nearly 300 deep learning researchers, clinicians, and healthcare companies. There were 4 keynote speakers from industry leaders, 21 oral presentations, and over 60 poster presentations. Awards were given for the best paper and poster, which focused on medical image analysis using deep learning. The conference aimed to advance the application of deep learning in medical imaging and discuss collaboration across industry, academia, and clinicians.
This document discusses analytics education in the era of big data. It begins with an overview of different terms used such as analytics, data mining, data science, and knowledge discovery. It then discusses trends in big data including the 3 V's of volume, velocity, and variety. It notes that skills and jobs in analytics are in high demand but there is a shortage of people with deep analytical skills. The document provides an overview of analytics education including various certificate programs and online courses available. It emphasizes that analytics education works best when combined with learning by doing through competitions and hands-on projects.
This document discusses the key factors that contributed to the recent boom in deep learning. It identifies better neural network algorithms/techniques, large datasets, massive parallelization using GPUs, and industry investment as major enabling factors. In particular, it highlights how the availability of large, labeled datasets like ImageNet; developments in CNNs, autoencoders, and other neural network architectures; the use of GPUs to enable efficient parallel training; and large-scale research at tech companies like Google were central to recent advances in deep learning.
Top 5 Deep Learning and AI Stories - November 3, 2017NVIDIA
The document discusses insights into deep learning and artificial intelligence. It provides the top 5 headlines: 1) Pentagon official discusses how AI and machine learning will revolutionize the US intelligence community. 2) Startup is working on an AI system to detect lung cancer earlier from chest X-rays to save lives. 3) NVIDIA's GPU Cloud gives developers access to optimized deep learning tools in the cloud. 4) Non-profit AI4ALL partners with NVIDIA to increase students' access to AI resources and careers. 5) NVIDIA expands its Deep Learning Institute to address the growing need for AI experts.
Accumulo Summit 2014: Addressing big data challenges through innovative archi...Accumulo Summit
The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. The Massachusetts Institute of Technology, Lincoln Laboratory (MIT LL) is not immune to these challenges and has developed a set of tools that address many of these challenges.
Big data volume stresses the storage, memory, and compute capacity of a computing system and requires access to a computing cloud. Choosing the right cloud is problem specific. Currently, there are four multi-billion dollar ecosystems that dominate the cloud computing environment: enterprise clouds, big data clouds, SQL database clouds, and supercomputing clouds. Each cloud ecosystem has its own hardware, software, conferences, and business markets. The broad nature of business big data challenges make it unlikely that one cloud ecosystem can meet its needs and solutions are likely to require the tools and techniques from more than one cloud ecosystem. The MIT Supercloud was developed to address this challenge. To our knowledge, the MIT SuperCloud is the only deployed cloud system that allows all four ecosystems to co-exist without sacrificing performance or functionality.
The velocity of big data velocity stresses the rate at which data can be absorbed and meaningful answers produced. Led by the NSA, a Common Big Data Architecture (CBDA) was developed for the U.S. government based on the Google Big Table NoSQL approach and is now in wide use. MIT/LL played a leading role in developing the CBDA and is a leader in adapting the CBDA to a variety of big data challenges.
Big data variety may present the largest challenge and greatest opportunities. The promise of big data is the ability to correlate diverse and heterogeneous data to form new insights. The centerpiece of the CBDA is the NSA developed Apache Accumulo database (capable of millions of entries/second) and the MIT/LL developed D4M schema. These technologies allow vast quantities of highly diverse data (text, computer logs, and social media data, etc.) to be automatically ingested into a common schema that enables rapid query and correlation of every element.
The talk will concentrate on how we utilize the aforementioned technologies in our mission to apply advanced technology to problems of national security.
This document discusses machine learning and topological data analysis using Ayasdi software. It provides an introduction to Ayasdi and topological data analysis, and explains how Ayasdi uses machine learning techniques like neural networks and random forests along with topological data analysis to better understand, segment, and act on data. Specifically, Ayasdi provides clearer segmentation of data and better identification of errors compared to traditional machine learning approaches. The goal is to help people leverage their data more effectively.
Emerging Dynamic TUW-ASE Summer 2015 - Distributed Systems and Challenges for...Hong-Linh Truong
This is a lecture from the advanced service engineering course from the Vienna University of Technology. See http://dsg.tuwien.ac.at/teaching/courses/ase/
Top 5 Deep Learning and AI Stories - September 14, 2018NVIDIA
Read this week's top 5 news updates in deep learning and AI: NVIDIA’s Clara Smartens up medical instruments, Fujifilm and NVIDIA bring radiology AI to Japan, Cisco boosts its deep learning capabilities, "I am AI" docuseries episode 8: Taking AI to new heights and How a Stanford PhD student is using deep learning to create “dank memes”.
Transforming Operations Using the Results of the Tech WaveDavid Blankinship
Brian Collins presented at the CalChiefs Fire Operations Technology Summit hosted by Esri. Attendees learned the history of how chiefs have transformed operations by using the results of the #TechWave
Are you interested in hearing more about his presentation? Email: marie.marshall@intterragroup.com
Transforming Healthcare at GTC Silicon ValleyNVIDIA
The GPU Technology Conference (GTC) brings together the leading minds in AI and healthcare that are driving advances in the industry - from top radiology departments and medical research institutions to the hottest startups from around the world. Can't miss panels and trainings at GTC Silicon Valley
Towards a better measure of business proximity: Topic modeling for industry i...Gene Moo Lee
The document presents a new approach for measuring business proximity between firms using topic modeling. It aims to overcome limitations of existing approaches by developing a data-driven, scalable method that provides finer-grained analysis with limited data requirements. The approach applies latent Dirichlet allocation to uncover topics from company descriptions in the CrunchBase dataset. Business proximity is then measured as the cosine similarity between the topic distributions of firm pairs. The method is shown to outperform a baseline of using common industry membership and provides a validated measure of firms' technological and business relatedness.
This document discusses the application of artificial intelligence and machine learning in petroleum engineering. It provides an overview of topics including:
- Petroleum data analytics which uses data as the foundation for analysis and modeling rather than assumptions.
- How artificial intelligence and machine learning can be used to model complex physical phenomena by learning from data rather than using mathematical equations.
- The importance of domain expertise when applying AI/ML to solve engineering problems compared to non-engineering problems.
- The differences between traditional statistical analysis and AI/ML, with the latter discovering patterns in data inductively rather than deductively fitting data to predetermined models.
- The importance of explainable artificial intelligence (XAI) for petroleum
The document discusses challenges and opportunities related to big data and high performance computing. It notes that computational power is increasing exponentially according to Moore's Law, but clock speeds have plateaued forcing a shift to multi-core processors. This is driving the need for parallel programming and new software approaches. Big data is also growing dramatically from various sources, such as sensors and social media. Analyzing this large, heterogeneous data requires new techniques in data mining, machine learning, and visualization. High performance computing and citizen science initiatives can help extract insights from big data to address important problems in health, environment, and other domains.
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
The new era of data science is here. Our lives and society are continuously transformed by our ability to collect data in a systematic fashion and turn that into value. The opportunities created by this change also comes with challenges that push for new and innovative data management and analytical methods as well as translating these new methods to applications in many areas that impact science, society, and education. Collaboration and ability of multi-disciplinary teams to work together and communicate to bring together the best of their knowledge in business, data and computing is vital for impactful solutions. This talk will discusses a reference ecosystem and question-driven methodology, called PPODS, to make impactful data science applications in many fields with specific examples in hazards, smart cities and biomedical research.
From AirBox to Smart City: where are we and what's next?Ling-Jyh Chen
The document discusses the AirBox project, which aims to monitor PM2.5 levels through participatory citizen sensing. It describes how over 1,600 AirBox devices have been deployed across 24 countries to measure air quality. The data is openly available through APIs and dashboards. The project also focuses on education and community engagement around air pollution issues. Applications of the large sensor network data include tracking emission sources, anomaly detection, and informing government policymaking to help make cities smarter.
[2017/07/11 AI-SOCD] AirBox: a Participatory Ecosystem for PM2.5 MonitoringLing-Jyh Chen
AirBox is a participatory ecosystem for PM2.5 air quality monitoring in Taiwan that has grown from 76 devices in 2015 to over 2,000 devices across 29 countries in 2017. It is based on an open hardware, open source, open data and open community approach. The system provides real-time air quality data visualization and analytics to empower citizens and inform environmental policymaking. Continued expansion of the monitoring network, development of new sensors and data applications, and interdisciplinary collaboration are priorities for the future.
Dr Alisdair Ritchie | Research: The Answer to the Problem of IoT SecurityPro Mrkt
The document discusses the growing issues surrounding security of internet of things (IoT) devices. It notes that cyber attacks cost businesses hundreds of billions annually and vulnerabilities often exist for over a year before being addressed. With the rapid growth of connected devices, addressing IoT security is increasingly important. The PETRAS research hub involves over 50 projects across 11 UK universities to better understand social and technical challenges around privacy, ethics, trust, reliability, and security of IoT systems. The goal is to make the UK a leader in trusted IoT expertise and help ensure security does not solely rely on consumer burden.
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
Motivating Introduction to MOOC on Big Data from an applications point of view https://bigdatacoursespring2014.appspot.com/course
Course says:
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Center...Geoffrey Fox
Motivating Introduction to MOOC on Big Data from an applications point of view https://bigdatacoursespring2014.appspot.com/course
Course says:
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.
The AirBox Project aims to create an ecosystem focused on participatory PM2.5 air pollution monitoring. It involves micro air pollution sensing using low-cost devices, open data analysis, app and firmware development, public awareness campaigns, and collaborations with experts, communities, and governments. Over 2,000 AirBox sensor devices have been deployed across 29 countries to measure PM2.5 levels and provide real-time data online through open data portals and dashboards.
Digitalisation and the future of research environmentsJisc
The document discusses how higher education is embracing digitalization through technologies like digitization, digitalization, and transformation. It also discusses how the digitalization journey can vary across different parts of an organization. Any future research environment will need to consider administration/management/support, the research process itself, and the community being served. A digital twin represents a technology that could help with a future research environment by providing a virtual representation to facilitate decision making and scenario testing across the research lifecycle.
Emerging Technologies in Synthetic Representation and Digital TwinLiming Zhu
This document discusses emerging technologies in synthetic representation and digital twins presented by Dr. Liming Zhu from CSIRO's Data61. It covers digital twins, synthetic representations, emerging technologies like federated learning and simulation, and examples of spatial digital twins in Australia. It emphasizes securely and privately connecting digital twins through techniques like federated analytics, sharing without access, desensitized and synthetic data. Future focus areas discussed include trusted data sharing, federated data and models, cross-domain security, and synthetic representation of supply chains.
e-SIDES workshop at ICT 2018, Vienna 5/12/2018e-SIDES.eu
This document summarizes a session discussing how to build the next privacy and security research agenda for big data. The session included an introduction, a discussion of the e-SIDES community position paper and process for providing input, a mentimeter voting activity, and a panel on ensuring responsible research and innovation responds to real needs. The panel featured representatives from universities and research organizations discussing issues like integrating privacy from the start, understanding cultural and regional differences, and ensuring research aligns with societal values and needs. The position paper and future research agenda aim to provide recommendations for an ethically sound approach to big data.
NUS-ISS Learning Day 2018-Painting Today's digital landscapeNUS-ISS
This document provides an overview of the current digital landscape presented by Mark Wee Kai Lie. It discusses the history of industrial revolutions and factors driving digital transformation. Key technology trends are examined, including artificial intelligence, quantum computing, and technologies impacting the digital workplace and how people live and work. The document also touches on challenges in digital delivery and approaches to address them, such as design thinking, business process reengineering, and change management. It concludes by listing Mark Wee's credentials and contact information.
Visual Information Analysis for Crisis and Natural Disasters Management and R...Yiannis Kompatsiaris
Invited talk at the Ninth International Conference on Image Processing Theory, Tools and Applications IPTA 2019 (http://www.ipta-conference.com/ipta19/)
Crises and natural disasters are unwelcome, but also unavoidable features of modern society, affecting more communities than ever. Visual information analysis plays an important role in efficient pre-event (e.g. risk modeling), during the event (response) and post-event (recovery) emergency situation management. This talk will describe the potential role of visual information sources including satellite images, surveillance and traffic cameras, social multimedia and aerial video in applications such as floods, fires, and oil spills. Multimodal and fusion techniques will be presented combining satellite and social data and how deep neural networks can be applied in this domain. The talks will include demos and results from the relevant BeAware and EOPEN projects and from our participation in the 2018 Multimedia Satellite Task of the MediaEval Benchmarking Initiative.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
This document discusses trends in data science in 2016, including how data science is moving into new use cases such as medicine, politics, government, and neuroscience. It also covers trends in hardware, generalized libraries, leveraging workflows, and frameworks that could enable a big leap ahead. The document discusses learning trends like MOOCs, inverted classrooms, collaborative learning, and how O'Reilly Media is embracing Jupyter notebooks. It also covers measuring distance between learners and subject communities, and the importance of both people and automation working together.
The document introduces the concepts of the Internet of Things (IoT) and discusses its applications and architecture models. It aims to discuss semantic technologies, service oriented solutions, and networking technologies that enable the integration of IoT data and services into the cyber world. Sources and videos are provided on topics relating to IoT security risks, definitions, and business trends.
This document provides an introduction to an "Introduction to IoT" course being taught in spring 2022. It outlines the instructor's details, grading breakdown, reference material, research areas, and course outline. The outline includes topics like the history of IoT, definitions of IoT, applications, challenges, and a case study on IoT in connected vehicles. The document also describes an IoT living lab setup at IIIT Hyderabad including sensor nodes for monitoring air quality, weather, energy use, crowds, and more.
Opportunities and Challenges of Large-scale IoT Data AnalyticsPayamBarnaghi
The document discusses opportunities and challenges of large-scale IoT data analytics. It provides an overview of the evolution of IoT from early technologies to current applications and future directions. It describes the types of heterogeneous and real-time data generated by IoT devices and challenges in analyzing this data. Examples of applications discussed include smart cities, transportation, healthcare, and event analysis. The document also summarizes work done in the EU CityPulse project on extracting events from social media and demonstrating IoT data analytics techniques.
Internet of things_by_economides_keynote_speech_at_ccit2014_finalAnastasios Economides
Internet of Things forecast, economics, applications, technology, research challenges, sensor networks security, attack models, countermeasures, network security visualization
Similar to Collaborative Data Science In A Highly Networked World (20)
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraIlkay Altintas, Ph.D.
My slides from the Big Data and The Earth Sciences: Grand Challenges Workshop on May 31st, 2017. Workshop link: http://prp.ucsd.edu/events/big-data-and-the-earth-science-grand-challenges-workshop
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
SDSC is a leader in high performance computing, data-intensive computing, and scientific data management. It focuses on "Big Data", "versatile computing", and "life sciences applications". The SDSC Data Science Office provides expertise, systems, and training for data science applications. Genomic analysis poses big data and computing challenges including data management, integration, and coordination and workflow management. New tools are needed to address these challenges. bioKepler is an example of a Kepler module for data-parallel bioinformatics. Training is also needed at the interface of domains to build the next generation of interdisciplinary scientists. SDSC works with industry partners through various strategies like sponsored research and providing access to systems and expertise.
The document describes WIFIRE (Wildfire Infrastructure for Resilience), a project funded by the National Science Foundation to develop a cyberinfrastructure for wildfire monitoring, dynamic prediction, and resilience. The goals of the project are to integrate real-time sensor data, satellite imagery, data management tools, wildfire simulation tools, and emergency response systems to improve wildfire disaster management before, during, and after fires. The cyberinfrastructure will analyze large amounts of heterogeneous sensor data and combine it with physical models to provide predictive capabilities and risk assessments to firefighters and the public.
WIFIRE is a project funded by the National Science Foundation to develop a cyberinfrastructure for scalable data-driven wildfire monitoring, dynamic prediction, and resilience. It involves collecting and integrating data from various sources like sensor networks, satellite imagery, and weather data to help with wildfire disaster management, prediction of fire spread, and resilience planning. The cyberinfrastructure being developed aims to make large amounts of real-time sensor data useful for analysis, combine data with physical models, and connect emergency response centers with predictive and preventative capabilities.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
ISUM 2015 Keynote
Summary: Computational and Data Science is about extracting knowledge from data and modeling. This end goal can only be achieved through a craft that combines people, processes, computational and Big Data platforms, application-specific purpose and programmability. Publications and provenance of the data products products leading to these publications are also important. With this in mind, this talk defines a terminology for computational and data science applications, and discuss why focusing on these concepts is important for executability and reproducibility in computational and data science.
This document discusses bridging big data and data science using scalable workflows. It describes how scientific workflows can integrate various data science tools and processes to analyze large datasets. Workflows allow standardized, programmable, and reproducible analysis at scale. Examples are provided of workflows developed at the San Diego Supercomputer Center for applications in bioinformatics, wildfire management, and other domains. The document advocates conceptualizing computational analyses as workflows to facilitate collaboration between data scientists and developers.
Scientific workflows help facilitate research by making complex processes easier to assemble, access diverse resources transparently, incorporate multiple tools, and ensure reproducibility. However, new challenges have emerged such as analyzing large amounts of sensor and genomic data. Workflows need to be more programmable, optimize resource usage across computing systems, and integrate with the full scientific process from data generation to publication. Next steps include specializing workflows for different domains and standards, treating workflows as publications, and catering to various hardware architectures.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Collaborative Data Science In A Highly Networked World
1. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Collaborative Data Science in a Highly Networked World
İlkay ALTINTAŞ, Ph.D.
Chief Data Science Officer, San Diego Supercomputer Center
Division Director, Cyberinfrastructure Research, Education and Development
Founder and Director, Workflows for Data Science Center of Excellence
2. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
What is a network
useful for?
?
3. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Making connections
• People and communities
• Data and applications
• People and information
• People and services
• Learners and classes
• Ideas and masses
4. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Advancing
Communication
and
Collaboration
5. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Any technology and
application built on
networking should be built
around these concepts.
6. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
How do we conduct and
teach data science in a
highly networked world?
?
7. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
What is Data Science?
?
9. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
How does successful data science happen?
Insight Data Product
“Big” Data
Question
Exploratory
Analysis
and
Modeling
Insight
10. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Customer
Demographic
Previous
Purchases
Book reviews
What kind of
books does this
customer like?
Book
recommendations
Example: Book Recommendations
11. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Model of customer’s
book preferences
New book
information
Who is likely to
like this book?
Find Potential Audience for a New Book
12. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Action to market
the book to the
right audience
Who is likely to
like this book?
Market a New Book
13. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Action to market
the book to the
right audience
Who is likely to
like this book?
Insight Action
Market a New Book
14. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Historical data Near real-time data
Prediction
Creating Actionable Information
16. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Why is the increased interest
in Data Science?
?
17. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
+
Big Data
Scalable Computing
Anywhere Anytime
18. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Data Science Today is Both a Big Data and a Big Compute Discipline
BIG DATA
COMPUTING AT
SCALE
Enables dynamic data-driven applications
Smart Manufacturing
Computer-Aided Drug Discovery
Personalized Precision Medicine
Smart Cities
Smart Grid and Energy Management
Disaster Resilience and Response
Requires:
• Data management
• Data-driven methods
• Scalable & dynamic
process coordination
• Resource optimization
• Skilled interdisciplinary
workforce
New era of
data science!
19. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Nearly every problem today is
transformed by big data.
20. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Example: Geospatial Big Data
• Flood of new data sources and types
• Needs new data management, storage and analysis
methods
• Too big for a single server, fast growing data volume
• Requires special database structures that can handle
data variety
• Too continuous for analysis at a later time, with
increasing streaming rate, i.e., velocity
• Varying degrees of uncertainty in measurements, and
other veracity issues
• Provides opportunities for scientific understanding at
different scales more than ever, i.e., potential high value
Real-time sensors
Weather forecast
Satellite imagery
Sea Surface Temperature
Measurements
Drone imagery
21. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Example: Biomedical Big Data http://nbcr.ucsd.edu
23. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
How do we amplify the value of Big Data?
24. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
How do we find the connections
and answer questions that
benefit the society?
“We are drowning in
information and
starving for knowledge”
– John Naisbitt
Source: Megatrends, 1982
25. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Create an Ecosystem that Enables
Needs and Best Practices
• data-driven
• scalable
• dynamic
• process-driven
• collaborative
• accountable
• reproducible
• interactive
• heterogeneous
• includes many different expertise
26. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
What would such an
ecosystem look like?
?
27. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
D
ata
M
anagem
ent
Advanced
Infrastructure
D
ata
Analytics
C
om
putational
Science
A Typical Collaborative Data Science Ecosystem
28. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Amplifying the
Value of Data
Related to X
Benefit Y for
Science,
Business,
Society or
Education
What if X was wildfires?
30. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
How do we Better Predict Wildfire Behavior?
• Wildfires are critical for ecology, but volatile
• Fuel load is high due to fire suppression over the
last century
• Drought, higher temperatures
• Better prevention, prediction and maintenance of
wildfires is needed
Photo of Harris Fire (2007) by former Fire Captain Bill
Clayton
Disaster management of (ongoing) wildfires heavily relies on
understanding their Direction and Rate of Spread (RoS).
Fire is Part of the Natural Ecology….
… but requires Monitoring, Prediction and Resilience
31. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Big Data Fire Modeling
Visualization
Monitoring
WIFIRE: A Scalable Data-Driven Monitoring, Dynamic
Prediction and Resilience Cyberinfrastructure for Wildfires
32. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
A dynamic system integration of
real-time sensor networks, satellite imagery, near-real
time data management tools, wildfire simulation tools,
and connectivity to emergency command centers
.
…. before, during and after a firestorm.
34. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
High Performance Wireless
Research and Education Network FARSITE
http://hpwren.ucsd.edu/cameras
>160 Meteorological Sensors and Growing
Major success to bring
internet to incident
command in the field. Used
in over 20 fires over time.
Most popular
operational fire
behavior
modeling system.
35. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Closing the Loop using Big Data
-- Wildfire Behavior Modeling and Data Assimilation --
• Computational costs for existing
models too high for real-time
analysis
• a priori -> a posteriori
• Parameter estimation to make
adjustments to the (input) parameters
• State estimation to adjust the
simulated fire front location with an a
posteriori update/measurement of the
actual fire front locationConceptual Data Assimilation Workflow with
Prediction and Update Steps using Sensor Data
36. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Fire Modeling Workflows in WIFIRE
Real-time sensors
Weather forecast
Fire perimeter
Landscape data
Monitoring &
fire mapping
37. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Firemap Tool
• A web-based GIS environment:
• access information related to fire
behavior
• analyze what-if scenarios
• model real-time fire behavior
• generate reports
• Powered by WIFIRE
Firemap
Web Interface
WIFIRE Data Interfaces WIFIRE Workflows
Computing Infrastructure
http://firemap.sdsc.edu
38. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Data-Driven Fire Progression
Prediction Over Three Hours
Collaboration with LA and
SD Fire Departments
http://firemap.sdsc.edu
August 2016 – Blue Cut Fire
Tahoe and Nevada Bureau
of Land Management
Cameras: 20 cameras added
with field-of-view
39. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
CA Fires 10/2017 through 12/2017
800K+ unique visitors and 8M+ hits
http://firemap.sdsc.edu
40. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
San Diego Airborne Intelligence
Reconnaissance System (SDAIRS)
Lilac Fire Perimeter and
WIFIRE Fire Progression
Model in SCOUT
41. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Thomas Fire: 12/04/2017- 01/12/2018
December 10, 2017
December 17, 2017
42. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Real-time Satellite Detections During
Thomas Fire: 12/04/2017- 01/12/2018
43. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Some Machine Learning Case Studies
• Smoke and fire perimeter detection based on imagery
• Prediction of Santa Ana and fire conditions specific to location
• Prediction of fuel build up based on fire and weather history
• NLP for understanding local conditions based on radio
communications
• Deep learning on multi-spectra imagery for high resolution fuel maps
• Classification project to generate more accurate fuel maps (using
Planet Labs satellite data)
All require periodic,
dynamic and
programmatic
access to data!
44. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Classification project to generate more
accurate fuel maps
• Accurate and up-to-date fuel maps are critical for
modeling wildfire rate of speed and potential burn
areas.
• Challenge:
• USGS Landfire provides the best available fuel maps
every two years.
• The WIFIRE system is limited by these potentially 2-year
old inputs. Fuel maps created at a higher temporal
frequency is desired.
• Approach:
• Using high-resolution satellite imagery and deep
learning methods, produce surface fuel maps of San
Diego County and other regions in Southern California.
• Use LandFire fuel maps as the target variable, the
objective is create a classification model that will
provide fuel maps at greater frequency with a measure
of uncertainty.
Cluster 1: Short Grass
45. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
WIFIRE Team: It takes a village!
• PhD level researchers
• Professional software
developers
• 29 undergraduate students
• UC San Diego
• UC Merced
• MURPA University
• University of Queensland
• 1 high school student
• 5 MSc and 5 MAS students
• 2 PhD students (UMD)
• 1 postdoctoral researcher
• Partners from fire departments
• Advisory board with diverse
expertise and affiliations
UMD - Fire modeling
UCSD MAE - Data assimilation
SDSC -
Cyberinfrastructure,
Workflows,
Data engineering,
Machine Learning,
Information Visualization,
HPWREN
Calit2/QI-
Cyberinfrastructure, GIS,
Advanced Visualization,
Machine Learning,
Urban Sustainability,
HPWREN
SIO - HPWREN
46. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
ACQUIRE PREPARE ANALYZE REPORT ACT
Focus on the Process and Team Work
to Answer a Question
…
47. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Scalable Drug Discovery
medium
Prima-1
Sticticacid
35ZWF
25KKL
22LSV
32CTM
26RQZ
27WT9
33AG6
33BAZ
28NZ6
27TGR
27VFS
35LWZ
36EB5
27UDP
32LDE
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
no p53
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no
com
poundPrim
a-1
35ZW
F25KKL25PW
S24M
LP26YYG22LSV24M
NR32CTM
22KTV24M
Y424LBC24NPU24NW
3
Series1"
Series2"
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no
com
poundPrim
a-1
35ZW
F25KKL25PW
S24M
LP26YYG22LSV24M
NR32CTM
22KTV24M
Y424LBC24NPU24NW
3
Series1"
Series2"cancer cell with p53-R175H mutant
cellproliferation
15 new reactivation compounds
reactivation
compounds kill
cells with p53
cancer mutant
Ieong et al., 2014
AMBER GPU
MD Tool
Minimization Actor
BENEFITS:
• Increase reuse
• Reproducibility
• Scale execution,
problem & solution
• Compare methods
• Train students
48. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Using workflows for process integration…
D
ata
M
anagem
ent
Advanced
Infrastructure
D
ata
Analytics
C
om
putational
Science
52. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Sample Variance Plotting and Storage
Workflow for Real-time Data
2006,
ROADNet
Project
53. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Workflows for Data Science
Center of Excellence at SDSC
Goal: Methodology and tool
development to build automated
and operational workflow-driven
solution architectures on big data
and HPC platforms.
Focus on the
question,
not the
technology!
Real-Time Hazards Management
wifire.ucsd.edu
Data-Parallel Bioinformatics
bioKepler.org
Scalable Automated Molecular Dynamics and Drug Discovery
nbcr.ucsd.edu
WorDS.sdsc.edu
• Access and query data
• Support exploratory design
• Scale computational analysis
• Increase reuse and reproducibility
• Save time, energy and money
• Formalize and standardize
• Train
54. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Balance of:
• team building
• process management
• performance optimization
• provenance tracking
• training and education
55. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
While working with experts on…
• domain expertise
• data modeling and integration
• data management services
• analytical methods
• communication and visualization
56. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
“The” Data Science Team
• Data engineer
• Data analyst
• Methods expert
• Scalability and operations expert
• Business manager
• Business analyst
• Scientist
• Visualization and dashboard developer
• Solution architect
• Story teller/coordinator
• Project manager
Expertise and skills often overlap,
but nobody has it all!
57. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
How can I get smart people
to collaborate and
communicate?
…to utilize data and infrastructure to
generate insights and solve a question.
Focus on the
question,
not the
technology!
Team Building
58. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Purpose to Lead to Insight
Focus on the
question,
not the
technology!
Purpose
LEAN METHOD
Minimize the
total time through the loop
CODE
LEARN BUILD
MEASURE
DATA
IDEAS
?
59. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Data Science Process
60. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
ACQUIRE PREPARE ANALYZE REPORT ACT
Basic Steps
in a Data
Science
Process
• Import raw dataset into your analytics
platform
• Explore & Visualize
• Perform Data Cleaning
• Feature Selection
• Model Selection
• Analyze the results
• Present your findings
• Use them
ACQUIRE
PREPARE
ANALYZE
REPORT
ACT
61. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Computational Data ScienceData Engineering
ACQUIRE PREPARE ANALYZE REPORT ACT
Scale Scale Scale Scale
Many iterations and rollbacks between steps.
64. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Process for Practice of
Data Science
Programmability
Ease of use, iteration, interaction, re-use, re-purpose
Scalability
From local experiments to large-scale runs
Reproducibility
Ability to validate, re-run, re-play
Data
Product
65. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Some P’s in PPoDS
Platforms
Process
People
Problem
or
Purpose
?
Programmability
66. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
The insights need to be evaluated to
turn them into action.
Platforms
Process
People
Purpose?
Programmability
Metrics Product
Insight
Action
?
67. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Pod è sub-process
Treat Each Step in the Solution
Process as a Conceptual Pod
Defined by:
• Purpose and goal
• Stakeholders
• Expectations
• Key questions to be answered,
production/consumption relationships, needs,
dependencies, limits, …
• Contracts
• Performance, economic, accuracy, policy, privacy,
reproducibility, political, …
• Knowns
• Known unknowns
Metrics for accountability should be built into
the process.
Timeline
Purpose Expectations
Planning of deliverables
Cost
Using the PPODS Approach
• Each step in your data pipelines is a
separate pod
• Define success metrics for calling
each pod done
• Pods can be atomic or hierarchical
68. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Zooming into a simple example…
PREPARE ANALYZE
Data
Exploration
Schema
Integration
Query
Processing
Machine
Learning
…
69. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Creating A Solution Architecture for
Networked Science Applications
70. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
Process-driven
Solution
Architectures
and the Role of
Workflows
71. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
…
COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
COMMUNICATION AND FEEDBACK
EXPLORATION
SCALABILITY
PROVENANCE
SECURITY
ACQUIRE PREPARE ANALYZE REPORT ACT
72. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Utilizing “Advanced Cyberinfrastructure”
D
ata
M
anagem
ent
Advanced
Infrastructure
D
ata
Analytics
C
om
putational
Science
Compute
+
Storage
+
Network
73. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego
Providing Cyberinfrastructure for Research and Education
• Established as a national supercomputer
resource center in 1985 by NSF
• A world leader in HPC, data-intensive computing,
and scientific data management
• Current strategic focus on “Big Data”, “versatile
computing”, and “life sciences applications”
Recent Innovative Architectures
• Gordon: First Flash-based
Supercomputer for Data-intensive
Apps
• Comet: Serving the Long Tail of
Science
74. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
The Pacific Research Platform Creates
a Regional End-to-End Science-Driven “Big Data Superhighway” System
Letters of Commitment from:
• 50 Researchers from 15 Campuses
• 32 IT/Network Organization Leaders
NSF CC*DNI Grant
$5M 10/2015-10/2020
PI: Larry Smarr, UC San Diego Calit2
Co-Pis:
• Camille Crittenden, UC Berkeley CITRIS,
• Tom DeFanti, UC San Diego Calit2,
• Philip Papadopoulos, UCSD SDSC,
• Frank Wuerthwein, UCSD Physics and SDSC
Disk-to-Disk: 10-100 Gbps
Source: John Hess, CENIC
Larry Smarr, UCSD
75. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
New NSF CHASE-CI Grant Creates a Community Cyberinfrastructure
Adding a Machine Learning Layer Built on Top of the Pacific Research Platform
Caltech
UCB
UCI UCR
UCSD
UCSC
Stanford
MSU
UCM
SDSU
NSF Grant for High Speed “Cloud” of 256 GPUs
For 30 ML Faculty & Their Students at 10 Campuses
for Training AI Algorithms on Big Data Slide Source: Larry Smarr, UCSD
76. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Next Step: Surrounding the CHASE-CI Machine Learning Platform
With Clouds of GPUs and Non Von Neumann Processors
Microsoft Installs FPGAs into Bing Servers &
432 into TAAC for Academic Access
64-TrueNorth
Cluster
CHASE-CI
Slide Source: Larry Smarr, UCSD
82. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Parts of the Solution
• Stakeholders
• Datasets
• Compliance requirements
• Defined actions
• Analytical methods
• Technical infrastructure
Bias
Transparency
Verification
Accuracy
Ethics
Reproducibility
Cost
83. CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
To summarize…
• Data science is a collaborative activity
• Focus on collaboration and communication from problem definition stage
• Apply process management techniques where necessary
• Incorporate and formalize definition of success from different perspectives
• Measurable automation should be the end goal
• Requires built in programmable and scalable data pipelines
• Includes measurable and programmable networks
• Iterations based on pre-defined metrics help
• PPODS is a methodology for collaborative data science application
integration and iteration
• Toolkits for process automation, scalable execution, provenance tracking and
reporting