Michael Mahoney discusses the rise of massive data from various sensors. He notes there are many types of sensors that generate large amounts of data, including physical, consumer, health, financial, internet, and astronomical sensors. While there are similarities between sensor applications, there are also differences in funding, customer demands, questions of interest, time sensitivity, and more. Analyzing massive data presents challenges due to its size, variability, and noise. New algorithms and statistical methods are needed to gain insights from these large and complex data sets. Mahoney advocates cross-disciplinary work to address the opportunities and difficulties presented by modern massive data.
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Lauri Eloranta
Third lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
Presentation given at the HEA Social Sciences learning and teaching summit 'Exploring the implications of ‘the era of big data’ for learning and teaching'.
A blog post outlining the issues discussed at the summit is available via: http://bit.ly/1lCBUIB
This document summarizes machine learning techniques used at NASA's Jet Propulsion Laboratory. It discusses how machine learning can be used to analyze large datasets that are too complex for humans to fully examine alone. Examples include identifying features in hyperspectral images and discovering patterns in genetic and meteorological data. Both supervised and unsupervised machine learning algorithms are covered.
This document discusses nanotechnology and its growth potential. It notes that nanotechnology is projected to become a trillion dollar global industry by 2015, employing over 2 million workers. Currently, there are only about 20,000 trained nanotechnologists worldwide. The document outlines different types of nanomaterials and generations of nanotechnology development. It also lists many universities, research centers, and companies in New York that are involved in nanotechnology research and commercialization.
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...summersocialwebshop
The document discusses extracting network data from text documents. It outlines extracting named entities like people, places, events as nodes and linking them based on proximity, syntax, and statistics to build networks. Examples are provided of networks constructed from news articles about Sudan showing prominent individuals and organizations and how their centrality changes over time. The goal is to understand socio-technical networks and their co-evolution with knowledge and structure from large-scale text data.
Examples of how to inspire the next generation to pursue computational chemis...Sean Ekins
This document provides examples of how to inspire the next generation to pursue computational chemistry and cheminformatics. It suggests starting engagement through social media like tweets that are relevant to how younger generations communicate. It describes the fields of computational chemistry and cheminformatics and highlights how the areas have evolved from using catalogs and 2D representations to 3D visualization and remote control of labs. The document advocates increasing diversity in the fields and engaging people with different backgrounds. It proposes developing educational apps and games to teach chemistry concepts in a fun way. The document also stresses the need to lower barriers to entry, make research more visible, and consider how to fund disruptive ideas to imagine and attract the future of computational chemistry.
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Amit Sheth
Amit Sheth's Keynote at Semantic Web Technologies for Science and Engineering Workshop (held in conjunction with ISWC2003), Sanibel Island, FL, October 20, 2003.
Localized methods in graph mining exploit the local structures in a graph instead attempting to find global structures. These are widely successful at all sorts of problems including community detection, label propagation, and a few others.
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Lauri Eloranta
Third lecture of the course CSS01: Introduction to Computational Social Science at the University of Helsinki, Spring 2015.(http://blogs.helsinki.fi/computationalsocialscience/).
Lecturer: Lauri Eloranta
Questions & Comments: https://twitter.com/laurieloranta
Presentation given at the HEA Social Sciences learning and teaching summit 'Exploring the implications of ‘the era of big data’ for learning and teaching'.
A blog post outlining the issues discussed at the summit is available via: http://bit.ly/1lCBUIB
This document summarizes machine learning techniques used at NASA's Jet Propulsion Laboratory. It discusses how machine learning can be used to analyze large datasets that are too complex for humans to fully examine alone. Examples include identifying features in hyperspectral images and discovering patterns in genetic and meteorological data. Both supervised and unsupervised machine learning algorithms are covered.
This document discusses nanotechnology and its growth potential. It notes that nanotechnology is projected to become a trillion dollar global industry by 2015, employing over 2 million workers. Currently, there are only about 20,000 trained nanotechnologists worldwide. The document outlines different types of nanomaterials and generations of nanotechnology development. It also lists many universities, research centers, and companies in New York that are involved in nanotechnology research and commercialization.
Jana Diesner, "Words and Networks: Considering the Content of Text Data for N...summersocialwebshop
The document discusses extracting network data from text documents. It outlines extracting named entities like people, places, events as nodes and linking them based on proximity, syntax, and statistics to build networks. Examples are provided of networks constructed from news articles about Sudan showing prominent individuals and organizations and how their centrality changes over time. The goal is to understand socio-technical networks and their co-evolution with knowledge and structure from large-scale text data.
Examples of how to inspire the next generation to pursue computational chemis...Sean Ekins
This document provides examples of how to inspire the next generation to pursue computational chemistry and cheminformatics. It suggests starting engagement through social media like tweets that are relevant to how younger generations communicate. It describes the fields of computational chemistry and cheminformatics and highlights how the areas have evolved from using catalogs and 2D representations to 3D visualization and remote control of labs. The document advocates increasing diversity in the fields and engaging people with different backgrounds. It proposes developing educational apps and games to teach chemistry concepts in a fun way. The document also stresses the need to lower barriers to entry, make research more visible, and consider how to fund disruptive ideas to imagine and attract the future of computational chemistry.
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Amit Sheth
Amit Sheth's Keynote at Semantic Web Technologies for Science and Engineering Workshop (held in conjunction with ISWC2003), Sanibel Island, FL, October 20, 2003.
Localized methods in graph mining exploit the local structures in a graph instead attempting to find global structures. These are widely successful at all sorts of problems including community detection, label propagation, and a few others.
This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data. It defines data mining as the extraction of interesting patterns from large datasets. The document outlines the key steps in the knowledge discovery process and how data mining fits within business intelligence applications. It also describes different types of data that can be mined and popular data mining algorithms.
The document discusses a Faculty Development Program (FDP) on database management systems that was held on December 6, 2018 at the University College of Engineering Tindivanam in Tindivanam, India. The FDP covered recent research perspectives in different database management systems and the importance of database management systems in Digital India. It was conducted by Dr. A. Karthirvel, Professor and Head of the Computer Science and Engineering Department at MNM Jain Engineering College in Chennai.
The document provides an overview of data science and what it entails. It discusses the hype around big data and data science, and how data science has evolved due to improvements in technology that allow for large-scale data processing. It defines data science as a process that involves collecting, cleaning, analyzing and extracting meaningful insights from data. Data scientists come from a variety of academic backgrounds and work in both industry and academia developing solutions to real-world problems using data-driven approaches.
Data Science: Origins, Methods, Challenges and the future?Cagatay Turkay
Slides for my talk at City Unrulyversity on 18.03.15 in London. Discuss the term Data Science, touch upon the origins and the data scientist types. A longer discussion on the Data Science process and challenges analysts face.
And here is the abstract of the talk:
Data Science ... the term is everywhere now, on the news, recruitment sites, technology boards. "Data scientist" is even named to be sexiest job title of the century. But what is it, really? Is it just a hype or a term that will be with us for some time?
This session will investigate where the term is originating from and how it relates to decades of research in established fields such as statistics, data mining, visualisation and machine learning. We will investigate how the field is evolving with the emergence of large, heterogeneous data resources. We will discuss the objectives, tools and challenges of data science as a practice, and look at examples from research and industrial applications.
The document discusses the MESUR (Making Use and Sense of Scholarly Usage Data) project which aims to develop new metrics for scholarly impact and prestige based on usage data from digital scholarly resources rather than just citations. The key points are:
1) MESUR analyzes over 1 billion usage events of scholarly articles and develops network-based metrics from usage patterns to map the structure of science.
2) Preliminary results show relevant structure in usage-based network maps that correlate with traditional citation-based metrics.
3) MESUR has produced a variety of usage and citation-based metrics and developed online tools for exploring these metrics.
In search of lost knowledge: joining the dots with Linked Datajonblower
These slides are from my seminar to the University of Reading Department of Meteorology, November 2013. They contain a (hopefully not very technical) introduction to the concepts of Linked Data and how we are applying them in the CHARMe project (http://www.charme.org.uk). In CHARMe we are using Open Annotation to connect users of climate data with community-generated "commentary information" that helps them to understand a dataset's strengths and weaknesses.
The slide notes contain some helpful context, so you might like to download the PPT file!
The slides are licensed as "Creative Commons Attribution 3.0", meaning that you can do what you like with these slides provided that you credit the University of Reading for their creation. See http://creativecommons.org/licenses/by/3.0/.
The document provides an overview of the data mining concepts and techniques course offered at the University of Illinois at Urbana-Champaign. It discusses the motivation for data mining due to abundant data collection and the need for knowledge discovery. It also describes common data mining functionalities like classification, clustering, association rule mining and the most popular algorithms used.
1. Developing a unifying theory of data mining that connects different tasks and approaches could help advance the field by providing a theoretical framework.
2. Scaling data mining methods to handle high dimensional and streaming data at massive scales is challenging due to limitations in current approaches for problems like concept drift.
3. Efficiently mining sequential, time series, and noisy time series data remains an important open problem, particularly for applications like financial and seismic predictions.
This document summarizes several research projects related to big data and social science knowledge. It discusses projects that analyzed large social media platforms like Facebook, Twitter, and Wikipedia to study information diffusion and social influences. It also discusses challenges like securing access to commercial data and ensuring replicability of findings. Examples demonstrate how big data can provide novel insights but are limited by the objects studied and incomplete representation of populations. The document discusses debates around the implications of big data for privacy, prediction, exclusion, and manipulation. It argues that knowledge depends on how research technologies advance knowledge within ethical and legal frameworks.
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
The document discusses big data techniques, tools, and applications. It describes how big data is enabled by increases in storage capacity, processing power, and data availability. It outlines common approaches to distributed processing, storage, and programming models for big data, including MapReduce, NoSQL databases, and cloud computing. It also provides examples of applications involving log file analysis, network alarm monitoring, media content analysis, and social network analysis.
This document summarizes a lecture on network science given by Madhav Marathe at Lawrence Livermore National Laboratory in December 2010. It provides an overview of network science, including definitions of networks and their unique properties. It also discusses mathematical and computational approaches to modeling complex networks and applications to infrastructure planning, energy systems, and national security. The lecture acknowledges prior work that contributed to its material from various researchers and textbooks.
This document discusses considerations for collecting social network data through surveys. It addresses research design elements like defining the relevant population boundaries and sampling approaches. For surveys specifically, it covers informed consent, name generator questions to identify social ties, response formats, and balancing depth of network detail collected versus sample size. The key challenges are defining the theoretical population of interest, collecting a sufficiently large and representative network sample, and designing survey questions that accurately capture social ties within time and resource constraints.
This document discusses considerations for collecting social network data through surveys. It addresses research design elements like defining the boundaries of the relevant population, sampling approaches for collecting local, global or complete network data, and sources of network data including surveys, archives, and secondary data sources. The document also provides guidance on survey elements like name generators, response formats, and balancing breadth versus depth of network data collection given time constraints of surveys.
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
This document discusses analyzing genomic data at scale using distributed machine learning tools like Spark, ADAM, and the Spark Notebook. It outlines challenges with genomic data like its large size and need for distributed teams in research projects. The document proposes sharing data, processes, and results more efficiently through tools like Shar3 that can streamline the data analysis lifecycle and allow distributed collaboration on genomic research projects and datasets.
This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data, defines data mining as the extraction of patterns from large data sets, and outlines the data mining process. A variety of data types that can be mined are described, including relational, transactional, time-series, text and web data. The document also covers major data mining functionalities like classification, clustering, association rule mining and trend analysis. Top 10 popular data mining algorithms are listed.
This document discusses challenges with the current scientific publishing system and proposes a vision for next generation scientific publishing (NGSP). Some key problems include retractions due to misconduct, lack of reproducibility, and non-reusable data and methods. NGSP would feature transparent and computable data and methods, open annotation of narratives and objects, and no restrictions on text mining or remixing. It would move information more quickly and allow verification through an open, service-oriented system without walled gardens. Taking NGSP forward will require collaboration across stakeholders in research communications.
This document discusses issues related to science research data. It notes that practices in science research drive institutional approaches to supporting research. The data lifecycle is discussed, including data management planning, storage, publishing, and more. Challenges with science data are also addressed, such as reproducibility and sharing practices. New tools and initiatives are emerging to help address these challenges, including crowd-funding of science, reproducibility initiatives, unique researcher identifiers, sharing code and data, and altmetrics.
This document discusses big data and analytics, outlining five trends and five research challenges. It begins by defining big data in terms of volume, velocity, variety, veracity and value. It then discusses the origins and evolution of big data, from early statistics to modern data science. Analytics is defined as using data to make empirically-derived, statistically valid decisions. The document outlines how hardware choices led to scaling out data processing across clusters rather than scaling up on single machines. It also provides examples of fields that generate huge volumes of data from billion dollar instruments like CERN's Large Hadron Collider and genomic sequencing facilities.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data. It defines data mining as the extraction of interesting patterns from large datasets. The document outlines the key steps in the knowledge discovery process and how data mining fits within business intelligence applications. It also describes different types of data that can be mined and popular data mining algorithms.
The document discusses a Faculty Development Program (FDP) on database management systems that was held on December 6, 2018 at the University College of Engineering Tindivanam in Tindivanam, India. The FDP covered recent research perspectives in different database management systems and the importance of database management systems in Digital India. It was conducted by Dr. A. Karthirvel, Professor and Head of the Computer Science and Engineering Department at MNM Jain Engineering College in Chennai.
The document provides an overview of data science and what it entails. It discusses the hype around big data and data science, and how data science has evolved due to improvements in technology that allow for large-scale data processing. It defines data science as a process that involves collecting, cleaning, analyzing and extracting meaningful insights from data. Data scientists come from a variety of academic backgrounds and work in both industry and academia developing solutions to real-world problems using data-driven approaches.
Data Science: Origins, Methods, Challenges and the future?Cagatay Turkay
Slides for my talk at City Unrulyversity on 18.03.15 in London. Discuss the term Data Science, touch upon the origins and the data scientist types. A longer discussion on the Data Science process and challenges analysts face.
And here is the abstract of the talk:
Data Science ... the term is everywhere now, on the news, recruitment sites, technology boards. "Data scientist" is even named to be sexiest job title of the century. But what is it, really? Is it just a hype or a term that will be with us for some time?
This session will investigate where the term is originating from and how it relates to decades of research in established fields such as statistics, data mining, visualisation and machine learning. We will investigate how the field is evolving with the emergence of large, heterogeneous data resources. We will discuss the objectives, tools and challenges of data science as a practice, and look at examples from research and industrial applications.
The document discusses the MESUR (Making Use and Sense of Scholarly Usage Data) project which aims to develop new metrics for scholarly impact and prestige based on usage data from digital scholarly resources rather than just citations. The key points are:
1) MESUR analyzes over 1 billion usage events of scholarly articles and develops network-based metrics from usage patterns to map the structure of science.
2) Preliminary results show relevant structure in usage-based network maps that correlate with traditional citation-based metrics.
3) MESUR has produced a variety of usage and citation-based metrics and developed online tools for exploring these metrics.
In search of lost knowledge: joining the dots with Linked Datajonblower
These slides are from my seminar to the University of Reading Department of Meteorology, November 2013. They contain a (hopefully not very technical) introduction to the concepts of Linked Data and how we are applying them in the CHARMe project (http://www.charme.org.uk). In CHARMe we are using Open Annotation to connect users of climate data with community-generated "commentary information" that helps them to understand a dataset's strengths and weaknesses.
The slide notes contain some helpful context, so you might like to download the PPT file!
The slides are licensed as "Creative Commons Attribution 3.0", meaning that you can do what you like with these slides provided that you credit the University of Reading for their creation. See http://creativecommons.org/licenses/by/3.0/.
The document provides an overview of the data mining concepts and techniques course offered at the University of Illinois at Urbana-Champaign. It discusses the motivation for data mining due to abundant data collection and the need for knowledge discovery. It also describes common data mining functionalities like classification, clustering, association rule mining and the most popular algorithms used.
1. Developing a unifying theory of data mining that connects different tasks and approaches could help advance the field by providing a theoretical framework.
2. Scaling data mining methods to handle high dimensional and streaming data at massive scales is challenging due to limitations in current approaches for problems like concept drift.
3. Efficiently mining sequential, time series, and noisy time series data remains an important open problem, particularly for applications like financial and seismic predictions.
This document summarizes several research projects related to big data and social science knowledge. It discusses projects that analyzed large social media platforms like Facebook, Twitter, and Wikipedia to study information diffusion and social influences. It also discusses challenges like securing access to commercial data and ensuring replicability of findings. Examples demonstrate how big data can provide novel insights but are limited by the objects studied and incomplete representation of populations. The document discusses debates around the implications of big data for privacy, prediction, exclusion, and manipulation. It argues that knowledge depends on how research technologies advance knowledge within ethical and legal frameworks.
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
The document discusses big data techniques, tools, and applications. It describes how big data is enabled by increases in storage capacity, processing power, and data availability. It outlines common approaches to distributed processing, storage, and programming models for big data, including MapReduce, NoSQL databases, and cloud computing. It also provides examples of applications involving log file analysis, network alarm monitoring, media content analysis, and social network analysis.
This document summarizes a lecture on network science given by Madhav Marathe at Lawrence Livermore National Laboratory in December 2010. It provides an overview of network science, including definitions of networks and their unique properties. It also discusses mathematical and computational approaches to modeling complex networks and applications to infrastructure planning, energy systems, and national security. The lecture acknowledges prior work that contributed to its material from various researchers and textbooks.
This document discusses considerations for collecting social network data through surveys. It addresses research design elements like defining the relevant population boundaries and sampling approaches. For surveys specifically, it covers informed consent, name generator questions to identify social ties, response formats, and balancing depth of network detail collected versus sample size. The key challenges are defining the theoretical population of interest, collecting a sufficiently large and representative network sample, and designing survey questions that accurately capture social ties within time and resource constraints.
This document discusses considerations for collecting social network data through surveys. It addresses research design elements like defining the boundaries of the relevant population, sampling approaches for collecting local, global or complete network data, and sources of network data including surveys, archives, and secondary data sources. The document also provides guidance on survey elements like name generators, response formats, and balancing breadth versus depth of network data collection given time constraints of surveys.
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
This document discusses analyzing genomic data at scale using distributed machine learning tools like Spark, ADAM, and the Spark Notebook. It outlines challenges with genomic data like its large size and need for distributed teams in research projects. The document proposes sharing data, processes, and results more efficiently through tools like Shar3 that can streamline the data analysis lifecycle and allow distributed collaboration on genomic research projects and datasets.
This document provides an introduction to data mining concepts and techniques. It discusses why data mining is needed due to the massive growth of data, defines data mining as the extraction of patterns from large data sets, and outlines the data mining process. A variety of data types that can be mined are described, including relational, transactional, time-series, text and web data. The document also covers major data mining functionalities like classification, clustering, association rule mining and trend analysis. Top 10 popular data mining algorithms are listed.
This document discusses challenges with the current scientific publishing system and proposes a vision for next generation scientific publishing (NGSP). Some key problems include retractions due to misconduct, lack of reproducibility, and non-reusable data and methods. NGSP would feature transparent and computable data and methods, open annotation of narratives and objects, and no restrictions on text mining or remixing. It would move information more quickly and allow verification through an open, service-oriented system without walled gardens. Taking NGSP forward will require collaboration across stakeholders in research communications.
This document discusses issues related to science research data. It notes that practices in science research drive institutional approaches to supporting research. The data lifecycle is discussed, including data management planning, storage, publishing, and more. Challenges with science data are also addressed, such as reproducibility and sharing practices. New tools and initiatives are emerging to help address these challenges, including crowd-funding of science, reproducibility initiatives, unique researcher identifiers, sharing code and data, and altmetrics.
This document discusses big data and analytics, outlining five trends and five research challenges. It begins by defining big data in terms of volume, velocity, variety, veracity and value. It then discusses the origins and evolution of big data, from early statistics to modern data science. Analytics is defined as using data to make empirically-derived, statistically valid decisions. The document outlines how hardware choices led to scaling out data processing across clusters rather than scaling up on single machines. It also provides examples of fields that generate huge volumes of data from billion dollar instruments like CERN's Large Hadron Collider and genomic sequencing facilities.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Sensors1(1)
1. Sensors, networks, and massive data
Michael W. Mahoney
Stanford University
May 2012
( For more info, see:
http:// cs.stanford.edu/people/mmahoney/
or Google on “Michael Mahoney”)
2. Lots of types of “sensors”
Examples:
• Physical/environmental: temperature, air quality, oil, etc.
• Consumer: RFID chips, SmartPhone, Store Video, etc.
• Health care: Patient Records, Images & Surgery Videos, etc.
• Financial: Transactions for regulations, HFT, etc.
• Internet/e-commerce: clicks, email, etc. for user modeling, etc.
• Astronomical/HEP: images, experiments, etc.
Common theme: easy to generate A LOT of data
Questions:
• What are similarities/differences i.t.o. funding drivers, customer
demands, questions of interest, time sensitivity, etc. about “sensing”
in these different applications?
• What can we learn from one area and apply to another area?
3. BIG data??? MASSIVE data????
NYT, Feb 11, 2012: “The Age of Big Data”
• “What is Big Data? A meme and a marketing term, for sure, but also
shorthand for advancing trends in technology that open the door to a new
approach to understanding the world and making decisions. …”
Why are big data big?
• Generate data at different places/times and different resolutions
• Factor of 10 more data is not just more data, but different data
4. BIG data??? MASSIVE data????
MASSIVE data:
• Internet, Customer Transactions, Astronomy/HEP = “Petascale”
• One Petabyte = watching 20 years of movies (HD) = listening to 20,000
years of MP3 (128 kbits/sec) = way too much to browse or comprehend
massive data:
• 105
people typed at 106
DNA SNPs; 106
or 109
node social network; etc.
In either case, main issues:
• Memory management issues, e.g., push computation to the data
• Hard to answer even basic questions about what data “looks like”
6. Algorithmic vs. Statistical Perspectives
Computer Scientists
• Data: are a record of everything that happened.
• Goal: process the data to find interesting patterns and associations.
• Methodology: Develop approximation algorithms under different
models of data access since the goal is typically computationally hard.
Statisticians (and Natural Scientists)
• Data: are a particular random instantiation of an underlying process
describing unobserved patterns in the world.
• Goal: is to extract information about the world from noisy data.
• Methodology: Make inferences (perhaps about unseen events) by
positing a model that describes the random variability of the data
around the deterministic model.
Lambert (2000), Mahoney (2010)
7. Thinking about large-scale data
Data generation is modern version of microscope/telescope:
• See things couldn't see before: e.g., movement of people, clicks and
interests; tracking of packages; fine-scale measurements of temperature,
chemicals, etc.
• Those inventions ushered new scientific eras and new understanding of
the world and new technologies to do stuff
Easy things become hard and hard things become easy:
• Easier to see the other side of universe than bottom of ocean
• Means, sums, medians, correlations is easy with small data
Our ability to generate data far exceeds our
ability to extract insight from data.
8. Many challenges ...
• Tradeoffs between prediction & understanding
• Tradeoffs between computation & communication,
• Balancing heat dissipation & energy requirements
• Scalable, interactive, & inferential analytics
• Temporal constraints in real-time applications
• Understanding “structure” and “noise” at large-scale (*)
• Even meaningfully answering “What does the data look like?”
9. Micro-markets in sponsored search
10 million keywords
1.4MillionAdvertisers
Gambling
Sports
Sports
Gambling
Movies Media
Sport
videos
What is the CTR and
advertiser ROI of
sports gambling
keywords?
Goal: Find isolated markets/clusters (in an advertiser-bidded phrase bipartite graph)
with sufficient money/clicks with sufficient coherence.
Ques: Is this even possible?
10. What about sensors?
Vector space model - analogous to “bag-of-words” model for documents/terms.
• Each sensor is a “document,” a vector in a high-dimensional Euclidean space
• Each measurement is a “term”, describing the elements of that vector
• (Advertisers and bidded-phrases--and many other things--are also analogous.)
Can also define sensor-measurement graphs :
• Sensors are nodes, and edges are between sensors with similar measurements
m
documents
(sensors)
n terms (measurements)
A ij= frequency of j-th term in i-th
document (value of j-th measurement
at i-th sensor)
= =
11. Cluster-quality Score: Conductance
S
S’
11
How cluster-like is a set of nodes?
Idea: balance “boundary” of cluster
with “volume” of cluster
Need a natural intuitive measure:
Conductance (normalized cut)
φ(S) ≈ # edges cut / # edges inside
Small φ(S) corresponds to better
clusters of nodes
12. Graph partitioning
A family of combinatorial optimization problems - want to
partition a graph’s nodes into two sets s.t.:
• Not much edge weight across the cut (cut quality)
• Both sides contain a lot of nodes
Standard formalizations of the bi-criterion are NP-hard!
Approximation algorithms:
• Spectral methods* - (compute eigenvectors)
• Local improvement - (important in practice)
• Multi-resolution - (important in practice)
• Flow-based methods* - (mincut-maxflow)
* comes with strong underlying theory to guide heuristics
13. Comparison of “spectral” versus “flow”
Spectral:
• Compute an eigenvector
• “Quadratic” worst-case bounds
• Worst-case achieved -- on “long
stringy” graphs
• Embeds you on a line (or Kn)
Flow:
• Compute a LP
• O(log n) worst-case bounds
• Worst-case achieved -- on
expanders
• Embeds you in L1
Two methods:
• Complementary strengths and weaknesses
• What we compute will depend on approximation
algorithm as well as objective function.
14. Analogy: What does a protein look like?
Experimental Procedure:
• Generate a bunch of output data by using
the unseen object to filter a known input
signal.
• Reconstruct the unseen object given the
output signal and what we know about the
artifactual properties of the input signal.
Three possible representations (all-atom;
backbone; and solvent-accessible
surface) of the three-dimensional
structure of the protein triose
phosphate isomerase.
17. Typical example of our findings
General relativity collaboration network
(pretty small: 4,158 nodes, 13,422 edges)
17Community size
Communityscore
Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)
18. Large Social and Information Networks
LiveJournal Epinions
Focus on the red curves (local spectral algorithm) - blue (Metis+Flow), green (Bag of
whiskers), and black (randomly rewired network) for consistency and cross-validation.
Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)
19. Interpretation: “Whiskers” and the
“core” of large informatics graphs
• “Whiskers”
• maximal sub-graph detached
from network by removing a
single edge
• contains 40% of nodes and 20%
of edges
• “Core”
• the rest of the graph, i.e., the
2-edge-connected core
• Global minimum of NCPP is a whisker
• BUT, core itself has nested whisker-
core structure
Leskovec, Lang, Dasgupta, and Mahoney (WWW 2008, 2010 & IM 2009)
20. Local “structure” and global “noise”
Many (most/all?) large informatics graphs (& massive data in general?)
• have local structure that is meaningfully geometric/low-dimensional
• does not have analogous meaningful global structure
Intuitive example:
• What does the graph of you and your
102
closest Facebook friends “look like”?
• What does the graph of you and your
105
closest Facebook friends “look like”?
21. Many lessons ...
This is problematic for MANY things people want to do:
• statistical analysis that relies on asymptotic limits
• recursive clustering algorithms
• analysts who want a few meaningful clusters
More data need not be better if you:
• don’t have control over the noise
• want “islands of insight” in the “sea of data”
How does this manifest itself in your “sensor” application?
• Needles in haystack; correlations; time series -- “scientific” apps
• Historically, CS & database apps did more summaries & aggregates
22. Big changes in the past ... and future
Consider the creation of:
• Modern Physics
• Computer Science
• Molecular Biology
These were driven by new measurement techniques and
technological advances, but they led to:
• big new (academic and applied) questions
• new perspectives on the world
• lots of downstream applications
We are in the middle of a similarly big shift!
• OR and Management Science
•Transistors and Microelectronics
• Biotechnology
23. Conclusions
HUGE range of “sensors” are generating A LOT of data:
• will lead to a very different world in many ways
Large-scale data are very different than small-scale data.
• Easy things become hard, and hard things become easy
• Types of questions that are meaningful to ask are different
• Structure, noise, etc. properties are often deeply counterintuitive
Different applications are driven by different considerations
• next-user-interaction, qualitative insight, failure modes, false
positives versus false negatives, time sensitivity, etc.
Algorithms can compute answers to known questions
• but algorithms can also be used as “experimental probes” of the data
to form questions!
24. MMDS Workshop on
“Algorithms for Modern Massive Data Sets”
(http://mmds.stanford.edu)
at Stanford University, July 10-13, 2012
Objectives:
- Address algorithmic, statistical, and mathematical challenges in modern statistical
data analysis.
- Explore novel techniques for modeling and analyzing massive, high-dimensional, and
nonlinearly-structured data.
- Bring together computer scientists, statisticians, mathematicians, and data analysis
practitioners to promote cross-fertilization of ideas.
Organizers: M. W. Mahoney, A. Shkolnik, G. Carlsson, and P. Drineas,
Registration is available now!