This presentation was provided by Jim Hahn of The University of Pennsylvania, during the NISO event "Transforming Search: What the Information Community Can and Should Build." The virtual conference was held on August 26, 2020.
Use of Open Data in Hong Kong (LegCo 2014)Sammy Fung
Presentation on use of open data in HK given to Legislative Council Secretariat. Content is mixed from my presentations at startmeup 2013 and opendatahk meetup.
The paper describes the work being conducted in the Cross-institutional Authority Collaboration (Institutionenübergreifende Integration von Normdaten, IN2N) project. This pilot project, executed in cooperation with the German National Library and the German Film Institute, aims to establish new collaboration models to improve cross-domain authority maintenance. The paper outlines applied strategies for providing a shared infrastructure as well as workflows for exchanging data about persons; interface enhancements permitting the exploitation of innovative web approaches; and cross-institutional data search and representation solutions. Furthermore, we discuss specific boundary conditions, such as disparities in the level of data granularity, for an interoperable cataloguing environment.
Embedded Files: Risks, Challenges and OptionsTim Allison
This document discusses the risks, challenges, and options related to embedded files. It notes that every file may contain embedded files and attachments, and systems need to be prepared to handle them. Specific file formats like PDF, Microsoft Office files, TIFF, HTML and more can all contain embedded files, including full file paths and sensitive data. Proper handling requires awareness of incremental updates, metadata, and going beyond surface-level parsing to fully extract all content.
Ultra-Broadband and Peta-Scale Collaboration Opportunities Between UC and CanadaLarry Smarr
06.06.12
Summary Talk
Canada - California Strategic Innovation Partnership Summit
ICT/Broadband Internet Session
Title: Ultra-Broadband and Peta-Scale Collaboration Opportunities Between UC and Canada
Vancouver, Canada
NIST Special Publication 500-293: US Government Cloud Computing Technology R...David Sweigert
This document presents the US Government Cloud Computing Technology Roadmap. It identifies 10 high-priority requirements that must be met for US government agencies to further adopt cloud computing. These requirements relate to standards, security, interoperability, portability, and other areas. The roadmap is intended to define and communicate what is needed to advance cloud computing technology and adoption. It incorporates input from public workshops, working groups, and comments. The overall goal is to help accelerate secure and effective cloud adoption within the US government.
This presentation was provided by Jim Hahn of The University of Pennsylvania, during the NISO event "Transforming Search: What the Information Community Can and Should Build." The virtual conference was held on August 26, 2020.
Use of Open Data in Hong Kong (LegCo 2014)Sammy Fung
Presentation on use of open data in HK given to Legislative Council Secretariat. Content is mixed from my presentations at startmeup 2013 and opendatahk meetup.
The paper describes the work being conducted in the Cross-institutional Authority Collaboration (Institutionenübergreifende Integration von Normdaten, IN2N) project. This pilot project, executed in cooperation with the German National Library and the German Film Institute, aims to establish new collaboration models to improve cross-domain authority maintenance. The paper outlines applied strategies for providing a shared infrastructure as well as workflows for exchanging data about persons; interface enhancements permitting the exploitation of innovative web approaches; and cross-institutional data search and representation solutions. Furthermore, we discuss specific boundary conditions, such as disparities in the level of data granularity, for an interoperable cataloguing environment.
Embedded Files: Risks, Challenges and OptionsTim Allison
This document discusses the risks, challenges, and options related to embedded files. It notes that every file may contain embedded files and attachments, and systems need to be prepared to handle them. Specific file formats like PDF, Microsoft Office files, TIFF, HTML and more can all contain embedded files, including full file paths and sensitive data. Proper handling requires awareness of incremental updates, metadata, and going beyond surface-level parsing to fully extract all content.
Ultra-Broadband and Peta-Scale Collaboration Opportunities Between UC and CanadaLarry Smarr
06.06.12
Summary Talk
Canada - California Strategic Innovation Partnership Summit
ICT/Broadband Internet Session
Title: Ultra-Broadband and Peta-Scale Collaboration Opportunities Between UC and Canada
Vancouver, Canada
NIST Special Publication 500-293: US Government Cloud Computing Technology R...David Sweigert
This document presents the US Government Cloud Computing Technology Roadmap. It identifies 10 high-priority requirements that must be met for US government agencies to further adopt cloud computing. These requirements relate to standards, security, interoperability, portability, and other areas. The roadmap is intended to define and communicate what is needed to advance cloud computing technology and adoption. It incorporates input from public workshops, working groups, and comments. The overall goal is to help accelerate secure and effective cloud adoption within the US government.
This document provides a summary of a presentation on government big data. It introduces the speakers and discusses topics like on-premises vs hosted services, the Recovery Accountability and Transparency Board's (RATB) cloud services, and EMC Isilon's scale-out storage solutions for big data in the federal sector. Specific solutions mentioned include RATB websites, their logical system architecture in the cloud, data governance, analytics as a service, and examples of federal agencies using Isilon for healthcare, life sciences, and other big data applications.
STI 2022 - Generating large-scale network analyses of scientific landscapes i...Michele Pasin
The growth of large, programatically accessible bibliometrics databases presents new opportunities for complex analyses of publication metadata. In addition to providing a wealth of information about authors and institutions, databases such as those provided by Dimensions also provide conceptual information and links to entities such as grants, funders and patents. However, data is not the only challenge in evaluating patterns in scholarly work: These large datasets can be challenging to integrate, particularly for those unfamiliar with the complex schemas necessary for accommodating such heterogeneous information, and those most comfortable with data mining may not be as experienced in data visualisation. Here, we present an open-source Python library that streamlines the process accessing and diagramming subsets of the Dimensions on Google BigQuery database and demonstrate its use on the freely available Dimensions COVID-19 dataset. We are optimistic that this tool will expand access to this valuable information by streamlining what would otherwise be multiple complex technical tasks, enabling more researchers to examine patterns in research focus and collaboration over time.
The document discusses several US grid projects including campus and regional grids like Purdue and UCLA that provide tens of thousands of CPUs and petabytes of storage. It describes national grids like TeraGrid and Open Science Grid that provide over a petaflop of computing power through resource sharing agreements. It outlines specific communities and projects using these grids for sciences like high energy physics, astronomy, biosciences, and earthquake modeling through the Southern California Earthquake Center. Software providers and toolkits that enable these grids are also mentioned like Globus, Virtual Data Toolkit, and services like Introduce.
The document discusses tools for analyzing dark data and dark matter, including DeepDive and Apache Spark. DeepDive is highlighted as a system that helps extract value from dark data by creating structured data from unstructured sources and integrating it into existing databases. It allows for sophisticated relationships and inferences about entities. Apache Spark is also summarized as providing high-level abstractions for stream processing, graph analytics, and machine learning on big data.
- Calit2 is a research institute at UCSD and UCI that opened two new buildings housing over 1000 researchers working on projects like nanotechnology and virtual reality.
- Calit2 has over $100 million in capital and partnerships with 93 industry partners generating over $63 million in funding.
- The OptIPuter project is a $13.5 million NSF-funded project led by Calit2 to create high-resolution portals over dedicated optical networks for sharing science data globally.
This document discusses duplicate and near duplicate detection at scale. It begins with an outline of the topics to be covered, including a search system assessment, text extraction assessment, duplicates and near duplicates case study, and exploration of near duplicates using minhash. It then discusses experiments with detecting duplicates and near duplicates in a web index of over 12.5 million documents. The results found over 2.8 million documents had the same text digest, and different techniques like text profiles and minhash were able to further group documents. However, minhash was found to be prohibitively slow at this scale. The document explores reasons for this and concludes more work is needed to efficiently detect near duplicates at large scale.
GigaByte Chief Editor Scott Edmunds presents on how to prepare a data paper for the TDR and WHO sponsored call for data papers describing datasets on vectors of human diseases launched in Nov 2021. Presented at the GBIF webinar on 25th January 2022 and aimed at authors interested in submitting a manuscript submitted to the series.
Evaluating Text Extraction at Scale: A case study from Apache TikaTim Allison
The document discusses Apache Tika, a framework for extracting text and metadata from various file formats. It describes some ways Tika parsing can fail, such as missing or garbled text. The author introduces the tika-eval module for comparing Tika extractions and identifying issues. Public corpora are being used to test Tika at scale and find regressions. The goal is improving Tika's robustness through large-scale evaluation.
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
FACE-IT is an effort to develop a new IT infrastructure to accelerate existing disciplinary research and enable information transfer among traditionally separate fields. At present, finding data and processing it into usable form can dominate research efforts. By providing ready access to not only data but also the software tools used to process it for specific uses (e.g., climate impact and economic model inputs), FACE-IT allows researchers to concentrate their efforts on analysis. Lowering barriers to data access allows researchers to stretch in new directions and allows researchers to learn and respond to the needs of other fields. FACE-IT builds on the Globus Galaxies platform, which has been developed over the past several years at the University of Chicago. FACE-IT also benefit from substantial software development undertaken by the communities who have developed most of the domain-specific tools required to populate FACE-IT with useful capabilities. The FACE-IT Galaxy manages earth system datatypes (as NetCDF), new tool parameters (dates, map, opendap), aggregated datatypes (RAFT), service providers and cool map visualizers.
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
This document provides an introduction to big data, including definitions of big data and why it is important. It discusses characteristics of big data like volume, velocity, variety and veracity. It provides examples of big data applications in various industries like GE, Boeing, social media, finance, CERN, journalism, politics and more. It also introduces NoSQL and the CAP theorem, and concludes that big data is changing business and technology by enabling new insights from data to reduce costs and optimize operations.
Red Hat, Green Energy Corp & Magpie - Open Source Smart Grid Plataform - ...impodgirl
The Pacific Northwest smart grid demonstration project led by Battelle Memorial Institute aims to validate the costs and benefits of smart grid technology. The $88.8 million project involves 12 utilities across 5 northwest states and will test technologies like dynamic pricing signals and demand response. It seeks to better integrate renewable energy and improve system efficiency over its 5-year duration. Red Hat is also entering the smart grid industry through a partnership with Grid Exchange Corporation to develop an open-source smart grid software integration platform applying standards like ICCP.
1105 Media - 2014 Core Market Capabilities PresentationChristina Langer
This document provides information about marketing programs to reach government technology decision makers. It discusses magazine advertising, inserts and outserts which provide direct delivery of marketing messages to readers' desks. It also outlines market capture programs for 2014 including Download Reports, Strategic Reports, and custom programs. Download Reports involve in-depth research on technology trends converted into content across multiple platforms over 6-9 months, providing sponsors exclusivity, ads, banners and leads. The programs aim to engage audiences and position sponsors as thought leaders.
The document provides information about David Cieslak and his company RKL eSolutions. It introduces Cieslak as the Chief Cloud Officer of RKL eSolutions, a subsidiary of RKL LLP that provides ERP sales and consulting. It then lists Cieslak's experience and accomplishments in the accounting industry. The rest of the document is an agenda for a presentation on technology trends that will discuss topics like AI, new devices, and connectivity standards.
The document provides an overview of big data analytics. It defines big data as high-volume, high-velocity, and high-variety information assets that require cost-effective and innovative forms of processing for insights and decision making. Big data is characterized by the 3Vs - volume, velocity, and variety. The emergence of big data is driven by the massive amount of data now being generated and stored, availability of open source tools, and commodity hardware. The course will cover Apache Hadoop, Apache Spark, streaming analytics, visualization, linked data analysis, and big data systems and AI solutions.
Slides from a webinar on webware presented by Mike Qaissaunee and Gordon F. Snyder, Jr. (both of nctt.org). The webinar was hosted by MATEC NetWorks (http://www.matecnetworks.org/) and delivered via Elluminate. Visit MATEC NetWorks to watch the webinar.
Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...Cloudera, Inc.
Federal, State and Local governments and the development community surrounding them are busy creating solutions leveraging the Apache Foundation Hadoop capabilities. This session will highlight the top five solutions selected by an all star panel of judges. Who will take home the coveted Hadoop Award for Government Excellence (Haggie) Nominations for Haggies are being accepted now at http://CTOlabs.com
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
More Related Content
Similar to "Building a File Observatory: Making Sense of PDFs in the Wild"
This document provides a summary of a presentation on government big data. It introduces the speakers and discusses topics like on-premises vs hosted services, the Recovery Accountability and Transparency Board's (RATB) cloud services, and EMC Isilon's scale-out storage solutions for big data in the federal sector. Specific solutions mentioned include RATB websites, their logical system architecture in the cloud, data governance, analytics as a service, and examples of federal agencies using Isilon for healthcare, life sciences, and other big data applications.
STI 2022 - Generating large-scale network analyses of scientific landscapes i...Michele Pasin
The growth of large, programatically accessible bibliometrics databases presents new opportunities for complex analyses of publication metadata. In addition to providing a wealth of information about authors and institutions, databases such as those provided by Dimensions also provide conceptual information and links to entities such as grants, funders and patents. However, data is not the only challenge in evaluating patterns in scholarly work: These large datasets can be challenging to integrate, particularly for those unfamiliar with the complex schemas necessary for accommodating such heterogeneous information, and those most comfortable with data mining may not be as experienced in data visualisation. Here, we present an open-source Python library that streamlines the process accessing and diagramming subsets of the Dimensions on Google BigQuery database and demonstrate its use on the freely available Dimensions COVID-19 dataset. We are optimistic that this tool will expand access to this valuable information by streamlining what would otherwise be multiple complex technical tasks, enabling more researchers to examine patterns in research focus and collaboration over time.
The document discusses several US grid projects including campus and regional grids like Purdue and UCLA that provide tens of thousands of CPUs and petabytes of storage. It describes national grids like TeraGrid and Open Science Grid that provide over a petaflop of computing power through resource sharing agreements. It outlines specific communities and projects using these grids for sciences like high energy physics, astronomy, biosciences, and earthquake modeling through the Southern California Earthquake Center. Software providers and toolkits that enable these grids are also mentioned like Globus, Virtual Data Toolkit, and services like Introduce.
The document discusses tools for analyzing dark data and dark matter, including DeepDive and Apache Spark. DeepDive is highlighted as a system that helps extract value from dark data by creating structured data from unstructured sources and integrating it into existing databases. It allows for sophisticated relationships and inferences about entities. Apache Spark is also summarized as providing high-level abstractions for stream processing, graph analytics, and machine learning on big data.
- Calit2 is a research institute at UCSD and UCI that opened two new buildings housing over 1000 researchers working on projects like nanotechnology and virtual reality.
- Calit2 has over $100 million in capital and partnerships with 93 industry partners generating over $63 million in funding.
- The OptIPuter project is a $13.5 million NSF-funded project led by Calit2 to create high-resolution portals over dedicated optical networks for sharing science data globally.
This document discusses duplicate and near duplicate detection at scale. It begins with an outline of the topics to be covered, including a search system assessment, text extraction assessment, duplicates and near duplicates case study, and exploration of near duplicates using minhash. It then discusses experiments with detecting duplicates and near duplicates in a web index of over 12.5 million documents. The results found over 2.8 million documents had the same text digest, and different techniques like text profiles and minhash were able to further group documents. However, minhash was found to be prohibitively slow at this scale. The document explores reasons for this and concludes more work is needed to efficiently detect near duplicates at large scale.
GigaByte Chief Editor Scott Edmunds presents on how to prepare a data paper for the TDR and WHO sponsored call for data papers describing datasets on vectors of human diseases launched in Nov 2021. Presented at the GBIF webinar on 25th January 2022 and aimed at authors interested in submitting a manuscript submitted to the series.
Evaluating Text Extraction at Scale: A case study from Apache TikaTim Allison
The document discusses Apache Tika, a framework for extracting text and metadata from various file formats. It describes some ways Tika parsing can fail, such as missing or garbled text. The author introduces the tika-eval module for comparing Tika extractions and identifying issues. Public corpora are being used to test Tika at scale and find regressions. The goal is improving Tika's robustness through large-scale evaluation.
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
FACE-IT is an effort to develop a new IT infrastructure to accelerate existing disciplinary research and enable information transfer among traditionally separate fields. At present, finding data and processing it into usable form can dominate research efforts. By providing ready access to not only data but also the software tools used to process it for specific uses (e.g., climate impact and economic model inputs), FACE-IT allows researchers to concentrate their efforts on analysis. Lowering barriers to data access allows researchers to stretch in new directions and allows researchers to learn and respond to the needs of other fields. FACE-IT builds on the Globus Galaxies platform, which has been developed over the past several years at the University of Chicago. FACE-IT also benefit from substantial software development undertaken by the communities who have developed most of the domain-specific tools required to populate FACE-IT with useful capabilities. The FACE-IT Galaxy manages earth system datatypes (as NetCDF), new tool parameters (dates, map, opendap), aggregated datatypes (RAFT), service providers and cool map visualizers.
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
This document provides an introduction to big data, including definitions of big data and why it is important. It discusses characteristics of big data like volume, velocity, variety and veracity. It provides examples of big data applications in various industries like GE, Boeing, social media, finance, CERN, journalism, politics and more. It also introduces NoSQL and the CAP theorem, and concludes that big data is changing business and technology by enabling new insights from data to reduce costs and optimize operations.
Red Hat, Green Energy Corp & Magpie - Open Source Smart Grid Plataform - ...impodgirl
The Pacific Northwest smart grid demonstration project led by Battelle Memorial Institute aims to validate the costs and benefits of smart grid technology. The $88.8 million project involves 12 utilities across 5 northwest states and will test technologies like dynamic pricing signals and demand response. It seeks to better integrate renewable energy and improve system efficiency over its 5-year duration. Red Hat is also entering the smart grid industry through a partnership with Grid Exchange Corporation to develop an open-source smart grid software integration platform applying standards like ICCP.
1105 Media - 2014 Core Market Capabilities PresentationChristina Langer
This document provides information about marketing programs to reach government technology decision makers. It discusses magazine advertising, inserts and outserts which provide direct delivery of marketing messages to readers' desks. It also outlines market capture programs for 2014 including Download Reports, Strategic Reports, and custom programs. Download Reports involve in-depth research on technology trends converted into content across multiple platforms over 6-9 months, providing sponsors exclusivity, ads, banners and leads. The programs aim to engage audiences and position sponsors as thought leaders.
The document provides information about David Cieslak and his company RKL eSolutions. It introduces Cieslak as the Chief Cloud Officer of RKL eSolutions, a subsidiary of RKL LLP that provides ERP sales and consulting. It then lists Cieslak's experience and accomplishments in the accounting industry. The rest of the document is an agenda for a presentation on technology trends that will discuss topics like AI, new devices, and connectivity standards.
The document provides an overview of big data analytics. It defines big data as high-volume, high-velocity, and high-variety information assets that require cost-effective and innovative forms of processing for insights and decision making. Big data is characterized by the 3Vs - volume, velocity, and variety. The emergence of big data is driven by the massive amount of data now being generated and stored, availability of open source tools, and commodity hardware. The course will cover Apache Hadoop, Apache Spark, streaming analytics, visualization, linked data analysis, and big data systems and AI solutions.
Slides from a webinar on webware presented by Mike Qaissaunee and Gordon F. Snyder, Jr. (both of nctt.org). The webinar was hosted by MATEC NetWorks (http://www.matecnetworks.org/) and delivered via Elluminate. Visit MATEC NetWorks to watch the webinar.
Hadoop World 2011: The Hadoop Award for Government Excellence - Bob Gourley -...Cloudera, Inc.
Federal, State and Local governments and the development community surrounding them are busy creating solutions leveraging the Apache Foundation Hadoop capabilities. This session will highlight the top five solutions selected by an all star panel of judges. Who will take home the coveted Hadoop Award for Government Excellence (Haggie) Nominations for Haggies are being accepted now at http://CTOlabs.com
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
Similar to "Building a File Observatory: Making Sense of PDFs in the Wild" (20)
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.