Single slide Powerpoint animation showing the total size of the journal literature in the context of the data produced by the LHC from 2009-2013. Need to view in Slideshow mode.
Kuhan Wang developed a machine learning pipeline to analyze textual features on content URLs and optimize advertisement placement for a company. The pipeline involved scraping URL data, processing text, modeling feature importance, and extracting top keywords. It was delivered to the company in Python code. Wang's dissertation background involved searching for signatures of microscopic black holes and exotic gravity states using data from the Large Hadron Collider particle detector.
- The document describes a machine learning pipeline developed by Insight Data Science to analyze textual features on content URLs and predict user engagement for optimal advertisement placement.
- Keywords were extracted from URLs and used to build a logistic regression classification model to predict whether users would click on advertisements based on URL text.
- The model was validated on a test dataset and achieved a precision between 0.55-0.85 and recall between 0.4-0.85 when randomly splitting data 50/50 for training and testing.
The document discusses LHCb's use of RHEA and T-Systems cloud resources for Monte Carlo simulation jobs during the HNSciCloud pilot phase. Key points: LHCb only used the CPU resources for these jobs and required managing RHEA and T-Systems as independent sites. Jobs were submitted via HTCondor CEs to provide flexibility and reduce overhead. On RHEA, LHCb had varying levels of resources and better success rates than on T-Systems which also had high failure rates due to lost heartbeats.
Kuhan Wang developed a machine learning pipeline to analyze textual features on content URLs and optimize advertisement placement for user engagement. The pipeline involved scraping URL data, processing text, modeling feature importance, and extracting top keywords. It was delivered to a company and evaluated using precision-recall metrics from a logistic regression model trained on engagement data. Additionally, Wang's dissertation research involved using data from the Large Hadron Collider to search for signatures of microscopic black holes and constraints on theories of extra spatial dimensions.
Smart Scalable Feature Reduction with Random Forests with Erik ErlandsonDatabricks
Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.
In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.
Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.
In this slidecast, Jason Stowe from Cycle Computing describes the company's recent record-breaking Petascale CycleCloud HPC production run.
"For this big workload, a 156,314-core CycleCloud behemoth spanning 8 AWS regions, totaling 1.21 petaFLOPS (RPeak, not RMax) of aggregate compute power, to simulate 205,000 materials, crunched 264 compute years in only 18 hours. Thanks to Cycle's software and Amazon's Spot Instances, a supercomputing environment worth $68M if you had bought it, ran 2.3 Million hours of material science, approximately 264 compute-years, of simulation in only 18 hours, cost only $33,000, or $0.16 per molecule."
Learn more: http://blog.cyclecomputing.com/2013/11/back-to-the-future-121-petaflopsrpeak-156000-core-cyclecloud-hpc-runs-264-years-of-materials-science.html
Watch the video presentation: http://wp.me/p3RLHQ-aO9
2014.4 journal of literature and art studiesDoris Carly
The document summarizes Charles Dickens' portrayal of self-damaging behavior in two female characters from his novel Bleak House: Lady Dedlock and Mademoiselle Hortense. It analyzes how their low self-esteem stems from various reasons and manifests in self-imposed isolation, madness, purposely dangerous acts, physical self-abuse, and destructive relationships with men. While Dickens did not intend to malign women, his depiction reflected the Victorian era's ambivalent attitudes towards strong female characters. The document aims to examine Dickens' exploration of self-damaging traits in women and understand the psychological forces driving such behavior.
The document provides guidance on conducting a literature review. It discusses that a literature review aims to convey previous knowledge and facts established on a topic by summarizing, evaluating, and integrating primary sources. The literature review is conducted in 5 stages - annotating relevant sources, organizing sources thematically, additional reading, writing individual sections, and integrating all sections. When writing the literature review, an introduction defining the topic, a body summarizing and grouping sources thematically, and a conclusion evaluating the current state of research and identifying gaps are essential elements to include.
Kuhan Wang developed a machine learning pipeline to analyze textual features on content URLs and optimize advertisement placement for a company. The pipeline involved scraping URL data, processing text, modeling feature importance, and extracting top keywords. It was delivered to the company in Python code. Wang's dissertation background involved searching for signatures of microscopic black holes and exotic gravity states using data from the Large Hadron Collider particle detector.
- The document describes a machine learning pipeline developed by Insight Data Science to analyze textual features on content URLs and predict user engagement for optimal advertisement placement.
- Keywords were extracted from URLs and used to build a logistic regression classification model to predict whether users would click on advertisements based on URL text.
- The model was validated on a test dataset and achieved a precision between 0.55-0.85 and recall between 0.4-0.85 when randomly splitting data 50/50 for training and testing.
The document discusses LHCb's use of RHEA and T-Systems cloud resources for Monte Carlo simulation jobs during the HNSciCloud pilot phase. Key points: LHCb only used the CPU resources for these jobs and required managing RHEA and T-Systems as independent sites. Jobs were submitted via HTCondor CEs to provide flexibility and reduce overhead. On RHEA, LHCb had varying levels of resources and better success rates than on T-Systems which also had high failure rates due to lost heartbeats.
Kuhan Wang developed a machine learning pipeline to analyze textual features on content URLs and optimize advertisement placement for user engagement. The pipeline involved scraping URL data, processing text, modeling feature importance, and extracting top keywords. It was delivered to a company and evaluated using precision-recall metrics from a logistic regression model trained on engagement data. Additionally, Wang's dissertation research involved using data from the Large Hadron Collider to search for signatures of microscopic black holes and constraints on theories of extra spatial dimensions.
Smart Scalable Feature Reduction with Random Forests with Erik ErlandsonDatabricks
Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.
In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.
Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.
In this slidecast, Jason Stowe from Cycle Computing describes the company's recent record-breaking Petascale CycleCloud HPC production run.
"For this big workload, a 156,314-core CycleCloud behemoth spanning 8 AWS regions, totaling 1.21 petaFLOPS (RPeak, not RMax) of aggregate compute power, to simulate 205,000 materials, crunched 264 compute years in only 18 hours. Thanks to Cycle's software and Amazon's Spot Instances, a supercomputing environment worth $68M if you had bought it, ran 2.3 Million hours of material science, approximately 264 compute-years, of simulation in only 18 hours, cost only $33,000, or $0.16 per molecule."
Learn more: http://blog.cyclecomputing.com/2013/11/back-to-the-future-121-petaflopsrpeak-156000-core-cyclecloud-hpc-runs-264-years-of-materials-science.html
Watch the video presentation: http://wp.me/p3RLHQ-aO9
2014.4 journal of literature and art studiesDoris Carly
The document summarizes Charles Dickens' portrayal of self-damaging behavior in two female characters from his novel Bleak House: Lady Dedlock and Mademoiselle Hortense. It analyzes how their low self-esteem stems from various reasons and manifests in self-imposed isolation, madness, purposely dangerous acts, physical self-abuse, and destructive relationships with men. While Dickens did not intend to malign women, his depiction reflected the Victorian era's ambivalent attitudes towards strong female characters. The document aims to examine Dickens' exploration of self-damaging traits in women and understand the psychological forces driving such behavior.
The document provides guidance on conducting a literature review. It discusses that a literature review aims to convey previous knowledge and facts established on a topic by summarizing, evaluating, and integrating primary sources. The literature review is conducted in 5 stages - annotating relevant sources, organizing sources thematically, additional reading, writing individual sections, and integrating all sections. When writing the literature review, an introduction defining the topic, a body summarizing and grouping sources thematically, and a conclusion evaluating the current state of research and identifying gaps are essential elements to include.
Provenance in Support of the ANDS Four TransformationsAndrew Treloar
The document discusses the Australian National Data Service (ANDS) and how it uses provenance information to support its four transformations of research data. ANDS aims to make Australian research data more discoverable, accessible, and reusable. It focuses on adding value to data through re-use rather than storing data itself. Provenance capture is important for managing data, connecting related data, improving discoverability, and enabling re-analysis. ANDS has funded projects involving provenance services and integration. Future work includes developing domain-specific extensions to the PROV-O standard and strengthening connections with the Research Data Alliance Interest Group on Provenance.
ANDS Applications Program: Building Tools to Facilitate Data ReuseAndrew Treloar
Presentation accompanying talk on the ANDS Applications program at IDCC 2016. Discusses the outputs of the program, but also focusses on issues of sustainability of such eresearch tools
Introductory talk for ANDS workshop on Institutional Repositories and data. The talk situates the topic within the field of scholarly communication before comparing the relative technical simplicity of running repositories of publications with the complexities that accompany a shift to data. The most-retweeted slide is the one viewing the response of repository managers to data through the lens of Elizabeth Kübler-Ross' stages of grieving.
Closing comments at #iPres 2014 conferenceAndrew Treloar
This document summarizes the author's observations from attending the iPres 2014 conference. In three sentences: The author notes that some presentations focused on recreating existing infrastructure instead of building on what is there. Several talks emphasized the importance of preserving data and processes to maintain the scholarly record. Overall the conference provided useful reflections on digital preservation practice and experiences, though some theoretical papers seemed detached from real-world challenges.
The document discusses changes in scholarly research and outputs that have implications for data archives. It notes that the research process itself is becoming more visible and dynamic as scholars use websites and tools to record and share their work. This results in a more extensive and heterogeneous scholarly record. However, many of these recording platforms are not designed for long-term archiving. Therefore, archives will need to develop new approaches to account for the characteristics of research recorded on the web, and to trigger the transfer of outputs from recording platforms to long-term archives.
The universe of identifiers and how ANDS is using themAndrew Treloar
Presentation on identifiers in general, and ANDS' approach to identifiers for objects and people in particular. Given at ODIP 3rd Workshop on August 7, 2014.
The document discusses adding value to researchers' data through the Australian National Data Service (ANDS). ANDS aims to transform unmanaged, disconnected data into structured collections that are managed, connected, findable, and reusable. It does this through nationally coordinated engagement with institutions and disciplines. The goal is to help researchers easily publish, discover, access and reuse research data.
The life-sciences as a pathfinder in data-intensive research practiceAndrew Treloar
Presentation given at UQ Winterschool 2014. The advent of the Internet is bringing about fundamental changes in the ways that research is performed and communicated. These have been particularly driven by the growing importance of data, as well as the tools available to work with this data. This presentation will examine this shift, drawing on examples from the life‐sciences, and try to make some predictions about the next five years.
Past, present, and future of scholarly technology and practicesAndrew Treloar
Thoughts about past, the present and the future of scholarly technologies and scholarly practices. Based on work done with @hvdsomp at #DANS, as well as discussions with @scharnhorsta
Talk given by @atreloar and @hvdsomp at workshop sponsored by http://dans.knaw.nl/ with title "Riding the Wave and the Scholarly Archive of the Future". NOTE: This reflects thinking in progress which may well change in the future.
Data Infrastructure and the Scholarly Ecosystem of the FutureAndrew Treloar
Talk delivered at forum at SURF in the Netherlands with the hashtag #disef. Talk deals with an overview of some thinking being done about elements of the ecosystem for scholarship, as well as some slides dealing with the Australian National Data Service (ands.org.au) and the Research Data Alliance (rd-alliance.org). These latter slides were used during a Q&A session as part of the talk.
Research data and the ANDS agenda in AustraliaAndrew Treloar
This document discusses research data and the agenda of the Australian National Data Service (ANDS) in Australia. ANDS was established in 2009 to enable Australian researchers to more easily publish, discover, access and reuse research data. It provides several national services and has funded over 200 projects. The document also outlines relevant national policies and ANDS's involvement in international organizations like the Research Data Alliance.
This document discusses how data is driving decisions in research. It notes that the amount of data being generated is growing exponentially and researchers are now in the data business. It outlines four transformations needed - from unmanaged to managed data, disconnected to connected data, invisible to findable data, and single-use to reusable data. National strategies in Australia are aiming to support these transformations through initiatives like the Australian National Data Service which provides resources and expertise to help researchers manage, connect, and enable reuse of research data.
Building on the Atlas (of Living Australia)Andrew Treloar
Presentation given at Atlas of Living Australia Science Symposium 2013. Discusses Australian National Data Service Applications program and two specific projects: Soils to Satellites (also involving TERN), and Edgar Bird Species distribution.
The document discusses ANDS' efforts to augment data discovery through repurposing DataCite metadata. It describes ANDS' goals of making data more findable, accessible, and reusable. It outlines a three stage plan to provide "See Also" suggestions for datasets: 1) internal suggestions, 2) suggestions from searching DataCite metadata, and 3) potentially integrating additional sources like the National Library of Australia. The "See Also" feature aims to support serendipity in discovery. Future work may include ranking searches and expanding the types of related results provided.
From Data to Data: One version of a History of Scholarly CommunicationAndrew Treloar
1) Scholarly communication has evolved from early written works and data to modern digital scholarship that generates vast amounts of data.
2) Issues with data preservation, accessibility, and selective publication have impacted the completeness of the evidence base over time.
3) As data-intensive research increases, standardization and data federation are needed to aggregate data from multiple sources and answer new questions.
4) Initiatives like institutional repositories, researcher workflows, and national programs aim to improve data sharing, access, and reuse to support new discoveries.
Data management: international challenges, national infrastructure, and insti...Andrew Treloar
This document discusses challenges and responses to data management from an Australian perspective. It outlines international challenges around inconvenient, imprisoned, invisible, and inaccessible data. It then discusses the importance of data reuse for efficiency, validation, and value. Two case studies on astronomy and cancer research demonstrate increased citations when data is publicly shared. The document also outlines Australia's national data service, ANDS, which aims to make data managed, connected, findable, and reusable. ANDS is building national data services and working with institutions to improve data management policies, capture, and metadata. Ongoing issues include balancing local vs national needs, sustainability, and encouraging data sharing cultures.
Provenance in Support of the ANDS Four TransformationsAndrew Treloar
The document discusses the Australian National Data Service (ANDS) and how it uses provenance information to support its four transformations of research data. ANDS aims to make Australian research data more discoverable, accessible, and reusable. It focuses on adding value to data through re-use rather than storing data itself. Provenance capture is important for managing data, connecting related data, improving discoverability, and enabling re-analysis. ANDS has funded projects involving provenance services and integration. Future work includes developing domain-specific extensions to the PROV-O standard and strengthening connections with the Research Data Alliance Interest Group on Provenance.
ANDS Applications Program: Building Tools to Facilitate Data ReuseAndrew Treloar
Presentation accompanying talk on the ANDS Applications program at IDCC 2016. Discusses the outputs of the program, but also focusses on issues of sustainability of such eresearch tools
Introductory talk for ANDS workshop on Institutional Repositories and data. The talk situates the topic within the field of scholarly communication before comparing the relative technical simplicity of running repositories of publications with the complexities that accompany a shift to data. The most-retweeted slide is the one viewing the response of repository managers to data through the lens of Elizabeth Kübler-Ross' stages of grieving.
Closing comments at #iPres 2014 conferenceAndrew Treloar
This document summarizes the author's observations from attending the iPres 2014 conference. In three sentences: The author notes that some presentations focused on recreating existing infrastructure instead of building on what is there. Several talks emphasized the importance of preserving data and processes to maintain the scholarly record. Overall the conference provided useful reflections on digital preservation practice and experiences, though some theoretical papers seemed detached from real-world challenges.
The document discusses changes in scholarly research and outputs that have implications for data archives. It notes that the research process itself is becoming more visible and dynamic as scholars use websites and tools to record and share their work. This results in a more extensive and heterogeneous scholarly record. However, many of these recording platforms are not designed for long-term archiving. Therefore, archives will need to develop new approaches to account for the characteristics of research recorded on the web, and to trigger the transfer of outputs from recording platforms to long-term archives.
The universe of identifiers and how ANDS is using themAndrew Treloar
Presentation on identifiers in general, and ANDS' approach to identifiers for objects and people in particular. Given at ODIP 3rd Workshop on August 7, 2014.
The document discusses adding value to researchers' data through the Australian National Data Service (ANDS). ANDS aims to transform unmanaged, disconnected data into structured collections that are managed, connected, findable, and reusable. It does this through nationally coordinated engagement with institutions and disciplines. The goal is to help researchers easily publish, discover, access and reuse research data.
The life-sciences as a pathfinder in data-intensive research practiceAndrew Treloar
Presentation given at UQ Winterschool 2014. The advent of the Internet is bringing about fundamental changes in the ways that research is performed and communicated. These have been particularly driven by the growing importance of data, as well as the tools available to work with this data. This presentation will examine this shift, drawing on examples from the life‐sciences, and try to make some predictions about the next five years.
Past, present, and future of scholarly technology and practicesAndrew Treloar
Thoughts about past, the present and the future of scholarly technologies and scholarly practices. Based on work done with @hvdsomp at #DANS, as well as discussions with @scharnhorsta
Talk given by @atreloar and @hvdsomp at workshop sponsored by http://dans.knaw.nl/ with title "Riding the Wave and the Scholarly Archive of the Future". NOTE: This reflects thinking in progress which may well change in the future.
Data Infrastructure and the Scholarly Ecosystem of the FutureAndrew Treloar
Talk delivered at forum at SURF in the Netherlands with the hashtag #disef. Talk deals with an overview of some thinking being done about elements of the ecosystem for scholarship, as well as some slides dealing with the Australian National Data Service (ands.org.au) and the Research Data Alliance (rd-alliance.org). These latter slides were used during a Q&A session as part of the talk.
Research data and the ANDS agenda in AustraliaAndrew Treloar
This document discusses research data and the agenda of the Australian National Data Service (ANDS) in Australia. ANDS was established in 2009 to enable Australian researchers to more easily publish, discover, access and reuse research data. It provides several national services and has funded over 200 projects. The document also outlines relevant national policies and ANDS's involvement in international organizations like the Research Data Alliance.
This document discusses how data is driving decisions in research. It notes that the amount of data being generated is growing exponentially and researchers are now in the data business. It outlines four transformations needed - from unmanaged to managed data, disconnected to connected data, invisible to findable data, and single-use to reusable data. National strategies in Australia are aiming to support these transformations through initiatives like the Australian National Data Service which provides resources and expertise to help researchers manage, connect, and enable reuse of research data.
Building on the Atlas (of Living Australia)Andrew Treloar
Presentation given at Atlas of Living Australia Science Symposium 2013. Discusses Australian National Data Service Applications program and two specific projects: Soils to Satellites (also involving TERN), and Edgar Bird Species distribution.
The document discusses ANDS' efforts to augment data discovery through repurposing DataCite metadata. It describes ANDS' goals of making data more findable, accessible, and reusable. It outlines a three stage plan to provide "See Also" suggestions for datasets: 1) internal suggestions, 2) suggestions from searching DataCite metadata, and 3) potentially integrating additional sources like the National Library of Australia. The "See Also" feature aims to support serendipity in discovery. Future work may include ranking searches and expanding the types of related results provided.
From Data to Data: One version of a History of Scholarly CommunicationAndrew Treloar
1) Scholarly communication has evolved from early written works and data to modern digital scholarship that generates vast amounts of data.
2) Issues with data preservation, accessibility, and selective publication have impacted the completeness of the evidence base over time.
3) As data-intensive research increases, standardization and data federation are needed to aggregate data from multiple sources and answer new questions.
4) Initiatives like institutional repositories, researcher workflows, and national programs aim to improve data sharing, access, and reuse to support new discoveries.
Data management: international challenges, national infrastructure, and insti...Andrew Treloar
This document discusses challenges and responses to data management from an Australian perspective. It outlines international challenges around inconvenient, imprisoned, invisible, and inaccessible data. It then discusses the importance of data reuse for efficiency, validation, and value. Two case studies on astronomy and cancer research demonstrate increased citations when data is publicly shared. The document also outlines Australia's national data service, ANDS, which aims to make data managed, connected, findable, and reusable. ANDS is building national data services and working with institutions to improve data management policies, capture, and metadata. Ongoing issues include balancing local vs national needs, sustainability, and encouraging data sharing cultures.
Journal literature size in the context of the LHC data
1. LHC output from 2009-2013
= 100PB
(www.symmetrymagazine.org/article/february-
2013/achievement-unlocked-100-petabytes-of-data)
Journal Literature size in context…
@atreloar