Global warming big data is analyzed and processed using Pig, Hive, Mapreduce, HDFC, HBase technologies. Results are visualized using Tableau and stored in HBase in Programming in Data Analytics Project.
Using the Data Cube vocabulary for Publishing Environmental Linked Data on la...Laurent Lefort
Canberra Semantic Web Meetup.
Initiatives have been launched to develop semantic vocabularies representing statistical classifications and discovery metadata. Tools are also being created by statistical organizations to support the publication of dimensional data conforming to the Data Cube specification, now in Last Call at W3C.
The meeting will be an opportunity to hear about two semantic Web and Linked Data initiatives for statistical data that are driven by the Australian Government. The Bureau of Meteorlogy and CSIRO have recently released a Linked Data version of the ACORN-SAT historical climate data at http://lab.environment.data.gov.au and the ABS has released the Census data modelled in the Data Cube vocabulary which is part of a challenge the ABS is organising in context of the SemStats Workshop (http://www.datalift.org/en/event/semstats2013/challenge) at the International Semantic Web Conference (ISWC) in Sydney (http://iswc2013.semanticweb.org).
Come along to hear about these two projects, the challenges encountered and the solutions developed.
ResourceSync core team members Bernhard Haslhofer and Simeon Warner will present on the ResourceSync specification and provide practical examples and scenarios for its application.
Global warming big data is analyzed and processed using Pig, Hive, Mapreduce, HDFC, HBase technologies. Results are visualized using Tableau and stored in HBase in Programming in Data Analytics Project.
Using the Data Cube vocabulary for Publishing Environmental Linked Data on la...Laurent Lefort
Canberra Semantic Web Meetup.
Initiatives have been launched to develop semantic vocabularies representing statistical classifications and discovery metadata. Tools are also being created by statistical organizations to support the publication of dimensional data conforming to the Data Cube specification, now in Last Call at W3C.
The meeting will be an opportunity to hear about two semantic Web and Linked Data initiatives for statistical data that are driven by the Australian Government. The Bureau of Meteorlogy and CSIRO have recently released a Linked Data version of the ACORN-SAT historical climate data at http://lab.environment.data.gov.au and the ABS has released the Census data modelled in the Data Cube vocabulary which is part of a challenge the ABS is organising in context of the SemStats Workshop (http://www.datalift.org/en/event/semstats2013/challenge) at the International Semantic Web Conference (ISWC) in Sydney (http://iswc2013.semanticweb.org).
Come along to hear about these two projects, the challenges encountered and the solutions developed.
ResourceSync core team members Bernhard Haslhofer and Simeon Warner will present on the ResourceSync specification and provide practical examples and scenarios for its application.
GlobusWorld 2021: Arecibo Observatory Data MovementGlobus
The story of how Globus helped move petabytes of data from the Arecibo Observatory to TACC, and thereby save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by George Robb III.
We are living in the world of “Big Data”. “Big Data” is mainly expressed with three Vs – Volume, Velocity and Variety. The presentation will discuss how Big Data impacts us and how SAS programmers can use SAS skills in Big Data environment
The presentation will introduce Big Data Storage solution – Hadoop and NoSQL. In Hadoop, the presentation will discuss two major Hadoop capabilities - Hadoop Distributed File System (HDFS) and Map/Reduce (parallel computing in Hadoop). The presentation will show how SAS can work with Hadoop using HDFS LIBNAME, FILENAME, SAS/ACCESS to Hadoop HIVE and SAS GRID Managers to Hadoop YARN. The presentation will also introduce the concepts of NoSQL database for a big data solution.
The presentation will also introduce how SAS can work with the variety of data format, especially XML and JSON. The presentation will show the use case of converting XML documents to SAS datasets using LIBNAME XMLV2 XMLMAP statement. The presentation will also introduce REST API to extract data through internets and will demonstrate how SAS PROC HTTP can move the data through REST API.
The story of how Globus helped the Arecibo Observatory save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by Julio Alvarado Negron.
Pesented at SWIB13 in Hamburg, 2013-11-27. ResourceSync slides excerpted from the full tutorial at http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial
While much of the recent literature in spatial statistics has evolved around addressing the big data issue, practical implementations of these methods on high performance computing systems for truly large data are still rare. We discuss our explorations in this area at the National Center for Atmospheric Research for a range of applications, which can benefit from large scale computing infrastructure. These applications include extreme value analysis, approximate spatial methods, spatial localization methods and statistically-based data compression and are implemented in different programming languages. We will focus on timing results and practical considerations, such as speed vs. memory trade-offs, limits of scaling and ease of use.
Forensic examiners are in an uninterrupted battle with criminals for the use of Hadoop Platform. Thus, forensic investigation on composite Hadoop Platforms is an emerging field for forensic practitioners. The major challenge to this environment is generating the effective evidence from a sheer amount of Hadoop backlogs which awaiting analysis to embody the criminal activity. As a consequence, it may be arduously time and resources consuming to extract the evidences from a significant amount of backlogs. In order to address the above challenges in generating evidences, forensic readiness can assist the forensic practitioners in powerful forensic works. This paper undertakes the forensic research with two folds of contribution; (i) it finds out the forensically important artifacts on Hadoop Platform: Non-Ambari Hortonworks Data Platform (HDP), (ii) forensic readiness is proposed by means of analysis to those residual artifacts. The outcomes of this paper contribute to a Hadoop Platform forensic readiness which can result in effective evidence generating in real-world forensics on Hadoop Platform.
Linked geospatial data has recently received attention, as researchers and practitioners have started tapping the wealth of geospatial information available on the Web. Incomplete geospatial information, although appearing often in the applications captured by such datasets, is not represented and queried properly due to the lack of appropriate data models and query languages. We discuss our recent work on the model RDFi, an extension of RDF with the ability to represent property values that exist, but are unknown or partially known, using constraints, and an extension of the query language SPARQL with qualitative and quantitative geospatial querying capabilities. We demonstrate the usefulness of RDFi in geospatial Semantic Web applications by giving examples and comparing the modeling capabilities of RDFi with the ones of related Semantic Web systems.
Dr. Frank Wuerthwein from the University of California at San Diego presentation at International Super Computing Conference on Big Data, 2013, US Until recently, the large CERN experiments, ATLAS and CMS, owned and controlled the computing infrastructure they operated on in the US, and accessed data only when it was locally available on the hardware they operated. However, Würthwein explains, with data-taking rates set to increase dramatically by the end of LS1 in 2015, the current operational model is no longer viable to satisfy peak processing needs. Instead, he argues, large-scale processing centers need to be created dynamically to cope with spikes in demand. To this end, Würthwein and colleagues carried out a successful proof-of-concept study, in which the Gordon Supercomputer at the San Diego Supercomputer Center was dynamically and seamlessly integrated into the CMS production system to process a 125-terabyte data set.
GlobusWorld 2021: Arecibo Observatory Data MovementGlobus
The story of how Globus helped move petabytes of data from the Arecibo Observatory to TACC, and thereby save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by George Robb III.
We are living in the world of “Big Data”. “Big Data” is mainly expressed with three Vs – Volume, Velocity and Variety. The presentation will discuss how Big Data impacts us and how SAS programmers can use SAS skills in Big Data environment
The presentation will introduce Big Data Storage solution – Hadoop and NoSQL. In Hadoop, the presentation will discuss two major Hadoop capabilities - Hadoop Distributed File System (HDFS) and Map/Reduce (parallel computing in Hadoop). The presentation will show how SAS can work with Hadoop using HDFS LIBNAME, FILENAME, SAS/ACCESS to Hadoop HIVE and SAS GRID Managers to Hadoop YARN. The presentation will also introduce the concepts of NoSQL database for a big data solution.
The presentation will also introduce how SAS can work with the variety of data format, especially XML and JSON. The presentation will show the use case of converting XML documents to SAS datasets using LIBNAME XMLV2 XMLMAP statement. The presentation will also introduce REST API to extract data through internets and will demonstrate how SAS PROC HTTP can move the data through REST API.
The story of how Globus helped the Arecibo Observatory save 50+ years of data for posterity and future research. Presented at the GlobusWorld 2021 conference by Julio Alvarado Negron.
Pesented at SWIB13 in Hamburg, 2013-11-27. ResourceSync slides excerpted from the full tutorial at http://www.slideshare.net/OpenArchivesInitiative/resourcesync-tutorial
While much of the recent literature in spatial statistics has evolved around addressing the big data issue, practical implementations of these methods on high performance computing systems for truly large data are still rare. We discuss our explorations in this area at the National Center for Atmospheric Research for a range of applications, which can benefit from large scale computing infrastructure. These applications include extreme value analysis, approximate spatial methods, spatial localization methods and statistically-based data compression and are implemented in different programming languages. We will focus on timing results and practical considerations, such as speed vs. memory trade-offs, limits of scaling and ease of use.
Forensic examiners are in an uninterrupted battle with criminals for the use of Hadoop Platform. Thus, forensic investigation on composite Hadoop Platforms is an emerging field for forensic practitioners. The major challenge to this environment is generating the effective evidence from a sheer amount of Hadoop backlogs which awaiting analysis to embody the criminal activity. As a consequence, it may be arduously time and resources consuming to extract the evidences from a significant amount of backlogs. In order to address the above challenges in generating evidences, forensic readiness can assist the forensic practitioners in powerful forensic works. This paper undertakes the forensic research with two folds of contribution; (i) it finds out the forensically important artifacts on Hadoop Platform: Non-Ambari Hortonworks Data Platform (HDP), (ii) forensic readiness is proposed by means of analysis to those residual artifacts. The outcomes of this paper contribute to a Hadoop Platform forensic readiness which can result in effective evidence generating in real-world forensics on Hadoop Platform.
Linked geospatial data has recently received attention, as researchers and practitioners have started tapping the wealth of geospatial information available on the Web. Incomplete geospatial information, although appearing often in the applications captured by such datasets, is not represented and queried properly due to the lack of appropriate data models and query languages. We discuss our recent work on the model RDFi, an extension of RDF with the ability to represent property values that exist, but are unknown or partially known, using constraints, and an extension of the query language SPARQL with qualitative and quantitative geospatial querying capabilities. We demonstrate the usefulness of RDFi in geospatial Semantic Web applications by giving examples and comparing the modeling capabilities of RDFi with the ones of related Semantic Web systems.
Dr. Frank Wuerthwein from the University of California at San Diego presentation at International Super Computing Conference on Big Data, 2013, US Until recently, the large CERN experiments, ATLAS and CMS, owned and controlled the computing infrastructure they operated on in the US, and accessed data only when it was locally available on the hardware they operated. However, Würthwein explains, with data-taking rates set to increase dramatically by the end of LS1 in 2015, the current operational model is no longer viable to satisfy peak processing needs. Instead, he argues, large-scale processing centers need to be created dynamically to cope with spikes in demand. To this end, Würthwein and colleagues carried out a successful proof-of-concept study, in which the Gordon Supercomputer at the San Diego Supercomputer Center was dynamically and seamlessly integrated into the CMS production system to process a 125-terabyte data set.
Big Data to SMART Data : Process scenario
Scenario of an implementation of a transformation process of the Data towards exploitable data and representative with treatments of the streaming, the distributed systems, the messages, the storage in an NoSQL environment, a management with an ecosystem Big Data graphic visualization of the data with the technologies:
Apache Storm, Apache Zookeeper, Apache Kafka, Apache Cassandra, Apache Spark and Data-Driven Document.
Experiences as a producer, consumer and observer of open dataProgCity
Peter Mooney, is an Environmental Protection Agency (EPA) funded Research Fellow at the Department of Computer Science, NUI Maynooth. He has been working with the EPA on making environmental data publicly accessibly for the last ten years.
Presentation was part of The 1st Seminar of the ERC Funded Programmable City Project based at NIRSA, NUI Maynooth, Republic of Ireland.
Analysis of National Footprint Accounts using MapReduce, Hive, Pig and Sqoopsushantparte
The footprint data is analyzed on a various basis,
whereas this research support big data for the national footprint
data fetched from the internet sources which was publicly
available. The proposed project refers to the data set from 1962 to
2013 and 2018 data taken for kaggel.com. the process of handling
large data set the proposed project utilizes a Hadoop environment.
The distributed environment of Hadoop and MySQL is being
used in this project to process the data. HBase and Sqoop is being
used for post-processing of data and data processing between
HDFS and MySQL respectively. The monitoring data is being
processed by providing some case studies with MapReduce, Pig
and Hive which can be statistically analyzed and visualized in
Tableau and Microsoft Power BI.
Dealing with Semantic Heterogeneity in Real-Time InformationEdward Curry
Tutorial at the EarthBiAs 2014 Summer School on Dealing with Semantic Heterogeneity in Real-Time Information
Part I: Large Scale Open Environments
Part Ii: Computational Paradigms
Part III: RDF Event Processing
Part IV: Theory of Event Exchange
Part V: Approaches to Semantic Decoupling
Part VI: Example Application: Linked Energy Intelligence
Department of Geography and Geoinformation Science Seminar, George Mason University, Falls Church, VA, September 2015.
Increasingly, GIS is part of the collaboration between computer scientists, information scientists, and domain scientists to solve complex scientific questions. Successfully addressing scientific problems, such as informing regional decision- and policy-making for coastal zone management and marine spatial planning, requires integrative and innovative approaches to analyzing, modeling, and developing extensive and diverse data sets. The current chaotic distribution of available data sets, lack of documentation about them, and lack of easy-to-use access tools and computer modeling and analysis codes are still major obstacles for scientists and educators alike. Contributing solutions to these problems is part of an emerging science agenda at Esri for a range of environmental, conservation, climate and ocean sciences that will be discussed. The talk will highlight some recent projects in progress, including a new global map of ecological land units, new tools to support multidimensional scientific data, continued work on an ocean basemap, and more.
AAG Session
4204 Data-based living: peopling and placing ‘big data
Tampa, Florida, April 11 2014
Tracey P. Lauriault and Rob Kitchin
National Institute for Regional and Spatial Analysis (NIRSA)
National University of Ireland at Maynooth (NUIM)
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
I describe the origins, current state and potential future directions for the Earth System Grid Federation, an international consortium that develops infrastructure for sharing of climate simulation and related datasets.
The EGI Federation of clusters and research clouds are components of the European Open Science Cloud, and they offer technical solutions and an infrastructure to support the EuroGEOSS pilots, GEOSS and EO data exploitation platforms.
Learn how, by looking at the collaboration of EGI with NextGEOSS, the production support of the Geohazards TEP of Terradue and the EOSC-hub collaboration with GEOSS.
The AGINFRA+ Virtual Research Environment (VRE)AGINFRA
Massimiliano Assante from CNR on The AGINFRA+ Virtual Research Environment (VRE).
Joint Workshop on Food Risk Assessment Research & Practice
24th November 2017, Wageningen University & Research, Netherlands
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Predicting property prices with machine learning algorithms.pdf
Improving access to geospatial Big Data in the hydrology domain
1. Improving access to
geospatial Big Data in the
hydrology domain
Claudia Vitolo1,2
and Wouter Buytaert1
1
Imperial College London
2
Brunel University London
Big Data and Spatial Analytics - Business and Industrial Section
Royal Statistical Society, London, UK - 18.11.2015
4. What is Hydrology?
Hydrology is the scientific study of the movement,
distribution, and quality of water on Earth.
Source: Hydrology. In Wikipedia, The Free Encyclopedia.
5. What do (river) hydrologists do?
▣ Collect data on climate,
soil, geology,
topography, etc.
▣ Setup model
▣ Calibrate model with
observed water levels
and stream flows
□ locations
□ time intervals
▣ Use models to analyse
scenarios and make
predictions
6. Big Data in Hydrology
Information:
▣ Topography & bathymetry
▣ Geology
▣ Soil & Moisture
▣ Land cover
▣ Weather & Climate
▣ Hydrometry
▣ Quality samples
▣ Groundwater
▣ Infrastructures
Format:
▣ Plain text
▣ Raster
▣ Vector
▣ Binary
▣ Markup Languages
▣ Graphs & networks
▣ Cad drawings
7. Big Data in Hydrology
Information:
▣ Topography & bathymetry
▣ Geology
▣ Soil & Moisture
▣ Land cover
▣ Weather & Climate
▣ Hydrometry
▣ Quality samples
▣ Groundwater
▣ Infrastructures
Format:
▣ Plain text
▣ Raster
▣ Vector
▣ Binary
▣ Markup Languages
▣ Graphs & networks
▣ Cad drawings
8. Big Data in Hydrology
Information:
▣ Topography & bathymetry
▣ Geology
▣ Soil & Moisture
▣ Land cover
▣ Weather & Climate
▣ Hydrometry
▣ Quality samples
▣ Groundwater
▣ Infrastructures
Format:
▣ Plain text
▣ Raster
▣ Vector
▣ Binary
▣ Markup Languages
▣ Graphs & networks
▣ Cad drawings
9. Big Data in Hydrology
Information:
▣ Topography & bathymetry
▣ Geology
▣ Soil & Moisture
▣ Land cover
▣ Weather & Climate
▣ Hydrometry
▣ Quality samples
▣ Groundwater
▣ Infrastructures
Format:
▣ Plain text
▣ Raster
▣ Vector
▣ Binary
▣ Markup Languages
▣ Graphs & networks
▣ Cad drawings
10. Big Data challenges:
▣ Get large volume of heterogeneous data
▣ Mash-up information and use it to make
decisions
12. Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
13. Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
14. Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
15. Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
16. Open Data
“Open data and content can be freely used, modified, and
shared by anyone for any purpose”
Source: http://opendefinition.org/
17. The National River Flow Archive (NRFA)
River flow data from gauging station networks across the UK
including networks operated by:
● Environment Agency (England),
● Natural Resources Wales,
● Scottish Environment Protection Agency,
● Rivers Agency (Northern Ireland).
http://nrfa.ceh.ac.uk/
18. GUI
PROS: simple and intuitive
CONS: not scalable, not
flexible
Point & click (GUI) vs programmatic
(API) data retrieval
API
PROS: scalable, fast and
flexible
CONS: requires
programming skills
20. The NRFA’s API
▣ metadata catalogue,
▣ catalogue filters,
▣ time series of gauged daily data,
▣ time series of catchment monthly rainfall.
21. How does an API work?
server/format/service?X=1&Y=2&Z=3
22. How does an API work?
server/format/service?X=1&Y=2&Z=3
QUESTION A:
How do I get information on station “18019” from the NRFA catalogue?
23. How does an API work?
server/format/service?X=1&Y=2&Z=3
QUESTION A:
How do I get information on station “18019” from the NRFA catalogue?
ANSWER:
nrfaapps.ceh.ac.uk/nrfa/json/stationSummary?db=nrfa_public&stn=18019
24. How does an API work?
server/format/service?X=1&Y=2&Z=3
QUESTION B:
How do I get the time series of gauged daily data for station “18019”?
25. How does an API work?
server/format/service?X=1&Y=2&Z=3
QUESTION B:
How do I get the time series of gauged daily data for station “18019”?
ANSWER:
nrfaapps.ceh.ac.uk/nrfa/xml/waterml2?db=nrfa_public&stn=18019&dt=gdf
27. R libraries to interface APIs
▣ raincpc: download and process the Climate Prediction Center's
(CPC) daily rainfall data
▣ rnoaa: an interface to NOAA Climate data API
▣ soilDB: read data from USDA-NCSS soil databases.
▣ waterData: retrieve, analyse, and calculate anomalies of daily
hydrologic time series data.
▣ rnrfa: an interface to the UK National River Flow Archive data API.
29. The R package RNRFA
API interface:
▣ make request
▣ parse response
▣ retrieve and filter metadata catalogue
▣ get time series of gauged daily data and catchment monthly
rainfall
API interface + external libraries:
▣ make maps
▣ create interactive tables and plots
▣ simplify and speed up reporting!
30. Example of dynamic report
▣ Find all the stations operated by National Resources Wales
▣ Retrieve time series of daily flows
▣ Run a basic analysis
▣ Create interactive plot, table and map
32. Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
33. Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
34. Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
35. Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
36. Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
37. Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R
38. Summary
Big Data
Large volumes of
heterogeneous spatio-
temporal data is becoming
increasingly open in the
hydrology domain.
GUIs vs APIs
GUIs may be the easiest way
to browse data but not the
most efficient. APIs are fast
and scalable.
Hardware/software
Hardware & software burden
is on the data provider side.
No need to update your
datasets, you always access
the latest version
R as interface
R is an easy-to-learn
language, widely used by
statisticians and scientists. It
provides a number of libraries
to obtain and parse data from
the web.
Reproducible workflows
Query databases, filter
information, convert
coordinates, generate plots
and maps for reproducible
reporting.
Scalability & Interoperability
An approach to gather
information for single as well
as multiple sites. At larger
scale, computing can be
made more efficient by using
cloud facilities.
R