SlideShare a Scribd company logo
1 of 25
Download to read offline
The Python Programming Language and HDF5: H5py.
Application to Satellite Remote Sensing Data Processing.
Needs of satellite remote sensing data processing and
programming languages.
Introduction to some Python elements.
Introduction to Numerical Python with Numpy.
Tying them together with H5py.
Examples.
Daniel Kahn, Science Systems and Applications Inc.
HDF/HDF-EOS Workshop XIII, Riverdale, MD
3 Nov 2009
Where credit is due:

Andrew Collette, UCLA, author of h5py
Francesc Alted, author of PyTables and the
HDF5 1.6 Pyrex definitions (i.e Python-C defs
used in H5py)
Andrew also requested acknowledgment for Darren
Dale (Cornell) and Armando Sole (ESRF) for their
suggestions, patches and encouragement.
Modern remote sensing satellite data and data processing
needs have several characteristics.
1) The data need to be subjected to numerical algorithms. The
algorithms vary in their sophistication but few, if any, require
supercomputer architectures. E.g. Computing a zonal mean.
(Fortran, C, C++)
2) The data need to be subjected to non-numerical “bookkeeping”
algorithms. Associated metadata needs to be read, matched, sliced,
diced, acted upon and committed to databases. (Perl)
3) Modern systems automatically the schedule processing on to
lots of inexpensive, commodity CPUs. This complicates using
managed (i.e. licensed) commercial runtime environments, e.g.
IDL, Matlab, Mathematica.
4) Satellite processing systems are dynamic. They need
customization several times over the course of each instrument's
life. Development time needs to be short. (Perl, IDL, Matlab,
Mathematica)
5) Data processing systems need to track data provenance.
A Remote Sensing Data Processing Challenge...

Problem:
We lack a programming language which spans the breath of
our remote sensing data processing needs, i.e. Which can
handle both numerical and non-numerical processing
(including HDF5) and that has a short development cycle.
Goal:
Find a programming language which allows fast development
and is adept at numerical processing as well as non-numerical
processing. “Adept” doesn't mean “best”, just pretty good.
...and the Python Programming Language
Why Python fits the bill better than:
Fortran
As an interpreted language with dynamic variable bindings
Python offers more flexibility during development. The
multidimensional array extension (numpy) provides efficient and
competitive array operations. It is more similar to IDL or
Matlab...
IDL or Matlab
No license required which can save money on several levels.
Python language can be more well suited to non-numerical
data processing tasks such as are encountered in production
processing of remote sensing. It is more similar to Perl...
Perl/PDL
Nicer syntax for array operations.
More complete HDF5 support (h5py vs PDL HDF5).
Some elements of Python
Python is an interpreted language. Python
statements can easily be run from an interactive
prompt. Python programs are written in ASCII files.

Python has many of the same elements as programming
languages used for numerical processing (I.e. loops,
conditionals, etc) plus...
Name spaces
Lists
Dictionaries
Python has name spaces
Name spaces allow related functions and data to be grouped together
in a way that will not conflict with unrelated functions and data which
happens to be named the same.
Name space are very useful, but are only relevant here because
Python name spaces correspond to Python modules. The
numerical and HDF5 routines we will see later are implemented
as Python modules.
For example, to open an HDF5 file in Python we first import the h5py
module and then we open the file using the File function of the h5py
module.

import h5py # First import the module.
FileID = h5py.File('hdf5_test.h5','r')#Now use the File function,
#note the module name in red.
.
.
.

We could import other modules simultaneously which have an unrelated
function called File and there would be no conflict or error.
Python has Lists:
These are 1D arrays which are useful for solving many problems.
List index notation looks like 1D array notation in many languages.

>>> MyList = [1.0,70] # initialize variable to 2 element list
>>> MyList.append("Demo Data!") # Append an element, NOT like fortran
>>> MyList.append(MyList[0] + MyList[1]) # Append sum of first 2 elements
>>> print MyList
[1.0, 70, 'Demo Data!', 71.0]

Python has Dictionaries:
(Aka hash tables) These map symbols to objects in analogy to
how a 1D list maps an integer index value to an object.
>>>
>>>
>>>
>>>
>>>
>>>
{0:

MyDictionary = {} # Initialize empty Dictionary
MyDictionary[0] = 1
MyDictionary['Gadzooks'] = 70
MyDictionary['Kind of Data'] = 'Demo Data!'
MyDictionary['Result'] = MyDictionary[0] + MyDictionary['Gadzooks']
print MyDictionary
1, 'Kind of Data': 'Demo Data!', 'Result': 71, 'Gadzooks': 70}
Python Lists vs Python Dictionaries vs Arrays in Fortran
Index to a List must be an integer, but a Dictionary can be
indexed by an integer or string.
MyList[Integer] vs. MyDictionary[Integer or String]

The objects referenced by a List or Dictionary can be any Python
object. This is in contrast to more traditional languages, eg. The
elements Fortran array must be of a particular type such as
real*4.
MyList = [34,”String Datum”]
or
MyDictionary={'FirstKey':34,'SecondKey':”String Datum”}
The List and Dictionary data structures make it possible to write
flexible programs very quickly. However, they are not good for
numerical computation. The variety of objects (number, string, etc)
which they can reference makes computation slow.
For example, adding array elements: MyList[0] + MyList[1]
Python must check at run time if MyList[0] and MyList[1] are
objects that can be added together, and not, say, a number and a
string. In a loop this check is performed at every iteration!
What to do?

Enter Numpy
Numpy is a package for use with Python which provides
multidimensional arrays of numeric (and other) types and extends
Python syntax to provide a familiar interface to the arrays.
Numpy extends Python syntax to allow the expression of vector
operations similar to those in IDL, Matlab, or Fortran (>=90).
Numpy Example: Create, Print, and add two 2D arrays
Build from Python lists of lists of elements (the module name is numpy)...
>>>
>>>
[[1
[4

a = numpy.array([[1,2,3],[4,5,6]])
print a
2 3]
5 6]]

Build from dimension sizes...
>>> b = numpy.ones([2,3])
>>> print b
[[ 1. 1. 1.]
[ 1. 1. 1.]]

Print a selected element...
>>> print a[1,2]
6

Example: Add two dimensional arrays.
>>> print a+b
[[ 2. 3. 4.]
[ 5. 6. 7.]]
>>>
The HDF5 connection: H5py
H5py is an Python-HDF5 interface is a Python module
written by Andrew Collette. Its design allows the use of
familiar Python structures for working with HDF5 files.

The interface maps Python syntax for familiar Python
objects to similar objects in the HDF5 file.
Here is a familiar example HDF5 file
from the HDFView distribution:
Here is how to read the 3D int array using
h5py.

>>> import h5py
>>> fid = h5py.File('hdf5_test.h5','r')
>>> group = fid['arrays']
>>> The3DArray = group['3D int array'].value
>>> fid.close()
>>> The3DArray
array([[[
174,
27,
0, ...,
102,
[
171,
27,
0, ...,
194,
[
172,
27,
0, ...,
102,
...,

71,
79,
55,

100009
100109
100209
Equivalence of HDF5 Groups and Python Dictionaries
Print value of dictionary entry:
>>> MyDictionary =
{'RandArray':numpy.random.random([2,2])}
>>> print MyDictionary['RandArray']
[[ 0.82066938 0.39219683]
[ 0.86546443 0.91276533]]
Print value of HDF5 file entry:
>>> fid = h5py.File('RandomArray.h5','r')
>>> print fid['RandArray'].value
[[ 0.1
3.14152908]
[ 2.71828008 0.
]]
Simple Real world example:
Goal: Retrieve HDF5 file from Configuration Management (CM)
and insert CM metadata into HDF5 file.
We want to place the CM path and revision number inside the
HDF5 file as an attribute.
Python script to retrieve file from CM and store Rev
number as attribute.
#! /usr/bin/env python
import sys
import os
import h5py
Rev = sys.argv[1] # Specifiy CM path on command line
SVNFilepath = sys.argv[2] # Specify revision number on
#comand line.
command = 'svn export -r ' + Rev + ' ' + SVNFilepath #Subversion
# Command
InStream = os.popen(command,'r')
ExportString = InStream.read()
ExportReturnCode = InStream.close()
Elements = SVNFilepath.split('/')
# HDF5 code
fid = h5py.File(Elements[-1]) # Elements[-1] is file name
fid.attrs['SVN Path and Revision'] = SVNFilepath + '@' + Rev
fid.close()

H5py code in red. Note the minimal effort coding HDF5 calls.
Another real world example (if NPOESS is your real world).
We have had several occasions to do data aggregation on HDF5 files for the
OMPS Limb instrument.
Our retrieval code (Fortran) processes an orbit of data as 480 distinct pieces
and places the results into 480 distinct HDF5 files. We wish to aggregate the
files such that an N dimensional array in a unaggregated file becomes a N+1
dimensional array in the aggregated HDF5 file.
The Fortran code is (mostly) the product of the retrieval scientist, while the
aggregation is a requirement of the production data processing system. It makes
sense to aggregate as a post-processing step to the Fortran code so as to
minimize changes to the original Fortran code.
Aggregation Algorithm
1) Input is list of HDF5 files
2) Analyze structure of one file to generate list of fully qualified dataset names,
dimensions, and type (In the code I use a Python Dictionary and not a list).
3) Assume all files have that structure. Read corresponding datasets (of dim N)
from each file into aggregation variable (of dim N+1).
4) After corresponding datasets have been read from all files write aggregation
variable to HDF5 file.
5) Repeat 3 until all datasets have been aggregated.
File 1 Array

Array 4
Array 3
File 4 Array
Array 2
Array 1

File 2 Array File 3 Array

+

+

+

Schematic of Array Aggregation for One Dataset
# Data Aggregator. This script takes a set of HDF5 files as input.
# The files are expected to have identical struture. All the
# corresponding arrays in the input files are combined into an array
# which has dimensions N+1, where N is the number of dimensions in the
# original, constituent arrays.
import sys
import h5py
import numpy
Files = sys.argv[1:] # Get file names from command line
# First step is to select one file and create a map of the shape and
# data type of all datasets. This is naturally done via a recursive
# function, called VisitAllObjects
FirstFid = h5py.File(Files[0],'r') # Open first HDF5 file
FileInfo = {} # Initialize FileInfo to be a dictionary. We will use it to build a mapping from
# dataset name to a tuple containing shape and type of dataset.
# EVALUATE HDF5 HIERARCHY
# Evaluating a a hierarchy is naturally a recursive process so we define a function....
def VisitAllObjects(Group,Path):
for i in Group.items():
if isinstance(i[1],h5py.Group):
VisitAllObjects(i[1],Path + '/' + i[0])
else:
DatasetName = Path + '/' + i[0]
FileInfo[DatasetName] = (Group[DatasetName].shape, Group[DatasetName].dtype)
VisitAllObjects(FirstFid,'')
FirstFid.close()
# Print dataset paths and info to screen
for (k,v) in FileInfo.items():
print k,v
# AGGREGATE DATA
# Now that we understand the file structure we can perform the aggregation.
OutputFileID = h5py.File('AggregatedData.h5','w')
NumberOfFiles = len(Files)
# Here is the meat of the code. The outer loop is over datasets, the inner
over all files.
for Dataset in FileInfo.keys():
AggregatedData = numpy.ndarray(FileInfo[Dataset][0]+
(NumberOfFiles,),dtype = FileInfo[Dataset][1])
for FileNumber in range(NumberOfFiles):
# Open file, read data into aggregation array, and close
fid = h5py.File(Files[FileNumber],'r')
AggregatedData[...,FileNumber] = fid[Dataset].value
fid.close()
Path = Dataset.split('/')
map((lambda(x): OutputFileID.require_group(x)), Path[1:-1])
#OutputFileID[Dataset] = AggregatedData
OutputFileID.create_dataset(Dataset,data=AggregatedData,compression=5,chunks=
FileInfo[Dataset][0]+(1,))
OutputFileID.close()
Original, Unaggregated Data Field
$ h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_NCEP 
Data/OMPS_LP_SDR_20041121_55_146_-69_-119.h5
HDF5 "Data/OMPS_LP_SDR_20041121_55_146_-69_-119.h5" {
DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 5, 3, 21 ) / ( 5, 3, 21 ) }
ATTRIBUTE "Title" {
DATATYPE H5T_STRING {
STRSIZE 25;
Note 3 dimensions
}
DATASPACE SCALAR
}
ATTRIBUTE "Units" {
DATATYPE H5T_STRING {
STRSIZE 2;
}
DATASPACE SCALAR
}
ATTRIBUTE "_FillValue" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
}
Aggregated Data Field

$ h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_NCEP 
AggregatedData.h5
HDF5 "AggregatedData.h5" {
DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 5, 3, 21, 4 ) / ( 5, 3, 21, 4 ) }
}
}

Now we have 4 dimensions. The new
dimension has extent 4, corresponding to
the number of input files.
Note that none of the attributes were copied.
This is a bug, but easily fixed.
To fix the bug, I assume attributes (like “Units”) are not
aggregated, and I take attributes and values from first file.
New code consists of only 2+ additional lines, shown below
in green.

def VisitAllObjects(Group,Path):
for i in Group.items():
if isinstance(i[1],h5py.Group):
VisitAllObjects(i[1],Path + '/' + i[0])
else:
DatasetName = Path + '/' + i[0]
FileInfo[DatasetName] = (Group[DatasetName].shape,
Group[DatasetName].dtype,
Group[DatasetName].attrs.listitems())

And also....

DS=OutputFileID.create_dataset(Dataset,data=AggregatedData,compression=5,chunks=F
[DS.attrs.__setitem__(Attribute[0],Attribute[1]) for Attribute in
FileInfo[Dataset][2]]
Fixed output:
$ h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_NCEP 
AggregatedDataWithAttributes.h5
HDF5 "AggregatedDataWithAttributes.h5" {
DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 5, 3, 21, 4 ) / ( 5, 3, 21, 4 ) }
ATTRIBUTE "Title" {
DATATYPE H5T_STRING {
STRSIZE 25;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
Now we have attributes.
}
ATTRIBUTE "Units" {
DATATYPE H5T_STRING {
STRSIZE 2;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "_FillValue" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
}

}
Summary:
Python offers a high degree of flexibility for code
development, combined with the ability to do easy
text, numerical array and HDF5 coding make it a
good candidate for solving problems in satellite
remote sensing data processing.
Few, if any, other computer languages offer this
combination of benefits making it uniquely suited for
this task.
Future
Work:
Tracking module version provenance is likely one of the
outstanding questions for Python use in production.
Acknowledgment:
Curt Tilmes at NASA Goddard funded this work via
contract NNG06HX18C task 614.5-01-07

More Related Content

What's hot

Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsInteroperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsThe HDF-EOS Tools and Information Center
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumRalf Gommers
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData StackPeadar Coyle
 

What's hot (20)

HDF Tools Tutorial
HDF Tools TutorialHDF Tools Tutorial
HDF Tools Tutorial
 
HDF5 Advanced Topics - Datatypes and Partial I/O
HDF5 Advanced Topics - Datatypes and Partial I/OHDF5 Advanced Topics - Datatypes and Partial I/O
HDF5 Advanced Topics - Datatypes and Partial I/O
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsInteroperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
 
HDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSSHDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSS
 
NASA HDF/HDF-EOS Data for Dummies (and Developers)
NASA HDF/HDF-EOS Data for Dummies (and Developers)NASA HDF/HDF-EOS Data for Dummies (and Developers)
NASA HDF/HDF-EOS Data for Dummies (and Developers)
 
Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Advanced HDF5 Features
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Tools to improve the usability of NASA HDF Data
Tools to improve the usability of NASA HDF DataTools to improve the usability of NASA HDF Data
Tools to improve the usability of NASA HDF Data
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Digital Object Identifiers for EOSDIS data
Digital Object Identifiers for EOSDIS dataDigital Object Identifiers for EOSDIS data
Digital Object Identifiers for EOSDIS data
 
NASA HDF/HDF-EOS Data Access Challenges
NASA HDF/HDF-EOS Data Access ChallengesNASA HDF/HDF-EOS Data Access Challenges
NASA HDF/HDF-EOS Data Access Challenges
 
NetCDF and HDF5
NetCDF and HDF5NetCDF and HDF5
NetCDF and HDF5
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Lecture1
Lecture1Lecture1
Lecture1
 
NumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS ForumNumPy Roadmap presentation at NumFOCUS Forum
NumPy Roadmap presentation at NumFOCUS Forum
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
Pybind11 - SciPy 2021
Pybind11 - SciPy 2021Pybind11 - SciPy 2021
Pybind11 - SciPy 2021
 
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 dataUsage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
 

Similar to Python Programming Language for Satellite Data Processing with HDF5 and Numpy

First Steps in Python Programming
First Steps in Python ProgrammingFirst Steps in Python Programming
First Steps in Python ProgrammingDozie Agbo
 
Top Most Python Interview Questions.pdf
Top Most Python Interview Questions.pdfTop Most Python Interview Questions.pdf
Top Most Python Interview Questions.pdfDatacademy.ai
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Introduction_to_Python.pptx
Introduction_to_Python.pptxIntroduction_to_Python.pptx
Introduction_to_Python.pptxVinay Chowdary
 
Five python libraries should know for machine learning
Five python libraries should know for machine learningFive python libraries should know for machine learning
Five python libraries should know for machine learningNaveen Davis
 
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxGovernment Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxShivamDenge
 
Interview-level-QA-on-Python-Programming.pdf
Interview-level-QA-on-Python-Programming.pdfInterview-level-QA-on-Python-Programming.pdf
Interview-level-QA-on-Python-Programming.pdfExaminationSectionMR
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answerskavinilavuG
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answersRojaPriya
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
R and Python, A Code Demo
R and Python, A Code DemoR and Python, A Code Demo
R and Python, A Code DemoVineet Jaiswal
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologiesPolad Saruxanov
 

Similar to Python Programming Language for Satellite Data Processing with HDF5 and Numpy (20)

First Steps in Python Programming
First Steps in Python ProgrammingFirst Steps in Python Programming
First Steps in Python Programming
 
Top Most Python Interview Questions.pdf
Top Most Python Interview Questions.pdfTop Most Python Interview Questions.pdf
Top Most Python Interview Questions.pdf
 
Spsl iv unit final
Spsl iv unit  finalSpsl iv unit  final
Spsl iv unit final
 
Spsl iv unit final
Spsl iv unit  finalSpsl iv unit  final
Spsl iv unit final
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Introduction_to_Python.pptx
Introduction_to_Python.pptxIntroduction_to_Python.pptx
Introduction_to_Python.pptx
 
Shivam PPT.pptx
Shivam PPT.pptxShivam PPT.pptx
Shivam PPT.pptx
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
 
Five python libraries should know for machine learning
Five python libraries should know for machine learningFive python libraries should know for machine learning
Five python libraries should know for machine learning
 
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxGovernment Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptx
 
Interview-level-QA-on-Python-Programming.pdf
Interview-level-QA-on-Python-Programming.pdfInterview-level-QA-on-Python-Programming.pdf
Interview-level-QA-on-Python-Programming.pdf
 
Python ppt
Python pptPython ppt
Python ppt
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answers
 
Python Course.docx
Python Course.docxPython Course.docx
Python Course.docx
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answers
 
MODULE 1.pptx
MODULE 1.pptxMODULE 1.pptx
MODULE 1.pptx
 
Intro to python
Intro to pythonIntro to python
Intro to python
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
R and Python, A Code Demo
R and Python, A Code DemoR and Python, A Code Demo
R and Python, A Code Demo
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologies
 

More from The HDF-EOS Tools and Information Center

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 

More from The HDF-EOS Tools and Information Center (20)

Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 

Recently uploaded

All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 

Recently uploaded (20)

All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 

Python Programming Language for Satellite Data Processing with HDF5 and Numpy

  • 1. The Python Programming Language and HDF5: H5py. Application to Satellite Remote Sensing Data Processing. Needs of satellite remote sensing data processing and programming languages. Introduction to some Python elements. Introduction to Numerical Python with Numpy. Tying them together with H5py. Examples. Daniel Kahn, Science Systems and Applications Inc. HDF/HDF-EOS Workshop XIII, Riverdale, MD 3 Nov 2009
  • 2. Where credit is due: Andrew Collette, UCLA, author of h5py Francesc Alted, author of PyTables and the HDF5 1.6 Pyrex definitions (i.e Python-C defs used in H5py) Andrew also requested acknowledgment for Darren Dale (Cornell) and Armando Sole (ESRF) for their suggestions, patches and encouragement.
  • 3. Modern remote sensing satellite data and data processing needs have several characteristics. 1) The data need to be subjected to numerical algorithms. The algorithms vary in their sophistication but few, if any, require supercomputer architectures. E.g. Computing a zonal mean. (Fortran, C, C++) 2) The data need to be subjected to non-numerical “bookkeeping” algorithms. Associated metadata needs to be read, matched, sliced, diced, acted upon and committed to databases. (Perl) 3) Modern systems automatically the schedule processing on to lots of inexpensive, commodity CPUs. This complicates using managed (i.e. licensed) commercial runtime environments, e.g. IDL, Matlab, Mathematica. 4) Satellite processing systems are dynamic. They need customization several times over the course of each instrument's life. Development time needs to be short. (Perl, IDL, Matlab, Mathematica) 5) Data processing systems need to track data provenance.
  • 4. A Remote Sensing Data Processing Challenge... Problem: We lack a programming language which spans the breath of our remote sensing data processing needs, i.e. Which can handle both numerical and non-numerical processing (including HDF5) and that has a short development cycle. Goal: Find a programming language which allows fast development and is adept at numerical processing as well as non-numerical processing. “Adept” doesn't mean “best”, just pretty good.
  • 5. ...and the Python Programming Language Why Python fits the bill better than: Fortran As an interpreted language with dynamic variable bindings Python offers more flexibility during development. The multidimensional array extension (numpy) provides efficient and competitive array operations. It is more similar to IDL or Matlab... IDL or Matlab No license required which can save money on several levels. Python language can be more well suited to non-numerical data processing tasks such as are encountered in production processing of remote sensing. It is more similar to Perl... Perl/PDL Nicer syntax for array operations. More complete HDF5 support (h5py vs PDL HDF5).
  • 6. Some elements of Python Python is an interpreted language. Python statements can easily be run from an interactive prompt. Python programs are written in ASCII files. Python has many of the same elements as programming languages used for numerical processing (I.e. loops, conditionals, etc) plus... Name spaces Lists Dictionaries
  • 7. Python has name spaces Name spaces allow related functions and data to be grouped together in a way that will not conflict with unrelated functions and data which happens to be named the same. Name space are very useful, but are only relevant here because Python name spaces correspond to Python modules. The numerical and HDF5 routines we will see later are implemented as Python modules. For example, to open an HDF5 file in Python we first import the h5py module and then we open the file using the File function of the h5py module. import h5py # First import the module. FileID = h5py.File('hdf5_test.h5','r')#Now use the File function, #note the module name in red. . . . We could import other modules simultaneously which have an unrelated function called File and there would be no conflict or error.
  • 8. Python has Lists: These are 1D arrays which are useful for solving many problems. List index notation looks like 1D array notation in many languages. >>> MyList = [1.0,70] # initialize variable to 2 element list >>> MyList.append("Demo Data!") # Append an element, NOT like fortran >>> MyList.append(MyList[0] + MyList[1]) # Append sum of first 2 elements >>> print MyList [1.0, 70, 'Demo Data!', 71.0] Python has Dictionaries: (Aka hash tables) These map symbols to objects in analogy to how a 1D list maps an integer index value to an object. >>> >>> >>> >>> >>> >>> {0: MyDictionary = {} # Initialize empty Dictionary MyDictionary[0] = 1 MyDictionary['Gadzooks'] = 70 MyDictionary['Kind of Data'] = 'Demo Data!' MyDictionary['Result'] = MyDictionary[0] + MyDictionary['Gadzooks'] print MyDictionary 1, 'Kind of Data': 'Demo Data!', 'Result': 71, 'Gadzooks': 70}
  • 9. Python Lists vs Python Dictionaries vs Arrays in Fortran Index to a List must be an integer, but a Dictionary can be indexed by an integer or string. MyList[Integer] vs. MyDictionary[Integer or String] The objects referenced by a List or Dictionary can be any Python object. This is in contrast to more traditional languages, eg. The elements Fortran array must be of a particular type such as real*4. MyList = [34,”String Datum”] or MyDictionary={'FirstKey':34,'SecondKey':”String Datum”}
  • 10. The List and Dictionary data structures make it possible to write flexible programs very quickly. However, they are not good for numerical computation. The variety of objects (number, string, etc) which they can reference makes computation slow. For example, adding array elements: MyList[0] + MyList[1] Python must check at run time if MyList[0] and MyList[1] are objects that can be added together, and not, say, a number and a string. In a loop this check is performed at every iteration! What to do? Enter Numpy Numpy is a package for use with Python which provides multidimensional arrays of numeric (and other) types and extends Python syntax to provide a familiar interface to the arrays. Numpy extends Python syntax to allow the expression of vector operations similar to those in IDL, Matlab, or Fortran (>=90).
  • 11. Numpy Example: Create, Print, and add two 2D arrays Build from Python lists of lists of elements (the module name is numpy)... >>> >>> [[1 [4 a = numpy.array([[1,2,3],[4,5,6]]) print a 2 3] 5 6]] Build from dimension sizes... >>> b = numpy.ones([2,3]) >>> print b [[ 1. 1. 1.] [ 1. 1. 1.]] Print a selected element... >>> print a[1,2] 6 Example: Add two dimensional arrays. >>> print a+b [[ 2. 3. 4.] [ 5. 6. 7.]] >>>
  • 12. The HDF5 connection: H5py H5py is an Python-HDF5 interface is a Python module written by Andrew Collette. Its design allows the use of familiar Python structures for working with HDF5 files. The interface maps Python syntax for familiar Python objects to similar objects in the HDF5 file.
  • 13. Here is a familiar example HDF5 file from the HDFView distribution: Here is how to read the 3D int array using h5py. >>> import h5py >>> fid = h5py.File('hdf5_test.h5','r') >>> group = fid['arrays'] >>> The3DArray = group['3D int array'].value >>> fid.close() >>> The3DArray array([[[ 174, 27, 0, ..., 102, [ 171, 27, 0, ..., 194, [ 172, 27, 0, ..., 102, ..., 71, 79, 55, 100009 100109 100209
  • 14. Equivalence of HDF5 Groups and Python Dictionaries Print value of dictionary entry: >>> MyDictionary = {'RandArray':numpy.random.random([2,2])} >>> print MyDictionary['RandArray'] [[ 0.82066938 0.39219683] [ 0.86546443 0.91276533]] Print value of HDF5 file entry: >>> fid = h5py.File('RandomArray.h5','r') >>> print fid['RandArray'].value [[ 0.1 3.14152908] [ 2.71828008 0. ]]
  • 15. Simple Real world example: Goal: Retrieve HDF5 file from Configuration Management (CM) and insert CM metadata into HDF5 file. We want to place the CM path and revision number inside the HDF5 file as an attribute.
  • 16. Python script to retrieve file from CM and store Rev number as attribute. #! /usr/bin/env python import sys import os import h5py Rev = sys.argv[1] # Specifiy CM path on command line SVNFilepath = sys.argv[2] # Specify revision number on #comand line. command = 'svn export -r ' + Rev + ' ' + SVNFilepath #Subversion # Command InStream = os.popen(command,'r') ExportString = InStream.read() ExportReturnCode = InStream.close() Elements = SVNFilepath.split('/') # HDF5 code fid = h5py.File(Elements[-1]) # Elements[-1] is file name fid.attrs['SVN Path and Revision'] = SVNFilepath + '@' + Rev fid.close() H5py code in red. Note the minimal effort coding HDF5 calls.
  • 17. Another real world example (if NPOESS is your real world). We have had several occasions to do data aggregation on HDF5 files for the OMPS Limb instrument. Our retrieval code (Fortran) processes an orbit of data as 480 distinct pieces and places the results into 480 distinct HDF5 files. We wish to aggregate the files such that an N dimensional array in a unaggregated file becomes a N+1 dimensional array in the aggregated HDF5 file. The Fortran code is (mostly) the product of the retrieval scientist, while the aggregation is a requirement of the production data processing system. It makes sense to aggregate as a post-processing step to the Fortran code so as to minimize changes to the original Fortran code.
  • 18. Aggregation Algorithm 1) Input is list of HDF5 files 2) Analyze structure of one file to generate list of fully qualified dataset names, dimensions, and type (In the code I use a Python Dictionary and not a list). 3) Assume all files have that structure. Read corresponding datasets (of dim N) from each file into aggregation variable (of dim N+1). 4) After corresponding datasets have been read from all files write aggregation variable to HDF5 file. 5) Repeat 3 until all datasets have been aggregated. File 1 Array Array 4 Array 3 File 4 Array Array 2 Array 1 File 2 Array File 3 Array + + + Schematic of Array Aggregation for One Dataset
  • 19. # Data Aggregator. This script takes a set of HDF5 files as input. # The files are expected to have identical struture. All the # corresponding arrays in the input files are combined into an array # which has dimensions N+1, where N is the number of dimensions in the # original, constituent arrays. import sys import h5py import numpy Files = sys.argv[1:] # Get file names from command line # First step is to select one file and create a map of the shape and # data type of all datasets. This is naturally done via a recursive # function, called VisitAllObjects FirstFid = h5py.File(Files[0],'r') # Open first HDF5 file FileInfo = {} # Initialize FileInfo to be a dictionary. We will use it to build a mapping from # dataset name to a tuple containing shape and type of dataset. # EVALUATE HDF5 HIERARCHY # Evaluating a a hierarchy is naturally a recursive process so we define a function.... def VisitAllObjects(Group,Path): for i in Group.items(): if isinstance(i[1],h5py.Group): VisitAllObjects(i[1],Path + '/' + i[0]) else: DatasetName = Path + '/' + i[0] FileInfo[DatasetName] = (Group[DatasetName].shape, Group[DatasetName].dtype) VisitAllObjects(FirstFid,'') FirstFid.close()
  • 20. # Print dataset paths and info to screen for (k,v) in FileInfo.items(): print k,v # AGGREGATE DATA # Now that we understand the file structure we can perform the aggregation. OutputFileID = h5py.File('AggregatedData.h5','w') NumberOfFiles = len(Files) # Here is the meat of the code. The outer loop is over datasets, the inner over all files. for Dataset in FileInfo.keys(): AggregatedData = numpy.ndarray(FileInfo[Dataset][0]+ (NumberOfFiles,),dtype = FileInfo[Dataset][1]) for FileNumber in range(NumberOfFiles): # Open file, read data into aggregation array, and close fid = h5py.File(Files[FileNumber],'r') AggregatedData[...,FileNumber] = fid[Dataset].value fid.close() Path = Dataset.split('/') map((lambda(x): OutputFileID.require_group(x)), Path[1:-1]) #OutputFileID[Dataset] = AggregatedData OutputFileID.create_dataset(Dataset,data=AggregatedData,compression=5,chunks= FileInfo[Dataset][0]+(1,)) OutputFileID.close()
  • 21. Original, Unaggregated Data Field $ h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_NCEP Data/OMPS_LP_SDR_20041121_55_146_-69_-119.h5 HDF5 "Data/OMPS_LP_SDR_20041121_55_146_-69_-119.h5" { DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 5, 3, 21 ) / ( 5, 3, 21 ) } ATTRIBUTE "Title" { DATATYPE H5T_STRING { STRSIZE 25; Note 3 dimensions } DATASPACE SCALAR } ATTRIBUTE "Units" { DATATYPE H5T_STRING { STRSIZE 2; } DATASPACE SCALAR } ATTRIBUTE "_FillValue" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 1 ) / ( 1 ) } } }
  • 22. Aggregated Data Field $ h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_NCEP AggregatedData.h5 HDF5 "AggregatedData.h5" { DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 5, 3, 21, 4 ) / ( 5, 3, 21, 4 ) } } } Now we have 4 dimensions. The new dimension has extent 4, corresponding to the number of input files. Note that none of the attributes were copied. This is a bug, but easily fixed.
  • 23. To fix the bug, I assume attributes (like “Units”) are not aggregated, and I take attributes and values from first file. New code consists of only 2+ additional lines, shown below in green. def VisitAllObjects(Group,Path): for i in Group.items(): if isinstance(i[1],h5py.Group): VisitAllObjects(i[1],Path + '/' + i[0]) else: DatasetName = Path + '/' + i[0] FileInfo[DatasetName] = (Group[DatasetName].shape, Group[DatasetName].dtype, Group[DatasetName].attrs.listitems()) And also.... DS=OutputFileID.create_dataset(Dataset,data=AggregatedData,compression=5,chunks=F [DS.attrs.__setitem__(Attribute[0],Attribute[1]) for Attribute in FileInfo[Dataset][2]]
  • 24. Fixed output: $ h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_NCEP AggregatedDataWithAttributes.h5 HDF5 "AggregatedDataWithAttributes.h5" { DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 5, 3, 21, 4 ) / ( 5, 3, 21, 4 ) } ATTRIBUTE "Title" { DATATYPE H5T_STRING { STRSIZE 25; CTYPE H5T_C_S1; } DATASPACE SCALAR Now we have attributes. } ATTRIBUTE "Units" { DATATYPE H5T_STRING { STRSIZE 2; CTYPE H5T_C_S1; } DATASPACE SCALAR } ATTRIBUTE "_FillValue" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 1 ) / ( 1 ) } } } }
  • 25. Summary: Python offers a high degree of flexibility for code development, combined with the ability to do easy text, numerical array and HDF5 coding make it a good candidate for solving problems in satellite remote sensing data processing. Few, if any, other computer languages offer this combination of benefits making it uniquely suited for this task. Future Work: Tracking module version provenance is likely one of the outstanding questions for Python use in production. Acknowledgment: Curt Tilmes at NASA Goddard funded this work via contract NNG06HX18C task 614.5-01-07