DataFinder: A Python Application for Scientific Data Management

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

2 comments

Comments 1 - 2 of 2 previous next Post a comment

  • + guestcef2ba guestcef2ba 2 years ago
    Q1: Use cases? What are the steps to get a simple file in. And then use the search to get it back? (this would be like a simple 'hello world' equivalent)



    Q2: How do I install the system, is it opensource?
  • + guest6a731a guest6a731a 2 years ago
    Will you be giving this talk at PyCon UK? Michael Foord
Post a comment
Embed Video
Edit your comment Cancel

2 Favorites & 1 Group

DataFinder: A Python Application for Scientific Data Management - Presentation Transcript

  1. DataFinder: A Python Application for Scientific Data Management EuroPython 2008 (July 9th 2008, Vilnius) Andreas Schreiber < Andreas.Schreiber@dlr.de> German Aerospace Center (DLR), Cologne http://www.dlr.de/sc
    • The DLR
    • German Aerospace Research Center
    • Space Agency of the Federal Republic of Germany
    • 5,600 employees working in 28 research institutes and facilities
      •  at 13 sites .
    • Offices in Brussels, Paris and Washington.
    Sites and employees  Köln  Lampoldshausen  Stuttgart  Oberpfaffenhofen Braunschweig   Göttingen Berlin -   Bonn Trauen   Hamburg  Neustrelitz Weilheim  Bremen - 
  2. Short Overview
    • DataFinder is a software for efficient management of scientific and technical data
    • Focus on huge data sets
    • Development of the DataFinder by DLR
    • Primary functionality
      • Structuring of data through assignment of meta information and self-defined data models
      • Flexible usage of heterogeneous storage resources
      • Integration in the working environment
  3. Introduction
    • DataFinder founded by DLR
    • National Grid project AeroGrid
  4. Introduction Background
    • Large-scale simulations
    • aerodynamics
    • material science
    • climate
    • Tons of measured data
    • wind-tunnel experiments
    • earth observations
    • traffic data
  5. Introduction Data Management Problem
    • Typical organizational situations
    • No central data management policy
    • Every employee organizes his/her data individually
    • Researchers spend about 30% of their time searching for data
    • Problem with data left behind by temporary staff
    • Increase of data size and regulations
    • Rapidly growing volume of simulation and experimental data
    • Legal requirements for long-term availability of data (up to 50 years!)
    • Situation similar at many organizations
    • All ~30 DLR institutes
    • Other research labs and agencies
    • Industry
  6. DataFinder History Search for solution for scientific data management
    • Definition of “standard problem” (helicopter simulation)
    • Test case for evaluation of software
    • Evaluation of commercial product data management (PDM) systems
    • PDM systems could manage data but with huge amount of costs
    • PDM systems have many unneeded functionalities
    • PDM systems have self-defined or unreadable scripting languages for extension and customization (Tcl etc.)
    • Development of DataFinder
    • Lightweight data management client and existing server solution
    • Just enough functionality for our problems (no paid but unused features!)
  7. DataFinder Development From Java Prototype to Python Product…
    • Development of prototype in Java
    • Data could be manages with prototype successfully
    • Drawbacks: Java problems on important platforms (e.g., SGI IRIX)
    • Embedded Jython interpreter great feature for users
    • User: “ The Java GUI is like shit, but the Python scripting is great. We want a pure Python solution! ”
    • Development of DataFinder product from scratch in Python
  8. Python for Scientists and Engineers Reasons for Python in Research and Industry
    • Observations :
    • Scientists and engineers don’t want to write software but just solve their problems
    • If they have to write code, it must be as easy as possible
    • Why Python is perfect?
    • Very easy to learn and easy to use ( = steep learning curve )
    • Allows rapid development ( = short development time )
    • Inherent great maintainability
    “ Python has the cleanest, most-scientist- or engineer friendly syntax and semantics.” (Paul F. Dubois. Ten good practices in scientific programming. Comp. In Sci. Eng., Jan/Feb 1999, pp.7-11) “ I want to design planes, not software!”
  9. DataFinder Overview Basic Concept
    • Client-Server solution
    • Based on open and stable standards , such as XML and WebDAV
    • Extensive use of standard software components (open source / commercial), limited own development at client side
  10. WebDAV Web-based Distributed Authoring & Versioning
    • Extension of HTTP
    • Allows to manage files on remote servers collaboratively
    • WebDAV supports
      • Resources (“files”)
      • Collections (“directories”)
      • Properties (“meta data”, in XML format)
      • Locking
    • WebDAV extensions
      • Versioning (DeltaV)
      • Access control (ACP)
      • Search (DASL)
  11. DataFinder Overview Client and Server
    • Client
    • User client
    • Administrator client
    • Implementation: Python with Qt
    • Server
    • WebDAV server for meta data and data structure
    • Data Store concept
      • Abstracts access to managed data
      • Flexible usage of heterogeneous storage resources
    • Implementation: Various existing server solutions (third-party)
  12. DataFinder Client Graphical User Interfaces User Client Administrator Client Implementation in Python with Qt/PyQt
  13. DataFinder Server Supported WebDAV servers
    • Commercial Server Solution
    • Tamino XML database (Software AG)
    • Open Source Server Solutions
    • Apache HTTP Web server and module mod_dav
      • Default storage: file system (mod_dav_fs)
      • Module Catacomb (mod_dav_repos) + Relational database ( http://catacomb.tigris.org )
  14. WebDAV / Meta Data Server (1) Tamino WebDAV Server
    • Commercial Server Solution (Software AG)
      • WebDAV Server
      • Tamino XML database backend
    • Advantages
      • Implements many WebDAV extensions (DASL, DeltaV, ACLs)
      • Fast XML processing
    • Good, but not free 
    • Used in DLR for use with DataFinder
      • One installation sufficient for many institutes
  15. WebDAV / Meta Data Server (2) Apache + mod_dav
    • Open Source solution (Apache Group)
      • Apache HTTP Web server
      • WebDAV extension module mod_dav
      • File system + (G)DBM database
    • Advantage: Free and easy to install 
    • … but some WebDAV features are not supported
      • No searching and versioning 
    Apache Core Server mod_http mod_auth_ldap mod_dav mod_dav_fs File system
  16. WebDAV / Meta Data Server (3) Catacomb
    • Open Source solution
      • Apache HTTP Web server + mod_dav
      • Module Catacomb (replacement for file system)
      • Relational database
    • Search and versioning implemented: Uses database search features
    • Open Source development at DLR ( http://catacomb.tigris.org )
    Apache Core Server mod_http mod_auth_ldap mod_dav mod_dav_fs File system DB (MySQL) Catacomb mod_dav_repos
  17. Mass Data Storage Data Stores Logical View User Client Storage Locations
  18. DataFinder Technical Aspects
    • Access privilege management
    • Authentication using WebDAV and LDAP
    • Authorization for users and groups based on WebDAV (ACP)
    • Client available on many platforms
    • Linux, Windows, …
    • Restricted by availability of Python 2.5 and Qt 3 + PyQt
    • Extensible through Python scripts
    • Python application programming interface (API)
    • Accessing data and meta data
  19. Python API User Client Extension with GUI import threading from datafinder.application import search_support from datafinder.gui.user import facade def searchAndDisplayResult(): &quot;&quot;&quot;Searches and displays the result in the search result logging window. &quot;&quot;&quot; query = &quot;displayname contains ‘test’ OR displayname == ‘ab’&quot; result = search_support.performSearch(query) resultLogger = facade.getSearchResultLogger() for path in result.keys(): resultLogger.info( &quot;Found item %s.&quot; % path) thread = threading.Thread(target=searchAndDisplayResult) thread.start()
  20. Python API Command Line Example (without GUI) # Get API from datafinder.application import ExternalFacade externalFacade = ExternalFacade.getInstance() # Connect to a repository externalFacade.performBasicDatafinderSetup(username, password, startUrl) # Download the whole content rootItem = externalFacade.getRootWebdavServerItem() items = externalFacade.getCollectionContents(rootItem) for item in items: externalFacade.downloadFile(item, baseDirectory)
  21. Additional “Batteries”… Used Libraries beyond the Python Standard Library (1)
    • PyQt (http://www.riverbankcomputing.co.uk/software/pyqt)
      • Interface to the Qt GUI framework (currently Qt 3)
      • Used for DataFinder UI layer
    • Pyparsing (http://pyparsing.wikispaces.com/)
      • Creating and executing simple grammars
      • Used for highlighting search expressions
    • python-ldap (http://python-ldap.sourceforge.net/)
      • Object-oriented API to access LDAP servers
      • Authentication against LDAP / ActiveDirectory server
    • paramiko (http://www.lag.net/paramiko)
      • SSH2 protocol implementation
  22. Additional “Batteries”… Used Libraries beyond the Python Standard Library (2)
    • PyGlobus (http://www-itg.lbl.gov/gtg/projects/pyGlobus)
      • Interface to The Globus Toolkit
      • Used for GridFTP Data Store
    • Boto (http://code.google.com/p/boto)
      • Interfaces to Amazon Web Services
      • Used for S3 (Simple Storage Service) Data Store
    • davlib (http://www.webdav.org/mod_dav/ davlib.py )
      • WebDAV client library
      • Used for core WebDAV functions
  23. WebDAV Client Library Support for DAV Extensions
    • Provides an object-oriented interface for accessing WebDAV server
      • Extracted from DataFinder source
    • WebDAV client-side library supports
      • Core WebDAV specification
      • Access Control Protocol
      • Basic Versioning (experimental)
      • DAV Searching and Locating
      • Secure HTTP connections
    • Implementation based on davlib and standard httplib
    • Apache License Version 2
    • Project Site: http://sourceforge.net/projects/pythonwebdavlib
  24. Working with DataFinder…
  25. Configuration and Customization Preparing DataFinder for certain “use cases”
    • Requirements Analysis
    • Analyze data, working environment, and users workflows
    • Configuration
    • Define and configure data model
    • Configure distributed storage resources (Data Stores)
    • Customization
    • Write functional extensions with Python scripts
  26. DataFinder Configuration Data Model and Data Stores
    • Logical view to data
    • Definition of data structuring and meta data (“data model”)
    • Separated storage of data structure / meta data and actual data files
    • Flexible use of (distributed) storage resources
      • File system, WebDAV, FTP, GridFTP
      • Amazon S3 (Simple Storage Service)
      • Tivoli Storage Manager (TSM)
      • Storage Resource Broker (SRB)
    • Complex search mechanism to find data
  27. Data Structure Mapping of Organizational Data Structures User Object (collection) Object (file) Relation Attributes (meta data) Project A Project B Project C File 1 File 2 Simulation I Experiment Simulation II Project Mega Code Ultra User Eddie Key Value Project Mega Code Ultra User Eddie Key Value Project Mega Code Ultra User Eddie Key Value Project Mega Code Ultra User Eddie Key Value Project Mega Code Ultra User Eddie Key Value Project Mega Code Ultra User Eddie Key Value Project Mega Code Ultra User Eddie Key Value Project Mega Code Ultra User Eddie Key Value Project Mega Code Ultra User Eddie Key Value
  28. Meta Data
    • Describe and annotate data (“files”) and collections (“directories”)
    • Different levels of meta data
      • Required attributes defined by administrator
      • User is free to choose additional ones
    • Different types of meta data
      • String
      • Numbers (float, double, …)
      • Lists
      • Pictures
      • Links
    • Stored in XML format
    • User can search in meta data
  29. Impact for Users
    • DataFinder restricts the rights of users!
    • Enforcement of “good behavior”
    • User must comply to organizational standards
    • Data is stored in defined (directory) hierarchy on data server
    • Required meta data must be set prior upload
    • User have certain access rights within hierarchy
    “ Damn! I’m a great scientist! I want freedom to have my own directory layout…”
  30. Customization Python-Scripting for Extension and Automation
    • Integration of DataFinder with environment
    • User, infrastructure, software, …
    • Extension of DataFinder by Python scripts
    • Actions for resources (i.e., files, directories)
    • User interface extensions
    • Typical automations and customizations
    • Data migration and data import
    • Start of external application (with downloaded data files)
    • Extraction of meta data from result files
    • Automation of recurring tasks (“workflows”)
  31. DataFinder Scripting Downloading File and Starting Application # Download the selected file and try to execute it. from datafinder.application import ExternalFacade from guitools.easygui import * import os from tempfile import * from win32api import ShellExecute # Get instance of ExternalFacade to access DataFinder API facade = ExternalFacade.getInstance() # Get currently selected collection in DataFinder Server-View resource = facade.getSelectedResource() if resource != None: tmpFile = mktemp(ressource.name) facade.downloadFile(resource, tmpFile) if os.path.exists(tmpFile): ShellExecute(0, None, tmpFile, &quot;&quot; , &quot;&quot; , 1) else : msgbox( &quot;No file selected to execute.&quot; )
  32. Examples…
  33. Example 1: Turbine Simulation
  34. Example 1: Fluid Dynamics Simulation Turbine Simulation
    • Design of new turbine engines
    • High-resolution simulation of flow
      • Computational Fluid Dynamics (CFD)
      • Use of high-performance computing resources (Cluster / Grid)
      • Huge amounts of data (>100 GByte)
    • DataFinder used for
      • Management of results
      • Automation of simulation runs
      • Starting pre-/post processing
    • Used for CFD-code TRACE (DLR)
    • See http://www.aero-grid.de
    • Simulation steps (example):
    • splitCGNS
      • Preparing data for TRACE
    • TRACE (CFD solver)
      • Main computation
    • fillCGNS
      • Conflating results
    • Post Processing
      • Data reduction and visualization
    • Automation with customized DataFinder
    Turbine Simulation Data Model
  35. Turbine Simulation: Graphical User Interface
  36. Turbine Simulation: Customized GUI Extensions
    • Create new simulation
    • Start a simulation
    • Query status
    • Cancel simulation
    • Project overview
    1 2 3 4 5
  37. Turbine Simulation Starting External Applications
    • CGNS Infos / ADFview / CGNS Plot
    • TRACE GUI
    • Gnuplot
    1 2 3
  38. Example 2: Automobile Supplier
  39. Example 2: Automobile Supplier DataFinder for Simulation and Data Management
    • Tasks
    • Automation and management of simulation of customers
    • Mapping of specific work sequence
    • High flexibility regarding customers requirements
  40. Automobile Supplier Data Model
  41. Automobile Supplier Configuration of Customers Parameters
  42. Automobile Supplier Management of Simulations
    • Status overview
    • Create, change, and delete data sets
    • Manage versions of data files
    • Parameter overview
  43. Automobile Supplier Upload, Download, and Versioning of Files
    • Upload/download of results
    • Versioning of results
    • Script store results in DataFinder data structures
  44. Example 3: Air Traffic Management
  45. Example 3: Air Traffic Monitoring Database for Air Traffic Monitoring
    • Air traffic monitoring is important for research
      • Predictions of air traffic
      • New traffic management approaches
    • Usage of DataFinder
      • Database for traffic data and reports
      • Project oriented view
  46. Database for Air Traffic Monitoring Data Model and Data Migration
  47. Database for Air Traffic Monitoring Data Import Wizard
    • Import of all data sources (PDF/Word/text files, Excel, Access, …)
    • Classification into multiple categories
    • Prevention of duplicated data and consistent naming
  48. Database for Air Traffic Monitoring Search Results
  49. Current Work and Future Plans
    • Current work
    • Migration to Qt 4
    • Improved usage (e.g., search dialogs)
    • Integration with Shibboleth
    • Future
    • Web interfaces
    • Jython
      • Embedding in Java/Eclipse applications
      • Reuse of custom GUI dialogs
    • Migration to Py3k
  50. Am Ende… Hinweise
    • pyCologne: Python User Group Köln
      • Monatliche Treffen von Python-Interessierten aus dem Großraum Köln
      • http://www.pycologne.de
    • Interesse an spannenden Tätigkeiten in Luft- und Raumfahrt?
      • Feste Mitarbeit
      • Diplomarbeiten, Praktika
      • https://wiki.sistec.dlr.de/StellenAusschreibungen
  51. Links
    • DataFinder Web site
    • http://www.dlr.de/datafinder
    • Python WebDAV library
    • http://sourceforge.net/projects/pythonwebdavlib
    • Catacomb
    • http://catacomb.tigris.org
    • AeroGrid Project
    • http://www.aero-grid.de
  52.  
  53. Questions?

+ Andreas SchreiberAndreas Schreiber, 2 years ago

custom

2454 views, 2 favs, 2 embeds more stats

EuroPython 2008 (09.07.2008, Vilnius)

More info about this document

© All Rights Reserved

Go to text version

  • Total Views 2454
    • 2450 on SlideShare
    • 4 from embeds
  • Comments 2
  • Favorites 2
  • Downloads 66
Most viewed embeds
  • 3 views on https://wiki.sistec.dlr.de
  • 1 views on http://wildfire.gigya.com

more

All embeds
  • 3 views on https://wiki.sistec.dlr.de
  • 1 views on http://wildfire.gigya.com

less

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

Cancel
File a copyright complaint
Having problems? Go to our helpdesk?

Categories

Groups / Events