Maria esteva


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hoyvoy a presentar
  • Estamostransitando un cambio de paradigma de investigacionquerealza el descubrimiento de patrones y fenomenos en grandescantidades de datos, y la posibilidad de integrardatosprovenientes de distintosdominiospararelacionar, cruzar y contextualizarestosfenomenos. La ciencia de datosintensivoscombina la teoria, la experimentacion y la simulacion en relacion a datosmasivos. Sabemosque la postulacion de hipotesisabrelasdimensiones de la investigacion. Como estasdimensiones no puedenexplorarse de antemano, unodebeprepararsepararealizarmuchosexperimentos y siumlacionesquepueden a suvezotorgarrespuestas o abrirnuevoscuestionamientos. La gran diferenciaesque los datosdisponiblespara el estudio de un fenomeno no son los acotados a lo que se suponerelevante a la hipotesis original, sino ….Para practicarscienciaLa sciencia de datosintensivostomainductive reasoning or thought which turns a simple observation or thought into a general theory. In other words it takes one piece of information and tries to generalise it from there. A researcher’s thought path goes from the specific to general and a hypothesis is formed. One of the basic concepts of informatics is the ‘future value of primary data’. It is envisioned that the primary data—and, if possible, the actual samples—collected by one investigator will be archived and made available to other investigators, who may re-analyse the data from a different point of view, employ part of the data set not relevant to the first investigator, pool data with other studies or conduct new measurements on the original samples (Koslow, 2000). This is entirely compatible with hypothesis-driven research. Indeed, a good hypothesis is not one that is likely to be correct, but one that opens up a new arena of investigation. Since this arena cannot be fully perceived in advance, one must be prepared to carry out new analyses not included in the original hypothesis. Yet, most current experimental design simply ignores this fact: the investigator collects only those data that are deemed relevant to the original hypothesis, and when new information causes the original hypothesis to change, the investigator must plan a new experiment from scratch.
  • Could be longDecentralized Research data & processes are complex evolving systemsExpertise and technologies supporting data change at fast paceLifecycle of research may include technologies to aid each stageTransition from research data to data collection Data risk increases as data is being processedData can be easily disorganized during the research process Interoperability issues emergeSecurity Provenance and metadataAnalysis and interpretationSharing and accessProcesos yuxtapuestosLos procesos de investigacionintensiva de datos son complejos,dinamicos y se encuentran en constanteevolucion. Al cambio de tecnologias y experiencia/conocimientos se le suma la inestabilidadeconomica de . Es en esteproceso
  • The evolving collections cyberinfrastructure here proposed could be considered a cloud service,(cloud that we know where it is). It provides flexibilty for the user to develop and archive collections in a continuum and at any point in their research lifecycle, while they, other users or both perform data analysis and visualization tasks across a seamless storage and computing environments. In addition we can provide the same interfaces and/or functionality as DuraCloud on top of our resources. Incorporalasteorias y practicas del dominiocientifico a lasactividades de administracion de datosMapearrecursos, redes, servicios y funciones de lascolecciones a Encoded textsWorks of artC Borgman call for action challenges due to lack of fundingIprEvolvingCollection’s architectureMap resources, network, services, metadata to the collection’s goals, functionsMulti-disciplinary Paleo – CSR -
  • Las actividadesestancentralizadas en una Collections activities are centralized in a data applications facility thatprovides different technical environments as collections building blocks that users may select according to the functionalities that they need. It consists of 1.2 Petabytes of online disk and a number of servers providing high-performance storage for all types of digital data. A high-performance parallel file system is directly accessible from all of TACC's High Performance Computing (HPC) resources, enabling mathematical computation and visual analyses of petabyte-scale datasets. (IMPORTANCE). It has web-based access, and other network protocols for storage and retrieval of data. For database collections, the facility provides flexibility in terms of the RDBMSs that users may choose from. We also maintain open source domain specific databases such as ARK [7] and Specify [8], which support Archaeology and Natural History collections respectively. Collections requiring long-term preservation are managed within iRODS [9]. Off site replication is done in a Mass Storage tape system with a capacity of 10 PB, and geographical replication is accomplished through an agreement with Indiana University’s Research Computing Division [10]. Close supervision, parts replacement contracts, and frequent schedule of upgrades are in place for maintaining the infrastructure. This model is based on TACC’s experience managing systems to assure 24/7 services and data security,High Performance Computing (HPC)Petabytes - distributed storageNetworks – parallel access and processesIntegration to compute nodes Remote accessMechanisms for sharing and for restricting data accessRDBMS, GIS, WebFlexibility to configure systemsOpen source librariesAutomation and generalization24/7 administrationCorral consists of 6 Petabytes of online disk and a number of servers providing high-performance storage for all types of digital data. It supports MySQL and Postgres databases, high-performance parallel file system, and web-based access, and other network protocols for storage and retrieval of data to and from sophisticated instruments, HPC simulations, and visualization laboratories. A high-performance parallel file system is accessible directly from TACC's world-class computational resources, Stampede and Lonestar, as well as Stallion, the world's largest tile display, enabling both mathematical and visual analysis of petabyte-scale datasets.
  • An example of what such infrastructure allows are the activities of the The UT Center for Space Research ( thatstores very large sensor, satellite, aerial and radar datasets that they curate for dissemination purposes. Within three days of the 2010 earthquake in Haiti in collaboration with TACC, the repository/file system used for managing CSR data in Corral, was turned into a web repository for sharing data. This allowed CSR to access, organize, retrieve, and post the data required by the emergency operations in the region [11]. This type of quick repurposing allowed a multi-terabyte collection managed through one application, to instantly become accessible through a password-protected web application on another server. Improve here
  • Coleccionesque
  • When possible, to facilitate collections organization and avoid manual metadata entry, descriptive metadata is automatically extracted from the collections record-keeping system at ingest. This process requires the existence of an informative and regular file naming and or directory labeling. It also involves previous work mapping the descriptive data points to standard metadata schemas such as Dublin Core (DC) or Visual Resources Association (VRA) Core. The latter results from consulting with our team and training users on the required standards and practices. Implemented as an iRODS rule, a Jython script parses directory labels and file naming conventions as files are ingested to iRODS. The extracted descriptive metadata is packaged along with the technical metadata, as a METS document and registered with the iRODS metadata catalog [13]. This process is being implemented in ICA and McFarland’s collection that for a long time has used a systematic naming convention including image title/terms, its geographical location and type of camera codes, and version control number. To access his files, he may search by any of these elements.
  • Preservation services for the collections stored in iRODS include: rules to generate file checksums, automatic off-site and geographical replication, massive extraction of metadata using FITS [12] and encoded as Preservation Metadata (PREMIS),within a METS and finally registering the metadata in the iRODS catalogue. Beyond basic bit level preservation and the services mentioned above, we address the domain scientists’ conception of data preservation. In the case of archaeology collections, preservation isassociated with maintaining the relationships between the objects found in a same context in the excavation. To assure that the archived data could render a representation of the site, The Institute of Classical Archaeology selected to have two collection instances within the storage facility. A presentation instance resides on the ARK database and web site, which provides interactivity features and the possibility for users to study data objects in relation to their geospatial location and to the researchers’ interpretations. The archival instance, stored in a hierarchical directory structure in iRODS, and off site replicated preserves contextual relationships between the raw imagesof the objects found on the excavation and correspondent image versions and documentation as generated by researchers on the site and through the research lifecycle. These relationships are gathered and preserved through a complex metadata system reflected in METS documents so that when one object is retrieved, all of the related objects are retrieved as well. In this way, if the ARK database ceases to be supported, the archival instance will serve to reconstruct the site.Flujo de trabajo de la benson. IPRES .
  • As a service and research organization, TACC offers up to five terabytes of free storage space and basic collection services to researchers on campus, and there is a fee structure for collections requiring more storage space and or consulting services. We are currently considering to follow Princeton’s institutional repository business model and To support complex collection services, the group faces the same limitations and possibilities as the researchers that create the collections. Thus, the group participates from grant proposals with research partners, and provides services in exchange for funding staff hours. In addition, campus organizations use the data facilities as a dark archive for annual fees, which are used to purchase hardware. TACC’s cyberinfrastructure intends to surpass the uncertainties of future research funding by embracing the notion that if a collection is built soundly, it will be used and supported, or it can be easily transferred to other archives or managed within other systems. After 4 years we are expanding to 5 petabytes of data. We have so much$500/TB/year,"Data Consulting" - This would just involve some simple assessment efforts and some recommendations as to proper data management practices, perhaps a willingness to help customize a template data management plan, setup of data replication scenarios and provision of tools to do metadata extraction and/or format conversion (but not ongoing involvement of staff in such metadata or format-related efforts. The basic idea here is that we help with the initial setup, and perhaps with periodic re-assessments, but don't actively manage data on behalf of users. I would propose this go at a rate of $750 per-TB/year. "Data Management or Curation" - This would be something more along the lines of what we've done for ICA, where we really get into the guts of the collection, help with reorganizing the data, develop tools or workflows for metadata extraction and/or format conversion, derivative generation, etc. It would again be more likely to be heavily frontloaded in terms of the work, but could involve things like monthly checksum verification and other reporting efforts, active collaboration with researchers to help categorize new data types and improve metadata or search mechanisms, and so on. I would propose this go for a rate of $1000/TB/year. In addition to these two levels of data service, I think we also have to have another category for database and/or web applications, which can be relatively small in terms of absolute size but very complex in terms of data structures, and require a lot more time on our part to help develop schemas, improve existing web frameworks, import external sources of data, and so on. Here I would propose that we start at a rate of $5000 for the basic service, with costs going up from there based either on the number of tasks we need to perform, the number of "moving parts" in the overall application framework, or other factors. I think for most projects we'd have to do an assessment and figure out what it will really cost, so the goal of the base price is really just to establish a level where "you must be at least this tall to ride", i.e. if you aren't willing to contemplate an investment of at least $5K, you should be looking to do it yourself or find someone else. 
  • Maria esteva

    1. 1. Maria Esteva, Texas Advanced ComputingCenter, University of Texas at AustinPANEL
    2. 2. Cyberinfrastructura para laadministración de datos deinvestigaciónMaria Esteva, Texas Advanced ComputingCenter, University of Texas at Austin2EieMayo 2013, Cali, Colombia
    3. 3. Datos & investigación• Ciencia intensiva dedatos– Teoría, experimentos, ysimulaciones en elcontexto de datos masivos• Datos sustentables– Documentados, estables,auténticos• Datos para diseminarconocimientos, citar, yreutilizar
    4. 4. Formación de colecciones• Proyectos de investigación complejos y en evoluciónconstante• Tecnología y conocimientos cambian continuamente• Fondos para investigación inestables• Las colecciones son mas vulnerables durante el procesode investigación• Arquitectura y funcionalidades de una colecciónpueden involucrar a varias tecnologías
    5. 5. Perspectivas• La curación de datos tiene como temacentral el problema que trata lainvestigación• Enfoque desde las ciencias de la información• Enfoque desde la infraestructura– Considerar la infraestructura y servicios desde laplanificación del proyecto de investigación y a travésdel ciclo de vida del proyecto
    6. 6. Infraestructura de datos @TACC• Equipo multidisciplinario• Corral• 6 Petabits de disco en línea• Sistema de archivo paraleloLustre• Transferencia de datos 1 -10GB/seg• Acceso Web• Flexibilidad deconfiguración• Librerías de código abierto• 24/7 seguridad ymantenimiento de lossistemas
    7. 7. Bases de datos• Bases de datosrelacionales:MySQL, PostgreSQL, SQL Server– Pecan Street Project• ARK y Specify• GIS (Sistema deinformacióngeográfica)– FASTI– Instituto deArqueología Clásica
    8. 8. Flexibilidad• Centro para la Investigación del Espacio (CSR)– Almacenamiento de datos provenientes de satélites,radares y sensores– Terremoto de Haití – 2010– El repositorio de datos de CSR fue transformado en unrepositorio web para compartir datos con los rescatistas.
    9. 9. Multiples posibilidades• Gestión de datos durante el proyecto de investigación• Almacenamiento temporario de datos para procesoscomputacionales• Acceso a colecciones de investigación• Archivo oscuro• El investigador es el curador• El equipo de TACC ofrece e implementa soluciones técnicas alproceso de curación y colabora en laorganización, estandarización y acceso de datos
    10. 10. Implementación de colecciones• TACC administra elacceso a lossistemas, instala losservidores/bases dedatos/librerías ydependencias.• Los usuarios tienenacceso a su código• Triage de colecciones– ICA, 5 petabytes dedatos desorganizados• Usuarios de distintosdominios• Usuarios con distintosniveles deconocimientos técnicos
    11. 11. Flujos de trabajo– Diferentes flujos de datos– Transición sin fisuras entresistemas dealmacenamiento y deanálisis.
    12. 12. Metadatos e integración
    13. 13. Acceso• Acceso web abierto alpublico• Acceso cerrado duranteel periodo de embargo• WebDav• Protegido porcontraseña• Acceso restringido alequipo de investigación• Desde los sistemas devisualización de TACC
    14. 14. Preservación• iRODS: bróker de archivosdistribuidos• Replica de archivos enRanch, un archivo decinta y replicacióngeográfica• Seguridad ymantenimiento• Chequeo deautenticidad de los datos• Captura automática demetadatos técnicos• Perspectiva sobre lo que
    15. 15. Modelo administrativo• 5 TB de almacenamiento gratuito ainvestigadores de la Universidad de Texas• Estructura de costos anual, basada enhonorarios del staff– Consultoría, curación de datos, bases dedatos y aplicaciones web• Funciona como archivo oscuro paracostear hardware• Participamos en subsidios deinvestigación
    16. 16. Data@TACC• Weijia Xu• Christopher Jordan• David Walling• Tomislav Urban• Siva Kulaskerian