Oracle Database is a relational database management system produced by Oracle Corporation. It stores data logically in tables, tablespaces, and schemas, and physically in datafiles. The database, SGA (containing the buffer cache, redo log buffer, and shared pool), and background processes like SMON, PMON, and DBWR work together for high performance and reliability. Backup methods and administrative tasks help maintain the database.
Oracle stores data logically in tablespaces and physically in datafiles associated with the corresponding tablespace. Tablespaces can be created, altered by resizing datafiles, have additional datafiles added, and dropped along with their contents. Users are created with a default tablespace assigned and granted privileges like connect and resource privileges.
This document provides an introduction to big data and related technologies. It defines big data as datasets that are too large to be processed by traditional methods. The motivation for big data is the massive growth in data volume and variety. Technologies like Hadoop and Spark were developed to process this data across clusters of commodity servers. Hadoop uses HDFS for storage and MapReduce for processing. Spark improves on MapReduce with its use of resilient distributed datasets (RDDs) and lazy evaluation. The document outlines several big data use cases and projects involving areas like radio astronomy, particle physics, and engine sensor data. It also discusses when Hadoop and Spark are suitable technologies.
This document provides an overview of an Oracle DBA walkthrough presentation. It includes a table of contents covering topics like the duties of database administrators, memory and process architecture, instance startup and shutdown, and tools for DBAs. It also introduces the presenter, Akash Pramanik, who is an Oracle DBA by profession and freelance trainer.
Hadoop Distributed File System (HDFS) is an open-source software framework that provides distributed storage and processing of large datasets across clusters of commodity hardware. HDFS has two main components - HDFS for storage and MapReduce for distributed processing. HDFS uses a master-slave architecture with a NameNode master and DataNodes slaves. The NameNode manages the file system namespace and metadata, while DataNodes store data blocks and report to the NameNode.
In this presentation from the DDN User Meeting at SC13, Erik Deumans from SSERCA describes how the institution is sharing data with WOS from DDN.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
This document summarizes a student project on developing a decentralized file sharing application. It presents an overview of the project including the problem statement, abstract, design details, experimental results, performance evaluation, test cases, and conclusion. The project aims to create a platform that allows users to share and store data in a distributed network without a centralized server. The key aspects covered are file segmentation, encryption using AES, generation of distributed hashes using SHA-256, and a p2p network architecture with distributed tables. Test cases are provided to validate user accounts, file encryption/decryption, file segmentation, and distributed hash table generation. Future work proposed includes creating digital coins, maintaining wallets, and converting coins to currencies.
Oracle Database is a relational database management system produced by Oracle Corporation. It stores data logically in tables, tablespaces, and schemas, and physically in datafiles. The database, SGA (containing the buffer cache, redo log buffer, and shared pool), and background processes like SMON, PMON, and DBWR work together for high performance and reliability. Backup methods and administrative tasks help maintain the database.
Oracle stores data logically in tablespaces and physically in datafiles associated with the corresponding tablespace. Tablespaces can be created, altered by resizing datafiles, have additional datafiles added, and dropped along with their contents. Users are created with a default tablespace assigned and granted privileges like connect and resource privileges.
This document provides an introduction to big data and related technologies. It defines big data as datasets that are too large to be processed by traditional methods. The motivation for big data is the massive growth in data volume and variety. Technologies like Hadoop and Spark were developed to process this data across clusters of commodity servers. Hadoop uses HDFS for storage and MapReduce for processing. Spark improves on MapReduce with its use of resilient distributed datasets (RDDs) and lazy evaluation. The document outlines several big data use cases and projects involving areas like radio astronomy, particle physics, and engine sensor data. It also discusses when Hadoop and Spark are suitable technologies.
This document provides an overview of an Oracle DBA walkthrough presentation. It includes a table of contents covering topics like the duties of database administrators, memory and process architecture, instance startup and shutdown, and tools for DBAs. It also introduces the presenter, Akash Pramanik, who is an Oracle DBA by profession and freelance trainer.
Hadoop Distributed File System (HDFS) is an open-source software framework that provides distributed storage and processing of large datasets across clusters of commodity hardware. HDFS has two main components - HDFS for storage and MapReduce for distributed processing. HDFS uses a master-slave architecture with a NameNode master and DataNodes slaves. The NameNode manages the file system namespace and metadata, while DataNodes store data blocks and report to the NameNode.
In this presentation from the DDN User Meeting at SC13, Erik Deumans from SSERCA describes how the institution is sharing data with WOS from DDN.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
This document summarizes a student project on developing a decentralized file sharing application. It presents an overview of the project including the problem statement, abstract, design details, experimental results, performance evaluation, test cases, and conclusion. The project aims to create a platform that allows users to share and store data in a distributed network without a centralized server. The key aspects covered are file segmentation, encryption using AES, generation of distributed hashes using SHA-256, and a p2p network architecture with distributed tables. Test cases are provided to validate user accounts, file encryption/decryption, file segmentation, and distributed hash table generation. Future work proposed includes creating digital coins, maintaining wallets, and converting coins to currencies.
Jagadish Venkatesh has over 10 years of experience in system administration and database administration. He has extensive experience administering Oracle databases, Linux servers, and Windows servers. Currently he works as a Senior Engineer providing remote infrastructure services to Oracle clients at Cambridge Technology India. Previously he has worked as a technical consultant and in various roles supporting banking infrastructure including an ATM switch. He has certifications in Oracle Database Administration and expertise across a range of technologies.
Cloudera Impala - HUG Karlsruhe, July 04, 2013Alexander Alten
Low latency data processing with Impala
Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), JDBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage, which partitions data into blocks and replicates them across nodes for fault tolerance. The master node tracks where data blocks are stored and worker nodes execute tasks like mapping and reducing data. Hadoop provides scalability and fault tolerance but is slower for iterative jobs compared to Spark, which keeps data in memory. The Lambda architecture also informs Hadoop's ability to handle batch and speed layers separately for scalability.
S. Prabhu is a highly experienced Oracle Database Administrator with over 13 years of experience installing, configuring, maintaining and tuning Oracle databases from versions 8i through 11g on various operating systems. He has extensive expertise in areas such as Oracle RAC, Data Guard, Golden Gate, ASM, backup strategies using RMAN, performance tuning using AWR/ADDM and SQL tuning. Prabhu has worked with global clients in both Fortune 500 companies and service providers on large, complex databases and holds an MCA from Madras University.
2014 CrossRef Annual Meeting: CrossRef System UpdateCrossref
The document summarizes system updates made by Crossref in 2014, including improvements to infrastructure like hardware, network resiliency and production systems that reduced DNS latency. Core system changes enhanced performance and call-back notifications. Features were added for books, standards, metadata queries and schema. Planned future updates involve integrating ORCIDs, cleaning article titles, modeling relations, redesigning stored queries and adding new content types.
The document provides an overview of Oracle Database including its architecture, components, and functions. It discusses Oracle's three-level database architecture consisting of the external, conceptual, and internal levels. It also describes Oracle's memory structure including the shared pool, database buffer cache, and redo log buffer. Key Oracle background processes like DBWR, LGWR, PMON, SMON, and CKPT are summarized.
This document provides an introduction to Oracle 10g, including its architecture and components. It discusses the Oracle instance, System Global Area (SGA) and Program Global Area (PGA). It describes the key background processes like SMON, PMON, DBWn, LGWR, CKPT and ARCn. It also explains the critical Oracle files - parameter file, control files, redo log files and data files. Finally, it outlines Oracle's logical data structures of tablespaces, segments, extents and data blocks.
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
Minha apresentação sobre construção de data lakes para bigdata usando hadoop como plataforma de dados. Conheça mais sobre nossos trabalhos de consultoria e treinamento em Hadoop Hortonworks, BigData, Data Warehousing e Business Intelligence
This document summarizes the skills and experience of an Oracle DBA named Bashapattan. It includes over 3.8 years of experience providing 24/7 support for Oracle 10g and 11g databases. Some responsibilities included database creation, configuration of users and privileges, backups using RMAN, performance tuning, and experience with RAC clusters. Previous work experience is provided for two projects supporting Oracle databases with responsibilities such as installation, administration, monitoring, patching, and upgrades.
Hadoop, Evolution of Hadoop, Features of HadoopDr Neelesh Jain
Hadoop, Evolution of Hadoop, Features of Hadoop is explained in the presentation as per the syllabus of RGPV, BU and MCU for the students of BCA, MCA and B. Tech.
This document provides an introduction to HDFS (Hadoop Distributed File System). It discusses what HDFS is, its core components, architecture, and key elements like the NameNode, metadata, and blocks. HDFS is designed for storing very large files across commodity hardware in a fault-tolerant manner and allows for streaming access. While HDFS can handle small datasets, its real power is with large and distributed data.
20160922 Materials Data Facility TMS WebinarBen Blaiszik
Fall 2016 TMS Webinar on Data Curation Tools. Slides for the Materials Data Facility presentation on data services (publish and discover) as described by Ben Blaiszik. See http://www.materialsdatafacility.org for more information.
An Oracle database consists of physical files on disk that store data and logical memory structures that manage the files. The database is made up of data files that contain tables and indexes, control files that track the physical components, and redo log files that record changes. The instance in memory associates with one database and manages access through background processes. The database is divided into logical storage units called tablespaces that map to the physical data files. Common tablespaces include SYSTEM, SYSAUX, undo and temporary tablespaces.
Oracle architecture with details-yogiji creationsYogiji Creations
Oracle is a database management system with a multi-tiered architecture. It consists of a database on disk that contains tables, indexes and other objects. An Oracle instance contains a memory area called the System Global Area that services requests from client applications. Background processes facilitate communication between the memory structures and database files on disk. Logical database structures like tablespaces, segments, extents and blocks help organize and manage the physical storage of data.
IOUG Collaborate 18 - ASM Concepts, Architecture and Best PracticesPini Dibask
Pini Dibask presented on Oracle ASM concepts, architecture, and best practices. Some key points:
- ASM is Oracle's recommended storage management solution and provides high performance storage for single-instance and RAC databases.
- ASM uses disk groups and stripes and mirrors data across disks for redundancy and load balancing. It also rebalances data automatically during storage changes.
- Administering ASM involves tasks like starting and stopping the ASM instance, managing disk groups and disks, and monitoring storage usage and I/O balance.
- Best practices for ASM include using separate disk groups for data and recovery files, ensuring consistent disk performance, monitoring I/O balance, and in
Enabling ABAC with Accumulo and Ranger integrationDataWorks Summit
This talk will cover the topics of attribute-based access control (ABAC), Apache Ranger, and Apache Accumulo.
Attribute-based access control (ABAC) is a relatively new standard from NIST that provides a flexible framework that replaces the complex matrix nightmare scenario of user/group/role mappings in enterprise role-based access control (RBAC) systems. ABAC provides the ability to manage and enforce authorizations for both person and non-person entities and makes policy decisions based on subject, action, resource, and environment attributes.
Ranger and Accumulo are two technologies that, when combined, allow creation of systems that support ABAC at the cell-level. Ranger provides an extensible framework for distributed policy decision and enforcement with centralized administration as well as auditing authorization decisions within the Apache Hadoop ecosystem. Accumulo's pluggable security model enables the integration of Ranger providing GUI- and REST-driven authorization management, user and group synchronization with LDAP endpoints, and a centralized authorization audit repository.
The combination of Ranger and Accumulo enables alignment with NIST ABAC standards for the Hadoop ecosystem. This talk will cover why that matters, the mechanics of Ranger's authorization model, and demonstrate an integration of the two systems.
Speakers
John Highcock, Systems Architect, Hortonworks
Marcus Waineo, Principal Solutions Engineer, Hortonworks
The document provides an overview of the Oracle DBA course, including its objectives to identify the various components of the Oracle architecture and learn how to perform tasks like starting and shutting down a database. It then describes the key components of the Oracle architecture, including the Oracle database (physical files), Oracle instance (memory structures and processes), System Global Area (SGA) used to store shared database information, and database buffer cache which stores recently used data blocks retrieved from data files.
Dr. Edward (Eddie) Bortnikov (Senior Director of Research) @ Verizon Media:
Ingestion and queries of real-time data in Druid are performed by a core software component named Incremental Index (I^2).
I^2’s scalability is paramount to the speed of the ingested data becoming queryable as well as to the operational efficiency of the Druid cluster.
The current I^2 Implementation is based on the traditional ordered JDK key-value (KV-)map.
We present an experimental I^2 implementation that is based on a novel data structure named OakMap - a scalable thread-safe off-heap KV-map for Big Data applications in Java.
With OakMap, I^2 can ingest data at almost 2x speed while using 30% less RAM.
The project is expected to become GA in 2020.
The document describes the architecture and design of the Hadoop Distributed File System (HDFS). It discusses key aspects of HDFS including its master/slave architecture with a single NameNode and multiple DataNodes. The NameNode manages the file system namespace and regulates client access, while DataNodes store and retrieve blocks of data. HDFS is designed to reliably store very large files across machines by replicating blocks of data and detecting/recovering from failures.
The document discusses view_hdf, a visualization and analysis tool developed to access data from HDF products generated by NASA's CERES Data Management System. view_hdf allows users to select and plot variables from CERES Science Data Sets without needing knowledge of HDF formats. It provides capabilities such as 2D and 3D graphics, geographic mapping, statistics computation, and saving/printing plots. Contact information is provided for accessing the CERES data center and documentation for view_hdf.
HDF is a file format for managing scientific data in heterogeneous environments. It provides data interoperability through I/O software, utilities, and search/access tools. HDF supports a variety of data types and structures, large datasets, metadata, portability across systems, fast I/O, and efficient storage. HDF-EOS extends HDF to define standard profiles for organizing Earth science remote sensing and in-situ data.
Jagadish Venkatesh has over 10 years of experience in system administration and database administration. He has extensive experience administering Oracle databases, Linux servers, and Windows servers. Currently he works as a Senior Engineer providing remote infrastructure services to Oracle clients at Cambridge Technology India. Previously he has worked as a technical consultant and in various roles supporting banking infrastructure including an ATM switch. He has certifications in Oracle Database Administration and expertise across a range of technologies.
Cloudera Impala - HUG Karlsruhe, July 04, 2013Alexander Alten
Low latency data processing with Impala
Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), JDBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage, which partitions data into blocks and replicates them across nodes for fault tolerance. The master node tracks where data blocks are stored and worker nodes execute tasks like mapping and reducing data. Hadoop provides scalability and fault tolerance but is slower for iterative jobs compared to Spark, which keeps data in memory. The Lambda architecture also informs Hadoop's ability to handle batch and speed layers separately for scalability.
S. Prabhu is a highly experienced Oracle Database Administrator with over 13 years of experience installing, configuring, maintaining and tuning Oracle databases from versions 8i through 11g on various operating systems. He has extensive expertise in areas such as Oracle RAC, Data Guard, Golden Gate, ASM, backup strategies using RMAN, performance tuning using AWR/ADDM and SQL tuning. Prabhu has worked with global clients in both Fortune 500 companies and service providers on large, complex databases and holds an MCA from Madras University.
2014 CrossRef Annual Meeting: CrossRef System UpdateCrossref
The document summarizes system updates made by Crossref in 2014, including improvements to infrastructure like hardware, network resiliency and production systems that reduced DNS latency. Core system changes enhanced performance and call-back notifications. Features were added for books, standards, metadata queries and schema. Planned future updates involve integrating ORCIDs, cleaning article titles, modeling relations, redesigning stored queries and adding new content types.
The document provides an overview of Oracle Database including its architecture, components, and functions. It discusses Oracle's three-level database architecture consisting of the external, conceptual, and internal levels. It also describes Oracle's memory structure including the shared pool, database buffer cache, and redo log buffer. Key Oracle background processes like DBWR, LGWR, PMON, SMON, and CKPT are summarized.
This document provides an introduction to Oracle 10g, including its architecture and components. It discusses the Oracle instance, System Global Area (SGA) and Program Global Area (PGA). It describes the key background processes like SMON, PMON, DBWn, LGWR, CKPT and ARCn. It also explains the critical Oracle files - parameter file, control files, redo log files and data files. Finally, it outlines Oracle's logical data structures of tablespaces, segments, extents and data blocks.
Construindo Data Lakes - Visão Prática com Hadoop e BigDataMarco Garcia
Minha apresentação sobre construção de data lakes para bigdata usando hadoop como plataforma de dados. Conheça mais sobre nossos trabalhos de consultoria e treinamento em Hadoop Hortonworks, BigData, Data Warehousing e Business Intelligence
This document summarizes the skills and experience of an Oracle DBA named Bashapattan. It includes over 3.8 years of experience providing 24/7 support for Oracle 10g and 11g databases. Some responsibilities included database creation, configuration of users and privileges, backups using RMAN, performance tuning, and experience with RAC clusters. Previous work experience is provided for two projects supporting Oracle databases with responsibilities such as installation, administration, monitoring, patching, and upgrades.
Hadoop, Evolution of Hadoop, Features of HadoopDr Neelesh Jain
Hadoop, Evolution of Hadoop, Features of Hadoop is explained in the presentation as per the syllabus of RGPV, BU and MCU for the students of BCA, MCA and B. Tech.
This document provides an introduction to HDFS (Hadoop Distributed File System). It discusses what HDFS is, its core components, architecture, and key elements like the NameNode, metadata, and blocks. HDFS is designed for storing very large files across commodity hardware in a fault-tolerant manner and allows for streaming access. While HDFS can handle small datasets, its real power is with large and distributed data.
20160922 Materials Data Facility TMS WebinarBen Blaiszik
Fall 2016 TMS Webinar on Data Curation Tools. Slides for the Materials Data Facility presentation on data services (publish and discover) as described by Ben Blaiszik. See http://www.materialsdatafacility.org for more information.
An Oracle database consists of physical files on disk that store data and logical memory structures that manage the files. The database is made up of data files that contain tables and indexes, control files that track the physical components, and redo log files that record changes. The instance in memory associates with one database and manages access through background processes. The database is divided into logical storage units called tablespaces that map to the physical data files. Common tablespaces include SYSTEM, SYSAUX, undo and temporary tablespaces.
Oracle architecture with details-yogiji creationsYogiji Creations
Oracle is a database management system with a multi-tiered architecture. It consists of a database on disk that contains tables, indexes and other objects. An Oracle instance contains a memory area called the System Global Area that services requests from client applications. Background processes facilitate communication between the memory structures and database files on disk. Logical database structures like tablespaces, segments, extents and blocks help organize and manage the physical storage of data.
IOUG Collaborate 18 - ASM Concepts, Architecture and Best PracticesPini Dibask
Pini Dibask presented on Oracle ASM concepts, architecture, and best practices. Some key points:
- ASM is Oracle's recommended storage management solution and provides high performance storage for single-instance and RAC databases.
- ASM uses disk groups and stripes and mirrors data across disks for redundancy and load balancing. It also rebalances data automatically during storage changes.
- Administering ASM involves tasks like starting and stopping the ASM instance, managing disk groups and disks, and monitoring storage usage and I/O balance.
- Best practices for ASM include using separate disk groups for data and recovery files, ensuring consistent disk performance, monitoring I/O balance, and in
Enabling ABAC with Accumulo and Ranger integrationDataWorks Summit
This talk will cover the topics of attribute-based access control (ABAC), Apache Ranger, and Apache Accumulo.
Attribute-based access control (ABAC) is a relatively new standard from NIST that provides a flexible framework that replaces the complex matrix nightmare scenario of user/group/role mappings in enterprise role-based access control (RBAC) systems. ABAC provides the ability to manage and enforce authorizations for both person and non-person entities and makes policy decisions based on subject, action, resource, and environment attributes.
Ranger and Accumulo are two technologies that, when combined, allow creation of systems that support ABAC at the cell-level. Ranger provides an extensible framework for distributed policy decision and enforcement with centralized administration as well as auditing authorization decisions within the Apache Hadoop ecosystem. Accumulo's pluggable security model enables the integration of Ranger providing GUI- and REST-driven authorization management, user and group synchronization with LDAP endpoints, and a centralized authorization audit repository.
The combination of Ranger and Accumulo enables alignment with NIST ABAC standards for the Hadoop ecosystem. This talk will cover why that matters, the mechanics of Ranger's authorization model, and demonstrate an integration of the two systems.
Speakers
John Highcock, Systems Architect, Hortonworks
Marcus Waineo, Principal Solutions Engineer, Hortonworks
The document provides an overview of the Oracle DBA course, including its objectives to identify the various components of the Oracle architecture and learn how to perform tasks like starting and shutting down a database. It then describes the key components of the Oracle architecture, including the Oracle database (physical files), Oracle instance (memory structures and processes), System Global Area (SGA) used to store shared database information, and database buffer cache which stores recently used data blocks retrieved from data files.
Dr. Edward (Eddie) Bortnikov (Senior Director of Research) @ Verizon Media:
Ingestion and queries of real-time data in Druid are performed by a core software component named Incremental Index (I^2).
I^2’s scalability is paramount to the speed of the ingested data becoming queryable as well as to the operational efficiency of the Druid cluster.
The current I^2 Implementation is based on the traditional ordered JDK key-value (KV-)map.
We present an experimental I^2 implementation that is based on a novel data structure named OakMap - a scalable thread-safe off-heap KV-map for Big Data applications in Java.
With OakMap, I^2 can ingest data at almost 2x speed while using 30% less RAM.
The project is expected to become GA in 2020.
The document describes the architecture and design of the Hadoop Distributed File System (HDFS). It discusses key aspects of HDFS including its master/slave architecture with a single NameNode and multiple DataNodes. The NameNode manages the file system namespace and regulates client access, while DataNodes store and retrieve blocks of data. HDFS is designed to reliably store very large files across machines by replicating blocks of data and detecting/recovering from failures.
The document discusses view_hdf, a visualization and analysis tool developed to access data from HDF products generated by NASA's CERES Data Management System. view_hdf allows users to select and plot variables from CERES Science Data Sets without needing knowledge of HDF formats. It provides capabilities such as 2D and 3D graphics, geographic mapping, statistics computation, and saving/printing plots. Contact information is provided for accessing the CERES data center and documentation for view_hdf.
HDF is a file format for managing scientific data in heterogeneous environments. It provides data interoperability through I/O software, utilities, and search/access tools. HDF supports a variety of data types and structures, large datasets, metadata, portability across systems, fast I/O, and efficient storage. HDF-EOS extends HDF to define standard profiles for organizing Earth science remote sensing and in-situ data.
The document summarizes a workshop between NASA, software developers, science communities, and data centers to discuss HDF and HDF-EOS tools. Key topics included interactions between these groups, technical details of EOSDIS and HDF-EOS, available and needed tools, resources for developers, and next steps to continue engagement through websites and future meetings.
The HDF Group provides software for managing large, complex data and services to support users of this technology. It derives most of its revenue from projects related to earth science, including supporting HDF-EOS, JPSS, and other earth science projects. It maintains various tools for working with HDF files and conducts maintenance, support, and development activities to support new versions and capabilities of HDF libraries and software.
Aashish Chaudhary gave a presentation on Kitware's work with scientific computing and visualization using HDF. HDF is a widely used data format at Kitware for domains like climate modeling, geospatial visualization, and information visualization. Kitware is looking to improve HDF support for cloud and web environments to enable streaming analytics and web-based data analysis. The company also aims to further open source collaboration and scientific computing.
Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
The presentation will provide an overview of subsetting software development activity at UAH. Updates have been made to all packages, reflecting the latest versions of HDF5 and HE5. The library of tools (HSE) for subsetting HDF-EOS data is up-to-date for SGI, Sun, and Linux platforms. Subsetting software is operational at NSIDC DAAC and GDAAC, in testing at LPDAAC. Ongoing work and plans will also be described, including row/column subsetting and index subsampling.
This document provides an overview and examples of accessing cloud data and services using the Earthdata Login (EDL), Pydap, and MATLAB. It discusses some common problems users encounter, such as being unable to access HDF5 data on AWS S3 using MATLAB or read data from OPeNDAP servers using Pydap. Solutions presented include using EDL to get temporary AWS tokens for S3 access in MATLAB and providing code examples on the HDFEOS website to help users access S3 data and OPeNDAP services. The document also notes some limitations, such as tokens being valid for only 1 hour, and workarounds like requesting new tokens or using the MATLAB HDF5 API instead of the netCDF API.
This document summarizes the fifth annual HDF workshop sponsored by ESDIS and NCSA. It provides an overview of the status of ESDIS, HDF/HDF-EOS, and plans for the future. Over 750 terabytes of Terra and Landsat 7 data have been processed and made available. Some instruments like ASTER and CERES now have validated data while others like MODIS are still being reprocessed. Future plans include installing data pools at DAACs and procuring an EMD contract to support ongoing EOS operations. The community advisory process involves groups like UWGs, DAWG, and SWGD to provide feedback. HDF is a file format for scientific data while HDF-EOS is the
This document provides an overview of HDF-EOS, which is an extension to HDF that defines standard data structures for remote sensing and in-situ data with tightly coupled geolocation information. It describes the core components of HDF-EOS files, including Grid, Swath, and Point structures, and provides examples. It also outlines the development of an HDF5-based version to overcome limitations of the HDF4-based library and allow for larger files.
The SEEDS (Strategic Evolution of ESE Data Systems) process is a strategy for maximizing the utility of Earth science data within NASA. It involves formulating best practices for data systems through a study of past and present systems. The SEEDS process engages data providers and users to incorporate lessons learned. It focuses on adopting common standards to increase flexibility rather than developing new systems. The document outlines the status of the SEEDS formulation studies, including the standards process study which recommends adopting the IETF process for approving and developing interoperability standards. It notes that standards working groups will be established to evaluate existing standards like HDF for adoption.
The document provides an introduction to PREMIS (Preservation Metadata: Implementation Strategies) and its application in audiovisual archives. It discusses the challenges of digital preservation and the need for preservation metadata to ensure long-term access. It then summarizes the key aspects of PREMIS, including the PREMIS Data Dictionary, its relationship to the OAIS reference model, the five interacting entities in the PREMIS data model, and issues around implementing PREMIS in archives.
Minerva is a storage plugin of Drill that connects IPFS's decentralized storage and Drill's flexible query engine. Any data file stored on IPFS can be easily accessed from Drill's query interface, just like a file stored on a local disk.
Visit https://github.com/bdchain/Minerva to learn more and try it out!
(1) The document discusses challenges of managing large and complex datasets for interdisciplinary research projects. It presents Hadoop and the Etosha data catalog as solutions.
(2) Etosha aims to publish and link metadata about datasets to enable discovery and sharing across distributed research clusters. It focuses on descriptive, structural and administrative metadata rather than just technical metadata.
(3) Etosha's architecture includes a distributed metadata service and context browser that can query metadata from different Hadoop clusters to support federated querying and subquery delegation.
This document introduces an HDF-EOS workshop for data producers, users, and tool developers. HDF-EOS is the baseline standard data format for EOS data, based on HDF but adding features for EOS. The workshop aims to provide in-depth information about HDF and HDF-EOS through hands-on tutorials for tools, utilities, and programming. It is sponsored by NASA and the ECS team, and provides individual consultation. HDF-EOS supports a large community that will need various analysis tools, and resources like documentation, sample data, and experts are available to support the HDF-EOS format.
We will summarize current status of HDF-EOS and associated tools. Update on HDF-EOS, HDFView plug-in and The HDF-EOS to GeoTIFF (HEG) conversion tool, including recent changes to the software, ongoing maintenance, upcoming releases, future plans, and issues will be discussed.
We will also summarize the status of HDF-EOS RFC. The HDF-EOS plug-in for the THG-developed tool, HDFView, has been enhanced. The plug-in offers browse capability for both HDF4 and HDF5 - based HDF-EOS files. HDFView can also process vanilla HDF4 and HDF5 files. New features including support for Point and Zonal Average objects have been added. A port to Mac OS X version will be available in next release.
The HDF-EOS to GeoTIFF (HEG) conversion tool has been augmented to include new projections, and support for additional AMSR-E and AIRS products. Subsetting features have also been augmented. The tool is available in both stand-alone and EOS DAAC online versions.
This document provides an overview of the Earth Science Markup Language (ESML). ESML is an XML-based interchange format that allows applications and services to access heterogeneous earth science data regardless of the underlying data format. It provides syntactic, semantic, and content metadata that describe data structures, meanings, and contents in a machine-readable way. The document outlines the need for such an interchange format, describes the components of ESML including the schema, libraries, and tools, and provides examples of writing ESML descriptions for different types of data files.
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...{code} by Dell EMC
New digital business models facilitated by containers require collecting and analyzing device data. Apache Mesos removes the need to build separate stacks and combines optimized application containers and data analytics into a single platform. In this session, we will explore new approaches to data analytics using REX-Ray as a container persistence tool and the SMACK stack - Spark, Mesos, Akka, Cassandra, Kafka – a set of tools for building data and messaging layers for digital engagement apps.
The document discusses technical challenges and approaches for building an open ecosystem of heterogeneous heritage collections. It describes Echoes, an open-source project that provides integrated access to digital cultural assets from different institutions. The key challenges addressed include dealing with different metadata schemas, poor data quality, data deduplication, and automatic enrichment. Technical approaches used by Echoes to overcome these challenges include modular tools for data analysis, transformation to a common schema, quality assurance, and enrichment.
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Ricard de la Vega
Echoes provides open, easy and innovative access to digital cultural assets from different institutions and is available in several languages. Within a single and integrated platform, users have access to a wide range of information on archaeology, architecture, books, monuments, people, photography etc. This can be explored using different criteria: concepts, digital objects, people, places and time. The platform can be installed for a region or a theme.
Echoes has developed tools that allow to analyze, clean and transform data collections to Europeana Data Model (EDM). Also tools to validate, enrich and publish heterogeneous data to a normalized data lake that can be exploited as linked open data and used with different data visualizations.
Product Keynote: Advancing Denodo’s Logical Data Fabric with AI and Advanced ...Denodo
Watch full webinar here: https://bit.ly/3r4wEVw
During this session, Denodo CTO Alberto Pan will discuss how a logical data fabric and the associated technologies of machine learning, artificial intelligence and data virtualization is the right approach to assist organizations to unify their data. He will discuss how a Logical Data Fabric reduces time to value hence increasing the overall business value of your data assets.
The document provides an overview of Hadoop including:
- A brief history of Hadoop and its origins from Nutch.
- An overview of the Hadoop architecture including HDFS and MapReduce.
- Examples of how companies like Yahoo, Facebook and Amazon use Hadoop at large scales to process petabytes of data.
The document provides an overview and status update of the Earth Science Data and Information System (ESDIS). ESDIS has successfully supported numerous Earth science satellite missions and currently manages over 2 petabytes of science data. In fiscal year 2002, ESDIS delivered over 16 million data products to more than 1.8 million users. ESDIS is working to enhance its capabilities through initiatives like Data Pools and the EOS ClearingHOuse (ECHO) metadata broker. HDF-EOS 5 development is nearly complete and the workshop will discuss next steps for HDF-EOS tools and community adoption of HDF-EOS 5.
The document discusses fuzzy matching and describes Fuzzy Table, a scalable solution for performing fuzzy matching on large multimedia databases using Hadoop. Fuzzy Table uses Hadoop for bulk processing tasks like clustering and indexing data. It then enables low-latency fuzzy searches by caching HDFS metadata and performing searches in parallel across data servers, with average query times scaling linearly as servers are added. Future work involves optimizations to reduce I/O latency and reliance on the HDFS Namenode.
Similar to Metadata Requirements for EOSDIS Data Providers (20)
This document discusses how to optimize HDF5 files for efficient access in cloud object stores. Key optimizations include using large dataset chunk sizes of 1-4 MiB, consolidating internal file metadata, and minimizing variable-length datatypes. The document recommends creating files with paged aggregation and storing file content information in the user block to enable fast discovery of file contents when stored in object stores.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
This document summarizes the current status and focus of the HDF Group. It discusses that the HDF Group is located in Champaign, IL and is a non-profit organization focused on developing and maintaining HDF software and data formats. It provides an overview of recent HDF5, HDF4 and HDFView releases and notes areas of focus for software quality improvements, increased transparency, strengthening the community, and modernizing HDF products. It invites support and participation in upcoming user group meetings.
This document provides an overview of HSDS (HDF Server and Data Service), which allows HDF5 files to be stored and accessed from the cloud. Key points include:
- HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects for scalability and parallelism.
- Features include streaming support, fancy indexing for complex queries, and caching for improved performance.
- HSDS can be deployed on Docker, Kubernetes, or AWS Lambda depending on needs.
- Case studies show HSDS is used by organizations like NREL and NSF to make petabytes of scientific data publicly accessible in the cloud.
This document discusses creating cloud-optimized HDF5 files by rearranging internal structures for more efficient data access in cloud object stores. It describes cloud-native and cloud-optimized storage formats, with the latter involving storing the entire HDF5 file as a single object. The benefits of cloud-optimized HDF5 include fast scanning and using the HDF5 library. Key aspects covered include using optimal chunk sizes, compression, and minimizing variable-length datatypes.
This document discusses updates and performance improvements to the HDF5 OPeNDAP data handler. It provides a history of the handler since 2001 and describes recent updates including supporting DAP4, new data types, and NetCDF data models. A performance study showed that passing compressed HDF5 data through the handler without decompressing/recompressing led to speedups of around 17-30x by leveraging HDF5 direct I/O APIs. This allows outputting HDF5 files as NetCDF files much faster through the handler.
This document provides instructions for using the Hyrax software to serve scientific data files stored on Amazon S3 using the OPeNDAP data access protocol. It describes how to generate ancillary metadata files called DMR++ files using the get_dmrpp tool that provide information about the data file structure and locations. The document explains how to run get_dmrpp inside a Docker container to process data files on S3 and generate customized DMR++ files that the Hyrax server can use to serve the files to clients.
The HDF5 Roadmap and New Features document outlines upcoming changes and improvements to the HDF5 library. Key points include:
- HDF5 1.13.x releases will include new features like selection I/O, the Onion VFD for versioned files, improved VFD SWMR for single-writer multiple-reader access, and subfiling for parallel I/O.
- The Virtual Object Layer allows customizing HDF5 object storage and introduces terminal and pass-through connectors.
- The Onion VFD stores versions of HDF5 files in a separate onion file for versioned access.
- VFD SWMR improves on legacy SWMR by implementing single-writer multiple-reader capabilities
This document discusses user analysis of the HDFEOS.org website and plans for future improvements. It finds that the majority of the site's 100 daily users are "quiet", not posting on forums or other interactive elements. The main user types are locators, who search for examples or data; mergers, who combine or mosaic datasets; and converters, who change file formats. The document outlines recent updates focused on these user types, like adding Python examples for subsetting and calculating latitude and longitude. It proposes future work on artificial intelligence/machine learning uses of HDF files and examples for processing HDF data in the cloud.
This document summarizes a presentation about the current status and future directions of the Hierarchical Data Format (HDF) software. It provides updates on recent HDF5 releases, development efforts including new compression methods and ways to access HDF5 data, and outreach resources. It concludes by inviting the audience to share wishes for future HDF development.
The document describes H5Coro, a new C++ library for reading HDF5 files from cloud storage. H5Coro was created to optimize HDF5 reading for cloud environments by minimizing I/O operations through caching and efficient HTTP requests. Performance tests showed H5Coro was 77-132x faster than the previous HDF5 library at reading HDF5 data from Amazon S3 for NASA's SlideRule project. H5Coro supports common HDF5 elements but does not support writing or some complex HDF5 data types and messages to focus on optimized read-only performance for time series data stored sequentially in memory.
This document summarizes MathWorks' work to modernize MATLAB's support for HDF5. Key points include:
1) MATLAB now supports HDF5 1.10.7 features like single-writer/multiple-reader access and virtual datasets through new and updated low-level functions.
2) Performance benchmarks show some improvements but also regressions compared to the previous HDF5 version, and work continues to optimize code and support future versions.
3) There are compatibility considerations for Linux filter plugins, but interim solutions are provided until MathWorks can ship a single HDF5 version.
HSDS provides HDF as a service through a REST API that can scale across nodes. New releases will enable serverless operation using AWS Lambda or direct client access without a server. This allows HDF data to be accessed remotely without managing servers. HSDS stores each HDF object separately, making it compatible with cloud object storage. Performance on AWS Lambda is slower than a dedicated server but has no management overhead. Direct client access has better performance but limits collaboration between clients.
HDF5 and Zarr are data formats that can be used to store and access scientific data. This presentation discusses approaches to translating between the two formats. It describes how HDF5 files were translated to the Zarr format by creating a separate Zarr store to hold HDF5 file chunks, and storing chunk location metadata. It also discusses an implementation that translates Zarr data to the HDF5 format by using a special chunking layout and storing chunk information in an HDF5 compound dataset. Limitations of the translations include lack of support for some HDF5 dataset properties in Zarr, and lack of support for some Zarr compression methods in the HDF5 implementation.
The document discusses HDF for the cloud, including new features of the HDF Server and what's next. Key points:
- HDF Server uses a "sharded schema" that maps HDF5 objects to individual storage objects, allowing parallel access and updates without transferring entire files.
- Implementations include HSDS software that uses the sharded schema with an API and SDKs for different languages like h5pyd for Python.
- New features of HSDS 0.6 include support for POSIX, Azure, AWS Lambda, and role-based access control.
- Future work includes direct access to storage without a server intermediary for some use cases.
This document compares different methods for accessing HDF and netCDF files stored on Amazon S3, including Apache Drill, THREDDS Data Server (TDS), and HDF5 Virtual File Driver (VFD). A benchmark test of accessing a 24GB HDF5/netCDF-4 file on S3 from Amazon EC2 found that TDS performed the best, responding within 2 minutes, while Apache Drill failed after 7 minutes. The document concludes that TDS 5.0 is the clear winner based on performance and support for role-based access control and HDF4 files, but the best solution depends on use case and software.
This document discusses STARE-PODS, a proposal to NASA/ACCESS-19 to develop a scalable data store for earth science data using the SpatioTemporal Adaptive Resolution Encoding (STARE) indexing scheme. STARE allows diverse earth science data to be unified and indexed, enabling the data to be partitioned and stored in a Parallel Optimized Data Store (PODS) for efficient analysis. The HDF Virtual Object Layer and Virtual Data Set technologies can then provide interfaces to access the data in STARE-PODS in a familiar way. The goal is for STARE-PODS to organize diverse data for alignment and parallel/distributed storage and processing to enable integrative analysis at scale.
This document provides an overview and update on HDF5 and its ecosystem. Key points include:
- HDF5 1.12.0 was recently released with new features like the Virtual Object Layer and external references.
- The HDF5 library now supports accessing data in the cloud using connectors like S3 VFD and REST VOL without needing to modify applications.
- Projects like HDFql and H5CPP provide additional interfaces for querying and working with HDF5 files from languages like SQL, C++, and Python.
- The HDF5 community is moving development to GitHub and improving documentation resources on the HDF wiki site.
This document summarizes new features in HDF5 1.12.0, including support for storing references to objects and attributes across files, new storage backends using a virtual object layer (VOL), and virtual file drivers (VFDs) for Amazon S3 and HDFS. It outlines the HDF5 roadmap for 2019-2022, which includes continued support for HDF5 1.8 and 1.10, and new features in future 1.12.x releases like querying, indexing, and provenance tracking.
The document discusses leveraging cloud resources like Amazon Web Services to improve software testing for the HDF group. Currently HDF software is tested on various in-house systems, but moving more testing to the cloud could provide better coverage of operating systems and distributions at a lower cost. AWS spot instances are being used to run HDF5 build and regression tests across different Linux distributions in around 30 minutes for approximately $0.02 per hour.
More from The HDF-EOS Tools and Information Center (20)
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
Must Know Postgres Extension for DBA and Developer during MigrationMydbops
Mydbops Opensource Database Meetup 16
Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration
Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting
Date & Time: 8th June | 10 AM - 1 PM IST
Venue: Bangalore International Centre, Bangalore
Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle.
Key Takeaways:
* Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities.
* Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom.
* Discover how these key extensions can empower both developers and DBAs during the migration process.
* Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends.
Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL.
Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability.
Contact us: info@mydbops.com
Visit: https://www.mydbops.com/
Follow us on LinkedIn: https://in.linkedin.com/company/mydbops
For more details and updates, please follow up the below links.
Meetup Page : https://www.meetup.com/mydbops-databa...
Twitter: https://twitter.com/mydbopsofficial
Blogs: https://www.mydbops.com/blog/
Facebook(Meta): https://www.facebook.com/mydbops/
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
"What does it really mean for your system to be available, or how to define w...Fwdays
We will talk about system monitoring from a few different angles. We will start by covering the basics, then discuss SLOs, how to define them, and why understanding the business well is crucial for success in this exercise.
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
2. Topics
•Why metadata is important
•Types of metadata in HDF-EOS files
•Required metadata
•How metadata is encoded and delivered
HDF-EOS Workshop II
SJSK 2
3. What is Metadata?
•Metadata is information that identifies and
characterizes an information product.
•Sometimes called “data about data”
HDF-EOS Workshop II
SJSK 3
4. Users Need Metadata
•Metadata is needed to answer questions such
as:
- What time and location does this data apply to?
- Why type of instrument and processing produced
the data?
- What other inputs were used to generate the data?
- What QA has been performed on this data?
- Who do I contact if I have questions about this data?
HDF-EOS Workshop II
SJSK 4
5. Metadata is Essential
•Large data archive systems cannot function
without metadata.
•Metadata is used to keep track of such things
as:
-
where the data is
what type of operations are possible on the data
whether there are any access restrictions on the data
how individual data files are logically grouped into
“collections.”
HDF-EOS Workshop II
SJSK 5
6. Key Concepts
•A granule is the smallest aggregation of data
that is independently described and inventoried
by the ECS. A granule consists of 1 or more
physical files.
•A collection is a logical grouping of granules.
•The ECS Data Model allows for:
- “Core” attributes
- “Product-Specific” Attributes (PSAs)
SJSK 6
7. Types of Metadata
•Metadata in HDF files
- stored as global text attributes
•Types of Metadata used in HDF-EOS files:
- Structural Metadata
- Core Metadata (inventory, can include PSAs)
- Archive Metadata (non-searchable, product-specific)
•Collection level metadata
- core and product-specific
HDF-EOS Workshop II
SJSK 7
8. Required Metadata
•Origins of metadata requirements:
- what is required to archive and retrieve files
- what is required to provide search and other
services on data
- what is federally mandated (FGDC)
•There are 287 attributes in the ECS data model
- only a subset are used for any given product
- 101 are applicable at the granule level
HDF-EOS Workshop II
SJSK 8
9. Metadata Coverage
•Science Data that are delivered for archiving in
ECS must meet what is called the Intermediate
level of metadata coverage. This involves as
few as:
- 31 collection level attributes
- 4 granule level attributes
•Compliance at this level is not enforced by the
system.
HDF-EOS Workshop II
SJSK 9
11. Granule-Level Metadata for
Intermediate Coverage
•There are only four granule-level metadata attributes
required:
- ShortName
- VersionID
- SizeMBECSDataGranule
- ProductionDateTime
•ShortName and VersionID are identical to the collectionlevel attributes with these names.
•For granules coming into ECS, SizeMBECSDataGranule
and ProductionDateTime are supplied by the system
upon insertion.
HDF-EOS Workshop II
SJSK 11
12. How is Metadata Supplied?
•Collection-level metadata is carried in an Earth
Science Data Type (ESDT) Descriptor file.
•Granule-level metadata is defined in the
descriptor file and populated using a Metadata
Configuration File (MCF).
•Granule-level metadata is delivered in the HDFEOS granule *or* in a populated MCF
accompanying a non-HDF granule.
•The DAAC where a collection will reside is
responsible for descriptors and ingest routines.
HDF-EOS Workshop II
SJSK 12
13. Metadata Work Flow for External
Data Providers
Data
Provider
Responsibility
Popula t ion
Analy s is
MDWorks
Data Model
MDWorks
Specs
DAAC
c ollec t ion c ore a t t ribut es +
granule
v a lue s c ore
a t t ributf init ions
P S A de es
Data/Docs
t y pe a nd f ormat
c hec k
PSA_Reg
Tools
V a lida t ion
ODL Parser
Descriptor
MCF Build
MCF
O DL
s y nt a x
c he ck
Ta s ks
Validated Desc.
Sc ie nc e
S of t ware
DLL c oding
SDP Toolkit
granule c ore va lues
P S A v a lue s
s t ruc t ura l me t a dat a
Te st & Va lid.
Const ra int s
c he ck s
Data Base
Load File
HDF-EOS file
HDF-EOS Workshop II
I nge s t
S ubs y s t e m
E SDT
I ns ert
DAAC Dat a Arc hiv e
SJSK 13
14. Metadata Resources on the Web
•ECS Metadata Homepage
http://ecsinfo.hitc.com/metadata/metadata.html
•Metadata Works (ESDT Descriptor Tool)
http://et3ws1.HITC.COM/metadata_works/
•EOSDIS Information Architecture
http://spsosun.gsfc.nasa.gov/InfoArch.html
•Federal Geographic Data Committee
http://www.fgdc.gov/
SJSK 14
15. Q&A w/ Experts Panel
•Q: “If you are a new data provider, how do you get your data into an HDF-EOS granule, given
the bewildering array of utilities and tools available? What is the simplest solution for this?”
•A: The recommended solution is to obtain the HCR package, which includes the HDF-EOS and
HDF libraries. For populating the required metadata in the granule, obtain the Metadata/Time
Toolkit_MDT. The steps would be:
1. Write an HCR and use the tools to turn this into a skeletal HDF-EOS granule. (This step is
optional).
2. Use the HDF-EOS library to create a granule. (If starting with a skeletal HDF-EOS file
generated from an HCR then plain HDF calls can be used to insert data into the granule ).
3. Use Toolkit_MDT calls to insert metadata into the granule. This requires generation of an
MCF in ODL. Metadata_Works is available for doing this. As an alternative, a simple HDF call
can be used to attach minimum metadata (in ODL) to an HDF file.
Note: if the data are going to reside in a DAAC, or in an archive that must be interoperable with
ECS, you will need to generate collection-level metadata. Metadata_Works is the recommended
tool for this.
SJSK 15
Editor's Notes
in short, without metadata, a user of the data is in the dark.
Not all metadata is used in searching. Some metadata is merely informative and will not be used in database queries. This metadata can be viewed to assist data consumers in deciding whether to order data or not.
Metadata is needed to identify a data product once it is archived in the system.
Without metadata, users could never find a file unless they knew the precise ID of the file (like a filename in some systems, or in ECS a UR).
By supplying a rich set of metadata attributes for the data, users will be able to find the data more easily and in a greater variety of routes or search methods.
All textual metadata (i.e. excluding things that are specifically provided for by HDF like scales and units) should be contained in HDF text attributes.
ECS compliant metadata must be written to HDF text attributes with specific names, and may span multiple attributes, numbered sequentially, to accommodate all metadata.
This metadata must also be written in ODL, or Object Description Language.
These tasks are best handled by using the SDP Toolkit.
Collection level metadata is delivered separately from the granules and will be discussed later.
ECS requires only 2 attributes to insert and acquire granules: ShortName and VersionID. Upon granule generation, ProductionDateTime is generated by the system and is this can also be used to identify granules belonging to an collection.
Temporal can also be designated by range, or periodic attributes
Spatial can also be designated by a single point, point & circle, or polygon.
ECS needs to be made aware of a data set prior to the arrival of the first “granule” of data, so that the archives that will hold the data and the database tables that will hold the metadata can be set up.
This is done by defining an Earth Science Data Type (ESDT). An ESDT “descriptor” file contains all the metadata values that describe the entire “collection” of data granules.
The ESDT descriptor also identifies the metadata that will pertain to the individual granules and whose values will be supplied as each granule is “inserted” into the system.
The Distributed Active Archive Centers (DAACs) are responsible for generating ESDT descriptor files, DLLs and any custom code necessary to ingest granules into the system.
(is it appropriate to say this?)