Recent Upgrades to ARM Data Transfer and Delivery Using GlobusGlobus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Giri Prakash from the ARM Data Center at Oak Ridge National Laboratory.
Enabling Secure Data Discoverability (SC21 Tutorial)Globus
Major research instruments are generating orders of magnitude more data in relatively short timeframes. As a result, the research enterprise is increasingly challenged by what should be mundane tasks: describing data for discovery and making data securely accessible to the broader research community. The ad hoc methods currently employed place undue burden on scientists and system administrators alike, and it is clear that a more robust, scalable approach is required.
Bespoke data portals (and science gateways/data commons) are becoming more prominent as a means of enabling access to large datasets. in this tutorial we demonstrate how services for authentication, authorization, metadata management, and search may be integrated with popular web frameworks, and used in combination with fast, well-architected networks to make data discoverable and accessible. Outcomes: build a simple, but functional, data portal that facilitates flexible data description, faceted data search and secure data access.
The document discusses India's Aadhaar identity system which collects biometric data on 1.2 billion Indian residents. MongoDB is used to store and search this identity data across multiple shards due to its auto-sharding, replication, and evolving schema capabilities. The implementation involves sharding data across 8 shards totaling over 2TB each, with performance and reliability addressed through replica sets, write concern configurations, and manual monitoring processes.
Since 1962, ICPSR has been an integral part of the infrastructure of social science research with its vast digital archive supporting over 700 member institutions worldwide. With the release of our new digital assets management system “Archonnex,” ICPSR continues this tradition by extending our expertise and digital technology capabilities as a service to the larger community. For the first time researchers, institutions, organizations, and even nations will be able to host their own repositories and setup data services for their members. We call it RaaS - Repository as a Service.
Live Geoinformation with Standardized Geoprocessing ServicesTheodor Foerster
This document proposes using HTTP Live Streaming to enable streaming web processing services. This allows for asynchronous and progressive transfer of geodata between a client and server. It improves performance, scalability, and the user experience over traditional synchronous WPS. The approach was implemented using 52North WPS and evaluated using a use case of generalizing OpenStreetMap data streams. Results demonstrated this streaming approach reduces memory footprint and improves processing time compared to a reference implementation.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
WoSC19: Serverless Workflows for Indexing Large Scientific DataUniversity of Chicago
The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific “data lakes” quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
Recent Upgrades to ARM Data Transfer and Delivery Using GlobusGlobus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Giri Prakash from the ARM Data Center at Oak Ridge National Laboratory.
Enabling Secure Data Discoverability (SC21 Tutorial)Globus
Major research instruments are generating orders of magnitude more data in relatively short timeframes. As a result, the research enterprise is increasingly challenged by what should be mundane tasks: describing data for discovery and making data securely accessible to the broader research community. The ad hoc methods currently employed place undue burden on scientists and system administrators alike, and it is clear that a more robust, scalable approach is required.
Bespoke data portals (and science gateways/data commons) are becoming more prominent as a means of enabling access to large datasets. in this tutorial we demonstrate how services for authentication, authorization, metadata management, and search may be integrated with popular web frameworks, and used in combination with fast, well-architected networks to make data discoverable and accessible. Outcomes: build a simple, but functional, data portal that facilitates flexible data description, faceted data search and secure data access.
The document discusses India's Aadhaar identity system which collects biometric data on 1.2 billion Indian residents. MongoDB is used to store and search this identity data across multiple shards due to its auto-sharding, replication, and evolving schema capabilities. The implementation involves sharding data across 8 shards totaling over 2TB each, with performance and reliability addressed through replica sets, write concern configurations, and manual monitoring processes.
Since 1962, ICPSR has been an integral part of the infrastructure of social science research with its vast digital archive supporting over 700 member institutions worldwide. With the release of our new digital assets management system “Archonnex,” ICPSR continues this tradition by extending our expertise and digital technology capabilities as a service to the larger community. For the first time researchers, institutions, organizations, and even nations will be able to host their own repositories and setup data services for their members. We call it RaaS - Repository as a Service.
Live Geoinformation with Standardized Geoprocessing ServicesTheodor Foerster
This document proposes using HTTP Live Streaming to enable streaming web processing services. This allows for asynchronous and progressive transfer of geodata between a client and server. It improves performance, scalability, and the user experience over traditional synchronous WPS. The approach was implemented using 52North WPS and evaluated using a use case of generalizing OpenStreetMap data streams. Results demonstrated this streaming approach reduces memory footprint and improves processing time compared to a reference implementation.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
WoSC19: Serverless Workflows for Indexing Large Scientific DataUniversity of Chicago
The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific “data lakes” quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
Krishnan Raman presented on LinkedIn's data obfuscation pipeline. The pipeline aims to analyze LinkedIn data to improve machine learning models, discover data quickly for analysis, and access data efficiently while complying with privacy regulations. It determines which files contain personally identifiable information (PII) to obfuscate, handles schema evolution, and preserves file names and types. WhereHows is used to track dataset lineage and locations. Obfuscated data is emitted with metrics on job progress captured in timeseries for monitoring the data pipeline. Challenges include unclean data, complex schemas, balancing failures vs dropped rows, and accounting for changing data and schemas. Auditing data, metadata, robust monitoring systems, and re-ob
This document proposes a log management solution using Logstash, Elasticsearch, and Kibana. Logstash is used to collect, parse, and index logs into Elasticsearch for centralized storage and real-time search. Kibana provides visualization and analytics dashboards. The solution offers scalability, reliability, searchability, and a low-cost and flexible open source approach to solving the challenges of gathering, analyzing, and gaining insights from large volumes of log data from diverse sources.
From R Script to Production Using rsparkling with Navdeep GillDatabricks
The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms.
In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
This document discusses LinkedIn's transition from an offline metrics platform to a near real-time "nearline" architecture using Apache Calcite and Apache Samza. It overviews LinkedIn's metrics platform and needs, and then details how the new nearline architecture works by translating Pig jobs into optimized Samza jobs using Calcite's relational algebra and query planning. An example production use case for analyzing storylines on the LinkedIn platform is also presented. The nearline architecture allows metrics to be computed with latencies of 5-30 minutes rather than 3-6 hours previously.
This document provides an introduction to OpenStack, an open source software platform for building private and public clouds. It describes the key OpenStack components for compute (Nova), storage (Cinder, Glance, Swift), networking (Neutron), and identity (Keystone). It then discusses how organizations like CERN and PayPal use OpenStack to manage large amounts of data and computing resources in a scalable, distributed manner. The document concludes by outlining various ways that individuals can get involved and contribute to the OpenStack community.
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
Charles Allen covers data processing, analytics, and insights systems at Snap. Strength points for Druid use cases are called out as are differences in some of the processing systems used.
This is the slide collection from the second talk from:
https://www.meetup.com/druidio-la/events/254080924/
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
The Past, Present and Future of Big Data @LinkedInSuja Viswesan
LinkedIn processes huge amounts of data from user events across the globe at scale. They collect 2.3 trillion messages per day totaling 2.5 PB of data and process it using highly reliable fault tolerant batch and stream processing. They access this data by persisting it durably across 120 PB of HDFS storage and make it searchable and available for online services. Their analytics infrastructure includes data ingestion using Gobblin, dataset management using Dali, storage using HDFS and Voldemort, and compute engines like YARN. They use solutions like federated HDFS, Dali, Hadoop OrgQueue and elasticity tuning to scale their system, cluster management and computation across their infrastructure of tens of thousands of nodes
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
The document discusses Valo, a big data analytics engine built from scratch focusing on simplicity and distributed capabilities. It describes Valo's architecture including time-series and semi-structured data repositories, REST API, and execution engine. It also discusses challenges of building distributed systems including cluster failures, data distribution, algorithms, and more.
Check out the webinar: https://imply.io/videos/whats-new-imply-3-3-apache-druid-0-18
The most recent Imply 3.3 release, based on Apache 0.18 brings several major new features, including joins, query laning and Clarity Alerts. These new features deliver increased design flexibility during design, and provide improved ingestion performance, and sub-second response times to help accelerate data warehouse and data lake deployments, and add real-time analytics in general.
Improve your SQL workload with observabilityOVHcloud
La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications.
Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience.
Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
In this talk, we will talk about:
1) the motivation of switching from Hbase backed analytics system to Druid
2) the architecture design of Druid as a platform in Pinterest (Archmage, Hadoop, Kafka) including a query interface, Archmage, a thrift service in front of Druid which exposes a thrift api to company-wise clients, handles Druid broker hosts discovery, serves as a relay to broker hosts to abstract the async HTTP connection and provides query optimizations transparent to clients including directly translating fixed pattern SQL to Druid native JSON queries to save planning time. In addition, we’ll cover the production Hadoop batch and Kafka real time ingestion pipeline setup and the reason we picked a pull-based solution instead of a push-based solution for real time ingestion.
3) We will also talk about the use cases currently running in production on this platform including their data volume, QPS, Druid cluster setup, the unique challenges we met while onboarding and how we addressed them with extensive tunings to meet SLA and lessons learned for use cases including: partner insights, which provides partners with stats on organic pins; realtime spam detection, which detects user login related anomaly events and pin related spamming events like pin creation and repin; and migrating the backend from Presto to Druid for Ads related experiments data analysis.
This document discusses various techniques for optimizing application performance, including reducing latency and increasing throughput. It covers strategies like using data structures like linked lists, bloom filters, and Merkle trees efficiently. Other topics include removing contention through approaches like the disruptor pattern, optimizing for network performance, and leveraging the reactor pattern. Performance of transports like XML/JSON and SOAP/REST is also evaluated. Monitoring tools like Java Flight Recorder are also mentioned.
This document discusses Hitachi Universal Replicator software, which asynchronously replicates data between Hitachi storage systems over any distance. It satisfies demanding business continuity and disaster recovery requirements by maintaining integrity of replicated data even during network outages. The software optimizes storage resources, improves bandwidth utilization, and supports heterogeneous storage environments for maximum data protection flexibility.
The document outlines the key steps in an online training program for Hadoop including setting up a virtual Hadoop cluster, loading and parsing payment data from XML files into databases incrementally using scheduling, building a migration flow from databases into Hadoop and Hive, running Hive queries and exporting data back to databases, and visualizing output data in reports. The training will be delivered online over 20 hours using tools like GoToMeeting.
Modern Scientific Data Management Practices: The Atmospheric Radiation Measur...Globus
These slides were presented by Giri Prakash from Oak Ridge National Lab at the AGU Fall Meeting 2018 in a session titled "Scalable Data Management Practices in Earth Sciences" convened by Ian Foster, Globus co-founder and director of Argonne's data science and learning division.
This document discusses Druid in production at Fyber, a company that indexes 5 terabytes of data daily from various sources into Druid. It describes the hardware used, including 30 historical nodes and 2 broker nodes. Issues addressed include slow query times with many dimensions, some as lists, and data cleanup steps to reduce cardinality like replacing values. Segment sizing and partitioning are also discussed. Hardware, data ingestion, querying, and optimizations used to scale Druid for Fyber's analytics needs are covered in under 3 sentences.
This document provides an overview of the HathiTrust Research Center (HTRC) architecture. It describes the key components including a portal for access, an agent for application submission, a registry for storing metadata, a secure API for programmatic access, and storage of data in a Cassandra cluster with indexing in Solr. It also outlines use cases and discusses how the architecture enables secure, non-consumptive research on copyrighted works stored in the HathiTrust digital library.
Proactive ops for container orchestration environmentsDocker, Inc.
This document discusses different approaches to monitoring systems from manual and reactive to proactive monitoring using container orchestration tools. It provides examples of metrics to monitor at the host/hardware, networking, application, and orchestration layers. The document emphasizes applying the principles of observability including structured logging, events and tracing with metadata, and monitoring the monitoring systems themselves. Speakers provide best practices around failure prediction, understanding failure modes, and using chaos engineering to build system resilience.
Krishnan Raman presented on LinkedIn's data obfuscation pipeline. The pipeline aims to analyze LinkedIn data to improve machine learning models, discover data quickly for analysis, and access data efficiently while complying with privacy regulations. It determines which files contain personally identifiable information (PII) to obfuscate, handles schema evolution, and preserves file names and types. WhereHows is used to track dataset lineage and locations. Obfuscated data is emitted with metrics on job progress captured in timeseries for monitoring the data pipeline. Challenges include unclean data, complex schemas, balancing failures vs dropped rows, and accounting for changing data and schemas. Auditing data, metadata, robust monitoring systems, and re-ob
This document proposes a log management solution using Logstash, Elasticsearch, and Kibana. Logstash is used to collect, parse, and index logs into Elasticsearch for centralized storage and real-time search. Kibana provides visualization and analytics dashboards. The solution offers scalability, reliability, searchability, and a low-cost and flexible open source approach to solving the challenges of gathering, analyzing, and gaining insights from large volumes of log data from diverse sources.
From R Script to Production Using rsparkling with Navdeep GillDatabricks
The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms.
In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
This document discusses LinkedIn's transition from an offline metrics platform to a near real-time "nearline" architecture using Apache Calcite and Apache Samza. It overviews LinkedIn's metrics platform and needs, and then details how the new nearline architecture works by translating Pig jobs into optimized Samza jobs using Calcite's relational algebra and query planning. An example production use case for analyzing storylines on the LinkedIn platform is also presented. The nearline architecture allows metrics to be computed with latencies of 5-30 minutes rather than 3-6 hours previously.
This document provides an introduction to OpenStack, an open source software platform for building private and public clouds. It describes the key OpenStack components for compute (Nova), storage (Cinder, Glance, Swift), networking (Neutron), and identity (Keystone). It then discusses how organizations like CERN and PayPal use OpenStack to manage large amounts of data and computing resources in a scalable, distributed manner. The document concludes by outlining various ways that individuals can get involved and contribute to the OpenStack community.
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
Charles Allen covers data processing, analytics, and insights systems at Snap. Strength points for Druid use cases are called out as are differences in some of the processing systems used.
This is the slide collection from the second talk from:
https://www.meetup.com/druidio-la/events/254080924/
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
The Past, Present and Future of Big Data @LinkedInSuja Viswesan
LinkedIn processes huge amounts of data from user events across the globe at scale. They collect 2.3 trillion messages per day totaling 2.5 PB of data and process it using highly reliable fault tolerant batch and stream processing. They access this data by persisting it durably across 120 PB of HDFS storage and make it searchable and available for online services. Their analytics infrastructure includes data ingestion using Gobblin, dataset management using Dali, storage using HDFS and Voldemort, and compute engines like YARN. They use solutions like federated HDFS, Dali, Hadoop OrgQueue and elasticity tuning to scale their system, cluster management and computation across their infrastructure of tens of thousands of nodes
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
The document discusses Valo, a big data analytics engine built from scratch focusing on simplicity and distributed capabilities. It describes Valo's architecture including time-series and semi-structured data repositories, REST API, and execution engine. It also discusses challenges of building distributed systems including cluster failures, data distribution, algorithms, and more.
Check out the webinar: https://imply.io/videos/whats-new-imply-3-3-apache-druid-0-18
The most recent Imply 3.3 release, based on Apache 0.18 brings several major new features, including joins, query laning and Clarity Alerts. These new features deliver increased design flexibility during design, and provide improved ingestion performance, and sub-second response times to help accelerate data warehouse and data lake deployments, and add real-time analytics in general.
Improve your SQL workload with observabilityOVHcloud
La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications.
Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience.
Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
In this talk, we will talk about:
1) the motivation of switching from Hbase backed analytics system to Druid
2) the architecture design of Druid as a platform in Pinterest (Archmage, Hadoop, Kafka) including a query interface, Archmage, a thrift service in front of Druid which exposes a thrift api to company-wise clients, handles Druid broker hosts discovery, serves as a relay to broker hosts to abstract the async HTTP connection and provides query optimizations transparent to clients including directly translating fixed pattern SQL to Druid native JSON queries to save planning time. In addition, we’ll cover the production Hadoop batch and Kafka real time ingestion pipeline setup and the reason we picked a pull-based solution instead of a push-based solution for real time ingestion.
3) We will also talk about the use cases currently running in production on this platform including their data volume, QPS, Druid cluster setup, the unique challenges we met while onboarding and how we addressed them with extensive tunings to meet SLA and lessons learned for use cases including: partner insights, which provides partners with stats on organic pins; realtime spam detection, which detects user login related anomaly events and pin related spamming events like pin creation and repin; and migrating the backend from Presto to Druid for Ads related experiments data analysis.
This document discusses various techniques for optimizing application performance, including reducing latency and increasing throughput. It covers strategies like using data structures like linked lists, bloom filters, and Merkle trees efficiently. Other topics include removing contention through approaches like the disruptor pattern, optimizing for network performance, and leveraging the reactor pattern. Performance of transports like XML/JSON and SOAP/REST is also evaluated. Monitoring tools like Java Flight Recorder are also mentioned.
This document discusses Hitachi Universal Replicator software, which asynchronously replicates data between Hitachi storage systems over any distance. It satisfies demanding business continuity and disaster recovery requirements by maintaining integrity of replicated data even during network outages. The software optimizes storage resources, improves bandwidth utilization, and supports heterogeneous storage environments for maximum data protection flexibility.
The document outlines the key steps in an online training program for Hadoop including setting up a virtual Hadoop cluster, loading and parsing payment data from XML files into databases incrementally using scheduling, building a migration flow from databases into Hadoop and Hive, running Hive queries and exporting data back to databases, and visualizing output data in reports. The training will be delivered online over 20 hours using tools like GoToMeeting.
Modern Scientific Data Management Practices: The Atmospheric Radiation Measur...Globus
These slides were presented by Giri Prakash from Oak Ridge National Lab at the AGU Fall Meeting 2018 in a session titled "Scalable Data Management Practices in Earth Sciences" convened by Ian Foster, Globus co-founder and director of Argonne's data science and learning division.
This document discusses Druid in production at Fyber, a company that indexes 5 terabytes of data daily from various sources into Druid. It describes the hardware used, including 30 historical nodes and 2 broker nodes. Issues addressed include slow query times with many dimensions, some as lists, and data cleanup steps to reduce cardinality like replacing values. Segment sizing and partitioning are also discussed. Hardware, data ingestion, querying, and optimizations used to scale Druid for Fyber's analytics needs are covered in under 3 sentences.
This document provides an overview of the HathiTrust Research Center (HTRC) architecture. It describes the key components including a portal for access, an agent for application submission, a registry for storing metadata, a secure API for programmatic access, and storage of data in a Cassandra cluster with indexing in Solr. It also outlines use cases and discusses how the architecture enables secure, non-consumptive research on copyrighted works stored in the HathiTrust digital library.
Proactive ops for container orchestration environmentsDocker, Inc.
This document discusses different approaches to monitoring systems from manual and reactive to proactive monitoring using container orchestration tools. It provides examples of metrics to monitor at the host/hardware, networking, application, and orchestration layers. The document emphasizes applying the principles of observability including structured logging, events and tracing with metadata, and monitoring the monitoring systems themselves. Speakers provide best practices around failure prediction, understanding failure modes, and using chaos engineering to build system resilience.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
- IoT devices generate large streams of data that need to be collected and processed in real-time. MQTT and Kafka are common protocols for collecting IoT data streams. MQTT is lightweight but lacks scalability while Kafka is highly scalable.
- Stream processing platforms like Flink, Storm and Spark can be used to analyze the IoT data streams. Flink supports both batch and stream processing while Storm is best for low-latency streaming. Spark is better for machine learning on streams.
- An example use case is real-time equipment monitoring in a factory where IoT sensors stream data to Kafka which is then processed by Flink to detect abnormalities and enable predictive maintenance. Performance is evaluated based on latency and
This document discusses Infobip's journey towards enabling real-time querying of aggregated data. Initially, Infobip had a monolithic architecture with a single database that became a bottleneck. They introduced multiple databases and microservices but querying spanned databases and results had to be joined. A data warehouse (GREEN) provided reporting but was not real-time. To enable real-time queries, Infobip implemented a lambda architecture using Kafka as the real-time data pipeline and Druid for real-time querying and aggregations, achieving sub-second responses and less than 2 seconds of data delay. This allows real-time insights from ingested messaging data while GREEN remains the batch/serving layer.
To understand an application’s performance, first you have to know what to measure. That’s the easy part. How do you take those measurements? Store them? Analyze them? Get them to the people who need them? Well, that’s where things get complicated, especially in the high-traffic distributed systems of the modern web! Like careful scientists, we must observe our subjects without altering them, and we must report our findings quickly so that we have the data necessary to make smart choices about the health and growth of the system.
Let’s explore the lessons learned by engineers at one of the world’s top web companies in their quest to find meaning at 5 MB/s. We’ll discuss the tools and techniques that enable the collection, indexing, and analysis of billions or more datapoints each hour, and learn how these same approaches can empower your applications and your business, no matter the scale.
Making Machine Learning Easy with H2O and WebFluxTrayan Iliev
Machine learning is becoming a must for many business domains and applications. H2O is a best-of-breed, open source, distributed machine learning library written in Java. The presentation shows how to create and train machine learning models easily using H2O Flow web interface, including Deep Learning Neural Networks (DNNs). The session provides a tutorial how to develop and deploy fullstack-reactive face recognition demo using React + RxJS WebSocket front-end, OpenCV, Caffe CNN for image segmentation, OpenFace CNN for feature extraction, H20 Flow for face recognition interactive model training and export as POJO. The trained POJO model is incorporated in a real-time streaming web service implemented using Spring 5 Web Flux and Spring Boot. All demo is 100% Java!
This document provides an introduction to big data and related technologies. It defines big data as datasets that are too large to be processed by traditional methods. The motivation for big data is the massive growth in data volume and variety. Technologies like Hadoop and Spark were developed to process this data across clusters of commodity servers. Hadoop uses HDFS for storage and MapReduce for processing. Spark improves on MapReduce with its use of resilient distributed datasets (RDDs) and lazy evaluation. The document outlines several big data use cases and projects involving areas like radio astronomy, particle physics, and engine sensor data. It also discusses when Hadoop and Spark are suitable technologies.
Sharding is a technique for partitioning and distributing data across multiple servers to enable scaling to large data volumes and workloads. It involves defining a shard key to partition data into chunks that are distributed across shards. The document discusses different types of sharding strategies like range, hash, and tag-aware sharding and how they apply to different use cases around scale, geo-distribution, and hardware optimization. It also covers best practices for building a sharded cluster like pre-splitting data, capacity planning, and using tools like MongoDB Management Service for production operations.
Edge computing and the Internet of Things bring great promise, but often just getting data from the edge requires moving mountains. Let's learn how to make edge data ingestion and analytics easier using StreamSets Data Collector edge, an ultralight, platform independent and small-footprint Open Source solution written in Go for streaming data from resource-constrained sensors and personal devices (like medical equipment or smartphones) to Apache Kafka, Amazon Kinesis and many others. This talk includes an overview of the SDC Edge main features, supported protocols and available processors for data transformation, insights on how it solves some challenges of traditional approaches to data ingestion, pipeline design basics, a walk-through some practical applications (Android devices and Raspberry Pi) and its integration with other technologies such as Streamsets Data Collector, Apache Kafka, Apache Hadoop, InfluxDB and Grafana. The goal here is to make attendees ready to quickly become IoT data intake and SDC Edge Ninjas.
Speaker
Guglielmo Iozzia, Big Data Delivery Manager, Optum (United Health)
Evolution from EDA to Data Mesh: Data in Motionconfluent
Thoughtworks Zhamak Dehghani observations on these traditional approaches’s failure modes, inspired her to develop an alternative big data management architecture that she aptly named the Data Mesh. This represents a paradigm shift that draws from modern distributed architecture and is founded on the principles of domain-driven design, self-serve platform, and product thinking with Data. In the last decade Apache Kafka has established a new category of data management infrastructure for data in motion that has been leveraged in modern distributed data architectures.
This document provides an overview and introduction to reactive robotics and the Internet of Things (IoT). It discusses several key concepts including reactive programming, functional reactive programming, and high-performance reactive Java. It also covers topics like concurrency, parallelism, queues, and the LMAX Disruptor design pattern. Code examples are provided to demonstrate reactive programming concepts using tools like RxJava. The document aims to explain reactive approaches that can help address complexity in robotics and IoT systems.
Stream Processing – Concepts and FrameworksGuido Schmutz
More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. It is one thing to collect these events in the velocity they arrive, without losing any single message. An Event Hub and a data flow engine can help here. It’s another thing to do some (complex) analytics on the data. There is always the option to first store in a data sink of choice and later analyze it. Storing even a high-volume event stream is feasible and not a challenge anymore. But this adds to the end-to-end latency and it takes minutes if not hours to present results. If you need to react fast, you simply can’t afford to first store the data. You need to do process it directly on the data stream. This is called Stream Processing or Stream Analytics. In this talk I will present the important concepts, a Stream Processing solution should support and then dive into some of the most popular frameworks available on the market and how they compare.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
This document discusses stream computing and various real-time analytics platforms for processing streaming data. It describes key concepts of stream computing like analyzing data in motion before storing, scaling to process large data volumes, and making faster decisions. Popular open-source platforms are explained briefly, including their architecture and uses - Spark, Storm, Kafka, Flume, and Amazon Kinesis.
Grid computing enables sharing of geographically distributed computing resources through a network. It allows for virtual organizations to collaborate on common goals without central control. The document discusses the types of grid computing including computational, data, and scavenging grids. It also outlines the key components of a grid including protocols, architecture, security, and resource management. Examples of existing grid projects are provided such as SETI@Home, EGEE, and BeINGrid.
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
Think you have big data? What about high availability
requirements? At DataDog we process billions of data points every day including metrics and events, as we help the world
monitor the their applications and infrastructure. Being the world’s monitoring system is a big responsibility, and thanks to
Redis we are up to the task. Join us as we discuss how the DataDog team monitors and scales Redis to power our SaaS based monitoring offering. We will discuss our usage and deployment patterns, as well as dive into monitoring best practices for production Redis workloads
This document discusses big data, including the large amounts of data being collected daily, challenges with traditional DBMS solutions, the need for new approaches like Hadoop and Aster Data to handle large volumes of structured and unstructured data, techniques for analyzing big data, and case studies of companies like Mobclix and Yahoo using big data solutions.
Similar to HathiTrust Research Center: The Fast Version (20)
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...Robert H. McDonald
The presentation provided an overview of the HathiTrust Research Center (HTRC) and its services. HTRC provides access to over 13 million digitized book volumes and facilitates text mining and analysis through its extracted features dataset, data capsule, and other tools. It discussed challenges of text mining copyrighted works and demonstrated use cases using distant reading techniques. HTRC also works on outreach, education, and developing new interfaces and tools to enable scholarly research using its collections and infrastructure.
This document provides an agenda and information about a tutorial on topic exploration using the HathiTrust Research Center (HTRC) Data Capsule. The agenda includes an overview of HTRC, an introduction to the Data Capsule and topic modeling, and hands-on sessions. Information is also provided about HTRC, including its mission to enable non-consumptive research on HathiTrust's digital library, its organizational structure, goals for the future, and important URLs.
The HathiTrust Research Center: An Overview of Advanced Computational ServicesRobert H. McDonald
These are my slides from the DPLAFest 2015 held in Indianapolis, IN on 04/17/2015-04/18/2015.
For more see - https://dplafest2015.sched.org/event/a1cfbaca67fd71a2409d28d9b27b1351
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterRobert H. McDonald
This document summarizes a presentation about scaling storage for the HathiTrust Research Center. The HTRC is a collaborative research center between Indiana University and University of Illinois that enables text data mining of the HathiTrust Digital Library. It discusses the mission and goals of HTRC, its partnerships with HathiTrust universities, and the services and tools it provides researchers. It also outlines the large amount of content in HathiTrust, HTRC's non-consumptive research paradigm, and its data and storage architecture to support terabyte-scale analysis of public domain and in-copyright texts.
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Robert H. McDonald
This is the slidedeck for my ACRL 2015 TechConnect Presentation with Nicole Vasilevsky (OHSU). For more on the program see - <a>http://bit.ly/1xcQbCr</a>.
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkRobert H. McDonald
This is the presentation on the HTRC given at the Indiana University booth at Supercomputing 2014 by Beth Plale - Co-Director HTRC and Robert McDonald - HTRC Executive Management Group.
This is the slide deck for the presentation that was given with Kate Lawrence (VP User Experience EBSCO), Courtney McDonald (Indiana University), and Esther Onega (University of Virginia) at the 2014 Charleston Conference on Thursday Nov 6, 2014.
Kuali OLE is an open source library services platform developed by librarians for flexibility and integration. It has 66 members from 10 institutions and is funded by partners and the Mellon Foundation. The platform has four modules and provides selection/acquisition, ERM and linked data functionality. It offers hosted, local or hybrid implementation options and seeks to expand consortial support and full ERM functions.
Charleston Seminar Being Earnest with our Collections - Legacy to CloudRobert H. McDonald
These are my slides for the 2014 Charleston Conference Seminar, "Being Earnest with our Collections," that I presented with Jill Grogg on moving libraries to the cloud.
The HathiTrust Research Center (HTRC): An Overview and DemoRobert H. McDonald
The session will provide an overview of the HathiTrust Research Center including its mission and current status. It will also include a demonstration of current HTRC phase one technology and services. Additionally, the speakers will address the HTRC's role in supporting humanities research at scale.
SEAD is a NSF DataNet project that aims to provide cyberinfrastructure for long tail data in sustainability science research. It develops tools for active and social curation of data including an Active Curation Repository (ACR) and VIVO profiles. It also creates a Virtual Archive to facilitate long-term access and preservation of datasets across multiple institutional repositories. The presentation provides an overview of SEAD's approach and highlights pilots with the National Center for Earth Surface Dynamics, including ingesting their data collections into the ACR and Virtual Archive and building a social network in VIVO.
New Perspectives for Business Intelligence: Library and Research Technologies...Robert H. McDonald
This is our presentation for Educause 2012 entitled New Perspectives for Business Intelligence: Library and Research Technologies and Research Collaboration for New Data Models held on Nov 8, 2012.
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...Robert H. McDonald
This document summarizes a presentation about Kuali OLE, an open source library management system created through collaboration between multiple universities. It describes the journey to create a collaborative community to develop the system, including establishing functional councils, technical architecture choices, and community organization. It also discusses plans for deployment, creating an ecosystem of vendors, investing in the community, and expanding globally.
GOKb & KB+: An International Partnership to leverage Open Access and Communit...Robert H. McDonald
GOKb & KB+ is an international partnership between Kuali OLE and JISC to leverage open access and community participation to enhance eContent metadata. The partnership aims to create a freely available global open knowledgebase (GOKb) of publication information about electronic resources. GOKb will integrate with Kuali OLE and JISC's Knowledge Base+ to reduce duplication of effort and improve the sustainability and quality of metadata.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
1. HathiTrust Research Center
The Fast Version
Robert H. McDonald | @mcdonald
Executive Committee-HathiTrust Research Center (HTRC)
Deputy Director-Data to Insight Center
Associate Dean-University Libraries
Indiana University
2. HTRC Mission
The HathiTrust Research Center (HTRC) is a collaborative
research center launched jointly by Indiana University
and the University of Illinois to act as the public-facing
research arm of the massive HathiTrust Digital Library.
The HTRC is mandated to help researchers from around
the world surmount the difficulties associated with
processing and analyzing terascale amounts of digital
text. Thus, the scholarly developers at HTRC work to
develop cutting-edge software tools and
cyberinfrastructure to enable advanced computational
access to the growing digital record of human knowledge.
HTRC began its efforts July 2011.
3. HTRC Non-Consumptive Research
Paradigm
• No action or set of actions on part of users,
either acting alone or in cooperation with
other users over duration of one or multiple
sessions can result in sufficient information
gathered from collection of copyrighted works
to reassemble pages from collection.
• Definition disallows collusion between users,
or accumulation of material over time.
Differentiates human researcher from proxy
which is not a user. Users are human beings.
4. HTRC Current Infrastructure
• Servers
– 14 production-level quad-core servers (virtual
machines)
• 16 – 32GB of memory
• 250 – 500GB of local disk each
– 6-node Cassandra cluster for volume store
– Ingest service and secure Data API access point
• Storage (IU University Infrastructure)
– 13TB of 15,000 RPM SAS disk storage
– Increase up to 17TB by end of 2012
– 500TB available in late year 2-year 3
5. HTRC Architecture
Portal Access
Blacklight
Direct
Agent programmatic
access (by
Job Collection programs running
Submission building on HTRC machines)
Security (OAuth2)
Data API access interface Solr Proxy
Registry (WSO2) Audit
Meandre
Algorithms Cassandra
Workflows
cluster
volume store
Result Sets Collections
Solr index
Compute resources
Storage resources
7. Contact Information
• Robert H. McDonald
– Email –
robert@indiana.edu
– Chat – rhmcdonald on
googletalk | skype
– Twitter - @mcdonald
– Blog –
http://www.rmcdonald.net
– Twitter Hashtag:
#HTRC12 http://slidesha.re/QCOrIX
– Web:
http:www.hathitrust.org/htrc
Editor's Notes
Registry – agent can deploy any service listed in this digram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -