HathiTrust Research Center: The Fast Version

These are my pecha kucha slides for the Library of Congress Storage meeting held on Sept 21, 2012 in Washington, DC.

HTRC Mission
The HathiTrust Research Center (HTRC) is a collaborative
research center launched jointly by Indiana University
and the University of Illinois to act as the public-facing
research arm of the massive HathiTrust Digital Library.

The HTRC is mandated to help researchers from around
the world surmount the difficulties associated with
processing and analyzing terascale amounts of digital
text. Thus, the scholarly developers at HTRC work to
develop cutting-edge software tools and
cyberinfrastructure to enable advanced computational
access to the growing digital record of human knowledge.
HTRC began its efforts July 2011.

HTRC Non-Consumptive Research
Paradigm
• No action or set of actions on part of users,
either acting alone or in cooperation with
other users over duration of one or multiple
sessions can result in sufficient information
gathered from collection of copyrighted works
to reassemble pages from collection.
• Definition disallows collusion between users,
or accumulation of material over time.
Differentiates human researcher from proxy
which is not a user. Users are human beings.

HTRC Current Infrastructure
• Servers
– 14 production-level quad-core servers (virtual
machines)
• 16 – 32GB of memory
• 250 – 500GB of local disk each
– 6-node Cassandra cluster for volume store
– Ingest service and secure Data API access point
• Storage (IU University Infrastructure)
– 13TB of 15,000 RPM SAS disk storage
– Increase up to 17TB by end of 2012
– 500TB available in late year 2-year 3

HTRC Architecture
Portal Access
Blacklight
Direct
Agent programmatic
access (by
Job Collection programs running
Submission building on HTRC machines)

Security (OAuth2)
Data API access interface Solr Proxy
Registry (WSO2) Audit
Meandre
Algorithms Cassandra
Workflows
cluster
volume store
Result Sets Collections
Solr index

Compute resources
Storage resources

HTRC Next Steps
• Phase 2 Begins Jan 2013….

• Thanks to

Contact Information
• Robert H. McDonald
– Email –
robert@indiana.edu
– Chat – rhmcdonald on
googletalk | skype
– Twitter - @mcdonald
– Blog –
http://www.rmcdonald.net
– Twitter Hashtag:
#HTRC12 http://slidesha.re/QCOrIX
– Web:
http:www.hathitrust.org/htrc

Major research instruments are generating orders of magnitude more data in relatively short timeframes. As a result, the research enterprise is increasingly challenged by what should be mundane tasks: describing data for discovery and making data securely accessible to the broader research community. The ad hoc methods currently employed place undue burden on scientists and system administrators alike, and it is clear that a more robust, scalable approach is required. Bespoke data portals (and science gateways/data commons) are becoming more prominent as a means of enabling access to large datasets. in this tutorial we demonstrate how services for authentication, authorization, metadata management, and search may be integrated with popular web frameworks, and used in combination with fast, well-architected networks to make data discoverable and accessible. Outcomes: build a simple, but functional, data portal that facilitates flexible data description, faceted data search and secure data access.

Globus: Enabling the Open Storage Network

Mongo for aadhaar

MongoDB

The document discusses India's Aadhaar identity system which collects biometric data on 1.2 billion Indian residents. MongoDB is used to store and search this identity data across multiple shards due to its auto-sharding, replication, and evolving schema capabilities. The implementation involves sharding data across 8 shards totaling over 2TB each, with performance and reliability addressed through replica sets, write concern configurations, and manual monitoring processes.

Repository As A Service (RaaS) at ICPSR

Harshakumar Ummerpillai

Since 1962, ICPSR has been an integral part of the infrastructure of social science research with its vast digital archive supporting over 700 member institutions worldwide. With the release of our new digital assets management system “Archonnex,” ICPSR continues this tradition by extending our expertise and digital technology capabilities as a service to the larger community. For the first time researchers, institutions, organizations, and even nations will be able to host their own repositories and setup data services for their members. We call it RaaS - Repository as a Service.

Live Geoinformation with Standardized Geoprocessing Services

Theodor Foerster

This document proposes using HTTP Live Streaming to enable streaming web processing services. This allows for asynchronous and progressive transfer of geodata between a client and server. It improves performance, scalability, and the user experience over traditional synchronous WPS. The approach was implemented using 52North WPS and evaluated using a use case of generalizing OpenStreetMap data streams. Results demonstrated this streaming approach reduces memory footprint and improves processing time compared to a reference implementation.

Data Automation at Light Sources

Ian Foster

Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others. Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.

WoSC19: Serverless Workflows for Indexing Large Scientific Data

University of Chicago

The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific “data lakes” quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.

Krishnan Raman presented on LinkedIn's data obfuscation pipeline. The pipeline aims to analyze LinkedIn data to improve machine learning models, discover data quickly for analysis, and access data efficiently while complying with privacy regulations. It determines which files contain personally identifiable information (PII) to obfuscate, handles schema evolution, and preserves file names and types. WhereHows is used to track dataset lineage and locations. Obfuscated data is emitted with metrics on job progress captured in timeseries for monitoring the data pipeline. Challenges include unclean data, complex schemas, balancing failures vs dropped rows, and accounting for changing data and schemas. Auditing data, metadata, robust monitoring systems, and re-ob

Log management with_logstash_and_elastic_search

Rishav Rohit

This document proposes a log management solution using Logstash, Elasticsearch, and Kibana. Logstash is used to collect, parse, and index logs into Elasticsearch for centralized storage and real-time search. Kibana provides visualization and analytics dashboards. The solution offers scalability, reliability, searchability, and a low-cost and flexible open source approach to solving the challenges of gathering, analyzing, and gaining insights from large volumes of log data from diverse sources.

From R Script to Production Using rsparkling with Navdeep Gill

The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms. In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...

Khai Tran

This document discusses LinkedIn's transition from an offline metrics platform to a near real-time "nearline" architecture using Apache Calcite and Apache Samza. It overviews LinkedIn's metrics platform and needs, and then details how the new nearline architecture works by translating Pig jobs into optimized Samza jobs using Calcite's relational algebra and query planning. An example production use case for analyzing storylines on the LinkedIn platform is also presented. The nearline architecture allows metrics to be computed with latencies of 5-30 minutes rather than 3-6 hours previously.

Openstack For Beginners

cpallares

This document provides an introduction to OpenStack, an open source software platform for building private and public clouds. It describes the key OpenStack components for compute (Nova), storage (Cinder, Glance, Swift), networking (Neutron), and identity (Keystone). It then discusses how organizations like CERN and PayPal use OpenStack to manage large amounts of data and computing resources in a scalable, distributed manner. The document concludes by outlining various ways that individuals can get involved and contribute to the OpenStack community.

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

Charles Allen

Coding the Continuum

Ian Foster

In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.

The Past, Present and Future of Big Data @LinkedIn

Suja Viswesan

LinkedIn processes huge amounts of data from user events across the globe at scale. They collect 2.3 trillion messages per day totaling 2.5 PB of data and process it using highly reliable fault tolerant batch and stream processing. They access this data by persisting it durably across 120 PB of HDFS storage and make it searchable and available for online services. Their analytics infrastructure includes data ingestion using Gobblin, dataset management using Dali, storage using HDFS and Voldemort, and compute engines like YARN. They use solutions like federated HDFS, Dali, Hadoop OrgQueue and elasticity tuning to scale their system, cluster management and computation across their infrastructure of tens of thousands of nodes

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...

Dataconomy Media

What’s New in Imply 3.3 & Apache Druid 0.18

Check out the webinar: https://imply.io/videos/whats-new-imply-3-3-apache-druid-0-18 The most recent Imply 3.3 release, based on Apache 0.18 brings several major new features, including joins, query laning and Clarity Alerts. These new features deliver increased design flexibility during design, and provide improved ingestion performance, and sub-second response times to help accelerate data warehouse and data lake deployments, and add real-time analytics in general.

Velocity cubes of galaxies

Jose Enrique Ruiz

Publishing and Serving Machine Learning Models with DLHub

Improve your SQL workload with observability

OVHcloud

La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications. Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience. Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?

Archmage, Pinterest’s Real-time Analytics Platform on Druid

Sriskandarajah Suhothayan

In this talk, we will talk about: 1) the motivation of switching from Hbase backed analytics system to Druid 2) the architecture design of Druid as a platform in Pinterest (Archmage, Hadoop, Kafka) including a query interface, Archmage, a thrift service in front of Druid which exposes a thrift api to company-wise clients, handles Druid broker hosts discovery, serves as a relay to broker hosts to abstract the async HTTP connection and provides query optimizations transparent to clients including directly translating fixed pattern SQL to Druid native JSON queries to save planning time. In addition, we’ll cover the production Hadoop batch and Kafka real time ingestion pipeline setup and the reason we picked a pull-based solution instead of a push-based solution for real time ingestion. 3) We will also talk about the use cases currently running in production on this platform including their data volume, QPS, Druid cluster setup, the unique challenges we met while onboarding and how we addressed them with extensive tunings to meet SLA and lessons learned for use cases including: partner insights, which provides partners with stats on organic pins; realtime spam detection, which detects user login related anomaly events and pin related spamming events like pin creation and repin; and migrating the backend from Presto to Druid for Ads related experiments data analysis.

Make it fast for everyone - performance and middleware design

This document discusses various techniques for optimizing application performance, including reducing latency and increasing throughput. It covers strategies like using data structures like linked lists, bloom filters, and Merkle trees efficiently. Other topics include removing contention through approaches like the disruptor pattern, optimizing for network performance, and leveraging the reactor pattern. Performance of transports like XML/JSON and SOAP/REST is also evaluated. Monitoring tools like Java Flight Recorder are also mentioned.

Hitachi datasheet-universal-replicator

Hitachi Vantara

This document discusses Hitachi Universal Replicator software, which asynchronously replicates data between Hitachi storage systems over any distance. It satisfies demanding business continuity and disaster recovery requirements by maintaining integrity of replicated data even during network outages. The software optimizes storage resources, improves bandwidth utilization, and supports heterogeneous storage environments for maximum data protection flexibility.

Payment Gateway Live hadoop project

Kamal A

The document outlines the key steps in an online training program for Hadoop including setting up a virtual Hadoop cluster, loading and parsing payment data from XML files into databases incrementally using scheduling, building a migration flow from databases into Hadoop and Hive, running Hive queries and exporting data back to databases, and visualizing output data in reports. The training will be delivered online over 20 hours using tools like GoToMeeting.

The Future of Real-Time in Spark

Modern Scientific Data Management Practices: The Atmospheric Radiation Measur...

Druid

Dori Waldman

This document discusses Druid in production at Fyber, a company that indexes 5 terabytes of data daily from various sources into Druid. It describes the hardware used, including 30 historical nodes and 2 broker nodes. Issues addressed include slow query times with many dimensions, some as lists, and data cleanup steps to reduce cardinality like replacing values. Segment sizing and partitioning are also discussed. Hardware, data ingestion, querying, and optimizations used to scale Druid for Fyber's analytics needs are covered in under 3 sentences.

HTRC Architecture Overview

This document provides an overview of the HathiTrust Research Center (HTRC) architecture. It describes the key components including a portal for access, an agent for application submission, a registry for storing metadata, a secure API for programmatic access, and storage of data in a Cassandra cluster with indexing in Solr. It also outlines use cases and discusses how the architecture enables secure, non-consumptive research on copyrighted works stored in the HathiTrust digital library.

Proactive ops for container orchestration environments

Docker, Inc.

This document discusses different approaches to monitoring systems from manual and reactive to proactive monitoring using container orchestration tools. It provides examples of metrics to monitor at the host/hardware, networking, application, and orchestration layers. The document emphasizes applying the principles of observability including structured logging, events and tracing with metadata, and monitoring the monitoring systems themselves. Speakers provide best practices around failure prediction, understanding failure modes, and using chaos engineering to build system resilience.

What's hot

Obfuscating LinkedIn Member Data

DataWorks Summit

Log management with_logstash_and_elastic_search

Rishav Rohit

From R Script to Production Using rsparkling with Navdeep Gill

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...

Khai Tran

Openstack For Beginners

cpallares

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

Charles Allen

Coding the Continuum

Ian Foster

The Past, Present and Future of Big Data @LinkedIn

Suja Viswesan

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...

Dataconomy Media

What’s New in Imply 3.3 & Apache Druid 0.18

Velocity cubes of galaxies

Jose Enrique Ruiz

Publishing and Serving Machine Learning Models with DLHub

Improve your SQL workload with observability

OVHcloud

Archmage, Pinterest’s Real-time Analytics Platform on Druid

Sriskandarajah Suhothayan

Make it fast for everyone - performance and middleware design

Hitachi datasheet-universal-replicator

Hitachi Vantara

Payment Gateway Live hadoop project

Kamal A

The Future of Real-Time in Spark

Modern Scientific Data Management Practices: The Atmospheric Radiation Measur...

Druid

Dori Waldman

What's hot (20)

Obfuscating LinkedIn Member Data

Log management with_logstash_and_elastic_search

From R Script to Production Using rsparkling with Navdeep Gill

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...

Openstack For Beginners

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

Coding the Continuum

The Past, Present and Future of Big Data @LinkedIn

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...

What’s New in Imply 3.3 & Apache Druid 0.18

Velocity cubes of galaxies

Publishing and Serving Machine Learning Models with DLHub

Improve your SQL workload with observability

Archmage, Pinterest’s Real-time Analytics Platform on Druid

Make it fast for everyone - performance and middleware design

Hitachi datasheet-universal-replicator

Payment Gateway Live hadoop project

The Future of Real-Time in Spark

Modern Scientific Data Management Practices: The Atmospheric Radiation Measur...

Druid

Similar to HathiTrust Research Center: The Fast Version

HTRC Architecture Overview

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Proactive ops for container orchestration environments

Docker, Inc.

Real time analytics

Leandro Totino Pereira

Data Science with the Help of Metadata

Jim Dowling

Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.

Io t data streaming

ratthaslip ranokphanuwat

- IoT devices generate large streams of data that need to be collected and processed in real-time. MQTT and Kafka are common protocols for collecting IoT data streams. MQTT is lightweight but lacks scalability while Kafka is highly scalable. - Stream processing platforms like Flink, Storm and Spark can be used to analyze the IoT data streams. Flink supports both batch and stream processing while Storm is best for low-latency streaming. Spark is better for machine learning on streams. - An example use case is real-time equipment monitoring in a factory where IoT sensors stream data to Kafka which is then processed by Flink to detect abnormalities and enable predictive maintenance. Performance is evaluated based on latency and

Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...

This document discusses Infobip's journey towards enabling real-time querying of aggregated data. Initially, Infobip had a monolithic architecture with a single database that became a bottleneck. They introduced multiple databases and microservices but querying spanned databases and results had to be joined. A data warehouse (GREEN) provided reporting but was not real-time. To enable real-time queries, Infobip implemented a lambda architecture using Kafka as the real-time data pipeline and Druid for real-time querying and aggregations, achieving sub-second responses and less than 2 seconds of data delay. This allows real-time insights from ingested messaging data while GREEN remains the batch/serving layer.

Drinking from the Firehose - Real-time Metrics

Samantha Quiñones

To understand an application’s performance, first you have to know what to measure. That’s the easy part. How do you take those measurements? Store them? Analyze them? Get them to the people who need them? Well, that’s where things get complicated, especially in the high-traffic distributed systems of the modern web! Like careful scientists, we must observe our subjects without altering them, and we must report our findings quickly so that we have the data necessary to make smart choices about the health and growth of the system. Let’s explore the lessons learned by engineers at one of the world’s top web companies in their quest to find meaning at 5 MB/s. We’ll discuss the tools and techniques that enable the collection, indexing, and analysis of billions or more datapoints each hour, and learn how these same approaches can empower your applications and your business, no matter the scale.

Making Machine Learning Easy with H2O and WebFlux

Trayan Iliev

Machine learning is becoming a must for many business domains and applications. H2O is a best-of-breed, open source, distributed machine learning library written in Java. The presentation shows how to create and train machine learning models easily using H2O Flow web interface, including Deep Learning Neural Networks (DNNs). The session provides a tutorial how to develop and deploy fullstack-reactive face recognition demo using React + RxJS WebSocket front-end, OpenCV, Caffe CNN for image segmentation, OpenFace CNN for feature extraction, H20 Flow for face recognition interactive model training and export as POJO. The trained POJO model is incorporated in a real-time streaming web service implemented using Spring 5 Web Flux and Spring Boot. All demo is 100% Java!

Inroduction to Big Data

Omnia Safaan

This document provides an introduction to big data and related technologies. It defines big data as datasets that are too large to be processed by traditional methods. The motivation for big data is the massive growth in data volume and variety. Technologies like Hadoop and Spark were developed to process this data across clusters of commodity servers. Hadoop uses HDFS for storage and MapReduce for processing. Spark improves on MapReduce with its use of resilient distributed datasets (RDDs) and lazy evaluation. The document outlines several big data use cases and projects involving areas like radio astronomy, particle physics, and engine sensor data. It also discusses when Hadoop and Spark are suitable technologies.

MongoDB Sharding Webinar 2014

Dylan Tong

Sharding is a technique for partitioning and distributing data across multiple servers to enable scaling to large data volumes and workloads. It involves defining a shard key to partition data into chunks that are distributed across shards. The document discusses different types of sharding strategies like range, hash, and tag-aware sharding and how they apply to different use cases around scale, geo-distribution, and hardware optimization. It also covers best practices for building a sharded cluster like pre-splitting data, capacity planning, and using tools like MongoDB Management Service for production operations.

Ultralight Data Movement for IoT with SDC Edge

DataWorks Summit

Edge computing and the Internet of Things bring great promise, but often just getting data from the edge requires moving mountains. Let's learn how to make edge data ingestion and analytics easier using StreamSets Data Collector edge, an ultralight, platform independent and small-footprint Open Source solution written in Go for streaming data from resource-constrained sensors and personal devices (like medical equipment or smartphones) to Apache Kafka, Amazon Kinesis and many others. This talk includes an overview of the SDC Edge main features, supported protocols and available processors for data transformation, insights on how it solves some challenges of traditional approaches to data ingestion, pipeline design basics, a walk-through some practical applications (Android devices and Raspberry Pi) and its integration with other technologies such as Streamsets Data Collector, Apache Kafka, Apache Hadoop, InfluxDB and Grafana. The goal here is to make attendees ready to quickly become IoT data intake and SDC Edge Ninjas. Speaker Guglielmo Iozzia, Big Data Delivery Manager, Optum (United Health)

Evolution from EDA to Data Mesh: Data in Motion

confluent

Thoughtworks Zhamak Dehghani observations on these traditional approaches’s failure modes, inspired her to develop an alternative big data management architecture that she aptly named the Data Mesh. This represents a paradigm shift that draws from modern distributed architecture and is founded on the principles of domain-driven design, self-serve platform, and product thinking with Data. In the last decade Apache Kafka has established a new category of data management infrastructure for data in motion that has been leveraged in modern distributed data architectures.

Reactive robotics io_t_2017

Trayan Iliev

This document provides an overview and introduction to reactive robotics and the Internet of Things (IoT). It discusses several key concepts including reactive programming, functional reactive programming, and high-performance reactive Java. It also covers topics like concurrency, parallelism, queues, and the LMAX Disruptor design pattern. Code examples are provided to demonstrate reactive programming concepts using tools like RxJava. The document aims to explain reactive approaches that can help address complexity in robotics and IoT systems.

Stream Processing – Concepts and Frameworks

Guido Schmutz

More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. It is one thing to collect these events in the velocity they arrive, without losing any single message. An Event Hub and a data flow engine can help here. It’s another thing to do some (complex) analytics on the data. There is always the option to first store in a data sink of choice and later analyze it. Storing even a high-volume event stream is feasible and not a challenge anymore. But this adds to the end-to-end latency and it takes minutes if not hours to present results. If you need to react fast, you simply can’t afford to first store the data. You need to do process it directly on the data stream. This is called Stream Processing or Stream Analytics. In this talk I will present the important concepts, a Stream Processing solution should support and then dive into some of the most popular frameworks available on the market and how they compare.

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming

Paco Nathan

London Spark Meetup 2014-11-11 @Skimlinks http://www.meetup.com/Spark-London/events/217362972/ To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc. This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release. Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…

OGCE MSI Presentation

marpierc

CS8091_BDA_Unit_IV_Stream_Computing

Palani Kumar

This document discusses stream computing and various real-time analytics platforms for processing streaming data. It describes key concepts of stream computing like analyzing data in motion before storing, scaling to process large data volumes, and making faster decisions. Popular open-source platforms are explained briefly, including their architecture and uses - Spark, Storm, Kafka, Flume, and Amazon Kinesis.

Grid Computing

abhiritva

Grid computing enables sharing of geographically distributed computing resources through a network. It allows for virtual organizations to collaborate on common goals without central control. The document discusses the types of grid computing including computational, data, and scavenging grids. It also outlines the key components of a grid including protocols, architecture, security, and resource management. Examples of existing grid projects are provided such as SETI@Home, EGEE, and BeINGrid.

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Redis Labs

Think you have big data? What about high availability requirements? At DataDog we process billions of data points every day including metrics and events, as we help the world monitor the their applications and infrastructure. Being the world’s monitoring system is a big responsibility, and thanks to Redis we are up to the task. Join us as we discuss how the DataDog team monitors and scales Redis to power our SaaS based monitoring offering. We will discuss our usage and deployment patterns, as well as dive into monitoring best practices for production Redis workloads

BigData

Shankar R

Similar to HathiTrust Research Center: The Fast Version (20)

HTRC Architecture Overview

Proactive ops for container orchestration environments

Real time analytics

Data Science with the Help of Metadata

Io t data streaming

Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...

Drinking from the Firehose - Real-time Metrics

Making Machine Learning Easy with H2O and WebFlux

Inroduction to Big Data

MongoDB Sharding Webinar 2014

Ultralight Data Movement for IoT with SDC Edge

Evolution from EDA to Data Mesh: Data in Motion

Reactive robotics io_t_2017

Stream Processing – Concepts and Frameworks

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming

OGCE MSI Presentation

CS8091_BDA_Unit_IV_Stream_Computing

Grid Computing

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

BigData

More from Robert H. McDonald

ER&L The Role of Choice in the Future of Discovery Evaluations Panel

The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...

The presentation provided an overview of the HathiTrust Research Center (HTRC) and its services. HTRC provides access to over 13 million digitized book volumes and facilitates text mining and analysis through its extracted features dataset, data capsule, and other tools. It discussed challenges of text mining copyrighted works and demonstrated use cases using distant reading techniques. HTRC also works on outreach, education, and developing new interfaces and tools to enable scholarly research using its collections and infrastructure.

Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...

JCDL 2015 Tutorial Opening Slides

This document provides an agenda and information about a tutorial on topic exploration using the HathiTrust Research Center (HTRC) Data Capsule. The agenda includes an overview of HTRC, an introduction to the Data Capsule and topic modeling, and hands-on sessions. Information is also provided about HTRC, including its mission to enable non-consumptive research on HathiTrust's digital library, its organizational structure, goals for the future, and important URLs.

TLT Discussion on "Saving My Stuff" - 06.05.15

The HathiTrust Research Center: An Overview of Advanced Computational Services

Elephant in the Room: Scaling Storage for the HathiTrust Research Center

This document summarizes a presentation about scaling storage for the HathiTrust Research Center. The HTRC is a collaborative research center between Indiana University and University of Illinois that enables text data mining of the HathiTrust Digital Library. It discusses the mission and goals of HTRC, its partnerships with HathiTrust universities, and the services and tools it provides researchers. It also outlines the large amount of content in HathiTrust, HTRC's non-consumptive research paradigm, and its data and storage architecture to support terabyte-scale analysis of public domain and in-copyright texts.

Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...

ER&L 2015 Closing Keynote Slides

HathiTrust Research Center Data Capsule Overview 09.10.14

The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework

Owning the Discovery Experience for Your Patrons

Kuali OLE: Enabling Choices for Libraries

Kuali OLE is an open source library services platform developed by librarians for flexibility and integration. It has 66 members from 10 institutions and is funded by partners and the Mellon Foundation. The platform has four modules and provides selection/acquisition, ERM and linked data functionality. It offers hosted, local or hybrid implementation options and seeks to expand consortial support and full ERM functions.

Charleston Seminar Being Earnest with our Collections - Legacy to Cloud

The HathiTrust Research Center (HTRC): An Overview and Demo

SCONUL Kuali OLE Briefing

SEAD Datanet and Sustainability Science

SEAD is a NSF DataNet project that aims to provide cyberinfrastructure for long tail data in sustainability science research. It develops tools for active and social curation of data including an Active Curation Repository (ACR) and VIVO profiles. It also creates a Virtual Archive to facilitate long-term access and preservation of datasets across multiple institutional repositories. The presentation provides an overview of SEAD's approach and highlights pilots with the National Center for Earth Surface Dynamics, including ingesting their data collections into the ACR and Virtual Archive and building a social network in VIVO.

New Perspectives for Business Intelligence: Library and Research Technologies...

Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...

This document summarizes a presentation about Kuali OLE, an open source library management system created through collaboration between multiple universities. It describes the journey to create a collaborative community to develop the system, including establishing functional councils, technical architecture choices, and community organization. It also discusses plans for deployment, creating an ecosystem of vendors, investing in the community, and expanding globally.

GOKb & KB+: An International Partnership to leverage Open Access and Communit...