This presentation addresses the challenge of processing big data in a cloud-based data repository. Using the Hydra Project’s Hydra and Sufia ruby gems and working with the Hydra community, we created a special repository for the project, and set up background jobs. Our approach is to create the metadata with these jobs, which are distributed across multiple computing cores. This will allow us to scale our infrastructure out on an as-needed basis, and decouples automatic metadata creation from the response times seen by the user. While the metadata is not immediately available after ingestion, it does mean that the object is. By distributing the jobs, we can compute complex properties without impacting the repository server. Hydra and Sufia allowed us to get a head start by giving us a simple self deposit repository, complete with background jobs support via Redis and Resque.
The Lambda Architecture was implemented at Mayo Clinic to optimize an existing natural language processing pipeline and replace a free-text search facility for colorectal cancer. The architecture uses Storm for real-time processing of up to 1.5 million documents per hour with average latency of 60 milliseconds. It provides a foundation for event-based, real-time, and batch processing as well as data discovery and analytics delivery. The implementation delivers operational benefits like faster annotations and search capabilities.
Starfish-A self tuning system for bigdata analyticssai Pramoda
Starfish is a self-tuning system for improving performance in Hadoop big data analytics. It collects execution profiles from Hadoop clusters, then uses a what-if engine and optimizers to search for and estimate the impact of different tuning configurations on jobs, workflows, and workloads. The goal of Starfish is to enable users and applications to get good performance automatically throughout the data lifecycle in Hadoop.
Big Data LDN 2016: Out of the Data Warehouses, and into the Data Lakes and St...Matt Stubbs
This document discusses using data streams and lakes for big data analytics in utilities. It begins by explaining how smart meter rollouts have increased meter reads from 80 million to 350 billion per year. Traditional data warehousing is challenged by the time criticality, resource demands, and scale of this unbounded data. The document then introduces data streams for real-time analytics and data lakes for flexible storage of huge amounts of structured, semi-structured, and unstructured data. It describes Valo, an open lambda architecture that uses multiple repositories and stream processing to enable both real-time and historical analysis across disparate data sources.
This presentation provides an overview of big data open source technologies. It defines big data as large amounts of data from various sources in different formats that traditional databases cannot handle. It discusses that big data technologies are needed to analyze and extract information from extremely large and complex data sets. The top technologies are divided into data storage, analytics, mining and visualization. Several prominent open source technologies are described for each category, including Apache Hadoop, Cassandra, MongoDB, Apache Spark, Presto and ElasticSearch. The presentation provides details on what each technology is used for and its history.
Visualizing Austin's data with Elasticsearch and KibanaObjectRocket
This document provides an introduction to Elasticsearch and Kibana. It describes what Elasticsearch is and how it can scale to handle large amounts of data and queries. It also describes Kibana and how it is used for data visualization. The document then demonstrates how to use Elasticsearch and Kibana together to visualize and analyze Austin transportation and restaurant inspection data.
This document provides an overview of Big Data training. It defines key concepts like volume, velocity, variety and veracity in Big Data. It discusses how Big Data is growing exponentially in terms of content, videos watched, and people online. It then introduces Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop like HDFS and MapReduce are explained. The document concludes with a discussion of Hadoop distributions and demonstrations of Cloudera, Cassandra and MongoDB.
This document summarizes a presentation about Spring, Querydsl, and MongoDB. It introduces Spring and Spring Data frameworks, which make it easier to build Java applications and access data. It also describes Querydsl, a query building tool that works with Spring Data. The presentation demonstrates how to use Spring Data and Querydsl with MongoDB, a non-relational database, to build applications that can query and retrieve data from MongoDB in a type-safe way. Examples of building queries, entities, and repositories are provided.
This document provides an overview of building a serverless data lake architecture on AWS. It discusses using AWS S3 for storage, AWS Glue for data cataloging and ETL processing, AWS Athena for running SQL queries, and Jupyter Notebooks for exploratory analysis. The full architecture shown brings these services together to allow for ingesting, storing, processing, and analyzing large amounts of data in a serverless and cost-effective manner.
The Lambda Architecture was implemented at Mayo Clinic to optimize an existing natural language processing pipeline and replace a free-text search facility for colorectal cancer. The architecture uses Storm for real-time processing of up to 1.5 million documents per hour with average latency of 60 milliseconds. It provides a foundation for event-based, real-time, and batch processing as well as data discovery and analytics delivery. The implementation delivers operational benefits like faster annotations and search capabilities.
Starfish-A self tuning system for bigdata analyticssai Pramoda
Starfish is a self-tuning system for improving performance in Hadoop big data analytics. It collects execution profiles from Hadoop clusters, then uses a what-if engine and optimizers to search for and estimate the impact of different tuning configurations on jobs, workflows, and workloads. The goal of Starfish is to enable users and applications to get good performance automatically throughout the data lifecycle in Hadoop.
Big Data LDN 2016: Out of the Data Warehouses, and into the Data Lakes and St...Matt Stubbs
This document discusses using data streams and lakes for big data analytics in utilities. It begins by explaining how smart meter rollouts have increased meter reads from 80 million to 350 billion per year. Traditional data warehousing is challenged by the time criticality, resource demands, and scale of this unbounded data. The document then introduces data streams for real-time analytics and data lakes for flexible storage of huge amounts of structured, semi-structured, and unstructured data. It describes Valo, an open lambda architecture that uses multiple repositories and stream processing to enable both real-time and historical analysis across disparate data sources.
This presentation provides an overview of big data open source technologies. It defines big data as large amounts of data from various sources in different formats that traditional databases cannot handle. It discusses that big data technologies are needed to analyze and extract information from extremely large and complex data sets. The top technologies are divided into data storage, analytics, mining and visualization. Several prominent open source technologies are described for each category, including Apache Hadoop, Cassandra, MongoDB, Apache Spark, Presto and ElasticSearch. The presentation provides details on what each technology is used for and its history.
Visualizing Austin's data with Elasticsearch and KibanaObjectRocket
This document provides an introduction to Elasticsearch and Kibana. It describes what Elasticsearch is and how it can scale to handle large amounts of data and queries. It also describes Kibana and how it is used for data visualization. The document then demonstrates how to use Elasticsearch and Kibana together to visualize and analyze Austin transportation and restaurant inspection data.
This document provides an overview of Big Data training. It defines key concepts like volume, velocity, variety and veracity in Big Data. It discusses how Big Data is growing exponentially in terms of content, videos watched, and people online. It then introduces Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop like HDFS and MapReduce are explained. The document concludes with a discussion of Hadoop distributions and demonstrations of Cloudera, Cassandra and MongoDB.
This document summarizes a presentation about Spring, Querydsl, and MongoDB. It introduces Spring and Spring Data frameworks, which make it easier to build Java applications and access data. It also describes Querydsl, a query building tool that works with Spring Data. The presentation demonstrates how to use Spring Data and Querydsl with MongoDB, a non-relational database, to build applications that can query and retrieve data from MongoDB in a type-safe way. Examples of building queries, entities, and repositories are provided.
This document provides an overview of building a serverless data lake architecture on AWS. It discusses using AWS S3 for storage, AWS Glue for data cataloging and ETL processing, AWS Athena for running SQL queries, and Jupyter Notebooks for exploratory analysis. The full architecture shown brings these services together to allow for ingesting, storing, processing, and analyzing large amounts of data in a serverless and cost-effective manner.
This document discusses big data and Hadoop. It notes that 90% of data created in the last two years is unstructured and difficult to analyze with traditional databases. Hadoop is an open source framework that stores and processes large datasets across clusters of commodity hardware. It works by dividing data into blocks, running map tasks on smaller portions in parallel, and then combining results with reduce tasks.
On Friday, September 25th Devin Hopps lead us through a presentation on an Introduction to Big Data and how technology has evolved to harness the power of Big Data.
In this webinar you'll learn about the best practices for Google BigQuery—and how Matillion ETL makes loading your data faster and easier. Find out from our experts how to leverage one of the largest, fastest, and most capable cloud data warehouses to improve your business and save money.
In this webinar:
- Discover how to work fast and efficiently with Google BigQuery
- Find out the best ways to monitor and control costs
- Learn to leverage Matillion ETL and optimize Google BigQuery
- Get tips and tricks for better performance
Presentation at the MOC Workshop, at Boston University.
Cloud Dataverse will be a new service for accessing and processing public data sets in a the Massachusetts Open Cloud (MOC). It is based on Dataverse, a popular software framework for sharing, archiving, and analyzing research data. Cloud Dataverse extends Dataverse to replicate datasets from institution repositories to a cloud-based repository and store their data files in Swift, making data processing faster for in-situ application running in the cloud.
Cloud Dataverse is a collaborative effort between two open source projects: Massachusetts Open Cloud (MOC) and Dataverse. The Dataversesoftware is being developed at Harvard's Institute for Quantitative Social Science (IQSS) with contributors worldwide providing 21 Dataverse installations. The Harvard Dataverse installation alone hosts more than 60,000 datasets from 300 institutions by 15,000 data authors. The MOC is a collaboration between higher education (BU, NEU, Harvard, MIT and UMass), government, and industry. Its mission is to create a self-sustaining at-scale public cloud based on the Open Cloud eXchange model.
GACS (Global Agricultural Concept Scheme) is a project between FAO, NAL, and CABI to create a common base of agricultural terms by merging and mapping their thesauri. Currently it contains a core of around 13,000 commonly used terms. The goal is for all data repositories to be open and interoperable by linking terms and concepts between different knowledge organization systems using GACS as a common reference point. Ontologies and thesauri can consult GACS when being developed to reuse existing related terms and concepts, and add new terms to GACS if they are commonly needed.
The aim of the webinar is to present the new online service desk powered by AKstem servicethat facilitates the submission of AGRIS data provider’s collections to AGRIS, providing an improved interaction with AGRIS data processing unit. Additionally, in this webinar we are presenting the ways and methods that AGRIS data providers can contribute their bibliographic information (metadata) to AGRIS. The webinar is addressing to both new and old AGRIS data providers, as it will be an opportunity to gain a better understanding on the new service and its functionalities.
Big Data Processing in the Cloud: a Hydra/Sufia Experience
Zhiwu Xie, Ph.D., Associate Professor and Technology Development Librarian, Center for Digital Research and Scholarship University Libraries, Virginia Tech
Big data introduction - Big Data from a Consulting perspective - SogetiEdzo Botjes
Big data introduction - Sogeti - Consulting Services - Business Technology - 20130628 v5
This is a small introduction to the topic Big Data and a small vision on how to enable a (big) company in using big data and embed it into the organisation.
Introductory Big Data presentation given during one of our Sizing Servers Lab user group meetings. The presentation is targeted towards an audience of about 20 SME employees. It also contains a short description of the work packages for our BIg Data project proposal that was submitted in March.
This presentation introduces concepts of Big Data in a layman's language. Author does not claim the originality of the content. The presentation is made by compiling from various sources. Author does not claim copyrights or privacy issues.
Big data is exponentially rising in today's age of information and digital shrinkage. This presentation potentially clears the concept and revolving hype around it.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
(STG308) How EA, State Of Texas & H3 Biomedicine Protect DataAmazon Web Services
In this session, learn how enterprise customers use AWS storage services to address different storage requirements. Learn how Electronic Arts and H3 Biomedicine manage their data flow from on-premises systems to the cloud, giving them a centralized build system and storage flexibility by leveraging enterprise storage gateways. The State of Texas uses AWS and partner solutions to modernize and secure their office file services, and backup and recovery systems, achieving dramatic savings and productivity gains without compromising IT efficiency.
Three Steps to Modern Media Asset Management with Active ArchiveAvere Systems
This document discusses a three step approach to modern media asset management with an active archive:
1) Using object storage like Cleversafe for scalable, low-cost archive storage that is geo-dispersed for resilience.
2) Making the archive easily accessible using tools like Avere to provide NAS simplicity and performance.
3) Managing large quantities of media assets using asset management tools like CatDV for ingest, metadata, search, collaboration and workflows.
This document discusses how big data assumptions and requirements have changed dramatically, necessitating an evolution in big data solutions. Specifically, it notes that big data now needs to address volume, velocity, and variety as well as real-time response. It also must run over virtualized cloud infrastructure while providing availability, security, and efficiency. The document recommends that big data solutions use infinitely scalable, high-performance data lakes rather than directly attached storage, as well as technologies like containers, network virtualization, and automated deployment and operation. It positions OpenStack as well-suited for big data given its ability to address these needs through integrated services for shared storage, deployment, job scheduling, and more.
Data Pipelines with Spark & DataStax EnterpriseDataStax
This document discusses building data pipelines for both static and streaming data using Apache Spark and DataStax Enterprise (DSE). For static data, it recommends using optimized data storage formats, distributed and scalable technologies like Spark, interactive analysis tools like notebooks, and DSE for persistent storage. For streaming data, it recommends using scalable distributed technologies, Kafka to decouple producers and consumers, and DSE for real-time analytics and persistent storage across datacenters.
A presentation discussing how to deploy Big data solutions. The difference between structured reporting systems which feed business processes and the data science systems which do cool stuff
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
This document discusses building data pipelines with Apache Spark and DataStax Enterprise (DSE) for both static and real-time data. It describes how DSE provides a scalable, fault-tolerant platform for distributed data storage with Cassandra and real-time analytics with Spark. It also discusses using Kafka as a messaging queue for streaming data and processing it with Spark. The document provides examples of using notebooks, Parquet, and Akka for building pipelines to handle both large static datasets and fast, real-time streaming data sources.
Harness the Power of Data in a Big Data Lake discusses strategies for ingesting and processing data in a data lake. It describes how to design a data ingestion framework that accounts for factors like data format, source, size, and location. The document contrasts ETL vs ELT approaches and discusses techniques for batched and change data capture ingestion of both structured and unstructured data. It also provides an overview of tools like Sqoop that can be used to ingest data from relational databases into a data lake.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
This document provides an introduction to big data, including definitions, characteristics, examples, and challenges. It defines big data as high-volume, high-velocity, and high-variety information assets that require new processing methods. Examples discussed include the Sloan Digital Sky Survey, Human Genome Project, and Large Hadron Collider experiments. Challenges of big data include storage, networking, data integrity, and the need for new technologies to handle the volume, velocity and variety. Emerging solutions involve distributed storage, local computation near data, and frameworks like Hadoop and MapReduce.
This document discusses big data and Hadoop. It notes that 90% of data created in the last two years is unstructured and difficult to analyze with traditional databases. Hadoop is an open source framework that stores and processes large datasets across clusters of commodity hardware. It works by dividing data into blocks, running map tasks on smaller portions in parallel, and then combining results with reduce tasks.
On Friday, September 25th Devin Hopps lead us through a presentation on an Introduction to Big Data and how technology has evolved to harness the power of Big Data.
In this webinar you'll learn about the best practices for Google BigQuery—and how Matillion ETL makes loading your data faster and easier. Find out from our experts how to leverage one of the largest, fastest, and most capable cloud data warehouses to improve your business and save money.
In this webinar:
- Discover how to work fast and efficiently with Google BigQuery
- Find out the best ways to monitor and control costs
- Learn to leverage Matillion ETL and optimize Google BigQuery
- Get tips and tricks for better performance
Presentation at the MOC Workshop, at Boston University.
Cloud Dataverse will be a new service for accessing and processing public data sets in a the Massachusetts Open Cloud (MOC). It is based on Dataverse, a popular software framework for sharing, archiving, and analyzing research data. Cloud Dataverse extends Dataverse to replicate datasets from institution repositories to a cloud-based repository and store their data files in Swift, making data processing faster for in-situ application running in the cloud.
Cloud Dataverse is a collaborative effort between two open source projects: Massachusetts Open Cloud (MOC) and Dataverse. The Dataversesoftware is being developed at Harvard's Institute for Quantitative Social Science (IQSS) with contributors worldwide providing 21 Dataverse installations. The Harvard Dataverse installation alone hosts more than 60,000 datasets from 300 institutions by 15,000 data authors. The MOC is a collaboration between higher education (BU, NEU, Harvard, MIT and UMass), government, and industry. Its mission is to create a self-sustaining at-scale public cloud based on the Open Cloud eXchange model.
GACS (Global Agricultural Concept Scheme) is a project between FAO, NAL, and CABI to create a common base of agricultural terms by merging and mapping their thesauri. Currently it contains a core of around 13,000 commonly used terms. The goal is for all data repositories to be open and interoperable by linking terms and concepts between different knowledge organization systems using GACS as a common reference point. Ontologies and thesauri can consult GACS when being developed to reuse existing related terms and concepts, and add new terms to GACS if they are commonly needed.
The aim of the webinar is to present the new online service desk powered by AKstem servicethat facilitates the submission of AGRIS data provider’s collections to AGRIS, providing an improved interaction with AGRIS data processing unit. Additionally, in this webinar we are presenting the ways and methods that AGRIS data providers can contribute their bibliographic information (metadata) to AGRIS. The webinar is addressing to both new and old AGRIS data providers, as it will be an opportunity to gain a better understanding on the new service and its functionalities.
Big Data Processing in the Cloud: a Hydra/Sufia Experience
Zhiwu Xie, Ph.D., Associate Professor and Technology Development Librarian, Center for Digital Research and Scholarship University Libraries, Virginia Tech
Big data introduction - Big Data from a Consulting perspective - SogetiEdzo Botjes
Big data introduction - Sogeti - Consulting Services - Business Technology - 20130628 v5
This is a small introduction to the topic Big Data and a small vision on how to enable a (big) company in using big data and embed it into the organisation.
Introductory Big Data presentation given during one of our Sizing Servers Lab user group meetings. The presentation is targeted towards an audience of about 20 SME employees. It also contains a short description of the work packages for our BIg Data project proposal that was submitted in March.
This presentation introduces concepts of Big Data in a layman's language. Author does not claim the originality of the content. The presentation is made by compiling from various sources. Author does not claim copyrights or privacy issues.
Big data is exponentially rising in today's age of information and digital shrinkage. This presentation potentially clears the concept and revolving hype around it.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
(STG308) How EA, State Of Texas & H3 Biomedicine Protect DataAmazon Web Services
In this session, learn how enterprise customers use AWS storage services to address different storage requirements. Learn how Electronic Arts and H3 Biomedicine manage their data flow from on-premises systems to the cloud, giving them a centralized build system and storage flexibility by leveraging enterprise storage gateways. The State of Texas uses AWS and partner solutions to modernize and secure their office file services, and backup and recovery systems, achieving dramatic savings and productivity gains without compromising IT efficiency.
Three Steps to Modern Media Asset Management with Active ArchiveAvere Systems
This document discusses a three step approach to modern media asset management with an active archive:
1) Using object storage like Cleversafe for scalable, low-cost archive storage that is geo-dispersed for resilience.
2) Making the archive easily accessible using tools like Avere to provide NAS simplicity and performance.
3) Managing large quantities of media assets using asset management tools like CatDV for ingest, metadata, search, collaboration and workflows.
This document discusses how big data assumptions and requirements have changed dramatically, necessitating an evolution in big data solutions. Specifically, it notes that big data now needs to address volume, velocity, and variety as well as real-time response. It also must run over virtualized cloud infrastructure while providing availability, security, and efficiency. The document recommends that big data solutions use infinitely scalable, high-performance data lakes rather than directly attached storage, as well as technologies like containers, network virtualization, and automated deployment and operation. It positions OpenStack as well-suited for big data given its ability to address these needs through integrated services for shared storage, deployment, job scheduling, and more.
Data Pipelines with Spark & DataStax EnterpriseDataStax
This document discusses building data pipelines for both static and streaming data using Apache Spark and DataStax Enterprise (DSE). For static data, it recommends using optimized data storage formats, distributed and scalable technologies like Spark, interactive analysis tools like notebooks, and DSE for persistent storage. For streaming data, it recommends using scalable distributed technologies, Kafka to decouple producers and consumers, and DSE for real-time analytics and persistent storage across datacenters.
A presentation discussing how to deploy Big data solutions. The difference between structured reporting systems which feed business processes and the data science systems which do cool stuff
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
This document discusses building data pipelines with Apache Spark and DataStax Enterprise (DSE) for both static and real-time data. It describes how DSE provides a scalable, fault-tolerant platform for distributed data storage with Cassandra and real-time analytics with Spark. It also discusses using Kafka as a messaging queue for streaming data and processing it with Spark. The document provides examples of using notebooks, Parquet, and Akka for building pipelines to handle both large static datasets and fast, real-time streaming data sources.
Harness the Power of Data in a Big Data Lake discusses strategies for ingesting and processing data in a data lake. It describes how to design a data ingestion framework that accounts for factors like data format, source, size, and location. The document contrasts ETL vs ELT approaches and discusses techniques for batched and change data capture ingestion of both structured and unstructured data. It also provides an overview of tools like Sqoop that can be used to ingest data from relational databases into a data lake.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
This document provides an introduction to big data, including definitions, characteristics, examples, and challenges. It defines big data as high-volume, high-velocity, and high-variety information assets that require new processing methods. Examples discussed include the Sloan Digital Sky Survey, Human Genome Project, and Large Hadron Collider experiments. Challenges of big data include storage, networking, data integrity, and the need for new technologies to handle the volume, velocity and variety. Emerging solutions involve distributed storage, local computation near data, and frameworks like Hadoop and MapReduce.
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Amazon Web Services
"In this talk, hear about two high-performant research services developed and operated by the Computation Institute at the University of Chicago running on AWS. Globus.org, a high-performance, reliable, robust file transfer service, has over 10,000 registered users who have moved over 25 petabytes of data using the service. The Globus service is operated entirely on AWS, leveraging Amazon EC2, Amazon EBS, Amazon S3, Amazon SES, Amazon SNS, etc. Globus Genomics is an end-to-end next-gen sequencing analysis service with state-of-art research data management capabilities. Globus Genomics uses Amazon EC2 for scaling out analysis, Amazon EBS for persistent storage, and Amazon S3 for archival storage. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS.
"
This document summarizes a presentation about providing next-generation sequencing analysis capabilities using Globus Genomics. It outlines challenges with current manual approaches to sequencing data analysis, including difficulties moving large datasets between locations and maintaining complex analysis scripts. The presentation introduces Globus Genomics, which uses Globus data transfer services integrated with Galaxy to provide a workflow-based system for sequencing analysis without requiring local installation or configuration. Key benefits include on-demand access to scalable cloud resources, ability to easily modify and reuse analysis workflows, and integration with data sources. The system aims to accelerate genomic research by automating and simplifying analysis.
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
If you are building a RAG application that serves millions of users, you should consider how to scale your system seamlessly and cost-efficiently. The Zilliz Serverless tier represents a significant innovation in the field of vector search, enabling you to rapidly scale to millions of tenants and billions of vectors, while fully leveraging the hot/cold characteristics across tenants to reduce data storage costs. It enables vector storage at costs comparable to S3 and facilitates vector search times in the hundreds of milliseconds for tens of millions of data points!
In this talk, we will delve into the implementation details, usage patterns, and performance metrics of Zilliz Serverless. We will discuss how it empowers AI-native applications to achieve rapid business growth by providing a cost-effective and scalable vector storage and search solution.
Cassandra is used for real-time bidding in online advertising. It processes billions of bid requests per day with low latency requirements. Segment data, which assigns product or service affinity to user groups, is stored in Cassandra to reduce calculations and allow users to be bid on sooner. Tuning the cache size and understanding the active dataset helps optimize performance.
In the past few years, the term "data lake" has leaked into our lexicon. But what exactly IS a data lake? Some IT managers confuse data lakes with data warehouses. Some people think data lakes replace data warehouses. Both of these conclusions are false. Their is room in your data architecture for both data lakes and data warehouses. They both have different use cases and those use cases can be complementary.
Todd Reichmuth, Solutions Engineer with Snowflake Computing, has spent the past 18 years in the world of Data Warehousing and Big Data. He spent that time at Netezza and then later at IBM Data. Earlier in 2018 making the jump to the cloud at Snowflake Computing.
Mike Myer, Sales Director with Snowflake Computing, has spent the past 6 years in the world of Security and looking to drive awareness to better Data Warehousing and Big Data solutions available! Was previously at local tech companies FireMon and Lockpath and decided to join Snowflake due to the disruptive technology that's truly helping folks in the Big Data world on a day to day basis.
This document provides an introduction to big data, including:
- Big data is characterized by its volume, velocity, and variety, which makes it difficult to process using traditional databases and requires new technologies.
- Technologies like Hadoop, MongoDB, and cloud platforms from Google and Amazon can provide scalable storage and processing of big data.
- Examples of how big data is used include analyzing social media and search data to gain insights, enabling personalized experiences and targeted advertising.
- As data volumes continue growing exponentially from sources like sensors, simulations, and digital media, new tools and approaches are needed to effectively analyze and make sense of "big data".
This document summarizes research on trade-offs in data integration systems. It discusses three main contributions:
1. A method to estimate response freshness using existing data summaries, which was able to estimate freshness with 6% error.
2. A maintenance process to maximize consistency under latency constraints by querying cached entries and maintaining stale or slowly changing entries. This outperformed baseline policies.
3. An extension of the maintenance policy to consider both latency and space constraints, including cache replacement policies. This outperformed state-of-the-art replacement policies when implemented in CSPARQL.
The document concludes that balancing latency and consistency in data integration is challenging due to their trade-off relationship, and discusses
Infofarm provides data science and artificial intelligence services including building and maintaining big data architectures using Apache Spark and Hadoop. They help organizations leverage data through training and workshops on data science techniques. Their passion is to extract business value from data by ingesting it from various sources into a datalake, processing it to generate information, and harvesting the value through use cases like personalization. A datalake involves storing raw and processed data in a file system for querying, while use cases may involve predictive analytics using the processed data. Infofarm can help organizations address challenges like data governance for GDPR through architecture best practices.
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
Similar to Big Data Processing in the Cloud: A Hydra/Sufia Experience (20)
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
7. DATA SHARING
• Encourage exploratory and multidisciplinary
research
• Foster open and inclusive communities around
• modeling of dynamic systems
• structural health monitoring and damage detection
• occupancy studies
• sensor evaluation
• data fusion
• energy reduction
• evacuation management
• …
9. COMPUTE INTENSIVE
• About 6GB raw data per hour
• Must be continuously processed,
ingested, and further processed
• User-generated computations
• Must not interfere with data retrieval
10. STORAGE INTENSIVE
• SEB will accumulate about 60TB of raw data
per year
• To facilitate researchers, we must keep raw
data for an extended period of time, e.g.,
>= 5 years
• VT currently does not have an affordable
storage facility to hold this much data
• Within XSEDE, only TACC’s Ranch can
allocate this much storage
The work reported here is a collaboration between the University Libraries’ Center for Digital Research and Scholarship and the Smart Infrastructure Laboratory at Virginia Tech.
The project centers around the Virginia Tech Signature Engineering Building, or SEB.
This new, one-hundred-and-sixty-thousand square-foot building will house a portion of Virginia Tech’s College of Engineering.
The Smart Infrastructure Laboratory, or VT-SIL, also wants to turn this building into a full-scale living laboratory.
Which is why during the construction, VT-SIL mounted over two hundred and forty vibration-monitoring accelerometers and hundreds of temperature, air flow, and other sensors, in one hundred and thirty six different locations throughout the building.
Upon completion, the SEB will be the most instrumented building for vibrations in the world.
VT-SIL will utilize the collected data to improve the design, monitoring, and daily operation of civil and mechanical infrastructure.
The data will also be used to investigate how humans interact with the built environment.
Moreover, VT-SIL wants to openly share much of the data with the public.
The objective is to encourage exploratory and multidisciplinary research, and to foster an open and inclusive community of researchers and educators.
The VT library’s involvement in this project focuses on data sharing and reuse, in particular, how to make the process more effective and efficient.
This is a big data problem that presents many distinctive challenges.
Now let’s step back a little bit. Forget the specific nature of the data and instead focus on the more abstract but also more generalizable characteristics of the problem we face.
We believe there are at least five distinct characteristics that separate this problem from many other data related projects done in libraries, and we believe similar characteristics will be seen more and more often as libraries are involved in more data intensive research.
First, big data problems require intensive computing power. Take SEB data as an example- the SEB generates about six gigabytes of raw data per hour.
This may not sound much, but realize that we may need to do complicated processing to transform the raw data, to ingest it into the repository, and to extract various metadata and features. All while the data keeps pouring in.
As the data grows larger, fewer end users will have the resources to process it, and will naturally expect us to do at least some preliminary processing for them.
For example, seismologists researching earthquakes will only be interested in the portion of the data that involves earthquakes. These researchers will want us to identify the earthquake data segments for them, instead of downloading many years worth of data archives just to figure it out by themselves. Such user-generated computations will demand even more processing power.
Also, processing new data must not interfere with serving the ingested data.
Big data also poses a storage challenge.
For example, the SEB will accumulate roughly sixty terabytes of raw data each year.
In order to facilitate multidisciplinary research to detect, for example, structural deteriorations over time, we must keep raw data for an extended period of time, e.g., >= 5 years
VT does not currently have an affordable storage facility to hold this much data. Even for universities that have already built massive storage systems, sharing data across institutional boundaries is still very problematic.
Now let’s take a look at the existing national R&D infrastructure.
XSEDE, the consortium including all NSF funded supercomputer centers, has a list of storage allocations. From the list we can easily figure out that the Texas Advanced Computer Center’s Ranch is the only storage system that can allocate sufficient long-term storage for the SEB project. But getting the allocation approved isn’t easy.
Of course big data also poses the challenge of big data transfer.
Even if we don’t have to pay for the bandwidth, imagine how crowded the network will be if we have hundreds of researchers around the world, and each tried to download hundreds of terabytes of data from us? It’s not very practical. It will take weeks, if not months, to move the data sets around. Is it really worth the trouble?
A more efficient and effective way to deal with this problem is to help the researchers reduce the data to more manageable sizes before sharing. But this, again, goes back to the first challenge of user-generated computation load.
We also predict much of the data processing will be on-demand.
This is because explorative and multidisciplinary research cannot predict the data usage beforehand.
New ideas will pop up from time to time that will require the data being manipulated in totally different ways from before.
And it will be very hard to predict how much processing power is enough.
All this leads the fifth challenge. How can this scale?
We believe the cloud is a viable, and for now, probably the only feasible solution to move forward.
The cloud is affordable, can cope with the on-demand workloads, and scales well without needing the high initial investment with hardware.
Bandwidth cost is the major drawback, which we hope to mitigate by processing the data where it is stored.
Those characteristics became framework requirements. The chosen framework needed to mix local and remote content…
… support background processing…
…and be distributable.
Let’s start with mixing local and remote content. This supports the storage intensive characteristic. If we can’t store data remotely, we can’t store all the data.
So, instead of keeping everything locally…
…we keep a pointer to the remote file. In effect, we are keeping a way of getting the remote data.
This is another way of looking at it. The local repository is pointing to the data somewhere in Amazon.
Next, the framework needs to be able to process data asynchronously in the background. This helps fulfill the compute intensive characteristic.
Here, the workers on the right are the important bit. They’re going to all the data processing for us.
Now, I’m going to show a quick demonstration of the workers and the queuing system. Here’s some data we’re going to be working with.
Some of the data is queued up into three queues. Some of the data is in multiple queues, and some is just in one. The queues here represent different kinds of processing that the workers will do.
And here’s our worker.
Here it’s picking up its first job off a queue. Which queue it chooses depends on how the worker was created. It may prefer or avoid certain queues.
Now it has the data, and is ready to work.
So it works, and creates the new metadata, and updates the item in the database.
We’re back to the beginning.
Choose a queue…
… pick up data…
… and process.
Repeat.
These screens are pulled from the demo application I created. Here’s what it looks like with nothing going on. Nothing in the queues (on the side), and no workers running.
Now we’re working! There are plenty of jobs queued up to keep the one worker busy. Unfortunately, trying to do all this data crunching on a single server will bog down all the other tasks the server is trying to do, like serve web pages.
So, background workers speed up the server by allowing web pages to be served while work is going on, but they still slow the server down, as the hardware has limits. In short, this won’t scale.
But if we can distribute the workload to multiple servers, we can get the work done faster, with less impact to our patrons. This meets the scalability characteristic.
Let’s visit our worker again. It used to be able to keep up with the jobs as they came in.
But now it’s overwhelmed. In our case, 6 terabytes of data per hour will do that.
So we start up new workers on new hardware to help. But we’re not going to buy more hardware! We’re already using Amazon for storage, they can handle our hardware too.
The load on our system is going to change, though, and we’re going to want more and more workers to deal with longer and longer queues.
Now that they are not on our public server, with is easier to accommodate.
And since Amazon still charges up for idle workers, we wind down if demand tapers off.
In our demo, it looks like this. Here’s the one worker from before.
Now we’ve scaled up, and the average time spent in a queue is falling.
Sufia checks two of our framework requirements out of the box.
Fedora lets us mix local and remote content, and Resque gives us packground processing.