SlideShare a Scribd company logo
BIG DATA
PROCESSING
IN THE CLOUD:
A HYDRA/SUFIA
EXPERIENCE
Helsinki
June 2014
Collin Brittle
Zhiwu Xie
WHO?
WHAT?
WHY?
SENSORS
SMARTINFRASTRUCTURE
DATA SHARING
• Encourage exploratory and multidisciplinary
research
• Foster open and inclusive communities around
• modeling of dynamic systems
• structural health monitoring and damage detection
• occupancy studies
• sensor evaluation
• data fusion
• energy reduction
• evacuation management
• …
CHARACTERIZATION
• Compute intensive
• Storage intensive
• Communication intensive
• On-demand
• Scalability challenge
COMPUTE INTENSIVE
• About 6GB raw data per hour
• Must be continuously processed,
ingested, and further processed
• User-generated computations
• Must not interfere with data retrieval
STORAGE INTENSIVE
• SEB will accumulate about 60TB of raw data
per year
• To facilitate researchers, we must keep raw
data for an extended period of time, e.g.,
>= 5 years
• VT currently does not have an affordable
storage facility to hold this much data
• Within XSEDE, only TACC’s Ranch can
allocate this much storage
COMMUNICATION
INTENSIVE
• What if hundreds of researchers around
the world each tried to download
hundreds of TB of our data?
ON DEMAND
• Explorative and multidisciplinary
research cannot predict the data usage
beforehand
SCALABILITY
• How to deal with these challenges in a
scalable manner?
BIG DATA + CLOUD
• Affordable
• Elastic
• Scalable
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
OBJECTS AND
DATASTREAMS
Local Object
Meta Meta File
OBJECTS AND
DATASTREAMS
Local Object
Meta Meta File
REMOTE
STORAGE
Local
Repository
EC2 GlacierS3
Amazon
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
Worker
Worker
Worker
Database
Public
Server
Clients
Redis
BACKGROUND
PROCESSING
0100
0010
FROM QUEUES
TO THE CLOUD
1010
0101
0101
0101
1100
0011
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101
1100
0011
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1100
0011
0011
1100
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1111
0000
1010
0101
1010
0101
FROM QUEUES
TO THE CLOUD
1010
0101
1010
0101
QUEUEING
QUEUEING
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
0101
0101
0101
0101
FROM QUEUES
TO THE CLOUD
0010
0100
0010
0100
0010
0100
1010
0101
1010
0101
1010
0101
1100
0011
1100
0011
1100
0011
1100
0011
FROM QUEUES
TO THE CLOUD
1010
0101
1100
0011
0010
0100
0000
0010
Database
Public
Server
Clients
Redis
Master
Redis
Slave
Private
Server
Private
Server
Private
Server
DISTRIBUTED
PROCESSING
SCALE UP
SCALE UP
WE CHOSE SUFIA
WHAT IS SUFIA?
• Ruby on Rails framework…
• Based on Hydra…
• Using Fedora Commons…
• And Resque
FRAMEWORK
REQUIREMENTS
• Mix local and remote content
• Support background processing
• Be distributable
QUESTIONS?
rotated8 (who works at) vt.edu

More Related Content

What's hot

Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
Ujjwal Gupta
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
Melissa Hornbostel
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
Matillion
 
Cloud Dataverse
Cloud DataverseCloud Dataverse
Cloud Dataverse
Merce Crosas
 
2017 04 embl
2017 04 embl2017 04 embl
2017 04 embl
Johannes Keizer
 
AKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS NetworkAKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS Network
AIMS (Agricultural Information Management Standards)
 

What's hot (6)

Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
Cloud Dataverse
Cloud DataverseCloud Dataverse
Cloud Dataverse
 
2017 04 embl
2017 04 embl2017 04 embl
2017 04 embl
 
AKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS NetworkAKstem Service: Supporting the AGRIS Network
AKstem Service: Supporting the AGRIS Network
 

Viewers also liked

Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
National Information Standards Organization (NISO)
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
Edzo Botjes
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
Bart Vandewoestyne
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Karan Desai
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 

Viewers also liked (7)

Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Big Data Processing in the Cloud: A Hydra/Sufia Experience

(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
Amazon Web Services
 
Three Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active ArchiveThree Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active Archive
Avere Systems
 
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Cloud Native Day Tel Aviv
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
Sanura Hettiarachchi
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
Chris Kernaghan
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
KGMGROUP
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
Rakuten Group, Inc.
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
drsm79
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Amazon Web Services
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
Ravi Madduri
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Zilliz
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Big Data
Big Data Big Data
Offsite presentation original
Offsite presentation originalOffsite presentation original
Offsite presentation original
sally.de
 
Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data
Digipolis Antwerpen
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
AWS Chicago
 

Similar to Big Data Processing in the Cloud: A Hydra/Sufia Experience (20)

(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
(STG308) How EA, State Of Texas & H3 Biomedicine Protect Data
 
Three Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active ArchiveThree Steps to Modern Media Asset Management with Active Archive
Three Steps to Modern Media Asset Management with Active Archive
 
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
Yaron Haviv, Iguaz.io - OpenStack and BigData - OpenStack Israel 2015
 
BigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearchBigData, NoSQL & ElasticSearch
BigData, NoSQL & ElasticSearch
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Big Data
Big Data Big Data
Big Data
 
Offsite presentation original
Offsite presentation originalOffsite presentation original
Offsite presentation original
 
Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data Meetup 25/04/19: Big Data
Meetup 25/04/19: Big Data
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 

Recently uploaded

一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
KiriakiENikolaidou
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 

Recently uploaded (20)

一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptxREUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
REUSE-SCHOOL-DATA-INTEGRATED-SYSTEMS.pptx
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 

Big Data Processing in the Cloud: A Hydra/Sufia Experience

Editor's Notes

  1. The work reported here is a collaboration between the University Libraries’ Center for Digital Research and Scholarship and the Smart Infrastructure Laboratory at Virginia Tech.
  2. The project centers around the Virginia Tech Signature Engineering Building, or SEB.
  3. This new, one-hundred-and-sixty-thousand square-foot building will house a portion of Virginia Tech’s College of Engineering. The Smart Infrastructure Laboratory, or VT-SIL, also wants to turn this building into a full-scale living laboratory.
  4. Which is why during the construction, VT-SIL mounted over two hundred and forty vibration-monitoring accelerometers and hundreds of temperature, air flow, and other sensors, in one hundred and thirty six different locations throughout the building. Upon completion, the SEB will be the most instrumented building for vibrations in the world.
  5. VT-SIL will utilize the collected data to improve the design, monitoring, and daily operation of civil and mechanical infrastructure. The data will also be used to investigate how humans interact with the built environment.
  6. Moreover, VT-SIL wants to openly share much of the data with the public. The objective is to encourage exploratory and multidisciplinary research, and to foster an open and inclusive community of researchers and educators. The VT library’s involvement in this project focuses on data sharing and reuse, in particular, how to make the process more effective and efficient. This is a big data problem that presents many distinctive challenges.
  7. Now let’s step back a little bit. Forget the specific nature of the data and instead focus on the more abstract but also more generalizable characteristics of the problem we face. We believe there are at least five distinct characteristics that separate this problem from many other data related projects done in libraries, and we believe similar characteristics will be seen more and more often as libraries are involved in more data intensive research.
  8. First, big data problems require intensive computing power. Take SEB data as an example- the SEB generates about six gigabytes of raw data per hour. This may not sound much, but realize that we may need to do complicated processing to transform the raw data, to ingest it into the repository, and to extract various metadata and features. All while the data keeps pouring in. As the data grows larger, fewer end users will have the resources to process it, and will naturally expect us to do at least some preliminary processing for them. For example, seismologists researching earthquakes will only be interested in the portion of the data that involves earthquakes. These researchers will want us to identify the earthquake data segments for them, instead of downloading many years worth of data archives just to figure it out by themselves. Such user-generated computations will demand even more processing power. Also, processing new data must not interfere with serving the ingested data.
  9. Big data also poses a storage challenge. For example, the SEB will accumulate roughly sixty terabytes of raw data each year. In order to facilitate multidisciplinary research to detect, for example, structural deteriorations over time, we must keep raw data for an extended period of time, e.g., >= 5 years VT does not currently have an affordable storage facility to hold this much data. Even for universities that have already built massive storage systems, sharing data across institutional boundaries is still very problematic. Now let’s take a look at the existing national R&D infrastructure. XSEDE, the consortium including all NSF funded supercomputer centers, has a list of storage allocations. From the list we can easily figure out that the Texas Advanced Computer Center’s Ranch is the only storage system that can allocate sufficient long-term storage for the SEB project. But getting the allocation approved isn’t easy.
  10. Of course big data also poses the challenge of big data transfer. Even if we don’t have to pay for the bandwidth, imagine how crowded the network will be if we have hundreds of researchers around the world, and each tried to download hundreds of terabytes of data from us? It’s not very practical. It will take weeks, if not months, to move the data sets around. Is it really worth the trouble? A more efficient and effective way to deal with this problem is to help the researchers reduce the data to more manageable sizes before sharing. But this, again, goes back to the first challenge of user-generated computation load.
  11. We also predict much of the data processing will be on-demand. This is because explorative and multidisciplinary research cannot predict the data usage beforehand. New ideas will pop up from time to time that will require the data being manipulated in totally different ways from before. And it will be very hard to predict how much processing power is enough.
  12. All this leads the fifth challenge. How can this scale?
  13. We believe the cloud is a viable, and for now, probably the only feasible solution to move forward. The cloud is affordable, can cope with the on-demand workloads, and scales well without needing the high initial investment with hardware. Bandwidth cost is the major drawback, which we hope to mitigate by processing the data where it is stored.
  14. Those characteristics became framework requirements. The chosen framework needed to mix local and remote content… … support background processing… …and be distributable.
  15. Let’s start with mixing local and remote content. This supports the storage intensive characteristic. If we can’t store data remotely, we can’t store all the data.
  16. So, instead of keeping everything locally…
  17. …we keep a pointer to the remote file. In effect, we are keeping a way of getting the remote data.
  18. This is another way of looking at it. The local repository is pointing to the data somewhere in Amazon.
  19. Next, the framework needs to be able to process data asynchronously in the background. This helps fulfill the compute intensive characteristic.
  20. Here, the workers on the right are the important bit. They’re going to all the data processing for us.
  21. Now, I’m going to show a quick demonstration of the workers and the queuing system. Here’s some data we’re going to be working with.
  22. Some of the data is queued up into three queues. Some of the data is in multiple queues, and some is just in one. The queues here represent different kinds of processing that the workers will do.
  23. And here’s our worker.
  24. Here it’s picking up its first job off a queue. Which queue it chooses depends on how the worker was created. It may prefer or avoid certain queues.
  25. Now it has the data, and is ready to work.
  26. So it works, and creates the new metadata, and updates the item in the database.
  27. We’re back to the beginning.
  28. Choose a queue…
  29. … pick up data…
  30. … and process.
  31. Repeat.
  32. These screens are pulled from the demo application I created. Here’s what it looks like with nothing going on. Nothing in the queues (on the side), and no workers running.
  33. Now we’re working! There are plenty of jobs queued up to keep the one worker busy. Unfortunately, trying to do all this data crunching on a single server will bog down all the other tasks the server is trying to do, like serve web pages. So, background workers speed up the server by allowing web pages to be served while work is going on, but they still slow the server down, as the hardware has limits. In short, this won’t scale.
  34. But if we can distribute the workload to multiple servers, we can get the work done faster, with less impact to our patrons. This meets the scalability characteristic.
  35. Let’s visit our worker again. It used to be able to keep up with the jobs as they came in.
  36. But now it’s overwhelmed. In our case, 6 terabytes of data per hour will do that.
  37. So we start up new workers on new hardware to help. But we’re not going to buy more hardware! We’re already using Amazon for storage, they can handle our hardware too.
  38. The load on our system is going to change, though, and we’re going to want more and more workers to deal with longer and longer queues. Now that they are not on our public server, with is easier to accommodate. And since Amazon still charges up for idle workers, we wind down if demand tapers off.
  39. In our demo, it looks like this. Here’s the one worker from before.
  40. Now we’ve scaled up, and the average time spent in a queue is falling.
  41. Sufia checks two of our framework requirements out of the box. Fedora lets us mix local and remote content, and Resque gives us packground processing.