SlideShare a Scribd company logo
Data Intensive Computing
Adam Carter
EPCC, The University of Edinburgh
EUDAT Training Coordinator
APARSEN Advanced Practitioners Course
Glasgow, July 2013
Introduction
• Data Intensive Computing and its Relationship to
Data Curation/Preservation
• The talk is slightly tangential…
– but there are many overlaps in the subjects,
technologies and aims of the data preservation and
data intensive computing/research
• Data is being preserved so that it can be re-used
Data Preservation Lifecycles
• Most data preservation lifecycles include that idea that
data is “unarchived” or “awakened”
• In the future data is likely to be more active more of the
time
– A good thing:
• Active curation
• Annotation, tagging and linking
– A challenge:
• Can best practices of archiving be maintained on a “live” dataset?
• You could certainly still look on this as a different use-
case, but in this case it’s still important to understand
what is going on elsewhere
Data Intensive Computing
Computing applications which devote most of their execution
time to computational requirements are deemed compute-
intensive and typically require small volumes of data, whereas
computing applications which require large volumes of data and
devote most of their processing time to I/O and manipulation of
data are deemed data-intensive. – Wikipedia
• My working definition:
– I/O-bound computations
• Data is (generally) too big to fit in memory
– Efficient disk access is required to get the data to the CPU on time
– Having the data in the right place at the right time is vital
Cluster Computing
Grid Computing
Supercomputing
Cloud Computing
The Role of Data Infrastructures in Data
Intensive Computing
• Traditionally, we bring the data to the compute
• In the future, we’ll want to bring the compute to
the data
– So where is the data?
– More than likely it’s in a repository…
– Maybe an “archive”…
What will the data in the archive look like?
• Files?
• Rows & Tables in a Relational Database?
• Tuples in a Triple Store?
How might you bring in compute?
• As (relational) database queries (SQL)
• As queries against an RDF store (SPARQL)
• As VMs which can mount local disks
• As scripts or executables that you allow a user to
run
• As services that you as a data service offer with
some kind of API (e.g. as a web service)
• These are the approaches that will need to be
offered by “repositories” holding large amounts
of “live” data.
– Many will probably also be relevant to archives
• How can a user get the information back out of
the archive?
– As complete files?
– Over the Internet (and your network!)?
Back to the Compute…
• Need to understand the performance of your
computations and your data transfers
• Do you know how fast your program runs?
• Do you know if it’s spending all of its time on
compute or if it’s spending its time waiting for
data?
• Where is your data bottleneck?
• Benchmarking is key
Amdahl’s Other Laws
• Gene Amdahl’s quantification of the balance
required for data-intensive applications:
– One bit of I/O per compute cycle
– Memory Size (bytes) / instructions per second = 1
– One IOop per 50,000 instructions
A Data Intensive Computer
GPU 4GB DDR3 Memory
Gb Ethernet
USB2
× 120
…or use whole datacentre(s)
Making best use of a machine designed for
data-intensive computing
• Work on streams of data, not files
– Not (so easily) searchable
– Not (so easily) sortable
– Not all programs can benefit from this approach
• and those that can, might require work
• Use multiple threads and asynchronous I/O
• If you’re using files, use a library that does some of
the hard work for you, e.g. MPI-IO
Some Data Intensive Technologies
• MapReduce/Hadoop (described earlier in the
week)
• Storm (http://storm-project.net)
– Low latency
– Real-time
– No writes to disk at intermediate stages
– Reportedly not quite such good scalability in terms of
throughput
…or use whole datacentre(s)
…or the internet?
Conclusions
• Data-Intensive is a new(ish) kind of computing
– necessitated by the huge amounts of data
– and offering new opportunities
• Need to think about new ways of doing computing
– It’s usually parallel computing, but not “traditional HPC”
• Matters for data preservation. Either:
– you’re preserving huge amounts of data that need to be
easily reused
– you need to process large amounts of data to do a
meaningful reduction so that the stored data retains its
value

More Related Content

Similar to DataIntensiveComputing.pdf

Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Data warehouse 26 exploiting parallel technologies
Data warehouse  26 exploiting parallel technologiesData warehouse  26 exploiting parallel technologies
Data warehouse 26 exploiting parallel technologies
Vaibhav Khanna
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
MongoDB
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
Ruben Badaró
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
RATISHKUMAR32
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
jlorenzocima
 
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
Lucas Jellema
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
Nisha Talagala
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity Planning
Norberto Leite
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Vaibhav Jain
 
try
trytry
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
Ahmed Misbah
 

Similar to DataIntensiveComputing.pdf (20)

Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Data warehouse 26 exploiting parallel technologies
Data warehouse  26 exploiting parallel technologiesData warehouse  26 exploiting parallel technologies
Data warehouse 26 exploiting parallel technologies
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
 
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity Planning
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
try
trytry
try
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 

Recently uploaded

Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
shahdabdulbaset
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
gray level transformation unit 3(image processing))
gray level transformation unit 3(image processing))gray level transformation unit 3(image processing))
gray level transformation unit 3(image processing))
shivani5543
 
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
amsjournal
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 

Recently uploaded (20)

Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
gray level transformation unit 3(image processing))
gray level transformation unit 3(image processing))gray level transformation unit 3(image processing))
gray level transformation unit 3(image processing))
 
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
UNLOCKING HEALTHCARE 4.0: NAVIGATING CRITICAL SUCCESS FACTORS FOR EFFECTIVE I...
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 

DataIntensiveComputing.pdf

  • 1. Data Intensive Computing Adam Carter EPCC, The University of Edinburgh EUDAT Training Coordinator APARSEN Advanced Practitioners Course Glasgow, July 2013
  • 2. Introduction • Data Intensive Computing and its Relationship to Data Curation/Preservation • The talk is slightly tangential… – but there are many overlaps in the subjects, technologies and aims of the data preservation and data intensive computing/research • Data is being preserved so that it can be re-used
  • 3. Data Preservation Lifecycles • Most data preservation lifecycles include that idea that data is “unarchived” or “awakened” • In the future data is likely to be more active more of the time – A good thing: • Active curation • Annotation, tagging and linking – A challenge: • Can best practices of archiving be maintained on a “live” dataset? • You could certainly still look on this as a different use- case, but in this case it’s still important to understand what is going on elsewhere
  • 4. Data Intensive Computing Computing applications which devote most of their execution time to computational requirements are deemed compute- intensive and typically require small volumes of data, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive. – Wikipedia • My working definition: – I/O-bound computations • Data is (generally) too big to fit in memory – Efficient disk access is required to get the data to the CPU on time – Having the data in the right place at the right time is vital
  • 6. The Role of Data Infrastructures in Data Intensive Computing • Traditionally, we bring the data to the compute • In the future, we’ll want to bring the compute to the data – So where is the data? – More than likely it’s in a repository… – Maybe an “archive”…
  • 7. What will the data in the archive look like? • Files? • Rows & Tables in a Relational Database? • Tuples in a Triple Store?
  • 8. How might you bring in compute? • As (relational) database queries (SQL) • As queries against an RDF store (SPARQL) • As VMs which can mount local disks • As scripts or executables that you allow a user to run • As services that you as a data service offer with some kind of API (e.g. as a web service)
  • 9. • These are the approaches that will need to be offered by “repositories” holding large amounts of “live” data. – Many will probably also be relevant to archives • How can a user get the information back out of the archive? – As complete files? – Over the Internet (and your network!)?
  • 10. Back to the Compute… • Need to understand the performance of your computations and your data transfers • Do you know how fast your program runs? • Do you know if it’s spending all of its time on compute or if it’s spending its time waiting for data? • Where is your data bottleneck? • Benchmarking is key
  • 11. Amdahl’s Other Laws • Gene Amdahl’s quantification of the balance required for data-intensive applications: – One bit of I/O per compute cycle – Memory Size (bytes) / instructions per second = 1 – One IOop per 50,000 instructions
  • 12. A Data Intensive Computer GPU 4GB DDR3 Memory Gb Ethernet USB2 × 120
  • 13. …or use whole datacentre(s)
  • 14. Making best use of a machine designed for data-intensive computing • Work on streams of data, not files – Not (so easily) searchable – Not (so easily) sortable – Not all programs can benefit from this approach • and those that can, might require work • Use multiple threads and asynchronous I/O • If you’re using files, use a library that does some of the hard work for you, e.g. MPI-IO
  • 15. Some Data Intensive Technologies • MapReduce/Hadoop (described earlier in the week) • Storm (http://storm-project.net) – Low latency – Real-time – No writes to disk at intermediate stages – Reportedly not quite such good scalability in terms of throughput
  • 16. …or use whole datacentre(s)
  • 18. Conclusions • Data-Intensive is a new(ish) kind of computing – necessitated by the huge amounts of data – and offering new opportunities • Need to think about new ways of doing computing – It’s usually parallel computing, but not “traditional HPC” • Matters for data preservation. Either: – you’re preserving huge amounts of data that need to be easily reused – you need to process large amounts of data to do a meaningful reduction so that the stored data retains its value