SlideShare a Scribd company logo
1 of 48
Data-intensive computing
Inf-2202 Concurrent and Data-intensive Programming
University of Tromsø, Fall 2015
Lars Ailo Bongo (larsab@cs.uit.no)
Outline
• Today:
– Introduction to data-intensive computing
– Data-intensive computing platforms
• Google File System, MapReduce
• 15/10: Guest lecture (Inge Alexander Raknes)
– Scala
– Spark
– AWS
• 22/10: Spark ecosystem
– GraphX, Shark, Mllib
• 3/11: Hadoop ecosystem
– Hbase, Impala? Storm?
Data-intensive Computing
• Big data
• + Machine learning/ statistics
– FYS-3012 Pattern recognition
– (Linear algebra & statistics)
• + Distributed systems
– INF-3200, INF-3203, INF-3201, and more
• (= Data analytics)
• Human produced content
– Videos, photos, audio…
• Human activity
– Online activity, GPS traces, tax records…
• Scientific instruments
– CERN LHC, Sloan Digital Sky Survey, DNA sequencers…
• Sensor data
Big Data Sources
Big data players
• Industry:
– Google, Facebook, Twitter, Amazon, Netflix, Visa, …
– Use data to provide services
– Use data to make money
– Has developed (most of the) technology for managing and
processing peta-scale datasets
• Government:
– NSA, Skatteetaten, Kartverket, e-resept, …
– Use data to make (hopefully) informed decisions
– Make data available for public and commercial services
• Science
– Biology, physics, medicine, social sciences,…
– Use data for novel scientific insights
– Should be open access, indexed, reusable, …
Big Data
• How big?
Dataset Size
< 4GB < 512GB TBs
PBs
Statistical Analysis (N x M)
• Billions of samples & few dimensions, or
• Billions of samples & thousands of dimensions, or
• Thousands of samples & thousands of dimensions
Data Analysis Tool
Computation Time
<100ms seconds minutes hours weeks
Optimizations
• R or Matlab implementation
• Algorithm parameter tuning
• C++/ Java / … implementation
• Data structure optimization
• Multi-threaded parallelization (single machine)
• Distributed parallelization (multiple-machines)
Outline
• History of Big Data + Biology
• My research
– Interactive data analytics
– Elixir infrastructure
– Other interesting stuff
• Google File System
• MapReduce
Jim Grays Talk
“Data, data everywhere”
Source: The Economist [http://www.economist.com/node/15557443?story_id=15557443]
Scientific Storage Systems
Source: http://www.usenix.org/events/lisa10/tech/slides/cass.pdf
Data growth in the life sciences
PB
Increase in bionformaticians?
@UiT
My Lab
• Biological Data Processing Systems Lab
• 3 + 1 PhD students
– Edvard Pedersen, Einar Holsbø, Bjørn Fjukstad +
Espen Mikal Robertsen
• 2 engineers
– Inge Alexander Raknes, Giacomo Tartari
• 3 master students
– Kenneth Knudsen, Morten Grønnesby, Jarl Fagerli
• http://bdps.cs.uit.no
Research Goal
Norwegian Woman and Cancer (NOWAC)
• Large and unique biobank of blood samples
• Understand development of cancer (and how to avoid it)
• Develop diagnosis approaches
• Develop or improve treatment
• http://site.uit.no/nowac/
Center for Bioinformatics (SfB)
• Interdisciplinary research and services
– Computer science
– Biotechnology
– Bioinformatics
• Special focus on marine metagenomics
• Commercial exploitation of marine resources
• http://sfb.cs.uit.no
Interactive Data Exploration Components
• Human experts for data analysis
• Interactive user interface
• Analysis methods and models
• Data management and backend processing
• Compute and storage resources
uit.no
Data-intensive computing platforms
Outline (part 2)
• Hardware platforms
• Infrastructure systems
– Google File System
– MapReduce
– Ecosystems
Hardware Requirements
• Process 1TB of data?
• Process 1PB of data?
Single Computer
Supercomputer
• Disadvantages:
– Centralized storage has limited bandwidth
– High cost of interconnect
…
…
Infiniband
56Gbits/s
164Gbit/s
Commodity Component Distributed System
…
…
SATA 6Gbit/s
Hadoop
Google File System (GFS)
• https://courses.cs.washington.edu/courses/cse490h/11wi
/CSE490H_files/gfs.pdf
• Hadoop Distributed File System implements GFS design
MapReduce
• http://research.google.com/archive/mapreduce-osdi04-
slides/index.html
• Hadoop MapReduce implements Google File System
design
Spark Ecosystem

More Related Content

Similar to Data Intensive Computing- Slide in Presentation

Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
Feyzi R. Bagirov
 

Similar to Data Intensive Computing- Slide in Presentation (20)

Neven Vrček: Internship programme and students’ entrepreneurship as a hub be...
Neven Vrček:  Internship programme and students’ entrepreneurship as a hub be...Neven Vrček:  Internship programme and students’ entrepreneurship as a hub be...
Neven Vrček: Internship programme and students’ entrepreneurship as a hub be...
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
 
DBMS
DBMSDBMS
DBMS
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNA
 
NFDI Physical Sciences Colloquium - FAIR
NFDI Physical Sciences Colloquium - FAIRNFDI Physical Sciences Colloquium - FAIR
NFDI Physical Sciences Colloquium - FAIR
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Rdm slides march 2014
Rdm slides march 2014Rdm slides march 2014
Rdm slides march 2014
 
Steve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data ArchiveSteve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data Archive
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Taming the Big Data Beast - Together
Taming the Big Data Beast - TogetherTaming the Big Data Beast - Together
Taming the Big Data Beast - Together
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
Web-GIS-based Conservation Tools for First Nations' Stewardship
Web-GIS-based Conservation Tools for First Nations' StewardshipWeb-GIS-based Conservation Tools for First Nations' Stewardship
Web-GIS-based Conservation Tools for First Nations' Stewardship
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web Data
 
An information environment for neuroscientists
An information environment for neuroscientistsAn information environment for neuroscientists
An information environment for neuroscientists
 
Managing provenance in the Social Sciences: the Data Documentation Initiative...
Managing provenance in the Social Sciences: the Data Documentation Initiative...Managing provenance in the Social Sciences: the Data Documentation Initiative...
Managing provenance in the Social Sciences: the Data Documentation Initiative...
 

Recently uploaded

Tagarino_14510147_Assessment 2B ThresholdProcess Journal FINAL.pdf
Tagarino_14510147_Assessment 2B ThresholdProcess Journal FINAL.pdfTagarino_14510147_Assessment 2B ThresholdProcess Journal FINAL.pdf
Tagarino_14510147_Assessment 2B ThresholdProcess Journal FINAL.pdf
erintagarino1
 
prodtion diary final ultima.pptxoiu8edrfgrh
prodtion diary final ultima.pptxoiu8edrfgrhprodtion diary final ultima.pptxoiu8edrfgrh
prodtion diary final ultima.pptxoiu8edrfgrh
LeonBraley
 
prodtion diary final ultima maxima.pptxadcaasfadfSDFAsdfasdfasdfasdfadfadfasd...
prodtion diary final ultima maxima.pptxadcaasfadfSDFAsdfasdfasdfasdfadfadfasd...prodtion diary final ultima maxima.pptxadcaasfadfSDFAsdfasdfasdfasdfadfadfasd...
prodtion diary final ultima maxima.pptxadcaasfadfSDFAsdfasdfasdfasdfadfadfasd...
LeonBraley
 
post production final hope.pptxjhgiugiugiug
post production final hope.pptxjhgiugiugiugpost production final hope.pptxjhgiugiugiug
post production final hope.pptxjhgiugiugiug
LeonBraley
 
Laplace Transforms 2 Questionjjjjjjjs.pptx
Laplace Transforms 2 Questionjjjjjjjs.pptxLaplace Transforms 2 Questionjjjjjjjs.pptx
Laplace Transforms 2 Questionjjjjjjjs.pptx
joshuaclack73
 
Captain america painting competition -- 14
Captain america painting competition -- 14Captain america painting competition -- 14
Captain america painting competition -- 14
Su Yan-Jen
 
Tagarino_14510147_Design Communication Document AS1.pdf
Tagarino_14510147_Design Communication Document AS1.pdfTagarino_14510147_Design Communication Document AS1.pdf
Tagarino_14510147_Design Communication Document AS1.pdf
erintagarino1
 
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdfTagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
erintagarino1
 
Tagarino_14510147_Assessment 3 Pavillion_Process Journal FINAL.pdf
Tagarino_14510147_Assessment 3 Pavillion_Process Journal FINAL.pdfTagarino_14510147_Assessment 3 Pavillion_Process Journal FINAL.pdf
Tagarino_14510147_Assessment 3 Pavillion_Process Journal FINAL.pdf
erintagarino1
 
evaluation final maxima.pptxiuiugiyuguy8g
evaluation final maxima.pptxiuiugiyuguy8gevaluation final maxima.pptxiuiugiyuguy8g
evaluation final maxima.pptxiuiugiyuguy8g
LeonBraley
 

Recently uploaded (20)

Sisters_Bond_storyboard.pdf_____________
Sisters_Bond_storyboard.pdf_____________Sisters_Bond_storyboard.pdf_____________
Sisters_Bond_storyboard.pdf_____________
 
Tagarino_14510147_Assessment 2B ThresholdProcess Journal FINAL.pdf
Tagarino_14510147_Assessment 2B ThresholdProcess Journal FINAL.pdfTagarino_14510147_Assessment 2B ThresholdProcess Journal FINAL.pdf
Tagarino_14510147_Assessment 2B ThresholdProcess Journal FINAL.pdf
 
prodtion diary final ultima.pptxoiu8edrfgrh
prodtion diary final ultima.pptxoiu8edrfgrhprodtion diary final ultima.pptxoiu8edrfgrh
prodtion diary final ultima.pptxoiu8edrfgrh
 
Barbara Stewart Scholarship Entries Spring 2024
Barbara Stewart Scholarship Entries Spring 2024Barbara Stewart Scholarship Entries Spring 2024
Barbara Stewart Scholarship Entries Spring 2024
 
prodtion diary final ultima maxima.pptxadcaasfadfSDFAsdfasdfasdfasdfadfadfasd...
prodtion diary final ultima maxima.pptxadcaasfadfSDFAsdfasdfasdfasdfadfadfasd...prodtion diary final ultima maxima.pptxadcaasfadfSDFAsdfasdfasdfasdfadfadfasd...
prodtion diary final ultima maxima.pptxadcaasfadfSDFAsdfasdfasdfasdfadfadfasd...
 
post production final hope.pptxjhgiugiugiug
post production final hope.pptxjhgiugiugiugpost production final hope.pptxjhgiugiugiug
post production final hope.pptxjhgiugiugiug
 
PRACTICA ELEVI LA S.C. NIDEC S.A. -- 2021
PRACTICA ELEVI LA S.C. NIDEC S.A. -- 2021PRACTICA ELEVI LA S.C. NIDEC S.A. -- 2021
PRACTICA ELEVI LA S.C. NIDEC S.A. -- 2021
 
Laplace Transforms 2 Questionjjjjjjjs.pptx
Laplace Transforms 2 Questionjjjjjjjs.pptxLaplace Transforms 2 Questionjjjjjjjs.pptx
Laplace Transforms 2 Questionjjjjjjjs.pptx
 
Captain america painting competition -- 14
Captain america painting competition -- 14Captain america painting competition -- 14
Captain america painting competition -- 14
 
Tagarino_14510147_Design Communication Document AS1.pdf
Tagarino_14510147_Design Communication Document AS1.pdfTagarino_14510147_Design Communication Document AS1.pdf
Tagarino_14510147_Design Communication Document AS1.pdf
 
Family Group Pottery Sessions in Abu Dhabi
Family Group Pottery Sessions in Abu DhabiFamily Group Pottery Sessions in Abu Dhabi
Family Group Pottery Sessions in Abu Dhabi
 
Anyone Can Draw Zentangles Interactive Book
Anyone Can Draw Zentangles Interactive BookAnyone Can Draw Zentangles Interactive Book
Anyone Can Draw Zentangles Interactive Book
 
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdfTagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
Tagarino_14510147_Process Journal AS1 Inhabiting the WallDesign.pdf
 
Tagarino_14510147_Assessment 3 Pavillion_Process Journal FINAL.pdf
Tagarino_14510147_Assessment 3 Pavillion_Process Journal FINAL.pdfTagarino_14510147_Assessment 3 Pavillion_Process Journal FINAL.pdf
Tagarino_14510147_Assessment 3 Pavillion_Process Journal FINAL.pdf
 
Winning Shots from Siena International Photography Awards 2015
Winning Shots from Siena International Photography Awards 2015Winning Shots from Siena International Photography Awards 2015
Winning Shots from Siena International Photography Awards 2015
 
evaluation final maxima.pptxiuiugiyuguy8g
evaluation final maxima.pptxiuiugiyuguy8gevaluation final maxima.pptxiuiugiyuguy8g
evaluation final maxima.pptxiuiugiyuguy8g
 
Poze practica elevi la S.C. IATSA S.A. 2021
Poze practica elevi la S.C. IATSA S.A. 2021Poze practica elevi la S.C. IATSA S.A. 2021
Poze practica elevi la S.C. IATSA S.A. 2021
 
17ink_beatboard_single-panel-COMPLETE_001
17ink_beatboard_single-panel-COMPLETE_00117ink_beatboard_single-panel-COMPLETE_001
17ink_beatboard_single-panel-COMPLETE_001
 
Presentation slide deck example for portfolio
Presentation slide deck example for portfolioPresentation slide deck example for portfolio
Presentation slide deck example for portfolio
 
The Adventurer's Guide Book by Amoré van der Linde
The Adventurer's Guide Book by Amoré van der LindeThe Adventurer's Guide Book by Amoré van der Linde
The Adventurer's Guide Book by Amoré van der Linde
 

Data Intensive Computing- Slide in Presentation

  • 1. Data-intensive computing Inf-2202 Concurrent and Data-intensive Programming University of Tromsø, Fall 2015 Lars Ailo Bongo (larsab@cs.uit.no)
  • 2. Outline • Today: – Introduction to data-intensive computing – Data-intensive computing platforms • Google File System, MapReduce • 15/10: Guest lecture (Inge Alexander Raknes) – Scala – Spark – AWS • 22/10: Spark ecosystem – GraphX, Shark, Mllib • 3/11: Hadoop ecosystem – Hbase, Impala? Storm?
  • 3. Data-intensive Computing • Big data • + Machine learning/ statistics – FYS-3012 Pattern recognition – (Linear algebra & statistics) • + Distributed systems – INF-3200, INF-3203, INF-3201, and more • (= Data analytics)
  • 4. • Human produced content – Videos, photos, audio… • Human activity – Online activity, GPS traces, tax records… • Scientific instruments – CERN LHC, Sloan Digital Sky Survey, DNA sequencers… • Sensor data Big Data Sources
  • 5. Big data players • Industry: – Google, Facebook, Twitter, Amazon, Netflix, Visa, … – Use data to provide services – Use data to make money – Has developed (most of the) technology for managing and processing peta-scale datasets • Government: – NSA, Skatteetaten, Kartverket, e-resept, … – Use data to make (hopefully) informed decisions – Make data available for public and commercial services • Science – Biology, physics, medicine, social sciences,… – Use data for novel scientific insights – Should be open access, indexed, reusable, …
  • 7. Dataset Size < 4GB < 512GB TBs PBs
  • 8. Statistical Analysis (N x M) • Billions of samples & few dimensions, or • Billions of samples & thousands of dimensions, or • Thousands of samples & thousands of dimensions
  • 10. Computation Time <100ms seconds minutes hours weeks
  • 11. Optimizations • R or Matlab implementation • Algorithm parameter tuning • C++/ Java / … implementation • Data structure optimization • Multi-threaded parallelization (single machine) • Distributed parallelization (multiple-machines)
  • 12. Outline • History of Big Data + Biology • My research – Interactive data analytics – Elixir infrastructure – Other interesting stuff • Google File System • MapReduce
  • 13.
  • 14.
  • 15.
  • 16.
  • 18.
  • 19.
  • 20. “Data, data everywhere” Source: The Economist [http://www.economist.com/node/15557443?story_id=15557443]
  • 21.
  • 22. Scientific Storage Systems Source: http://www.usenix.org/events/lisa10/tech/slides/cass.pdf
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31. Data growth in the life sciences PB
  • 33. My Lab • Biological Data Processing Systems Lab • 3 + 1 PhD students – Edvard Pedersen, Einar Holsbø, Bjørn Fjukstad + Espen Mikal Robertsen • 2 engineers – Inge Alexander Raknes, Giacomo Tartari • 3 master students – Kenneth Knudsen, Morten Grønnesby, Jarl Fagerli • http://bdps.cs.uit.no
  • 35. Norwegian Woman and Cancer (NOWAC) • Large and unique biobank of blood samples • Understand development of cancer (and how to avoid it) • Develop diagnosis approaches • Develop or improve treatment • http://site.uit.no/nowac/
  • 36. Center for Bioinformatics (SfB) • Interdisciplinary research and services – Computer science – Biotechnology – Bioinformatics • Special focus on marine metagenomics • Commercial exploitation of marine resources • http://sfb.cs.uit.no
  • 37.
  • 38. Interactive Data Exploration Components • Human experts for data analysis • Interactive user interface • Analysis methods and models • Data management and backend processing • Compute and storage resources
  • 40. Outline (part 2) • Hardware platforms • Infrastructure systems – Google File System – MapReduce – Ecosystems
  • 41. Hardware Requirements • Process 1TB of data? • Process 1PB of data?
  • 43. Supercomputer • Disadvantages: – Centralized storage has limited bandwidth – High cost of interconnect … … Infiniband 56Gbits/s 164Gbit/s
  • 44. Commodity Component Distributed System … … SATA 6Gbit/s
  • 46. Google File System (GFS) • https://courses.cs.washington.edu/courses/cse490h/11wi /CSE490H_files/gfs.pdf • Hadoop Distributed File System implements GFS design