High Performance Distributed Computing and Data Science

737 views

Published on

Henri Ball describes how high performance computing is driven by the demands of large scale data problems. He also describes his links to other computer science disciplines within the DSRC.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
737
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

High Performance Distributed Computing and Data Science

  1. 1. DS RC Data Science Research Center High Performance Distributed Computing Henri Bal Vrije Universiteit Amsterdam
  2. 2. DS RC Outline 1. Development of the field 2. Highlights VU-HPDC group 3. Links to data science cycle 4. Conclusions
  3. 3. DS RC Developments • Multiple types of data explosions: – Big data: huge processing/transportation demands – Complex heterogeneous data 10-100 x global internet traffic per year, exascale processing Complex data
  4. 4. DS RC Developments • Infrastructure explosion – High complexity: heterogeneous systems with diversity of processors, systems, networks
  5. 5. DS RC VU HPDC GROUP • Bridge the gap between demanding applications and complex infrastructure • Distributed programming systems for – – – – Clusters, grids, clouds Heterogeneous systems (``Jungles”) Accelerators (GPUs) Clouds & mobile devices • Applications: multimedia, semantic web, model checking, games, astronomy, astrophysics, climate modeling ….
  6. 6. DS RC Highlights VU-HPDC group 889Billion game states 2002 Solved Awari Multimedia data AAAI-VC 2007 Multimedia data Semantic web 3rd Prize: ISWC 2008 Astronomy data DACH 2008 - BS DACH 2008 - FT Semantic web 1st Prize: SCALE 2008 1st Prize: SCALE 2010 EYR 2011 Sustainability award
  7. 7. DS RC Links to data science cycle Visual Analytics Perception Cognition Decision Theory Understand and decide Distributed reasoning Distributed Processing Reasoning Knowledge representati on Large Scale Databases Store and process Software Eng. System / Network Eng. Analyze and model Multimedia Retrieval Modeling and simulation Information Retrieval Machine Learning
  8. 8. DS RC Reasoning – Semantic Web • Make the Web smarter by injecting meaning so that machines can “understand” it. o initial idea by Tim Berners-Lee in 2001 • Now attracted the interest of big IT companies
  9. 9. DS RC Google Example
  10. 10. DS RC Google Example
  11. 11. DS RC Distributed Reasoning • WebPIE: web-scale distributed reasoner doing full materialization • QueryPIE: distributed reasoning with backward-chaining + pre-materialization of schema-triples • DynamiTE: maintains materialization after updates (additions & removals)  Challenge: real-time incremental reasoning on web scale, combining new (streaming) data & existing historic data With: Jacopo Urbani, Alessandro Margara, Frank van Harmelen COMMIT/
  12. 12. DS R C Distributed Computing • Jungle computing with Ibis – Distributed, heterogeneous, hierarchical systems • Programming accelerators With: NLeSC (Frank Seinstra, Rob van Nieuwpoort et al.)
  13. 13. DS RC Ibis • Computational Astrophysics (Leiden) gravitational dynamics stellar evolution AMUSE radiative transport • Climate Modeling (Utrecht) • Multimedia Content Analysis (UvA) hydrodynamics
  14. 14. DS RC Accelerators (GPUs) Host Interface GigaThread Engine GPC GPC SM SM SM SM SM GPC SM SM SM SM SM SM SM GPC Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine SM Polymorph Engine Polymorph Engine Memory Controller Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine Polymorph Engine L2 Cache Polymorph Engine Polymorph Engine SM Polymorph Engine Polymorph Engine SM Polymorph Engine Polymorph Engine SM Polymorph Engine Polymorph Engine SM GPC SM Polymorph Engine Polymorph Engine SM SM SM SM SM Raster Engine GPC SM SM SM SM SM GPC SM Raster Engine GPC • Methodology for efficient GPU programming – Stepwise refinement, different levels of hardware abstraction – Compiler feedback at each level  Challenge: getting grip on performance Memory Controller Memory Controller SM Memory Controller – Multimedia content analysis – Climate modeling – LOFAR (pulsar pipelines) Raster Engine SM Memory Controller • Use cases Memory Controller Raster Engine SM
  15. 15. DS RC Glasswing: MapReduce on Accelerators • Use accelerators (OpenCL) as mainstream feature • Massive out-of-core data sets • Scale vertically & horizontally • Maintain MapReduce abstraction With: Ismail El Helw, Rutger Hofman, UvA-SNE
  16. 16. DS RC Glasswing Pipeline • Overlaps computation, communication & disk access • Supports multiple buffering levels
  17. 17. DS RC Evaluation (DAS-4, EC2) • Compute-bound applications benefit dramatically from GPUs (up to 107×) • Better scalability than Hadoop • Runs on a variety of accelerators & clouds  Challenge: real-world (compute-intensive) applications
  18. 18. DS RC Conclusions • Strong links with Big data & Complex data Visual Analytics Perception Cognition Decision Theory Understand and decide Distributed Processing Reasoning Knowledge representati on Large Scale Databases Store and process Software Eng. System / Network Eng. Analyze and model Multimedia Retrieval Modeling and simulation Information Retrieval Machine Learning

×