SlideShare a Scribd company logo
1 of 27
What is the "Big Data" version of the Linpack 
Benchmark? 
What is “Big Data” version of Berkeley Dwarfs 
and NAS Parallel Benchmarks? 
Based on Presentation at Clusters, Clouds, and Data for 
Scientific Computing CCDSC 2014 
September 6 2014 
Geoffrey Fox, Judy Qiu 
School of Informatics and Computing 
Digital Science Center 
Indiana University Bloomington 
Shantenu Jha 
Radical Group 
Rutgers University
Summary 
• Advances in high-performance/parallel computing in the 1980's 
and 90's was spurred by the development of quality high-performance 
libraries, e.g., SCALAPACK, as well as by well-established 
benchmarks, such as Linpack. 
• Similar efforts to develop libraries for high-performance data 
analytics are underway. In this talk we motivate that such 
benchmarks should be motivated by frequent patterns 
encountered in high-performance analytics, which we call Ogres. 
• Based upon earlier work, we propose that doing so will enable 
adequate coverage of the "Apache" bigdata stack as well as most 
common application requirements, whilst building upon parallel 
computing experience. 
• Given the spectrum of analytic requirements and applications, 
there are multiple "facets" that need to be covered, and thus we 
propose an initial set of benchmarks - by no means currently 
complete - that covers these characteristics. 
– We hope this will encourage debate
The Answer
Linpack for data? 
• There is a simple solution – use Linpack 
• The core of many data analytics algorithms is often linear 
algebra and involves full not sparse matrices although 
– Not always Matrix solvers but rather large matrix multiplication 
– Matrix solution can be done (much faster) with conjugate 
gradient in cases I’ve looked at (200 iterations for matrix size of 
a million) 
• Big Data can be dominated by analytics but also by other 
aspects of problem such as datastore access and data 
transport. 
• We expand “topic of presentation” to “broad based 
benchmark set” in spirit of Berkeley Dwarfs i.e. “capture key 
features” and “grand challenges” in (academic) Big Data
Proposed Spectrum of Benchmarks/Features 
• Classic Database: TPC benchmarks 
• NoSQL Data systems: store, index, query (e.g. on Tweets) 
• Hard core commercial: Web Search, Collaborative 
Filtering (different structure and defer to Google!) 
• Streaming: Gather in Pub-Sub(Kafka) + Process (Apache 
Storm) solution (e.g. gather tweets, Internet of Things) 
• Pleasingly parallel (Local Analytics): as in initial steps of 
LHC, Astronomy, Pathology, Bioimaging (differ in type of 
data analysis) 
• “Global” Analytics: Deep Learning, SVM, 
Multidimensional Scaling, Graph Community (~Clustering) 
to finding to Shortest Path (?Shared memory) 
• Workflow linking above
Why? Cover Software Stack 
Stress different components 
Combines HPC and Apache 
140 packages but still incomplete
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies 
Cross-Cutting 
Functionalities 
Message Protocols: 
Thrift, Protobuf 
Distributed 
Coordination: 
Zookeeper, JGroups 
Security & 
Privacy: 
InCommon, 
OpenStack 
Keystone, LDAP, 
Sentry 
Monitoring: 
Ambari, Ganglia, 
Nagios, Inca 
Workflow-Orchestration: Oozie, ODE, Airavata, OODT (Tools), Pegasus, 
Kepler, Swift, Taverna, Trident, ActiveBPEL, BioKepler, Galaxy, IPython 
Application and Analytics: Mahout , MLlib , MLbase, CompLearn, R, 
Bioconductor, ImageJ, Scalapack, PetSc 
High level Programming: Hive, HCatalog, Pig, Shark, MRQL, Impala, Sawzall, 
Drill 
Basic Programming model and runtime, SPMD, Streaming, MapReduce: 
Hadoop, Spark, Twister, Stratosphere, Tez, Hama, Storm, S4, Samza, Giraph, 
Pregel, Pegasus, Reef 
Inter process communication Collectives, point-to-point, publish-subscribe: 
Harp, MPI, Netty, ZeroMQ, ActiveMQ, RabbitMQ, QPid, Kafka, Kestrel 
In-memory databases/caches: GORA (general object from NoSQL), 
Memcached, Redis (key value), Hazelcast, Ehcache 
Object-relational mapping: Hibernate, OpenJPA and JDBC Standard 
Extraction Tools: UIMA, Tika 
SQL: Oracle, MySQL, Phoenix, SciDB, Apache Derby 
NoSQL: HBase, Accumulo, Cassandra, Solandra, MongoDB, CouchDB, Lucene, 
Solr, Berkeley DB, Azure Table, Dynamo, Riak, Voldemort. Neo4J, Yarcdata, 
Jena, Sesame, AllegroGraph, RYA, Parquet 
File management: iRODS 
Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP) 
Cluster Resource Management: Mesos, Yarn, Helix, Llama, Condor, SGE, 
OpenPBS, Moab, Slurm, Torque 
File systems: HDFS, Swift, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS 
Interoperability: Whirr, JClouds, OCCI, CDMI 
DevOps: Docker, Puppet, Chef, Ansible, Boto, Libcloud, Cobbler, CloudMesh 
IaaS Management from HPC to hypervisors: OpenStack, OpenNebula, 
Eucalyptus, CloudStack, vCloud, Amazon, Azure, Google
HPC-ABDS Layers 
1) Message Protocols 
2) Distributed Coordination: 
3) Security & Privacy: 
4) Monitoring: 
5) IaaS Management from HPC to hypervisors: 
6) DevOps: 
7) Interoperability: 
8) File systems: 
9) Cluster Resource Management: 
10) Data Transport: 
11) SQL / NoSQL / File management: 
12) In-memory databases&caches / Object-relational mapping / Extraction Tools 
13) Inter process communication Collectives, point-to-point, publish-subscribe 
14) Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: 
15) High level Programming: 
16) Application and Analytics: 
17) Workflow-Orchestration: 
Here are 17 functionalities. Technologies are 
presented in this order 
4 Cross cutting at top 
13 in order of layered diagram starting at 
bottom
Maybe a Big Data Initiative would include 
• We don’t need 140 software packages so can choose e.g. 
• Workflow: Python, Pegasus or Kepler 
• Data Mahout, R, ImageJ, Scalapack 
• High level Programming: Hive, Pig 
• Parallel Programming model: Hadoop, Spark, Giraph (Twister4Azure, Harp), 
Storm 
• Communication: MPI; Kafka or RabbitMQ (Streaming) 
• In-memory: Memcached 
• Data Management: Hbase, MongoDB, MySQL or Derby 
• Distributed Coordination: Zookeeper 
• Cluster Management: Yarn, Slurm 
• File Systems: HDFS, Lustre 
• DevOps: Cloudmesh, Chef, Puppet, Docker, Cobbler 
• IaaS: Amazon, Azure, OpenStack, Libcloud 
• Monitoring: Inca, Ganglia, Nagios
Why? Build on Parallel 
Computing Experience 
Benchmarks Instantiate Key Features
HPC Benchmark Classics 
• Linpack or HPL: Parallel LU factorization for solution of 
linear equations 
• NPB version 1: Mainly classic HPC solver kernels 
– MG: Multigrid 
– CG: Conjugate Gradient 
– FT: Fast Fourier Transform 
– IS: Integer sort 
– EP: Embarrassingly Parallel 
– BT: Block Tridiagonal 
– SP: Scalar Pentadiagonal 
– LU: Lower-Upper symmetric Gauss Seidel
13 Berkeley Dwarfs 
• Dense Linear Algebra 
• Sparse Linear Algebra 
• Spectral Methods 
• N-Body Methods 
• Structured Grids 
• Unstructured Grids 
• MapReduce 
• Combinational Logic 
• Graph Traversal 
• Dynamic Programming 
• Backtrack and Branch-and-Bound 
• Graphical Models 
• Finite State Machines 
First 6 of these correspond to 
Colella’s original. 
Monte Carlo dropped. 
N-body methods are a subset of 
Particle in Colella. 
Note a little inconsistent in that 
MapReduce is a programming 
model and spectral method is a 
numerical method. 
NO clean solution likely for Big 
Data. Need multiple facets!
7 Computational Giants of 
NRC Massive Data Analysis Report 
1) G1: Basic Statistics (see MRStat later) 
2) G2: Generalized N-Body Problems 
3) G3: Graph-Theoretic Computations 
4) G4: Linear Algebraic Computations 
5) G5: Optimizations e.g. Linear Programming 
6) G6: Integration e.g. LDA and other GML 
7) G7: Alignment Problems e.g. BLAST
Why? Cover Big Data 
Application Survey 
Performed by NIST Big Data Working Group 
Leads to Ogres covering Big Data Application 
features. Here we focus on benchmarks that 
cover the Ogres
51 Detailed Use Cases: Contributed July-September 2013 
Covers goals, data features such as 3 V’s, software, hardware 
• http://bigdatawg.nist.gov/usecases.php 
• https://bigdatacoursespring2014.appspot.com/course (Section 5) 
• Government Operation(4): National Archives and Records Administration, Census Bureau 
• Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, 
Digital Materials, Cargo shipping (as in UPS) 
26 Features for each use case 
Biased to science 
• Defense(3): Sensors, Image surveillance, Situation Assessment 
• Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, 
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity 
• Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd 
Sourcing, Network Science, NIST benchmark datasets 
• The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source 
experiments 
• Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron 
Collider at CERN, Belle Accelerator II in Japan 
• Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, 
Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate 
simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry 
(microbes to watersheds), AmeriFlux and FLUXNET gas sensors 
• Energy(1): Smart grid 15
Features of 51 Use Cases I 
• PP (26) Pleasingly Parallel or Map Only 
• MR (18) Classic MapReduce MR (add MRStat below for full count) 
• MRStat (7) Simple version of MR where key computations are 
simple reduction as found in statistical averages such as histograms 
and averages 
• MRIter (23) Iterative MapReduce or MPI (Spark, Twister) 
• Graph (9) Complex graph data structure needed in analysis 
• Fusion (11) Integrate diverse data to aid discovery/decision making; 
could involve sophisticated algorithms or could just be a portal 
• Streaming (41) Some data comes in incrementally and is processed 
this way 
• Classify (30) Classification: divide data into categories 
• S/Q (12) Index, Search and Query
Features of 51 Use Cases II 
• CF (4) Collaborative Filtering for recommender engines 
• LML (36) Local Machine Learning (Independent for each parallel 
entity) 
• GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, 
PLSI, MDS, 
– Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief 
Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can 
call EGO or Exascale Global Optimization with scalable parallel algorithm 
• Workflow (51) Universal 
• GIS (16) Geotagged data and often displayed in ESRI, Microsoft 
Virtual Earth, Google Earth, GeoServer etc. 
• HPC (5) Classic large-scale simulation of cosmos, materials, etc. 
generating (visualization) data 
• Agent (2) Simulations of models of data-defined macroscopic 
entities represented as agents
Data Source and Style Facet I 
• (i) SQL or NoSQL: NoSQL includes Document, Column, Key-value, 
Graph, Triple store 
• (ii) Other Enterprise data systems: e.g. Warehouses 
• (iii) Set of Files: as managed in iRODS and extremely common in 
scientific research 
• (iv) File, Object, Block and Data-parallel (HDFS) raw storage: 
Separated from computing? 
• (v) Internet of Things: 24 to 50 Billion devices on Internet by 
2020 
• (vi) Streaming: Incremental update of datasets with new 
algorithms to achieve real-time response (G7) 
• (vii) HPC simulations: generate major (visualization) output that 
often needs to be mined 
• (viii) Involve GIS: Geographical Information Systems provide 
attractive access to geospatial data
2. Perform real time analytics on data source streams and 
notify users when specified events occur 
Streaming Data 
Streaming Data 
Streaming Data 
Specify filter 
Posted Data Identified Events 
Archive 
Storm, Kafka, Hbase, Zookeeper 
Filter Identifying 
Events 
Repository 
Post Selected 
Events 
Fetch streamed 
Data
5. Perform interactive analytics on data in analytics-optimized 
data system 
Hadoop, Spark, Giraph, Pig … 
Data Storage: HDFS, Hbase 
Data, Streaming, Batch ….. 
Mahout, R
Data Source and Style Facet II 
• Before data gets to compute system, there is often an 
initial data gathering phase which is characterized by a 
block size and timing. Block size varies from month 
(Remote Sensing, Seismic) to day (genomic) to seconds or 
lower (Real time control, streaming) 
• There are storage/compute system styles: Shared, 
Dedicated, Permanent, Transient 
• Other characteristics are needed for permanent 
auxiliary/comparison datasets and these could be 
interdisciplinary, implying nontrivial data 
movement/replication 
• 10 Data Access/Use Styles from Bob Marcus at NIST (you 
have seen his patterns 2 and 5 and my extension for 
science 5A follows)
5A. Perform interactive analytics on 
observational scientific data 
Grid or Many Task Software, Hadoop, Spark, Giraph, Pig … 
Data Storage: HDFS, Hbase, File Collection (Lustre) 
Streaming Twitter data for 
Social Networking 
Science Analysis Code, 
Mahout, R 
Transport batch of data to primary 
analysis data system 
Record Scientific Data in 
“field” 
Local 
Accumulate 
and initial 
computing 
Direct Transfer 
NIST Examples include 
LHC, Remote Sensing, 
Astronomy and 
Bioinformatics
Why? Typical Big Data Analytics 
See Mahout, MLLib, R, usage in 
application survey
Core Analytics I 
• Map-Only 
• Pleasingly parallel - Local Machine Learning 
• MapReduce: Search/Query/Index 
• Summarizing statistics as in LHC Data analysis (histograms) (G1) 
• Recommender Systems (Collaborative Filtering) 
• Linear Classifiers (Bayes, Random Forests) 
• Alignment and Streaming (G7) 
• Genomic Alignment, Incremental Classifiers 
• Global Analytics: Nonlinear Solvers (structure depends on 
objective function) (G5,G6) 
– Stochastic Gradient Descent SGD 
– (L-)BFGS approximation to Newton’s Method 
– Levenberg-Marquardt solver
Core Analytics II 
• Global Analytics: Map-Collective (See Mahout, 
MLlib) (G2,G4,G6) 
• Often use matrix-matrix,-vector operations, solvers 
(conjugate gradient) 
• Clustering (many methods), Mixture Models, LDA 
(Latent Dirichlet Allocation), PLSI (Probabilistic Latent 
Semantic Indexing) 
• SVM and Logistic Regression 
• Outlier Detection (several approaches) 
• PageRank, (find leading eigenvector of sparse matrix) 
• SVD (Singular Value Decomposition) 
• MDS (Multidimensional Scaling) 
• Learning Neural Networks (Deep Learning) 
• Hidden Markov Models
Core Analytics III 
• Global Analytics – Map-Communication (targets 
for Giraph) (G3) 
• Graph Structure (Communities, subgraphs/motifs, 
diameter, maximal cliques, connected components) 
• Network Dynamics - Graph simulation Algorithms 
(epidemiology) 
• Global Analytics – Asynchronous Shared Memory 
(may be distributed algorithms) 
• Graph Structure (Betweenness centrality, shortest 
path) (G3) 
• Linear/Quadratic Programming, Combinatorial 
Optimization, Branch and Bound (G5)
Proposed Spectrum of Benchmarks/Features 
• Classic Database: TPC benchmarks 
• NoSQL Data systems: store, index, query (e.g. on Tweets) 
• Hard core commercial: Web Search, Collaborative 
Filtering (different structure and defer to Google!) 
• Streaming: Gather in Pub-Sub(Kafka) + Process (Apache 
Storm) solution (e.g. gather tweets, Internet of Things) 
• Pleasingly parallel (Local Analytics): as in initial steps of 
LHC, Astronomy, Pathology, Bioimaging (differ in type of 
data analysis) 
• “Global” Analytics: Deep Learning, SVM, 
Multidimensional Scaling, Graph Community finding 
(~Clustering) to Shortest Path (? Shared memory) 
• Workflow linking above

More Related Content

What's hot

51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data StackGeoffrey Fox
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleJulius Remigio, CBIP
 
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...DataWorks Summit
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitGanesan Narayanasamy
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputinginside-BigData.com
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detectionMostafaAliAbbas
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKAbhi Jit
 

What's hot (20)

51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
 
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning Perspectives
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detection
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 

Similar to What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data” version of Berkeley Dwarfs and NAS Parallel Benchmarks?

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackTuri, Inc.
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCHimanshu Bedi
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Databricks
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...Felix Gessert
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 

Similar to What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data” version of Berkeley Dwarfs and NAS Parallel Benchmarks? (20)

High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Dibbs spidal april6-2016
Dibbs spidal april6-2016Dibbs spidal april6-2016
Dibbs spidal april6-2016
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
 
Spark
SparkSpark
Spark
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 

More from Geoffrey Fox

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online EducationGeoffrey Fox
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityGeoffrey Fox
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationGeoffrey Fox
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a ServiceGeoffrey Fox
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Geoffrey Fox
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2Geoffrey Fox
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1Geoffrey Fox
 

More from Geoffrey Fox (16)

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online Education
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC Technology
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and Education
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Remarks on MOOC's
Remarks on MOOC'sRemarks on MOOC's
Remarks on MOOC's
 
FutureGrid Computing Testbed as a Service
 FutureGrid Computing Testbed as a Service FutureGrid Computing Testbed as a Service
FutureGrid Computing Testbed as a Service
 
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2CTS Conference Web 2.0 Tutorial Part 2
CTS Conference Web 2.0 Tutorial Part 2
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 

Recently uploaded

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data” version of Berkeley Dwarfs and NAS Parallel Benchmarks?

  • 1. What is the "Big Data" version of the Linpack Benchmark? What is “Big Data” version of Berkeley Dwarfs and NAS Parallel Benchmarks? Based on Presentation at Clusters, Clouds, and Data for Scientific Computing CCDSC 2014 September 6 2014 Geoffrey Fox, Judy Qiu School of Informatics and Computing Digital Science Center Indiana University Bloomington Shantenu Jha Radical Group Rutgers University
  • 2. Summary • Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack. • Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres. • Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience. • Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics. – We hope this will encourage debate
  • 4. Linpack for data? • There is a simple solution – use Linpack • The core of many data analytics algorithms is often linear algebra and involves full not sparse matrices although – Not always Matrix solvers but rather large matrix multiplication – Matrix solution can be done (much faster) with conjugate gradient in cases I’ve looked at (200 iterations for matrix size of a million) • Big Data can be dominated by analytics but also by other aspects of problem such as datastore access and data transport. • We expand “topic of presentation” to “broad based benchmark set” in spirit of Berkeley Dwarfs i.e. “capture key features” and “grand challenges” in (academic) Big Data
  • 5. Proposed Spectrum of Benchmarks/Features • Classic Database: TPC benchmarks • NoSQL Data systems: store, index, query (e.g. on Tweets) • Hard core commercial: Web Search, Collaborative Filtering (different structure and defer to Google!) • Streaming: Gather in Pub-Sub(Kafka) + Process (Apache Storm) solution (e.g. gather tweets, Internet of Things) • Pleasingly parallel (Local Analytics): as in initial steps of LHC, Astronomy, Pathology, Bioimaging (differ in type of data analysis) • “Global” Analytics: Deep Learning, SVM, Multidimensional Scaling, Graph Community (~Clustering) to finding to Shortest Path (?Shared memory) • Workflow linking above
  • 6. Why? Cover Software Stack Stress different components Combines HPC and Apache 140 packages but still incomplete
  • 7. Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross-Cutting Functionalities Message Protocols: Thrift, Protobuf Distributed Coordination: Zookeeper, JGroups Security & Privacy: InCommon, OpenStack Keystone, LDAP, Sentry Monitoring: Ambari, Ganglia, Nagios, Inca Workflow-Orchestration: Oozie, ODE, Airavata, OODT (Tools), Pegasus, Kepler, Swift, Taverna, Trident, ActiveBPEL, BioKepler, Galaxy, IPython Application and Analytics: Mahout , MLlib , MLbase, CompLearn, R, Bioconductor, ImageJ, Scalapack, PetSc High level Programming: Hive, HCatalog, Pig, Shark, MRQL, Impala, Sawzall, Drill Basic Programming model and runtime, SPMD, Streaming, MapReduce: Hadoop, Spark, Twister, Stratosphere, Tez, Hama, Storm, S4, Samza, Giraph, Pregel, Pegasus, Reef Inter process communication Collectives, point-to-point, publish-subscribe: Harp, MPI, Netty, ZeroMQ, ActiveMQ, RabbitMQ, QPid, Kafka, Kestrel In-memory databases/caches: GORA (general object from NoSQL), Memcached, Redis (key value), Hazelcast, Ehcache Object-relational mapping: Hibernate, OpenJPA and JDBC Standard Extraction Tools: UIMA, Tika SQL: Oracle, MySQL, Phoenix, SciDB, Apache Derby NoSQL: HBase, Accumulo, Cassandra, Solandra, MongoDB, CouchDB, Lucene, Solr, Berkeley DB, Azure Table, Dynamo, Riak, Voldemort. Neo4J, Yarcdata, Jena, Sesame, AllegroGraph, RYA, Parquet File management: iRODS Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Condor, SGE, OpenPBS, Moab, Slurm, Torque File systems: HDFS, Swift, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Interoperability: Whirr, JClouds, OCCI, CDMI DevOps: Docker, Puppet, Chef, Ansible, Boto, Libcloud, Cobbler, CloudMesh IaaS Management from HPC to hypervisors: OpenStack, OpenNebula, Eucalyptus, CloudStack, vCloud, Amazon, Azure, Google
  • 8. HPC-ABDS Layers 1) Message Protocols 2) Distributed Coordination: 3) Security & Privacy: 4) Monitoring: 5) IaaS Management from HPC to hypervisors: 6) DevOps: 7) Interoperability: 8) File systems: 9) Cluster Resource Management: 10) Data Transport: 11) SQL / NoSQL / File management: 12) In-memory databases&caches / Object-relational mapping / Extraction Tools 13) Inter process communication Collectives, point-to-point, publish-subscribe 14) Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: 15) High level Programming: 16) Application and Analytics: 17) Workflow-Orchestration: Here are 17 functionalities. Technologies are presented in this order 4 Cross cutting at top 13 in order of layered diagram starting at bottom
  • 9. Maybe a Big Data Initiative would include • We don’t need 140 software packages so can choose e.g. • Workflow: Python, Pegasus or Kepler • Data Mahout, R, ImageJ, Scalapack • High level Programming: Hive, Pig • Parallel Programming model: Hadoop, Spark, Giraph (Twister4Azure, Harp), Storm • Communication: MPI; Kafka or RabbitMQ (Streaming) • In-memory: Memcached • Data Management: Hbase, MongoDB, MySQL or Derby • Distributed Coordination: Zookeeper • Cluster Management: Yarn, Slurm • File Systems: HDFS, Lustre • DevOps: Cloudmesh, Chef, Puppet, Docker, Cobbler • IaaS: Amazon, Azure, OpenStack, Libcloud • Monitoring: Inca, Ganglia, Nagios
  • 10. Why? Build on Parallel Computing Experience Benchmarks Instantiate Key Features
  • 11. HPC Benchmark Classics • Linpack or HPL: Parallel LU factorization for solution of linear equations • NPB version 1: Mainly classic HPC solver kernels – MG: Multigrid – CG: Conjugate Gradient – FT: Fast Fourier Transform – IS: Integer sort – EP: Embarrassingly Parallel – BT: Block Tridiagonal – SP: Scalar Pentadiagonal – LU: Lower-Upper symmetric Gauss Seidel
  • 12. 13 Berkeley Dwarfs • Dense Linear Algebra • Sparse Linear Algebra • Spectral Methods • N-Body Methods • Structured Grids • Unstructured Grids • MapReduce • Combinational Logic • Graph Traversal • Dynamic Programming • Backtrack and Branch-and-Bound • Graphical Models • Finite State Machines First 6 of these correspond to Colella’s original. Monte Carlo dropped. N-body methods are a subset of Particle in Colella. Note a little inconsistent in that MapReduce is a programming model and spectral method is a numerical method. NO clean solution likely for Big Data. Need multiple facets!
  • 13. 7 Computational Giants of NRC Massive Data Analysis Report 1) G1: Basic Statistics (see MRStat later) 2) G2: Generalized N-Body Problems 3) G3: Graph-Theoretic Computations 4) G4: Linear Algebraic Computations 5) G5: Optimizations e.g. Linear Programming 6) G6: Integration e.g. LDA and other GML 7) G7: Alignment Problems e.g. BLAST
  • 14. Why? Cover Big Data Application Survey Performed by NIST Big Data Working Group Leads to Ogres covering Big Data Application features. Here we focus on benchmarks that cover the Ogres
  • 15. 51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware • http://bigdatawg.nist.gov/usecases.php • https://bigdatacoursespring2014.appspot.com/course (Section 5) • Government Operation(4): National Archives and Records Administration, Census Bureau • Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) 26 Features for each use case Biased to science • Defense(3): Sensors, Image surveillance, Situation Assessment • Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity • Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets • The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source experiments • Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan • Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors • Energy(1): Smart grid 15
  • 16. Features of 51 Use Cases I • PP (26) Pleasingly Parallel or Map Only • MR (18) Classic MapReduce MR (add MRStat below for full count) • MRStat (7) Simple version of MR where key computations are simple reduction as found in statistical averages such as histograms and averages • MRIter (23) Iterative MapReduce or MPI (Spark, Twister) • Graph (9) Complex graph data structure needed in analysis • Fusion (11) Integrate diverse data to aid discovery/decision making; could involve sophisticated algorithms or could just be a portal • Streaming (41) Some data comes in incrementally and is processed this way • Classify (30) Classification: divide data into categories • S/Q (12) Index, Search and Query
  • 17. Features of 51 Use Cases II • CF (4) Collaborative Filtering for recommender engines • LML (36) Local Machine Learning (Independent for each parallel entity) • GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI, MDS, – Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm • Workflow (51) Universal • GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual Earth, Google Earth, GeoServer etc. • HPC (5) Classic large-scale simulation of cosmos, materials, etc. generating (visualization) data • Agent (2) Simulations of models of data-defined macroscopic entities represented as agents
  • 18. Data Source and Style Facet I • (i) SQL or NoSQL: NoSQL includes Document, Column, Key-value, Graph, Triple store • (ii) Other Enterprise data systems: e.g. Warehouses • (iii) Set of Files: as managed in iRODS and extremely common in scientific research • (iv) File, Object, Block and Data-parallel (HDFS) raw storage: Separated from computing? • (v) Internet of Things: 24 to 50 Billion devices on Internet by 2020 • (vi) Streaming: Incremental update of datasets with new algorithms to achieve real-time response (G7) • (vii) HPC simulations: generate major (visualization) output that often needs to be mined • (viii) Involve GIS: Geographical Information Systems provide attractive access to geospatial data
  • 19. 2. Perform real time analytics on data source streams and notify users when specified events occur Streaming Data Streaming Data Streaming Data Specify filter Posted Data Identified Events Archive Storm, Kafka, Hbase, Zookeeper Filter Identifying Events Repository Post Selected Events Fetch streamed Data
  • 20. 5. Perform interactive analytics on data in analytics-optimized data system Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase Data, Streaming, Batch ….. Mahout, R
  • 21. Data Source and Style Facet II • Before data gets to compute system, there is often an initial data gathering phase which is characterized by a block size and timing. Block size varies from month (Remote Sensing, Seismic) to day (genomic) to seconds or lower (Real time control, streaming) • There are storage/compute system styles: Shared, Dedicated, Permanent, Transient • Other characteristics are needed for permanent auxiliary/comparison datasets and these could be interdisciplinary, implying nontrivial data movement/replication • 10 Data Access/Use Styles from Bob Marcus at NIST (you have seen his patterns 2 and 5 and my extension for science 5A follows)
  • 22. 5A. Perform interactive analytics on observational scientific data Grid or Many Task Software, Hadoop, Spark, Giraph, Pig … Data Storage: HDFS, Hbase, File Collection (Lustre) Streaming Twitter data for Social Networking Science Analysis Code, Mahout, R Transport batch of data to primary analysis data system Record Scientific Data in “field” Local Accumulate and initial computing Direct Transfer NIST Examples include LHC, Remote Sensing, Astronomy and Bioinformatics
  • 23. Why? Typical Big Data Analytics See Mahout, MLLib, R, usage in application survey
  • 24. Core Analytics I • Map-Only • Pleasingly parallel - Local Machine Learning • MapReduce: Search/Query/Index • Summarizing statistics as in LHC Data analysis (histograms) (G1) • Recommender Systems (Collaborative Filtering) • Linear Classifiers (Bayes, Random Forests) • Alignment and Streaming (G7) • Genomic Alignment, Incremental Classifiers • Global Analytics: Nonlinear Solvers (structure depends on objective function) (G5,G6) – Stochastic Gradient Descent SGD – (L-)BFGS approximation to Newton’s Method – Levenberg-Marquardt solver
  • 25. Core Analytics II • Global Analytics: Map-Collective (See Mahout, MLlib) (G2,G4,G6) • Often use matrix-matrix,-vector operations, solvers (conjugate gradient) • Clustering (many methods), Mixture Models, LDA (Latent Dirichlet Allocation), PLSI (Probabilistic Latent Semantic Indexing) • SVM and Logistic Regression • Outlier Detection (several approaches) • PageRank, (find leading eigenvector of sparse matrix) • SVD (Singular Value Decomposition) • MDS (Multidimensional Scaling) • Learning Neural Networks (Deep Learning) • Hidden Markov Models
  • 26. Core Analytics III • Global Analytics – Map-Communication (targets for Giraph) (G3) • Graph Structure (Communities, subgraphs/motifs, diameter, maximal cliques, connected components) • Network Dynamics - Graph simulation Algorithms (epidemiology) • Global Analytics – Asynchronous Shared Memory (may be distributed algorithms) • Graph Structure (Betweenness centrality, shortest path) (G3) • Linear/Quadratic Programming, Combinatorial Optimization, Branch and Bound (G5)
  • 27. Proposed Spectrum of Benchmarks/Features • Classic Database: TPC benchmarks • NoSQL Data systems: store, index, query (e.g. on Tweets) • Hard core commercial: Web Search, Collaborative Filtering (different structure and defer to Google!) • Streaming: Gather in Pub-Sub(Kafka) + Process (Apache Storm) solution (e.g. gather tweets, Internet of Things) • Pleasingly parallel (Local Analytics): as in initial steps of LHC, Astronomy, Pathology, Bioimaging (differ in type of data analysis) • “Global” Analytics: Deep Learning, SVM, Multidimensional Scaling, Graph Community finding (~Clustering) to Shortest Path (? Shared memory) • Workflow linking above

Editor's Notes

  1. Big data dwarfs are Ogres Implement Ogres in HPC-ABDS
  2. Big dwarfs are Ogres Implement Ogres in ABDS+
  3. BFGS Broyden–Fletcher–Goldfarb–Shanno algorithm L = limited memory