Spark on the ARC
Big data analytics frameworks on HPC Cluster
Mark E. DeYoung
Himanshu Bedi
Mohammed Salman
Dr. David Raymond
Dr. Joseph G. Tront
Agenda
• MOTIVATION
• HPC vs HADOOP
• Spark on the ARC
• VT ARC
• Current Work
• Implementation Details
• Results
• Future Work
Log Archiving and Analysis (LAARG)
• Ingestion of Logs into Elastic Search
• Setting up of Infrastructure for Big Data Analytics
• Machine Learning Algorithms on Network Logs
Network Security Data Analytics
• Application of data science
approaches to the problem
domain of network security
• Solution domains:
• Data Mining (DM)
• Machine Learning (ML)
• Natural Language Processing
(NLP)
• ML - Requires underlying
analytic infrastructure
capable of efficiently applying
ML algorithms (w/ fine grained parallelism)
to big data (w/ coarse grained parallelism)
Parallelism & Data Separability
• Parallelism – doing lots of things at the same time :
• Granularity:
• Fine – large number of small tasks. Requires low-latency communication and frequent
coordination
• Coarse – small number of large tasks. Requires high throughput communication and less
frequent coordination
• Levels: Instruction (ILP), Task (TLP), Data (DLP), Transaction
• Ability to achieve parallelism is shaped by properties of data (and the
algorithms used to process the data).
• Data separability – data co-dependency determines ability to segment
data
• Uniform - identically sized segments
• Modular - a priori segmentation according to some extrinsic knowledge about the data
• Arbitrary - arbitrarily sized segments
• Generalizations:
• Machine Learning – fine grained, data has low temporal and/or spatial separability.
• Big Data – coarse grained, data has high temporal and/or spatial separability
• Machine Learning from Big Data – we need both fine and coarse grained
parallelism to process uniform and modular data.
HPC vs HADOOP
Point of Interest HPC HADOOP
Data Storage &
Processing
Centralized Storage Data is distributed across
nodes.
Hardware
Infrastructure
Requires high end
computing components
Runs mostly on commodity
hardware
FS Storage Lustre File System HDFS file system
Cluster Resource
Management
SLURM/TORQUE YARN/Mesos
Programming
Languages
Uses second generation
languages like C/C++
Uses latest programming
languages like
Java/Scala/Python
Business
Applications
Primary use for
scientific research
Commercial Analytics use-
cases
What is Spark ?
• In-memory data
processing
engine
• Interface for
programming
clusters with
data parallelism
• Facilitates
iterative
algorithms (good for ML)
& interactive
exploratory data
analysis (EDA)
• Source – www.databricks.com
Spark on the ARC
• ARC environment can support both fine and coarse grained
parallelism necessary for machine learning from big data.
• Spark provide a framework to orchestrate algorithm execution and
distributed data management (programming model).
• Pros:
• Don’t need dedicated hardware, don’t need to sys admin the
hardware
• Deployment is fairly straight forward – script based at the moment.
• Spark (unlike MPI) provides distributed data model – Resilient
Distributed Datasets (RDDs); Datasets and Data Frames
• Cons:
• ARC is batch oriented - not appropriate for long running services,
interactive, or incremental/streaming tasks
• Shared resource – might have to wait for specific computer type
• Loss of control – uptime and maintenance actions controlled by
ARC
VT ARC
• Set of HPC Clusters: New River, Cascades, Dragons Tooth
• Several compute node configurations:
• Processing: multi-core, multi-CPU, some with GPU
• Storage:
• Node local – HDD, SSD, NVME, memory
• Centralized Storage on IBM General Parallel File System (GPFS)
• Interconnect: low-latency 100 Gbps (MPI), throughput oriented 10 Gps
(data movement)
• Resource Management / Scheduling:
• TORQUE Resource Manager - modified open source Portable
Batch System(PBS)
• MOAB Cluster Workload Management – billing for allocations
• Environment configuration: Lmod environmental modules
system
Current Work
• Developed and evaluated deployment models of the Apache Hadoop
and Spark frameworks on existing batch oriented HPC clusters.
• Created a framework to automate the creation of deployment
variations and monitor the execution of evaluation iterations that
accommodates dynamic resource allocations.
Batch Job with 3 compute nodes
• Figure 1 shows the Hadoop
Namenode (NN) and YARN
ResourceManageer (RM)
service daemons running
on the head node.
• Figure 2 shows the Hadoop
Datanode (DN) and YARN
NodeManager (NM) running
on each of the worker
compute-nodes allocated to
the job.
Implementation Workflow
Evaluation
• Evaluation was carried out on two clusters maintained by VT’s ARC, namely
Cascades and NewRiver.
• A dynamic Spark and Hadoop cluster is instantiated and the scheduling is
carried out in both the standalone mode and with YARN.
• Two benchmarks – namely Spark Bench and HiBench were run to test the
Spark and Hadoop configurations.
• Collected telemetry data from the telemetry framework provided by VT ARC
as part of TORQUE/Moab installation. This data includes queuing delay, time
to completion, CPU utilization and memory consumption.
• Investigated the effects of horizontal scaling versus vertical scaling by
comparing the resource utilization in either case.
Results
Future Work
• Examining overhead incurred in allocation of resources
from the HPC scheduler.
• Evaluate the impact of user contention when the compute
nodes are shared between users.
• Run realistic, real-work workloads, primarily running
Machine Learning algorithms on network logs collected
from the access points distributed across the campus.
• Analyzing the performance of this framework for
streaming data.
THANK YOU

PEARC 17: Spark On the ARC

  • 1.
    Spark on theARC Big data analytics frameworks on HPC Cluster Mark E. DeYoung Himanshu Bedi Mohammed Salman Dr. David Raymond Dr. Joseph G. Tront
  • 2.
    Agenda • MOTIVATION • HPCvs HADOOP • Spark on the ARC • VT ARC • Current Work • Implementation Details • Results • Future Work
  • 3.
    Log Archiving andAnalysis (LAARG) • Ingestion of Logs into Elastic Search • Setting up of Infrastructure for Big Data Analytics • Machine Learning Algorithms on Network Logs
  • 4.
    Network Security DataAnalytics • Application of data science approaches to the problem domain of network security • Solution domains: • Data Mining (DM) • Machine Learning (ML) • Natural Language Processing (NLP) • ML - Requires underlying analytic infrastructure capable of efficiently applying ML algorithms (w/ fine grained parallelism) to big data (w/ coarse grained parallelism)
  • 5.
    Parallelism & DataSeparability • Parallelism – doing lots of things at the same time : • Granularity: • Fine – large number of small tasks. Requires low-latency communication and frequent coordination • Coarse – small number of large tasks. Requires high throughput communication and less frequent coordination • Levels: Instruction (ILP), Task (TLP), Data (DLP), Transaction • Ability to achieve parallelism is shaped by properties of data (and the algorithms used to process the data). • Data separability – data co-dependency determines ability to segment data • Uniform - identically sized segments • Modular - a priori segmentation according to some extrinsic knowledge about the data • Arbitrary - arbitrarily sized segments • Generalizations: • Machine Learning – fine grained, data has low temporal and/or spatial separability. • Big Data – coarse grained, data has high temporal and/or spatial separability • Machine Learning from Big Data – we need both fine and coarse grained parallelism to process uniform and modular data.
  • 6.
    HPC vs HADOOP Pointof Interest HPC HADOOP Data Storage & Processing Centralized Storage Data is distributed across nodes. Hardware Infrastructure Requires high end computing components Runs mostly on commodity hardware FS Storage Lustre File System HDFS file system Cluster Resource Management SLURM/TORQUE YARN/Mesos Programming Languages Uses second generation languages like C/C++ Uses latest programming languages like Java/Scala/Python Business Applications Primary use for scientific research Commercial Analytics use- cases
  • 7.
    What is Spark? • In-memory data processing engine • Interface for programming clusters with data parallelism • Facilitates iterative algorithms (good for ML) & interactive exploratory data analysis (EDA) • Source – www.databricks.com
  • 8.
    Spark on theARC • ARC environment can support both fine and coarse grained parallelism necessary for machine learning from big data. • Spark provide a framework to orchestrate algorithm execution and distributed data management (programming model). • Pros: • Don’t need dedicated hardware, don’t need to sys admin the hardware • Deployment is fairly straight forward – script based at the moment. • Spark (unlike MPI) provides distributed data model – Resilient Distributed Datasets (RDDs); Datasets and Data Frames • Cons: • ARC is batch oriented - not appropriate for long running services, interactive, or incremental/streaming tasks • Shared resource – might have to wait for specific computer type • Loss of control – uptime and maintenance actions controlled by ARC
  • 9.
    VT ARC • Setof HPC Clusters: New River, Cascades, Dragons Tooth • Several compute node configurations: • Processing: multi-core, multi-CPU, some with GPU • Storage: • Node local – HDD, SSD, NVME, memory • Centralized Storage on IBM General Parallel File System (GPFS) • Interconnect: low-latency 100 Gbps (MPI), throughput oriented 10 Gps (data movement) • Resource Management / Scheduling: • TORQUE Resource Manager - modified open source Portable Batch System(PBS) • MOAB Cluster Workload Management – billing for allocations • Environment configuration: Lmod environmental modules system
  • 10.
    Current Work • Developedand evaluated deployment models of the Apache Hadoop and Spark frameworks on existing batch oriented HPC clusters. • Created a framework to automate the creation of deployment variations and monitor the execution of evaluation iterations that accommodates dynamic resource allocations.
  • 11.
    Batch Job with3 compute nodes • Figure 1 shows the Hadoop Namenode (NN) and YARN ResourceManageer (RM) service daemons running on the head node. • Figure 2 shows the Hadoop Datanode (DN) and YARN NodeManager (NM) running on each of the worker compute-nodes allocated to the job.
  • 12.
  • 13.
    Evaluation • Evaluation wascarried out on two clusters maintained by VT’s ARC, namely Cascades and NewRiver. • A dynamic Spark and Hadoop cluster is instantiated and the scheduling is carried out in both the standalone mode and with YARN. • Two benchmarks – namely Spark Bench and HiBench were run to test the Spark and Hadoop configurations. • Collected telemetry data from the telemetry framework provided by VT ARC as part of TORQUE/Moab installation. This data includes queuing delay, time to completion, CPU utilization and memory consumption. • Investigated the effects of horizontal scaling versus vertical scaling by comparing the resource utilization in either case.
  • 14.
  • 15.
    Future Work • Examiningoverhead incurred in allocation of resources from the HPC scheduler. • Evaluate the impact of user contention when the compute nodes are shared between users. • Run realistic, real-work workloads, primarily running Machine Learning algorithms on network logs collected from the access points distributed across the campus. • Analyzing the performance of this framework for streaming data.
  • 16.