Presenters: Amy Apon, Linh Ngo and Michael Payne, Clemson University
Big Data research at Clemson University includes the development of computational science applications, evaluation of software and hardware infrastructure systems to support Big Data, and design and development of data analytics platforms. HPCC Systems has been designed to support production quality data storage and analytics in an enterprise environment. In this talk we describe how we are using the HPCC Systems platform in an academic research environment to support research in Big Data.
Like many major research universities, Clemson University hosts a shared high performance supercomputing cluster. The Palmetto cluster at Clemson ranks in the top five among university-owned supercomputers in the U.S. and is a significant research resource for faculty, students, and staff. As a shared resource, computing nodes are allocated to users via a batch scheduler system, and users do not have administrative privileges. In this talk we describe how we use HPCC Systems on the shared campus resource using user privileges only. This work opens doors for use of HPCC Systems to a broad community of academic users who have access to shared computing resources.
HPCC Systems is a key tool in several analytics projects at Clemson University. In this talk we describe how we use the HPCC Systems platform in our study of the research productivity of academic institutions. This research requires the use of detailed data over decades of time from various sources and formats without any consistency or a predefined data structure. LexisNexis and Reed Elsevier are key partners in the acquisition of data for this research. The strengths of HPCC Systems include its ability to support easy ingestion of various data formats, dynamic transformation and integration of ingested data, and a seamless interface between the data processing interface and the analytic tools written in R.
NOTE: The video of this presentation is the 2nd one shown in the accompanying YouTube link.
Designing IA for AI - Information Architecture Conference 2024
HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University
1. Applications of HPCC Systems at Clemson University Amy Apon, PhD ● Linh Ngo, PhD ● Michael Payne Big Data Systems Laboratory Clemson University
2. Applications of HPCC Systems at Clemson University Clemson Strengths and Opportunities
PhD-level faculty & research staff Talented students Significant industry collaborators
Palmetto – Top 5 in US Academic Supercomputers ~2000 nodes, 20K cores, 600 GPUs 100Gb Internet connectivity
Facilities
People
3. Applications of HPCC Systems at Clemson University Big Data Systems Lab Overview
Perform World Class Research on the Systems and Enabling Information Technology for Advanced Data Analytics
Big Data Systems Lab Research Areas
Systems and Architectures
Tools and Operations
Data Analytics and Applications
Big Data Systems Lab Vision
4. Applications of HPCC Systems at Clemson University Effect of High Performance Computing on Academic Research Productivity
Motivation: There is a lot of pressure on federal funding
We propose efficiency as a measure from which to gain insights on return on investment
We show that locally- available HPC has a positive effect on the ability of a university to do research
5. Motivation: Government and business need information about public sentiment. Research: We develop and apply methods to analyze large amounts of textual data to enable inquiry of social and business problems.
Applications of HPCC Systems at Clemson University Text mining of news reports and social media for business intelligence
6. Shared Execution Environment
Temporary Local Storage
User Privileges Only
Applications of HPCC Systems at Clemson University Shared Computing Resources among Researchers
7. Applications of HPCC Systems at Clemson University
Linh Ngo, PhD
HPCC Systems in a Shared Research Computing Environment
8. Shared Execution Environment
Temporary Local Storage
User Privileges Only
Applications of HPCC Systems at Clemson University Shared Computing Resources among Researchers
How to provision and configure an HPCC cluster dynamically for research purposes?
•
Step 1: Configure, install, and deploy HPCC as a non-root user
•
Step 2: Dynamically provision HPCC cluster in a shared research environment
9. binutils ICU XALAN APR …
/
/usr
/lib64
/lib
…
/opt
/
/home
$USER
…
/parallel_scratch
/local_scratch
Applications of HPCC Systems at Clemson University Installation and Configuration of Dependencies
10. Administrative privileges
Non-administrative privileges
…
/
etc
init.d
HPCCSystems
opt
HPCCSystems
var
lib
HPCCSystems
log
HPCCSystems
configmgr
mydafilesrv
mydafilesrv
…
/
home
$USER
hpcc
local_scratch
hpcc
parallel_scratch
lib
HPCCSystems
log
$USER
lock
pid
$USER
Applications of HPCC Systems at Clemson University Resolving Non-default Installation Path Conflicts
11. Remove/relax root-level settings:
i.e.: is_root
Reduce default configuration settings for resource requirements:
depended on resource allocation requests
Applications of HPCC Systems at Clemson University Non-root Deployment
12. mydafilesrv mydafile myeclc mythor myroxie …
PBS_NODEFILE
environment.xml
1
2
3
4
5
user.palmetto.clemson.edu
Applications of HPCC Systems at Clemson University
Dynamic Provisioning
Deploy to /local_scratch or /parallel_scratch?
13. Applications of HPCC Systems at Clemson University
Michael Payne Using HPCC Systems to Manage Academic Data LexisNexis Summer 2014 Internship
14. •
Research in Scholarly Data requires academic data from many different sources, which store data under various formats
•
Aggregating these sources into a useful and cohesive structure requires a data-intensive approach to preprocessing, integration and analysis
•
HPCC Systems is a platform to streamline this process
Applications of HPCC Systems at Clemson University
Using HPCC Systems to Manage Academic Data
15. Higher Education Institutions
Research
High Performance Computing Capability
Funding Support
Applications of HPCC Systems at Clemson University Categories of Scholarly Data
16. Higher Education Institutions
Research
High Performance Computing Capability
Funding Support
Detailed Award (XML) Federal Funding (tab- delimited) Expenditures (tab-delimited) Institution Patent (Excel) Degrees Conferred (Excel)
NIH Award Data (CSV/Excel)
Top500 Supercomputer affiliated with academic institutions (XML)
Institutions from States with EPSCoR status (tab-delimited)
Institutional Information with Carnegie’s Research Classifications (multi-sheet Excel)
Enrollment (multi-sheet Excel) Financial (multi-sheet Excel) Faculty (multi-sheet Excel)
Detailed list of articles by discipline with abstracts, references, and disambiguated authors. (XML)
Applications of HPCC Systems at Clemson University Scholarly Data Description
17. Institution name/email
Name similarity Address
PI/Author name Institution name
Name similarity
Match name with WoS ‘s Organization-Enhanced name
Acknowledgment attributes (2008 on ward, automatic)
Applications of HPCC Systems at Clemson University Examples of Scholarly Data Links
18. •
Porting data analytic processes to ECL
•
Applying Machine Learning techniques for article abstract classification
Applications of HPCC Systems at Clemson University
Ongoing Work
19. LexisNexis Internship Machine Learning Manager Timothy Humphrey Mentor Arjuna Chala
Applications of HPCC Systems at Clemson University Summer 2014 Internship - Logistic Regression for Dense Matrices
20. •
Prediction using continuous and discrete values
•
No distributional assumptions on the predictors
•
May not be normally distributed or linearly related
•
Relationship between the discrete variable and the predictor is non-linear
Applications of HPCC Systems at Clemson University Logistic Regression
21. Matrices can be partitioned Schemes must be compatible There are multiple choices!
X
=
4 x 4
4 x 1
4 x 1
2 x 3
X
=
2 x 1
1 x 1
2 x 1
3 x 1
Applications of HPCC Systems at Clemson University Parallel Block Basic Linear Algebra Subprograms (PB-BLAS)
22. •
Logistic Runtimes
•
Hard Coded Mapping
•
Full Higgs Dataset 11,000,000 x 28
0
5
10
15
20
25
30
35
Higgs 1,000
Higgs 10,000
Higgs 100,000
Time in Minutes
PB-BLAS
Non PB-BLAS
Applications of HPCC Systems at Clemson University Machine Learning in ECL
23. •
Logistic Runtimes
•
Auto Mapping
•
Full Elsevier Dataset 100,000 x 3,291
0
5
10
15
20
25
Elsevier 100
Time in Minutes
PB-BLAS
Non PB-BLAS
Applications of HPCC Systems at Clemson University Machine Learning in ECL
24. 0
50
100
150
200
250
300
350
400
450
Elsevier 1,000
Time in Minutes
PB-BLAS
Non PB-BLAS
Applications of HPCC Systems at Clemson University Machine Learning in ECL
•
Logistic Runtimes
•
Auto Mapping
•
Full Elsevier Dataset 100,000 x 3,291
25. •
Logistic Regression code and supporting functions have been documented and merged to ECL-ML GitHub repository
•
Auto block vector mapping function for any user that wants to use PB-BLAS
•
Ready to use element wise multiplication in PB-BLAS
•
Updated debugging statements that a clear understanding of errors
•
Test functions for both block vector mapping function
•
Sample code for using logistic regression
•
Currently working on K-means implementation that utilizes PB-BLAS
Applications of HPCC Systems at Clemson University
Project Summary
26. Linh Ngo, PhD ● Alex Herzog, PhD ● Michael Payne ● Amy Apon, PhD {lngo, aherzog, mpayne3, aapon}@clemson.edu Big Data Systems Laboratory Clemson University