This document summarizes the work distribution for a team of 3 PhD students on a social media project. It outlines 18 tasks performed in chronological order related to data collection, cleaning, modeling and analysis. It notes that the team worked together collaboratively on calls to discuss methods and develop code, and divided up large datasets and files among the members to complete processing and analysis in parallel since the datasets were too large for one person. The summary concludes it is difficult to attribute specific tasks to individuals since the work was distributed and all members contributed equally to each task.
Slides from 'Stay Calm & Keep Current' - How to filter machine learning related academic papers - introduction of our open source project for this purpose.
An Interactive Visual Analytics Dashboard for the Employment Situation ReportBenjamin Bengfort
The Employment Situation Report is a monthly news release by the Bureau of Labor Statistics which describes the results of the Current Population Survey. Its release is widely anticipated by economists, journalists, and politicians as it is used to forecast the economic condition of the United States by describing ongoing trends and has a broad impact on public and corporate economic confidence leading directly to investment decisions. The report itself is in a PDF format that is comprised primarily of text and tabular information. Quickly and correctly interpreting the results of the jobs report is vital for quality reporting and decision making, but the report is more suited for longer study than deriving insights. In this project we explore the use of an interactive dashboard for visual analytics upon the released BLS data. Using an application demonstration and a usability study we will show that visually interacting with the most current employment data, users are able to rapidly achieve rich insights similar to those reported on in the text of the Employment Situation Report.
A talk presented to the US Networking and Information Technology Research and Development (NITRD) Program's High End Computing Interagency Working Group, 16 January 2020
Slides from 'Stay Calm & Keep Current' - How to filter machine learning related academic papers - introduction of our open source project for this purpose.
An Interactive Visual Analytics Dashboard for the Employment Situation ReportBenjamin Bengfort
The Employment Situation Report is a monthly news release by the Bureau of Labor Statistics which describes the results of the Current Population Survey. Its release is widely anticipated by economists, journalists, and politicians as it is used to forecast the economic condition of the United States by describing ongoing trends and has a broad impact on public and corporate economic confidence leading directly to investment decisions. The report itself is in a PDF format that is comprised primarily of text and tabular information. Quickly and correctly interpreting the results of the jobs report is vital for quality reporting and decision making, but the report is more suited for longer study than deriving insights. In this project we explore the use of an interactive dashboard for visual analytics upon the released BLS data. Using an application demonstration and a usability study we will show that visually interacting with the most current employment data, users are able to rapidly achieve rich insights similar to those reported on in the text of the Employment Situation Report.
A talk presented to the US Networking and Information Technology Research and Development (NITRD) Program's High End Computing Interagency Working Group, 16 January 2020
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
Presentation given at the UIUC Workshop on Materials Computation: data science and multiscale modeling. Materials Data Facility data publication, discovery, Globus, and associated python and REST interfaces are discussed. Video available soon.
When developing topic maps and their applications, key challenges are how to pick up the main subjects in targeted domains and how to systematize those subjects. This paper introduces a topic map development about topic map case examples. It also introduces what kinds of subjects were extracted and how the identifiers of those subjects were given and how those subjects were classified in the first version. Then the difficulties which were emerged during the development are discussed. In order to promote sharing of the case examples and make good use of them, I provide some consideration and future works.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
If You want This Project Entittled "JPS-School Management System"
Contact - Sarthak Khabiya
Email :-sarthakkhabiya@gmail.com
Contact Number - +91-8717912597
Research data spring: streamlining depositJisc RDM
The research data spring project "Streamlining deposit: an OJS to repository plugin" slides for the third sandpit workshop. Project led by Ernesto Priego of City University London.
Working towards Sustainable Software for Science (an NSF and community view)Daniel S. Katz
This talk looks at the goal of sustainable scientific software from the point-of-view of an NSF program officer who funds software as infrastructure, meaning software that enables a community beyond the developers to perform research, and from the point-of-view of the attendees of the First Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE1, http://wssspe.researchcomputing.org.uk/wssspe1/). Issues to be discussed include what sustainability means, funding, incentives, career paths, and communities.
Vade Meccum_Book reading and publishing software NetBeans.docxGeetaShreeprabha
This is an Informatics Practices Project submitted for the AISSCE exam IP in 2017-18. By using JAVA and MySQL an application is created on NetBeans for book reading and publishing. The purpose of this project was to create a repository of books and a cloud storage account for written content which can be accessed at a later stage.
Analyzing Big Data's Weakest Link (hint: it might be you)HPCC Systems
Tim Menzies, NC State University, presents at the 2015 HPCC Systems Engineering Summit Community Day.
For Big Data applications, there is a lack of any gold standards for "good analysis" or methods to assess our certification programs. Hence, we are still in the dark about whether or not our human analysts are making the best use possible of the tools of Big Data. While much progress has been made in the systems aspects of Big Data, certain critical human-centered aspects remain an open issue. Regardless of the sophistication of the analysis tools and environment, all that architecture can still be used incorrectly by users. If this issue was confined to a small number of inexperienced users, then it could be addressed via process improvements such as better training. But is it? What do we know about our analysts? Where are the studies that mine the people doing the data miners?
This presentation offers some preliminary results on tools that combine ECL with other methods that recognize the code generated by experienced or inexperienced developers. While the results are preliminary, they do raise the possibility that we can better characterize what it means to be experienced (or inexperienced) at Big Data applications.
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
Presentation given at the UIUC Workshop on Materials Computation: data science and multiscale modeling. Materials Data Facility data publication, discovery, Globus, and associated python and REST interfaces are discussed. Video available soon.
When developing topic maps and their applications, key challenges are how to pick up the main subjects in targeted domains and how to systematize those subjects. This paper introduces a topic map development about topic map case examples. It also introduces what kinds of subjects were extracted and how the identifiers of those subjects were given and how those subjects were classified in the first version. Then the difficulties which were emerged during the development are discussed. In order to promote sharing of the case examples and make good use of them, I provide some consideration and future works.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
If You want This Project Entittled "JPS-School Management System"
Contact - Sarthak Khabiya
Email :-sarthakkhabiya@gmail.com
Contact Number - +91-8717912597
Research data spring: streamlining depositJisc RDM
The research data spring project "Streamlining deposit: an OJS to repository plugin" slides for the third sandpit workshop. Project led by Ernesto Priego of City University London.
Working towards Sustainable Software for Science (an NSF and community view)Daniel S. Katz
This talk looks at the goal of sustainable scientific software from the point-of-view of an NSF program officer who funds software as infrastructure, meaning software that enables a community beyond the developers to perform research, and from the point-of-view of the attendees of the First Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE1, http://wssspe.researchcomputing.org.uk/wssspe1/). Issues to be discussed include what sustainability means, funding, incentives, career paths, and communities.
Vade Meccum_Book reading and publishing software NetBeans.docxGeetaShreeprabha
This is an Informatics Practices Project submitted for the AISSCE exam IP in 2017-18. By using JAVA and MySQL an application is created on NetBeans for book reading and publishing. The purpose of this project was to create a repository of books and a cloud storage account for written content which can be accessed at a later stage.
Analyzing Big Data's Weakest Link (hint: it might be you)HPCC Systems
Tim Menzies, NC State University, presents at the 2015 HPCC Systems Engineering Summit Community Day.
For Big Data applications, there is a lack of any gold standards for "good analysis" or methods to assess our certification programs. Hence, we are still in the dark about whether or not our human analysts are making the best use possible of the tools of Big Data. While much progress has been made in the systems aspects of Big Data, certain critical human-centered aspects remain an open issue. Regardless of the sophistication of the analysis tools and environment, all that architecture can still be used incorrectly by users. If this issue was confined to a small number of inexperienced users, then it could be addressed via process improvements such as better training. But is it? What do we know about our analysts? Where are the studies that mine the people doing the data miners?
This presentation offers some preliminary results on tools that combine ECL with other methods that recognize the code generated by experienced or inexperienced developers. While the results are preliminary, they do raise the possibility that we can better characterize what it means to be experienced (or inexperienced) at Big Data applications.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
1.
Appendix
Social Media (IS 735)
Fall 2020
Department of Information Systems
New Jersey Institute of Technology
Work Distribution
December 15, 2020
Team
We are a team of 3 PhD students as follows:
1. Kaustav Bhattacharjee (kb526@njit.edu)
2. Sahaj Vaidya (ssv47@njit.edu)
3. Soumyadeep Basu (sb2356@njit.edu)
Tasks
The following tasks were performed in chronological order:
1. Searched for Problem Statement
2. Developed of Project Proposal
3. Identified the datasets and expert users
4. Collected the datasets
5. Explored the datasets and cleaning the data
6. Collected missing data attributes
7. Developed the Data Collection Report
8. Implemented ElasticSearch in order to ingest the data (discontinued later)
9. Performed Topic Modelling using LDA on the expert dataset
10. Performed Subjectivity Analysis on the expert dataset
11. Developed Project Intermediate Report
12. Performed Topic Modelling using NMF on the expert dataset
13. Performed Sentiment Analysis using BERT on the expert and non-expert datasets
14. Performed Subjectivity Analysis on the non-expert dataset (after feedback from
Intermediate Report)
2. 2
15. Performed Sentiment Analysis using Flair on the expert and non-expert datasets
16. Performed Topic Modelling using LDA on the non-expert dataset
17. Performed Topic Modelling using NMF on the non-expert dataset
18. Developed the Project Final Report on Medium
Summary
We have performed the above mentioned task together as a team. We used to get into a call,
discuss the methods to be used and developed the code/algorithm to be used. Since we had to
deal with a lot of files (e.g.: 600 files of 6000 records each, for one of the data collection tasks),
we divided the files into equal numbers of subsets (e.g.: 200 files for each team member) and
distributed them among ourselves to be executed on our individual computers. Hence, it would
be difficult to pinpoint exactly what each person did, since all the team members have an equal
contribution to each of the tasks to make the project a success.
✦✦✦