The document describes a Big Data project analyzing the Stack Overflow dataset. Key points:
- The dataset was obtained from Stack Exchange and consists of XML files totaling around 20GB that were parsed and loaded into HDFS.
- The data was analyzed to identify trending questions, unanswered questions, closed questions, dead questions, top tags, and more. PageRank analysis was also performed to rank posts.
- The results were visualized in a web application using technologies like Hive, HBase, Pig, and Mahout recommendation. Performance issues were encountered during the MapReduce parsing of the large XML files.
Details regarding the working of chatgpt and basic use cases can be found in this presentation. The presentation also contains details regarding other Open AI products and their useability. You can also find ways in which chatgpt can be implemented in existing App and websites.
Details regarding the working of chatgpt and basic use cases can be found in this presentation. The presentation also contains details regarding other Open AI products and their useability. You can also find ways in which chatgpt can be implemented in existing App and websites.
Inheritance in java is a mechanism in which one object acquires all the properties and behaviors of parent object. The idea behind inheritance in java is that you can create new classes that are built upon existing classes.
This presentation walks through essential points for developing and working with REST APIs or web services to communicate through various platforms. This also explains HTTP methods.
In this C# Web REST tutorial, beginners will learn first what a C# REST API is and then what are the HTTP Verbs in C# REST API is. Learn about the HTTP Status Codes in this C# tutorial. and we will become acquainted with all of the Constraints of the C# REST API. After that, for a better learning experience, we will see a practical demonstration of C# REST API in this C# programming tutorial. Finally we will wind up this session with Few takeaways on C# REST API.
Slides from our CodeMash 2013 Precompiler session, "Web Development with Python and Django", including a breezy introduction to the Python programming language and the Django web framework. The example code repository is available at https://github.com/finiteloopsoftware/django-precompiler/
Polymorphism is the ability of an object to take more than one forms. It is one of the important concept of object-oriented programming language. JAVA is object-oriented programming language which support the concept of polymorphisms.
Alternatively, it is defined as the ability of a reference variable to change behavior according to what object instance it is holding.
Project Report for Twitter Sentiment Analysis done using Apache Flume and data is analysed using Hive.
I intend to address the following questions:
How raw tweets can be used to find audience’s perception or sentiment about a person ?
How Hadoop can be used to solve this problem?
How Apache Hive can be used to organize the final data in a tabular format and query it?
How a data visualization tool can be used to display the findings?
GENERATIVE AI, THE FUTURE OF PRODUCTIVITYAndre Muscat
Discuss the impact and opportunity of using Generative AI to support your development and creative teams
* Explore business challenges in content creation
* Cost-per-unit of different types of content
* Use AI to reduce cost-per-unit
* New partnerships being formed that will have a material impact on the way we search and engage with content
Part 4 of a 9 Part Research Series named "What matters in AI" published on www.andremuscat.com
For this plenary talk at the Charlotte AI Institute for Smarter Learning, Dr. Cori Faklaris introduces her fellow college educators to the exciting world of generative AI tools. She gives a high-level overview of the generative AI landscape and how these tools use machine learning algorithms to generate creative content such as music, art, and text. She then shares some examples of generative AI tools and demonstrate how she has used some of these tools to enhance teaching and learning in the classroom and to boost her productivity in other areas of academic life.
Super keyword is a reference variable that is used for refer parent class object. Super keyword is used in java at three level, at variable level, at method level and at constructor level.
- Study the architecture and design
- Compare Old & New Technology stack
- Analyze evolution of architecture and scalability
- Lessons learned over time
Inheritance in java is a mechanism in which one object acquires all the properties and behaviors of parent object. The idea behind inheritance in java is that you can create new classes that are built upon existing classes.
This presentation walks through essential points for developing and working with REST APIs or web services to communicate through various platforms. This also explains HTTP methods.
In this C# Web REST tutorial, beginners will learn first what a C# REST API is and then what are the HTTP Verbs in C# REST API is. Learn about the HTTP Status Codes in this C# tutorial. and we will become acquainted with all of the Constraints of the C# REST API. After that, for a better learning experience, we will see a practical demonstration of C# REST API in this C# programming tutorial. Finally we will wind up this session with Few takeaways on C# REST API.
Slides from our CodeMash 2013 Precompiler session, "Web Development with Python and Django", including a breezy introduction to the Python programming language and the Django web framework. The example code repository is available at https://github.com/finiteloopsoftware/django-precompiler/
Polymorphism is the ability of an object to take more than one forms. It is one of the important concept of object-oriented programming language. JAVA is object-oriented programming language which support the concept of polymorphisms.
Alternatively, it is defined as the ability of a reference variable to change behavior according to what object instance it is holding.
Project Report for Twitter Sentiment Analysis done using Apache Flume and data is analysed using Hive.
I intend to address the following questions:
How raw tweets can be used to find audience’s perception or sentiment about a person ?
How Hadoop can be used to solve this problem?
How Apache Hive can be used to organize the final data in a tabular format and query it?
How a data visualization tool can be used to display the findings?
GENERATIVE AI, THE FUTURE OF PRODUCTIVITYAndre Muscat
Discuss the impact and opportunity of using Generative AI to support your development and creative teams
* Explore business challenges in content creation
* Cost-per-unit of different types of content
* Use AI to reduce cost-per-unit
* New partnerships being formed that will have a material impact on the way we search and engage with content
Part 4 of a 9 Part Research Series named "What matters in AI" published on www.andremuscat.com
For this plenary talk at the Charlotte AI Institute for Smarter Learning, Dr. Cori Faklaris introduces her fellow college educators to the exciting world of generative AI tools. She gives a high-level overview of the generative AI landscape and how these tools use machine learning algorithms to generate creative content such as music, art, and text. She then shares some examples of generative AI tools and demonstrate how she has used some of these tools to enhance teaching and learning in the classroom and to boost her productivity in other areas of academic life.
Super keyword is a reference variable that is used for refer parent class object. Super keyword is used in java at three level, at variable level, at method level and at constructor level.
- Study the architecture and design
- Compare Old & New Technology stack
- Analyze evolution of architecture and scalability
- Lessons learned over time
Analysis of StackOverflow posts/user data trend analysis. Predicting time to answer (classification) using Weka. CSCI599 final project on Social media data analytics
In this talk we briefly discuss some of our recent studies of Stack Overflow, a popular Q&A site targeting software developers. As opposed to studies of software artefacts discussed at Stack Overflow (e.g., APIs or programming examples), we focus on studying individuals active on Stack Overflow---who are they, what motivates them, and what affects their participation in Stack Overflow discussions.
Our findings indicate that Stack Overflow is no different from other communities of software developers in terms of gender representation but is significantly different from them in terms of gender engagement: controlling for engagement duration women and men ask and answer comparable number of questions, but women disengage faster. We conjecture that faster disengagement of women is the less pretty consequence of gamification mechanisms embedded in Stack Overflow, the same gamification mechanisms that provide developers with faster answers than ever before, attract numerous contributors and ultimately catalyse software development.
As an additional contribution we present genderComputer, a tool inferring gender of an individual based on her/his name and location.
The talk is based on the following papers:
* Gender, representation and online participation: A quantitative study, Vasilescu, B., Capiluppi, A. and Serebrenik, A., Interacting with Computers. 2013, Oxford University Press.
* How social Q&A sites are changing knowledge sharing in open source software communities, Vasilescu, B., Serebrenik, A., Devanbu, P. T. and Filkov, V., In CSCW, 2014, ACM.
* StackOverflow and GitHub: Associations between software development and crowdsourced knowledge, Vasilescu, B., Filkov, V. and Serebrenik, A., In Social Computing, 2013, IEEE.
Stack Overflow - It's all about performance / Marco Cecconi (Stack Overflow)Ontico
Stack Overflow, and its Q&A network Stack Exchange, have been growing exponentially for the last five years. They now encompass
~150 Q&A sites
~9 million users
~13 million questions
~22 million answers
In this talk, I will describe:
+ The physical architecture of Stack Overflow. How many servers are there? What is their purpose and what are their specs?
+ The logical architecture of the software. How do we scale up? What are the main building blocks of our software?
+ The tooling system. What supports our extreme optimization philosophy?
+ The development team. What are our core values? What footprint do we want to leave as developers?
Marco Cecconi, Software Developer @ Stack Exchange - The architecture of Stac...How to Web
The Stack Exchange network is a huge success story counting 109 sites, many millions of visitors per month. What software architecture powers a global top 100 website? How is our software structured? How many servers are there? Come find out!
More details on: http://2013.howtoweb.co/
Filippo Lanubile: Social Software as Key Enabler of Collaborative Development Environments.
Keynote speech at the 5th International Workshop on Social Software Engineering (SSE 2013), August 18, 2013, Saint Petersburg, Russia, colocated with ESEC/FSE 2013
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)Margaret-Anne Storey
Audio+slide video is posted at http://margaretannestorey.wordpress.com.
Slides from a Keynote at Mining Software Repository Conference 2012, co-located with ICSE 2012 in Zurich, Switzerland.
Mining Sociotechnical Information From Software RepositoriesMarco Aurelio Gerosa
A large amount of data is produced during collaborative software development. The analysis of such data sets a great opportunity to better understand Software Engineering from the perspective of evidence-based research. Mining software repositories studies have explored both the technical and social aspects of software development contributed to the discovery of important information about how software development evolves and how developers collaborate. Several repositories store data regarding source code production (version control systems), communication between developers and users (forums and mailing lists), and coordination of activities (issue tracker, task managers, etc.). In the open source world, such data is available in large ecosystems of software development. Platforms such as GitHub host millions of repositories, which receive contributions from millions of developers worldwide. Some project repositories register data from more than a decade of development, enabling the analysis of projects from a historical perspective. In this talk, I will discuss some of the uses and challenges of mining software repositories, focusing on some works conducted in our group, such as: identification of change dependencies, evaluation of architectural degradation from commit meta-data, core-periphery analysis of developers participation, change-proneness prediction, analysis of the impact of refactoring on code quality, and relations between quality attributes of the test and the code being tested.
Presentacion MoodleMoot 2014 Colombia - Integración Moodle con un Repositorio...Paola Amadeo
Comunicando Moodle con un repositorio digital de objetos de aprendizaje abiertos.Una experiencia en la Facultad de Informática de la Universidad Nacional de La Plata. Argentina.
Autores: Javier Díaz, Alejandra Schiavoni, Alejandra Osorio, Paola Amadeo, M. Emilia Charnelli, José Schultz, Alex Humar, Agustina Reynoso
Here at MRM, we are delivering new and exciting work with
HTML5, CSS3, responsive web and cross platform solutions, but what does that really entail?
Hear the tech team explain recent work, current trends and future capabilities.
What is Rated Ranking Evaluator and how to use it (for both Software Engineer and IT Manager). Talk made during Chorus Workshops at Plainschwarz Salon.
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux Foundation AI. Since its open-sourced, Amundsen has been used and extended by many different companies within our community.
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
Abstract – It is a research venture on the new information-access standard called type-ahead search, in which systems discover responds to a keyword query on-the-fly as users type in the uncertainty. In this paper we learn how to support fuzzy type-ahead search in XML. Underneath fuzzy search is important when users have limited knowledge about the exact representation of the entities they are looking for, such as people records in an online directory. We have developed and deployed several such systems, some of which have been used by many people on a daily basis. The systems received overwhelmingly positive feedbacks from users due to their friendly interfaces with the fuzzy-search feature. We describe the design and implementation of the systems, and demonstrate several such systems. We show that our efficient techniques can indeed allow this search paradigm to scale on large amounts of data.
Index Terms - type-ahead, large data set, server side, online directory, search technique.
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
RRE is an open-source search quality evaluation tool that can be used to produce a set of reports about the quality of a system, iteration after iteration, and that can be integrated within a continuous integration infrastructure to monitor quality metrics after each release.
Many aspects remained problematic though:
– how to directly evaluate a middle layer search-API that communicates with Apache Solr or Elasticsearch?
– how to easily generate explicit and implicit ratings without spending hours on tedious json files?
– how to better explore the evaluation results? with nice widgets and interesting insights?
Rated Ranking Evaluator Enterprise solves these problems and much more.
Join us as we introduce the next generation of open-source search quality evaluation tools, exploring the internals and real-world scenarios!
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAIJwest
Open Data is now collecting attention for innovative service creation, mainly in the area of
government, bioscience, and smart X project. However, to promote its application more for consumer
services, a search engine for Open Data to know what kind of data are there would be of help. This paper
presents a voice assistant which uses Open Data as its knowledge source. It is featured by improvement of
accuracy according to the user feedbacks, and acquisition of unregistered data by the user participation.
We also show an application to support for a field-work and confirm its effectiveness.
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachAlessandro Benedetti
Every information retrieval practitioner ordinarily struggles with the task of evaluating how well a search engine is performing and to reproduce the performance achieved in a specific point in time.
Improving the correctness and effectiveness of a search system requires a set of tools which help measuring the direction where the system is going.
Additionally it is extremely important to track the evolution of the search system in time and to be able to reproduce and measure the same performance (through metrics of interest such as precison@k, recall, NDCG@k...).
The talk will describe the Rated Ranking Evaluator from a researcher and software engineer perspective.
RRE is an open source search quality evaluation tool, that can be used to produce a set of reports about the quality of a system, iteration after iteration and that could be integrated within a continuous integration infrastructure to monitor quality metrics after each release .
Focus of the talk will be to raise public awareness of the topic of search quality evaluation and reproducibility describing how RRE could help the industry.
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachAlessandro Benedetti
Every information retrieval practitioner ordinarily struggles with the task of evaluating how well a search engine is performing and to reproduce the performance achieved in a specific point in time.
Improving the correctness and effectiveness of a search system requires a set of tools which help measuring the direction where the system is going.
Additionally it is extremely important to track the evolution of the search system in time and to be able to reproduce and measure the same performance (through metrics of interest such as precison@k, recall, NDCG@k...).
The talk will describe the Rated Ranking Evaluator from a researcher and software engineer perspective.
RRE is an open source search quality evaluation tool, that can be used to produce a set of reports about the quality of a system, iteration after iteration and that could be integrated within a continuous integration infrastructure to monitor quality metrics after each release .
Focus of the talk will be to raise public awareness of the topic of search quality evaluation and reproducibility describing how RRE could help the industry.
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...OpenSource Connections
Every team working on Information Retrieval software struggles with the task of evaluating how well their system performs in terms of search quality(at a specific point in time and historically).
Evaluating search quality is important both to understand and size the improvement or regression of your search application across the development cycles, and to communicate such progress to relevant stakeholders.
To satisfy these requirements an helpful tool must be:
- flexible and highly configurable for a technical user
- immediate, visual and concise for an optimal business utilization
In the industry, and especially in the open source community, the landscape is quite fragmented: such requirements are often achieved using ad-hoc partial solutions that each time require a considerable amount of development and customization effort.
To provide a standard, unified and approachable technology, we developed the Rated Ranking Evaluator (RRE), an open source tool for evaluating and measuring the search quality of a given search infrastructure. RRE is modular, compatible with multiple search technologies and easy to extend. It is composed by a core library and a set of modules and plugins that give it the flexibility to be integrated in automated evaluation processes and in continuous integrations flows.
This talk will introduce RRE, it will describe its latest developments and demonstrate how it can be integrated in a project to measure and assess the search quality of your search application.
The focus of the presentation will be on a live demo showing an example project with a set of initial relevancy issues that we will solve iteration after iteration: using RRE output feedbacks to gradually drive the improvement process until we reach an optimal balance between quality evaluation measures.
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
Every team working on Information Retrieval software struggles with the task of evaluating how well their system performs in terms of search quality(at a specific point in time and historically).
Evaluating search quality is important both to understand and size the improvement or regression of your search application across the development cycles, and to communicate such progress to relevant stakeholders.
To satisfy these requirements an helpful tool must be:
flexible and highly configurable for a technical user
immediate, visual and concise for an optimal business utilization
In the industry, and especially in the open source community, the landscape is quite fragmented: such requirements are often achieved using ad-hoc partial solutions that each time require a considerable amount of development and customization effort.
To provide a standard, unified and approachable technology, we developed the Rated Ranking Evaluator (RRE), an open source tool for evaluating and measuring the search quality of a given search infrastructure. RRE is modular, compatible with multiple search technologies and easy to extend. It is composed by a core library and a set of modules and plugins that give it the flexibility to be integrated in automated evaluation processes and in continuous integrations flows.
This talk will introduce RRE, it will describe its latest developments and demonstrate how it can be integrated in a project to measure and assess the search quality of your search application.
The focus of the presentation will be on a live demo showing an example project with a set of initial relevancy issues that we will solve iteration after iteration: using RRE output feedbacks to gradually drive the improvement process until we reach an optimal balance between quality evaluation measures.
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
STACK OVERFLOW DATASET ANALYSIS
1. Big Data Project Presentation
Team Members: Shrinivasaragav Balasubramanian, Shelley
Bhatnagar
STACK OVERFLOW DATASET ANALYSIS
2. The Dataset is obtained from Stack Exchange Data Dump at the Internet
Archive.
The link to the Dataset is as follows :
https://archive.org/details/stackexchange
Each site under Stack Exchange is formatted as a separate archive
consisting of XML files zipped via 7-zip that includes various files.
We chose the Stack Overflow Data Segment under the Stack Exchange
Dump which originally is around ~ 20 GB and we brought it to 3 GB for
performing analysis.
Dataset Overview:
3. Stack Overflow Dataset consists of following files that are treated as
tables in our Database Design:
Posts
PostLinks
Tags
Users
Votes
Batches
Comments
Dataset Overview:
4.
5. Since our dataset is in xml format, we designed parsers for each file i.e
table, to process the data easily and dump the data into HDFS.
The parsers were designed into a Java Application, implementing Mapper
and Reducer while configuring a job in Hadoop to parse the data.
The Jar is run in Hadoop Distributed Mode and the parsed data is dumped
into HDFS.
Each file in dataset consists of 12 million + entries.
Each table had 6-7 attributes in average while also consisting of missing
attributes, empty fields and hence inconsistent data entries which the
parser took care of.
Mission:
6. The Posts table consisted of an attribute named PostTypeId which is 1 if
the Post is a Question Post and 2 is the Post is an answer to the Question.
Since most of our analysis was centered on this table, we divided the
table into PostQuestions and PostAnswers to make the analysis simple.
Eg. <row Id="1258222" PostTypeId="2" ParentId="1238775“
CreationDate="2009-08-11T02:29:20.380" Score="1"
Body="<p>Lisp. There are so many Lisp systems out there defined in
terms of rules not imperative commands. Google ahoy...</p>
"
OwnerUserId="16709" LastActivityDate="2009-08-11T02:29:20.380"
CommentCount="0" />
Posts Table:
7. The trending Questions that are viewed and scored highly by users.
The Questions that doesn’t have any answers.
The Questions that have been marked closed for each category.
The Questions that are dead and have no activity past 2 years.
The most viewed questions in each category.
The most scored questions in each category
The count of posted questions of each category over a timeframe (say 2
years).
The list of tags other than standard tags.
The top posted Questions in each category.
Analysis using Posts
8. The RANK of the Post in the dataset.
Approximate time for a User Post in a category to expect a correct answer
or a working solution.
Analysis on Posts (cont)
9. The User profile with maximum views.
The top users with maximum reputation points.
Most valuable users in the dataset.
The numbers of users that have been awarded batches.
The count of users creating account in a given timeframe (say 6 months).
Recommending users to contribute an answer for a similarly liked
category.
The inactive accounts over a range of time.
Total Number of dead accounts.
The Number of users bearing various batches
Analysis on Users:
10. The comments that have a count greater than average count.
The users posting maximum number of comments.
The Question Post that have highest number of comments.
Analysis on Comments
11. The number of spam comments in the dataset.
The Users that contribute to the spam posts.
The Posts that are scheduled to be deleted from the data dump over a period
of say (6 months).
The top users carrying votes titled as favorite.
Analysis on Votes
12. A page rank is calculated to find out the weightage of the posted Query
contributed by a user into the dump.
Each Post written as a question maybe linked to several other similar posts
that are posted by users having similar doubts.
Similarly each answer to a post can be referred by another post.
Hence, Page Rank is a ‘’VOTE” by all the other posts in the dataset.
A link to a Post counts as a vote of support, absence of which indicates
lack of support.
Overview of Internal Page Rank Analysis:
13. Thus if we have a Post with PostId = A, which have Posts T1…..Tn pointing
to it, we take a dumping factor between 0 – 1 and we have define C(A) to
be as the number of links associated with the Post, the Page Rank of a
Post is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Page Rank Formula:
14. The Page Rank of each Post depends on the post linked to it.
It is calculates without knowing the final value of Page Rank.
Thus we run the calculation repeatedly which takes us closer to the
estimated final value.
How is Page Rank Calculated?
15. The “damping factor” is quite subtle.
If it’s too high then it takes ages for the numbers to settle,
if it’s too low then you get repeated over-shoot
We performed analysis for achieving the optimal damping factor.
The Damping factor chosen for this Dataset is 0.25.
No matter from where we start the guess, once settled, the average Page
Rank of all pages will be 1.0
Choosing the Dumping Factor:
18. The analysis predicts and provides an estimates time in which a user can
expect an activity on the Post.
Analysis involved categorizing the dataset according to the tags.
For each posted question the fastest reply was taken into consideration
and the time difference between posting a question and getting the first
reply was calculated.
This difference was averaged for all the posts belonging to a category,
thereby predicting the activity on a post.
Predicting First Activity Time On A Post
19. In the application, a user can provide the tags he/she would be using for
their posts.
Based on the tags provided, the application will calculate the average
time taken for an activity on each tag and then average the two results.
How This Works In The Application
20. Creating a graph structure based on Posts and Related Posts.
Graph will comprise of Nodes and Edges.
Each Node will have several Edges and each Edges will be a Node again
will several Edges.
Created a Pig UDF where all the Posts and Related Posts are sent as a
Group.
Based on the input a graph gets created.
Rank is calculated based on how many incoming links each Node has.
The more the number of incoming links, the higher the Page Rank.
How We Did It
21. Integrated Hive with the existing Hbase table.
We need to provide the hbase.columns.mapping whereas
hbase.table.name is optional to provide.
We use HbaseStorage Handler to allow Hive to interact with Hbase.
Hive Hbase Integration
22. HiveServer is an optional service that allows a remote client to submit
requests to Hive, using a variety of programming languages, and retrieve
results.
We used the Hive Thrift Server to connect with the Hive Tables from the
Web Application.
Starting the Hive Thrift Server: hive –service hiveserver
Connection String:
Hive Thrift Server
23. Providing Suggestions to users regarding the various questions they can
answer from other categories.
We have taken the User ID, Category ID and the Interaction level as the
input to Mahout User Recommender.
Mahout User Based Recommender
24. We used pig queries to join the various tables and get an output which
contained User ID, Category ID and Interaction level.
We used this output as an input to the Mahout User Based
Recommender.
We converted the Interaction Level values to be in the range of 0 to 5.
We used the PearsonCorrelationSimilarity and the NearestNNeighbours
as the neighborhood.
We then used the UserBased Recommender to provide 3 suggestions of
other Categories for which the user can provide his contribution by
answering the questions.
How Did We Implement It
26. We were able to incorporate our analysis in a Web Appplication.
The Web Application retrieves the required data using Hbase and Hive
technologies.
Below attached are screenshots of the application and the analysis that
has been performed.
We have used Google Charts for displaying our analysis in a graph.
Web Application
39. Performance depends upon input sizes and MR FS chunk size.
While there were queries that required sorting of data, many temp files
were created and written onto the disc.
The performance of MR is evaluated by reviewing the counters for map
task.
In the Parser Implemented to read the xml file, there were significant
problems faced.
The number of spilled records were significantly more than the map task
read that resulted in NullPointerException with the message:
INFO mapreduce.Job: Job job_local1747290386_0001 failed with
state FAILED due to: NA
Problem Faced: