The document discusses experimental causal inference and key concepts in experimental design. It defines causal inference as trying to answer causal questions from data, and experimental causal inference as doing so using experiments rather than observations. The basic ideas of experimental design are outlined as maximizing useful variation, eliminating unhelpful variation, and randomizing what cannot be eliminated. Randomization is described as the key to ensuring treatment groups are statistically equivalent. Some open issues discussed include types of randomization, choice of treatment levels, and other challenges like multiple variables, blocking, and limitations of randomization.
This document summarizes several methods for estimating causal effects from observational data:
1. The back-door criterion provides a method for identifying when causal effects are identifiable based on observable variables. It requires adjusting for a set of variables S that block back-door paths between the treatment X and outcome Y.
2. Estimation methods described include calculating average treatment effects, avoiding estimating high-dimensional marginal distributions using sampling, matching on propensity scores, and using instrumental variables.
3. Propensity score matching involves estimating propensity scores via logistic regression and then matching treated and control units based on their propensity to receive treatment.
4. Instrumental variables estimation uses an instrument I that is associated with treatment X
Tweetfix is a visualization platform, developed for the Fix the Fixing european project, where users can explore the results of crowdsourced data analytics from Social Media on well-known Match Fixing cases.
This document discusses metrics for detecting social media fraud. There are several forms of social media fraud, including identity theft, fake product promotions, and generating fake revenue. Detecting fake accounts and fraud groups is important for social media companies and users to prevent financial losses, damage to company image, and identity theft. Fraud metrics can analyze patterns in the social media network graph using social network analysis techniques. Graph metrics like density, centrality, and connected components can help identify potential fraud behaviors and focus investigations on pivotal nodes in the fraud network. Strongly and weakly connected components are useful for identifying other accounts connected to a known fraudulent user.
1) The document discusses several papers related to modeling trust in online networks and communities.
2) Key concepts discussed include TrustRank for identifying reputable web pages, propagation of both trust and distrust in social networks, the EigenTrust algorithm for reputation management in peer-to-peer networks, and attack-resistant trust metrics for public key certification.
3) The document also provides summaries and evaluations of experiments conducted using various real-world datasets to analyze different trust computation models.
"Research on Opinion Mining and Sentiment Analysis" project, under the Machine Learning course of the Postgraduate Programme of Computer Science Department, AUTh.
The document discusses the evolution of the web from Web 1.0 to Web 2.0 and the problems with representing meaning. It introduces semantic web as representing things rather than just documents using semantic annotations in formats like RDFa, microformats and microdata. Linked data allows complex queries across a web of data by embedding semantic annotations and using common schemas like Schema.org. Major companies are now building knowledge graphs to represent structured data from sources on a linked open web.
The document discusses experimental causal inference and key concepts in experimental design. It defines causal inference as trying to answer causal questions from data, and experimental causal inference as doing so using experiments rather than observations. The basic ideas of experimental design are outlined as maximizing useful variation, eliminating unhelpful variation, and randomizing what cannot be eliminated. Randomization is described as the key to ensuring treatment groups are statistically equivalent. Some open issues discussed include types of randomization, choice of treatment levels, and other challenges like multiple variables, blocking, and limitations of randomization.
This document summarizes several methods for estimating causal effects from observational data:
1. The back-door criterion provides a method for identifying when causal effects are identifiable based on observable variables. It requires adjusting for a set of variables S that block back-door paths between the treatment X and outcome Y.
2. Estimation methods described include calculating average treatment effects, avoiding estimating high-dimensional marginal distributions using sampling, matching on propensity scores, and using instrumental variables.
3. Propensity score matching involves estimating propensity scores via logistic regression and then matching treated and control units based on their propensity to receive treatment.
4. Instrumental variables estimation uses an instrument I that is associated with treatment X
Tweetfix is a visualization platform, developed for the Fix the Fixing european project, where users can explore the results of crowdsourced data analytics from Social Media on well-known Match Fixing cases.
This document discusses metrics for detecting social media fraud. There are several forms of social media fraud, including identity theft, fake product promotions, and generating fake revenue. Detecting fake accounts and fraud groups is important for social media companies and users to prevent financial losses, damage to company image, and identity theft. Fraud metrics can analyze patterns in the social media network graph using social network analysis techniques. Graph metrics like density, centrality, and connected components can help identify potential fraud behaviors and focus investigations on pivotal nodes in the fraud network. Strongly and weakly connected components are useful for identifying other accounts connected to a known fraudulent user.
1) The document discusses several papers related to modeling trust in online networks and communities.
2) Key concepts discussed include TrustRank for identifying reputable web pages, propagation of both trust and distrust in social networks, the EigenTrust algorithm for reputation management in peer-to-peer networks, and attack-resistant trust metrics for public key certification.
3) The document also provides summaries and evaluations of experiments conducted using various real-world datasets to analyze different trust computation models.
"Research on Opinion Mining and Sentiment Analysis" project, under the Machine Learning course of the Postgraduate Programme of Computer Science Department, AUTh.
The document discusses the evolution of the web from Web 1.0 to Web 2.0 and the problems with representing meaning. It introduces semantic web as representing things rather than just documents using semantic annotations in formats like RDFa, microformats and microdata. Linked data allows complex queries across a web of data by embedding semantic annotations and using common schemas like Schema.org. Major companies are now building knowledge graphs to represent structured data from sources on a linked open web.
This document describes a project to analyze GitHub data and develop visualizations and recommendations. The project will have two parts: 1) A visualization part that analyzes metrics like programming languages used, active users, geographic distribution of users, and popular repositories, and 2) A recommendation system that suggests potential contributors or repositories for a given user based on their activity history. The project aims to provide insights into active areas on GitHub and which languages are most widely used. It will also help increase collaboration by recommending potential collaborators and interesting repositories for users. The document outlines the project timeline and division of labor across gathering requirements, design, implementation, and developing the user interface.
The Ring programming language version 1.7 book - Part 6 of 196Mahmoud Samir Fayed
Ring is an innovative, multi-paradigm programming language designed to be simple, small, flexible and fast. It supports imperative, object-oriented, functional and other paradigms. Ring's compiler compiles source code to bytecode, which is executed by the Ring Virtual Machine. The language was created to be highly productive for application development while also small and portable.
The document discusses various popular programming languages that will be in high demand in 2023. It provides descriptions of 18 programming languages including JavaScript, Python, Go, Java, Kotlin, PHP, C#, Swift, and others. For each language, it outlines the level of difficulty, needed skills, platforms used, popularity among programmers, benefits, and typical degree of use. The document serves as a guide for programmers to choose languages to learn based on their goals and industry demand.
The Ring programming language version 1.6 book - Part 6 of 189Mahmoud Samir Fayed
The document describes the Ring programming language. It discusses why Ring was created, including wanting a language that is simple, natural, encourages organization, and has transparent implementation. It provides an overview of Ring's design goals, features, and licensing. Key points include that Ring supports multiple paradigms like object-oriented programming and functional programming. It aims to be small, fast, and give programmers memory control. Ring also has a straightforward syntax without semicolons or explicit function endings.
APIs and SDKs: Breaking Into and Succeeding in a Specialty MarketScott Abel
This document provides an overview of writing documentation for APIs and SDKs. It discusses typical users and producers of APIs/SDKs, ideal information to include in SDK and API documentation, common documentation deliverables, programming concepts to cover, and help authoring tools. The document also outlines benefits and drawbacks to technical writers in this specialty, ways to break into the market including education and training options, and resources for API/SDK documentation writers.
This document illustrates the basic idea about flutter and its facilities. Along with this, the document also depicts the comparison report of the cross-platform, react.
Full stack development involves building both the front end and back end of a web application. Full stack developers work with front end technologies like HTML, CSS, JavaScript, Angular, and React as well as back end technologies like PHP, Java, Python, Node.js, and frameworks like Express, Django and Rails. They also integrate databases like Oracle, MongoDB, and SQL to store and retrieve application data. Popular full stack technologies include MEAN, MERN, and LAMP stacks.
Open Source project failure often stems from not setting clear objectives or having a shared vision from the start. That said there are many success stories, including two well known Statistical examples: Demetra; and Eurostat SDMX tools (SDMX-RI). However, in all these examples there was at first a founding organisation/entity that created the right environment for its successful path into a new paradigm. In the context of my presentation this being the Statistical Information System Collaboration Community (SIS-CC / http://siscc.oecd.org).
Presented at the International Marketing and Output DataBase Conference, Gozd Martuljek, September 18 - 22, 2016.
This document describes a project called Git-Influencer that aims to discover influential GitHub users by language. It does this by mapping users to languages they contribute to, building networks for each language, and running PageRank algorithms on these networks to score user importance. Several challenges are outlined such as dealing with data volume, accounting for inactive users, and handling users with no followers. Potential improvements discussed include using different data storage, classification metrics, and graph algorithms.
Guidelines for Working with Contract Developers in Evergreenloriayre
This document provides information and guidance for libraries seeking to develop new applications or features for the Evergreen integrated library system. It outlines the steps to verify an idea is new, write requirements, find a developer, introduce them to the community, track progress, submit code for review, and ensure the code is properly licensed. Details are given on skill sets needed, places to find developers, using Git for version control, and contributing code to the Evergreen code repositories.
Google created Go because existing systems programming languages did not provide efficient compilation, execution, and ease of programming simultaneously. Go combines the ease of dynamically typed languages with the efficiency and safety of compiled, statically typed languages. It has features for concurrency and garbage collection that improve on C, as well as syntax inspired by Python, JavaScript, and Pascal.
Build Great Networked APIs with Swift, OpenAPI, and gRPCTim Burks
This document discusses building APIs with Swift, OpenAPI, and gRPC. It introduces protocol buffers for defining data structures, and gRPC for building APIs. It recommends using the gnostic tool to convert OpenAPI descriptions to protocol buffers for use with gRPC plugins. This allows building high-quality code generators in different languages by separating the generator from the API description parsing. The document provides examples of building gRPC APIs and clients in Swift.
The Ring programming language version 1.4 book - Part 2 of 30Mahmoud Samir Fayed
This document provides an overview of the Ring programming language, including its history, goals, features and license. The Ring language was created in 2013 as a general purpose language that is simple, small, flexible and fast. It supports multiple paradigms like object-oriented, functional and declarative programming. Notable features include a compact syntax, dynamic typing, first-class functions/objects, exception handling, libraries for games, GUI and more. The language is open source under the MIT license.
Go is a new programming language developed by Google as a systems language for building network and cloud services. It was created to address the need for a modern language that supports concurrency and multicore processors. The design goals of Go included being fast, easy to use, and supporting features like garbage collection and parallelism. While the syntax is C-like and it is statically typed, Go also incorporates elements from dynamically typed languages for ease of programming.
The film Cold Mountain follows the journey of Inman as he tries to return home to Ada after being separated by the Civil War. The movie shows flashbacks of Inman and Ada's relationship before the war. It also shows Ada struggling to survive on her own after being left alone. The film focuses on Inman's dangerous journey back to Cold Mountain and Ada learning to take care of herself in his absence.
Peter Tao is a full-stack software developer and computer science student at the University of Toronto. He has work experience as a software developer co-op at Ceridian where he implemented components in their software. He also worked as a full-stack developer at Futurera where he built a full-stack website for student organizations. His projects include building a social media web app using MERN stack, a Rubik's Cube solver desktop app, a translation Android app, and a photo manager desktop app.
The document provides an overview of the Dart programming language created by Google. It discusses that Dart is an open-source, class-based language with C-style syntax that compiles to JavaScript. Google created Dart to help developers build complex web apps faster. The document outlines Dart's features and tools like Dartium and the Dart Editor IDE. It recommends trying Dart for new applications and lists companies using Dart in production.
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
This workshop will teach you the basics git concepts (such as references, commits, and blobs) and how they can be mapped into a series of relational tables.
Once we understand the basic concepts we will show how language classification and program parsing are available as SQL custom functions, how to use them correctly, and how to obtain aggregate results with `GROUP BY` and friends. We will discuss Universal Abstract Syntax Trees and how some advanced checks can be done on top this language agnostic structure. Running these checks at scale requires some extra knowledge and we’ll discuss the challenges and their possible solutions.
To finish, we will also discuss how the information in git repositories encodes a form of social network which can be used to better understand the engineering processes of a given organization.
This document describes a project to analyze GitHub data and develop visualizations and recommendations. The project will have two parts: 1) A visualization part that analyzes metrics like programming languages used, active users, geographic distribution of users, and popular repositories, and 2) A recommendation system that suggests potential contributors or repositories for a given user based on their activity history. The project aims to provide insights into active areas on GitHub and which languages are most widely used. It will also help increase collaboration by recommending potential collaborators and interesting repositories for users. The document outlines the project timeline and division of labor across gathering requirements, design, implementation, and developing the user interface.
The Ring programming language version 1.7 book - Part 6 of 196Mahmoud Samir Fayed
Ring is an innovative, multi-paradigm programming language designed to be simple, small, flexible and fast. It supports imperative, object-oriented, functional and other paradigms. Ring's compiler compiles source code to bytecode, which is executed by the Ring Virtual Machine. The language was created to be highly productive for application development while also small and portable.
The document discusses various popular programming languages that will be in high demand in 2023. It provides descriptions of 18 programming languages including JavaScript, Python, Go, Java, Kotlin, PHP, C#, Swift, and others. For each language, it outlines the level of difficulty, needed skills, platforms used, popularity among programmers, benefits, and typical degree of use. The document serves as a guide for programmers to choose languages to learn based on their goals and industry demand.
The Ring programming language version 1.6 book - Part 6 of 189Mahmoud Samir Fayed
The document describes the Ring programming language. It discusses why Ring was created, including wanting a language that is simple, natural, encourages organization, and has transparent implementation. It provides an overview of Ring's design goals, features, and licensing. Key points include that Ring supports multiple paradigms like object-oriented programming and functional programming. It aims to be small, fast, and give programmers memory control. Ring also has a straightforward syntax without semicolons or explicit function endings.
APIs and SDKs: Breaking Into and Succeeding in a Specialty MarketScott Abel
This document provides an overview of writing documentation for APIs and SDKs. It discusses typical users and producers of APIs/SDKs, ideal information to include in SDK and API documentation, common documentation deliverables, programming concepts to cover, and help authoring tools. The document also outlines benefits and drawbacks to technical writers in this specialty, ways to break into the market including education and training options, and resources for API/SDK documentation writers.
This document illustrates the basic idea about flutter and its facilities. Along with this, the document also depicts the comparison report of the cross-platform, react.
Full stack development involves building both the front end and back end of a web application. Full stack developers work with front end technologies like HTML, CSS, JavaScript, Angular, and React as well as back end technologies like PHP, Java, Python, Node.js, and frameworks like Express, Django and Rails. They also integrate databases like Oracle, MongoDB, and SQL to store and retrieve application data. Popular full stack technologies include MEAN, MERN, and LAMP stacks.
Open Source project failure often stems from not setting clear objectives or having a shared vision from the start. That said there are many success stories, including two well known Statistical examples: Demetra; and Eurostat SDMX tools (SDMX-RI). However, in all these examples there was at first a founding organisation/entity that created the right environment for its successful path into a new paradigm. In the context of my presentation this being the Statistical Information System Collaboration Community (SIS-CC / http://siscc.oecd.org).
Presented at the International Marketing and Output DataBase Conference, Gozd Martuljek, September 18 - 22, 2016.
This document describes a project called Git-Influencer that aims to discover influential GitHub users by language. It does this by mapping users to languages they contribute to, building networks for each language, and running PageRank algorithms on these networks to score user importance. Several challenges are outlined such as dealing with data volume, accounting for inactive users, and handling users with no followers. Potential improvements discussed include using different data storage, classification metrics, and graph algorithms.
Guidelines for Working with Contract Developers in Evergreenloriayre
This document provides information and guidance for libraries seeking to develop new applications or features for the Evergreen integrated library system. It outlines the steps to verify an idea is new, write requirements, find a developer, introduce them to the community, track progress, submit code for review, and ensure the code is properly licensed. Details are given on skill sets needed, places to find developers, using Git for version control, and contributing code to the Evergreen code repositories.
Google created Go because existing systems programming languages did not provide efficient compilation, execution, and ease of programming simultaneously. Go combines the ease of dynamically typed languages with the efficiency and safety of compiled, statically typed languages. It has features for concurrency and garbage collection that improve on C, as well as syntax inspired by Python, JavaScript, and Pascal.
Build Great Networked APIs with Swift, OpenAPI, and gRPCTim Burks
This document discusses building APIs with Swift, OpenAPI, and gRPC. It introduces protocol buffers for defining data structures, and gRPC for building APIs. It recommends using the gnostic tool to convert OpenAPI descriptions to protocol buffers for use with gRPC plugins. This allows building high-quality code generators in different languages by separating the generator from the API description parsing. The document provides examples of building gRPC APIs and clients in Swift.
The Ring programming language version 1.4 book - Part 2 of 30Mahmoud Samir Fayed
This document provides an overview of the Ring programming language, including its history, goals, features and license. The Ring language was created in 2013 as a general purpose language that is simple, small, flexible and fast. It supports multiple paradigms like object-oriented, functional and declarative programming. Notable features include a compact syntax, dynamic typing, first-class functions/objects, exception handling, libraries for games, GUI and more. The language is open source under the MIT license.
Go is a new programming language developed by Google as a systems language for building network and cloud services. It was created to address the need for a modern language that supports concurrency and multicore processors. The design goals of Go included being fast, easy to use, and supporting features like garbage collection and parallelism. While the syntax is C-like and it is statically typed, Go also incorporates elements from dynamically typed languages for ease of programming.
The film Cold Mountain follows the journey of Inman as he tries to return home to Ada after being separated by the Civil War. The movie shows flashbacks of Inman and Ada's relationship before the war. It also shows Ada struggling to survive on her own after being left alone. The film focuses on Inman's dangerous journey back to Cold Mountain and Ada learning to take care of herself in his absence.
Peter Tao is a full-stack software developer and computer science student at the University of Toronto. He has work experience as a software developer co-op at Ceridian where he implemented components in their software. He also worked as a full-stack developer at Futurera where he built a full-stack website for student organizations. His projects include building a social media web app using MERN stack, a Rubik's Cube solver desktop app, a translation Android app, and a photo manager desktop app.
The document provides an overview of the Dart programming language created by Google. It discusses that Dart is an open-source, class-based language with C-style syntax that compiles to JavaScript. Google created Dart to help developers build complex web apps faster. The document outlines Dart's features and tools like Dartium and the Dart Editor IDE. It recommends trying Dart for new applications and lists companies using Dart in production.
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
This workshop will teach you the basics git concepts (such as references, commits, and blobs) and how they can be mapped into a series of relational tables.
Once we understand the basic concepts we will show how language classification and program parsing are available as SQL custom functions, how to use them correctly, and how to obtain aggregate results with `GROUP BY` and friends. We will discuss Universal Abstract Syntax Trees and how some advanced checks can be done on top this language agnostic structure. Running these checks at scale requires some extra knowledge and we’ll discuss the challenges and their possible solutions.
To finish, we will also discuss how the information in git repositories encodes a form of social network which can be used to better understand the engineering processes of a given organization.
Similar to Exploring Language Communities on Github (20)
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
2. Introduction
This study focuses on the exploration of underlying patterns and the detection of
communities on programming languages used by GitHub users, via network analysis.
There are two graphs derived from the whole dataset and two location-specific graphs,
in order to study both the general audience of GitHub as well as the trends regarding
some sample locations.
Goal: Understand how languages are practically grouped in terms of the way
developers use them, as well as discover trends either worldwide or on specific
locations.
Nodes → Languages
Edges → Language co-occurrence in User Profiles (based on the user repositories)
3. GitHub
● GitHub is a web-based Git repository hosting service
● It offers distributed revision control and source code
management (SCM)
● It is the largest host of source code in the world! [1]
Why Github?
“The introduction of social features in a code hosting site has drawn
particular attention from researchers while the integrated social
features, and the availability of metadata through an accessible api
have made GitHub very attractive for software engineering
researchers” [3]
Top Image Source: https://goo.gl/CWBMqb
Bottom Image Source: https://github.com/logos
4. ● Programming Language
categorization ambiguity
● GitHub bias on Web Development
● Locations and users have
power-law distribution: there are
numerous developers from few
locations (such as California,
London etc) and there is a
significant amount of locations
with few users
Pros Challenges
● Developers will get a hint of
which languages are used jointly,
and thus perhaps serve the same
purpose.
● Language creators will get a hint
of what their audience prefer and
trust.
● Language communities might
actually be another way to
explore developer communities.
5. Fundamentals
Dataset Features
➔ ID, Username, Location, Followers, Public Repos, Languages & Bytes of code
Network Structure
➔ Nodes: Languages
◆ Attribute: Total Bytes of Code
➔ Edges: Pairs of Languages that co-occurred in at least one user profile
◆ Weight: Amount of users that use both languages
Challenges upon Data
➔ Only public repositories accessible (users mainly work on private!)
➔ Languages are added by the user (empty, not real, not written in the same way)
PyGithub[2]
6. Final Datasets
❏ 4000 users since GitHub foundation + 150.000 from 2012
❏ Filter: Get only users with locations!
❏ Final: 2300 users since GitHub foundation + 37.000 from 2012
9. Methodology
Create graph (as described):
● Filters: Degree Range
● Layout: Force Atlas 2
● Node size: “Bytes of Code” Range
● Label size: Degree Range
Compute Modularity & get communities:
● Sometimes using edge weights, sometimes not
Visualize pairs of languages and amount of developers that use both
19. Conclusions
Language-Oriented
➔ “Web-oriented” is the most robust category of languages used in Github
➔ “JavaScript - CSS” is the leading pair of languages, always outnumbering all other pairs
➔ Even though JavaScript is almost always dominating Pairs of Languages, C is always the
most used one in matters of Bytes of Code [perhaps C users are not language-extroverts…]
Scheme-Oriented
➔ With a user-based scheme we can understand the general preferences of developers and the
patterns between languages. [difficult when dataset is big!]
➔ With a repo-based scheme we can understand hidden (or at least not widely known)
patterns of languages that are used for same purposes.
➔ General purpose: repo-based scheme
Location purpose: user-based scheme
20. Future Work
● More Data !
● More Locations and Comparisons
● Language Graphs based on Top/Most influential Users [using followers or stars]
● Association Rules on Languages for community detection
● User Graph to detect user communities per Location (e.g. web developers, game
developers) and compare with Language Graph of Location
21. References
1. Github on Wikipedia: https://en.wikipedia.org/wiki/GitHub
2. PyGithub Library: https://github.com/PyGithub/PyGithub
3. Kalliamvakou, Eirini, et al. "The promises and perils of mining GitHub." Proceedings of the
11th working conference on mining software repositories. ACM, 2014.
4. Thung, Ferdian, et al. "Network structure of social coding in github." Software maintenance
and reengineering (csmr), 2013 17th european conference on. IEEE, 2013.
5. Takhteyev, Yuri, and Andrew Hilts. "Investigating the geography of open source software
through GitHub." (2010).
6. Figueira Filho, Fernando, et al. "A study on the geographical distribution of Brazil’s
prestigious software developers." Journal of Internet Services and Applications 6.1 (2015): 1.
Image Source: http://wifflegif.com/tags/58347-octocat-gifs
22. Thank you for your attention! Any questions?
Image Source: https://octodex.github.com/images/heisencat.png